Skip to content

Conversation

@patrislav
Copy link
Contributor

This PR introduces the following packages:

  • product with the Product entity and Repository interface
  • price with the Price entity and Repository interface
  • mysql with the product.Repository and price.Repository MySQL implementations, allowing for querying and inserting products and prices
  • aws with the implementation of AWS pricing ingestion, currently supporting only EC2 and EBS

These are almost the same as their implementation in the internal PoC, with only minimal changes.

@patrislav patrislav self-assigned this Oct 9, 2020
@patrislav
Copy link
Contributor Author

Addressed the comments

Copy link
Member

@xescugc xescugc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so I have some global things to mention:

  • For now as we only have AWS it's fine but the ingester should be abstracted and use the aws/ingester as an interfaces that any Provider (GCP/AZ...) should fulfill, because at the end we are: ReadingCSV, Storing Product, Storing Price. And this could easily be generalized into an itf. I normally try to have this ASAP so it's easy to test things (as you can fake providers and test the logic) and also have the right architecture from the beginning.

  • I would use the pkg we have to abstract SQL on YD (I'll make them OSS today just to speed this) so then you do not need to use the "github.com/DATA-DOG/go-sqlmock" for anything. And if you need to test it use an e2e test type.

I think this is all for now.

@patrislav
Copy link
Contributor Author

  • For now as we only have AWS it's fine but the ingester should be abstracted and use the aws/ingester as an interfaces that any Provider (GCP/AZ...) should fulfill, because at the end we are: ReadingCSV, Storing Product, Storing Price. And this could easily be generalized into an itf. I normally try to have this ASAP so it's easy to test things (as you can fake providers and test the logic) and also have the right architecture from the beginning.

Yeah, I'm planning to have an interface of:

type Ingester interface {
    Ingest(ctx context.Context, service, location string) error
}

The AWS ingester already fulfils it. The reason is that each provider has a different way of importing the pricing data, AWS has CSV and JSON files, GCP has a gRPC API (abstracted by official Go library), Azure has a REST interface. So I'm not sure if the Ingester interface can be abstracted differently.

  • I would use the pkg we have to abstract SQL on YD (I'll make them OSS today just to speed this) so then you do not need to use the "github.com/DATA-DOG/go-sqlmock" for anything. And if you need to test it use an e2e test type.

I'm using the sqlmock's expectations also as a way to test the filters query generation, not sure if the same can be done with the SQL interface. But once you publish it, I'll take a look and use it.

@xescugc
Copy link
Member

xescugc commented Oct 13, 2020

Sorry forgot to comment on the Migrate ones, I would use https://github.com/lopezator/migrator is a simple lib that basically does that. I have used it already on the past on several projects. I can send you some code to directly copy the implementation :)

The AWS ingester already fulfils it. The reason is that each provider has a different way of importing the pricing data, AWS has CSV and JSON files, GCP has a gRPC API (abstracted by official Go library), Azure has a REST interface. So I'm not sure if the Ingester interface can be abstracted differently.

I'm sure you can make a common one, Is the CSV/GCP/API the way to Download the file no? Ok from that you can read it and return product.Product and price.Price so the the caller can do the logic to store. Basically the common interface has to return the list of Product/Prices as those entities are already reusable between services so makes no sense to have them have an specific logic if that it's already abstrct.

Or it could even return a chan which the caller would consume, there are multiple solutions that would make this, imo, easier to test/use.

I'm using the sqlmock's expectations also as a way to test the filters query generation, not sure if the same can be done with the SQL interface. But once you publish it, I'll take a look and use it.

My main problem with testing SQL by injecting data into it (even if it's with a mock) is that you are testing SQL, and it's already been tested for the years. Also you are inserting data which is not going though your "Use-Case" so it's not real data, so at the end you are testing a query that you may never do, which is a useless test. If you want to test that I prefer you to use some kind of e2e in which you Create and then Get as those are actual usecases of the code instead of faking data, and at the end you also need e2e tests hehe.

And with the lib you'll be able to check that which SQL are you generating.

@patrislav
Copy link
Contributor Author

Sorry forgot to comment on the Migrate ones, I would use https://github.com/lopezator/migrator is a simple lib that basically does that. I have used it already on the past on several projects. I can send you some code to directly copy the implementation :)

Sure, I will use that lib 👍

I'm sure you can make a common one, Is the CSV/GCP/API the way to Download the file no? Ok from that you can read it and return product.Product and price.Price so the the caller can do the logic to store. Basically the common interface has to return the list of Product/Prices as those entities are already reusable between services so makes no sense to have them have an specific logic if that it's already abstrct.

Or it could even return a chan which the caller would consume, there are multiple solutions that would make this, imo, easier to test/use.

It would be perfect to have the return type be e.g. (chan product.Product, chan price.Price, chan error), but the problem is that to add a price we need the product to already be in the database and we need to know its ID, to associate the price to it. One alternative could be to have it as:

Ingest(productRepo product.Repository, priceCh chan<- price.Price) error

So the Ingester must create Products itself but sends the Prices on the priceCh channel. This is inconsistent though, so do you have some idea? Returning all of the Products and all of the prices at once makes it impossible to read it streamed. And we could associate Prices to Products using (vendor, sku) instead of ID, but then there's still the data race problem (if a Price is attempted to be inserted before the Product, and with concurrent channels, it might happen.)

My main problem with testing SQL by injecting data into it (even if it's with a mock) is that you are testing SQL, and it's already been tested for the years. Also you are inserting data which is not going though your "Use-Case" so it's not real data, so at the end you are testing a query that you may never do, which is a useless test. If you want to test that I prefer you to use some kind of e2e in which you Create and then Get as those are actual usecases of the code instead of faking data, and at the end you also need e2e tests hehe.

And with the lib you'll be able to check that which SQL are you generating.

I hoped that this way I can avoid e2e tests, given how functionally simple it is. 😅

@patrislav
Copy link
Contributor Author

Pushed updates with:

  • new Ingester interface with Ingest(ctx context.Context, results chan<- *price.WithProduct) error method signature
  • new IngestPricing function that uses an Ingester (and a Backend that is an aggregation of all repositories) to import pricing data
  • using the migrator library for DB migrations
  • using the sqlr interfaces

I hope I addressed all the comments and haven't forgotten about any.

Copy link
Member

@xescugc xescugc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I would say it's ok, it's quite long now.

Some points:

  • There are comments with a 👍 that are not actually addressed, check them before actually coping this to the OSS one.
  • Whenever we have concurrency and channels I would strongly suggest to add much more documentation than the one you have now (which is none) because when you have to re-read the code and guess the use of the channels it's not funny haha.
  • There are some tests still missing, IMO we should have an e2e moking the HTTP.Do and return a custom CSV and see that the rest works.

The Backend I guess is just a helper to aggregate the initialization of the repositories, it's a good idea 😄 .

And I think this is all, I would say you can move this to the OSS PR

@patrislav
Copy link
Contributor Author

  • There are comments with a +1 that are not actually addressed, check them before actually coping this to the OSS one.

Yeah, I got confused as I used +1 as an acknowledgement, instead of "done." I will go through everything again and address them.

  • Whenever we have concurrency and channels I would strongly suggest to add much more documentation than the one you have now (which is none) because when you have to re-read the code and guess the use of the channels it's not funny haha.

Yep, those parts could certainly use more comments 👍

  • There are some tests still missing, IMO we should have an e2e moking the HTTP.Do and return a custom CSV and see that the rest works.

OK, will work on it.

And I think this is all, I would say you can move this to the OSS PR

I was planning to have this be the OSS repo 😅 But yeah, could instead develop here (merging all the PR's here), use it internally, and only then when we decide to open-source it (and have a name), we'll move it to a new OSS repo.

@xescugc
Copy link
Member

xescugc commented Oct 20, 2020

I was planning to have this be the OSS repo sweat_smile But yeah, could instead develop here (merging all the PR's here), use it internally, and only then when we decide to open-source it (and have a name), we'll move it to a new OSS repo.

Aff got confused, too many reviews I thought it was the PoC haha.

@patrislav
Copy link
Contributor Author

Added e2e package with a test for AWS ingestion, and hopefully, this time addressed all of the comments 😄

Copy link
Member

@xescugc xescugc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cool to also use the -race on the tests as we are using concurrency.

But for me you can RS so I can make a better review :)

@patrislav patrislav force-pushed the pk-aws-ingestion branch 2 times, most recently from 678244d to 0757639 Compare October 26, 2020 11:49
@patrislav
Copy link
Contributor Author

Would be cool to also use the -race on the tests as we are using concurrency.

Tested using -race, no failure reported 👍

But for me you can RS so I can make a better review :)

Rebased

Copy link
Member

@xescugc xescugc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small comments.

@patrislav
Copy link
Contributor Author

Addressed the comments, ready for review again (I thought I commented this yesterday but apparently, GH didn't accept it.)

Copy link
Contributor

@kerak19 kerak19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few things

@patrislav
Copy link
Contributor Author

Addressed the comments

aws/ingester.go Outdated
}

// Ingest reads pricing data from AWS and sends the results to the provided channel.
func (ing *Ingester) Ingest(ctx context.Context, results chan<- *price.WithProduct) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ingest method should return the results channel instead of accepting it. Creating and closing channel in different places is a weird pattern, so we should avoid it.
It should run a goroutine inside, so it won't block.
About the error, we can return separate channel for it or wrap *price.WithProducts

type result struct {
  wp *price.WithProduct
  err error
}

then check the error for every new channel msg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @kerak19 it has to return the chan, but IDK bout this struct mmmm.

I not that familiar of patters on concurrency but maybe expecting an channel to push the error? Or maybe have it like the scan wher you have to check the .Err() after it finishes to see if it had any error. So basically the caller would need to do:

for i := <-ingester.Ingest(ctx) {
}

if ingester.Err() {
}

But the one from @kerak19 it's good to, I would make it public but the rest lgtm :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed this answer from SO: https://stackoverflow.com/a/25142269, seems like this pattern is common.

The ingester.Err() may be also good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YY LGTM :) I remembered the ingester.Err() being a GO blogpost, either of them is fine for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind this signature of Ingest was to allow the caller to control the concurrency in only one place for every kind of ingester. E.g. the caller is the one to specify the size of the channel that it will receive from (and thus the rate at which it can receive) and it controls how the concurrency is done. This way all the complexity is moved to only one place (the IngestPricing func in ingestion.go) instead of being duplicated in every ingester (there's only one - AWS - for now, but more will come.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that brings a lot of unnecessary complexity.
Caller shouldn't care about size of the channel (and if he do, he should have the option for additional chSize argument), because it's just a stream, that he can read from at whatever rate he wants (the channel's buffer should be bigger than 1 anyway, in case if the writes are faster than reads).

If more providers will come I'm sure we can handle it. For now we shouldn't bring any unnecessary anti patterns.

ingestion.go Outdated
// 1) the context is cancelled;
// 2) an error happened on the backend (sending the error to the errs channel);
// 3) the priceProducts is closed (sending to the done channel).
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the Ingest returning priceProducts this goroutine may be removed.
Then on error you may just

return err

The errs and done channels will be no longer necessary. Error will be returned directly and the context will take care of done.

ingestion.go Outdated
Comment on lines 49 to 50
case <-ctx.Done():
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now about the ctx, I think this method should ignore it and just pass it to the Ingest.
Then you'll be able to do

for pp := range priceProducts {
  // do the work and check pp.err`
}

This way you can safely range on channel, while the Ingest will take care of context error.

@patrislav
Copy link
Contributor Author

Addressed the comments, Ingester.Ingest now returns the results channel and the IngestPricing func is much simplified.

Copy link
Contributor

@kerak19 kerak19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd give it even bigger buffer, but LGTM for me anyway. RS

@patrislav
Copy link
Contributor Author

Rebased.

I'd give it even bigger buffer, but LGTM for me anyway.

The ingester uses a bufio.Buffer internally, so by keeping the channel size small, the values sent on the channel are more in-line with what was read from the buffer - which should make progress tracking more accurate.

Copy link
Contributor

@kerak19 kerak19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ingester uses a bufio.Buffer internally, so by keeping the channel size small, the values sent on the channel are more in-line with what was read from the buffer - which should make progress tracking more accurate.

The one thing is, the Ingest can (and should) finish before the whole ingestion process is done (csv reads are faster than DB writes) - this creates a problem, where progress says we're done (the csv is done), but we're still writing the data - and I don't like that it's being artificially held back because of it (i.e. limiting the buffer). IMO progress should be tracked by IngestPricing, because it reflects what actually happens right now.

Unfortunately it'd require additional changes and TBH i don't think it's necessary (at least now). As you said earlier, the optimizations can be done later, so LGTM and w8 for @xescugc.

Copy link
Member

@xescugc xescugc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 2 comments and the rest LGTM.

We still have no CI so I assume it works fine but we should also add the linter.

Comment on lines 2 to 12

type Migration struct {
Name string
SQL string
}

var Migrations = [1]Migration{
v0Initial,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write some docs.

aws/ingester.go Outdated
Comment on lines 20 to 21
// VendorName uniquely identifies this vendor implementation.
const VendorName = "aws"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not use VerndorName as we said to name it Provider, so maybe ProviderName, at least for now, I normally like having an unified list of those with an enumer generator but for now let's go with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to ProviderName for now, but will keep this in mind when we start adding more providers. What would be the best place for this enum? New provider package?

Copy link
Member

@xescugc xescugc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM RS

@patrislav
Copy link
Contributor Author

patrislav commented Nov 6, 2020

We still have no CI so I assume it works fine but we should also add the linter.

Rebased and opened an issue for this: #7

The one thing is, the Ingest can (and should) finish before the whole ingestion process is done (csv reads are faster than DB writes) - this creates a problem, where progress says we're done (the csv is done), but we're still writing the data - and I don't like that it's being artificially held back because of it (i.e. limiting the buffer). IMO progress should be tracked by IngestPricing, because it reflects what actually happens right now.

Yes, this would be best tracked by IngestPricing, however, there are obstacles: 1) each provider implementation might track progress differently (or not track at all, e.g. not sure whether this can be done with GCP) and 2) only the ingester knows what is the total size of the file and how much has been read so far - that's what the bufio.Reader is used for - it reads ahead from the download stream (up to 100MB by default) and the ingester reads from the buffer as the writing progresses. I agree that it might make less sense now as the bulk of the logic was moved around 🤔

So this will be something to deal with when the rest of the optimizations is implemented.

Copy link
Member

@xescugc xescugc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh once we have all the code we can optimize/remove some things, we'll see at that moment. IIRC on PR is still missing to have most of the MVP done no?

@patrislav
Copy link
Contributor Author

Yeh once we have all the code we can optimize/remove some things, we'll see at that moment. IIRC on PR is still missing to have most of the MVP done no?

Yes, #6 is the other PR that will contain the estimation part. Other than this, the two only include EC2 and (partially) EBS, more resources (esp. RDS that is supposed to be part of the MVP) will be added in subsequent PR's. I split it up this way to reduce the review area.

@patrislav patrislav merged commit 3096805 into master Nov 6, 2020
@patrislav patrislav deleted the pk-aws-ingestion branch November 6, 2020 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants