AWS ingestion #5

patrislav · 2020-10-09T10:51:42Z

This PR introduces the following packages:

product with the Product entity and Repository interface
price with the Price entity and Repository interface
mysql with the product.Repository and price.Repository MySQL implementations, allowing for querying and inserting products and prices
aws with the implementation of AWS pricing ingestion, currently supporting only EC2 and EBS

These are almost the same as their implementation in the internal PoC, with only minimal changes.

mysql/filters.go

price/price.go

aws/ingester.go

patrislav · 2020-10-12T13:30:57Z

Addressed the comments

aws/ingester.go

mysql/filters.go

mysql/price.go

mysql/product.go

util/util.go

xescugc

Ok so I have some global things to mention:

For now as we only have AWS it's fine but the ingester should be abstracted and use the aws/ingester as an interfaces that any Provider (GCP/AZ...) should fulfill, because at the end we are: ReadingCSV, Storing Product, Storing Price. And this could easily be generalized into an itf. I normally try to have this ASAP so it's easy to test things (as you can fake providers and test the logic) and also have the right architecture from the beginning.
I would use the pkg we have to abstract SQL on YD (I'll make them OSS today just to speed this) so then you do not need to use the "github.com/DATA-DOG/go-sqlmock" for anything. And if you need to test it use an e2e test type.

I think this is all for now.

product/repository.go

product/product.go

price/price.go

aws/ingester.go

aws/ingester_test.go

patrislav · 2020-10-13T10:47:50Z

For now as we only have AWS it's fine but the ingester should be abstracted and use the aws/ingester as an interfaces that any Provider (GCP/AZ...) should fulfill, because at the end we are: ReadingCSV, Storing Product, Storing Price. And this could easily be generalized into an itf. I normally try to have this ASAP so it's easy to test things (as you can fake providers and test the logic) and also have the right architecture from the beginning.

Yeah, I'm planning to have an interface of:

type Ingester interface {
    Ingest(ctx context.Context, service, location string) error
}

The AWS ingester already fulfils it. The reason is that each provider has a different way of importing the pricing data, AWS has CSV and JSON files, GCP has a gRPC API (abstracted by official Go library), Azure has a REST interface. So I'm not sure if the Ingester interface can be abstracted differently.

I would use the pkg we have to abstract SQL on YD (I'll make them OSS today just to speed this) so then you do not need to use the "github.com/DATA-DOG/go-sqlmock" for anything. And if you need to test it use an e2e test type.

I'm using the sqlmock's expectations also as a way to test the filters query generation, not sure if the same can be done with the SQL interface. But once you publish it, I'll take a look and use it.

xescugc · 2020-10-13T11:12:36Z

Sorry forgot to comment on the Migrate ones, I would use https://github.com/lopezator/migrator is a simple lib that basically does that. I have used it already on the past on several projects. I can send you some code to directly copy the implementation :)

The AWS ingester already fulfils it. The reason is that each provider has a different way of importing the pricing data, AWS has CSV and JSON files, GCP has a gRPC API (abstracted by official Go library), Azure has a REST interface. So I'm not sure if the Ingester interface can be abstracted differently.

I'm sure you can make a common one, Is the CSV/GCP/API the way to Download the file no? Ok from that you can read it and return product.Product and price.Price so the the caller can do the logic to store. Basically the common interface has to return the list of Product/Prices as those entities are already reusable between services so makes no sense to have them have an specific logic if that it's already abstrct.

Or it could even return a chan which the caller would consume, there are multiple solutions that would make this, imo, easier to test/use.

I'm using the sqlmock's expectations also as a way to test the filters query generation, not sure if the same can be done with the SQL interface. But once you publish it, I'll take a look and use it.

My main problem with testing SQL by injecting data into it (even if it's with a mock) is that you are testing SQL, and it's already been tested for the years. Also you are inserting data which is not going though your "Use-Case" so it's not real data, so at the end you are testing a query that you may never do, which is a useless test. If you want to test that I prefer you to use some kind of e2e in which you Create and then Get as those are actual usecases of the code instead of faking data, and at the end you also need e2e tests hehe.

And with the lib you'll be able to check that which SQL are you generating.

patrislav · 2020-10-13T14:21:12Z

Sorry forgot to comment on the Migrate ones, I would use https://github.com/lopezator/migrator is a simple lib that basically does that. I have used it already on the past on several projects. I can send you some code to directly copy the implementation :)

Sure, I will use that lib 👍

I'm sure you can make a common one, Is the CSV/GCP/API the way to Download the file no? Ok from that you can read it and return product.Product and price.Price so the the caller can do the logic to store. Basically the common interface has to return the list of Product/Prices as those entities are already reusable between services so makes no sense to have them have an specific logic if that it's already abstrct.

Or it could even return a chan which the caller would consume, there are multiple solutions that would make this, imo, easier to test/use.

It would be perfect to have the return type be e.g. (chan product.Product, chan price.Price, chan error), but the problem is that to add a price we need the product to already be in the database and we need to know its ID, to associate the price to it. One alternative could be to have it as:

Ingest(productRepo product.Repository, priceCh chan<- price.Price) error

So the Ingester must create Products itself but sends the Prices on the priceCh channel. This is inconsistent though, so do you have some idea? Returning all of the Products and all of the prices at once makes it impossible to read it streamed. And we could associate Prices to Products using (vendor, sku) instead of ID, but then there's still the data race problem (if a Price is attempted to be inserted before the Product, and with concurrent channels, it might happen.)

My main problem with testing SQL by injecting data into it (even if it's with a mock) is that you are testing SQL, and it's already been tested for the years. Also you are inserting data which is not going though your "Use-Case" so it's not real data, so at the end you are testing a query that you may never do, which is a useless test. If you want to test that I prefer you to use some kind of e2e in which you Create and then Get as those are actual usecases of the code instead of faking data, and at the end you also need e2e tests hehe.

And with the lib you'll be able to check that which SQL are you generating.

I hoped that this way I can avoid e2e tests, given how functionally simple it is. 😅

patrislav · 2020-10-19T07:55:00Z

Pushed updates with:

new Ingester interface with Ingest(ctx context.Context, results chan<- *price.WithProduct) error method signature
new IngestPricing function that uses an Ingester (and a Backend that is an aggregation of all repositories) to import pricing data
using the migrator library for DB migrations
using the sqlr interfaces

I hope I addressed all the comments and haven't forgotten about any.

xescugc

Overall I would say it's ok, it's quite long now.

Some points:

There are comments with a 👍 that are not actually addressed, check them before actually coping this to the OSS one.
Whenever we have concurrency and channels I would strongly suggest to add much more documentation than the one you have now (which is none) because when you have to re-read the code and guess the use of the channels it's not funny haha.
There are some tests still missing, IMO we should have an e2e moking the HTTP.Do and return a custom CSV and see that the rest works.

The Backend I guess is just a helper to aggregate the initialization of the repositories, it's a good idea 😄 .

And I think this is all, I would say you can move this to the OSS PR

go.mod

mysql/price_test.go

mysql/product_test.go

patrislav · 2020-10-20T14:18:03Z

There are comments with a +1 that are not actually addressed, check them before actually coping this to the OSS one.

Yeah, I got confused as I used +1 as an acknowledgement, instead of "done." I will go through everything again and address them.

Whenever we have concurrency and channels I would strongly suggest to add much more documentation than the one you have now (which is none) because when you have to re-read the code and guess the use of the channels it's not funny haha.

Yep, those parts could certainly use more comments 👍

There are some tests still missing, IMO we should have an e2e moking the HTTP.Do and return a custom CSV and see that the rest works.

OK, will work on it.

And I think this is all, I would say you can move this to the OSS PR

I was planning to have this be the OSS repo 😅 But yeah, could instead develop here (merging all the PR's here), use it internally, and only then when we decide to open-source it (and have a name), we'll move it to a new OSS repo.

xescugc · 2020-10-20T14:33:55Z

I was planning to have this be the OSS repo sweat_smile But yeah, could instead develop here (merging all the PR's here), use it internally, and only then when we decide to open-source it (and have a name), we'll move it to a new OSS repo.

Aff got confused, too many reviews I thought it was the PoC haha.

patrislav · 2020-10-22T12:34:01Z

Added e2e package with a test for AWS ingestion, and hopefully, this time addressed all of the comments 😄

xescugc

Would be cool to also use the -race on the tests as we are using concurrency.

But for me you can RS so I can make a better review :)

ingestion.go

patrislav · 2020-10-26T11:50:34Z

Would be cool to also use the -race on the tests as we are using concurrency.

Tested using -race, no failure reported 👍

But for me you can RS so I can make a better review :)

Rebased

xescugc

Just small comments.

aws/ingester.go

aws/regions.go

patrislav · 2020-10-29T08:33:10Z

Addressed the comments, ready for review again (I thought I commented this yesterday but apparently, GH didn't accept it.)

kerak19

Few things

ingestion.go

mysql/migrate.go

util/util.go

patrislav · 2020-10-30T13:56:32Z

Addressed the comments

kerak19 · 2020-11-04T10:05:36Z

aws/ingester.go

+}
+
+// Ingest reads pricing data from AWS and sends the results to the provided channel.
+func (ing *Ingester) Ingest(ctx context.Context, results chan<- *price.WithProduct) error {


The Ingest method should return the results channel instead of accepting it. Creating and closing channel in different places is a weird pattern, so we should avoid it.
It should run a goroutine inside, so it won't block.
About the error, we can return separate channel for it or wrap *price.WithProducts

type result struct { wp *price.WithProduct err error }

then check the error for every new channel msg.

I agree with @kerak19 it has to return the chan, but IDK bout this struct mmmm.

I not that familiar of patters on concurrency but maybe expecting an channel to push the error? Or maybe have it like the scan wher you have to check the .Err() after it finishes to see if it had any error. So basically the caller would need to do:

for i := <-ingester.Ingest(ctx) { } if ingester.Err() { }

But the one from @kerak19 it's good to, I would make it public but the rest lgtm :)

I followed this answer from SO: https://stackoverflow.com/a/25142269, seems like this pattern is common.

The ingester.Err() may be also good.

YY LGTM :) I remembered the ingester.Err() being a GO blogpost, either of them is fine for me.

The idea behind this signature of Ingest was to allow the caller to control the concurrency in only one place for every kind of ingester. E.g. the caller is the one to specify the size of the channel that it will receive from (and thus the rate at which it can receive) and it controls how the concurrency is done. This way all the complexity is moved to only one place (the IngestPricing func in ingestion.go) instead of being duplicated in every ingester (there's only one - AWS - for now, but more will come.)

Unfortunately that brings a lot of unnecessary complexity.
Caller shouldn't care about size of the channel (and if he do, he should have the option for additional chSize argument), because it's just a stream, that he can read from at whatever rate he wants (the channel's buffer should be bigger than 1 anyway, in case if the writes are faster than reads).

If more providers will come I'm sure we can handle it. For now we shouldn't bring any unnecessary anti patterns.

kerak19 · 2020-11-04T10:23:31Z

ingestion.go

+	// 1) the context is cancelled;
+	// 2) an error happened on the backend (sending the error to the errs channel);
+	// 3) the priceProducts is closed (sending to the done channel).
+	go func() {


With the Ingest returning priceProducts this goroutine may be removed.
Then on error you may just

return err

The errs and done channels will be no longer necessary. Error will be returned directly and the context will take care of done.

kerak19 · 2020-11-04T10:27:49Z

ingestion.go

+			case <-ctx.Done():
+				return


Now about the ctx, I think this method should ignore it and just pass it to the Ingest.
Then you'll be able to do

for pp := range priceProducts { // do the work and check pp.err` }

This way you can safely range on channel, while the Ingest will take care of context error.

ingestion.go

patrislav · 2020-11-05T13:48:31Z

Addressed the comments, Ingester.Ingest now returns the results channel and the IngestPricing func is much simplified.

kerak19

I'd give it even bigger buffer, but LGTM for me anyway. RS

patrislav · 2020-11-05T14:27:11Z

Rebased.

I'd give it even bigger buffer, but LGTM for me anyway.

The ingester uses a bufio.Buffer internally, so by keeping the channel size small, the values sent on the channel are more in-line with what was read from the buffer - which should make progress tracking more accurate.

kerak19

The ingester uses a bufio.Buffer internally, so by keeping the channel size small, the values sent on the channel are more in-line with what was read from the buffer - which should make progress tracking more accurate.

The one thing is, the Ingest can (and should) finish before the whole ingestion process is done (csv reads are faster than DB writes) - this creates a problem, where progress says we're done (the csv is done), but we're still writing the data - and I don't like that it's being artificially held back because of it (i.e. limiting the buffer). IMO progress should be tracked by IngestPricing, because it reflects what actually happens right now.

Unfortunately it'd require additional changes and TBH i don't think it's necessary (at least now). As you said earlier, the optimizations can be done later, so LGTM and w8 for @xescugc.

xescugc

Just 2 comments and the rest LGTM.

We still have no CI so I assume it works fine but we should also add the linter.

xescugc · 2020-11-06T09:11:29Z

mysql/migrations/migrations.go

+
+type Migration struct {
+	Name string
+	SQL  string
+}
+
+var Migrations = [1]Migration{
+	v0Initial,


I would write some docs.

xescugc · 2020-11-06T09:15:25Z

aws/ingester.go

+// VendorName uniquely identifies this vendor implementation.
+const VendorName = "aws"


I would not use VerndorName as we said to name it Provider, so maybe ProviderName, at least for now, I normally like having an unified list of those with an enumer generator but for now let's go with this.

Changed it to ProviderName for now, but will keep this in mind when we start adding more providers. What would be the best place for this enum? New provider package?

xescugc

LGTM RS

patrislav · 2020-11-06T10:09:44Z

We still have no CI so I assume it works fine but we should also add the linter.

Rebased and opened an issue for this: #7

The one thing is, the Ingest can (and should) finish before the whole ingestion process is done (csv reads are faster than DB writes) - this creates a problem, where progress says we're done (the csv is done), but we're still writing the data - and I don't like that it's being artificially held back because of it (i.e. limiting the buffer). IMO progress should be tracked by IngestPricing, because it reflects what actually happens right now.

Yes, this would be best tracked by IngestPricing, however, there are obstacles: 1) each provider implementation might track progress differently (or not track at all, e.g. not sure whether this can be done with GCP) and 2) only the ingester knows what is the total size of the file and how much has been read so far - that's what the bufio.Reader is used for - it reads ahead from the download stream (up to 100MB by default) and the ingester reads from the buffer as the writing progresses. I agree that it might make less sense now as the bulk of the logic was moved around 🤔

So this will be something to deal with when the rest of the optimizations is implemented.

xescugc

Yeh once we have all the code we can optimize/remove some things, we'll see at that moment. IIRC on PR is still missing to have most of the MVP done no?

patrislav · 2020-11-06T14:57:09Z

Yeh once we have all the code we can optimize/remove some things, we'll see at that moment. IIRC on PR is still missing to have most of the MVP done no?

Yes, #6 is the other PR that will contain the estimation part. Other than this, the two only include EC2 and (partially) EBS, more resources (esp. RDS that is supposed to be part of the MVP) will be added in subsequent PR's. I split it up this way to reduce the review area.

patrislav self-assigned this Oct 9, 2020

kerak19 requested changes Oct 12, 2020

View reviewed changes

kerak19 requested changes Oct 13, 2020

View reviewed changes

xescugc requested changes Oct 13, 2020

View reviewed changes

xescugc approved these changes Oct 20, 2020

View reviewed changes

go.mod Outdated Show resolved Hide resolved

mysql/price_test.go Outdated Show resolved Hide resolved

mysql/product_test.go Outdated Show resolved Hide resolved

xescugc reviewed Oct 23, 2020

View reviewed changes

ingestion.go Outdated Show resolved Hide resolved

Patryk Kalinowski added 2 commits October 26, 2020 14:19

product: Product entity and Repository interface

62333a1

price: Price entity and Repository interface

ef021ea

patrislav force-pushed the pk-aws-ingestion branch 2 times, most recently from 678244d to 0757639 Compare October 26, 2020 11:49

xescugc reviewed Oct 28, 2020

View reviewed changes

aws/ingester.go Outdated Show resolved Hide resolved

aws/regions.go Show resolved Hide resolved

kerak19 requested changes Oct 29, 2020

View reviewed changes

kerak19 requested changes Nov 4, 2020

View reviewed changes

kerak19 reviewed Nov 5, 2020

View reviewed changes

patrislav force-pushed the pk-aws-ingestion branch from f443f41 to 628c43c Compare November 5, 2020 14:24

kerak19 approved these changes Nov 5, 2020

View reviewed changes

xescugc reviewed Nov 6, 2020

View reviewed changes

Patryk Kalinowski added 8 commits November 6, 2020 13:08

mysql: implement ProductRepository and PriceRepository

e7c3e2c

util: add package with utilities

bd5640b

go.mod: update dependencies

8b948fe

mock: add mocks

45d5e91

tests(mysql): add mysql implementation tests

304295e

aws: implement AWS ingester

c402ee4

costestimation: IngestPricing function

5e26b0c

e2e: add end-to-end tests for AWS ingestion

c533d1c

patrislav force-pushed the pk-aws-ingestion branch from 2290f67 to c533d1c Compare November 6, 2020 10:08

kerak19 approved these changes Nov 6, 2020

View reviewed changes

xescugc approved these changes Nov 6, 2020

View reviewed changes

patrislav merged commit 3096805 into master Nov 6, 2020

patrislav deleted the pk-aws-ingestion branch November 6, 2020 14:58

		// VendorName uniquely identifies this vendor implementation.
		const VendorName = "aws"

AWS ingestion #5

AWS ingestion #5

Uh oh!

Conversation

patrislav commented Oct 9, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrislav commented Oct 12, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xescugc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrislav commented Oct 13, 2020

Uh oh!

xescugc commented Oct 13, 2020

Uh oh!

patrislav commented Oct 13, 2020

Uh oh!

patrislav commented Oct 19, 2020

Uh oh!

xescugc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrislav commented Oct 20, 2020

Uh oh!

xescugc commented Oct 20, 2020

Uh oh!

patrislav commented Oct 22, 2020

Uh oh!

xescugc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patrislav commented Oct 26, 2020

Uh oh!

xescugc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

patrislav commented Oct 29, 2020

Uh oh!

kerak19 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrislav commented Oct 30, 2020

Uh oh!

patrislav commented Nov 6, 2020 •

edited

Loading