-
Notifications
You must be signed in to change notification settings - Fork 33
AWS ingestion #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Addressed the comments |
xescugc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so I have some global things to mention:
-
For now as we only have AWS it's fine but the ingester should be abstracted and use the
aws/ingesteras an interfaces that any Provider (GCP/AZ...) should fulfill, because at the end we are: ReadingCSV, Storing Product, Storing Price. And this could easily be generalized into an itf. I normally try to have this ASAP so it's easy to test things (as you can fake providers and test the logic) and also have the right architecture from the beginning. -
I would use the pkg we have to abstract SQL on YD (I'll make them OSS today just to speed this) so then you do not need to use the
"github.com/DATA-DOG/go-sqlmock"for anything. And if you need to test it use an e2e test type.
I think this is all for now.
Yeah, I'm planning to have an interface of: type Ingester interface {
Ingest(ctx context.Context, service, location string) error
}The AWS ingester already fulfils it. The reason is that each provider has a different way of importing the pricing data, AWS has CSV and JSON files, GCP has a gRPC API (abstracted by official Go library), Azure has a REST interface. So I'm not sure if the Ingester interface can be abstracted differently.
I'm using the |
|
Sorry forgot to comment on the
I'm sure you can make a common one, Is the CSV/GCP/API the way to Download the file no? Ok from that you can read it and return Or it could even return a
My main problem with testing SQL by injecting data into it (even if it's with a mock) is that you are testing SQL, and it's already been tested for the years. Also you are inserting data which is not going though your "Use-Case" so it's not real data, so at the end you are testing a query that you may never do, which is a useless test. If you want to test that I prefer you to use some kind of And with the lib you'll be able to check that which SQL are you generating. |
Sure, I will use that lib 👍
It would be perfect to have the return type be e.g. Ingest(productRepo product.Repository, priceCh chan<- price.Price) errorSo the Ingester must create Products itself but sends the Prices on the
I hoped that this way I can avoid e2e tests, given how functionally simple it is. 😅 |
|
Pushed updates with:
I hope I addressed all the comments and haven't forgotten about any. |
xescugc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I would say it's ok, it's quite long now.
Some points:
- There are comments with a 👍 that are not actually addressed, check them before actually coping this to the OSS one.
- Whenever we have concurrency and channels I would strongly suggest to add much more documentation than the one you have now (which is none) because when you have to re-read the code and guess the use of the channels it's not funny haha.
- There are some tests still missing, IMO we should have an
e2emoking the HTTP.Do and return a custom CSV and see that the rest works.
The Backend I guess is just a helper to aggregate the initialization of the repositories, it's a good idea 😄 .
And I think this is all, I would say you can move this to the OSS PR
Yeah, I got confused as I used +1 as an acknowledgement, instead of "done." I will go through everything again and address them.
Yep, those parts could certainly use more comments 👍
OK, will work on it.
I was planning to have this be the OSS repo 😅 But yeah, could instead develop here (merging all the PR's here), use it internally, and only then when we decide to open-source it (and have a name), we'll move it to a new OSS repo. |
Aff got confused, too many reviews I thought it was the PoC haha. |
|
Added |
xescugc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cool to also use the -race on the tests as we are using concurrency.
But for me you can RS so I can make a better review :)
678244d to
0757639
Compare
Tested using
Rebased |
xescugc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just small comments.
|
Addressed the comments, ready for review again (I thought I commented this yesterday but apparently, GH didn't accept it.) |
kerak19
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few things
|
Addressed the comments |
aws/ingester.go
Outdated
| } | ||
|
|
||
| // Ingest reads pricing data from AWS and sends the results to the provided channel. | ||
| func (ing *Ingester) Ingest(ctx context.Context, results chan<- *price.WithProduct) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Ingest method should return the results channel instead of accepting it. Creating and closing channel in different places is a weird pattern, so we should avoid it.
It should run a goroutine inside, so it won't block.
About the error, we can return separate channel for it or wrap *price.WithProducts
type result struct {
wp *price.WithProduct
err error
}then check the error for every new channel msg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @kerak19 it has to return the chan, but IDK bout this struct mmmm.
I not that familiar of patters on concurrency but maybe expecting an channel to push the error? Or maybe have it like the scan wher you have to check the .Err() after it finishes to see if it had any error. So basically the caller would need to do:
for i := <-ingester.Ingest(ctx) {
}
if ingester.Err() {
}
But the one from @kerak19 it's good to, I would make it public but the rest lgtm :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed this answer from SO: https://stackoverflow.com/a/25142269, seems like this pattern is common.
The ingester.Err() may be also good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YY LGTM :) I remembered the ingester.Err() being a GO blogpost, either of them is fine for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind this signature of Ingest was to allow the caller to control the concurrency in only one place for every kind of ingester. E.g. the caller is the one to specify the size of the channel that it will receive from (and thus the rate at which it can receive) and it controls how the concurrency is done. This way all the complexity is moved to only one place (the IngestPricing func in ingestion.go) instead of being duplicated in every ingester (there's only one - AWS - for now, but more will come.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately that brings a lot of unnecessary complexity.
Caller shouldn't care about size of the channel (and if he do, he should have the option for additional chSize argument), because it's just a stream, that he can read from at whatever rate he wants (the channel's buffer should be bigger than 1 anyway, in case if the writes are faster than reads).
If more providers will come I'm sure we can handle it. For now we shouldn't bring any unnecessary anti patterns.
ingestion.go
Outdated
| // 1) the context is cancelled; | ||
| // 2) an error happened on the backend (sending the error to the errs channel); | ||
| // 3) the priceProducts is closed (sending to the done channel). | ||
| go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the Ingest returning priceProducts this goroutine may be removed.
Then on error you may just
return errThe errs and done channels will be no longer necessary. Error will be returned directly and the context will take care of done.
ingestion.go
Outdated
| case <-ctx.Done(): | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now about the ctx, I think this method should ignore it and just pass it to the Ingest.
Then you'll be able to do
for pp := range priceProducts {
// do the work and check pp.err`
}This way you can safely range on channel, while the Ingest will take care of context error.
|
Addressed the comments, |
kerak19
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd give it even bigger buffer, but LGTM for me anyway. RS
f443f41 to
628c43c
Compare
|
Rebased.
The ingester uses a |
kerak19
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ingester uses a bufio.Buffer internally, so by keeping the channel size small, the values sent on the channel are more in-line with what was read from the buffer - which should make progress tracking more accurate.
The one thing is, the Ingest can (and should) finish before the whole ingestion process is done (csv reads are faster than DB writes) - this creates a problem, where progress says we're done (the csv is done), but we're still writing the data - and I don't like that it's being artificially held back because of it (i.e. limiting the buffer). IMO progress should be tracked by IngestPricing, because it reflects what actually happens right now.
Unfortunately it'd require additional changes and TBH i don't think it's necessary (at least now). As you said earlier, the optimizations can be done later, so LGTM and w8 for @xescugc.
xescugc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just 2 comments and the rest LGTM.
We still have no CI so I assume it works fine but we should also add the linter.
|
|
||
| type Migration struct { | ||
| Name string | ||
| SQL string | ||
| } | ||
|
|
||
| var Migrations = [1]Migration{ | ||
| v0Initial, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would write some docs.
aws/ingester.go
Outdated
| // VendorName uniquely identifies this vendor implementation. | ||
| const VendorName = "aws" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not use VerndorName as we said to name it Provider, so maybe ProviderName, at least for now, I normally like having an unified list of those with an enumer generator but for now let's go with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to ProviderName for now, but will keep this in mind when we start adding more providers. What would be the best place for this enum? New provider package?
xescugc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM RS
2290f67 to
c533d1c
Compare
Rebased and opened an issue for this: #7
Yes, this would be best tracked by So this will be something to deal with when the rest of the optimizations is implemented. |
xescugc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh once we have all the code we can optimize/remove some things, we'll see at that moment. IIRC on PR is still missing to have most of the MVP done no?
Yes, #6 is the other PR that will contain the estimation part. Other than this, the two only include EC2 and (partially) EBS, more resources (esp. RDS that is supposed to be part of the MVP) will be added in subsequent PR's. I split it up this way to reduce the review area. |
This PR introduces the following packages:
productwith the Product entity and Repository interfacepricewith the Price entity and Repository interfacemysqlwith theproduct.Repositoryandprice.RepositoryMySQL implementations, allowing for querying and inserting products and pricesawswith the implementation of AWS pricing ingestion, currently supporting only EC2 and EBSThese are almost the same as their implementation in the internal PoC, with only minimal changes.