-
Notifications
You must be signed in to change notification settings - Fork 816
Worry more about fingerprint clashes #717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is the relevant bits: https://github.com/cortexproject/cortex/blob/master/pkg/querier/chunk_store_queryable.go#L51-L68 The idea is that 2 different series could have same fingerprint and then we just say the chunks for both belong to one. |
From a quick look through the code, I've the feeling that we also suffer hash collisions in results cache too (see |
This fixes cortexproject#717 We hit issue in production when our customer issued a query that issued a simple query that touched 20K series, many of which had hash collisions on `client.FastFingerprint`. This made our querier code merge multiple series together which caused counters to go up and down and caused rate() to return weird artifacts. It is quite tricky to debug and I now have a test up which proves that we will suffer if we don't handle collisions explicitly. After looking at several solutions for fixing that, I've finally settled on something super simple, just use a string as the map key. Here is a benchmark where I insert and lookup 100K series: goos: linux goarch: amd64 pkg: github.com/cortexproject/cortex/pkg/distributor BenchmarkSeriesMap-8 842 91633894 ns/op PASS ok github.com/cortexproject/cortex/pkg/distributor 86.307s Now I could have just used labels.String(), but the performance there is quite bad: goos: linux goarch: amd64 pkg: github.com/cortexproject/cortex/pkg/distributor BenchmarkSeriesMap-8 195 373104859 ns/op PASS ok github.com/cortexproject/cortex/pkg/distributor 110.456s Compare this to using client.Fingerprint for the hashing: goos: linux goarch: amd64 pkg: github.com/cortexproject/cortex/pkg/distributor BenchmarkSeriesMap-8 1273 54778130 ns/op This means that we've gotten 70% in this section, but as explained below this is very small when compared to the overall query. ---------------- Now the reason I've stuck to the simplified case is because it takes <100ms for doing this over 100K series, and in most cases the network time to load all the chunks and iterate through them is several orders of magnitude higher. And tbh, I've looked at manually handling collsions, and we need to do labels.Equal(l1, l2) to see if the series we're looking is the actual series in the map or a collision, and the perf there actually worse. I'm open to ideas here. Also note that the values of each map are different, and any complex solution would require interface{} which is arguably worse. sha256 has been tried and is worse in comparision. Signed-off-by: Goutham Veeramachaneni <[email protected]>
* Dont rely on hashes for collecting chunks together This fixes #717 We hit issue in production when our customer issued a query that issued a simple query that touched 20K series, many of which had hash collisions on `client.FastFingerprint`. This made our querier code merge multiple series together which caused counters to go up and down and caused rate() to return weird artifacts. It is quite tricky to debug and I now have a test up which proves that we will suffer if we don't handle collisions explicitly. After looking at several solutions for fixing that, I've finally settled on something super simple, just use a string as the map key. Here is a benchmark where I insert and lookup 100K series: goos: linux goarch: amd64 pkg: github.com/cortexproject/cortex/pkg/distributor BenchmarkSeriesMap-8 842 91633894 ns/op PASS ok github.com/cortexproject/cortex/pkg/distributor 86.307s Now I could have just used labels.String(), but the performance there is quite bad: goos: linux goarch: amd64 pkg: github.com/cortexproject/cortex/pkg/distributor BenchmarkSeriesMap-8 195 373104859 ns/op PASS ok github.com/cortexproject/cortex/pkg/distributor 110.456s Compare this to using client.Fingerprint for the hashing: goos: linux goarch: amd64 pkg: github.com/cortexproject/cortex/pkg/distributor BenchmarkSeriesMap-8 1273 54778130 ns/op This means that we've gotten 70% in this section, but as explained below this is very small when compared to the overall query. ---------------- Now the reason I've stuck to the simplified case is because it takes <100ms for doing this over 100K series, and in most cases the network time to load all the chunks and iterate through them is several orders of magnitude higher. And tbh, I've looked at manually handling collsions, and we need to do labels.Equal(l1, l2) to see if the series we're looking is the actual series in the map or a collision, and the perf there actually worse. I'm open to ideas here. Also note that the values of each map are different, and any complex solution would require interface{} which is arguably worse. sha256 has been tried and is worse in comparision. Signed-off-by: Goutham Veeramachaneni <[email protected]> * Add changelog entry Signed-off-by: Goutham Veeramachaneni <[email protected]> * Address feedback Signed-off-by: Goutham Veeramachaneni <[email protected]>
Not fixed by #3192 which addresses a similar problem in the querier. I was thinking of code like this: cortex/pkg/chunk/series_store.go Line 248 in 0a33f9d
I also found cortex/pkg/distributor/distributor.go Line 717 in 0a33f9d
cortex/pkg/distributor/query.go Lines 114 to 120 in 0a33f9d
|
Whilst the first example can be dropped when we deprecate chunks (#4268), the others still seem to be valid. |
The chunk store fetch code assumes that fingerprints uniquely identify a timeseries; this is fairly likely when they are looking at a single metric, but we still could get clashes.
Perhaps use the sha256 from the index?
The text was updated successfully, but these errors were encountered: