Skip to content

Worry more about fingerprint clashes #717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bboreham opened this issue Feb 19, 2018 · 4 comments · Fixed by #3192
Open

Worry more about fingerprint clashes #717

bboreham opened this issue Feb 19, 2018 · 4 comments · Fixed by #3192
Labels
component/querier good first issue These are great first issues. If you are looking for a place to start, start here! help wanted type/bug

Comments

@bboreham
Copy link
Contributor

The chunk store fetch code assumes that fingerprints uniquely identify a timeseries; this is fairly likely when they are looking at a single metric, but we still could get clashes.

Perhaps use the sha256 from the index?

@gouthamve
Copy link
Contributor

This is the relevant bits: https://github.com/cortexproject/cortex/blob/master/pkg/querier/chunk_store_queryable.go#L51-L68

The idea is that 2 different series could have same fingerprint and then we just say the chunks for both belong to one.

@pracucci
Copy link
Contributor

pracucci commented Dec 5, 2019

From a quick look through the code, I've the feeling that we also suffer hash collisions in results cache too (see cache.HashKey usage).

gouthamve added a commit to gouthamve/cortex that referenced this issue Sep 17, 2020
This fixes cortexproject#717

We hit issue in production when our customer issued a query that issued
a simple query that touched 20K series, many of which had hash
collisions on `client.FastFingerprint`. This made our querier code
merge multiple series together which caused counters to go up
and down and caused rate() to return weird artifacts.

It is quite tricky to debug and I now have a test up which proves that
we will suffer if we don't handle collisions explicitly. After looking
at several solutions for fixing that, I've finally settled on something
super simple, just use a string as the map key.

Here is a benchmark where I insert and lookup 100K series:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8         842          91633894 ns/op
PASS
ok      github.com/cortexproject/cortex/pkg/distributor 86.307s

Now I could have just used labels.String(), but the performance there is
quite bad:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8         195         373104859 ns/op
PASS
ok      github.com/cortexproject/cortex/pkg/distributor 110.456s

Compare this to using client.Fingerprint for the hashing:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8        1273          54778130 ns/op

This means that we've gotten 70% in this section, but as explained below
this is very small when compared to the overall query.

----------------

Now the reason I've stuck to the simplified case is because it takes
<100ms for doing this over 100K series, and in most cases the network
time to load all the chunks and iterate through them is several orders
of magnitude higher.

And tbh, I've looked at manually handling collsions, and we need to do
labels.Equal(l1, l2) to see if the series we're looking is the actual
series in the map or a collision, and the perf there actually worse.

I'm open to ideas here. Also note that the values of each map are
different, and any complex solution would require interface{} which is
arguably worse. sha256 has been tried and is worse in comparision.

Signed-off-by: Goutham Veeramachaneni <[email protected]>
pracucci pushed a commit that referenced this issue Sep 17, 2020
* Dont rely on hashes for collecting chunks together

This fixes #717

We hit issue in production when our customer issued a query that issued
a simple query that touched 20K series, many of which had hash
collisions on `client.FastFingerprint`. This made our querier code
merge multiple series together which caused counters to go up
and down and caused rate() to return weird artifacts.

It is quite tricky to debug and I now have a test up which proves that
we will suffer if we don't handle collisions explicitly. After looking
at several solutions for fixing that, I've finally settled on something
super simple, just use a string as the map key.

Here is a benchmark where I insert and lookup 100K series:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8         842          91633894 ns/op
PASS
ok      github.com/cortexproject/cortex/pkg/distributor 86.307s

Now I could have just used labels.String(), but the performance there is
quite bad:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8         195         373104859 ns/op
PASS
ok      github.com/cortexproject/cortex/pkg/distributor 110.456s

Compare this to using client.Fingerprint for the hashing:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8        1273          54778130 ns/op

This means that we've gotten 70% in this section, but as explained below
this is very small when compared to the overall query.

----------------

Now the reason I've stuck to the simplified case is because it takes
<100ms for doing this over 100K series, and in most cases the network
time to load all the chunks and iterate through them is several orders
of magnitude higher.

And tbh, I've looked at manually handling collsions, and we need to do
labels.Equal(l1, l2) to see if the series we're looking is the actual
series in the map or a collision, and the perf there actually worse.

I'm open to ideas here. Also note that the values of each map are
different, and any complex solution would require interface{} which is
arguably worse. sha256 has been tried and is worse in comparision.

Signed-off-by: Goutham Veeramachaneni <[email protected]>

* Add changelog entry

Signed-off-by: Goutham Veeramachaneni <[email protected]>

* Address feedback

Signed-off-by: Goutham Veeramachaneni <[email protected]>
@bboreham
Copy link
Contributor Author

bboreham commented Sep 21, 2020

Not fixed by #3192 which addresses a similar problem in the querier.

I was thinking of code like this:

filtered, keys := filterChunksByUniqueFingerprint(filtered)

I also found

metrics[m.Fingerprint()] = m

fp := ss.Metric.Fingerprint()
mss, ok := fpToSampleStream[fp]
if !ok {
mss = &model.SampleStream{
Metric: ss.Metric,
}
fpToSampleStream[fp] = mss

@bboreham bboreham reopened this Sep 21, 2020
@gouthamve gouthamve added good first issue These are great first issues. If you are looking for a place to start, start here! help wanted and removed not-as-easy-as-it-looks labels Sep 21, 2020
@bboreham
Copy link
Contributor Author

Whilst the first example can be dropped when we deprecate chunks (#4268), the others still seem to be valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/querier good first issue These are great first issues. If you are looking for a place to start, start here! help wanted type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants