Skip to content

Dont rely on hashes for collecting chunks together #3192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 17, 2020

Conversation

gouthamve
Copy link
Contributor

@gouthamve gouthamve commented Sep 16, 2020

This fixes #717

We hit issue in production when our customer issued a query that issued
a simple query that touched 20K series, many of which had hash
collisions on client.FastFingerprint. This made our querier code
merge multiple series together which caused counters to go up
and down and caused rate() to return weird artifacts.

It is quite tricky to debug and I now have a test up which proves that
we will suffer if we don't handle collisions explicitly. After looking
at several solutions for fixing that, I've finally settled on something
super simple, just use a string as the map key.

Here is a benchmark where I insert and lookup 100K series:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8         842          91633894 ns/op
PASS
ok      github.com/cortexproject/cortex/pkg/distributor 86.307s

Now I could have just used labels.String(), but the performance there is
quite bad:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8         195         373104859 ns/op
PASS
ok      github.com/cortexproject/cortex/pkg/distributor 110.456s

Compare this to using client.Fingerprint for the hashing:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8        1273          54778130 ns/op

This means that we've gotten 70% in this section, but as explained below
this is very small when compared to the overall query.


Now the reason I've stuck to the simplified case is because it takes
<100ms for doing this over 100K series, and in most cases the network
time to load all the chunks and iterate through them is several orders
of magnitude higher.

And tbh, I've looked at manually handling collsions, and we need to do
labels.Equal(l1, l2) to see if the series we're looking is the actual
series in the map or a collision, and the perf there actually worse.

I'm open to ideas here. Also note that the values of each map are
different, and any complex solution would require interface{} which is
arguably worse. sha256 has been tried and is worse in comparision.

Signed-off-by: Goutham Veeramachaneni [email protected]

What this PR does:

Which issue(s) this PR fixes:
Fixes #717

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@gouthamve gouthamve requested review from pracucci, pstibrany and bboreham and removed request for pracucci September 16, 2020 15:35
Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I agree the extra time spent here is marginal with respect to the overall query path and the added complexity of a collision handling approach is probably not worth the time spent implementing it.

suggestion: Since label.Bytes() resets the length of the provided []bytes slice I updated the code to do the following:

query.go

// LabelsToKeyString is used to form a string to be used as
// the hashKey. Don't print, use l.String() for printing.
func LabelsToKeyString(l labels.Labels, b []byte) string {
	return string(l.Bytes(b))
}

query_test.go

func benchmarkSeriesMap(numSeries int, b *testing.B) {
	series := makeSeries(numSeries)
	sm := map[string]int{}

        // instantiate a reusable slice
	buf := make([]byte, 0, 1024)
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		for i, s := range series {
			sm[LabelsToKeyString(s, buf)] = i
		}

By reusing the byte array you can shave off about 10%. Although this may not be worth it.

$ /usr/local/bin/go test -benchmem -run=^$ github.com/cortexproject/cortex/pkg/distributor -bench ^BenchmarkSeriesMap
goos: darwin
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8   	      13	  82358056 ns/op	29421232 B/op	  200303 allocs/op
PASS
ok  	github.com/cortexproject/cortex/pkg/distributor	3.026s

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this approach. Please include memory allocations in your benchmarks, and see comment about lowering length when allocating byte slice.

for _, c := range chunks {
fp := client.Fingerprint(c.Metric)
chunksBySeries[fp] = append(chunksBySeries[fp], c)
key := distributor.LabelsToKeyString(c.Metric)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this function to some utils or "client" package where other fingerprinting functions are?

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (modulo Peter's comments)

This fixes cortexproject#717

We hit issue in production when our customer issued a query that issued
a simple query that touched 20K series, many of which had hash
collisions on `client.FastFingerprint`. This made our querier code
merge multiple series together which caused counters to go up
and down and caused rate() to return weird artifacts.

It is quite tricky to debug and I now have a test up which proves that
we will suffer if we don't handle collisions explicitly. After looking
at several solutions for fixing that, I've finally settled on something
super simple, just use a string as the map key.

Here is a benchmark where I insert and lookup 100K series:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8         842          91633894 ns/op
PASS
ok      github.com/cortexproject/cortex/pkg/distributor 86.307s

Now I could have just used labels.String(), but the performance there is
quite bad:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8         195         373104859 ns/op
PASS
ok      github.com/cortexproject/cortex/pkg/distributor 110.456s

Compare this to using client.Fingerprint for the hashing:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
BenchmarkSeriesMap-8        1273          54778130 ns/op

This means that we've gotten 70% in this section, but as explained below
this is very small when compared to the overall query.

----------------

Now the reason I've stuck to the simplified case is because it takes
<100ms for doing this over 100K series, and in most cases the network
time to load all the chunks and iterate through them is several orders
of magnitude higher.

And tbh, I've looked at manually handling collsions, and we need to do
labels.Equal(l1, l2) to see if the series we're looking is the actual
series in the map or a collision, and the perf there actually worse.

I'm open to ideas here. Also note that the values of each map are
different, and any complex solution would require interface{} which is
arguably worse. sha256 has been tried and is worse in comparision.

Signed-off-by: Goutham Veeramachaneni <[email protected]>
Signed-off-by: Goutham Veeramachaneni <[email protected]>
@gouthamve
Copy link
Contributor Author

I tried the 3 options and am getting some very confusing answers:

With b := make([]byte, 0, 512)

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/ingester/client
BenchmarkSeriesMap-8         794          91165196 ns/op        28800429 B/op     200002 allocs/op
PASS
ok      github.com/cortexproject/cortex/pkg/ingester/client     82.069s

With b := make([]byte, 0, 1024)

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/ingester/client
BenchmarkSeriesMap-8         792          91004586 ns/op        28800442 B/op     200002 allocs/op
PASS
ok      github.com/cortexproject/cortex/pkg/ingester/client     81.759s

With

numBytes := 0
for _, lv := range l {
	numBytes += 2 + len(lv.Name) + len(lv.Value)
}
b := make([]byte, 0, numBytes
goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/ingester/client
BenchmarkSeriesMap-8         702          99405354 ns/op        57600479 B/op     400002 allocs/op
PASS
ok      github.com/cortexproject/cortex/pkg/ingester/client     80.696s

Signed-off-by: Goutham Veeramachaneni <[email protected]>
@pstibrany
Copy link
Contributor

pstibrany commented Sep 17, 2020

I tried the 3 options and am getting some very confusing answers:

You're right. This make call doesn't really cause any allocation, thanks to Go optimizations:

./query.go:259:24: l does not escape
./query.go:260:11: make([]byte, 0, 1024) does not escape
./query.go:261:15: string(l.Bytes(b)) escapes to heap

(running go test -c -gcflags '-m' in the package)

@pracucci pracucci merged commit 9f3ce2c into cortexproject:master Sep 17, 2020
@gouthamve gouthamve deleted the fix-717 branch September 17, 2020 11:03
@pracucci pracucci mentioned this pull request Sep 21, 2020
3 tasks
Comment on lines +173 to +174
hashToChunkseries := map[string]ingester_client.TimeSeriesChunk{}
hashToTimeSeries := map[string]ingester_client.TimeSeries{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are now misnamed: the map key is not a hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Worry more about fingerprint clashes
5 participants