-
Notifications
You must be signed in to change notification settings - Fork 817
Fixed per-series sharding token generation in the distributor #1888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed per-series sharding token generation in the distributor #1888
Conversation
Signed-off-by: Marco Pracucci <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As PR #1726 hasn't been merged yet, perhaps this should have been a comment on that PR? What is the correct merge order now?
@pstibrany It is merged. |
You're right. So no problem there. This is a tricky bug, good that you've found it. |
pkg/distributor/distributor.go
Outdated
|
||
// In case of a fatal error (ie. logic bug) we should fail the entire | ||
// request and not consider it a partial error. | ||
if err == errShardLabelsNotSorted { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seem like it might be a little brittle to me - previously all errors from validateSeries
were "soft", now some are special.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. From a design perspective, I think the main issue is that both labels removal and sharding token generation is done inside the validateSeries()
which sounds "weird" (it does more than "validation").
I've pushed a commit where both labels removal and sharding token generation has been moved outside.
What's your take now?
…he sharding token Signed-off-by: Marco Pracucci <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though I think in general it would be good to move more logic out of push and into other functions, which is part of why I included more than just "validation" in the valdiateSeries
function. The name was not clear.
Also looking at the function again it's clear I'd intended to do something different with the keys and that got lost in the review at some point. I don't know what the stance on something like this is, but validateLabels
could have been updated to return uint32, client.PreallocTimseries, error
.
Since we've managed to re-shard the world at least twice in the past year, would it be worth having a unit test that fails if it happens again? |
Signed-off-by: Marco Pracucci <[email protected]>
Agree @bboreham. I've added a test hardcoding some expected tokens, given we don't want them to change over the time. What's about now? I've also run the same test (removing the case related to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
What this PR does:
The PR #1726 introduced some changes in the distributor to allow to drop labels on a per-tenant basis. These changes also unintentionally changed the actual per-series sharding token: we noticed it while rolling out in one of our clusters, where an almost full resharding occurred (switching the cluster from 5M active series to 9M active series until the old sharing chunks have been flushed).
Before the PR #1726, the token was generated based on the labels without the HA replica label. After the PR #1726, the token is generated based on the labels with the HA replica label.
In this PR I'm fixing it, rolling back to the old behaviour, while honoring the intention of the PR #1726: the token should be generated based on the labels without the HA replica label and without the dropped ones.
I haven't updated the changelog because the PR #1726 has been merged after the
0.4.0
release, so it's still in the unreleased limbo.Which issue(s) this PR fixes:
No issue
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]