Skip to content

Fixed per-series sharding token generation in the distributor #1888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Dec 6, 2019

What this PR does:
The PR #1726 introduced some changes in the distributor to allow to drop labels on a per-tenant basis. These changes also unintentionally changed the actual per-series sharding token: we noticed it while rolling out in one of our clusters, where an almost full resharding occurred (switching the cluster from 5M active series to 9M active series until the old sharing chunks have been flushed).

Before the PR #1726, the token was generated based on the labels without the HA replica label. After the PR #1726, the token is generated based on the labels with the HA replica label.

In this PR I'm fixing it, rolling back to the old behaviour, while honoring the intention of the PR #1726: the token should be generated based on the labels without the HA replica label and without the dropped ones.

I haven't updated the changelog because the PR #1726 has been merged after the 0.4.0 release, so it's still in the unreleased limbo.

Which issue(s) this PR fixes:
No issue

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As PR #1726 hasn't been merged yet, perhaps this should have been a comment on that PR? What is the correct merge order now?

@pracucci
Copy link
Contributor Author

pracucci commented Dec 6, 2019

As PR #1726 hasn't been merged yet

@pstibrany It is merged.

@pstibrany
Copy link
Contributor

@pstibrany It is merged.

You're right. So no problem there. This is a tricky bug, good that you've found it.


// In case of a fatal error (ie. logic bug) we should fail the entire
// request and not consider it a partial error.
if err == errShardLabelsNotSorted {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem like it might be a little brittle to me - previously all errors from validateSeries were "soft", now some are special.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. From a design perspective, I think the main issue is that both labels removal and sharding token generation is done inside the validateSeries() which sounds "weird" (it does more than "validation").

I've pushed a commit where both labels removal and sharding token generation has been moved outside.

What's your take now?

@pracucci pracucci requested a review from gouthamve December 9, 2019 17:52
Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I think in general it would be good to move more logic out of push and into other functions, which is part of why I included more than just "validation" in the valdiateSeries function. The name was not clear.

Also looking at the function again it's clear I'd intended to do something different with the keys and that got lost in the review at some point. I don't know what the stance on something like this is, but validateLabels could have been updated to return uint32, client.PreallocTimseries, error.

@bboreham
Copy link
Contributor

bboreham commented Dec 9, 2019

Since we've managed to re-shard the world at least twice in the past year, would it be worth having a unit test that fails if it happens again?

@pracucci
Copy link
Contributor Author

pracucci commented Dec 10, 2019

Since we've managed to re-shard the world at least twice in the past year, would it be worth having a unit test that fails if it happens again?

Agree @bboreham. I've added a test hardcoding some expected tokens, given we don't want them to change over the time. What's about now?

I've also run the same test (removing the case related to DropLabels) against 0.4.0 and it passes.

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@bboreham bboreham merged commit b5f5c19 into cortexproject:master Dec 11, 2019
@pracucci pracucci deleted the fix-resharing-due-to-distributor-key-generation-logic branch December 11, 2019 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants