Skip to content

Add ingesters shuffle sharding support on the read path #3252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Sep 30, 2020

What this PR does:
This PR aims to implement the ingesters shuffle sharding on the read path, as described in this design doc.

The idea is that, when distributors/ingesters shuffle sharding is enabled, we only query data from the subset of ingesters holding the data for a specific tenant.

This PR builds on top of #3299.

Documentation has not been updated yet.

Which issue(s) this PR fixes:
N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job!

@@ -88,18 +91,24 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
f.DurationVar(&cfg.LookbackDelta, "querier.lookback-delta", 5*time.Minute, "Time since the last sample after which a time series is considered stale and ignored by expression evaluations.")
f.StringVar(&cfg.SecondStoreEngine, "querier.second-store-engine", "", "Second store engine to use for querying. Empty = disabled.")
f.Var(&cfg.UseSecondStoreBeforeTime, "querier.use-second-store-before-time", "If specified, second store is only used for queries before this timestamp. Default value 0 means secondary store is always queried.")
f.DurationVar(&cfg.ShuffleShardingIngestersLookbackPeriod, "querier.shuffle-sharding-ingesters-lookback-period", 0, "When distributor's sharding strategy is shuffle-sharding and this setting is > 0, queriers fetch in-memory series from the minimum set of required ingesters, selecting only ingesters which may have received series since 'now - lookback period'. If this setting is 0, queriers always query all ingesters (ingesters shuffle sharding on read path is disabled).")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Advice on what is the reasonable value to use or how to compute it would be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since adding all such details here may be "complicated" I was thinking to do it in the doc. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@pracucci pracucci force-pushed the ingesters-read-path-shuffle-sharding branch 2 times, most recently from 2fd3266 to 6a3110f Compare October 12, 2020 08:03
@pracucci pracucci force-pushed the ingesters-read-path-shuffle-sharding branch from 6a3110f to 6a2f9a7 Compare October 12, 2020 13:44
Copy link
Contributor

@ranton256 ranton256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions and comments, nothing I found was related to correctness or blocking.

@@ -43,9 +45,16 @@ func TestIngesterSharding(t *testing.T) {
defer s.Close()

flags := BlocksStorageFlags
flags["-distributor.shard-by-all-labels"] = "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ingester shuffle sharding on read path depend on shard-by-all-labels being set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't. I enabled it in this test just because it's how we (and to my knowledge most of users) run Cortex, so to have an integration test which better reflects real setups.

return nil, err
}
if reallyAll {
replicationSet.MaxErrors = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was this reallyAll branch that's being removed used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. reallyAll was used only by UserStats() to make sure the request will fail if querying any ingester will fail (by default we allow 1 failure, due to quorum). The logic has not changed: I've just moved replicationSet.MaxErrors = 0 to UserStats().

@pracucci pracucci force-pushed the ingesters-read-path-shuffle-sharding branch from bea8d1f to 612077c Compare October 13, 2020 12:29
@@ -7,7 +7,7 @@ SLOW=
TAGS=
PARALLEL=
RACE="-race -covermode=atomic"
TIMEOUT=5m
TIMEOUT=10m
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests run with -race in CI and, when race is enabled, they're very slow, especially new "stress" tests I've added for shuffle sharding. However, at this stage I believe we should prefer coverage over speed. We'll later work to speed up CI.

Copy link
Contributor

@ranton256 ranton256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the answers and the unit test.

… the shard size is increased or instances in a new zone are added

Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci force-pushed the ingesters-read-path-shuffle-sharding branch from 67a8cb6 to a08b64c Compare October 15, 2020 11:20
@pracucci pracucci merged commit 0ecd515 into cortexproject:master Oct 15, 2020
@pracucci pracucci deleted the ingesters-read-path-shuffle-sharding branch October 15, 2020 12:01
gotjosh added a commit to gotjosh/cortex that referenced this pull request Oct 20, 2020
…rgid-ctx

* 'master' of github.com:cortexproject/cortex:
  Enforce integration tests default flags config to never be overwritten (cortexproject#3370)
  Avoid deletion of blocks which are not shipped (cortexproject#3346)
  Upgrade Thanos to latest master (cortexproject#3363)
  Migrate CircleCI workflows to GitHub Actions (2/3) (cortexproject#3341)
  Remove comments that doesn't seem right (cortexproject#3361)
  add ingester interface (cortexproject#3352)
  Fail fast an ingester if unable to load existing TSDBs (cortexproject#3354)
  Fixed Gossip memberlist members joining when addresses are configured using DNS-based service discovery (cortexproject#3360)
  Export distributor method to get ingester replication set (cortexproject#3356)
  Correct link for Block Storage reference (cortexproject#3234)
  Added section on Cleaner. (cortexproject#3327)
  Update prometheus vendor to master (cortexproject#3345)
  adding GHA CI env variable check (cortexproject#3351)
  Add ingesters shuffle sharding support on the read path (cortexproject#3252)
simonswine pushed a commit to grafana/e2e that referenced this pull request Jan 13, 2022
…t/cortex#3252)

* Added ingesters shuffle sharding support on the read path

Signed-off-by: Marco Pracucci <[email protected]>

* Committed broken test to signal there's an issue

Signed-off-by: Marco Pracucci <[email protected]>

* Fixed shuffle sharding consistency when zone-awareness is enabled and the shard size is increased or instances in a new zone are added

Signed-off-by: Marco Pracucci <[email protected]>

* Added ShuffleShardingIngestersLookbackPeriod validation

Signed-off-by: Marco Pracucci <[email protected]>

* Tiny getIngestersForQuery() optimisation

Signed-off-by: Marco Pracucci <[email protected]>

* Added unit tests on distributor changes

Signed-off-by: Marco Pracucci <[email protected]>

* Added integration test

Signed-off-by: Marco Pracucci <[email protected]>

* Fixed race in unit test

Signed-off-by: Marco Pracucci <[email protected]>

* Added CHANGELOG entry

Signed-off-by: Marco Pracucci <[email protected]>

* Removed ring.ByZoneAndAddr because unused

Signed-off-by: Marco Pracucci <[email protected]>

* Addressed review comments

Signed-off-by: Marco Pracucci <[email protected]>

* Fixed TestRing_ShuffleShardWithLookback_CorrectnessWithFuzzy after wrong rebase

Signed-off-by: Marco Pracucci <[email protected]>

* Fixed tests after rebase

Signed-off-by: Marco Pracucci <[email protected]>

* Added unit test on ReplicationSet.GetAddresses()

Signed-off-by: Marco Pracucci <[email protected]>

* Improved TestRing_ShuffleShardWithLookback

Signed-off-by: Marco Pracucci <[email protected]>

* Fixed TestRing_ShuffleShardWithLookback_CorrectnessWithFuzzy

Signed-off-by: Marco Pracucci <[email protected]>

* Increased unit tests timeout

Signed-off-by: Marco Pracucci <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants