-
Notifications
You must be signed in to change notification settings - Fork 818
Add shuffle sharding grouper/planner #4357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I wanted to discuss a change in the block compaction behavior that this PR would introduce. The current implementation of the Thanos compactor will always compact the first set of overlapping blocks if there exists such a set. This means that if the most recently ingested set of blocks from multiple ingesters are overlapping, the blocks will be compacted. If this happens, there is potentially a “missing” block in the compaction with the Thanos planner since there may be an ingester that has not fully uploaded the block when the compaction begins. So if there are 3 overlapping blocks when the compaction begins, and they are the latest blocks passed to the Thanos planner, the planner will plan a compaction of those 3 blocks even if there is a potential fourth ingester that has yet to upload a block. With this PR, overlapping blocks will not be compacted if they are the last set of blocks meaning in the example above, the 3 blocks won’t be compacted if there are the latest ones and they don’t cover a full range. In a real-world situation, this would only have an impact on customers who stop ingesting blocks. The impact will be that the last group of n blocks, where n is the number of ingesters will remain uncompacted for as long as they are the latest blocks. The impact of leaving the last n blocks uncompacted would be increased storage size as well as query time (if they continue to query even after stopping ingesting blocks). One thing to note with the Thanos approach, there can be duplicate work if the blocks are compacted, and another block that overlaps is uploaded after the compaction begins. A couple of different approaches I considered were adding grouping overlapping blocks before grouping by compactable ranges, this results in the compaction behavior being the same using these changes compared to Thanos. Another approach is if there are no new blocks after the time defined by the smallest block range passes from the max time of all the blocks, the block which are overlapping can be compacted, even if they are the latest blocks. Something else that I considered is making this a toggle to allow the user to define their own preference, but I think that this isn't ideal as it would either lead to having to support the toggle indefinitely or eventually having to have users switch to a single behavior. Small example illustrating what’s mentioned above 4 total blocks with 1 block incoming (not yet uploaded)
Thanos compaction The above blocks with the current (Thanos) compaction with time ranges [20, 120, 240] would result in blocks:
Afterwards, once block 5 is fully uploaded the final resulting blocks from a single run of the compaction will be
With these blocks, another compaction will need to be done to fully compact the overlapping blocks 2-5. New compaction behavior With this PR and the shuffle-sharding strategy, the blocks would remain uncompacted. And would wait until a more recent block than 2-5 is uploaded. Once that block is uploaded blocks 2-5 would be impacted in 1 compaction.
The downside with this approach is that the uncompacted blocks 2-5 were stored for a longer time compared to the current (Thanos) approach as it was waiting for a more recent block to be uploaded before compacting the blocks. In the above with this PR if I was wondering what your thoughts were about which approach would be preferable? |
This PR replaces and implements the changes recommended in #4318 |
Discussed in the community call and leaving the blocks uncompacted is okay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is long and I didn't read through every line. Broadly it looks ok.
I did wonder why the word "thanos" shows up so often - if the code is copied from Thanos it should say so, and if not can you just explain your thinking to me?
garbageCollectedBlocks: garbageCollectedBlocks, | ||
hashFunc: hashFunc, | ||
compactions: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{ | ||
Name: "thanos_compact_group_compactions_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add new metrics in Cortex starting "thanos_"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note where the metrics were copied from in Thanos. With these changes wouldn't the metrics in Cortex remain the same? They are only used when creating a new group using compact.NewGroup
which is what is being done now (https://github.com/cortexproject/cortex/blob/master/vendor/github.com/thanos-io/thanos/pkg/compact/compact.go#L262-L312)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it might be better to expose those metrics as another function in Thanos?
Any way I can help with this PR? We're running into limits in our compaction (we have about 25M active time series in a single-tenant cortex). I'd be happy to run pre-release compactor builds if this needs some kind of validation. |
error message from build is
Since this looks like a useful PR that we want to merge, but I don't won the original branch, I will create a new branch to work on resolving the error. |
Signed-off-by: Albert <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
Signed-off-by: Albert <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…ortexproject#4262) * add MaxRetries to WaitInstanceState Signed-off-by: Albert <[email protected]> * update CHANGELOG.md Signed-off-by: Albert <[email protected]> * Add timeout for waiting on compactor to become ACTIVE in the ring. Signed-off-by: Albert <[email protected]> * add MaxRetries variable back to WaitInstanceState Signed-off-by: Albert <[email protected]> * Fix linting issues Signed-off-by: Albert <[email protected]> * Remove duplicate entry from changelog Signed-off-by: Albert <[email protected]> * Address PR comments and set timeout to be configurable Signed-off-by: Albert <[email protected]> * Address PR comments and fix tests Signed-off-by: Albert <[email protected]> * Update unit tests Signed-off-by: Albert <[email protected]> * Update changelog and fix linting Signed-off-by: Albert <[email protected]> * Fixed CHANGELOG entry order Signed-off-by: Marco Pracucci <[email protected]> Co-authored-by: Albert <[email protected]> Co-authored-by: Marco Pracucci <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
* MergeIterator: allocate less memory at first We were allocating 24x the number of streams of batches, where each batch holds up to 12 samples. By allowing `c.batches` to reallocate when needed, we avoid the need to pre-allocate enough memory for all possible scenarios. * chunk_test: fix innacurate end time on chunks The `through` time is supposed to be the last time in the chunk, and having it one step higher was throwing off other tests and benchmarks. * MergeIterator benchmark: add more realistic sizes At 15-second scrape intervals a chunk covers 30 minutes, so 1,000 chunks is about three weeks, a highly un-representative test. Instant queries, such as those done by the ruler, will only fetch one chunk from each ingester. Signed-off-by: Bryan Boreham <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
* Expose default configuration values for memberlist. Set the defaults for various memberlist configuration values based on the "Default LAN" configuration. The only result of this change is that the defaults are now visible and are in the documentation. This also means that if the default values change, then the changes are visible in the documentation, where as before they would have gone unnoticed. To prevent this being a breaking change, the existing behaviour is retained, in case anyone is explicitly setting the values to zero and expecting the default to be used. Signed-off-by: Steve Simpson <[email protected]> * Remove use of zero value as default value indicator. Signed-off-by: Steve Simpson <[email protected]> * Review comments. Signed-off-by: Steve Simpson <[email protected]> * Review comments. Signed-off-by: Steve Simpson <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
cortexproject#4342) * Allow setting ring heartbeat timeout to zero to disable timeout check. This change allows the various ring heartbeat timeouts to be configured with zero, as a means of disabling the timeout. This is expected to be used with a separate enhancement to allow disabling heartbeats. When the heartbeat timeout is disabled, instances will always appear as healthy in the ring. Signed-off-by: Steve Simpson <[email protected]> * Review comments. Signed-off-by: Steve Simpson <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…time. (cortexproject#4317) * Add a new config and metric for reporting ruler query execution wall time. Signed-off-by: Tyler Reid <[email protected]> * Spacing and PR number fixup Signed-off-by: Tyler Reid <[email protected]> * Wrap the defer in a function to make it defer after the return rather than after the if block. Add a unit test to validate we're tracking time correctly. Signed-off-by: Tyler Reid <[email protected]> * Use seconds for our duration rather than nanoseconds Signed-off-by: Tyler Reid <[email protected]> * Review comment fixes Signed-off-by: Tyler Reid <[email protected]> * Update config flag in the config docs Signed-off-by: Tyler Reid <[email protected]> * Pass counter rather than counter vector for metrics query function Signed-off-by: Tyler Reid <[email protected]> * Fix comment in MetricsQueryFunction Signed-off-by: Tyler Reid <[email protected]> * Move query metric and log to separate function. Add log message for ruler query time. Signed-off-by: Tyler Reid <[email protected]> * Update config file and change log to show this a per user metric Signed-off-by: Tyler Reid <[email protected]> * code review fixes Signed-off-by: Tyler Reid <[email protected]> * update log message for ruler query metrics Signed-off-by: Tyler Reid <[email protected]> * Remove append and just use the array for key values in the log messag Signed-off-by: Tyler Reid <[email protected]> * Add query-frontend component to front end log message Signed-off-by: Tyler Reid <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
I thought it would be good to put a security page into the docs, so that it shows up in a search. Content is just pointing at other resources. Signed-off-by: Bryan Boreham <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…xproject#4345) * Optimise memberlist kv store access by storing data unencoded. The following profile data was taken from running 50 idle ingesters with memberlist, with almost everything at default values (5s heartbeats): ``` 52.16% mergeBytesValueForKey +- 52.16% mergeValueForKey +- 47.84% computeNewValue +- 27.24% codec Proto Decode +- 26.25% mergeWithTime ``` It is apparent from the this that a lot of time is spent on the memberlist receive path, as might be expected, specifically, the merging of the update into the current state. The cost however is not in decoding the incoming states (occurs in `mergeBytesValueForKey` before `mergeValueForKey`), but in fact decoding _current state_ of the value in the store (as it is stored encoded). The ring state was measured at 123K (50 ingesters), so it makes sense that decoding could be costly. This can be avoided by storing the value in it's decoded `Mergeable` form. When doing this, care has to be taken to deep copy the value when accessed, as it is modified in place before being updated in the store, and accessed outside the store mutex. Note a side effect of this change is that is no longer straightforward to expose the `memberlist_kv_store_value_bytes` metric, as this reported the size of the encoded data, therefore it has been removed. Signed-off-by: Steve Simpson <[email protected]> * Typo. Signed-off-by: Steve Simpson <[email protected]> * Review comments. Signed-off-by: Steve Simpson <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…o. (cortexproject#4344) * Allow disabling of ring heartbeats by setting relevant options to zero. Signed-off-by: Steve Simpson <[email protected]> * Review comments. Signed-off-by: Steve Simpson <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…#4346) * Expose configuration of memberlist packet compression. Allows manually specifying whether memberlist should compress packets via a new configuration flag: `-memberlist.enable-compression`. This typically has little benefit for Cortex, as the ring state messages are already compressed with Snappy, the second layer of compression does not achieve any additional saving. It's not clear cut whether there might still be some benefit for internal memberlist messages; this needs to be evaluated in a environment of some reasonable scale. Signed-off-by: Steve Simpson <[email protected]> * Review comments. Signed-off-by: Steve Simpson <[email protected]> * Review comments. Signed-off-by: Steve Simpson <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…exproject#4348) It was only waiting one second for the second sync to complete, which is probably too harsh a deadline than necessary for overloaded systems. Signed-off-by: Steve Simpson <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…xproject#4349) The test is writing a single silence and checking a metric which indicates whether replicating the silence has been attempted yet. This is so we can check later on that no replication activity occurs. The assertions later on in the test are passing, but the first one is not, indicating that the replication doesn't trigger early enough. This makes sense because the replication is not synchronous with the writing of the silence. Signed-off-by: Steve Simpson <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
) * Add proposal document Signed-off-by: Gofman <[email protected]> Signed-off-by: ilangofman <[email protected]> * Minor text modifications Signed-off-by: ilangofman <[email protected]> * Implement requested changes to the proposal Signed-off-by: ilangofman <[email protected]> * Fix mention of Compactor instead of purger in proposal Signed-off-by: ilangofman <[email protected]> * Fixed wording and spelling in proposal Signed-off-by: ilangofman <[email protected]> * Update the cache invalidation method Signed-off-by: ilangofman <[email protected]> * Fix wording on cache invalidation section Signed-off-by: ilangofman <[email protected]> * Minor wording additions Signed-off-by: ilangofman <[email protected]> * Remove white-noise from text Signed-off-by: ilangofman <[email protected]> * Remove the deleting state and change cache invalidation Signed-off-by: ilangofman <[email protected]> * Add deleted state and update cache invalidation Signed-off-by: ilangofman <[email protected]> * Add one word to clear things up Signed-off-by: ilangofman <[email protected]> * update api limits section Signed-off-by: ilangofman <[email protected]> * ran clean white noise Signed-off-by: ilangofman <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
Signed-off-by: Albert <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
Signed-off-by: Albert <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
Conventionally the minimum time would be before the maximum. Apparently none of the tests were depending on this. Signed-off-by: Bryan Boreham <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
We need to add the merged value back to the map. Extract merging as a separate function so it can be tested. Adapt the existing test to cover multiple series. Signed-off-by: Bryan Boreham <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
Rearrange `CHANGELOG.md` to conform to instructions in `pull_request_template.md`. Also add a `-` to a CLI flag to conform to instructions in `design-patterns-and-conventions.md`. Signed-off-by: Andrew Seigner <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
* Introduce `http` config settings in Azure storage Cortex v1.11.0 included thanos-io/thanos#3970, which added configuration options to Azure's http client and transport, replacing usage of `http.DefaultClient`. Unfortunately since Cortex was not setting this config, Cortex implicitly switched from `http.DefaultClient` to all empty values (e.g. `MaxIdleConns: 0` rather than 100). Introduce `http` config settings to Azure storage. This motivated moving `s3.HTTPConfig` into a new `pkg/storage/bucket/config` package, to allow `azure` and `s3` to share it. Also update the instructions for running the website to include installing `embedmd`. Signed-off-by: Andrew Seigner <[email protected]> * feedback: `config.HTTP` -> `http.Config` also back out changelog cleanup Signed-off-by: Andrew Seigner <[email protected]> * Back out accidental changelog addition Signed-off-by: Andrew Seigner <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
* Update Thanos to latest main Update Thanos dependency to include thanos-io/thanos#4928, to conserve memory. Signed-off-by: Andrew Seigner <[email protected]> * Update changelog to summarize user-facing changes Signed-off-by: Andrew Seigner <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
* Adding test case for dropping metrics by name to understand better flow of distributor Signed-off-by: Pedro Tanaka <[email protected]> * Adding test case and new metric for dropped samples Signed-off-by: Pedro Tanaka <[email protected]> * Updating CHANGELOG with new changes Signed-off-by: Pedro Tanaka <[email protected]> * Fixing linting problem on distributor file Signed-off-by: Pedro Tanaka <[email protected]> * Reusing discarded samples metric from validate package Signed-off-by: Pedro Tanaka <[email protected]> * Compare labelset with len() instead of comparing to nil Signed-off-by: Pedro Tanaka <[email protected]> * Undoing unnecessary changes on tests and distributor Signed-off-by: Pedro Tanaka <[email protected]> * Small rename on comment Signed-off-by: Pedro Tanaka <[email protected]> * Fixing linting offenses Signed-off-by: Pedro Tanaka <[email protected]> * Reseting validation dropped samples metric to avoid getting metrics from other test runs Signed-off-by: Pedro Tanaka <[email protected]> * Resolving problems after rebase conflicts Signed-off-by: Pedro Tanaka <[email protected]> * Registering counter for dropped metrics in test Signed-off-by: Pedro Tanaka <[email protected]> * Checking if user label drop configuration did not drop __name__ label Signed-off-by: Pedro Tanaka <[email protected]> * Do not check for name label, adding new test Signed-off-by: Pedro Tanaka <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
* Disable block deletion marks migration by default Flag is named `-compactor.block-deletion-marks-migration-enabled`. This feature was added in v1.7, so we expect most users to have upgraded by now. Signed-off-by: Bryan Boreham <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
Signed-off-by: Julien Pivotto <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…ct#4602) * Upgrade Go to 1.17.5 for integration tests Signed-off-by: Arve Knudsen <[email protected]> * Upgrade to Go 1.17 in Dockerfiles Signed-off-by: Arve Knudsen <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
* Update build image. Signed-off-by: Peter Štibraný <[email protected]> * CHANGELOG.md Signed-off-by: Peter Štibraný <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
This reverts commit f2656f8. Signed-off-by: Arve Knudsen <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…#4440)" (cortexproject#4613) This reverts commit a635a1e. Signed-off-by: Arve Knudsen <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
* Federated ruler proposal Signed-off-by: Rees Dooley <[email protected]> Co-authored-by: Rees Dooley <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
) This reverts commit 19f3802. Signed-off-by: Arve Knudsen <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…exproject#4614) Signed-off-by: Arve Knudsen <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…er (cortexproject#4615) Signed-off-by: Arve Knudsen <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…t#4617) Signed-off-by: Arve Knudsen <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
)" (cortexproject#4611) This reverts commit 32b1b40. Signed-off-by: Arve Knudsen <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
…project#4619) Signed-off-by: Arve Knudsen <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]> Signed-off-by: Alvin Lin <[email protected]>
Move the change log line to unreleased section Signed-off-by: Alvin Lin <[email protected]>
Signed-off-by: Alvin Lin <[email protected]>
Signed-off-by: Alvin Lin <[email protected]>
Signed-off-by: Alvin Lin <[email protected]>
Signed-off-by: Alvin Lin <[email protected]>
db0595b
to
8e78a51
Compare
Please see #4624 instead. |
Signed-off-by: Albert [email protected]
What this PR does:
Implements generation of parallelize plans for the proposal outlined in #4272 using a shuffle sharding grouper and planner. Currently the parallelizable plans are generated but every compactor runs every planned compaction, the actual sharding will happen in a subsequent PR.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]