Randomize cluster startup node order during topology refresh#4060
Open
petyaslavova wants to merge 9 commits into
Open
Randomize cluster startup node order during topology refresh#4060petyaslavova wants to merge 9 commits into
petyaslavova wants to merge 9 commits into
Conversation
…logy_reinitialization
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
…ceive the last failed node as argument and it is moved to be the last option for topology refresh
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9006304. Configure here.
…etter maint notifications behaviour the randomization is mocked to keep the original order
| for _ in range(execute_attempts): | ||
| if self._initialize: | ||
| await self.initialize() | ||
| await self.initialize(last_failed_node_name=last_failed_node_name) |
Collaborator
There was a problem hiding this comment.
So async initialisation never passes additional_startup_nodes_info?
Collaborator
Author
There was a problem hiding this comment.
@vladvildanov Currently, no - it was added for the maintenance notifications handling in sync client. I'm adding this now, so that later, when the maintenance notifications are added to the sync client we will have the same order of the arguments.
…logy_reinitialization
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Randomizes the startup node iteration order during cluster topology initialization for both sync and async clients. This prevents many clients from consistently querying the same first startup node when reinitializing cluster state.
The implementation copies
startup_nodesto a list, shuffles it when multiple nodes are available, and then proceeds with the existing initialization flow. Sync behavior still includes any additional startup nodes after the shuffled startup node list, preserving the existing MOVED refresh path behavior.Adds sync and async cluster tests that use the real cluster fixture and mock only
random.shuffleto make the order deterministic. The tests verify that initialization queries the node that becomes first after shuffling.Fixes #4049
Note
Medium Risk
Changes cluster topology refresh ordering and retry initialization behavior (sync + asyncio), which can affect how clients recover from node failures and MOVED/connection errors. Test changes reduce timing flakiness but don’t alter lock semantics.
Overview
Cluster topology refresh is now randomized. Both sync and asyncio
NodesManager.initialize()copystartup_nodesto a list andrandom.shuffle()it (when multiple nodes exist) so clients don’t consistently query the same first node during reinitialization.Failed nodes are deprioritized on refresh. Connection/timeout errors now annotate the exception with
last_failed_node_name, and subsequentinitialize()calls pass this through so the failed node is tried after other startup and additional startup nodes.Tests updated/added. New sync/async cluster tests assert shuffling is applied (via deterministic
random.shufflemocking), and lock blocking-timeout tests now use patched fake time modules to avoid real-time assertions and flakiness.Reviewed by Cursor Bugbot for commit 883effe. Bugbot is set up for automated code reviews on this repo. Configure here.