Skip to content

Randomize cluster startup node order during topology refresh#4060

Open
petyaslavova wants to merge 9 commits into
masterfrom
ps_randomize_startup_nodes_on_cluster_topology_reinitialization
Open

Randomize cluster startup node order during topology refresh#4060
petyaslavova wants to merge 9 commits into
masterfrom
ps_randomize_startup_nodes_on_cluster_topology_reinitialization

Conversation

@petyaslavova
Copy link
Copy Markdown
Collaborator

@petyaslavova petyaslavova commented May 12, 2026

Randomizes the startup node iteration order during cluster topology initialization for both sync and async clients. This prevents many clients from consistently querying the same first startup node when reinitializing cluster state.

The implementation copies startup_nodes to a list, shuffles it when multiple nodes are available, and then proceeds with the existing initialization flow. Sync behavior still includes any additional startup nodes after the shuffled startup node list, preserving the existing MOVED refresh path behavior.

Adds sync and async cluster tests that use the real cluster fixture and mock only random.shuffle to make the order deterministic. The tests verify that initialization queries the node that becomes first after shuffling.

Fixes #4049


Note

Medium Risk
Changes cluster topology refresh ordering and retry initialization behavior (sync + asyncio), which can affect how clients recover from node failures and MOVED/connection errors. Test changes reduce timing flakiness but don’t alter lock semantics.

Overview
Cluster topology refresh is now randomized. Both sync and asyncio NodesManager.initialize() copy startup_nodes to a list and random.shuffle() it (when multiple nodes exist) so clients don’t consistently query the same first node during reinitialization.

Failed nodes are deprioritized on refresh. Connection/timeout errors now annotate the exception with last_failed_node_name, and subsequent initialize() calls pass this through so the failed node is tried after other startup and additional startup nodes.

Tests updated/added. New sync/async cluster tests assert shuffling is applied (via deterministic random.shuffle mocking), and lock blocking-timeout tests now use patched fake time modules to avoid real-time assertions and flakiness.

Reviewed by Cursor Bugbot for commit 883effe. Bugbot is set up for automated code reviews on this repo. Configure here.

@petyaslavova petyaslavova added the maintenance Maintenance (CI, Releases, etc) label May 12, 2026
@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented May 12, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

…ceive the last failed node as argument and it is moved to be the last option for topology refresh
Comment thread redis/asyncio/cluster.py Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9006304. Configure here.

Comment thread redis/asyncio/cluster.py
…etter maint notifications behaviour the randomization is mocked to keep the original order
@petyaslavova petyaslavova requested a review from vladvildanov May 15, 2026 13:01
Comment thread redis/asyncio/cluster.py
for _ in range(execute_attempts):
if self._initialize:
await self.initialize()
await self.initialize(last_failed_node_name=last_failed_node_name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So async initialisation never passes additional_startup_nodes_info?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladvildanov Currently, no - it was added for the maintenance notifications handling in sync client. I'm adding this now, so that later, when the maintenance notifications are added to the sync client we will have the same order of the arguments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maintenance Maintenance (CI, Releases, etc)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CLUSTER SLOTS re-initialization cascade: deterministic startup_nodes ordering causes network saturation on first-slot node in large clusters

2 participants