Fix ClusterClient behavior when cluster topology is refreshed. Fix several places where connections might leak.#3917
Conversation
|
Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset. In case there are security findings, they will be communicated to you as a comment inside the PR. Hope you’ll enjoy using Jit. Questions? Comments? Want to learn more? Get in touch with us. |
There was a problem hiding this comment.
Pull request overview
This PR improves cluster topology refresh behavior by preserving connection pools during topology changes rather than destroying and recreating them. Previously, cluster topology refreshes would disconnect all connections and set the node's redis connection to None, causing unnecessary connection churn and potential race conditions with in-use connections.
Changes:
- Connection pools are now preserved during topology refresh with lazy reconnection for in-use connections and immediate disconnection for idle connections to clear stale state
- Failing nodes are moved to the end of the cache rather than removed, improving node selection behavior during reinitialization
- Transaction connection lifecycle improved to properly release connections on errors, preventing leaks
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| redis/cluster.py | Implements connection pool preservation during topology refresh, adds move_node_to_end_of_cached_nodes method, and fixes transaction connection leaks in error paths |
| redis/asyncio/cluster.py | Async version of connection pool preservation, adds ClusterNode methods for connection lifecycle management (disconnect_if_needed, update_active_connections_for_reconnect, disconnect_free_connections) |
| redis/asyncio/connection.py | Adds reset_should_reconnect method to clear reconnect flag after disconnect |
| tests/test_cluster.py | Updates test assertion to reflect new behavior where all node connections are reused during topology refresh |
| tests/test_cluster_transaction.py | Adds proper cleanup in try/finally blocks to ensure mock connections don't interfere with teardown, sets host/port on mock connections for find_connection_owner |
| tests/test_asyncio/test_cluster_transaction.py | Similar cleanup improvements for async tests, removes mock connections from node's _free list after tests |
| tests/test_asyncio/test_cluster.py | Adds cleanup in try/finally for mock connections, adds comprehensive unit tests for new ClusterNode connection handling methods and move_node_to_end_of_cached_nodes |
| tests/conftest.py | Adds host/port attributes to mock_connection fixture needed by find_connection_owner |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (2)
redis/asyncio/cluster.py:1354
- If
disconnect_if_neededraises an exception, the connection that was acquired will not be released back to the pool, potentially causing a connection leak. Consider wrapping the disconnect call in a try-except block or moving the disconnect handling into a try-finally structure to ensure the connection is always released. Alternatively, handle disconnect errors gracefully withindisconnect_if_neededitself.
async def execute_command(self, *args: Any, **kwargs: Any) -> Any:
# Acquire connection
connection = self.acquire_connection()
# Handle lazy disconnect for connections marked for reconnect
await self.disconnect_if_needed(connection)
# Execute command
await connection.send_packed_command(connection.pack_command(*args), False)
# Read response
try:
return await self.parse_response(connection, args[0], **kwargs)
finally:
# Release connection
self._free.append(connection)
redis/asyncio/cluster.py:1381
- If
disconnect_if_neededraises an exception at line 1360, the connection that was acquired will not be released back to the pool at line 1379, potentially causing a connection leak. Consider wrapping the entire method body in a try-finally block to ensure the connection is always released, or handle disconnect errors gracefully withindisconnect_if_neededitself.
async def execute_pipeline(self, commands: List["PipelineCommand"]) -> bool:
# Acquire connection
connection = self.acquire_connection()
# Handle lazy disconnect for connections marked for reconnect
await self.disconnect_if_needed(connection)
# Execute command
await connection.send_packed_command(
connection.pack_commands(cmd.args for cmd in commands), False
)
# Read responses
ret = False
for cmd in commands:
try:
cmd.result = await self.parse_response(
connection, cmd.args[0], **cmd.kwargs
)
except Exception as e:
cmd.result = e
ret = True
# Release connection
self._free.append(connection)
return ret
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f2de9c1 to
71a5913
Compare
1087ffd to
bdd9d11
Compare
… just marking in use connections for reconnect and disconnecting free ones instead.
…hed_nodes(sync client version)
…ion. Removing unneeded connection mock attributes.
1a57ee1 to
854dfa9
Compare
Preserve connection pools during cluster topology refresh
Previously, when a cluster topology refresh occurred (due to connection errors, failovers, or MOVED responses), the entire connection pool was disconnected and the node's redis connection was set to None. This caused unnecessary connection churn and could lead to race conditions where in-use connections were abruptly disconnected mid-command.
This change improves the behavior by preserving existing connection pools during topology changes.
In-use connections are now marked for lazy reconnection (they complete their current operation and reconnect on next use), while free/idle connections are disconnected immediately to clear stale state like READONLY mode.
Additionally, failing nodes are moved to the end of the cache rather than being removed, so they're tried last during reinitialization - this also improves the behaviour described in #3693.
Transaction connection lifecycle has also been improved to properly release connections back to the pool on errors, preventing connection leaks.