RedisClusterClient enters unrecoverable state during AWS ElastiCache upgrades

When performing an online Amazon ElastiCache Redis cluster upgrade, the service uses the Python redis async cluster client configured as:

```
client = redis.RedisCluster(
    host=redis_settings.url,
    port=redis_settings.port,
    max_connections=redis_settings.max_connections,
    health_check_interval=redis_settings.health_check_interval_s,
    socket_timeout=redis_settings.socket_timeout_s,
    socket_connect_timeout=redis_settings.socket_connect_timeout_s,
    decode_responses=False,
    require_full_coverage=False,
    retry=Retry(
        EqualJitterBackoff(
            base=redis_settings.retry_base_s,
            cap=redis_settings.retry_cap_s,
        ),
        retries=redis_settings.retry_attempts,
    ),
    load_balancing_strategy=LoadBalancingStrategy.ROUND_ROBIN,
    dynamic_startup_nodes=True,
)
```
The Redis cluster client is created once at pod startup and reused for all subsequent API calls.

During an online ElastiCache upgrade or node replacement, Redis nodes are temporarily restarted and cluster topology changes occur. While this is in progress, concurrent API requests experience connection failures and cluster resolution errors. After the upgrade completes and the cluster becomes healthy again, the existing Redis client does not recover and continues to surface errors on every request.

Observed error patterns include:

ConnectionRefusedError: [Errno 111] Connect call failed

redis.exceptions.ConnectionError

redis.exceptions.RedisClusterException

redis.exceptions.MaxConnectionsError

IndexError: pop from an empty deque

At this point, API calls are no longer able to connect to the Redis cluster, and the service remains in a broken state indefinitely. Recovery only occurs after restarting the pods, which recreates the Redis client and allows cluster topology and connections to be re-established.

This behavior suggests that the Redis cluster client does not automatically self-heal or rebuild its internal connection pool and slot cache after topology changes during ElastiCache online upgrades when the client is long-lived and reused across requests.

Could you help identify any configuration changes or client settings that would allow the Redis Python client to recover automatically after an online ElastiCache upgrade? I recently upgraded the Redis Python client from version 5.2 to 7.1.0 in the hope that the newer version would handle Redis Cluster topology changes more gracefully, but the client still does not recover without restarting the pods.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RedisClusterClient enters unrecoverable state during AWS ElastiCache upgrades #3929

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RedisClusterClient enters unrecoverable state during AWS ElastiCache upgrades #3929

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions