Skip to content

RedisClusterClient enters unrecoverable state during AWS ElastiCache upgrades #3929

@madhavich3134

Description

@madhavich3134

When performing an online Amazon ElastiCache Redis cluster upgrade, the service uses the Python redis async cluster client configured as:

client = redis.RedisCluster(
    host=redis_settings.url,
    port=redis_settings.port,
    max_connections=redis_settings.max_connections,
    health_check_interval=redis_settings.health_check_interval_s,
    socket_timeout=redis_settings.socket_timeout_s,
    socket_connect_timeout=redis_settings.socket_connect_timeout_s,
    decode_responses=False,
    require_full_coverage=False,
    retry=Retry(
        EqualJitterBackoff(
            base=redis_settings.retry_base_s,
            cap=redis_settings.retry_cap_s,
        ),
        retries=redis_settings.retry_attempts,
    ),
    load_balancing_strategy=LoadBalancingStrategy.ROUND_ROBIN,
    dynamic_startup_nodes=True,
)

The Redis cluster client is created once at pod startup and reused for all subsequent API calls.

During an online ElastiCache upgrade or node replacement, Redis nodes are temporarily restarted and cluster topology changes occur. While this is in progress, concurrent API requests experience connection failures and cluster resolution errors. After the upgrade completes and the cluster becomes healthy again, the existing Redis client does not recover and continues to surface errors on every request.

Observed error patterns include:

ConnectionRefusedError: [Errno 111] Connect call failed

redis.exceptions.ConnectionError

redis.exceptions.RedisClusterException

redis.exceptions.MaxConnectionsError

IndexError: pop from an empty deque

At this point, API calls are no longer able to connect to the Redis cluster, and the service remains in a broken state indefinitely. Recovery only occurs after restarting the pods, which recreates the Redis client and allows cluster topology and connections to be re-established.

This behavior suggests that the Redis cluster client does not automatically self-heal or rebuild its internal connection pool and slot cache after topology changes during ElastiCache online upgrades when the client is long-lived and reused across requests.

Could you help identify any configuration changes or client settings that would allow the Redis Python client to recover automatically after an online ElastiCache upgrade? I recently upgraded the Redis Python client from version 5.2 to 7.1.0 in the hope that the newer version would handle Redis Cluster topology changes more gracefully, but the client still does not recover without restarting the pods.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions