-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
When performing an online Amazon ElastiCache Redis cluster upgrade, the service uses the Python redis async cluster client configured as:
client = redis.RedisCluster(
host=redis_settings.url,
port=redis_settings.port,
max_connections=redis_settings.max_connections,
health_check_interval=redis_settings.health_check_interval_s,
socket_timeout=redis_settings.socket_timeout_s,
socket_connect_timeout=redis_settings.socket_connect_timeout_s,
decode_responses=False,
require_full_coverage=False,
retry=Retry(
EqualJitterBackoff(
base=redis_settings.retry_base_s,
cap=redis_settings.retry_cap_s,
),
retries=redis_settings.retry_attempts,
),
load_balancing_strategy=LoadBalancingStrategy.ROUND_ROBIN,
dynamic_startup_nodes=True,
)
The Redis cluster client is created once at pod startup and reused for all subsequent API calls.
During an online ElastiCache upgrade or node replacement, Redis nodes are temporarily restarted and cluster topology changes occur. While this is in progress, concurrent API requests experience connection failures and cluster resolution errors. After the upgrade completes and the cluster becomes healthy again, the existing Redis client does not recover and continues to surface errors on every request.
Observed error patterns include:
ConnectionRefusedError: [Errno 111] Connect call failed
redis.exceptions.ConnectionError
redis.exceptions.RedisClusterException
redis.exceptions.MaxConnectionsError
IndexError: pop from an empty deque
At this point, API calls are no longer able to connect to the Redis cluster, and the service remains in a broken state indefinitely. Recovery only occurs after restarting the pods, which recreates the Redis client and allows cluster topology and connections to be re-established.
This behavior suggests that the Redis cluster client does not automatically self-heal or rebuild its internal connection pool and slot cache after topology changes during ElastiCache online upgrades when the client is long-lived and reused across requests.
Could you help identify any configuration changes or client settings that would allow the Redis Python client to recover automatically after an online ElastiCache upgrade? I recently upgraded the Redis Python client from version 5.2 to 7.1.0 in the hope that the newer version would handle Redis Cluster topology changes more gracefully, but the client still does not recover without restarting the pods.