Release the read lock while creating connections inrefresh_connections #191
Merged
barshaul merged 2 commits intoamazon-contributing:mainfrom Oct 9, 2024
Merged
Release the read lock while creating connections inrefresh_connections #191barshaul merged 2 commits intoamazon-contributing:mainfrom
refresh_connections #191barshaul merged 2 commits intoamazon-contributing:mainfrom
Conversation
… while creating a new connection
refresh_connections by adjusting lock management
refresh_connections by adjusting lock management refresh_connections
eifrah-aws
reviewed
Sep 15, 2024
redis/src/cluster_async/mod.rs
Outdated
| ) | ||
| .await; | ||
| tasks.push(async move { | ||
| let connections_container = inner.conn_lock.read().await; |
There was a problem hiding this comment.
Making a lock "public" is not a good idea. We should an atomic API and not the lock itself.
For example:
fn do_something() -> Result<(), Box<dyn Error>>{
let _lk = self.lock.write()?;
...
}if the "something" is complex, we should add an API:
fn write_lock_and_do<F>(callback: F) -> Result<(), Box<dyn Error>>
where F: Fn() -> Result<(), Box<dyn Error>> {
let _lk = self.lock.write()?;
callback()
}this way we have a full control over the lock and we can avoid misuse of the lock
Author
There was a problem hiding this comment.
I think that's a good idea, lets do it in a seperate PR
| match result { | ||
| (address, Ok(node)) => { | ||
| let connections_container = inner.conn_lock.read().await; | ||
| connections_container.replace_or_add_connection_for_address(address, node); |
There was a problem hiding this comment.
We should expose API for inner for this function (replace_or_add_connection_for_address) and avoid exposing the lock here
redis/src/cluster_async/mod.rs
Outdated
| } | ||
| } | ||
| } | ||
| info!("refresh connections completed"); |
There was a problem hiding this comment.
Is this something that happens often? if it does, please move this to debug!
eifrah-aws
approved these changes
Oct 9, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Description:
Main Changes:
Lock Management Improvement:
In the previous implementation, the read lock (
inner.conn_lock.read()) was held throughout the entire connection refresh process (for all connections sent to refresh), including while attempting to establish connections (viaget_or_create_conn). If connections were slow or timed out, the lock was held for an extended duration, blocking other tasks requiring a write lock.The new implementation releases the read lock before making connection attempts. If the connection is successfully established, the read lock is reacquired to update the connection container. This approach ensures that other operations needing the lock (e.g., write operations) can proceed while connections are being established.
Unclear Deadlock Behavior:
A deadlock scenario was observed while testing the
update_slotmap_movedbranch (on amazon-contributing/redis-rs) during failover testing. The root cause of the deadlock remains unclear. The branch introduces changes that attempt to acquire a write lock on the connection container, which leads to the issue. However, even after removing the content of theupdate_upon_movedfunction (leaving only the lock acquisition), the deadlock persisted, suggesting that the problem isn't directly tied to the logic in the function itself.It seems like there is an unusual race condition occurring, causing the lock to enter an undefined state where neither reads nor writes are able to acquire it. This lock state is leading to the deadlock, with all tasks attempting to use the lock getting blocked.
The issue arose in the following situation:
refresh_connectionsis triggered and acquires the read lock, whileget_or_create_connis waiting for a connection to complete.update_upon_movedtries to acquire the write lock but is blocked since the read lock is held byrefresh_connections.refresh_connectionsfails with aConnection refused (os error 111)and exits, the lock is not properly released.Important: It is unclear why this "deadlock" occurs and why the lock isn't released after the function exits. Despite attempts to explicitly drop the lock right before the function returns, the issue persisted. However, with the new lock-release-before-connection strategy, the problem no longer appears.
Testing:
This issue and change were tested by simulating node failovers on the
update_slotmap_movedbranch, verifying that the client successfully recovers without getting stuck, allowing the system to quickly find the promoted replica and maintain operations.We still need to investigate the root cause of the lock issue (looks like a tokio bug?), but this change resolves the deadlock and improving lock management.
Deadlock Test Logs: