Skip to content

session: fix closing semantics #328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

wprzytula
Copy link
Collaborator

@wprzytula wprzytula commented Jun 25, 2025

Note: generated with GPT-4o and manually redacted.

Fix Session Closing Semantics and Enable AsyncTests Suite

Summary

This pull request introduces critical improvements to the session closing mechanism. Additionally, it enables the AsyncTests::Close test suite, aligns the behavior with the expectations of the CPP Driver, and satisfies the contract defined in the cassandra.h documentation regarding cass_session_free() and cass_session_close().

Key Changes

  1. Satisfy cass_session_free Contract:

    • Updated cass_session_free to wait for the session to close before deallocating, as specified in the cassandra.h documentation. This ensures that all requests are completed before the session is freed.
  2. Empirical Proof of Flaws in Current Implementation:

    • A temporary commit (913a7c2b) was introduced to empirically demonstrate flaws in the current implementation. The AsyncTests::Close test was tuned to fail by increasing concurrent requests, adding sleep times, and switching to a multi-threaded runtime with a hardcoded number of worker threads. This highlights issues with session closure concurrent to running requests.
    • Subsequent commits address these flaws by ensuring synchronous read-locking of the session upon scheduling a request.
    • Note: This temporary commit will be removed before merging the PR into the master branch.
  3. Prevent Logical Races:

    • Modified the session locking mechanism to acquire the lock synchronously before returning from cass_session_execute[_batch] functions. This ensures that the session cannot be closed while requests are running, preventing logical races and request failures.
    • Care was taken to avoid deadlocks in the current_thread runtime by executing remaining futures while waiting for the lock to be released (RUNTIME.block_on(lock.read())).
  4. Avoid Blocking Locks:

    • Replaced blocking_read() with RUNTIME.block_on(lock.read()) to prevent thread-blocking issues. This change allows the async runtime to continue processing queued tasks while waiting for the lock to be released, reducing the risk of deadlocks, especially on the current_thread executor, but also possibly on multi_thread executor if locks are taken in callbacks.
  5. Enable AsyncTests::Close Suite:

    • The Session::execute(_batch) methods has already been cloning the Session's Arc, preventing use-after-free (UAF) scenarios when the session is closed while requests are still running [introduced in c1e40d7].
    • The RwLock mechanism now ensures that the session is protected from premature drops by synchronously taking a read lock for all running requests. This guarantees that cass_session_close() and cass_session_free() block until all requests are completed, aligning with the expectations of the AsyncTests::Close suite.

Files Changed

  • Makefile: Enabled the AsyncTests::Close suite.
  • scylla-rust-wrapper/src/session.rs: Comprehensive changes to improve session locking, prevent logical races, and satisfy the cass_session_free contract.
  • scylla-rust-wrapper/src/future.rs: pub(crate)'d one method to allow extracting close_fut functionality.
  • scylla-rust-wrapper/src/lib.rs and tests/src/integration/tests/test_async.cpp: Temporary changes to demonstrate flaws in the current implementation.

Notes to reviewers

  • Verify that the AsyncTests::Close test's semantics are satisfied.
  • Ensure that session operations behave as expected under concurrent request and session closure scenarios.
  • Make sure that the locking semantics are correct:
    • preparing statements,
    • executing statements,
    • executing batches,
    • connecting session,
    • closing session.
  • Confirm that no deadlocks occur in the current_thread runtime.
  • Validate that the temporary commit (913a7c2b) is removed before merging.

Fixes: #304

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have implemented Rust unit tests for the features/changes introduced.
  • I have enabled appropriate tests in Makefile in {SCYLLA,CASSANDRA}_(NO_VALGRIND_)TEST_FILTER.
  • I added appropriate Fixes: annotations to PR description.

@wprzytula wprzytula self-assigned this Jun 25, 2025
@wprzytula wprzytula added this to the 0.6 milestone Jun 25, 2025
@wprzytula wprzytula added bug Something isn't working P1 P1 priority item - very important labels Jun 25, 2025
@wprzytula wprzytula requested review from Copilot and Lorak-mmk June 25, 2025 10:12
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the session closing mechanism by enforcing synchronous lock acquisition to prevent request/session race conditions and to meet the cass_session_free contract. It also adjusts test parameters and updates the runtime configuration to support proper concurrent behavior and ensure that the AsyncTests::Close suite runs reliably.

  • Update session locking to use synchronous (block_on) guard acquisition in multiple API functions.
  • Adjust test parameters (concurrent request count and sleep delays) for temporary diagnostic purposes.
  • Update runtime configuration to a multi-threaded tokio runtime with two worker threads and enable AsyncTests suite in the Makefile.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/src/integration/tests/test_async.cpp Reduced concurrent requests and added a sleep delay to simulate timing; ensure temporary changes are reverted or documented.
scylla-rust-wrapper/src/session.rs Updated synchronous lock acquisition logic and added detailed comments; consider refactoring repeated patterns to improve maintainability.
scylla-rust-wrapper/src/lib.rs Switched to a multi-threaded runtime with two worker threads.
scylla-rust-wrapper/src/future.rs Exposed into_raw as a public method.
Makefile Enabled the AsyncTests suite.
Comments suppressed due to low confidence (5)

tests/src/integration/tests/test_async.cpp:19

  • Reducing the number of concurrent requests to 5u may undermine the purpose of high-concurrency testing; ensure that this temporary change is reverted or clearly documented before merging.
#define NUMBER_OF_CONCURRENT_REQUESTS 5u

tests/src/integration/tests/test_async.cpp:57

  • The insertion of a fixed sleep time could artificially delay test execution; consider parameterizing or removing this once temporary diagnostics are complete.
      insert.set_sleep_time(100);

scylla-rust-wrapper/src/session.rs:85

  • [nitpick] Consider clarifying the rationale for using a synchronous lock here (and throughout the file) by referencing the CPP Driver requirements, to improve maintainability and ease future updates.
        // TODO: It's not clear whether this lock should be taken synchronously or asynchronously.

scylla-rust-wrapper/src/session.rs:241

  • [nitpick] The repeated pattern and similar comments for synchronous lock acquisition across functions suggest that this logic could be abstracted into a helper function to reduce redundancy and simplify future changes.
    keyspace_length: size_t,

scylla-rust-wrapper/src/session.rs:635

  • [nitpick] Consider addressing the TODO by verifying whether the default setting for the query consistency is appropriate; updating this comment with a concrete action or follow-up reference would be beneficial.
        // TODO: investigate if this is correct.

@wprzytula wprzytula force-pushed the fix-session-close branch from fead714 to b477893 Compare June 25, 2025 10:14
@wprzytula wprzytula marked this pull request as ready for review June 25, 2025 10:37
@wprzytula wprzytula removed the request for review from Lorak-mmk June 30, 2025 16:53
@wprzytula
Copy link
Collaborator Author

wprzytula commented Jun 30, 2025

The taken approach is wrong. Callbacks - they will panic due to #329 problem.
This must wait until I find an approach that will fix #329 and support current_thread runtime at the same time.

@wprzytula wprzytula marked this pull request as draft July 1, 2025 06:36
@wprzytula wprzytula modified the milestones: 0.6, 0.5.1 Jul 6, 2025
wprzytula added 9 commits July 6, 2025 14:51
This increases our likelihood of experiencing interleaving that make
the test AsyncTests.Close fail.
This is a preparatory step for fixing cass_session_free in the next
commit.
Quoting `cassandra.h` documentation of `cass_session_free`:
> Frees a session instance. If the session is still connected it will
> be synchronously closed before being deallocated.

This is definitely not the case in the Rust wrapper, which simply
deallocates the session without waiting for the requests to complete.
This commit makes `cass_session_free` wait for the session to close,
satisfying the contract.
The current implementation of the driver is flawed. In order
to empirically prove this, I tune the AsyncTests.Close test
to fail with the current implementation. The tuning involves
changing the number of concurrent requests to 5 and adding
a sleep time of 100 milliseconds between each request.
Moreover, the runtime is changed to use a multi-threaded
runtime with 2 worker threads to allow for concurrent execution
with interleaving that promotes the issue.

This commit is intended to demonstrate the issue with the current
implementation of the driver, specifically in how it handles
session closure concurrent to running requests. The test should fail
because some requests will only try to read-lock the session
after the session has been closed, leading to failure of those requests.

The following commits will address the issue by read-locking the session
synchronously upon scheduling a request (via either of
`cass_session_execute[_batch]`).

This commit shall be removed before merging this PR into master.
AsyncTests.Close test specifies that the session only be closed after
all running requests are finished. This is not necessarily the case
in the current implementation, because the session lock is taken for
reading by the request futures only in the async block. Races may thus
occur if the session is closed while the request is being executed
and it has not yet acquired the lock, leading in request failures.

To be in line with the CPP Driver, we take the session lock
synchronously, before returning from `cass_session_execute[_batch]`
functions. This way we ensure that the session is not closed
while there are any requests running, and that the session is closed
only after all requests are finished.

Care is taken to ensure that the current_thread runtime does not deadlock
when waiting for the lock, by executing remaining futures while waiting
for the lock to be released (by calling `RUNTIME.block_on`).
There is number of places in the code where we use `blocking_read()`
to access the session data. This is suboptimal because it blocks the
current thread, which can lead to deadlocks (on current_thread
executor). To accomodate for this case, we switch to using
`RUNTIME.block_on(lock.read())` to allow the async runtime to work
on queued tasks while waiting for the lock to be released.
Since the commit c1e40d7, the Session
`execute(_batch)` methods now clone the Session's Arc, which prevents
UAF if the Session is closed while the requests are still running.

That commit's message says: "we cannot enable `AsyncTests::Close` yet
since it expects that prematurely dropped session awaits all async tasks
before closing". This is now taken care of by the previous commit.
The `RwLock` that the `CassSessionInner` is protected with is taken
synchronously for reading by all running requests. This protects
the Session from the premature drop, as the `RwLock` will not let
the "writer" - `cass_session_close`'s future - to proceed until all
requests are done. This means that `cass_session_close()`'s future
resolves only when all requests are done.
This also means that `cass_session_free()` blocks until all requests
are done, which is exactly what `AsyncTests::Close` expects.

Thus, we can enable the `AsyncTests::Close` test suite.
@wprzytula wprzytula force-pushed the fix-session-close branch from b477893 to e3c2b0a Compare July 6, 2025 12:51
@wprzytula
Copy link
Collaborator Author

Rebased on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 P1 priority item - very important
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix AsyncTests.*_Close (await pending futures when closing the session)
1 participant