Skip to content

DC-aware LBP - better support the deprecated parameters #321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 17, 2025

Conversation

wprzytula
Copy link
Collaborator

@wprzytula wprzytula commented Jun 12, 2025

Background

The parameters: used_hosts_per_remote_dc, allow_remote_dcs_for_local_cl of the DC-aware LBP have been long deprecated. However, they've been still supported in the cpp-driver, so we should probably support it. At least, we should accept the same set of values as the cpp-driver, which has not been true: values other than 0 would trigger CASS_ERROR_LIB_BAD_PARAMS, and 0 would result in a behaviour different than cpp-driver's: it would be ignored and not restrict remote DC hosts at all.

What's done

More informative warnings/errors

The messages logged when an invalid / not supported / partially supported value is given as a parameter are made more informative.

Made used_hosts_per_remote_dc semantics closer to cpp-driver's

Due to Rust driver API's, it'd be very hard to support it correctly in the case of >0, because of reasons explained in #315. Therefore, the partial support approach is taken:

  • case =0 -> employ CassHostFilter, disable DC failover and we're done. Our semantics here are in line with the CPP Driver.
  • case >0 -> Treat this as +inf case - don't limit connections nor requests to remote nodes. Emit a warning that the semantics are different than in CPP Driver.

Allowed used_hosts_per_remote_dc > 0 and described its limitations

Any nonzero value is treated as +inf, which means that the datacenter failover is permitted without restrictions. All remote nodes can have opened connections to and be targeted with requests.

Fixed used_hosts_per_remote_dc = 0 semantics and made it fully consistent with cpp-driver's

  1. Datacenter failover is disabled, meaning that remote nodes do not appear in query plans.
  2. If a DC is remote to all LBPs and those LBPs disallow DC failover, then the driver will not connect to any hosts in that DC. This addresses the need to minimise overhead and costs of needless connecting to remote DCs.

Implemented full support of allow_remote_dcs_for_local_cl

This was as simple as using another filtering logic in the existing FilteringLoadBalancingPolicy.

Testing

Unit tests

Unit tests were introduced for internal abstractions used to configure the CassHostFilter to filter out remote nodes, if those are disallowed by all LBPs. Existing unit tests were adjusted to expect used_hosts_per_remote_dc > 0 as supported.
A unit test was written for the backbone of allow_remote_dcs_for_local_cl, DcLocalConsistencyAllowance.

Integration tests

  1. Enabled DCExecutionProfileTest_DCAware test.
  2. Enabled DcAwarePolicyTest_UsedHostsRemoteDc test.
  3. Written a missing test for the allow_remote_dcs_for_local_cl option: DcAwarePolicyTest_AllowRemoteDcsForLocalCl.

Unresolved questions

Behaviour of both our previous and current implementation of functions taking used_hosts_per_remote_dc parameter differ from one described in cassandra.h. Should we adjust the documentation there to be in line with the implementation?

Fixes: #315

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have implemented Rust unit tests for the features/changes introduced.
    - [ ] I have enabled appropriate tests in Makefile in {SCYLLA,CASSANDRA}_(NO_VALGRIND_)TEST_FILTER.
  • I added appropriate Fixes: annotations to PR description.

@wprzytula wprzytula requested review from Lorak-mmk and Copilot June 12, 2025 11:58
@wprzytula wprzytula self-assigned this Jun 12, 2025
@wprzytula wprzytula added the P2 P2 item - probably some people use this, let's implement that label Jun 12, 2025
@wprzytula wprzytula added this to the 0.5 milestone Jun 12, 2025
Copilot

This comment was marked as outdated.

@wprzytula wprzytula force-pushed the dc-aware-lb-better-support branch 5 times, most recently from edaf1f6 to df9bf55 Compare June 12, 2025 16:54
Comment on lines +836 to +850
let permit_dc_failover = if used_hosts_per_remote_dc > 0 {
// TODO: update cassandra.h documentation to reflect this behaviour.
tracing::warn!(
"cass_*_set_load_balance_dc_aware(_n): `used_hosts_per_remote_dc` parameter is only partially \
supported in the driver: `0` is supported correctly, and any value `>0` has the semantics of \"+inf\", \
which means no limit on the number of hosts per remote DC. This is different from the original cpp-driver! \
To clarify, you can understand this parameter as \"permit_dc_failover\", with `0` being `false` and `>0` \
being `true`."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it is difficult to support connection-opening part of this parameter. We could expose cpp_rust_unstable method to allow that.

What is the problem with exposing LBP part of that? You can wrap LBP, and make an iterator that counts returned nodes per dc, and filters out nodes of DCs that reached the limit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also handle the failover-for-local-cl parameter in wrapper.

Copy link
Collaborator Author

@wprzytula wprzytula Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it is difficult to support connection-opening part of this parameter. We could expose cpp_rust_unstable method to allow that.

It's not the matter of the Rust driver's API; it's how its connection pooling works inside.
It's even more about the very cpp-driver's mechanism being flawed. See my findings in #315:

After the initial metadata fetch, connection pools are created only for the non-ignored known nodes.

  • NOTE: How I understand the code, connection pool can later be created only for newly added nodes. This means that if all the nodes from a remote DC that we have opened connections to get DOWN, then the driver will not open connections to another node from that DC!

With such flawed semantics AND a large and nontrivial effort that would be needed to do this in the Rust Driver's inner parts, I'm strongly for not supporting <1, +inf) interval in the same way as cpp-driver.

What is the problem with exposing LBP part of that? You can wrap LBP, and make an iterator that counts returned nodes per dc, and filters out nodes of DCs that reached the limit.

Yes, we could do that. But what would be the benefit of this? I see none. The important part is not to open unnecessary connections. With connections already opened, I don't see any benefit from omitting a strict subset of remote targets from the query plan.

If you see any benefit, please share it. But not seeing it, I'm not going to implement this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With such flawed semantics

Oh I'm not suggesting to follow such weird behaviors. The important part is limiting the connections to N per remote DC - if we have exact semantics more that are more reasonable than cpp-driver, even better.

Ofc there is no need to do this in this PR.

cass_{cluster,exec_profile}_set_load_balance_dc_aware(_n) now warn
on null or empty local DC name provided.
`set_load_balance_dc_aware_n` unnecessarily `unwrap()`ped the local DC
name instead of pattern matching. This commit fixes that.
cass_{cluster,exec_profile}_set_load_balance_dc_aware(_n) now warn
if the deprecated parameters:
- `used_hosts_per_remote_dc`, or
- `allow_remote_dcs_for_local_cl`
are used (set to non-zero value).

This commit does not change the behavior of the driver, it just informs
the user that these parameters are no longer supported and should be set
to 0.

NOTE: next few commits implement (partial) support for these parameters
and thus change the messages introduces here.
The parameter `used_hosts_per_remote_dc` of the DC-aware LBP has been
long deprecated. However, it's been still supported in the cpp-driver.
Due to Rust driver API's, it'd be very hard to support it correctly
in the case of `>0`, because of reasons explained in
<scylladb#315>.
Therefore, the partial support approach is taken:
- case `=0` -> employ `HostFilter`, disable DC failover and we're done.
  Our semantics here are in line with the CPP Driver.
- case `>0` -> Treat this as `+inf` case - don't limit connections nor
  requests to remote nodes. Emit a warning that the semantics are
  different than in CPP Driver.

This commit only configures the LBP to set `permit_dc_failover` to false
when `used_hosts_per_remote_dc` is `0`, and to true when it's greater
than `0`.

The next commit will implement the connection management based on this
parameter: If a DC is remote to **all** LBPs **and** those LBPs disallow
DC failover, then the driver will not connect to any hosts in that DC.
This will finish proper support for the `used_hosts_per_remote_dc=0`
case.
The parameter `used_hosts_per_remote_dc` of the DC-aware LBP has been
long deprecated. However, it's been still supported in the cpp-driver.
Due to Rust driver API's, it'd be very hard to support it correctly
in the case of `>0`, because of reasons explained in
<scylladb#315>.
Therefore, the partial support approach is taken:
- case `=0` -> employ `HostFilter`, disable DC failover and we're done.
  Our semantics here are in line with the CPP Driver.
- case `>0` -> Treat this as `+inf` case - don't limit connections nor
  requests to remote nodes. Emit a warning that the semantics are
  different than in CPP Driver.

This commit introduces specific connection management for the case
when `used_hosts_per_remote_dc=0`: if a DC is remote to **all** LBPs
**and** those LBPs disallow DC failover, then the driver will not
connect to any hosts in that DC.

Implementation is done using the existing `Filtering{Config,Info}`
machinery. LBPs declare whether they restrict DCs to the specific local
DC, or allow all DCs - `DcRestriction` enum serves this purpose.
Then, `CassHostFilter` is built from the composition of DC restrictions,
by building up an `AllowedDcs` enum: if any LBP allows all DCs,
then `AllowedDcs::All` is used, otherwise, we build a whitelist of
allowed DCs is constructed (`AllowedDcs::Whitelist(Vec<String>)`).
Finally, if `AllowedDcs` is `Whitelist`, then `CassHostFilter` filters
out hosts whose DCs are not in that whitelist.

This finishes proper support for the `used_hosts_per_remote_dc=0` case.
@wprzytula wprzytula force-pushed the dc-aware-lb-better-support branch from 0736b58 to fc33900 Compare June 17, 2025 11:38
The purpose of `DcLocalConsistencyAllowance` is to determine whether
remote DCs are allowed to be contacted when a local consistency level
is used. It's going to be used in the next commit, to support
`allow_remote_dcs_for_local_cl` parameter of the DC-aware LBP.
Even though the parameter has long been deprecated, we should support
it. This commit makes both functions:
- `cass_cluster_set_load_balance_dc_aware`,
- `cass_execution_profile_set_load_balance_dc_aware`
handle the parameter `allow_remote_dcs_for_local_cl` correctly.
Those logs are emitted at trace level, so they will not clutter
the output. They are useful for debugging.
This reverts commit 6b44c2c.

The adjustments are no longer needed, because the offending parameters
are now supported.
@wprzytula wprzytula force-pushed the dc-aware-lb-better-support branch from fc33900 to b30520a Compare June 17, 2025 11:44
@wprzytula wprzytula changed the title DC-aware LBP - better support of used_hosts_per_remote_dc parameter DC-aware LBP - better support the deprecated parameters Jun 17, 2025
@wprzytula wprzytula requested review from Lorak-mmk and Copilot June 17, 2025 11:57
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the DC-aware load balancing policy by updating the semantics of the deprecated parameters so that nonzero used_hosts_per_remote_dc values are now treated as permitting DC failover and by fully supporting the allow_remote_dcs_for_local_cl option. Key changes include:

  • Updating integration tests to use the new behavior (switching from 0 to 1 in parameters and adjusting assertions).
  • Modifying the filtering and load balancing policy logic to incorporate DcRestriction and DcLocalConsistencyAllowance.
  • Adjusting API functions in the cluster and exec_profile modules to reflect the updated parameter semantics.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/src/integration/tests/test_exec_profile.cpp Updated test profiles with revised LBP parameters and assertions.
tests/src/integration/tests/test_dc_aware_policy.cpp Added tests for allow_remote_dcs_for_local_cl and adjusted validation of execution profiles.
scylla-rust-wrapper/src/load_balancing.rs Introduced new enums and logic for handling remote DC allowance in LBP, with additional logging.
scylla-rust-wrapper/src/exec_profile.rs Modified exec profile configuration to align with new DC-aware policy behavior.
scylla-rust-wrapper/src/cluster.rs Updated cluster configuration and related tests to support the revised parameter semantics.
Makefile Updated test filters to reflect changes in DC-aware policy tests.
Comments suppressed due to low confidence (1)

tests/src/integration/tests/test_dc_aware_policy.cpp:88

  • [nitpick] If the commented-out assertions represent expected behavior under certain conditions, consider clarifying the intent or adding a note explaining the rationale. If they are no longer necessary, consider removing them to improve code clarity.
        // are down and removed them from the host list.

"Filtering policy got {:?} in fallback and decided to {}.",
node,
match (is_host_allowed, is_dc_allowed) {
(false, false) => "DROP it because neither host nor DC are not allowed",
Copy link
Preview

Copilot AI Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The log message here contains a double negative that may be confusing. Consider rephrasing it to something like "DROP it because neither host nor DC is allowed" to improve clarity.

Suggested change
(false, false) => "DROP it because neither host nor DC are not allowed",
(false, false) => "DROP it because neither host nor DC is allowed",

Copilot uses AI. Check for mistakes.

@wprzytula
Copy link
Collaborator Author

wprzytula commented Jun 17, 2025

The Scylla ITs (5.4.8) check is failing. The logs are the following:

 ccm> ccm node1 stop --not-gently
ccm> ccm node2 stop --not-gently
ccm> ccm node1 start --wait-other-notice --wait-for-binary-proto
unknown file: Failure
C++ exception with description "Traceback (most recent call last):
  File "/home/runner/.local/bin/ccm", line 74, in <module>
    cmd.run()
  File "/home/runner/.local/lib/python3.11/site-packages/ccmlib/cmds/node_cmds.py", line 196, in run
    self.node.start(not self.options.no_join_ring,
  File "/home/runner/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py", line 762, in start
    scylla_process = self._start_scylla(args=args, marks=marks, update_pid=update_pid,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py", line 413, in _start_scylla
    node.watch_log_for_alive(self, from_mark=mark, timeout=t)
  File "/home/runner/.local/lib/python3.11/site-packages/ccmlib/node.py", line 535, in watch_log_for_alive
    self.watch_log_for(tofind, from_mark=from_mark, timeout=timeout, filename=filename)
  File "/home/runner/.local/lib/python3.11/site-packages/ccmlib/node.py", line 499, in watch_log_for
    raise TimeoutError(time.strftime("%d %b %Y %H:%M:%S", time.gmtime()) + " [" + self.name + "] Missing: " + str(
ccmlib.node.TimeoutError: 17 Jun 2025 13:10:28 [node3] Missing: ['127.0.0.1.* now UP']:
INFO  2025-06-17 13:08:04,593 [shard 0:main] raft_.....
See system.log for remainder
" thrown in TearDown().
[  FAILED  ] DcAwarePolicyTest.Integration_Cassandra_UsedHostsRemoteDc (165376 ms)

It looks like ccm encounters an error trying to start a node that was killed. Interestingly, the problem does not appear with the newer Scyllas. @Lorak-mmk any ideas?

My observations: the test that fails is one that I've newly added.

EDIT: we decided that we can remove that old Scylla version from the CI.

The Scylla 5.4.8 was introduced to the CI 9 months ago, in
acf9d3d. The rationale was to test on
the same version as the cpp-driver CI. There's no use in testing neither
cpp-driver nor this driver with such old version anymore.

As some tests fail due to ccm errors on that version, we remove it from
the CI.
The test DCExecutionProfileTest.DCAware was disabled because it expected
deterministic behaviour of the token-unaware round robin load balancing.
As the Rust Driver round-robins nodes with a random shift, the test
would fail. I reworked it to check that the node in the other DC is not
used, instead of expecting a specific node in the local DC to be used.

The test is now enabled in the Makefile.
DCAwarePolicy integration test was missing a test case for the
`allow_remote_dcs_for_local_cl` parameter. This commit adds a test
case that verifies the behavior of the DCAwarePolicy when this
parameter is set to either logical value.
Once we:
- implemented `get_attempted_hosts_from_future`, and
- implemented correct support of the two deprecated options:
  - `used_hosts_per_remote_dc`,
  - `allow_remote_dcs_for_local_cl`,
we can enable the `DcAwarePolicyTest` test suite.
@wprzytula wprzytula force-pushed the dc-aware-lb-better-support branch from b30520a to 946ccca Compare June 17, 2025 16:07
@wprzytula wprzytula merged commit 14bb41a into scylladb:master Jun 17, 2025
11 checks passed
@wprzytula wprzytula deleted the dc-aware-lb-better-support branch June 17, 2025 16:55
@wprzytula wprzytula mentioned this pull request Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 P2 item - probably some people use this, let's implement that
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LBP: support DCAwarePolicy's used_hosts_per_remote_dc parameter
2 participants