Skip to content

LBP: support DCAwarePolicy's used_hosts_per_remote_dc parameter #315

Closed
@wprzytula

Description

@wprzytula

CPP Driver semantics

Both cass_{cluster,execution_profile}_set_load_balancing_dc_aware() accept the deprecated used_hosts_per_remote_dc parameter.
The CPP Driver, although Datastax deprecated the parameter in 2019, has retained support for the parameter. We must decide if we want to support it or not. Or, perhaps we want to support it partially (with caveats).

Big Picture semantics

The used_hosts_per_remote_dc controls the number of hosts per remote DC (i.e., for every remote DC the number is the same, and all remote DC have their own limit) that may be used be the driver.

It influences two key aspects of the driver's behaviour:

  1. Which (how many) remote nodes the driver opens connections to.
  2. Which (how many) remote nodes the driver puts in the query plans.

Both limits work similarly in the following way:

  • Driver keeps a set of nodes per every DC.
  • Driver maintains the state (UP/DOWN) of each node, based exclusively on CQL events.
  • When Driver needs to do something with the known nodes (open connections or send a request), first it filters them by retaining nodes from the local DC and only first (in an arbitrary driver's order) used_hosts_per_remote_dc from every remote DC; other remote nodes are ignored.
  • After the initial metadata fetch, connection pools are created only for the non-ignored known nodes.
    • NOTE: How I understand the code, connection pool can later be created only for newly added nodes. This means that if all the nodes from a remote DC that we have opened connections to get DOWN, then the driver will not open connections to another node from that DC!
  • Upon a request issued, the LBP includes only non-ignored nodes in the query plan.

In-depth semantics

  1. Used in DCAwarePolicy::DCAwareQueryPlan::compute_next() to limit the maximum index of host in the host vec (essentially wrapping modulo the bound). This is only about the query plan, not about opened connections.
  2. Used in DCAwarePolicy::distance() to bound the number of hosts per remote (used_hosts_per_remote_dc first remote hosts get REMOTE distance, rest gets IGNORE).
CassHostDistance DCAwarePolicy::distance(const Host::Ptr& host) const {
  if (local_dc_.empty() || host->dc() == local_dc_) {
    return CASS_HOST_DISTANCE_LOCAL;
  }

  const CopyOnWriteHostVec& hosts = per_remote_dc_live_hosts_.get_hosts(host->dc());
  size_t num_hosts = std::min(hosts->size(), used_hosts_per_remote_dc_);
  for (size_t i = 0; i < num_hosts; ++i) {
    if ((*hosts)[i]->address() == host->address()) {
      return CASS_HOST_DISTANCE_REMOTE;
    }
  }

  return CASS_HOST_DISTANCE_IGNORE;
}
  • As can be seen, first used_hosts_per_remote_dc_ hosts (in the arbitrary map order) in each remote DC are considered REMOTE, the rest are IGNOREd. This does not take into account whether connections exist or not; it is purely about the node considered UP or DOWN based on cluster topology events.

  • This is used, among others, to determine whether to open a connection to given node.

    • Upon Session initialization, in SessionBase::on_initialize(), cluster_->available_hosts() is called. This returns only hosts that are not ignored.
      • Cluster::is_host_ignored() scans all LBPs and returns true if ALL of them ignore the host.
    • RequestProcessor::internal_host_add() also creates a connection pool for a new host only if the host is not ignored.

Discussion

Option 1: Dated deprecation -> simply ignore?

The discussion should start with mention that the parameter was deprecated in 2019. This means that there is a probably simplest possible solution: to ignore the parameter, emitting a warning when it's non-zero. OTOH, the problem is that our current behaviour (the only configurable in the driver) is more like +inf (all connections to remote nodes opened) passed as a parameter than like 0 (no connections to remote nodes opened).

Option 2: Let's support it! (At least partly...)

  1. Query plan - doable. LBP has access to nodes' status (based on opened connections, not CQL events, as a more robust indicator). So it could easily retain only a specified number of nodes in the query plan. This, however, seems to be the less important part of this semantics, with (not) opening connections being the more important (as we want to avoid overhead of excess, anyway unused, TCP/CQL connections).

  2. Connection pool - not doable with current Rust Driver. - We can only support the particular case of used_hosts_per_remote_dc=0 by employing DcHostFilter.

  • For used_hosts_per_remote_dc>0, we first have to establish the semantics.
    • As noted above, CPP Driver misbehaves when the only nodes from a given remote DC that have connections opened to get DOWN. Should we fix this (IMO a bug) in our CPP-Rust implementation?
    • For nodes in remote DCs, should we open connections to all shards or just one shard?
    • Does the present API (an unsigned parameter) mean that there always must be some bound given? In such situation, if a user wants effectively no bound, they must pass a large number (say, max possible unsigned). This is ugly.
    • Why would anyone want to pass anything different than {0, 1, +inf}? What's even the usecase of this?

A solution to consider: support just two cases:

  • =0 -> employ DcHostFilter, disallow DC failover in the LBP, and we're done. Our semantics here are in line with the CPP Driver.
  • !=0 -> Treat this as +inf case - don't limit connections to remote nodes. Emit a warning that the semantics are different than in CPP Driver, + attach a deprecation notice.

Metadata

Metadata

Assignees

Labels

P2P2 item - probably some people use this, let's implement that

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions