Skip to content

(Maybe) Bug: data race in index_gt::update when using key-based metrics with parallel updates #678

@yoonseok-kim

Description

@yoonseok-kim

Hi, I’d like to report a potential data race issue in index_gt::update.

Describe the bug

Our product maintains vector data separately, so we use index_gt instead of index_dense_gt and implement our own metric. This metric retrieves our vector data by looking it up using the node’s key.

However, when using a metric that accesses vectors via keys, running index_gt::update in parallel can lead to a data race, because index_gt::update does not protect key updates with a lock.

Below is a snippet from our metric implementation:

  explicit custom_metric_t(...) {}
  float operator()(member_citerator_t const& a, member_citerator_t const& b) {
    return calculate_distance(to_vector(get_key(a)), to_vector(get_key(b)));
  }

Below is a code that read the key of a node in index_gt

friend inline vector_key_t get_key(member_iterator_gt const& it) noexcept { return it.key(); }

Below is a code that write the key of a node in index_gt::update

updated_node.key(key);

I would like to ask for clarification regarding this behavior. During parallel execution of index_gt::update, custom metrics that look up vectors by key cannot be guaranteed to be thread-safe. My assumption is that this was acceptable because index_dense_gt uses slot-based vector lookup, so ensuring thread-safe key reads/writes inside index_gt::update was not necessary. Is this assumption correct, or is the current behavior unintended?

Additionally, I have found that with a small modification to index_gt::update, it is possible to make parallel updates safe when updating the vector of an existing key. I have submitted a PR with this proposed fix. While this change has no impact on index_dense_gt (which uses slot-based vector lookup), it can offer benefits for key-based lookup approaches like ours, allowing update operations to be parallelized without data races when the key itself is not being modified. (ref. #679)

Steps to reproduce

  1. vector data is stored separately, and vectors are retrieved based on keys (so using index_gt).
  2. a custom metric is implemented that performs vector lookup using these keys.
  3. operations such as index_gt::insert + index_gt::update or index_gt::update + index_gt::update are executed in parallel.
  4. a data race occurs due to concurrent read/write access to the node’s key.

Expected behavior

With index_gt combined with a key-based vector lookup metric, I would like to make the operations thread-safe so that we can safely parallelize (index_gt::add + index_gt::update) and (index_gt::update + index_gt::update), or at least enable safe parallel execution in cases where the key is not being modified.

USearch version

v2.21.2

Operating System

Red Hat Enterprise Linux release 8.6

Hardware architecture

x86

Which interface are you using?

C++ implementation

Contact Details

[email protected]

Are you open to being tagged as a contributor?

  • I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions