-
Notifications
You must be signed in to change notification settings - Fork 263
Description
Hi, I’d like to report a potential data race issue in index_gt::update.
Describe the bug
Our product maintains vector data separately, so we use index_gt instead of index_dense_gt and implement our own metric. This metric retrieves our vector data by looking it up using the node’s key.
However, when using a metric that accesses vectors via keys, running index_gt::update in parallel can lead to a data race, because index_gt::update does not protect key updates with a lock.
Below is a snippet from our metric implementation:
explicit custom_metric_t(...) {}
float operator()(member_citerator_t const& a, member_citerator_t const& b) {
return calculate_distance(to_vector(get_key(a)), to_vector(get_key(b)));
}Below is a code that read the key of a node in index_gt
USearch/include/usearch/index.hpp
Line 2029 in 13c3fd9
| friend inline vector_key_t get_key(member_iterator_gt const& it) noexcept { return it.key(); } |
Below is a code that write the key of a node in index_gt::update
USearch/include/usearch/index.hpp
Line 2990 in 13c3fd9
| updated_node.key(key); |
I would like to ask for clarification regarding this behavior. During parallel execution of index_gt::update, custom metrics that look up vectors by key cannot be guaranteed to be thread-safe. My assumption is that this was acceptable because index_dense_gt uses slot-based vector lookup, so ensuring thread-safe key reads/writes inside index_gt::update was not necessary. Is this assumption correct, or is the current behavior unintended?
Additionally, I have found that with a small modification to index_gt::update, it is possible to make parallel updates safe when updating the vector of an existing key. I have submitted a PR with this proposed fix. While this change has no impact on index_dense_gt (which uses slot-based vector lookup), it can offer benefits for key-based lookup approaches like ours, allowing update operations to be parallelized without data races when the key itself is not being modified. (ref. #679)
Steps to reproduce
- vector data is stored separately, and vectors are retrieved based on keys (so using
index_gt). - a custom metric is implemented that performs vector lookup using these keys.
- operations such as
index_gt::insert+index_gt::updateorindex_gt::update+index_gt::updateare executed in parallel. - a data race occurs due to concurrent read/write access to the node’s key.
Expected behavior
With index_gt combined with a key-based vector lookup metric, I would like to make the operations thread-safe so that we can safely parallelize (index_gt::add + index_gt::update) and (index_gt::update + index_gt::update), or at least enable safe parallel execution in cases where the key is not being modified.
USearch version
v2.21.2
Operating System
Red Hat Enterprise Linux release 8.6
Hardware architecture
x86
Which interface are you using?
C++ implementation
Contact Details
Are you open to being tagged as a contributor?
- I am open to being mentioned in the project
.githistory as a contributor
Is there an existing issue for this?
- I have searched the existing issues
Code of Conduct
- I agree to follow this project's Code of Conduct