-
Notifications
You must be signed in to change notification settings - Fork 194
How to Use FALCONN
This is a guide to using FALCONN. As a complement to this introduction, we also have a Doxygen-generated API reference for C++ and a pdoc-generated API reference for Python.
As of now, FALCONN supports static datasets. We plan to add support for dynamic datasets shortly. Also, please be aware that FALCONN is currently not thread-safe. We are planning to release a multi-threaded version next.
We start with summarizing the C++ API. For the Python API, see below.
To use FALCONN, include the main header file falconn/lsh_nn_table.h.
All classes provided by FALCONN are in the namespace falconn.
The main wrapper class encapsulating an LSH data structure is LSHNearestNeighborTable.
Instances of this class should be created with the function construct_table.
The function construct_table has one required template parameter PointType, which is the type of an individual point to be handled by the data structure. PointType can be one of the following options:
DenseVector<float>DenseVector<double>SparseVector<float>SparseVector<double>
DenseVector is simply a typedef for the dynamically-sized vector class defined by the Eigen library, which offers convenient vector operations (see Eigen's documention for a brief introduction). SparseVector is a typedef for a std::vector of pairs, where the first component is the index of a coordinate, and the second component is the value (elements must be sorted by the first component).
The function construct_table has two parameters:
- A dataset
points, whose type isstd::vector<PointType>. - The LSH table parameters
params, whose type isLSHConstructionParameters.
The function construct_table returns a unique_ptr to the actual LSH table object.
The struct LSHConstructionParameters has several fields.
Some of the fields must be set in every case, while other fields are only necessary for certain hash families.
A good starting point is to use the function get_default_parameters that sets all parameters to reasonable default values, and then to fine-tune the returned LSH parameters if necessary.
The following five parameters are always required. If the meaning of some parameters is not clear, see our LSH Primer.
-
dimensionis the dimension of the dataset. -
lsh_familycan either beLSHFamily::HyperplaneorLSHFamily::Crosspolytope. -
distance_functionis the distance function used for selecting the near neighbors among the filtered points. Currently it can be eitherDistanceFunction::NegativeInnerProduct, which corresponds to the negated dot product, orDistanceFunction::EuclideanSquared, which corresponds to the squared Euclidean distance. -
storage_hash_table: how the low-level hash tables are stored. Set it toStorageHashTable::BitPackedFlatHashTableif the number of bins is not much larger than the number of points and toStorageHashTable::LinearProbingHashTable, otherwise. -
num_setup_threads: number of threads used to set up the hash table. Zero indicates that FALCONN should use the maximum number of available hardware threads (or1if this number cannot be determined). The number of threads used is always at most the number of tablesl. -
kis the number of hash functions per hash table. -
lis the number of hash tables.
The following parameters are required for the cross-polytope hash:
-
last_cp_dimensionis the dimension of the last of thekcross-polytopes. This parameter allows more granular control over the effective number of bits in the cross-polytope hash. The parameterlast_cp_dimensionmust be between1and the smallest power of two that is not less thandimension.
To simplify setting the parameters k and
last_cp_dimension for the cross-polytope hash, we provide a helper function
compute_number_of_hash_functions with the following interface: it takes a parameter number_of_hash_bits
and the partially set construction parameters (in particular, the fields dimension and lsh_family must be set).
The function then sets the fields k and last_cp_dimension automatically such that the number of buckets
is equal to 2number_of_hash_bits (i.e., the number of effective bits in the hash is number_of_hash_bits).
We recommend using this subroutine (for more details, see the example we provide).
-
num_rotationsis the number of pseudo-random rotations (see LSH Families for details). For sufficiently dense data, we recommend a value of 1. For sparse data, we recommend a value of 2.
Optional parameters for all hash functions:
-
seedis the randomness seed which is used in the hash functions.
Optional parameters specific to the cross-polytope hash:
-
feature_hashing_dimensionis the intermediate feature hashing dimension in the sparse version of the cross-polytope hash. The value offeature_hashing_dimensionmust be a power of two. In a nutshell, the meaning of this parameter is as follows: We first perform feature hashing to reduce the dimension of the sparse vector tofeature_hashing_dimensionand then apply the dense cross-polytope hash in this lower-dimensional space. Smaller values offeature_hashing_dimensionlead to a faster hash computation, but the quality of the hash function also decreases. For many high-dimensional and sparse data sets (such as bag of words, etc.), a feature hashing dimension of 512, 1024, or 2048 works well.
After setting up an LSH table, the next important step is setting the number of probes per query (see our LSH Primer for a discussion). This can be done using the member function set_num_probes. By default, we query one bucket
per hash table. This conservative choice corresponds to "vanilla" (i.e., non-multiprobe) LSH and is often suboptimal.
Setting the right number of probes for the desired query accuracy is crucial to get good performance with FALCONN.
Note that set_num_probes sets the number of probes we make for all hash tables together. Our GloVe example shows how to set the number of probes automatically.
In contrast to the other LSH parameters such as k and l, the number of probes can be changed after the table has been
constructed. Changing the number of probes involves little to no overhead.
Finally, we are ready to answer queries. On a very high level, an LSH table answers queries in two steps. First, the table retrieves a list of candidate data points using the multiprobe LSH procedure. Second, it scans this candidate list and chooses the best points according to the desired distance metric.
FALCONN provides low-level routines for generating the list of candidates, as well as more high-level functions that filter this list according to the specified distance metric.
- High-level methods:
-
find_nearest_neighborreturns the closest data point among the candidates generated by the LSH look-up -
find_k_nearest_neighborsreturns the k closest data points among the candidates generated by the LSH look-up -
find_near_neighborsreturns all the candidates within a given distance threshold.
-
- Low-level methods:
-
get_candidates_with_duplicatesreturns all candidates from the tables with possible duplicates. This is the lowest level query method. -
get_unique_candidatesalso returns all candidates as above, but eliminates duplicates from the returned candidate list. -
get_unique_sorted_candidates: the same as above, but the candidates are sorted according to their index / key (not their distance!). This can be helpful for memory locality when making a pass over all candidates.
-
We provide a thin Python wrapper around the C++ API described above, so most of the above discussion applies also to Python. For details, see our pdoc-generated documentation. As of now, the Python wrapper supports only dense data. Datasets are given as two-dimensional NumPy arrays, where rows are data points.
An important point is that the NumPy array holding a dataset must stay valid while using the LSH data structure. Otherwise, the code might crash silently.