|
1 | 1 | # LSH.jl
|
2 | 2 |
|
3 |
| -Documentation for the LSH.jl package. |
| 3 | +LSH.jl is a Julia package for performing [locality-sensitive hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) with various similarity functions. |
| 4 | + |
| 5 | +## Introduction |
| 6 | +One of the simplest methods for classifying, categorizing, and grouping data is to measure how similarities pairs of data points are. For instance, the classical [``k``-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) takes a similarity function |
| 7 | + |
| 8 | +```math |
| 9 | +s:X\times X\to\mathbb{R} |
| 10 | +``` |
| 11 | + |
| 12 | +and a query point ``x\in X``, where ``X`` is the input space. It then computes ``s(x,y)`` for every point ``y`` in a database, and keeps the ``k`` points that are closest to ``x``. |
| 13 | + |
| 14 | +Broadly, there are two computational issues with this approach: |
| 15 | + |
| 16 | +- First, the database may be massive, much larger than could possibly fit in memory. This would make the brute-force approach of computing ``s(x,y)`` for every point ``y`` in the database far too expensive to be practical. |
| 17 | +- Second, the dimensionality of the data may be such that computing ``s(x,y)`` is itself expensive. In addition, the similarity function itself may simply be intrinsically difficult to compute. For instance, calculating Wasserstein distance entails solving a very high-dimensional linear program. |
| 18 | + |
| 19 | +In order to solve these problems, researchers have over time developed a variety of techniques to accelerate similarity search: |
| 20 | + |
| 21 | +- [``k``-d trees](https://en.wikipedia.org/wiki/K-d_tree) |
| 22 | +- [Ball trees](https://en.wikipedia.org/wiki/Ball_tree) |
| 23 | +- Data reduction techniques |
| 24 | + |
| 25 | +## Locality-sensitive hashing |
| 26 | +*Locality-sensitive hashing* (LSH) is a technique for accelerating similarity search that works by using a hash function on the query point ``x`` and limiting similarity search to only those points in the database that experience a hash collision with ``x``. The hash functions that are used are randomly generated from a family of *locality-sensitive hash functions*. These hash functions have the property that ``Pr[h(x) = h(y)]`` (i.e., the probability of a hash collision) increases the more similar that ``x`` and ``y`` are. |
| 27 | + |
| 28 | +LSH.jl is a package that provides definitions of locality-sensitive hash functions for a variety of different similarities. Currently, LSH.jl supports hash functions for |
| 29 | + |
| 30 | +- Cosine similarity (`cossim`) |
| 31 | +- Jaccard similarity (`jaccard`) |
| 32 | +- ``L^1`` (Manhattan / "taxicab") distance (`ℓ1`) |
| 33 | +- ``L^2`` (Euclidean) distance (`ℓ2`) |
| 34 | +- Inner product (`inner_prod`) |
| 35 | +- Function-space hashes (`L1`, `L2`, and `cossim`) |
0 commit comments