Skip to content

raphamorim/fuzzy

Repository files navigation

Fuzzy

A comprehensive fuzzy search library implementing state-of-the-art algorithms and data structures for approximate string matching, text search, and similarity detection.

Features

Distance Metrics

  • Levenshtein Distance: Classic edit distance algorithm
  • Damerau-Levenshtein Distance: Supports transpositions
  • Myers' Algorithm: Efficient diff algorithm for edit distance

Data Structures

  • BK-Tree: Metric tree for efficient similarity search
  • Suffix Array: For substring search and pattern matching
  • FM-Index: Compressed full-text index based on Burrows-Wheeler Transform

Indexing Methods

  • N-gram Indexing: Character n-gram based search with Jaccard similarity
  • Trigram Index: Optimized 3-gram indexing for approximate matching
  • Q-gram Distance: Distance metric based on q-gram profiles

Locality-Sensitive Hashing

  • MinHash LSH: For finding similar documents
  • SimHash: Near-duplicate detection using hamming distance

Advanced Algorithms

  • Wu-Manber: Bit-parallel approximate string matching
  • Ukkonen A*: A* search for approximate pattern matching
  • Bitonic Sort: Parallel-friendly sorting for fuzzy matches

Benchmark

Installation

go get github.com/raphamorim/fuzzy

Usage Examples

BK-Tree for Spell Checking

tree := fuzzy.NewBKTree()
words := []string{"book", "books", "cake", "boo", "boon", "cook"}
for _, word := range words {
    tree.Add(word)
}

// Find words within edit distance 2 of "bok"
suggestions := tree.Search("bok", 2)
// Returns: ["book", "boo", "cook"]

N-gram Search

ng := fuzzy.NewNGram(3)
ng.Add("hello world", 0)
ng.Add("hello there", 1)
ng.Add("goodbye world", 2)

// Search with similarity threshold
results := ng.Search("helo wrld", 0.3)
// Returns indices of matching documents

LSH for Document Similarity

lsh := fuzzy.NewLSH(10, 5, 3) // 10 hash tables, 5 hash functions, 3-shingles
lsh.Add("The quick brown fox jumps over the lazy dog")
lsh.Add("The fast brown fox jumps over the lazy cat")

similar := lsh.Query("The quick brown fox leaps over the lazy dog", 0.7)
// Returns indices of similar documents

Wu-Manber Approximate Search

wm := fuzzy.NewWuManber("pattern")
text := "This text contains a patern with an error"
matches := wm.Search(text, 1) // Allow 1 error

for _, match := range matches {
    fmt.Printf("Found at position %d-%d with %d errors\n", 
        match.Start, match.End, match.Distance)
}

Benchmarks

Run benchmarks with:

go test -bench=. ./...

Algorithm Comparison

Key Differences

Algorithm Type Best Use Case Strengths Limitations
BK-Tree Tree-based Spell checking, typo correction Exact edit distance, efficient for small distances Slower for large edit distances
LSH Hash-based Near-duplicate detection, document similarity Very fast queries, probabilistic Approximate results, needs tuning
Suffix Array Index-based Substring search, pattern matching Exact matches, handles large texts High memory usage, slow updates
Wu-Manber Bit-parallel Pattern search with errors Fast streaming search, configurable errors Limited pattern length
N-gram Statistical Partial matching, similarity scoring Flexible similarity metrics, language-agnostic Less precise than edit distance

Algorithm Complexity

Algorithm Build Time Search Time Space
BK-Tree O(n log n) O(log n) O(n)
Suffix Array O(n log n) O(log n + m) O(n)
FM-Index O(n log n) O(m) O(n)
N-gram Index O(n) O(m + k) O(n)
LSH O(n) O(1) O(n)
Wu-Manber O(m) O(n) O(m)

Where:

  • n = text/corpus size
  • m = pattern size
  • k = number of results

Comparison with Other Libraries

  • sahilm/fuzzy: Uses a simple character-by-character matching algorithm optimized for editor-style filename matching
  • lithammer/fuzzysearch: Implements basic fuzzy matching with Levenshtein distance ranking

This library implements advanced algorithms not found in those libraries:

  • BK-trees and other metric tree structures
  • Suffix arrays and FM-indexes for large text search
  • N-gram based indexing
  • Locality-sensitive hashing approaches
  • Advanced edit distance algorithms like Myers' algorithm and Wu-Manber

License

GPL-3.0

About

fuzzy matching with Levenshtein, Damerau-Levenshtein, Bitap and n-gram

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published