Fuzzy

A comprehensive fuzzy search library implementing state-of-the-art algorithms and data structures for approximate string matching, text search, and similarity detection.

Features

Distance Metrics

Levenshtein Distance: Classic edit distance algorithm
Damerau-Levenshtein Distance: Supports transpositions
Myers' Algorithm: Efficient diff algorithm for edit distance

Data Structures

BK-Tree: Metric tree for efficient similarity search
Suffix Array: For substring search and pattern matching
FM-Index: Compressed full-text index based on Burrows-Wheeler Transform

Indexing Methods

N-gram Indexing: Character n-gram based search with Jaccard similarity
Trigram Index: Optimized 3-gram indexing for approximate matching
Q-gram Distance: Distance metric based on q-gram profiles

Locality-Sensitive Hashing

MinHash LSH: For finding similar documents
SimHash: Near-duplicate detection using hamming distance

Advanced Algorithms

Wu-Manber: Bit-parallel approximate string matching
Ukkonen A*: A* search for approximate pattern matching
Bitonic Sort: Parallel-friendly sorting for fuzzy matches

Installation

go get github.com/raphamorim/fuzzy

Usage Examples

BK-Tree for Spell Checking

tree := fuzzy.NewBKTree()
words := []string{"book", "books", "cake", "boo", "boon", "cook"}
for _, word := range words {
    tree.Add(word)
}

// Find words within edit distance 2 of "bok"
suggestions := tree.Search("bok", 2)
// Returns: ["book", "boo", "cook"]

N-gram Search

ng := fuzzy.NewNGram(3)
ng.Add("hello world", 0)
ng.Add("hello there", 1)
ng.Add("goodbye world", 2)

// Search with similarity threshold
results := ng.Search("helo wrld", 0.3)
// Returns indices of matching documents

LSH for Document Similarity

lsh := fuzzy.NewLSH(10, 5, 3) // 10 hash tables, 5 hash functions, 3-shingles
lsh.Add("The quick brown fox jumps over the lazy dog")
lsh.Add("The fast brown fox jumps over the lazy cat")

similar := lsh.Query("The quick brown fox leaps over the lazy dog", 0.7)
// Returns indices of similar documents

Wu-Manber Approximate Search

wm := fuzzy.NewWuManber("pattern")
text := "This text contains a patern with an error"
matches := wm.Search(text, 1) // Allow 1 error

for _, match := range matches {
    fmt.Printf("Found at position %d-%d with %d errors\n", 
        match.Start, match.End, match.Distance)
}

Benchmarks

Run benchmarks with:

go test -bench=. ./...

Algorithm Comparison

Key Differences

Algorithm	Type	Best Use Case	Strengths	Limitations
BK-Tree	Tree-based	Spell checking, typo correction	Exact edit distance, efficient for small distances	Slower for large edit distances
LSH	Hash-based	Near-duplicate detection, document similarity	Very fast queries, probabilistic	Approximate results, needs tuning
Suffix Array	Index-based	Substring search, pattern matching	Exact matches, handles large texts	High memory usage, slow updates
Wu-Manber	Bit-parallel	Pattern search with errors	Fast streaming search, configurable errors	Limited pattern length
N-gram	Statistical	Partial matching, similarity scoring	Flexible similarity metrics, language-agnostic	Less precise than edit distance

Algorithm Complexity

Algorithm	Build Time	Search Time	Space
BK-Tree	O(n log n)	O(log n)	O(n)
Suffix Array	O(n log n)	O(log n + m)	O(n)
FM-Index	O(n log n)	O(m)	O(n)
N-gram Index	O(n)	O(m + k)	O(n)
LSH	O(n)	O(1)	O(n)
Wu-Manber	O(m)	O(n)	O(m)

Where:

n = text/corpus size
m = pattern size
k = number of results

Comparison with Other Libraries

sahilm/fuzzy: Uses a simple character-by-character matching algorithm optimized for editor-style filename matching
lithammer/fuzzysearch: Implements basic fuzzy matching with Levenshtein distance ranking

This library implements advanced algorithms not found in those libraries:

BK-trees and other metric tree structures
Suffix arrays and FM-indexes for large text search
N-gram based indexing
Locality-sensitive hashing approaches
Advanced edit distance algorithms like Myers' algorithm and Wu-Manber

License

GPL-3.0

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
benchmark		benchmark
testdata		testdata
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
bktree.go		bktree.go
bktree_test.go		bktree_test.go
go.mod		go.mod
go.sum		go.sum
lsh.go		lsh.go
lsh_test.go		lsh_test.go
ngram.go		ngram.go
ngram_test.go		ngram_test.go
suffixarray.go		suffixarray.go
suffixarray_test.go		suffixarray_test.go
wumanber.go		wumanber.go
wumanber_test.go		wumanber_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fuzzy

Features

Distance Metrics

Data Structures

Indexing Methods

Locality-Sensitive Hashing

Advanced Algorithms

Installation

Usage Examples

BK-Tree for Spell Checking

N-gram Search

LSH for Document Similarity

Wu-Manber Approximate Search

Benchmarks

Algorithm Comparison

Key Differences

Algorithm Complexity

Comparison with Other Libraries

License

About

Uh oh!

Releases

Packages

Languages

License

raphamorim/fuzzy

Folders and files

Latest commit

History

Repository files navigation

Fuzzy

Features

Distance Metrics

Data Structures

Indexing Methods

Locality-Sensitive Hashing

Advanced Algorithms

Installation

Usage Examples

BK-Tree for Spell Checking

N-gram Search

LSH for Document Similarity

Wu-Manber Approximate Search

Benchmarks

Algorithm Comparison

Key Differences

Algorithm Complexity

Comparison with Other Libraries

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages