-
Notifications
You must be signed in to change notification settings - Fork 34
Initial Full Text Search design documentation #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
allenss-amazon
wants to merge
28
commits into
valkey-io:main
Choose a base branch
from
allenss-amazon:allenss-fulltext-docs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
aef048e
First crack at full text design docs
allenss-amazon e06b6d1
First crack at full text design docs
allenss-amazon 2da08c2
First crack at full text design docs
allenss-amazon 4c189fd
More.
allenss-amazon c100eab
more
allenss-amazon 6a7b160
more
allenss-amazon 5c5d37c
Add Fuzzy and rename
allenss-amazon aae4738
more
allenss-amazon 7441f4a
Add estimate description
allenss-amazon aca4907
Even more
allenss-amazon 27060f9
More
allenss-amazon 50635c3
Merge branch 'valkey-io:main' into allenss-fulltext-docs
allenss-amazon 020281e
Update
allenss-amazon c97edbc
Fix spelling errors
allenss-amazon aaaaac6
Merge branch 'valkey-io:main' into allenss-fulltext-docs
allenss-amazon 64809ad
Merge branch 'valkey-io:main' into allenss-fulltext-docs
allenss-amazon 31827d5
Update docs/full-text/index.md
allenss-amazon 5e79836
Update docs/full-text/index.md
allenss-amazon 2ad4e6c
Update docs/full-text/phrase.md
allenss-amazon 1ff1702
Update docs/full-text/phrase.md
allenss-amazon 3e8e73c
Update src/indexes/text/text.h
allenss-amazon 34de205
Update src/indexes/text/wildcard_iterator.h
allenss-amazon b3e7851
Update src/indexes/text/wildcard_iterator.h
allenss-amazon 9f60f2b
Update src/indexes/text_index.h
allenss-amazon 43683da
Update src/indexes/text/wildcard_iterator.h
allenss-amazon b92ae91
fix missing comma
allenss-amazon 338742a
Merge branch 'main' into allenss-fulltext-docs
allenss-amazon 8b10951
Merge branch 'valkey-io:main' into allenss-fulltext-docs
allenss-amazon File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# Fuzzy Matching | ||
|
||
There are many good blog posts on Levenshtein automata. | ||
|
||
https://julesjacobs.com/2015/06/17/disqus-levenshtein-simple-and-fast.html | ||
|
||
https://fulmicoton.com/posts/levenshtein/ | ||
|
||
The bottom line is that the prefix tree representation of the data allows efficient fuzzy search for matches. | ||
It's expected that building of the Levenstein automata is O(edit-distance * length-query-string) time and that the automata allows for efficient searching of a prefix-tree, because it can prune large subtree based on the evaluation of the sub-tree prefix. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Text Index | ||
|
||
The _TextIndex_ object is logically a sequence of 4-tuples: (_Word_, _Key_, _Field_, _Position_). The search operators can be streamlined when the tuple can be iterated in that order, henceforth referred to as lexical order. | ||
Lexical ordering allows operations like intersection and union that operate on multiple | ||
iteration sequences to perform merge-like operations in linear time. | ||
|
||
In addition to the standard CRUD operations _TextIndex_ provides _WordIterator_ that efficiently iterates over sequences of tuples where the _Word_ element shares a common prefix or | ||
optionally a common suffix, again in lexical order. _WordIterator_ optimizes other operations, e.g., | ||
it's efficient to move from one _Key_ to another _Key_ without iterating over the intervening _Field_ and/or _Position_ entries -- typically in O(1) or worst case O(log #keys) time. | ||
From this capability is constructed the various search operators: word search, phrase search, and fuzzy search. | ||
|
||
The _TextIndex_ object is implemented as a two-level hierarchy of objects. At the top level is a _RadixTree_ which maps a _Word_ into a _Postings_ object which is a container of (_Key_, _Field_, _Position_) triples. | ||
The use of the top-level _RadixTree_ allows efficient implementation of operations on subsets of the index that consist of _Words_ that have a common prefix or suffix. | ||
|
||
Both the _Postings_ and _RadixTree_ implementations must adapt efficiently across a wide range in the number of items they contain. | ||
It's expected that both objects will have multiple internal representations to balance time/space efficiency at different scales. | ||
Likely the initial implementation will have two representations, i.e., | ||
a space-efficient implementation with O(N) insert/delete/iterate times and a time-efficient implementation with O(1) or O(Log N) insert/delete/iterate times. | ||
|
||
Like all of the Valkey search operators, the text search operators: word, phrase and fuzzy search must support both the pre-filtering and post-filtering modes when combined with vector search. | ||
At the conceptual level, the only real difference between the pre- and post- filtering modes of the search operators is that for the post-filtering mode the search is performed across all _TextIndex_ entries with a particular _Field_. Whereas for the pre-filtering mode the search is performed for _TextIndex_ entries with a particular _Key_. | ||
|
||
While there are many time/space tradeoffs possible for the pre-filtering case, it is proposed to handle the pre-filtering case with the same code as the post-filtering case only operating over a _TextIndex_ that has been constrained to a single _Key_. | ||
In other words, for each user-declared Schema there will be one _TextIndex_ constructed across all of the _Key_, _Field_ and _Position_ entries. This _TextIndex_ object will support all non-vector and post-filtered vector query operations. In addition, each Schema will have a secondary hashmap that provides one _TextIndex_ object for each _Key_ to support pre-filtering vector queries. | ||
|
||
As it turns out, this secondary per-key hashmap is also useful to support key deletion as it contains exactly the words contained by the fields of the key and nothing else. This use-case drives the need for the _RadixTree_ and _Postings_ objects to have representations optimized for a very low numbers of entries. | ||
|
||
## Defrag | ||
|
||
The _Postings_ objects contained within the schema-wide _TextIndex_ object will contain the majority of the consumed memory. Implementing defrag is done by using the _WordIterator_ to visit each _Postings_ object node and defrag it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
Text indexes are commonly referred to as inverted because they are not indexes from names to values, but rather from values to names. | ||
Within a running Valkey instance we can think of the text index as a collection of tuples and then reason about how these tuples are indexed for efficient operation. | ||
Tuple members are: | ||
|
||
- _Schema_ -- The user-visible index (aka index-schema) | ||
- _Field_ -- The TEXT field (aka attribute) definition with a _Schema_. | ||
- _Word_ -- A lexical element. Query operators work on words. | ||
- _Key_ -- The Valkey key containing this _Word_, needed for result generation as well as combining with other search operators. | ||
- _Position_ -- Location within the _Field_ (stored as a _Word_ offset), needed for exact phrase matching. Future versions may extend the _Position_ to include the byte offset within the _Field_ to efficiently support highlighting. | ||
|
||
There are some choices to make in how to index this information. There aren't any search operations that are keyed by _Position_, so this tuple element isn't a candidate for indexing. | ||
|
||
However, when looking across the various operations that the search module needs to perform it's clear that both _Key_-based and _Word_-based organizations are useful. | ||
|
||
The ingestion engine wants a _Key_-based organization in order to efficiently locate tuples for removal (ingestion => remove old values then maybe insert new ones). It turns out that vector queries can also use a _Key_-based organization in some filtering modes. | ||
|
||
Text query operations want a _Word_-based organization. | ||
So the choice is of how to index the other members of a tuple: _Schema_, _Field_, _Key_ and _Position_. | ||
There are three different choices for _Word_-based dictionary with very different time/space consequences. | ||
|
||
One choice would be to have a single per-node _Word_ dictionary. While this offers the best dictionary space efficiency, it will require each _Postings_ object to contain the remaining tuple entries: _Schema_, _Field_, _Key_ and _Position_ for every _Word_ present in the corpus. This prohibits taking advantage of the high rate of duplication in the _Schema_ and _Field_ tuple members. | ||
A major problem with this choice is that in order to delete a _Schema_, you must crawl the entire _Word_ dictionary. | ||
There are use-cases where Schema creation and deletion are fairly frequent. So this becomes a poor choice. | ||
|
||
Another choice would be to organize a _Word_ dictionary for each _Schema_. | ||
Now, the _Postings_ object need only provide: _Field_, _Key_ and _Position_ entries. | ||
This has the advantage of eliminating the highly redundant _Schema_ tuple member and the disadvantage of duplicating space for words that appear in multiple Schemas as well as increasing the size of the _Posting_ object record the _Field_. More on this option below. | ||
|
||
The last choice would be a per-_Field_ word dictionary. Now the _Postings_ object need only provide: _Key_ and _Position_ entries. | ||
Extending the pattern of the per-_Schema_ word dictionary, this has the advantage of eliminating both of the highly redundant tuple members: _Schema_ and _Field_ with the disadvantage of duplicating words found in multiple fields in the corpus. | ||
|
||
Having ruled out the per-node word dictionary, the choice between per-_Schema_ and per-_Field_ is evaluated. The difference in the _Postings_ object size between these two choices need not be very large. | ||
In particular because the vast majority of indexes will likely have a small number of text fields only a very small number of bits would be required to represent a field and these could efficiently be combined with the _Position_ field resulting in a per-_Schema_ posting that is only an epsilon larger than the per-_Field_ posting. | ||
Thus it's likely that the space savings of the per-_Schema_ word dictionary will dominate, making it the most space efficient choice. | ||
|
||
Another reason to choose per-_Schema_ is that the query language of Redisearch is optimized for multi-field text searching. | ||
For example the query string: `(@t1|@t2):entire` searches for the word `entire` in two different text fields. The per-_Schema_ organization only requires only a single word lookup and a single _Postings_ traversal, while the per-_Field_ organization would require two word lookups and two _Postings_ traversals. | ||
It should be noted that the Redisearch default for text queries is to search _all_ fields of type TEXT (which is different than the default for all other query operators that require a single field to be specified). | ||
Thus the per-_Schema_ organization is chosen. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Exact Phrase Matching | ||
|
||
The exact phrase search operator looks for sequences of words within the same field of one key. In the query language, the exact phrase consists of a sequence of word specifiers enclosed in double quotes. | ||
Each word specifier could be a word, a word wildcard match, or a fuzzy word match. | ||
The exact phrase search also has two parameters of metadata: _Slop_ and _Inorder_. The _Slop_ parameter is actually a maximum distance between words, i.e., with _Slop_ == 0, the words must be adjacent. With _Slop_ == 1, there can be up to 1 non-matching words between the matching words. The _Inorder_ parameter indicates whether the word specifiers must be found in the text field in the exact order as specified in the query or whether they can be found in any order within the constrains of _Slop_. | ||
|
||
Iterators for each word specifier are constructed from the query and iterated. As each matching word is found, the corresponding _Postings_ object is then consulted and intersected to locate keys that contain all of the words. Once this key is located, then a position iterator for these keys is used to determine if the _Slop_ and _Inorder_ sequencing requirements are satisfied. | ||
|
||
The implementation will need to provide some form of self-policing to ensure that the timeout requirements are honored as it's quite possible for these nested searches to explode combinatorially. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,163 @@ | ||||||
import requests | ||||||
from bs4 import BeautifulSoup | ||||||
import re | ||||||
import time | ||||||
import redis | ||||||
|
||||||
# ------------------------------------------------------------------- | ||||||
# A simple Radix Tree implementation | ||||||
# ------------------------------------------------------------------- | ||||||
class RadixTree: | ||||||
def __init__(self): | ||||||
self.children = {} | ||||||
self.postings = 0 | ||||||
|
||||||
def insert(self, word): | ||||||
""" | ||||||
Insert a single word into the radix tree. | ||||||
""" | ||||||
if len(word) == 0: | ||||||
self.postings = self.postings + 1 | ||||||
else: | ||||||
if word[0] not in self.children: | ||||||
self.children[word[0]] = RadixTree() | ||||||
self.children[word[0]].insert(word[1:]) | ||||||
|
||||||
def count_nodes(self): | ||||||
""" | ||||||
Returns total number of nodes in the tree (including the root). | ||||||
""" | ||||||
count = len(self.children) | ||||||
if self.postings > 0: | ||||||
count = count + 1 | ||||||
for c in self.children.values(): | ||||||
count += c.count_nodes() | ||||||
return count | ||||||
|
||||||
def count_single_child_nodes(self): | ||||||
""" | ||||||
Returns how many nodes have exactly one child. | ||||||
""" | ||||||
count = 0 | ||||||
if self.postings == 0 and len(self.children) == 1: | ||||||
count = 1 | ||||||
for c in self.children.values(): | ||||||
count += c.count_single_child_nodes() | ||||||
return count | ||||||
|
||||||
# ------------------------------------------------------------------- | ||||||
# Helper function to fetch random Wikipedia text | ||||||
# ------------------------------------------------------------------- | ||||||
def fetch_random_wikipedia_page_text(): | ||||||
""" | ||||||
Fetches a random Wikipedia page and returns its text content. | ||||||
""" | ||||||
url = "https://en.wikipedia.org/wiki/Special:Random" | ||||||
headers = { | ||||||
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" | ||||||
} | ||||||
try: | ||||||
response = requests.get(url, headers=headers, timeout=10) | ||||||
response.raise_for_status() | ||||||
except Exception as e: | ||||||
print(f"Failed to fetch page: {e}") | ||||||
return "" | ||||||
|
||||||
soup = BeautifulSoup(response.text, "html.parser") | ||||||
|
||||||
# Extract text from <p> tags as a simple approach | ||||||
paragraphs = soup.find_all("p") | ||||||
page_text = "" | ||||||
for p in paragraphs: | ||||||
page_text += p.get_text() + "\n" | ||||||
|
||||||
# Simple cleaning | ||||||
# Remove extra whitespace, newlines, references in brackets, etc. | ||||||
page_text = re.sub(r"\[\d+\]", "", page_text) # remove reference markers | ||||||
page_text = re.sub(r"\s+", " ", page_text).strip() | ||||||
return page_text | ||||||
|
||||||
client = redis.Redis(host='localhost', port=6379, decoded_responses=True) | ||||||
|
||||||
def memory_usage(): | ||||||
x = client.execute_command("memory stats") | ||||||
return int(x["total.allocated"]) | ||||||
|
||||||
# ------------------------------------------------------------------- | ||||||
# Main script to scrape, build tree, and show statistics | ||||||
# ------------------------------------------------------------------- | ||||||
def main(): | ||||||
key_count = 0 | ||||||
key_size = 0 | ||||||
word_size = 0 | ||||||
word_count = 0 | ||||||
stop_word_count = 0 | ||||||
postings_space4 = 0 | ||||||
postings_space8 = 0 | ||||||
reverse_space = 0 | ||||||
|
||||||
NUM_PAGES = 10000 | ||||||
NUM_STATS = 100 | ||||||
tree = RadixTree() | ||||||
|
||||||
client.execute_command("flushall sync") | ||||||
client.execute_command("ft.create x on hash schema x text") | ||||||
|
||||||
start_usage = memory_usage() | ||||||
print("At startup memory = ", start_usage) | ||||||
|
||||||
stop_words = {"a":0, "is":0, "the":0, "an":0, "and":0, "are":0, "as":0, "at":0, "be":0, "but":0, "by":0, "for":0, "if":0, "in":0, "into":0, "it":0, "no":0, "not":0, "of":0, "on":0, "or":0, "such":0, "that":0, "their":0, "then":0, "there":0, "these":0, "they":0, "this":0, "to":0, "was":0, "will":0, "with":0} | ||||||
|
||||||
for i in range(NUM_PAGES): | ||||||
text = fetch_random_wikipedia_page_text() | ||||||
|
||||||
if not text: | ||||||
# If the text is empty (fetch failure)", skip | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
continue | ||||||
|
||||||
# Split into sentences (very naive). You can improve with nltk sent_tokenize. | ||||||
sentences = re.split(r'[.!?]+', text) | ||||||
|
||||||
for sentence in sentences: | ||||||
# Further split into words (again naive) | ||||||
client.hset(str(key_count), {"x":sentence}) | ||||||
words = re.split(r'\W+', sentence) | ||||||
words = [w.lower() for w in words if w] # remove empty strings, to lower | ||||||
key_count = key_count + 1 | ||||||
key_size = key_size + len(sentence) | ||||||
per_sentence = {} | ||||||
for word in words: | ||||||
word_count = word_count + 1 | ||||||
word_size = word_size + len(word) | ||||||
if word in stop_words: | ||||||
stop_word_count = stop_word_count + 1 | ||||||
else: | ||||||
if word in per_sentence: | ||||||
postings_space4 = postings_space4 + 1 | ||||||
postings_space8 = postings_space8 + 1 | ||||||
else: | ||||||
postings_space4 = postings_space4 + 5 | ||||||
postings_space8 = postings_space8 + 9 | ||||||
per_sentence[word] = True | ||||||
tree.insert(word) | ||||||
if (i % NUM_STATS) == 0: | ||||||
# After inserting all words from pages, compute statistics | ||||||
total_nodes = tree.count_nodes() | ||||||
single_child_nodes = tree.count_single_child_nodes() | ||||||
word_nodes = total_nodes - single_child_nodes | ||||||
space8 = (word_nodes * 8) + \ | ||||||
int(single_child_nodes * 1.5) + \ | ||||||
postings_space8 | ||||||
space4 = (word_nodes * 8) + int(single_child_nodes * 1.5) + postings_space4 + \ | ||||||
(key_count * 4) | ||||||
|
||||||
space44 = space4 - (word_nodes * 4) | ||||||
|
||||||
redis_usage = memory_usage() - start_usage | ||||||
|
||||||
print(f"Keys:{key_count} AvgKeySize:{key_size/key_count:.1f} Words:{word_count}/{stop_word_count} AvgWord:{word_size/word_count:.1f} Postings_space:{postings_space8//1024}/{postings_space4//1024}KB Space:{space8//1024}/{space4//1024}/{space44//1024}KB Space/Word:{space8/word_count:.1f}/{space4/word_count:.1f} Space/Corpus:{space8/key_size:.1f}/{space4/key_size:.1f}/{space44/key_size:.1f} Redis:{redis_usage//1024}KB /Key:{redis_usage//key_count} /Word:{redis_usage//word_count} Ratio:{space8/redis_usage:.1f}/{space4/redis_usage:.1f}" ) | ||||||
|
||||||
if __name__ == "__main__": | ||||||
main() | ||||||
|
||||||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impersonating browsers is inappropriate.
Please invent a real UA.