Skip to content

Initial Full Text Search design documentation #46

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
aef048e
First crack at full text design docs
allenss-amazon Feb 3, 2025
e06b6d1
First crack at full text design docs
allenss-amazon Feb 3, 2025
2da08c2
First crack at full text design docs
allenss-amazon Feb 3, 2025
4c189fd
More.
allenss-amazon Feb 4, 2025
c100eab
more
allenss-amazon Feb 5, 2025
6a7b160
more
allenss-amazon Feb 6, 2025
5c5d37c
Add Fuzzy and rename
allenss-amazon Feb 8, 2025
aae4738
more
allenss-amazon Feb 9, 2025
7441f4a
Add estimate description
allenss-amazon Feb 9, 2025
aca4907
Even more
allenss-amazon Feb 14, 2025
27060f9
More
allenss-amazon Feb 14, 2025
50635c3
Merge branch 'valkey-io:main' into allenss-fulltext-docs
allenss-amazon Mar 29, 2025
020281e
Update
allenss-amazon Mar 30, 2025
c97edbc
Fix spelling errors
allenss-amazon Mar 30, 2025
aaaaac6
Merge branch 'valkey-io:main' into allenss-fulltext-docs
allenss-amazon May 22, 2025
64809ad
Merge branch 'valkey-io:main' into allenss-fulltext-docs
allenss-amazon Jun 13, 2025
31827d5
Update docs/full-text/index.md
allenss-amazon Jun 13, 2025
5e79836
Update docs/full-text/index.md
allenss-amazon Jun 13, 2025
2ad4e6c
Update docs/full-text/phrase.md
allenss-amazon Jun 13, 2025
1ff1702
Update docs/full-text/phrase.md
allenss-amazon Jun 13, 2025
3e8e73c
Update src/indexes/text/text.h
allenss-amazon Jun 13, 2025
34de205
Update src/indexes/text/wildcard_iterator.h
allenss-amazon Jun 13, 2025
b3e7851
Update src/indexes/text/wildcard_iterator.h
allenss-amazon Jun 13, 2025
9f60f2b
Update src/indexes/text_index.h
allenss-amazon Jun 13, 2025
43683da
Update src/indexes/text/wildcard_iterator.h
allenss-amazon Jun 13, 2025
b92ae91
fix missing comma
allenss-amazon Jun 13, 2025
338742a
Merge branch 'main' into allenss-fulltext-docs
allenss-amazon Jun 17, 2025
8b10951
Merge branch 'valkey-io:main' into allenss-fulltext-docs
allenss-amazon Jun 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 11 additions & 9 deletions .vscode/cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,23 +13,25 @@
// words - list of words to be always considered correct
"words": [
"absl",
"vmsdk",
"redis",
"Valkey",
"nonexistentkey",
"valkeysearch",
"bazel",
"Externalizer",
"highwayhash",
"hnsw",
"hnswlib",
"Inorder",
"MRMW",
"Externalizer",
"highwayhash",
"mstime",
"NOLINTNEXTLINE",
"nonexistentkey",
"redis",
"Redisearch",
"synchronistically",
"bazel",
"Valkey",
"valkeysearch",
"vmsdk"
],
// flagWords - list of words to be always considered incorrect
// This is useful for offensive words and common spelling errors.
// For example "hte" should be "the"
"flagWords": []
}
}
10 changes: 10 additions & 0 deletions docs/full-text/fuzzy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Fuzzy Matching

There are many good blog posts on Levenshtein automata.

https://julesjacobs.com/2015/06/17/disqus-levenshtein-simple-and-fast.html

https://fulmicoton.com/posts/levenshtein/

The bottom line is that the prefix tree representation of the data allows efficient fuzzy search for matches.
It's expected that building of the Levenstein automata is O(edit-distance * length-query-string) time and that the automata allows for efficient searching of a prefix-tree, because it can prune large subtree based on the evaluation of the sub-tree prefix.
30 changes: 30 additions & 0 deletions docs/full-text/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Text Index

The _TextIndex_ object is logically a sequence of 4-tuples: (_Word_, _Key_, _Field_, _Position_). The search operators can be streamlined when the tuple can be iterated in that order, henceforth referred to as lexical order.
Lexical ordering allows operations like intersection and union that operate on multiple
iteration sequences to perform merge-like operations in linear time.

In addition to the standard CRUD operations _TextIndex_ provides _WordIterator_ that efficiently iterates over sequences of tuples where the _Word_ element shares a common prefix or
optionally a common suffix, again in lexical order. _WordIterator_ optimizes other operations, e.g.,
it's efficient to move from one _Key_ to another _Key_ without iterating over the intervening _Field_ and/or _Position_ entries -- typically in O(1) or worst case O(log #keys) time.
From this capability is constructed the various search operators: word search, phrase search, and fuzzy search.

The _TextIndex_ object is implemented as a two-level hierarchy of objects. At the top level is a _RadixTree_ which maps a _Word_ into a _Postings_ object which is a container of (_Key_, _Field_, _Position_) triples.
The use of the top-level _RadixTree_ allows efficient implementation of operations on subsets of the index that consist of _Words_ that have a common prefix or suffix.

Both the _Postings_ and _RadixTree_ implementations must adapt efficiently across a wide range in the number of items they contain.
It's expected that both objects will have multiple internal representations to balance time/space efficiency at different scales.
Likely the initial implementation will have two representations, i.e.,
a space-efficient implementation with O(N) insert/delete/iterate times and a time-efficient implementation with O(1) or O(Log N) insert/delete/iterate times.

Like all of the Valkey search operators, the text search operators: word, phrase and fuzzy search must support both the pre-filtering and post-filtering modes when combined with vector search.
At the conceptual level, the only real difference between the pre- and post- filtering modes of the search operators is that for the post-filtering mode the search is performed across all _TextIndex_ entries with a particular _Field_. Whereas for the pre-filtering mode the search is performed for _TextIndex_ entries with a particular _Key_.

While there are many time/space tradeoffs possible for the pre-filtering case, it is proposed to handle the pre-filtering case with the same code as the post-filtering case only operating over a _TextIndex_ that has been constrained to a single _Key_.
In other words, for each user-declared Schema there will be one _TextIndex_ constructed across all of the _Key_, _Field_ and _Position_ entries. This _TextIndex_ object will support all non-vector and post-filtered vector query operations. In addition, each Schema will have a secondary hashmap that provides one _TextIndex_ object for each _Key_ to support pre-filtering vector queries.

As it turns out, this secondary per-key hashmap is also useful to support key deletion as it contains exactly the words contained by the fields of the key and nothing else. This use-case drives the need for the _RadixTree_ and _Postings_ objects to have representations optimized for a very low numbers of entries.

## Defrag

The _Postings_ objects contained within the schema-wide _TextIndex_ object will contain the majority of the consumed memory. Implementing defrag is done by using the _WordIterator_ to visit each _Postings_ object node and defrag it.
39 changes: 39 additions & 0 deletions docs/full-text/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Text indexes are commonly referred to as inverted because they are not indexes from names to values, but rather from values to names.
Within a running Valkey instance we can think of the text index as a collection of tuples and then reason about how these tuples are indexed for efficient operation.
Tuple members are:

- _Schema_ -- The user-visible index (aka index-schema)
- _Field_ -- The TEXT field (aka attribute) definition with a _Schema_.
- _Word_ -- A lexical element. Query operators work on words.
- _Key_ -- The Valkey key containing this _Word_, needed for result generation as well as combining with other search operators.
- _Position_ -- Location within the _Field_ (stored as a _Word_ offset), needed for exact phrase matching. Future versions may extend the _Position_ to include the byte offset within the _Field_ to efficiently support highlighting.

There are some choices to make in how to index this information. There aren't any search operations that are keyed by _Position_, so this tuple element isn't a candidate for indexing.

However, when looking across the various operations that the search module needs to perform it's clear that both _Key_-based and _Word_-based organizations are useful.

The ingestion engine wants a _Key_-based organization in order to efficiently locate tuples for removal (ingestion => remove old values then maybe insert new ones). It turns out that vector queries can also use a _Key_-based organization in some filtering modes.

Text query operations want a _Word_-based organization.
So the choice is of how to index the other members of a tuple: _Schema_, _Field_, _Key_ and _Position_.
There are three different choices for _Word_-based dictionary with very different time/space consequences.

One choice would be to have a single per-node _Word_ dictionary. While this offers the best dictionary space efficiency, it will require each _Postings_ object to contain the remaining tuple entries: _Schema_, _Field_, _Key_ and _Position_ for every _Word_ present in the corpus. This prohibits taking advantage of the high rate of duplication in the _Schema_ and _Field_ tuple members.
A major problem with this choice is that in order to delete a _Schema_, you must crawl the entire _Word_ dictionary.
There are use-cases where Schema creation and deletion are fairly frequent. So this becomes a poor choice.

Another choice would be to organize a _Word_ dictionary for each _Schema_.
Now, the _Postings_ object need only provide: _Field_, _Key_ and _Position_ entries.
This has the advantage of eliminating the highly redundant _Schema_ tuple member and the disadvantage of duplicating space for words that appear in multiple Schemas as well as increasing the size of the _Posting_ object record the _Field_. More on this option below.

The last choice would be a per-_Field_ word dictionary. Now the _Postings_ object need only provide: _Key_ and _Position_ entries.
Extending the pattern of the per-_Schema_ word dictionary, this has the advantage of eliminating both of the highly redundant tuple members: _Schema_ and _Field_ with the disadvantage of duplicating words found in multiple fields in the corpus.

Having ruled out the per-node word dictionary, the choice between per-_Schema_ and per-_Field_ is evaluated. The difference in the _Postings_ object size between these two choices need not be very large.
In particular because the vast majority of indexes will likely have a small number of text fields only a very small number of bits would be required to represent a field and these could efficiently be combined with the _Position_ field resulting in a per-_Schema_ posting that is only an epsilon larger than the per-_Field_ posting.
Thus it's likely that the space savings of the per-_Schema_ word dictionary will dominate, making it the most space efficient choice.

Another reason to choose per-_Schema_ is that the query language of Redisearch is optimized for multi-field text searching.
For example the query string: `(@t1|@t2):entire` searches for the word `entire` in two different text fields. The per-_Schema_ organization only requires only a single word lookup and a single _Postings_ traversal, while the per-_Field_ organization would require two word lookups and two _Postings_ traversals.
It should be noted that the Redisearch default for text queries is to search _all_ fields of type TEXT (which is different than the default for all other query operators that require a single field to be specified).
Thus the per-_Schema_ organization is chosen.
9 changes: 9 additions & 0 deletions docs/full-text/phrase.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Exact Phrase Matching

The exact phrase search operator looks for sequences of words within the same field of one key. In the query language, the exact phrase consists of a sequence of word specifiers enclosed in double quotes.
Each word specifier could be a word, a word wildcard match, or a fuzzy word match.
The exact phrase search also has two parameters of metadata: _Slop_ and _Inorder_. The _Slop_ parameter is actually a maximum distance between words, i.e., with _Slop_ == 0, the words must be adjacent. With _Slop_ == 1, there can be up to 1 non-matching words between the matching words. The _Inorder_ parameter indicates whether the word specifiers must be found in the text field in the exact order as specified in the query or whether they can be found in any order within the constrains of _Slop_.

Iterators for each word specifier are constructed from the query and iterated. As each matching word is found, the corresponding _Postings_ object is then consulted and intersected to locate keys that contain all of the words. Once this key is located, then a position iterator for these keys is used to determine if the _Slop_ and _Inorder_ sequencing requirements are satisfied.

The implementation will need to provide some form of self-policing to ensure that the timeout requirements are honored as it's quite possible for these nested searches to explode combinatorially.
163 changes: 163 additions & 0 deletions docs/full-text/scrape.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
import requests
from bs4 import BeautifulSoup
import re
import time
import redis

# -------------------------------------------------------------------
# A simple Radix Tree implementation
# -------------------------------------------------------------------
class RadixTree:
def __init__(self):
self.children = {}
self.postings = 0

def insert(self, word):
"""
Insert a single word into the radix tree.
"""
if len(word) == 0:
self.postings = self.postings + 1
else:
if word[0] not in self.children:
self.children[word[0]] = RadixTree()
self.children[word[0]].insert(word[1:])

def count_nodes(self):
"""
Returns total number of nodes in the tree (including the root).
"""
count = len(self.children)
if self.postings > 0:
count = count + 1
for c in self.children.values():
count += c.count_nodes()
return count

def count_single_child_nodes(self):
"""
Returns how many nodes have exactly one child.
"""
count = 0
if self.postings == 0 and len(self.children) == 1:
count = 1
for c in self.children.values():
count += c.count_single_child_nodes()
return count

# -------------------------------------------------------------------
# Helper function to fetch random Wikipedia text
# -------------------------------------------------------------------
def fetch_random_wikipedia_page_text():
"""
Fetches a random Wikipedia page and returns its text content.
"""
url = "https://en.wikipedia.org/wiki/Special:Random"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impersonating browsers is inappropriate.

Please invent a real UA.

}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
except Exception as e:
print(f"Failed to fetch page: {e}")
return ""

soup = BeautifulSoup(response.text, "html.parser")

# Extract text from <p> tags as a simple approach
paragraphs = soup.find_all("p")
page_text = ""
for p in paragraphs:
page_text += p.get_text() + "\n"

# Simple cleaning
# Remove extra whitespace, newlines, references in brackets, etc.
page_text = re.sub(r"\[\d+\]", "", page_text) # remove reference markers
page_text = re.sub(r"\s+", " ", page_text).strip()
return page_text

client = redis.Redis(host='localhost', port=6379, decoded_responses=True)

def memory_usage():
x = client.execute_command("memory stats")
return int(x["total.allocated"])

# -------------------------------------------------------------------
# Main script to scrape, build tree, and show statistics
# -------------------------------------------------------------------
def main():
key_count = 0
key_size = 0
word_size = 0
word_count = 0
stop_word_count = 0
postings_space4 = 0
postings_space8 = 0
reverse_space = 0

NUM_PAGES = 10000
NUM_STATS = 100
tree = RadixTree()

client.execute_command("flushall sync")
client.execute_command("ft.create x on hash schema x text")

start_usage = memory_usage()
print("At startup memory = ", start_usage)

stop_words = {"a":0, "is":0, "the":0, "an":0, "and":0, "are":0, "as":0, "at":0, "be":0, "but":0, "by":0, "for":0, "if":0, "in":0, "into":0, "it":0, "no":0, "not":0, "of":0, "on":0, "or":0, "such":0, "that":0, "their":0, "then":0, "there":0, "these":0, "they":0, "this":0, "to":0, "was":0, "will":0, "with":0}

for i in range(NUM_PAGES):
text = fetch_random_wikipedia_page_text()

if not text:
# If the text is empty (fetch failure)", skip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# If the text is empty (fetch failure)", skip
# If the text is empty (fetch failure), skip

continue

# Split into sentences (very naive). You can improve with nltk sent_tokenize.
sentences = re.split(r'[.!?]+', text)

for sentence in sentences:
# Further split into words (again naive)
client.hset(str(key_count), {"x":sentence})
words = re.split(r'\W+', sentence)
words = [w.lower() for w in words if w] # remove empty strings, to lower
key_count = key_count + 1
key_size = key_size + len(sentence)
per_sentence = {}
for word in words:
word_count = word_count + 1
word_size = word_size + len(word)
if word in stop_words:
stop_word_count = stop_word_count + 1
else:
if word in per_sentence:
postings_space4 = postings_space4 + 1
postings_space8 = postings_space8 + 1
else:
postings_space4 = postings_space4 + 5
postings_space8 = postings_space8 + 9
per_sentence[word] = True
tree.insert(word)
if (i % NUM_STATS) == 0:
# After inserting all words from pages, compute statistics
total_nodes = tree.count_nodes()
single_child_nodes = tree.count_single_child_nodes()
word_nodes = total_nodes - single_child_nodes
space8 = (word_nodes * 8) + \
int(single_child_nodes * 1.5) + \
postings_space8
space4 = (word_nodes * 8) + int(single_child_nodes * 1.5) + postings_space4 + \
(key_count * 4)

space44 = space4 - (word_nodes * 4)

redis_usage = memory_usage() - start_usage

print(f"Keys:{key_count} AvgKeySize:{key_size/key_count:.1f} Words:{word_count}/{stop_word_count} AvgWord:{word_size/word_count:.1f} Postings_space:{postings_space8//1024}/{postings_space4//1024}KB Space:{space8//1024}/{space4//1024}/{space44//1024}KB Space/Word:{space8/word_count:.1f}/{space4/word_count:.1f} Space/Corpus:{space8/key_size:.1f}/{space4/key_size:.1f}/{space44/key_size:.1f} Redis:{redis_usage//1024}KB /Key:{redis_usage//key_count} /Word:{redis_usage//word_count} Ratio:{space8/redis_usage:.1f}/{space4/redis_usage:.1f}" )

if __name__ == "__main__":
main()


Loading
Loading