valkey-io · allenss-amazon · Feb 3, 2025 · Feb 3, 2025 · Feb 3, 2025 · Feb 4, 2025
diff --git a/.vscode/cspell.json b/.vscode/cspell.json
@@ -13,23 +13,25 @@
     // words - list of words to be always considered correct
     "words": [
         "absl",
-        "vmsdk",
-        "redis",
-        "Valkey",
-        "nonexistentkey",
-        "valkeysearch",
+        "bazel",
+        "Externalizer",
+        "highwayhash",
         "hnsw",
         "hnswlib",
+        "Inorder",
         "MRMW",
-        "Externalizer",
-        "highwayhash",
         "mstime",
         "NOLINTNEXTLINE",
+        "nonexistentkey",
+        "redis",
+        "Redisearch",
         "synchronistically",
-        "bazel",
+        "Valkey",
+        "valkeysearch",
+        "vmsdk"
     ],
     // flagWords - list of words to be always considered incorrect
     // This is useful for offensive words and common spelling errors.
     // For example "hte" should be "the"
     "flagWords": []
-}
+}
diff --git a/docs/full-text/fuzzy.md b/docs/full-text/fuzzy.md
@@ -0,0 +1,10 @@
+# Fuzzy Matching
+
+There are many good blog posts on Levenshtein automata.  
+
+https://julesjacobs.com/2015/06/17/disqus-levenshtein-simple-and-fast.html
+
+https://fulmicoton.com/posts/levenshtein/
+
+The bottom line is that the prefix tree representation of the data allows efficient fuzzy search for matches.
+It's expected that building of the Levenstein automata is O(edit-distance * length-query-string) time and that the automata allows for efficient searching of a prefix-tree, because it can prune large subtree based on the evaluation of the sub-tree prefix.
diff --git a/docs/full-text/index.md b/docs/full-text/index.md
@@ -0,0 +1,30 @@
+# Text Index
+
+The _TextIndex_ object is logically a sequence of 4-tuples: (_Word_, _Key_, _Field_, _Position_). The search operators can be streamlined when the tuple can be iterated in that order, henceforth referred to as lexical order.
+Lexical ordering allows operations like intersection and union that operate on multiple
+iteration sequences to perform merge-like operations in linear time.
+
+In addition to the standard CRUD operations _TextIndex_ provides _WordIterator_ that efficiently iterates over sequences of tuples where the _Word_ element shares a common prefix or
+optionally a common suffix, again in lexical order. _WordIterator_ optimizes other operations, e.g.,
+it's efficient to move from one _Key_ to another _Key_ without iterating over the intervening _Field_ and/or _Position_ entries -- typically in O(1) or worst case O(log #keys) time.
+From this capability is constructed the various search operators: word search, phrase search, and fuzzy search.
+
+The _TextIndex_ object is implemented as a two-level hierarchy of objects. At the top level is a _RadixTree_ which maps a _Word_ into a _Postings_ object which is a container of (_Key_, _Field_, _Position_) triples.
+The use of the top-level _RadixTree_ allows efficient implementation of operations on subsets of the index that consist of _Words_ that have a common prefix or suffix.
+
+Both the _Postings_ and _RadixTree_ implementations must adapt efficiently across a wide range in the number of items they contain.
+It's expected that both objects will have multiple internal representations to balance time/space efficiency at different scales.
+Likely the initial implementation will have two representations, i.e.,
+a space-efficient implementation with O(N) insert/delete/iterate times and a time-efficient implementation with O(1) or O(Log N) insert/delete/iterate times.
+
+Like all of the Valkey search operators, the text search operators: word, phrase and fuzzy search must support both the pre-filtering and post-filtering modes when combined with vector search.
+At the conceptual level, the only real difference between the pre- and post- filtering modes of the search operators is that for the post-filtering mode the search is performed across all _TextIndex_ entries with a particular _Field_. Whereas for the pre-filtering mode the search is performed for _TextIndex_ entries with a particular _Key_.
+
+While there are many time/space tradeoffs possible for the pre-filtering case, it is proposed to handle the pre-filtering case with the same code as the post-filtering case only operating over a _TextIndex_ that has been constrained to a single _Key_.
+In other words, for each user-declared Schema there will be one _TextIndex_ constructed across all of the _Key_, _Field_ and _Position_ entries. This _TextIndex_ object will support all non-vector and post-filtered vector query operations. In addition, each Schema will have a secondary hashmap that provides one _TextIndex_ object for each _Key_ to support pre-filtering vector queries.
+
+As it turns out, this secondary per-key hashmap is also useful to support key deletion as it contains exactly the words contained by the fields of the key and nothing else. This use-case drives the need for the _RadixTree_ and _Postings_ objects to have representations optimized for a very low numbers of entries.
+
+## Defrag
+
+The _Postings_ objects contained within the schema-wide _TextIndex_ object will contain the majority of the consumed memory. Implementing defrag is done by using the _WordIterator_ to visit each _Postings_ object node and defrag it.
diff --git a/docs/full-text/overview.md b/docs/full-text/overview.md
@@ -0,0 +1,39 @@
+Text indexes are commonly referred to as inverted because they are not indexes from names to values, but rather from values to names.
+Within a running Valkey instance we can think of the text index as a collection of tuples and then reason about how these tuples are indexed for efficient operation.
+Tuple members are:
+
+- _Schema_ -- The user-visible index (aka index-schema)
+- _Field_ -- The TEXT field (aka attribute) definition with a _Schema_.
+- _Word_ -- A lexical element. Query operators work on words.
+- _Key_ -- The Valkey key containing this _Word_, needed for result generation as well as combining with other search operators.
+- _Position_ -- Location within the _Field_ (stored as a _Word_ offset), needed for exact phrase matching. Future versions may extend the _Position_ to include the byte offset within the _Field_ to efficiently support highlighting.
+
+There are some choices to make in how to index this information. There aren't any search operations that are keyed by _Position_, so this tuple element isn't a candidate for indexing.
+
+However, when looking across the various operations that the search module needs to perform it's clear that both _Key_-based and _Word_-based organizations are useful.
+
+The ingestion engine wants a _Key_-based organization in order to efficiently locate tuples for removal (ingestion => remove old values then maybe insert new ones). It turns out that vector queries can also use a _Key_-based organization in some filtering modes.
+
+Text query operations want a _Word_-based organization.
+So the choice is of how to index the other members of a tuple: _Schema_, _Field_, _Key_ and _Position_.
+There are three different choices for _Word_-based dictionary with very different time/space consequences.
+
+One choice would be to have a single per-node _Word_ dictionary. While this offers the best dictionary space efficiency, it will require each _Postings_ object to contain the remaining tuple entries: _Schema_, _Field_, _Key_ and _Position_ for every _Word_ present in the corpus. This prohibits taking advantage of the high rate of duplication in the _Schema_ and _Field_ tuple members.
+A major problem with this choice is that in order to delete a _Schema_, you must crawl the entire _Word_ dictionary.
+There are use-cases where Schema creation and deletion are fairly frequent. So this becomes a poor choice.
+
+Another choice would be to organize a _Word_ dictionary for each _Schema_.
+Now, the _Postings_ object need only provide: _Field_, _Key_ and _Position_ entries.
+This has the advantage of eliminating the highly redundant _Schema_ tuple member and the disadvantage of duplicating space for words that appear in multiple Schemas as well as increasing the size of the _Posting_ object record the _Field_. More on this option below.
+
+The last choice would be a per-_Field_ word dictionary. Now the _Postings_ object need only provide: _Key_ and _Position_ entries.
+Extending the pattern of the per-_Schema_ word dictionary, this has the advantage of eliminating both of the highly redundant tuple members: _Schema_ and _Field_ with the disadvantage of duplicating words found in multiple fields in the corpus.
+
+Having ruled out the per-node word dictionary, the choice between per-_Schema_ and per-_Field_ is evaluated. The difference in the _Postings_ object size between these two choices need not be very large.
+In particular because the vast majority of indexes will likely have a small number of text fields only a very small number of bits would be required to represent a field and these could efficiently be combined with the _Position_ field resulting in a per-_Schema_ posting that is only an epsilon larger than the per-_Field_ posting.
+Thus it's likely that the space savings of the per-_Schema_ word dictionary will dominate, making it the most space efficient choice.
+
+Another reason to choose per-_Schema_ is that the query language of Redisearch is optimized for multi-field text searching.
+For example the query string: `(@t1|@t2):entire` searches for the word `entire` in two different text fields. The per-_Schema_ organization only requires only a single word lookup and a single _Postings_ traversal, while the per-_Field_ organization would require two word lookups and two _Postings_ traversals.
+It should be noted that the Redisearch default for text queries is to search _all_ fields of type TEXT (which is different than the default for all other query operators that require a single field to be specified).
+Thus the per-_Schema_ organization is chosen.
diff --git a/docs/full-text/phrase.md b/docs/full-text/phrase.md
@@ -0,0 +1,9 @@
+# Exact Phrase Matching
+
+The exact phrase search operator looks for sequences of words within the same field of one key. In the query language, the exact phrase consists of a sequence of word specifiers enclosed in double quotes.
+Each word specifier could be a word, a word wildcard match, or a fuzzy word match.
+The exact phrase search also has two parameters of metadata: _Slop_ and _Inorder_. The _Slop_ parameter is actually a maximum distance between words, i.e., with _Slop_ == 0, the words must be adjacent. With _Slop_ == 1, there can be up to 1 non-matching words between the matching words. The _Inorder_ parameter indicates whether the word specifiers must be found in the text field in the exact order as specified in the query or whether they can be found in any order within the constrains of _Slop_.
+
+Iterators for each word specifier are constructed from the query and iterated. As each matching word is found, the corresponding _Postings_ object is then consulted and intersected to locate keys that contain all of the words. Once this key is located, then a position iterator for these keys is used to determine if the _Slop_ and _Inorder_ sequencing requirements are satisfied.
+
+The implementation will need to provide some form of self-policing to ensure that the timeout requirements are honored as it's quite possible for these nested searches to explode combinatorially.
diff --git a/docs/full-text/scrape.py b/docs/full-text/scrape.py
@@ -0,0 +1,163 @@
+import requests
+from bs4 import BeautifulSoup
+import re
+import time
+import redis
+
+# -------------------------------------------------------------------
+# A simple Radix Tree implementation
+# -------------------------------------------------------------------
+class RadixTree:
+    def __init__(self):
+        self.children = {}
+        self.postings = 0
+
+    def insert(self, word):
+        """
+        Insert a single word into the radix tree.
+        """
+        if len(word) == 0:
+            self.postings = self.postings + 1
+        else:
+            if word[0] not in self.children:
+                self.children[word[0]] = RadixTree()
+            self.children[word[0]].insert(word[1:])
+
+    def count_nodes(self):
+        """
+        Returns total number of nodes in the tree (including the root).
+        """
+        count = len(self.children)
+        if self.postings > 0:
+            count = count + 1
+        for c in self.children.values():
+            count += c.count_nodes()
+        return count
+
+    def count_single_child_nodes(self):
+        """
+        Returns how many nodes have exactly one child.
+        """
+        count = 0
+        if self.postings == 0 and len(self.children) == 1:
+            count = 1
+        for c in self.children.values():
+            count += c.count_single_child_nodes()
+        return count
+
+# -------------------------------------------------------------------
+# Helper function to fetch random Wikipedia text
+# -------------------------------------------------------------------
+def fetch_random_wikipedia_page_text():
+    """
+    Fetches a random Wikipedia page and returns its text content.
+    """
+    url = "https://en.wikipedia.org/wiki/Special:Random"
+    headers = {
+        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
+    }
+    try:
+        response = requests.get(url, headers=headers, timeout=10)
+        response.raise_for_status()
+    except Exception as e:
+        print(f"Failed to fetch page: {e}")
+        return ""
+
+    soup = BeautifulSoup(response.text, "html.parser")
+
+    # Extract text from <p> tags as a simple approach
+    paragraphs = soup.find_all("p")
+    page_text = ""
+    for p in paragraphs:
+        page_text += p.get_text() + "\n"
+
+    # Simple cleaning
+    # Remove extra whitespace, newlines, references in brackets, etc.
+    page_text = re.sub(r"\[\d+\]", "", page_text)  # remove reference markers
+    page_text = re.sub(r"\s+", " ", page_text).strip()
+    return page_text
+
+client = redis.Redis(host='localhost', port=6379, decoded_responses=True)
+
+def memory_usage():
+    x = client.execute_command("memory stats")
+    return int(x["total.allocated"])
+
+# -------------------------------------------------------------------
+# Main script to scrape, build tree, and show statistics
+# -------------------------------------------------------------------
+def main():
+    key_count = 0
+    key_size = 0
+    word_size = 0
+    word_count = 0
+    stop_word_count = 0
+    postings_space4 = 0
+    postings_space8 = 0
+    reverse_space = 0
+
+    NUM_PAGES = 10000
+    NUM_STATS = 100
+    tree = RadixTree()
+
+    client.execute_command("flushall sync")
+    client.execute_command("ft.create x on hash schema x text")
+
+    start_usage = memory_usage()
+    print("At startup memory = ", start_usage)
+
+    stop_words = {"a":0, "is":0, "the":0, "an":0, "and":0, "are":0, "as":0, "at":0, "be":0, "but":0, "by":0, "for":0, "if":0, "in":0, "into":0, "it":0, "no":0, "not":0, "of":0, "on":0, "or":0, "such":0, "that":0, "their":0, "then":0, "there":0, "these":0, "they":0, "this":0, "to":0, "was":0, "will":0, "with":0}
+
+    for i in range(NUM_PAGES):
+        text = fetch_random_wikipedia_page_text()
+
+        if not text:
+            # If the text is empty (fetch failure)", skip
-            # If the text is empty (fetch failure)", skip
+            # If the text is empty (fetch failure), skip
-            # If the text is empty (fetch failure)", skip
+            # If the text is empty (fetch failure), skip
+            continue
+
+        # Split into sentences (very naive). You can improve with nltk sent_tokenize.
+        sentences = re.split(r'[.!?]+', text)
+
+        for sentence in sentences:
+            # Further split into words (again naive)
+            client.hset(str(key_count), {"x":sentence})
+            words = re.split(r'\W+', sentence)
+            words = [w.lower() for w in words if w]  # remove empty strings, to lower
+            key_count = key_count + 1
+            key_size = key_size + len(sentence)
+            per_sentence = {}
+            for word in words:
+                word_count = word_count + 1
+                word_size = word_size + len(word)
+                if word in stop_words:
+                    stop_word_count = stop_word_count + 1
+                else:
+                    if word in per_sentence:
+                        postings_space4 = postings_space4 + 1
+                        postings_space8 = postings_space8 + 1
+                    else:
+                        postings_space4 = postings_space4 + 5
+                        postings_space8 = postings_space8 + 9
+                    per_sentence[word] = True
+                    tree.insert(word)
+        if (i % NUM_STATS) == 0:
+            # After inserting all words from pages, compute statistics
+            total_nodes = tree.count_nodes()
+            single_child_nodes = tree.count_single_child_nodes()
+            word_nodes = total_nodes - single_child_nodes
+            space8 = (word_nodes * 8) + \
+                int(single_child_nodes * 1.5) + \
+                 postings_space8
+            space4 = (word_nodes * 8) + int(single_child_nodes * 1.5) + postings_space4 + \
+                (key_count * 4)
+
+            space44 = space4 - (word_nodes * 4)
+
+            redis_usage = memory_usage() - start_usage
+
+            print(f"Keys:{key_count} AvgKeySize:{key_size/key_count:.1f} Words:{word_count}/{stop_word_count} AvgWord:{word_size/word_count:.1f} Postings_space:{postings_space8//1024}/{postings_space4//1024}KB Space:{space8//1024}/{space4//1024}/{space44//1024}KB Space/Word:{space8/word_count:.1f}/{space4/word_count:.1f} Space/Corpus:{space8/key_size:.1f}/{space4/key_size:.1f}/{space44/key_size:.1f} Redis:{redis_usage//1024}KB /Key:{redis_usage//key_count} /Word:{redis_usage//word_count} Ratio:{space8/redis_usage:.1f}/{space4/redis_usage:.1f}" )
+
+if __name__ == "__main__":
+    main()
+
+