Skip to content

Commit d321153

Browse files
Merge pull request #1466 from redis/DOC-5149-python-vec-json-examples
DOC-5149 and DOC-5153 added Python and Go vector JSON examples
2 parents f6319b5 + 93fb85f commit d321153

File tree

2 files changed

+244
-14
lines changed

2 files changed

+244
-14
lines changed

content/develop/clients/go/vecsearch.md

Lines changed: 124 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,9 @@ In the example below, we use the
3232
[`huggingfaceembedder`](https://pkg.go.dev/github.com/henomis/[email protected]/embedder/huggingface)
3333
package from the [`LinGoose`](https://pkg.go.dev/github.com/henomis/[email protected])
3434
framework to generate vector embeddings to store and index with
35-
Redis Query Engine.
35+
Redis Query Engine. The code is first demonstrated for hash documents with a
36+
separate section to explain the
37+
[differences with JSON documents](#differences-with-json-documents).
3638

3739
## Initialize
3840

@@ -80,10 +82,10 @@ the embeddings for this example are both available for free.
8082

8183
The `huggingfaceembedder` model outputs the embeddings as a
8284
`[]float32` array. If you are storing your documents as
83-
[hash]({{< relref "/develop/data-types/hashes" >}}) objects
84-
(as we are in this example), then you must convert this array
85-
to a `byte` string before adding it as a hash field. In this example,
86-
we will use the function below to produce the `byte` string:
85+
[hash]({{< relref "/develop/data-types/hashes" >}}) objects, then you
86+
must convert this array to a `byte` string before adding it as a hash field.
87+
The function shown below uses Go's [`binary`](https://pkg.go.dev/encoding/binary)
88+
package to produce the `byte` string:
8789

8890
```go
8991
func floatsToBytes(fs []float32) []byte {
@@ -101,7 +103,8 @@ func floatsToBytes(fs []float32) []byte {
101103
Note that if you are using [JSON]({{< relref "/develop/data-types/json" >}})
102104
objects to store your documents instead of hashes, then you should store
103105
the `[]float32` array directly without first converting it to a `byte`
104-
string.
106+
string (see [Differences with JSON documents](#differences-with-json-documents)
107+
below).
105108

106109
## Create the index
107110

@@ -187,7 +190,7 @@ hf := huggingfaceembedder.New().
187190
## Add data
188191

189192
You can now supply the data objects, which will be indexed automatically
190-
when you add them with [`hset()`]({{< relref "/commands/hset" >}}), as long as
193+
when you add them with [`HSet()`]({{< relref "/commands/hset" >}}), as long as
191194
you use the `doc:` prefix specified in the index definition.
192195

193196
Use the `Embed()` method of `huggingfacetransformer`
@@ -310,6 +313,120 @@ As you would expect, the result for `doc:0` with the content text *"That is a ve
310313
is the result that is most similar in meaning to the query text
311314
*"That is a happy person"*.
312315

316+
## Differences with JSON documents
317+
318+
Indexing JSON documents is similar to hash indexing, but there are some
319+
important differences. JSON allows much richer data modelling with nested fields, so
320+
you must supply a [path]({{< relref "/develop/data-types/json/path" >}}) in the schema
321+
to identify each field you want to index. However, you can declare a short alias for each
322+
of these paths (using the `As` option) to avoid typing it in full for
323+
every query. Also, you must set `OnJSON` to `true` when you create the index.
324+
325+
The code below shows these differences, but the index is otherwise very similar to
326+
the one created previously for hashes:
327+
328+
```go
329+
_, err = rdb.FTCreate(ctx,
330+
"vector_json_idx",
331+
&redis.FTCreateOptions{
332+
OnJSON: true,
333+
Prefix: []any{"jdoc:"},
334+
},
335+
&redis.FieldSchema{
336+
FieldName: "$.content",
337+
As: "content",
338+
FieldType: redis.SearchFieldTypeText,
339+
},
340+
&redis.FieldSchema{
341+
FieldName: "$.genre",
342+
As: "genre",
343+
FieldType: redis.SearchFieldTypeTag,
344+
},
345+
&redis.FieldSchema{
346+
FieldName: "$.embedding",
347+
As: "embedding",
348+
FieldType: redis.SearchFieldTypeVector,
349+
VectorArgs: &redis.FTVectorArgs{
350+
HNSWOptions: &redis.FTHNSWOptions{
351+
Dim: 384,
352+
DistanceMetric: "L2",
353+
Type: "FLOAT32",
354+
},
355+
},
356+
},
357+
).Result()
358+
```
359+
360+
Use [`JSONSet()`]({{< relref "/commands/json.set" >}}) to add the data
361+
instead of [`HSet()`]({{< relref "/commands/hset" >}}). The maps
362+
that specify the fields have the same structure as the ones used for `HSet()`.
363+
364+
An important difference with JSON indexing is that the vectors are
365+
specified using lists instead of binary strings. The loop below is similar
366+
to the one used previously to add the hash data, but it doesn't use the
367+
`floatsToBytes()` function to encode the `float32` array.
368+
369+
```go
370+
for i, emb := range embeddings {
371+
_, err = rdb.JSONSet(ctx,
372+
fmt.Sprintf("jdoc:%v", i),
373+
"$",
374+
map[string]any{
375+
"content": sentences[i],
376+
"genre": tags[i],
377+
"embedding": emb.ToFloat32(),
378+
},
379+
).Result()
380+
381+
if err != nil {
382+
panic(err)
383+
}
384+
}
385+
```
386+
387+
The query is almost identical to the one for the hash documents. This
388+
demonstrates how the right choice of aliases for the JSON paths can
389+
save you having to write complex queries. An important thing to notice
390+
is that the vector parameter for the query is still specified as a
391+
binary string (using the `floatsToBytes()` method), even though the data for
392+
the `embedding` field of the JSON was specified as an array.
393+
394+
```go
395+
jsonQueryEmbedding, err := hf.Embed(ctx, []string{
396+
"That is a happy person",
397+
})
398+
399+
if err != nil {
400+
panic(err)
401+
}
402+
403+
jsonBuffer := floatsToBytes(jsonQueryEmbedding[0].ToFloat32())
404+
405+
jsonResults, err := rdb.FTSearchWithArgs(ctx,
406+
"vector_json_idx",
407+
"*=>[KNN 3 @embedding $vec AS vector_distance]",
408+
&redis.FTSearchOptions{
409+
Return: []redis.FTSearchReturn{
410+
{FieldName: "vector_distance"},
411+
{FieldName: "content"},
412+
},
413+
DialectVersion: 2,
414+
Params: map[string]any{
415+
"vec": jsonBuffer,
416+
},
417+
},
418+
).Result()
419+
```
420+
421+
Apart from the `jdoc:` prefixes for the keys, the result from the JSON
422+
query is the same as for hash:
423+
424+
```
425+
ID: jdoc:0, Distance:0.114169843495, Content:'That is a very happy person'
426+
ID: jdoc:1, Distance:0.610845327377, Content:'That is a happy dog'
427+
ID: jdoc:2, Distance:1.48624765873, Content:'Today is a sunny day'
428+
```
429+
313430
## Learn more
314431

315432
See

content/develop/clients/redis-py/vecsearch.md

Lines changed: 120 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,10 +28,12 @@ similarity of an embedding generated from some query text with embeddings stored
2828
or JSON fields, Redis can retrieve documents that closely match the query in terms
2929
of their meaning.
3030

31-
In the example below, we use the
31+
The example below uses the
3232
[`sentence-transformers`](https://pypi.org/project/sentence-transformers/)
3333
library to generate vector embeddings to store and index with
34-
Redis Query Engine.
34+
Redis Query Engine. The code is first demonstrated for hash documents with a
35+
separate section to explain the
36+
[differences with JSON documents](#differences-with-json-documents).
3537

3638
## Initialize
3739

@@ -50,6 +52,7 @@ from sentence_transformers import SentenceTransformer
5052
from redis.commands.search.query import Query
5153
from redis.commands.search.field import TextField, TagField, VectorField
5254
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
55+
from redis.commands.json.path import Path
5356

5457
import numpy as np
5558
import redis
@@ -86,7 +89,7 @@ except redis.exceptions.ResponseError:
8689
pass
8790
```
8891

89-
Next, we create the index.
92+
Next, create the index.
9093
The schema in the example below specifies hash objects for storage and includes
9194
three fields: the text content to index, a
9295
[tag]({{< relref "/develop/interact/search-and-query/advanced-concepts/tags" >}})
@@ -127,10 +130,10 @@ Use the `model.encode()` method of `SentenceTransformer`
127130
as shown below to create the embedding that represents the `content` field.
128131
The `astype()` option that follows the `model.encode()` call specifies that
129132
we want a vector of `float32` values. The `tobytes()` option encodes the
130-
vector components together as a single binary string rather than the
131-
default Python list of `float` values.
132-
Use the binary string representation when you are indexing hash objects
133-
(as we are here), but use the default list of `float` for JSON objects.
133+
vector components together as a single binary string.
134+
Use the binary string representation when you are indexing hashes
135+
or running a query (but use a list of `float` for
136+
[JSON documents](#differences-with-json-documents)).
134137

135138
```python
136139
content = "That is a very happy person"
@@ -226,6 +229,116 @@ As you would expect, the result for `doc:0` with the content text *"That is a ve
226229
is the result that is most similar in meaning to the query text
227230
*"That is a happy person"*.
228231

232+
## Differences with JSON documents
233+
234+
Indexing JSON documents is similar to hash indexing, but there are some
235+
important differences. JSON allows much richer data modelling with nested fields, so
236+
you must supply a [path]({{< relref "/develop/data-types/json/path" >}}) in the schema
237+
to identify each field you want to index. However, you can declare a short alias for each
238+
of these paths (using the `as_name` keyword argument) to avoid typing it in full for
239+
every query. Also, you must specify `IndexType.JSON` when you create the index.
240+
241+
The code below shows these differences, but the index is otherwise very similar to
242+
the one created previously for hashes:
243+
244+
```py
245+
schema = (
246+
TextField("$.content", as_name="content"),
247+
TagField("$.genre", as_name="genre"),
248+
VectorField(
249+
"$.embedding", "HNSW", {
250+
"TYPE": "FLOAT32",
251+
"DIM": 384,
252+
"DISTANCE_METRIC": "L2"
253+
},
254+
as_name="embedding"
255+
)
256+
)
257+
258+
r.ft("vector_json_idx").create_index(
259+
schema,
260+
definition=IndexDefinition(
261+
prefix=["jdoc:"], index_type=IndexType.JSON
262+
)
263+
)
264+
```
265+
266+
Use [`json().set()`]({{< relref "/commands/json.set" >}}) to add the data
267+
instead of [`hset()`]({{< relref "/commands/hset" >}}). The dictionaries
268+
that specify the fields have the same structure as the ones used for `hset()`
269+
but `json().set()` receives them in a positional argument instead of
270+
the `mapping` keyword argument.
271+
272+
An important difference with JSON indexing is that the vectors are
273+
specified using lists instead of binary strings. Generate the list
274+
using the `tolist()` method instead of `tobytes()` as you would with a
275+
hash.
276+
277+
```py
278+
content = "That is a very happy person"
279+
280+
r.json().set("jdoc:0", Path.root_path(), {
281+
"content": content,
282+
"genre": "persons",
283+
"embedding": model.encode(content).astype(np.float32).tolist(),
284+
})
285+
286+
content = "That is a happy dog"
287+
288+
r.json().set("jdoc:1", Path.root_path(), {
289+
"content": content,
290+
"genre": "pets",
291+
"embedding": model.encode(content).astype(np.float32).tolist(),
292+
})
293+
294+
content = "Today is a sunny day"
295+
296+
r.json().set("jdoc:2", Path.root_path(), {
297+
"content": content,
298+
"genre": "weather",
299+
"embedding": model.encode(content).astype(np.float32).tolist(),
300+
})
301+
```
302+
303+
The query is almost identical to the one for the hash documents. This
304+
demonstrates how the right choice of aliases for the JSON paths can
305+
save you having to write complex queries. An important thing to notice
306+
is that the vector parameter for the query is still specified as a
307+
binary string (using the `tobytes()` method), even though the data for
308+
the `embedding` field of the JSON was specified as a list.
309+
310+
```py
311+
q = Query(
312+
"*=>[KNN 3 @embedding $vec AS vector_distance]"
313+
).return_field("vector_distance").return_field("content").dialect(2)
314+
315+
query_text = "That is a happy person"
316+
317+
res = r.ft("vector_json_idx").search(
318+
q, query_params={
319+
"vec": model.encode(query_text).astype(np.float32).tobytes()
320+
}
321+
)
322+
```
323+
324+
Apart from the `jdoc:` prefixes for the keys, the result from the JSON
325+
query is the same as for hash:
326+
327+
```
328+
Result{
329+
3 total,
330+
docs: [
331+
Document {
332+
'id': 'jdoc:0',
333+
'payload': None,
334+
'vector_distance': '0.114169985056',
335+
'content': 'That is a very happy person'
336+
},
337+
.
338+
.
339+
.
340+
```
341+
229342
## Learn more
230343

231344
See

0 commit comments

Comments
 (0)