Skip to content

Commit 2c2322e

Browse files
berkecanrizaipre-commit-ci[bot]Spycsh
authored
add: Pathway vector store and retriever as LangChain component (#342)
* nb Signed-off-by: Berke <[email protected]> * init changes Signed-off-by: Berke <[email protected]> * docker Signed-off-by: Berke <[email protected]> * example data Signed-off-by: Berke <[email protected]> * docs(readme): update, add commands Signed-off-by: Berke <[email protected]> * fix: formatting, data sources Signed-off-by: Berke <[email protected]> * docs(readme): update instructions, add comments Signed-off-by: Berke <[email protected]> * fix: rm unused parts Signed-off-by: Berke <[email protected]> * fix: image name, compose env vars Signed-off-by: Berke <[email protected]> * fix: rm unused part Signed-off-by: Berke <[email protected]> * fix: logging name Signed-off-by: Berke <[email protected]> * fix: env var Signed-off-by: Berke <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Berke <[email protected]> * fix: rename pw docker Signed-off-by: Berke <[email protected]> * docs(readme): update input sources Signed-off-by: Berke <[email protected]> * nb Signed-off-by: Berke <[email protected]> * init changes Signed-off-by: Berke <[email protected]> * fix: formatting, data sources Signed-off-by: Berke <[email protected]> * docs(readme): update instructions, add comments Signed-off-by: Berke <[email protected]> * fix: rm unused part Signed-off-by: Berke <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Berke <[email protected]> * fix: rename pw docker Signed-off-by: Berke <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Berke <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: mv vector store, naming, clarify instructions, improve ingestion components Signed-off-by: Berke <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add pw retriever test fix: update docker to include libmagic Signed-off-by: Berke <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * implement suggestions from review, entrypoint, reqs, comments, https_proxy. Signed-off-by: Berke <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: update docker tags in test and readme Signed-off-by: Berke <[email protected]> * tests: add separate pathway vectorstore test Signed-off-by: Berke <[email protected]> --------- Signed-off-by: Berke <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sihan Chen <[email protected]>
1 parent 6d4b668 commit 2c2322e

15 files changed

+618
-0
lines changed
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Retriever Microservice with Pathway
2+
3+
## 🚀Start Microservices
4+
5+
### With the Docker CLI
6+
7+
We suggest using `docker compose` to run this app, refer to [`docker compose`](#with-the-docker-compose) section below.
8+
9+
If you prefer to run them separately, refer to this section.
10+
11+
#### (Optionally) Start the TEI (embedder) service separately
12+
13+
> Note that Docker compose will start this service as well, this step is thus optional.
14+
15+
```bash
16+
export LANGCHAIN_TRACING_V2=true
17+
export LANGCHAIN_API_KEY=${your_langchain_api_key}
18+
export LANGCHAIN_PROJECT="opea/retriever"
19+
model=BAAI/bge-base-en-v1.5
20+
revision=refs/pr/4
21+
# TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060" # if you want to use the hosted embedding service, example: "http://127.0.0.1:6060"
22+
23+
# then run:
24+
docker run -p 6060:80 -e http_proxy=$http_proxy -e https_proxy=$https_proxy --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.2 --model-id $model --revision $revision
25+
```
26+
27+
Health check the embedding service with:
28+
29+
```bash
30+
curl 127.0.0.1:6060/embed -X POST -d '{"inputs":"What is Deep Learning?"}' -H 'Content-Type: application/json'
31+
```
32+
33+
If the model supports re-ranking, you can also use:
34+
35+
```bash
36+
curl 127.0.0.1:6060/rerank -X POST -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}' -H 'Content-Type: application/json'
37+
```
38+
39+
#### Start Retriever Service
40+
41+
Retriever service queries the Pathway vector store on incoming requests.
42+
Make sure that Pathway vector store is already running, [see Pathway vector store here](../../../vectorstores/langchain/pathway/README.md).
43+
44+
Retriever service expects the Pathway host and port variables to connect to the vector DB. Set the Pathway vector store environment variables.
45+
46+
```bash
47+
export PATHWAY_HOST=0.0.0.0
48+
export PATHWAY_PORT=8666
49+
```
50+
51+
```bash
52+
# make sure you are in the root folder of the repo
53+
docker build -t opea/retriever-pathway:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/retrievers/langchain/pathway/docker/Dockerfile .
54+
55+
docker run -p 7000:7000 -e PATHWAY_HOST=${PATHWAY_HOST} -e PATHWAY_PORT=${PATHWAY_PORT} -e http_proxy=$http_proxy -e https_proxy=$https_proxy --network="host" opea/retriever-pathway:latest
56+
```
57+
58+
### With the Docker compose
59+
60+
First, set the env variables:
61+
62+
```bash
63+
export PATHWAY_HOST=0.0.0.0
64+
export PATHWAY_PORT=8666
65+
export LANGCHAIN_TRACING_V2=true
66+
export LANGCHAIN_API_KEY=${your_langchain_api_key}
67+
export LANGCHAIN_PROJECT="opea/retriever"
68+
model=BAAI/bge-base-en-v1.5
69+
revision=refs/pr/4
70+
# TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060" # if you want to use the hosted embedding service, example: "http://127.0.0.1:6060"
71+
```
72+
73+
Text embeddings inference service expects the `RETRIEVE_MODEL_ID` variable to be set.
74+
75+
```bash
76+
export RETRIEVE_MODEL_ID=BAAI/bge-base-en-v1.5
77+
```
78+
79+
Note that following docker compose sets the `network_mode: host` in retriever image to allow local vector store connection.
80+
This will start the both the embedding and retriever services:
81+
82+
```bash
83+
cd comps/retrievers/langchain/pathway/docker
84+
85+
docker compose -f docker_compose_retriever.yaml build
86+
docker compose -f docker_compose_retriever.yaml up
87+
88+
# shut down the containers
89+
docker compose -f docker_compose_retriever.yaml down
90+
```
91+
92+
Make sure the retriever service is working as expected:
93+
94+
```bash
95+
curl http://0.0.0.0:7000/v1/health_check -X GET -H 'Content-Type: application/json'
96+
```
97+
98+
send an example query:
99+
100+
```bash
101+
exm_embeddings=$(python -c "import random; embedding = [random.uniform(-1, 1) for _ in range(768)]; print(embedding)")
102+
103+
curl http://0.0.0.0:7000/v1/retrieval -X POST -d "{\"text\":\"What is the revenue of Nike in 2023?\",\"embedding\":${exm_embeddings}}" -H 'Content-Type: application/json'
104+
```
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
2+
# Copyright (C) 2024 Intel Corporation
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
FROM langchain/langchain:latest
6+
7+
ARG ARCH="cpu"
8+
9+
RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
10+
libgl1-mesa-glx \
11+
libjemalloc-dev \
12+
vim
13+
14+
RUN useradd -m -s /bin/bash user && \
15+
mkdir -p /home/user && \
16+
chown -R user /home/user/
17+
18+
COPY comps /home/user/comps
19+
20+
USER user
21+
22+
RUN pip install --no-cache-dir --upgrade pip && \
23+
if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
24+
pip install --no-cache-dir -r /home/user/comps/retrievers/langchain/pathway/requirements.txt
25+
26+
ENV PYTHONPATH=$PYTHONPATH:/home/user
27+
28+
WORKDIR /home/user/comps/retrievers/langchain/pathway
29+
30+
ENTRYPOINT ["bash", "entrypoint.sh"]
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
version: "3.8"
5+
6+
services:
7+
tei_xeon_service:
8+
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.2
9+
container_name: tei-xeon-server
10+
ports:
11+
- "6060:80"
12+
volumes:
13+
- "./data:/data"
14+
shm_size: 1g
15+
command: --model-id ${RETRIEVE_MODEL_ID}
16+
retriever:
17+
image: opea/retriever-pathway:latest
18+
container_name: retriever-pathway-server
19+
ports:
20+
- "7000:7000"
21+
ipc: host
22+
network_mode: host
23+
environment:
24+
no_proxy: ${no_proxy}
25+
http_proxy: ${http_proxy}
26+
https_proxy: ${https_proxy}
27+
PATHWAY_HOST: ${PATHWAY_HOST}
28+
PATHWAY_PORT: ${PATHWAY_PORT}
29+
TEI_EMBEDDING_ENDPOINT: ${TEI_EMBEDDING_ENDPOINT}
30+
LANGCHAIN_API_KEY: ${LANGCHAIN_API_KEY}
31+
restart: unless-stopped
32+
33+
networks:
34+
default:
35+
driver: bridge
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
pip --no-cache-dir install -r requirements-runtime.txt
5+
6+
python retriever_pathway.py
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
langsmith
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
docarray[full]
2+
fastapi
3+
frontend==0.0.3
4+
huggingface_hub
5+
langchain_community == 0.2.0
6+
opentelemetry-api
7+
opentelemetry-exporter-otlp
8+
opentelemetry-sdk
9+
pathway
10+
prometheus-fastapi-instrumentator
11+
sentence_transformers
12+
shortuuid
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
import os
5+
import time
6+
7+
from langchain_community.vectorstores import PathwayVectorClient
8+
from langsmith import traceable
9+
10+
from comps import (
11+
EmbedDoc,
12+
SearchedDoc,
13+
ServiceType,
14+
TextDoc,
15+
opea_microservices,
16+
register_microservice,
17+
register_statistics,
18+
statistics_dict,
19+
)
20+
21+
host = os.getenv("PATHWAY_HOST", "127.0.0.1")
22+
port = int(os.getenv("PATHWAY_PORT", 8666))
23+
24+
EMBED_MODEL = os.getenv("EMBED_MODEL", "BAAI/bge-base-en-v1.5")
25+
26+
tei_embedding_endpoint = os.getenv("TEI_EMBEDDING_ENDPOINT")
27+
28+
29+
@register_microservice(
30+
name="opea_service@retriever_pathway",
31+
service_type=ServiceType.RETRIEVER,
32+
endpoint="/v1/retrieval",
33+
host="0.0.0.0",
34+
port=7000,
35+
)
36+
@traceable(run_type="retriever")
37+
@register_statistics(names=["opea_service@retriever_pathway"])
38+
def retrieve(input: EmbedDoc) -> SearchedDoc:
39+
start = time.time()
40+
documents = pw_client.similarity_search(input.text, input.fetch_k)
41+
42+
docs = [TextDoc(text=r.page_content) for r in documents]
43+
44+
time_spent = time.time() - start
45+
statistics_dict["opea_service@retriever_pathway"].append_latency(time_spent, None) # noqa: E501
46+
return SearchedDoc(retrieved_docs=docs, initial_query=input.text)
47+
48+
49+
if __name__ == "__main__":
50+
# Create the vectorstore client
51+
pw_client = PathwayVectorClient(host=host, port=port)
52+
opea_microservices["opea_service@retriever_pathway"].start()

comps/vectorstores/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,7 @@ For details, please refer to this [readme](langchain/pgvector/README.md)
1717
## Vectorstores Microservice with Pinecone
1818

1919
For details, please refer to this [readme](langchain/pinecone/README.md)
20+
21+
## Vectorstores Microservice with Pathway
22+
23+
For details, please refer to this [readme](langchain/pathway/README.md)
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
FROM pathwaycom/pathway:0.13.2-slim
5+
6+
ENV DOCKER_BUILDKIT=1
7+
ENV PYTHONUNBUFFERED=1
8+
9+
RUN apt-get update && apt-get install -y \
10+
poppler-utils \
11+
libreoffice \
12+
libmagic-dev \
13+
&& rm -rf /var/lib/apt/lists/*
14+
15+
WORKDIR /app
16+
17+
COPY requirements.txt /app/
18+
19+
RUN pip install --no-cache-dir -r requirements.txt
20+
21+
COPY vectorstore_pathway.py /app/
22+
23+
24+
CMD ["python", "vectorstore_pathway.py"]
25+
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Start the Pathway Vector DB Server
2+
3+
Set the environment variables for Pathway, and the embedding model.
4+
5+
> Note: If you are using `TEI_EMBEDDING_ENDPOINT`, make sure embedding service is already running.
6+
> See the instructions under [here](../../../retrievers/langchain/pathway/README.md)
7+
8+
```bash
9+
export PATHWAY_HOST=0.0.0.0
10+
export PATHWAY_PORT=8666
11+
# TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060" # uncomment if you want to use the hosted embedding service, example: "http://127.0.0.1:6060"
12+
```
13+
14+
## Configuration
15+
16+
### Setting up the Pathway data sources
17+
18+
Pathway can listen to many sources simultaneously, such as local files, S3 folders, cloud storage, and any data stream. Whenever a new file is added or an existing file is modified, Pathway parses, chunks and indexes the documents in real-time.
19+
20+
See [pathway-io](https://pathway.com/developers/api-docs/pathway-io) for more information.
21+
22+
You can easily connect to the data inside the folder with the Pathway file system connector. The data will automatically be updated by Pathway whenever the content of the folder changes. In this example, we create a single data source that reads the files under the `./data` folder.
23+
24+
You can manage your data sources by configuring the `data_sources` in `vectorstore_pathway.py`.
25+
26+
```python
27+
import pathway as pw
28+
29+
data = pw.io.fs.read(
30+
"./data",
31+
format="binary",
32+
mode="streaming",
33+
with_metadata=True,
34+
) # This creates a Pathway connector that tracks
35+
# all the files in the ./data directory
36+
37+
data_sources = [data]
38+
```
39+
40+
### Other configs (parser, splitter and the embedder)
41+
42+
Pathway vectorstore handles the ingestion and processing of the documents.
43+
This allows you to configure the parser, splitter and the embedder.
44+
Whenever a file is added or modified in one of the sources, Pathway will automatically ingest the file.
45+
46+
By default, `ParseUnstructured` parser, `langchain.text_splitter.CharacterTextSplitter` splitter and `BAAI/bge-base-en-v1.5` embedder are used.
47+
48+
For more information, see the relevant Pathway docs:
49+
50+
- [Vector store docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/vectorstore)
51+
- [parsers docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/parsers)
52+
- [splitters docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/splitters)
53+
- [embedders docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/embedders)
54+
55+
## Building and running
56+
57+
Build the Docker and run the Pathway Vector Store:
58+
59+
```bash
60+
cd comps/vectorstores/langchain/pathway
61+
62+
docker build --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -t opea/vectorstore-pathway:latest .
63+
64+
# with locally loaded model, you may add `EMBED_MODEL` env variable to configure the model.
65+
docker run -e PATHWAY_HOST=${PATHWAY_HOST} -e PATHWAY_PORT=${PATHWAY_PORT} -e http_proxy=$http_proxy -e https_proxy=$https_proxy -v ./data:/app/data -p ${PATHWAY_PORT}:${PATHWAY_PORT} opea/vectorstore-pathway:latest
66+
67+
# with the hosted embedder (network argument is needed for the vector server to reach to the embedding service)
68+
docker run -e PATHWAY_HOST=${PATHWAY_HOST} -e PATHWAY_PORT=${PATHWAY_PORT} -e TEI_EMBEDDING_ENDPOINT=${TEI_EMBEDDING_ENDPOINT} -e http_proxy=$http_proxy -e https_proxy=$https_proxy -v ./data:/app/data -p ${PATHWAY_PORT}:${PATHWAY_PORT} --network="host" opea/vectorstore-pathway:latest
69+
```
70+
71+
## Health check the vector store
72+
73+
Wait until the server finishes indexing the docs, and send the following request to check it.
74+
75+
```bash
76+
curl -X 'POST' \
77+
"http://$PATHWAY_HOST:$PATHWAY_PORT/v1/statistics" \
78+
-H 'accept: */*' \
79+
-H 'Content-Type: application/json'
80+
```
81+
82+
This should respond with something like:
83+
84+
> `{"file_count": 1, "last_indexed": 1724325093, "last_modified": 1724317365}`

0 commit comments

Comments
 (0)