add: Pathway vector store and retriever as LangChain component (#342)

berkecanrizai · pre-commit-ci[bot] · Spycsh · web-flow · commit 2c2322e7be4e · 2024-08-29T21:04:41.000+08:00
* nb Signed-off-by: Berke <berkecanrizai1@gmail.com> * init changes Signed-off-by: Berke <berkecanrizai1@gmail.com> * docker Signed-off-by: Berke <berkecanrizai1@gmail.com> * example data Signed-off-by: Berke <berkecanrizai1@gmail.com> * docs(readme): update, add commands Signed-off-by: Berke <berkecanrizai1@gmail.com> * fix: formatting, data sources Signed-off-by: Berke <berkecanrizai1@gmail.com> * docs(readme): update instructions, add comments Signed-off-by: Berke <berkecanrizai1@gmail.com> * fix: rm unused parts Signed-off-by: Berke <berkecanrizai1@gmail.com> * fix: image name, compose env vars Signed-off-by: Berke <berkecanrizai1@gmail.com> * fix: rm unused part Signed-off-by: Berke <berkecanrizai1@gmail.com> * fix: logging name Signed-off-by: Berke <berkecanrizai1@gmail.com> * fix: env var Signed-off-by: Berke <berkecanrizai1@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Berke <berkecanrizai1@gmail.com> * fix: rename pw docker Signed-off-by: Berke <berkecanrizai1@gmail.com> * docs(readme): update input sources Signed-off-by: Berke <berkecanrizai1@gmail.com> * nb Signed-off-by: Berke <berkecanrizai1@gmail.com> * init changes Signed-off-by: Berke <berkecanrizai1@gmail.com> * fix: formatting, data sources Signed-off-by: Berke <berkecanrizai1@gmail.com> * docs(readme): update instructions, add comments Signed-off-by: Berke <berkecanrizai1@gmail.com> * fix: rm unused part Signed-off-by: Berke <berkecanrizai1@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Berke <berkecanrizai1@gmail.com> * fix: rename pw docker Signed-off-by: Berke <berkecanrizai1@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Berke <berkecanrizai1@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: mv vector store, naming, clarify instructions, improve ingestion components Signed-off-by: Berke <berkecanrizai1@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add pw retriever test fix: update docker to include libmagic Signed-off-by: Berke <berkecanrizai1@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * implement suggestions from review, entrypoint, reqs, comments, https_proxy. Signed-off-by: Berke <berkecanrizai1@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: update docker tags in test and readme Signed-off-by: Berke <berkecanrizai1@gmail.com> * tests: add separate pathway vectorstore test Signed-off-by: Berke <berkecanrizai1@gmail.com> --------- Signed-off-by: Berke <berkecanrizai1@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sihan Chen <39623753+Spycsh@users.noreply.github.com>
diff --git a/comps/retrievers/langchain/pathway/README.md b/comps/retrievers/langchain/pathway/README.md
@@ -0,0 +1,104 @@
+# Retriever Microservice with Pathway
+
+## 🚀Start Microservices
+
+### With the Docker CLI
+
+We suggest using `docker compose` to run this app, refer to [`docker compose`](#with-the-docker-compose) section below.
+
+If you prefer to run them separately, refer to this section.
+
+#### (Optionally) Start the TEI (embedder) service separately
+
+> Note that Docker compose will start this service as well, this step is thus optional.
+
+```bash
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_API_KEY=${your_langchain_api_key}
+export LANGCHAIN_PROJECT="opea/retriever"
+model=BAAI/bge-base-en-v1.5
+revision=refs/pr/4
+# TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060"  # if you want to use the hosted embedding service, example: "http://127.0.0.1:6060"
+
+# then run:
+docker run -p 6060:80 -e http_proxy=$http_proxy -e https_proxy=$https_proxy --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.2 --model-id $model --revision $revision
+```
+
+Health check the embedding service with:
+
+```bash
+curl 127.0.0.1:6060/embed     -X POST     -d '{"inputs":"What is Deep Learning?"}'     -H 'Content-Type: application/json'
+```
+
+If the model supports re-ranking, you can also use:
+
+```bash
+curl 127.0.0.1:6060/rerank     -X POST     -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}'     -H 'Content-Type: application/json'
+```
+
+#### Start Retriever Service
+
+Retriever service queries the Pathway vector store on incoming requests.
+Make sure that Pathway vector store is already running, [see Pathway vector store here](../../../vectorstores/langchain/pathway/README.md).
+
+Retriever service expects the Pathway host and port variables to connect to the vector DB. Set the Pathway vector store environment variables.
+
+```bash
+export PATHWAY_HOST=0.0.0.0
+export PATHWAY_PORT=8666
+```
+
+```bash
+# make sure you are in the root folder of the repo
+docker build -t opea/retriever-pathway:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/retrievers/langchain/pathway/docker/Dockerfile .
+
+docker run -p 7000:7000 -e PATHWAY_HOST=${PATHWAY_HOST} -e PATHWAY_PORT=${PATHWAY_PORT} -e http_proxy=$http_proxy -e https_proxy=$https_proxy --network="host" opea/retriever-pathway:latest
+```
+
+### With the Docker compose
+
+First, set the env variables:
+
+```bash
+export PATHWAY_HOST=0.0.0.0
+export PATHWAY_PORT=8666
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_API_KEY=${your_langchain_api_key}
+export LANGCHAIN_PROJECT="opea/retriever"
+model=BAAI/bge-base-en-v1.5
+revision=refs/pr/4
+# TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060"  # if you want to use the hosted embedding service, example: "http://127.0.0.1:6060"
+```
+
+Text embeddings inference service expects the `RETRIEVE_MODEL_ID` variable to be set.
+
+```bash
+export RETRIEVE_MODEL_ID=BAAI/bge-base-en-v1.5
+```
+
+Note that following docker compose sets the `network_mode: host` in retriever image to allow local vector store connection.
+This will start the both the embedding and retriever services:
+
+```bash
+cd comps/retrievers/langchain/pathway/docker
+
+docker compose -f docker_compose_retriever.yaml build
+docker compose -f docker_compose_retriever.yaml up
+
+# shut down the containers
+docker compose -f docker_compose_retriever.yaml down
+```
+
+Make sure the retriever service is working as expected:
+
+```bash
+curl http://0.0.0.0:7000/v1/health_check   -X GET   -H 'Content-Type: application/json'
+```
+
+send an example query:
+
+```bash
+exm_embeddings=$(python -c "import random; embedding = [random.uniform(-1, 1) for _ in range(768)]; print(embedding)")
+
+curl http://0.0.0.0:7000/v1/retrieval   -X POST   -d "{\"text\":\"What is the revenue of Nike in 2023?\",\"embedding\":${exm_embeddings}}"   -H 'Content-Type: application/json'
+```
diff --git a/comps/retrievers/langchain/pathway/docker/Dockerfile b/comps/retrievers/langchain/pathway/docker/Dockerfile
@@ -0,0 +1,30 @@
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+FROM langchain/langchain:latest
+
+ARG ARCH="cpu"
+
+RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
+    libgl1-mesa-glx \
+    libjemalloc-dev \
+    vim
+
+RUN useradd -m -s /bin/bash user && \
+    mkdir -p /home/user && \
+    chown -R user /home/user/
+
+COPY comps /home/user/comps
+
+USER user
+
+RUN pip install --no-cache-dir --upgrade pip && \
+    if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
+    pip install --no-cache-dir -r /home/user/comps/retrievers/langchain/pathway/requirements.txt
+
+ENV PYTHONPATH=$PYTHONPATH:/home/user
+
+WORKDIR /home/user/comps/retrievers/langchain/pathway
+
+ENTRYPOINT ["bash", "entrypoint.sh"]
diff --git a/comps/retrievers/langchain/pathway/docker/docker_compose_retriever.yaml b/comps/retrievers/langchain/pathway/docker/docker_compose_retriever.yaml
@@ -0,0 +1,35 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+version: "3.8"
+
+services:
+  tei_xeon_service:
+    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.2
+    container_name: tei-xeon-server
+    ports:
+      - "6060:80"
+    volumes:
+      - "./data:/data"
+    shm_size: 1g
+    command: --model-id ${RETRIEVE_MODEL_ID}
+  retriever:
+    image: opea/retriever-pathway:latest
+    container_name: retriever-pathway-server
+    ports:
+      - "7000:7000"
+    ipc: host
+    network_mode: host
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      PATHWAY_HOST: ${PATHWAY_HOST}
+      PATHWAY_PORT: ${PATHWAY_PORT}
+      TEI_EMBEDDING_ENDPOINT: ${TEI_EMBEDDING_ENDPOINT}
+      LANGCHAIN_API_KEY: ${LANGCHAIN_API_KEY}
+    restart: unless-stopped
+
+networks:
+  default:
+    driver: bridge
diff --git a/comps/retrievers/langchain/pathway/entrypoint.sh b/comps/retrievers/langchain/pathway/entrypoint.sh
@@ -0,0 +1,6 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+pip --no-cache-dir install -r requirements-runtime.txt
+
+python retriever_pathway.py
diff --git a/comps/retrievers/langchain/pathway/requirements-runtime.txt b/comps/retrievers/langchain/pathway/requirements-runtime.txt
@@ -0,0 +1 @@
+langsmith
diff --git a/comps/retrievers/langchain/pathway/requirements.txt b/comps/retrievers/langchain/pathway/requirements.txt
@@ -0,0 +1,12 @@
+docarray[full]
+fastapi
+frontend==0.0.3
+huggingface_hub
+langchain_community == 0.2.0
+opentelemetry-api
+opentelemetry-exporter-otlp
+opentelemetry-sdk
+pathway
+prometheus-fastapi-instrumentator
+sentence_transformers
+shortuuid
diff --git a/comps/retrievers/langchain/pathway/retriever_pathway.py b/comps/retrievers/langchain/pathway/retriever_pathway.py
@@ -0,0 +1,52 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+import time
+
+from langchain_community.vectorstores import PathwayVectorClient
+from langsmith import traceable
+
+from comps import (
+    EmbedDoc,
+    SearchedDoc,
+    ServiceType,
+    TextDoc,
+    opea_microservices,
+    register_microservice,
+    register_statistics,
+    statistics_dict,
+)
+
+host = os.getenv("PATHWAY_HOST", "127.0.0.1")
+port = int(os.getenv("PATHWAY_PORT", 8666))
+
+EMBED_MODEL = os.getenv("EMBED_MODEL", "BAAI/bge-base-en-v1.5")
+
+tei_embedding_endpoint = os.getenv("TEI_EMBEDDING_ENDPOINT")
+
+
+@register_microservice(
+    name="opea_service@retriever_pathway",
+    service_type=ServiceType.RETRIEVER,
+    endpoint="/v1/retrieval",
+    host="0.0.0.0",
+    port=7000,
+)
+@traceable(run_type="retriever")
+@register_statistics(names=["opea_service@retriever_pathway"])
+def retrieve(input: EmbedDoc) -> SearchedDoc:
+    start = time.time()
+    documents = pw_client.similarity_search(input.text, input.fetch_k)
+
+    docs = [TextDoc(text=r.page_content) for r in documents]
+
+    time_spent = time.time() - start
+    statistics_dict["opea_service@retriever_pathway"].append_latency(time_spent, None)  # noqa: E501
+    return SearchedDoc(retrieved_docs=docs, initial_query=input.text)
+
+
+if __name__ == "__main__":
+    # Create the vectorstore client
+    pw_client = PathwayVectorClient(host=host, port=port)
+    opea_microservices["opea_service@retriever_pathway"].start()
diff --git a/comps/vectorstores/README.md b/comps/vectorstores/README.md
@@ -17,3 +17,7 @@ For details, please refer to this [readme](langchain/pgvector/README.md)
 ## Vectorstores Microservice with Pinecone
 
 For details, please refer to this [readme](langchain/pinecone/README.md)
+
+## Vectorstores Microservice with Pathway
+
+For details, please refer to this [readme](langchain/pathway/README.md)
diff --git a/comps/vectorstores/langchain/pathway/Dockerfile b/comps/vectorstores/langchain/pathway/Dockerfile
@@ -0,0 +1,25 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+FROM pathwaycom/pathway:0.13.2-slim
+
+ENV DOCKER_BUILDKIT=1
+ENV PYTHONUNBUFFERED=1
+
+RUN apt-get update && apt-get install -y \
+    poppler-utils \
+    libreoffice \
+    libmagic-dev \
+    && rm -rf /var/lib/apt/lists/*
+
+WORKDIR /app
+
+COPY requirements.txt /app/
+
+RUN pip install --no-cache-dir -r requirements.txt
+
+COPY vectorstore_pathway.py /app/
+
+
+CMD ["python", "vectorstore_pathway.py"]
+
diff --git a/comps/vectorstores/langchain/pathway/README.md b/comps/vectorstores/langchain/pathway/README.md
@@ -0,0 +1,84 @@
+# Start the Pathway Vector DB Server
+
+Set the environment variables for Pathway, and the embedding model.
+
+> Note: If you are using `TEI_EMBEDDING_ENDPOINT`, make sure embedding service is already running.
+> See the instructions under [here](../../../retrievers/langchain/pathway/README.md)
+
+```bash
+export PATHWAY_HOST=0.0.0.0
+export PATHWAY_PORT=8666
+# TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060"  # uncomment if you want to use the hosted embedding service, example: "http://127.0.0.1:6060"
+```
+
+## Configuration
+
+### Setting up the Pathway data sources
+
+Pathway can listen to many sources simultaneously, such as local files, S3 folders, cloud storage, and any data stream. Whenever a new file is added or an existing file is modified, Pathway parses, chunks and indexes the documents in real-time.
+
+See [pathway-io](https://pathway.com/developers/api-docs/pathway-io) for more information.
+
+You can easily connect to the data inside the folder with the Pathway file system connector. The data will automatically be updated by Pathway whenever the content of the folder changes. In this example, we create a single data source that reads the files under the `./data` folder.
+
+You can manage your data sources by configuring the `data_sources` in `vectorstore_pathway.py`.
+
+```python
+import pathway as pw
+
+data = pw.io.fs.read(
+    "./data",
+    format="binary",
+    mode="streaming",
+    with_metadata=True,
+)  # This creates a Pathway connector that tracks
+# all the files in the ./data directory
+
+data_sources = [data]
+```
+
+### Other configs (parser, splitter and the embedder)
+
+Pathway vectorstore handles the ingestion and processing of the documents.
+This allows you to configure the parser, splitter and the embedder.
+Whenever a file is added or modified in one of the sources, Pathway will automatically ingest the file.
+
+By default, `ParseUnstructured` parser, `langchain.text_splitter.CharacterTextSplitter` splitter and `BAAI/bge-base-en-v1.5` embedder are used.
+
+For more information, see the relevant Pathway docs:
+
+- [Vector store docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/vectorstore)
+- [parsers docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/parsers)
+- [splitters docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/splitters)
+- [embedders docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/embedders)
+
+## Building and running
+
+Build the Docker and run the Pathway Vector Store:
+
+```bash
+cd comps/vectorstores/langchain/pathway
+
+docker build --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -t opea/vectorstore-pathway:latest .
+
+# with locally loaded model, you may add `EMBED_MODEL` env variable to configure the model.
+docker run -e PATHWAY_HOST=${PATHWAY_HOST} -e PATHWAY_PORT=${PATHWAY_PORT} -e http_proxy=$http_proxy -e https_proxy=$https_proxy -v ./data:/app/data -p ${PATHWAY_PORT}:${PATHWAY_PORT} opea/vectorstore-pathway:latest
+
+# with the hosted embedder (network argument is needed for the vector server to reach to the embedding service)
+docker run -e PATHWAY_HOST=${PATHWAY_HOST} -e PATHWAY_PORT=${PATHWAY_PORT} -e TEI_EMBEDDING_ENDPOINT=${TEI_EMBEDDING_ENDPOINT} -e http_proxy=$http_proxy -e https_proxy=$https_proxy -v ./data:/app/data -p ${PATHWAY_PORT}:${PATHWAY_PORT} --network="host" opea/vectorstore-pathway:latest
+```
+
+## Health check the vector store
+
+Wait until the server finishes indexing the docs, and send the following request to check it.
+
+```bash
+curl -X 'POST' \
+  "http://$PATHWAY_HOST:$PATHWAY_PORT/v1/statistics" \
+  -H 'accept: */*' \
+  -H 'Content-Type: application/json'
+```
+
+This should respond with something like:
+
+> `{"file_count": 1, "last_indexed": 1724325093, "last_modified": 1724317365}`
diff --git a/comps/vectorstores/langchain/pathway/data/nke-10k-2023.pdf b/comps/vectorstores/langchain/pathway/data/nke-10k-2023.pdf
diff --git a/comps/vectorstores/langchain/pathway/requirements.txt b/comps/vectorstores/langchain/pathway/requirements.txt
@@ -0,0 +1,4 @@
+langchain_openai
+pathway[xpack-llm] >= 0.14.1
+sentence_transformers
+unstructured[all-docs] >= 0.10.28,<0.15
diff --git a/comps/vectorstores/langchain/pathway/vectorstore_pathway.py b/comps/vectorstores/langchain/pathway/vectorstore_pathway.py
diff --git a/tests/test_retrievers_langchain_pathway.sh b/tests/test_retrievers_langchain_pathway.sh
diff --git a/tests/test_vectorstores_langchain_pathway.sh b/tests/test_vectorstores_langchain_pathway.sh