Skip to content

Commit f84d91a

Browse files
srinarayan-srikanthanpre-commit-ci[bot]BaoHuilingXinyuYe-Intel
authored
adding dataprep support for CLIP based models for VideoRAGQnA example for v1.0 (#621)
* dataprep service Signed-off-by: srinarayan-srikanthan <[email protected]> * dataprep updates Signed-off-by: srinarayan-srikanthan <[email protected]> * rearranged dirs Signed-off-by: srinarayan-srikanthan <[email protected]> * added readme Signed-off-by: srinarayan-srikanthan <[email protected]> * removed checks Signed-off-by: srinarayan-srikanthan <[email protected]> * added features Signed-off-by: srinarayan-srikanthan <[email protected]> * added get method Signed-off-by: srinarayan-srikanthan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add dim at init, rm unused Signed-off-by: BaoHuiling <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add wait after connect DB Signed-off-by: BaoHuiling <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unused Signed-off-by: BaoHuiling <[email protected]> * Update comps/dataprep/vdms/README.md Co-authored-by: XinyuYe-Intel <[email protected]> Signed-off-by: BaoHuiling <[email protected]> * add test script for mm case Signed-off-by: BaoHuiling <[email protected]> * add return value and update readme Signed-off-by: BaoHuiling <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * check bug Signed-off-by: BaoHuiling <[email protected]> * fix mm-script Signed-off-by: BaoHuiling <[email protected]> * add into dataprep workflow Signed-off-by: BaoHuiling <[email protected]> * rm whitespace Signed-off-by: BaoHuiling <[email protected]> * updated readme and added test script Signed-off-by: srinarayan-srikanthan <[email protected]> * removed unused file Signed-off-by: srinarayan-srikanthan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move test script Signed-off-by: BaoHuiling <[email protected]> * restructured repo Signed-off-by: srinarayan-srikanthan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates path in test script Signed-off-by: srinarayan-srikanthan <[email protected]> * add name for build Signed-off-by: BaoHuiling <[email protected]> --------- Signed-off-by: srinarayan-srikanthan <[email protected]> Signed-off-by: BaoHuiling <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: BaoHuiling <[email protected]> Co-authored-by: XinyuYe-Intel <[email protected]>
1 parent 4165c7d commit f84d91a

20 files changed

+1475
-0
lines changed

.github/workflows/docker/compose/dataprep-compose-cd.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,7 @@ services:
2323
build:
2424
dockerfile: comps/dataprep/pinecone/langchain/Dockerfile
2525
image: ${REGISTRY:-opea}/dataprep-pinecone:${TAG:-latest}
26+
dataprep-vdms:
27+
build:
28+
dockerfile: comps/dataprep/vdms/multimodal_langchain/docker/Dockerfile
29+
image: ${REGISTRY:-opea}/dataprep-vdms:${TAG:-latest}

comps/dataprep/vdms/README.md

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Dataprep Microservice with VDMS
2+
3+
For dataprep microservice, we currently provide one framework: `Langchain`.
4+
5+
<!-- We also provide `Langchain_ray` which uses ray to parallel the data prep for multi-file performance improvement(observed 5x - 15x speedup by processing 1000 files/links.). -->
6+
7+
We organized the folders in the same way, so you can use either framework for dataprep microservice with the following constructions.
8+
9+
# 🚀1. Start Microservice with Python (Option 1)
10+
11+
## 1.1 Install Requirements
12+
13+
Install Single-process version (for 1-10 files processing)
14+
15+
```bash
16+
apt-get update
17+
apt-get install -y default-jre tesseract-ocr libtesseract-dev poppler-utils
18+
cd langchain
19+
pip install -r requirements.txt
20+
```
21+
22+
<!-- - option 2: Install multi-process version (for >10 files processing)
23+
24+
```bash
25+
cd langchain_ray; pip install -r requirements_ray.txt
26+
``` -->
27+
28+
## 1.2 Start VDMS Server
29+
30+
Please refer to this [readme](../../vectorstores/langchain/vdms/README.md).
31+
32+
## 1.3 Setup Environment Variables
33+
34+
```bash
35+
export http_proxy=${your_http_proxy}
36+
export https_proxy=${your_http_proxy}
37+
export VDMS_HOST=${host_ip}
38+
export VDMS_PORT=55555
39+
export COLLECTION_NAME=${your_collection_name}
40+
export LANGCHAIN_TRACING_V2=true
41+
export LANGCHAIN_PROJECT="opea/gen-ai-comps:dataprep"
42+
export PYTHONPATH=${path_to_comps}
43+
```
44+
45+
## 1.4 Start Document Preparation Microservice for VDMS with Python Script
46+
47+
Start document preparation microservice for VDMS with below command.
48+
49+
Start single-process version (for 1-10 files processing)
50+
51+
```bash
52+
python prepare_doc_vdms.py
53+
```
54+
55+
<!-- - option 2: Start multi-process version (for >10 files processing)
56+
57+
```bash
58+
python prepare_doc_redis_on_ray.py
59+
``` -->
60+
61+
# 🚀2. Start Microservice with Docker (Option 2)
62+
63+
## 2.1 Start VDMS Server
64+
65+
Please refer to this [readme](../../vectorstores/langchain/vdms/README.md).
66+
67+
## 2.2 Setup Environment Variables
68+
69+
```bash
70+
export http_proxy=${your_http_proxy}
71+
export https_proxy=${your_http_proxy}
72+
export VDMS_HOST=${host_ip}
73+
export VDMS_PORT=55555
74+
export TEI_ENDPOINT=${your_tei_endpoint}
75+
export COLLECTION_NAME=${your_collection_name}
76+
export SEARCH_ENGINE="FaissFlat"
77+
export DISTANCE_STRATEGY="L2"
78+
export PYTHONPATH=${path_to_comps}
79+
```
80+
81+
## 2.3 Build Docker Image
82+
83+
- Build docker image with langchain
84+
85+
Start single-process version (for 1-10 files processing)
86+
87+
```bash
88+
cd ../../../
89+
docker build -t opea/dataprep-vdms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain/Dockerfile .
90+
```
91+
92+
<!-- - option 2: Start multi-process version (for >10 files processing)
93+
94+
```bash
95+
cd ../../../../
96+
docker build -t opea/dataprep-on-ray-vdms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain_ray/Dockerfile . -->
97+
98+
## 2.4 Run Docker with CLI
99+
100+
Start single-process version (for 1-10 files processing)
101+
102+
```bash
103+
docker run -d --name="dataprep-vdms-server" -p 6007:6007 --runtime=runc --ipc=host \
104+
-e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TEI_ENDPOINT=$TEI_ENDPOINT \
105+
-e COLLECTION_NAME=$COLLECTION_NAME -e VDMS_HOST=$VDMS_HOST -e VDMS_PORT=$VDMS_PORT \
106+
opea/dataprep-vdms:latest
107+
```
108+
109+
<!-- - option 2: Start multi-process version (for >10 files processing)
110+
111+
```bash
112+
docker run -d --name="dataprep-vdms-server" -p 6007:6007 --runtime=runc --ipc=host \
113+
-e http_proxy=$http_proxy -e https_proxy=$https_proxy \
114+
-e COLLECTION_NAME=$COLLECTION_NAME -e VDMS_HOST=$VDMS_HOST -e VDMS_PORT=$VDMS_PORT \
115+
-e TIMEOUT_SECONDS=600 opea/dataprep-on-ray-vdms:latest
116+
``` -->
117+
118+
# 🚀3. Status Microservice
119+
120+
```bash
121+
docker container logs -f dataprep-vdms-server
122+
```
123+
124+
# 🚀4. Consume Microservice
125+
126+
Once document preparation microservice for VDMS is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.
127+
128+
Make sure the file path after `files=@` is correct.
129+
130+
- Single file upload
131+
132+
```bash
133+
curl -X POST \
134+
-H "Content-Type: multipart/form-data" \
135+
-F "files=@./file1.txt" \
136+
http://localhost:6007/v1/dataprep
137+
```
138+
139+
You can specify chunk_size and chunk_size by the following commands.
140+
141+
```bash
142+
curl -X POST \
143+
-H "Content-Type: multipart/form-data" \
144+
-F "files=@./LLAMA2_page6.pdf" \
145+
-F "chunk_size=1500" \
146+
-F "chunk_overlap=100" \
147+
http://localhost:6007/v1/dataprep
148+
```
149+
150+
- Multiple file upload
151+
152+
```bash
153+
curl -X POST \
154+
-H "Content-Type: multipart/form-data" \
155+
-F "files=@./file1.txt" \
156+
-F "files=@./file2.txt" \
157+
-F "files=@./file3.txt" \
158+
http://localhost:6007/v1/dataprep
159+
```
160+
161+
- Links upload (not supported for llama_index now)
162+
163+
```bash
164+
curl -X POST \
165+
-F 'link_list=["https://www.ces.tech/"]' \
166+
http://localhost:6007/v1/dataprep
167+
```
168+
169+
or
170+
171+
```python
172+
import requests
173+
import json
174+
175+
proxies = {"http": ""}
176+
url = "http://localhost:6007/v1/dataprep"
177+
urls = [
178+
"https://towardsdatascience.com/no-gpu-no-party-fine-tune-bert-for-sentiment-analysis-with-vertex-ai-custom-jobs-d8fc410e908b?source=rss----7f60cf5620c9---4"
179+
]
180+
payload = {"link_list": json.dumps(urls)}
181+
182+
try:
183+
resp = requests.post(url=url, data=payload, proxies=proxies)
184+
print(resp.text)
185+
resp.raise_for_status() # Raise an exception for unsuccessful HTTP status codes
186+
print("Request successful!")
187+
except requests.exceptions.RequestException as e:
188+
print("An error occurred:", e)
189+
```
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
FROM python:3.11-slim
5+
6+
ENV LANG=C.UTF-8
7+
8+
ARG ARCH="cpu"
9+
10+
RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
11+
build-essential \
12+
libcairo2-dev \
13+
libgl1-mesa-glx \
14+
libjemalloc-dev \
15+
vim
16+
17+
RUN useradd -m -s /bin/bash user && \
18+
mkdir -p /home/user && \
19+
chown -R user /home/user/
20+
21+
USER user
22+
23+
COPY comps /home/user/comps
24+
25+
RUN pip install --no-cache-dir --upgrade pip setuptools && \
26+
if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
27+
pip install --no-cache-dir -r /home/user/comps/dataprep/vdms/langchain/requirements.txt
28+
29+
ENV PYTHONPATH=/home/user
30+
31+
USER root
32+
33+
RUN mkdir -p /home/user/comps/dataprep/vdms/langchain/uploaded_files && chown -R user /home/user/comps/dataprep/vdms/langchain
34+
35+
USER user
36+
37+
WORKDIR /home/user/comps/dataprep/vdms/langchain
38+
39+
ENTRYPOINT ["python", "prepare_doc_vdms.py"]
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
import os
5+
6+
7+
def getEnv(key, default_value=None):
8+
env_value = os.getenv(key, default=default_value)
9+
print(f"{key}: {env_value}")
10+
return env_value
11+
12+
13+
# Embedding model
14+
EMBED_MODEL = getEnv("EMBED_MODEL", "BAAI/bge-base-en-v1.5")
15+
16+
# VDMS configuration
17+
VDMS_HOST = getEnv("VDMS_HOST", "localhost")
18+
VDMS_PORT = int(getEnv("VDMS_PORT", 55555))
19+
COLLECTION_NAME = getEnv("COLLECTION_NAME", "rag-vdms")
20+
SEARCH_ENGINE = getEnv("SEARCH_ENGINE", "FaissFlat")
21+
DISTANCE_STRATEGY = getEnv("DISTANCE_STRATEGY", "L2")
22+
23+
# LLM/Embedding endpoints
24+
TGI_LLM_ENDPOINT = getEnv("TGI_LLM_ENDPOINT", "http://localhost:8080")
25+
TGI_LLM_ENDPOINT_NO_RAG = getEnv("TGI_LLM_ENDPOINT_NO_RAG", "http://localhost:8081")
26+
TEI_EMBEDDING_ENDPOINT = getEnv("TEI_ENDPOINT")
27+
28+
# chunk parameters
29+
CHUNK_SIZE = getEnv("CHUNK_SIZE", 1500)
30+
CHUNK_OVERLAP = getEnv("CHUNK_OVERLAP", 100)
31+
32+
current_file_path = os.path.abspath(__file__)
33+
parent_dir = os.path.dirname(current_file_path)
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
version: "3"
5+
services:
6+
vdms-vector-db:
7+
image: intellabs/vdms:latest
8+
container_name: vdms-vector-db
9+
ports:
10+
- "55555:55555"
11+
dataprep-vdms:
12+
image: opea/dataprep-vdms:latest
13+
container_name: dataprep-vdms-server
14+
ports:
15+
- "6007:6007"
16+
ipc: host
17+
environment:
18+
no_proxy: ${no_proxy}
19+
http_proxy: ${http_proxy}
20+
https_proxy: ${https_proxy}
21+
VDMS_HOST: ${VDMS_HOST}
22+
VDMS_PORT: ${VDMS_PORT}
23+
COLLECTION_NAME: ${COLLECTION_NAME}
24+
restart: unless-stopped
25+
26+
networks:
27+
default:
28+
driver: bridge

0 commit comments

Comments
 (0)