Skip to content

Commit b99e658

Browse files
srajabospre-commit-ci[bot]lianhaodmsuehirintelsharath
authored andcommitted
[Bug: 1375] Fix Readme errors in dataprep component for all VectorDBs (opea-project#1377)
* [Bug: 1375] Fix Readme errors in dataprep component for all VectorDBs Fixes opea-project#1375 Signed-off-by: Piroozan, Nariman <[email protected]> Signed-off-by: Ghosh, Soumyadip <[email protected]> Signed-off-by: Jaini, Pallavi <[email protected]> Signed-off-by: Kavulya, Soila <[email protected]> Signed-off-by: Shifani Rajabose <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Shifani Rajabose <[email protected]> * Improve dataprep CI and fix pptx file ingesting bug (opea-project#1334) - Fix permission issue for when ingesting pptx file with embedded image - Add more test coverage to the dataprep CI and unify common dataprep CI test code for DB backends: qdrant, milvus, redis, pgvector Signed-off-by: Lianhao Lu <[email protected]> Signed-off-by: Shifani Rajabose <[email protected]> * Fix docker compose command in embedding BridgeTower readme (opea-project#1374) Signed-off-by: Dina Suehiro Jones <[email protected]> Signed-off-by: Shifani Rajabose <[email protected]> * Changes to checkin text2graph microservice (opea-project#1357) Signed-off-by: Raghava, Sharath <[email protected]> Signed-off-by: Shifani Rajabose <[email protected]> * [Bug: 1375] Fix Readme errors in dataprep component for all VectorDBs Fixes opea-project#1375 Signed-off-by: Piroozan, Nariman <[email protected]> Signed-off-by: Ghosh, Soumyadip <[email protected]> Signed-off-by: Jaini, Pallavi <[email protected]> Signed-off-by: Kavulya, Soila <[email protected]> Signed-off-by: Shifani Rajabose <[email protected]> --------- Signed-off-by: Shifani Rajabose <[email protected]> Signed-off-by: Lianhao Lu <[email protected]> Signed-off-by: Dina Suehiro Jones <[email protected]> Signed-off-by: Raghava, Sharath <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Lianhao Lu <[email protected]> Co-authored-by: Dina Suehiro Jones <[email protected]> Co-authored-by: intelsharath <[email protected]> Co-authored-by: Liang Lv <[email protected]> Signed-off-by: pallavi.jaini <[email protected]>
1 parent 97d75f1 commit b99e658

10 files changed

+88
-459
lines changed

comps/dataprep/src/README_elasticsearch.md

Lines changed: 10 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,68 +1,41 @@
11
# Dataprep Microservice with Elasticsearch
22

3-
## 🚀1. Start Microservice with Python(Option 1)
3+
## 🚀1. Start Microservice with Docker
44

5-
### 1.1 Install Requirements
6-
7-
```bash
8-
pip install -r requirements.txt
9-
```
10-
11-
### 1.2 Setup Environment Variables
12-
13-
```bash
14-
export ES_CONNECTION_STRING=http://localhost:9200
15-
export INDEX_NAME=${your_index_name}
16-
```
17-
18-
### 1.3 Start Elasticsearch
5+
### 1.1 Start Elasticsearch
196

207
Please refer to this [readme](../../third_parties/elasticsearch/src/README.md).
218

22-
### 1.4 Start Document Preparation Microservice for Elasticsearch with Python Script
23-
24-
Start document preparation microservice for Elasticsearch with below command.
25-
26-
```bash
27-
python prepare_doc_elastic.py
28-
```
29-
30-
## 🚀2. Start Microservice with Docker (Option 2)
31-
32-
### 2.1 Start Elasticsearch
33-
34-
Please refer to this [readme](../../third_parties/elasticsearch/src/README.md).
35-
36-
### 2.2 Setup Environment Variables
9+
### 1.2 Setup Environment Variables
3710

3811
```bash
3912
export ES_CONNECTION_STRING=http://localhost:9200
4013
export INDEX_NAME=${your_index_name}
4114
```
4215

43-
### 2.3 Build Docker Image
16+
### 1.3 Build Docker Image
4417

4518
```bash
4619
cd GenAIComps
4720
docker build -t opea/dataprep:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/src/Dockerfile .
4821
```
4922

50-
### 2.4 Run Docker with CLI (Option A)
23+
### 1.4 Run Docker with CLI (Option A)
5124

5225
```bash
5326
docker run --name="dataprep-elasticsearch" -p 6011:6011 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e ES_CONNECTION_STRING=$ES_CONNECTION_STRING -e INDEX_NAME=$INDEX_NAME -e TEI_EMBEDDING_ENDPOINT=$TEI_EMBEDDING_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e DATAPREP_COMPONENT_NAME="OPEA_DATAPREP_ELASTICSEARCH" opea/dataprep:latest
5427
```
5528

56-
### 2.5 Run with Docker Compose (Option B)
29+
### 1.5 Run with Docker Compose (Option B)
5730

5831
```bash
5932
cd comps/dataprep/deployment/docker_compose/
6033
docker compose -f compose_elasticsearch.yaml up -d
6134
```
6235

63-
## 🚀3. Consume Microservice
36+
## 🚀2. Consume Microservice
6437

65-
### 3.1 Consume Upload API
38+
### 2.1 Consume Upload API
6639

6740
Once document preparation microservice for Elasticsearch is started, user can use below command to invoke the
6841
microservice to convert the document to embedding and save to the database.
@@ -74,7 +47,7 @@ curl -X POST \
7447
http://localhost:6011/v1/dataprep/ingest
7548
```
7649

77-
### 3.2 Consume get API
50+
### 2.2 Consume get API
7851

7952
To get uploaded file structures, use the following command:
8053

@@ -103,7 +76,7 @@ Then you will get the response JSON like this:
10376
]
10477
```
10578

106-
### 4.3 Consume delete API
79+
### 2.3 Consume delete API
10780

10881
To delete uploaded file/link, use the following command.
10982

comps/dataprep/src/README_milvus.md

Lines changed: 15 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,88 +1,50 @@
11
# Dataprep Microservice with Milvus
22

3-
## 🚀1. Start Microservice with Python (Option 1)
3+
## 🚀1. Start Microservice with Docker
44

5-
### 1.1 Requirements
6-
7-
```bash
8-
pip install -r requirements.txt
9-
apt-get install tesseract-ocr -y
10-
apt-get install libtesseract-dev -y
11-
apt-get install poppler-utils -y
12-
```
13-
14-
### 1.2 Start Milvus Server
5+
### 1.1 Start Milvus Server
156

167
Please refer to this [readme](../../third_parties/milvus/src/README.md).
178

18-
### 1.3 Setup Environment Variables
9+
### 1.2 Setup Environment Variables
1910

2011
```bash
2112
export no_proxy=${your_no_proxy}
2213
export http_proxy=${your_http_proxy}
2314
export https_proxy=${your_http_proxy}
24-
export MILVUS_HOST=${your_milvus_host_ip}
15+
export MILVUS_HOST=${your_host_ip}
2516
export MILVUS_PORT=19530
2617
export COLLECTION_NAME=${your_collection_name}
27-
export TEI_EMBEDDING_ENDPOINT=${your_embedding_endpoint}
28-
export HUGGINGFACEHUB_API_TOKEN=${your_huggingface_api_token}
18+
export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
19+
export EMBEDDING_MODEL_ID=${your_embedding_model_id}
2920
```
3021

31-
### 1.4 Start TEI Embedding Service
22+
### 1.3 Start TEI Embedding Service
3223

3324
First, start the TEI embedding server.
3425

3526
```bash
3627
your_port=6010
3728
model="BAAI/bge-base-en-v1.5"
3829
docker run -p $your_port:80 -v ./data:/data --name tei_server -e http_proxy=$http_proxy -e https_proxy=$https_proxy --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.5 --model-id $model
39-
```
40-
41-
Setup environment variables:
42-
43-
```bash
4430
export TEI_EMBEDDING_ENDPOINT="http://localhost:$your_port"
45-
export MILVUS_HOST=${your_host_ip}
4631
```
4732

48-
### 1.5 Start Document Preparation Microservice for Milvus with Python Script
49-
50-
Start document preparation microservice for Milvus with below command.
51-
52-
```bash
53-
python prepare_doc_milvus.py
54-
```
55-
56-
## 🚀2. Start Microservice with Docker (Option 2)
57-
58-
### 2.1 Start Milvus Server
59-
60-
Please refer to this [readme](../../third_parties/milvus/src/README.md).
61-
62-
### 2.2 Build Docker Image
33+
### 1.4 Build Docker Image
6334

6435
```bash
6536
cd ../../..
6637
# build dataprep milvus docker image
6738
docker build -t opea/dataprep:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy --build-arg no_proxy=$no_proxy -f comps/dataprep/src/Dockerfile .
6839
```
6940

70-
### 2.3 Setup Environment Variables
71-
72-
```bash
73-
export TEI_EMBEDDING_ENDPOINT="http://localhost:$your_port"
74-
export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
75-
export EMBEDDING_MODEL_ID=${your_embedding_model_id}
76-
export MILVUS_HOST=${your_host_ip}
77-
```
78-
79-
### 2.3 Run Docker with CLI (Option A)
41+
### 1.5 Run Docker with CLI (Option A)
8042

8143
```bash
8244
docker run -d --name="dataprep-milvus-server" -p 6010:6010 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e TEI_EMBEDDING_ENDPOINT=${TEI_EMBEDDING_ENDPOINT} -e MILVUS_HOST=${MILVUS_HOST} -e HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -e DATAPREP_COMPONENT_NAME="OPEA_DATAPREP_MILVUS" opea/dataprep:latest
8345
```
8446

85-
### 2.4 Run with Docker Compose (Option B)
47+
### 1.5 Run with Docker Compose (Option B)
8648

8749
```bash
8850
mkdir model
@@ -94,9 +56,9 @@ cd ../
9456
docker compose -f compose_milvus.yaml up -d
9557
```
9658

97-
## 🚀3. Consume Microservice
59+
## 🚀2. Consume Microservice
9860

99-
### 3.1 Consume Upload API
61+
### 2.1 Consume Upload API
10062

10163
Once document preparation microservice for Milvus is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.
10264

@@ -179,7 +141,7 @@ Note: If you specify "table_strategy=llm", You should first start TGI Service, p
179141
curl -X POST -H "Content-Type: application/json" -d '{"path":"/home/user/doc/your_document_name","process_table":true,"table_strategy":"hq"}' http://localhost:6010/v1/dataprep/ingest
180142
```
181143

182-
### 3.2 Consume get API
144+
### 2.2 Consume get API
183145

184146
To get uploaded file structures, use the following command:
185147

@@ -208,7 +170,7 @@ Then you will get the response JSON like this:
208170
]
209171
```
210172

211-
### 3.3 Consume delete API
173+
### 2.3 Consume delete API
212174

213175
To delete uploaded file/link, use the following command.
214176

@@ -234,7 +196,7 @@ curl -X POST \
234196
http://localhost:6010/v1/dataprep/delete
235197
```
236198

237-
## 🚀4. Troubleshooting
199+
## 🚀3. Troubleshooting
238200

239201
1. If you get errors from TEI Embedding Endpoint like `cannot find this task, maybe it has expired` while uploading files, try to reduce the `chunk_size` in the curl command like below (the default chunk_size=1500).
240202

comps/dataprep/src/README_multimodal.md

Lines changed: 14 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -7,61 +7,13 @@ This `dataprep` microservice accepts the following from the user and ingests the
77
- Audio (wav files)
88
- PDFs (with text and images)
99

10-
## 🚀1. Start Microservice with Python(Option 1)
10+
## 🚀1. Start Microservice with Docker
1111

12-
### 1.1 Install Requirements
13-
14-
```bash
15-
# Install ffmpeg static build
16-
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz
17-
mkdir ffmpeg-git-amd64-static
18-
tar -xvf ffmpeg-git-amd64-static.tar.xz -C ffmpeg-git-amd64-static --strip-components 1
19-
export PATH=$(pwd)/ffmpeg-git-amd64-static:$PATH
20-
cp $(pwd)/ffmpeg-git-amd64-static/ffmpeg /usr/local/bin/
21-
22-
pip install -r requirements.txt
23-
```
24-
25-
### 1.2 Start Redis Stack Server
12+
### 1.1 Start Redis Stack Server
2613

2714
Please refer to this [readme](../../third_parties/redis/src/README.md).
2815

29-
### 1.3 Setup Environment Variables
30-
31-
```bash
32-
export your_ip=$(hostname -I | awk '{print $1}')
33-
export REDIS_URL="redis://${your_ip}:6379"
34-
export INDEX_NAME=${your_redis_index_name}
35-
export PYTHONPATH=${path_to_comps}
36-
```
37-
38-
### 1.4 Start LVM Microservice (Optional)
39-
40-
This is required only if you are going to consume the _generate_captions_ API of this microservice as in [Section 4.3](#43-consume-generate_captions-api).
41-
42-
Please refer to this [readme](../../lvms/src/README.md) to start the LVM microservice.
43-
After LVM is up, set up environment variables.
44-
45-
```bash
46-
export your_ip=$(hostname -I | awk '{print $1}')
47-
export LVM_ENDPOINT="http://${your_ip}:9399/v1/lvm"
48-
```
49-
50-
### 1.5 Start Data Preparation Microservice for Redis with Python Script
51-
52-
Start document preparation microservice for Redis with below command.
53-
54-
```bash
55-
python prepare_videodoc_redis.py
56-
```
57-
58-
## 🚀2. Start Microservice with Docker (Option 2)
59-
60-
### 2.1 Start Redis Stack Server
61-
62-
Please refer to this [readme](../../third_parties/redis/src/README.md).
63-
64-
### 2.2 Start LVM Microservice (Optional)
16+
### 1.2 Start LVM Microservice (Optional)
6517

6618
This is required only if you are going to consume the _generate_captions_ API of this microservice as described [here](#43-consume-generate_captions-api).
6719

@@ -73,7 +25,7 @@ export your_ip=$(hostname -I | awk '{print $1}')
7325
export LVM_ENDPOINT="http://${your_ip}:9399/v1/lvm"
7426
```
7527

76-
### 2.3 Setup Environment Variables
28+
### 1.3 Setup Environment Variables
7729

7830
```bash
7931
export your_ip=$(hostname -I | awk '{print $1}')
@@ -84,39 +36,39 @@ export INDEX_NAME=${your_redis_index_name}
8436
export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
8537
```
8638

87-
### 2.4 Build Docker Image
39+
### 1.4 Build Docker Image
8840

8941
```bash
9042
cd ../../../../
9143
docker build -t opea/dataprep:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/src/Dockerfile .
9244
```
9345

94-
### 2.5 Run Docker with CLI (Option A)
46+
### 1.5 Run Docker with CLI (Option A)
9547

9648
```bash
9749
docker run -d --name="dataprep-multimodal-redis" -p 6007:5000 --runtime=runc --ipc=host -e no_proxy=$no_proxy -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e REDIS_HOST=$your_ip -e REDIS_URL=$REDIS_URL -e INDEX_NAME=$INDEX_NAME -e LVM_ENDPOINT=$LVM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e MULTIMODAL_DATAPREP=true -e DATAPREP_COMPONENT_NAME="OPEA_DATAPREP_MULTIMODALREDIS" opea/dataprep-multimodal-redis:latest
9850
```
9951

100-
### 2.6 Run with Docker Compose (Option B - deprecated, will move to genAIExample in future)
52+
### 1.6 Run with Docker Compose (Option B - deprecated, will move to genAIExample in future)
10153

10254
```bash
10355
cd comps/dataprep/multimodal/redis/langchain
10456
docker compose -f compose_redis_multimodal.yaml up -d
10557
```
10658

107-
## 🚀3. Status Microservice
59+
## 🚀2. Status Microservice
10860

10961
```bash
11062
docker container logs -f dataprep-multimodal-redis
11163
```
11264

113-
## 🚀4. Consume Microservice
65+
## 🚀3. Consume Microservice
11466

11567
Once this dataprep microservice is started, user can use the below commands to invoke the microservice to convert images, videos, text, and PDF files to embeddings and save to the Redis vector store.
11668

11769
This microservice provides 3 different ways for users to ingest files into Redis vector store corresponding to the 3 use cases.
11870

119-
### 4.1 Consume _ingest_ API
71+
### 3.1 Consume _ingest_ API
12072

12173
**Use case:** This API is used for videos accompanied by transcript files (`.vtt` format), images accompanied by text caption files (`.txt` format), and PDF files containing a mix of text and images.
12274

@@ -163,7 +115,7 @@ curl -X POST \
163115
http://localhost:6007/v1/dataprep/ingest
164116
```
165117

166-
### 4.2 Consume _generate_transcripts_ API
118+
### 3.2 Consume _generate_transcripts_ API
167119

168120
**Use case:** This API should be used when a video has meaningful audio or recognizable speech but its transcript file is not available, or for audio files with speech.
169121

@@ -189,7 +141,7 @@ curl -X POST \
189141
http://localhost:6007/v1/dataprep/generate_transcripts
190142
```
191143

192-
### 4.3 Consume _generate_captions_ API
144+
### 3.3 Consume _generate_captions_ API
193145

194146
**Use case:** This API should be used when uploading an image, or when uploading a video that does not have meaningful audio or does not have audio.
195147

@@ -223,7 +175,7 @@ curl -X POST \
223175
http://localhost:6007/v1/dataprep/generate_captions
224176
```
225177

226-
### 4.4 Consume get API
178+
### 3.4 Consume get API
227179

228180
To get names of uploaded files, use the following command.
229181

@@ -233,7 +185,7 @@ curl -X POST \
233185
http://localhost:6007/v1/dataprep/get
234186
```
235187

236-
### 4.5 Consume delete API
188+
### 3.5 Consume delete API
237189

238190
To delete uploaded files and clear the database, use the following command.
239191

0 commit comments

Comments
 (0)