Skip to content

Commit 7686cfa

Browse files
Refine Dataprep Milvus MS (#570)
Signed-off-by: letonghan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 2360e5a commit 7686cfa

File tree

5 files changed

+1408
-238
lines changed

5 files changed

+1408
-238
lines changed

comps/dataprep/milvus/README.md

Lines changed: 126 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Dataprep Microservice with Milvus
22

3-
## 🚀Start Microservice with Python
3+
## 🚀1. Start Microservice with Python (Option 1)
44

5-
### Install Requirements
5+
### 1.1 Requirements
66

77
```bash
88
pip install -r requirements.txt
@@ -11,11 +11,11 @@ apt-get install libtesseract-dev -y
1111
apt-get install poppler-utils -y
1212
```
1313

14-
### Start Milvus Server
14+
### 1.2 Start Milvus Server
1515

1616
Please refer to this [readme](../../../vectorstores/langchain/milvus/README.md).
1717

18-
### Setup Environment Variables
18+
### 1.3 Setup Environment Variables
1919

2020
```bash
2121
export no_proxy=${your_no_proxy}
@@ -27,30 +27,76 @@ export COLLECTION_NAME=${your_collection_name}
2727
export MOSEC_EMBEDDING_ENDPOINT=${your_embedding_endpoint}
2828
```
2929

30-
### Start Document Preparation Microservice for Milvus with Python Script
30+
### 1.4 Start Mosec Embedding Service
31+
32+
First, you need to build a mosec embedding serving docker image.
33+
34+
```bash
35+
cd ../../..
36+
docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy -t opea/embedding-mosec-endpoint:latest -f comps/embeddings/langchain-mosec/mosec-docker/Dockerfile .
37+
```
38+
39+
Then start the mosec embedding server.
40+
41+
```bash
42+
your_port=6010
43+
docker run -d --name="embedding-mosec-endpoint" -p $your_port:8000 opea/embedding-mosec-endpoint:latest
44+
```
45+
46+
Setup environment variables:
47+
48+
```bash
49+
export MOSEC_EMBEDDING_ENDPOINT="http://localhost:$your_port"
50+
export MILVUS=${your_host_ip}
51+
```
52+
53+
### 1.5 Start Document Preparation Microservice for Milvus with Python Script
3154

3255
Start document preparation microservice for Milvus with below command.
3356

3457
```bash
3558
python prepare_doc_milvus.py
3659
```
3760

38-
## 🚀Start Microservice with Docker
61+
## 🚀2. Start Microservice with Docker (Option 2)
62+
63+
### 2.1 Start Milvus Server
64+
65+
Please refer to this [readme](../../../vectorstores/langchain/milvus/README.md).
3966

40-
### Build Docker Image
67+
### 2.2 Build Docker Image
4168

4269
```bash
43-
cd ../../../../
70+
cd ../../..
71+
# build mosec embedding docker image
72+
docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy -t opea/embedding-langchain-mosec-endpoint:latest -f comps/embeddings/langchain-mosec/mosec-docker/Dockerfile .
73+
# build dataprep milvus docker image
4474
docker build -t opea/dataprep-milvus:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy --build-arg no_proxy=$no_proxy -f comps/dataprep/milvus/docker/Dockerfile .
4575
```
4676

47-
### Run Docker with CLI
77+
### 2.3 Setup Environment Variables
78+
79+
```bash
80+
export MOSEC_EMBEDDING_ENDPOINT="http://localhost:$your_port"
81+
export MILVUS=${your_host_ip}
82+
```
83+
84+
### 2.3 Run Docker with CLI (Option A)
85+
86+
```bash
87+
docker run -d --name="dataprep-milvus-server" -p 6010:6010 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${MOSEC_EMBEDDING_ENDPOINT} -e MILVUS=${MILVUS} opea/dataprep-milvus:latest
88+
```
89+
90+
### 2.4 Run with Docker Compose (Option B)
4891

4992
```bash
50-
docker run -d --name="dataprep-milvus-server" -p 6010:6010 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${your_embedding_endpoint} -e MILVUS=${your_milvus_host_ip} opea/dataprep-milvus:latest
93+
cd docker
94+
docker compose -f docker-compose-dataprep-milvus.yaml up -d
5195
```
5296

53-
## Invoke Microservice
97+
## 🚀3. Consume Microservice
98+
99+
### 3.1 Consume Upload API
54100

55101
Once document preparation microservice for Milvus is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.
56102

@@ -65,13 +111,13 @@ curl -X POST \
65111
http://localhost:6010/v1/dataprep
66112
```
67113

68-
You can specify chunk_size and chunk_size by the following commands.
114+
You can specify chunk_size and chunk_size by the following commands. To avoid big chunks, pass a small chun_size like 500 as below (default 1500).
69115

70116
```bash
71117
curl -X POST \
72118
-H "Content-Type: multipart/form-data" \
73119
-F "files=@./file.pdf" \
74-
-F "chunk_size=1500" \
120+
-F "chunk_size=500" \
75121
-F "chunk_overlap=100" \
76122
http://localhost:6010/v1/dataprep
77123
```
@@ -132,3 +178,70 @@ Note: If you specify "table_strategy=llm", You should first start TGI Service, p
132178
```bash
133179
curl -X POST -H "Content-Type: application/json" -d '{"path":"/home/user/doc/your_document_name","process_table":true,"table_strategy":"hq"}' http://localhost:6010/v1/dataprep
134180
```
181+
182+
### 3.2 Consume get_file API
183+
184+
To get uploaded file structures, use the following command:
185+
186+
```bash
187+
curl -X POST \
188+
-H "Content-Type: application/json" \
189+
http://localhost:6010/v1/dataprep/get_file
190+
```
191+
192+
Then you will get the response JSON like this:
193+
194+
```json
195+
[
196+
{
197+
"name": "uploaded_file_1.txt",
198+
"id": "uploaded_file_1.txt",
199+
"type": "File",
200+
"parent": ""
201+
},
202+
{
203+
"name": "uploaded_file_2.txt",
204+
"id": "uploaded_file_2.txt",
205+
"type": "File",
206+
"parent": ""
207+
}
208+
]
209+
```
210+
211+
### 3.3 Consume delete_file API
212+
213+
To delete uploaded file/link, use the following command.
214+
215+
The `file_path` here should be the `id` get from `/v1/dataprep/get_file` API.
216+
217+
```bash
218+
# delete link
219+
curl -X POST \
220+
-H "Content-Type: application/json" \
221+
-d '{"file_path": "https://www.ces.tech/.txt"}' \
222+
http://localhost:6010/v1/dataprep/delete_file
223+
224+
# delete file
225+
curl -X POST \
226+
-H "Content-Type: application/json" \
227+
-d '{"file_path": "uploaded_file_1.txt"}' \
228+
http://localhost:6010/v1/dataprep/delete_file
229+
230+
# delete all files and links, will drop the entire db collection
231+
curl -X POST \
232+
-H "Content-Type: application/json" \
233+
-d '{"file_path": "all"}' \
234+
http://localhost:6010/v1/dataprep/delete_file
235+
```
236+
237+
## 🚀4. Troubleshooting
238+
239+
1. If you get errors from Mosec Embedding Endpoint like `cannot find this task, maybe it has expired` while uploading files, try to reduce the `chunk_size` in the curl command like below (the default chunk_size=1500).
240+
241+
```bash
242+
curl -X POST \
243+
-H "Content-Type: multipart/form-data" \
244+
-F "files=@./file.pdf" \
245+
-F "chunk_size=500" \
246+
http://localhost:6010/v1/dataprep
247+
```
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
version: "3"
5+
services:
6+
etcd:
7+
container_name: milvus-etcd
8+
image: quay.io/coreos/etcd:v3.5.5
9+
environment:
10+
- ETCD_AUTO_COMPACTION_MODE=revision
11+
- ETCD_AUTO_COMPACTION_RETENTION=1000
12+
- ETCD_QUOTA_BACKEND_BYTES=4294967296
13+
- ETCD_SNAPSHOT_COUNT=50000
14+
volumes:
15+
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
16+
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
17+
healthcheck:
18+
test: ["CMD", "etcdctl", "endpoint", "health"]
19+
interval: 30s
20+
timeout: 20s
21+
retries: 3
22+
23+
minio:
24+
container_name: milvus-minio
25+
image: minio/minio:RELEASE.2023-03-20T20-16-18Z
26+
environment:
27+
MINIO_ACCESS_KEY: minioadmin
28+
MINIO_SECRET_KEY: minioadmin
29+
ports:
30+
- "9001:9001"
31+
- "9000:9000"
32+
volumes:
33+
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
34+
command: minio server /minio_data --console-address ":9001"
35+
healthcheck:
36+
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
37+
interval: 30s
38+
timeout: 20s
39+
retries: 3
40+
41+
standalone:
42+
container_name: milvus-standalone
43+
image: milvusdb/milvus:v2.4.6
44+
command: ["milvus", "run", "standalone"]
45+
security_opt:
46+
- seccomp:unconfined
47+
environment:
48+
ETCD_ENDPOINTS: etcd:2379
49+
MINIO_ADDRESS: minio:9000
50+
volumes:
51+
- ${DOCKER_VOLUME_DIRECTORY:-.}/milvus.yaml:/milvus/configs/milvus.yaml
52+
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
53+
healthcheck:
54+
test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
55+
interval: 30s
56+
start_period: 90s
57+
timeout: 20s
58+
retries: 3
59+
ports:
60+
- "19530:19530"
61+
- "9091:9091"
62+
depends_on:
63+
- "etcd"
64+
- "minio"
65+
66+
mosec-embedding:
67+
image: opea/embedding-mosec-endpoint:latest
68+
container_name: embedding-mosec-server
69+
ports:
70+
- "6009:8000"
71+
ipc: host
72+
environment:
73+
http_proxy: ${http_proxy}
74+
https_proxy: ${https_proxy}
75+
restart: unless-stopped
76+
77+
dataprep-milvus:
78+
image: opea/dataprep-milvus:latest
79+
container_name: dataprep-milvus-server
80+
ports:
81+
- "6010:6010"
82+
ipc: host
83+
environment:
84+
no_proxy: ${no_proxy}
85+
http_proxy: ${http_proxy}
86+
https_proxy: ${https_proxy}
87+
MOSEC_EMBEDDING_ENDPOINT: ${MOSEC_EMBEDDING_ENDPOINT}
88+
MILVUS: ${MILVUS}
89+
restart: unless-stopped
90+
91+
networks:
92+
default:
93+
driver: bridge

0 commit comments

Comments
 (0)