Skip to content

Commit 29fe569

Browse files
XuhuiRenpre-commit-ci[bot]lvliang-intel
authored
Enable GraphRAG with Neo4J (#682)
* add graphrag for neo4j Signed-off-by: XuhuiRen <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add Signed-off-by: XuhuiRen <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add Signed-off-by: XuhuiRen <[email protected]> * add Signed-off-by: XuhuiRen <[email protected]> * fix ut Signed-off-by: XuhuiRen <[email protected]> * fix Signed-off-by: XuhuiRen <[email protected]> * add Signed-off-by: XuhuiRen <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update retriever_neo4j.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add Signed-off-by: XuhuiRen <[email protected]> * Update test_retrievers_neo4j_langchain.sh * add Signed-off-by: XuhuiRen <[email protected]> * Update test_retrievers_neo4j_langchain.sh * Update test_retrievers_neo4j_langchain.sh * Update test_retrievers_neo4j_langchain.sh * add docker Signed-off-by: XuhuiRen <[email protected]> * Update retrievers-compose-cd.yaml * Update test_retrievers_neo4j_langchain.sh * Update config.py * Update test_retrievers_neo4j_langchain.sh * Update test_retrievers_neo4j_langchain.sh * Update config.py * Update test_retrievers_neo4j_langchain.sh * Update requirements.txt * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update requirements.txt * Update requirements.txt * Update requirements.txt --------- Signed-off-by: XuhuiRen <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: lvliang-intel <[email protected]>
1 parent 18092f3 commit 29fe569

File tree

16 files changed

+850
-0
lines changed

16 files changed

+850
-0
lines changed

.github/workflows/docker/compose/dataprep-compose-cd.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,7 @@ services:
2727
build:
2828
dockerfile: comps/dataprep/vdms/langchain/Dockerfile
2929
image: ${REGISTRY:-opea}/dataprep-vdms:${TAG:-latest}
30+
dataprep-neo4j:
31+
build:
32+
dockerfile: comps/dataprep/neo4j/langchain/Dockerfile
33+
image: ${REGISTRY:-opea}/dataprep-neo4j:${TAG:-latest}

.github/workflows/docker/compose/retrievers-compose-cd.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,7 @@ services:
2727
build:
2828
dockerfile: comps/retrievers/multimodal/redis/langchain/Dockerfile
2929
image: ${REGISTRY:-opea}/multimodal-retriever-redis:${TAG:-latest}
30+
retriever-neo4j:
31+
build:
32+
dockerfile: comps/retrievers/neo4j/langchain/Dockerfile
33+
image: ${REGISTRY:-opea}/retriever-neo4j:${TAG:-latest}
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
FROM python:3.11-slim
5+
6+
ENV LANG=C.UTF-8
7+
8+
ARG ARCH="cpu"
9+
10+
RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
11+
build-essential \
12+
default-jre \
13+
libgl1-mesa-glx \
14+
libjemalloc-dev
15+
16+
RUN useradd -m -s /bin/bash user && \
17+
mkdir -p /home/user && \
18+
chown -R user /home/user/
19+
20+
USER user
21+
22+
COPY comps /home/user/comps
23+
24+
RUN pip install --no-cache-dir --upgrade pip setuptools && \
25+
if [ ${ARCH} = "cpu" ]; then pip install --no-cache-dir torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
26+
pip install --no-cache-dir -r /home/user/comps/dataprep/neo4j/langchain/requirements.txt
27+
28+
ENV PYTHONPATH=$PYTHONPATH:/home/user
29+
30+
USER root
31+
32+
RUN mkdir -p /home/user/comps/dataprep/qdrant/langchain/uploaded_files && chown -R user /home/user/comps/dataprep/neo4j/langchain/uploaded_files
33+
34+
USER user
35+
36+
WORKDIR /home/user/comps/dataprep/neo4j/langchain
37+
38+
ENTRYPOINT ["python", "prepare_doc_neo4j.py"]
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Dataprep Microservice with Neo4J
2+
3+
## 🚀Start Microservice with Python
4+
5+
### Install Requirements
6+
7+
```bash
8+
pip install -r requirements.txt
9+
apt-get install libtesseract-dev -y
10+
apt-get install poppler-utils -y
11+
```
12+
13+
### Start Neo4J Server
14+
15+
To launch Neo4j locally, first ensure you have docker installed. Then, you can launch the database with the following docker command.
16+
17+
```bash
18+
docker run \
19+
-p 7474:7474 -p 7687:7687 \
20+
-v $PWD/data:/data -v $PWD/plugins:/plugins \
21+
--name neo4j-apoc \
22+
-d \
23+
-e NEO4J_AUTH=neo4j/password \
24+
-e NEO4J_PLUGINS=\[\"apoc\"\] \
25+
neo4j:latest
26+
```
27+
28+
### Setup Environment Variables
29+
30+
```bash
31+
export no_proxy=${your_no_proxy}
32+
export http_proxy=${your_http_proxy}
33+
export https_proxy=${your_http_proxy}
34+
export NEO4J_URI=${your_neo4j_url}
35+
export NEO4J_USERNAME=${your_neo4j_username}
36+
export NEO4J_PASSWORD=${your_neo4j_password}
37+
export PYTHONPATH=${path_to_comps}
38+
```
39+
40+
### Start Document Preparation Microservice for Neo4J with Python Script
41+
42+
Start document preparation microservice for Neo4J with below command.
43+
44+
```bash
45+
python prepare_doc_neo4j.py
46+
```
47+
48+
## 🚀Start Microservice with Docker
49+
50+
### Build Docker Image
51+
52+
```bash
53+
cd ../../../../
54+
docker build -t opea/dataprep-neo4j:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/neo4j/langchain/Dockerfile .
55+
```
56+
57+
### Run Docker with CLI
58+
59+
```bash
60+
docker run -d --name="dataprep-neo4j-server" -p 6007:6007 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/dataprep-neo4j:latest
61+
```
62+
63+
### Setup Environment Variables
64+
65+
```bash
66+
export no_proxy=${your_no_proxy}
67+
export http_proxy=${your_http_proxy}
68+
export https_proxy=${your_http_proxy}
69+
export NEO4J_URI=${your_neo4j_url}
70+
export NEO4J_USERNAME=${your_neo4j_username}
71+
export NEO4J_PASSWORD=${your_neo4j_password}
72+
```
73+
74+
### Run Docker with Docker Compose
75+
76+
```bash
77+
cd comps/dataprep/neo4j/langchain
78+
docker compose -f docker-compose-dataprep-neo4j.yaml up -d
79+
```
80+
81+
## Invoke Microservice
82+
83+
Once document preparation microservice for Neo4J is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.
84+
85+
```bash
86+
curl -X POST \
87+
-H "Content-Type: multipart/form-data" \
88+
-F "files=@./file1.txt" \
89+
http://localhost:6007/v1/dataprep
90+
```
91+
92+
You can specify chunk_size and chunk_size by the following commands.
93+
94+
```bash
95+
curl -X POST \
96+
-H "Content-Type: multipart/form-data" \
97+
-F "files=@./file1.txt" \
98+
-F "chunk_size=1500" \
99+
-F "chunk_overlap=100" \
100+
http://localhost:6007/v1/dataprep
101+
```
102+
103+
We support table extraction from pdf documents. You can specify process_table and table_strategy by the following commands. "table_strategy" refers to the strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm," the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast".
104+
105+
Note: If you specify "table_strategy=llm", You should first start TGI Service, please refer to 1.2.1, 1.3.1 in https://github.com/opea-project/GenAIComps/tree/main/comps/llms/README.md, and then `export TGI_LLM_ENDPOINT="http://${your_ip}:8008"`.
106+
107+
For ensure the quality and comprehensiveness of the extracted entities, we recommend to use `gpt-4o` as the default model for parsing the document. To enable the openai service, please `export OPENAI_KEY=xxxx` before using this services.
108+
109+
```bash
110+
curl -X POST \
111+
-H "Content-Type: multipart/form-data" \
112+
-F "files=@./your_file.pdf" \
113+
-F "process_table=true" \
114+
-F "table_strategy=hq" \
115+
http://localhost:6007/v1/dataprep
116+
```
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
import os
5+
6+
# Neo4J configuration
7+
NEO4J_URL = os.getenv("NEO4J_URI", "bolt://localhost:7687")
8+
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME", "neo4j")
9+
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD", "test")
10+
11+
# LLM/Embedding endpoints
12+
TGI_LLM_ENDPOINT = os.getenv("TGI_LLM_ENDPOINT", "http://localhost:8080")
13+
TGI_LLM_ENDPOINT_NO_RAG = os.getenv("TGI_LLM_ENDPOINT_NO_RAG", "http://localhost:8081")
14+
TEI_EMBEDDING_ENDPOINT = os.getenv("TEI_ENDPOINT")
15+
OPENAI_KEY = os.getenv("OPENAI_API_KEY")
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
version: "3"
5+
services:
6+
neo4j-vector-db:
7+
image: neo4j/neo4j
8+
container_name: neo4j-graph-db
9+
ports:
10+
- "6337:6337"
11+
- "6338:6338"
12+
tgi_gaudi_service:
13+
image: ghcr.io/huggingface/tgi-gaudi:2.0.1
14+
container_name: tgi-service
15+
ports:
16+
- "8088:80"
17+
volumes:
18+
- "./data:/data"
19+
shm_size: 1g
20+
environment:
21+
no_proxy: ${no_proxy}
22+
http_proxy: ${http_proxy}
23+
https_proxy: ${https_proxy}
24+
HF_TOKEN: ${HF_TOKEN}
25+
command: --model-id ${LLM_MODEL_ID} --auto-truncate --max-input-tokens 1024 --max-total-tokens 2048
26+
dataprep-neo4j:
27+
image: opea/gen-ai-comps:dataprep-neo4j-xeon-server
28+
container_name: dataprep-neo4j-server
29+
depends_on:
30+
- neo4j-vector-db
31+
- tgi_gaudi_service
32+
ports:
33+
- "6007:6007"
34+
ipc: host
35+
environment:
36+
no_proxy: ${no_proxy}
37+
http_proxy: ${http_proxy}
38+
https_proxy: ${https_proxy}
39+
NEO4J_URL: ${NEO4J_URL}
40+
NEO4J_USERNAME: ${NEO4J_USERNAME}
41+
NEO4J_PASSWORD: ${NEO4J_PASSWORD}
42+
TGI_LLM_ENDPOINT: ${TEI_ENDPOINT}
43+
OPENAI_KEY: ${OPENAI_API_KEY}
44+
restart: unless-stopped
45+
46+
networks:
47+
default:
48+
driver: bridge

0 commit comments

Comments
 (0)