Skip to content

Commit 838d16d

Browse files
authored
Changes to checkin text2graph microservice (opea-project#1357)
Signed-off-by: Raghava, Sharath <[email protected]>
1 parent 60c20b5 commit 838d16d

File tree

13 files changed

+696
-0
lines changed

13 files changed

+696
-0
lines changed
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
# this file should be run in the root of the repo
5+
services:
6+
text2graph:
7+
build:
8+
dockerfile: comps/text2graph/src/Dockerfile
9+
image: ${REGISTRY:-opea}/text2graph:${TAG:-latest}

comps/cores/mega/constants.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ class ServiceType(Enum):
3434
ANIMATION = 17
3535
IMAGE2IMAGE = 18
3636
TEXT2SQL = 19
37+
TEXT2GRAPH = 20
3738

3839

3940
class MegaServiceEndpoint(Enum):

comps/text2graph/deployment/docker_compose/README.md

Whitespace-only changes.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
services:
5+
text2graph:
6+
image: opea/text2graph:latest
7+
container_name: text2graph
8+
ports:
9+
- ${TEXT2GRAPH_PORT:-8090}:8090
10+
environment:
11+
- no_proxy=${no_proxy}
12+
- https_proxy=${https_proxy}
13+
- http_proxy=${http_proxy}
14+
- LLM_MODEL_ID=${LLM_MODEL_ID:-"Babelscape/rebel-large"}
15+
- HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
16+
ipc: host
17+
restart: always
18+
19+
text2graph-gaudi:
20+
image: opea/text2graph:${TAG:-latest}
21+
container_name: text2graph-gaudi-server
22+
ports:
23+
- ${TEXT2GRAPH_PORT:-9090}:8080
24+
environment:
25+
- TGI_LLM_ENDPOINT=${TGI_LLM_ENDPOINT:-8080}:8080
26+
27+
networks:
28+
default:
29+
driver: bridge

comps/text2graph/src/Dockerfile

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
FROM ubuntu:22.04
5+
6+
WORKDIR /home/graph_extract
7+
8+
FROM python:3.11-slim
9+
ENV LANG=C.UTF-8
10+
ARG ARCH=cpu
11+
12+
RUN apt-get update -y && apt-get install vim -y && apt-get install -y --no-install-recommends --fix-missing \
13+
build-essential
14+
15+
RUN useradd -m -s /bin/bash user && \
16+
mkdir -p /home/user && \
17+
chown -R user /home/user/
18+
19+
COPY comps /home/user/comps
20+
21+
RUN pip install --no-cache-dir --upgrade pip setuptools && \
22+
if [ ${ARCH} = "cpu" ]; then \
23+
pip install --no-cache-dir --extra-index-url https://download.pytorch.org/whl/cpu -r /home/user/comps/text2graph/src/requirements.txt; \
24+
else \
25+
pip install --no-cache-dir -r /home/user/comps/text2graph/src/requirements.txt; \
26+
fi
27+
28+
ENV https_proxy=${https_proxy}
29+
ENV http_proxy=${http_proxy}
30+
ENV no_proxy=${no_proxy}
31+
ENV LLM_ID=${LLM_ID:-"Babelscape/rebel-large"}
32+
ENV SPAN_LENGTH=${SPAN_LENGTH:-"1024"}
33+
ENV OVERLAP=${OVERLAP:-"100"}
34+
ENV MAX_LENGTH=${MAX_NEW_TOKENS:-"256"}
35+
ENV HUGGINGFACEHUB_API_TOKEN=${HF_TOKEN}
36+
ENV HF_TOKEN=${HF_TOKEN}
37+
ENV LLM_MODEL_ID=${LLM_ID}
38+
ENV TGI_PORT=8008
39+
ENV PYTHONPATH="/home/user/":$PYTHONPATH
40+
41+
USER user
42+
43+
WORKDIR /home/user/comps/text2graph/src/
44+
45+
RUN bash -c 'source /home/user/comps/text2graph/src/setup_service_env.sh'
46+
47+
ENTRYPOINT ["python", "opea_text2graph_microservice.py"]

comps/text2graph/src/README.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Text to graph triplet extractor
2+
3+
Creating graphs from text is about converting unstructured text into structured data is challenging.
4+
It's gained significant traction with the advent of Large Language Models (LLMs), bringing it more into the mainstream. There are two main approaches to extract graph triplets depending on the types of LLM architectures like decode and encoder-decoder models.
5+
6+
## Decoder Models
7+
8+
Decoder-only models are faster during inference as they skip the encoding. This is ideal for tasks where the
9+
input-output mapping is simpler or where multitasking is required. It is suitable for generating outputs based on
10+
prompts or when computational efficiency is a priority. In certain cases, the decoder only models struggle with
11+
tasks requiring deep contextual understanding or when input-output structures are highly heterogeneous.
12+
13+
## Encoder-decoder models
14+
15+
This microservice provides an encoder decoder architecture approach to graph triplet extraction. Models like REBEL, is based on the BART family/like model and fine-tuned for relation extraction and classification tasks. The approach works better when handling complex relations applications and data source. Encoder decoder models often achieve high performance on benchmarks due to their ability to encode contextual information effectively. It is suitable for tasks requiring detailed parsing of text into structured formats, such as knowledge graph construction from unstructured data.
16+
17+
# Features
18+
19+
Input text from a document or string(s) in text format and the graph triplets and nodes are identified.
20+
Subsequent processing needs to be done such as performing entity disambiguation to merge duplicate entities
21+
before generating cypher code
22+
23+
## Implementation
24+
25+
The text-to-graph microservice able to extract from unstructured text in document, textfile, or string formats
26+
The service is hosted in a docker. The text2graph extraction requires logic and LLMs to be hosted.
27+
LLM hosting is done with TGI for Gaudi's and natively running on CPUs for CPU.
28+
29+
# 🚀1. Start Microservice with Docker
30+
31+
Option 1 running on CPUs
32+
33+
## Install Requirements
34+
35+
```bash
36+
pip install -r requirements.txt
37+
```
38+
39+
## Environment variables : Configure LLM Parameters based on the model selected.
40+
41+
```
42+
export LLM_ID=${LLM_ID:-"Babelscape/rebel-large"}
43+
export SPAN_LENGTH=${SPAN_LENGTH:-"1024"}
44+
export OVERLAP=${OVERLAP:-"100"}
45+
export MAX_LENGTH=${MAX_NEW_TOKENS:-"256"}
46+
export HUGGINGFACEHUB_API_TOKEN=""
47+
export LLM_MODEL_ID=${LLM_ID}
48+
export TGI_PORT=8008
49+
```
50+
51+
##Echo env variables
52+
53+
```
54+
echo "Extractor details"
55+
echo LLM_ID=${LLM_ID}
56+
echo SPAN_LENGTH=${SPAN_LENGTH}
57+
echo OVERLAP=${OVERLAP}
58+
echo MAX_LENGTH=${MAX_LENGTH}
59+
```
60+
61+
### Start TGI Service
62+
63+
```bash
64+
export HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
65+
export LLM_MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"
66+
export TGI_PORT=8008
67+
68+
docker run -d --name="text2graph-tgi-endpoint" --ipc=host -p $TGI_PORT:80 -v ./data:/data --shm-size 1g -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -e model=${LLM_MODEL_ID} ghcr.io/huggingface/text-generation-inference:2.1.0 --model-id $LLM_MODEL_ID
69+
```
70+
71+
### Verify the TGI Service
72+
73+
```bash
74+
export your_ip=$(hostname -I | awk '{print $1}')
75+
curl http://${your_ip}:${TGI_PORT}/generate \
76+
-X POST \
77+
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
78+
-H 'Content-Type: application/json'
79+
```
80+
81+
### Setup Environment Variables to host TGI
82+
83+
```bash
84+
export TGI_LLM_ENDPOINT="http://${your_ip}:${TGI_PORT}"
85+
```
86+
87+
### Start Text2Graph Microservice with Docker
88+
89+
Command to build text2graph microservice
90+
91+
```bash
92+
docker build -f Dockerfile -t user_name:graph_extractor ../../../
93+
```
94+
95+
Command to launch text2graph microservice
96+
97+
```bash
98+
docker run -i -t --net=host --ipc=host -p 8090 user_name:graph_extractor
99+
```
100+
101+
The docker launches the text2graph microservice. To run it interactive.
102+
103+
# Validation and testing
104+
105+
## Text to triplets
106+
107+
Test directory is under GenAIComps/tests/text2graph/
108+
There are two files in this directory.
109+
110+
- example_from_file.py : Example python script that downloads a text file and extracts triplets
111+
112+
- test_text2graph_opea.sh : The main script that checks for health and builds docker, extracts and generates triplets.
113+
114+
## Check if services are up
115+
116+
### Setup validation process
117+
118+
For set up use http://localhost:8090/docs for swagger documentation, list of commands, interactive GUI.

0 commit comments

Comments
 (0)