Merge branch 'main' into stats

xiguiw · web-flow · commit 5997431f6e7b · 2025-02-25T17:40:00.000+08:00
diff --git a/comps/asr/src/integrations/dependency/whisper/Dockerfile.intel_hpu b/comps/asr/src/integrations/dependency/whisper/Dockerfile.intel_hpu
@@ -22,7 +22,9 @@ COPY --chown=user:user comps /home/user/comps
 # Install requirements and optimum habana
 RUN pip install --no-cache-dir --upgrade pip && \
     pip install --no-cache-dir -r /home/user/comps/asr/src/requirements.txt && \
-    pip install --no-cache-dir optimum[habana]
+    pip install --no-cache-dir optimum[habana] && \
+    pip install git+https://github.com/huggingface/optimum-habana.git@transformers_future && \
+    pip install --no-cache-dir --upgrade Jinja2
 
 ENV PYTHONPATH=$PYTHONPATH:/home/users
 
diff --git a/comps/guardrails/deployment/docker_compose/compose.yaml b/comps/guardrails/deployment/docker_compose/compose.yaml
@@ -20,6 +20,19 @@ services:
       HUGGINGFACEHUB_API_TOKEN: ${HF_TOKEN}
     restart: unless-stopped
 
+  # toxicity detection service
+  guardrails-toxicity-detection-server:
+    image: ${REGISTRY:-opea}/guardrails-toxicity-detection:${TAG:-latest}
+    container_name: guardrails-toxicity-detection-server
+    ports:
+      - "${TOXICITY_DETECTION_PORT:-9090}:9090"
+    ipc: host
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+    restart: unless-stopped
+
   # factuality alignment service
   guardrails-factuality-predictionguard-server:
     image: ${REGISTRY:-opea}/guardrails-factuality-predictionguard:${TAG:-latest}
@@ -130,6 +143,7 @@ services:
       http_proxy: ${http_proxy}
       https_proxy: ${https_proxy}
       PREDICTIONGUARD_API_KEY: ${PREDICTIONGUARD_API_KEY}
+      TOXICITY_DETECTION_COMPONENT_NAME: "PREDICTIONGUARD_TOXICITY_DETECTION"
     restart: unless-stopped
 
 networks:
diff --git a/comps/guardrails/src/toxicity_detection/README.md b/comps/guardrails/src/toxicity_detection/README.md
@@ -2,17 +2,52 @@
 
 ## Introduction
 
-Toxicity Detection Microservice allows AI Application developers to safeguard user input and LLM output from harmful language in a RAG environment. By leveraging a smaller fine-tuned Transformer model for toxicity classification (e.g. DistilledBERT, RoBERTa, etc.), we maintain a lightweight guardrails microservice without significantly sacrificing performance making it readily deployable on both Intel Gaudi and Xeon.
+Toxicity Detection Microservice allows AI Application developers to safeguard user input and LLM output from harmful language in a RAG environment. By leveraging a smaller fine-tuned Transformer model for toxicity classification (e.g. DistillBERT, RoBERTa, etc.), we maintain a lightweight guardrails microservice without significantly sacrificing performance. This [article](https://huggingface.co/blog/daniel-de-leon/toxic-prompt-roberta) shows how the small language model (SLM) used in this microservice performs as good, if not better, than some of the most popular decoder LLM guardrails. This microservice uses [`Intel/toxic-prompt-roberta`](https://huggingface.co/Intel/toxic-prompt-roberta) that was fine-tuned on Gaudi2 with ToxicChat and Jigsaw Unintended Bias datasets.
 
-This microservice uses [`Intel/toxic-prompt-roberta`](https://huggingface.co/Intel/toxic-prompt-roberta) that was fine-tuned on Gaudi2 with ToxicChat and Jigsaw Unintended Bias datasets.
+In addition to showing promising toxic detection performance, the table below compares a [locust](https://github.com/locustio/locust) stress test on this microservice and the [LlamaGuard microservice](https://github.com/opea-project/GenAIComps/blob/main/comps/guardrails/src/guardrails/README.md#LlamaGuard). The input included varying lengths of toxic and non-toxic input over 200 seconds. A total of 50 users are added in the first 100 seconds, while the last 100 seconds the number of users stayed constant. It should also be noted that the LlamaGuard microservice was deployed on a Gaudi2 card while the toxicity detection microservice was deployed on a 4th generation Xeon.
 
-Toxicity is defined as rude, disrespectful, or unreasonable language likely to make someone leave a conversation. This can include instances of aggression, bullying, targeted hate speech, or offensive language. For more information on labels see [Jigsaw Toxic Comment Classification Challenge](http://kaggle.com/c/jigsaw-toxic-comment-classification-challenge).
+| Microservice       | Request Count | Median Response Time (ms) | Average Response Time (ms) | Min Response Time (ms) | Max Response Time (ms) | Requests/s |  50% |  95% |
+| :----------------- | ------------: | ------------------------: | -------------------------: | ---------------------: | ---------------------: | ---------: | ---: | ---: |
+| LG                 |          2099 |                      3300 |                       2718 |                     81 |                   4612 |       10.5 | 3300 | 4600 |
+| Toxicity Detection |          4547 |                       450 |                        796 |                     19 |                  10045 |       22.7 |  450 | 2500 |
+
+This microservice is designed to detect toxicity, which is defined as rude, disrespectful, or unreasonable language likely to make someone leave a conversation. This can include instances of aggression, bullying, targeted hate speech, or offensive language. For more information on labels see [Jigsaw Toxic Comment Classification Challenge](http://kaggle.com/c/jigsaw-toxic-comment-classification-challenge).
+
+## Environment Setup
+
+### Clone OPEA GenAIComps and Setup Environment
+
+Clone this repository at your desired location and set an environment variable for easy setup and usage throughout the instructions.
+
+```bash
+git clone https://github.com/opea-project/GenAIComps.git
+
+export OPEA_GENAICOMPS_ROOT=$(pwd)/GenAIComps
+```
+
+Set the port that this service will use and the component name
+
+```
+export TOXICITY_DETECTION_PORT=9090
+export TOXICITY_DETECTION_COMPONENT_NAME="OPEA_NATIVE_TOXICITY"
+```
+
+By default, this microservice uses `OPEA_NATIVE_TOXICITY` which invokes [`Intel/toxic-prompt-roberta`](https://huggingface.co/Intel/toxic-prompt-roberta), locally.
+
+Alternatively, if you are using Prediction Guard, reset the following component name environment variable:
+
+```
+export TOXICITY_DETECTION_COMPONENT_NAME="PREDICTIONGUARD_TOXICITY_DETECTION"
+```
+
+### Set environment variables
 
 ## 🚀1. Start Microservice with Python（Option 1）
 
 ### 1.1 Install Requirements
 
 ```bash
+cd $OPEA_GENAICOMPS_ROOT/comps/guardrails/src/toxicity_detection
 pip install -r requirements.txt
 ```
 
@@ -24,27 +59,42 @@ python toxicity_detection.py
 
 ## 🚀2. Start Microservice with Docker (Option 2)
 
-### 2.1 Prepare toxicity detection model
+### 2.1 Build Docker Image
 
-export HUGGINGFACEHUB_API_TOKEN=${HP_TOKEN}
+```bash
+cd $OPEA_GENAICOMPS_ROOT
+docker build \
+    --build-arg https_proxy=$https_proxy \
+    --build-arg http_proxy=$http_proxy \
+    -t opea/guardrails-toxicity-detection:latest  \
+    -f comps/guardrails/src/toxicity_detection/Dockerfile .
+```
 
-### 2.2 Build Docker Image
+### 2.2.a Run Docker with Compose (Option A)
 
 ```bash
-cd ../../../ # back to GenAIComps/ folder
-docker build -t opea/guardrails-toxicity-detection:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/guardrails/src/toxicity_detection/Dockerfile .
+cd $OPEA_GENAICOMPS_ROOT/comps/guardrails/deployment/docker_compose
+docker compose up -d guardrails-toxicity-detection-server
 ```
 
-### 2.3 Run Docker Container with Microservice
+### 2.2.b Run Docker with CLI (Option B)
 
 ```bash
-docker run -d --rm --runtime=runc --name="guardrails-toxicity-detection-endpoint" -p 9091:9091 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} opea/guardrails-toxicity-detection:latest
+docker run -d --rm \
+    --name="guardrails-toxicity-detection-server" \
+    --runtime=runc  \
+    -p ${TOXICITY_DETECTION_PORT}:9090 \
+    --ipc=host \
+    -e http_proxy=$http_proxy \
+    -e https_proxy=$https_proxy \
+    -e no_proxy=${no_proxy} \
+     opea/guardrails-toxicity-detection:latest
 ```
 
 ## 🚀3. Get Status of Microservice
 
 ```bash
-docker container logs -f guardrails-toxicity-detection-endpoint
+docker container logs -f guardrails-toxicity-detection-server
 ```
 
 ## 🚀4. Consume Microservice Pre-LLM/Post-LLM
@@ -54,9 +104,9 @@ Once microservice starts, users can use examples (bash or python) below to apply
 **Bash:**
 
 ```bash
-curl localhost:9091/v1/toxicity
-    -X POST
-    -d '{"text":"How to poison my neighbor'\''s dog without being caught?"}'
+curl localhost:${TOXICITY_DETECTION_PORT}/v1/toxicity \
+    -X POST \
+    -d '{"text":"How to poison my neighbor'\''s dog without being caught?"}' \
     -H 'Content-Type: application/json'
 ```
 
@@ -71,9 +121,11 @@ Example Output:
 ```python
 import requests
 import json
+import os
 
+toxicity_detection_port = os.getenv("TOXICITY_DETECTION_PORT")
 proxies = {"http": ""}
-url = "http://localhost:9091/v1/toxicity"
+url = f"http://localhost:{toxicty_detection_port}/v1/toxicity"
 data = {"text": "How to poison my neighbor'''s dog without being caught?"}
 
 
diff --git a/comps/guardrails/src/toxicity_detection/integrations/toxicdetection.py b/comps/guardrails/src/toxicity_detection/integrations/toxicdetection.py
@@ -0,0 +1,48 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import asyncio
+import os
+
+from transformers import pipeline
+
+from comps import CustomLogger, OpeaComponent, OpeaComponentRegistry, ServiceType, TextDoc
+
+logger = CustomLogger("opea_toxicity_native")
+logflag = os.getenv("LOGFLAG", False)
+
+
+@OpeaComponentRegistry.register("OPEA_NATIVE_TOXICITY")
+class OpeaToxicityDetectionNative(OpeaComponent):
+    """A specialized toxicity detection component derived from OpeaComponent."""
+
+    def __init__(self, name: str, description: str, config: dict = None):
+        super().__init__(name, ServiceType.GUARDRAIL.name.lower(), description, config)
+        self.model = os.getenv("TOXICITY_DETECTION_MODEL", "Intel/toxic-prompt-roberta")
+        self.toxicity_pipeline = pipeline("text-classification", model=self.model, tokenizer=self.model)
+        health_status = self.check_health()
+        if not health_status:
+            logger.error("OpeaToxicityDetectionNative health check failed.")
+
+    async def invoke(self, input: TextDoc):
+        """Invokes the toxic detection for the input.
+
+        Args:
+            input (Input TextDoc)
+        """
+        toxic = await asyncio.to_thread(self.toxicity_pipeline, input.text)
+        if toxic[0]["label"].lower() == "toxic":
+            return TextDoc(text="Violated policies: toxicity, please check your input.", downstream_black_list=[".*"])
+        else:
+            return TextDoc(text=input.text)
+
+    def check_health(self) -> bool:
+        """Checks the health of the animation service.
+
+        Returns:
+            bool: True if the service is reachable and healthy, False otherwise.
+        """
+        if self.toxicity_pipeline:
+            return True
+        else:
+            return False
diff --git a/comps/guardrails/src/toxicity_detection/opea_toxicity_detection_microservice.py b/comps/guardrails/src/toxicity_detection/opea_toxicity_detection_microservice.py
@@ -3,8 +3,7 @@
 
 import os
 import time
-
-from integrations.predictionguard import OpeaToxicityDetectionPredictionGuard
+from typing import Union
 
 from comps import (
     CustomLogger,
@@ -21,7 +20,17 @@
 logger = CustomLogger("opea_toxicity_detection_microservice")
 logflag = os.getenv("LOGFLAG", False)
 
-toxicity_detection_component_name = os.getenv("TOXICITY_DETECTION_COMPONENT_NAME", "PREDICTIONGUARD_TOXICITY_DETECTION")
+toxicity_detection_port = int(os.getenv("TOXICITY_DETECTION_PORT", 9090))
+toxicity_detection_component_name = os.getenv("TOXICITY_DETECTION_COMPONENT_NAME", "OPEA_NATIVE_TOXICITY")
+
+if toxicity_detection_component_name == "OPEA_NATIVE_TOXICITY":
+    from integrations.toxicdetection import OpeaToxicityDetectionNative
+elif toxicity_detection_component_name == "PREDICTIONGUARD_TOXICITY_DETECTION":
+    from integrations.predictionguard import OpeaToxicityDetectionPredictionGuard
+else:
+    logger.error(f"Component name {toxicity_detection_component_name} is not recognized")
+    exit(1)
+
 # Initialize OpeaComponentLoader
 loader = OpeaComponentLoader(
     toxicity_detection_component_name,
@@ -35,12 +44,12 @@
     service_type=ServiceType.GUARDRAIL,
     endpoint="/v1/toxicity",
     host="0.0.0.0",
-    port=9090,
+    port=toxicity_detection_port,
     input_datatype=TextDoc,
-    output_datatype=ScoreDoc,
+    output_datatype=Union[TextDoc, ScoreDoc],
 )
 @register_statistics(names=["opea_service@toxicity_detection"])
-async def toxicity_guard(input: TextDoc) -> ScoreDoc:
+async def toxicity_guard(input: TextDoc) -> Union[TextDoc, ScoreDoc]:
     start = time.time()
 
     # Log the input if logging is enabled
diff --git a/comps/llms/src/text-generation/README_bedrock.md b/comps/llms/src/text-generation/README_bedrock.md
@@ -9,6 +9,7 @@
 In order to start Bedrock service, you need to setup the following environment variables first.
 
 ```bash
+export AWS_REGION=${aws_region}
 export AWS_ACCESS_KEY_ID=${aws_access_key_id}
 export AWS_SECRET_ACCESS_KEY=${aws_secret_access_key}
 ```
@@ -23,13 +24,13 @@ export AWS_SESSION_TOKEN=${aws_session_token}
 
 ```bash
 cd GenAIComps/
-docker build --no-cache -t opea/bedrock:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/src/text-generation/Dockerfile .
+docker build --no-cache -t opea/llm-textgen:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/src/text-generation/Dockerfile .
 ```
 
 ## Run the Bedrock Microservice
 
 ```bash
-docker run -d --name bedrock -p  9009:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e LLM_COMPONENT_NAME="OpeaTextGenBedrock" -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN opea/bedrock:latest
+docker run -d --name bedrock -p  9009:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e LLM_COMPONENT_NAME="OpeaTextGenBedrock" -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY -e AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN -e BEDROCK_REGION=$AWS_REGION opea/llm-textgen:latest
 ```
 
 (You can remove `-e AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN` if you are not using an IAM Role)
@@ -42,6 +43,7 @@ curl http://${host_ip}:9009/v1/chat/completions \
   -d '{"model": "us.anthropic.claude-3-5-haiku-20241022-v1:0", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
   -H 'Content-Type: application/json'
 
+# stream mode
 curl http://${host_ip}:9009/v1/chat/completions \
  -X POST \
  -d '{"model": "us.anthropic.claude-3-5-haiku-20241022-v1:0", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17, "stream": "true"}' \
diff --git a/comps/llms/src/text-generation/integrations/bedrock.py b/comps/llms/src/text-generation/integrations/bedrock.py
@@ -87,13 +87,21 @@ async def invoke(self, input: ChatCompletionRequest):
         if logflag and len(inference_config) > 0:
             logger.info(f"[llm - chat] inference_config: {inference_config}")
 
-        # Parse messages from HuggingFace TGI format to bedrock messages format
-        # tgi: [{role: "system" | "user", content: "text"}]
+        # Parse messages to Bedrock format
+        # tgi: "prompt" or [{role: "system" | "user", content: "text"}]
         # bedrock: [role: "assistant" | "user", content: {text: "content"}]
-        messages = [
-            {"role": "assistant" if i.get("role") == "system" else "user", "content": [{"text": i.get("content", "")}]}
-            for i in input.messages
-        ]
+        messages = None
+        if isinstance(input.messages, str):
+            messages = [{"role": "user", "content": [{"text": input.messages}]}]
+        else:
+            # Convert from list of HuggingFace TGI message objects
+            messages = [
+                {
+                    "role": "assistant" if i.get("role") == "system" else "user",
+                    "content": [{"text": i.get("content", "")}],
+                }
+                for i in input.messages
+            ]
 
         # Bedrock requires that conversations start with a user prompt
         # TGI allows the first message to be an assistant prompt, defining assistant behavior
diff --git a/comps/retrievers/README.md b/comps/retrievers/README.md
@@ -2,7 +2,7 @@
 
 This retriever microservice is a highly efficient search service designed for handling and retrieving embedding vectors. It operates by receiving an embedding vector as input and conducting a similarity search against vectors stored in a VectorDB database. Users must specify the VectorDB's URL and the index name, and the service searches within that index to find documents with the highest similarity to the input vector.
 
-The service primarily utilizes similarity measures in vector space to rapidly retrieve contentually similar documents. The vector-based retrieval approach is particularly suited for handling large datasets, offering fast and accurate search results that significantly enhance the efficiency and quality of information retrieval.
+The service primarily utilizes similarity measures in vector space to rapidly retrieve contextually similar documents. The vector-based retrieval approach is particularly suited for handling large datasets, offering fast and accurate search results that significantly enhance the efficiency and quality of information retrieval.
 
 Overall, this microservice provides robust backend support for applications requiring efficient similarity searches, playing a vital role in scenarios such as recommendation systems, information retrieval, or any other context where precise measurement of document similarity is crucial.
 
diff --git a/comps/third_parties/tgi/README.md b/comps/third_parties/tgi/README.md
@@ -18,13 +18,13 @@ export MAX_TOTAL_TOKENS=2048
 Run tgi on xeon.
 
 ```bash
-cd deplopyment/docker_compose
+cd deployment/docker_compose
 docker compose -f compose.yaml up -d tgi-server
 ```
 
 Run tgi on gaudi.
 
 ```bash
-cd deplopyment/docker_compose
+cd deployment/docker_compose
 docker compose -f compose.yaml up -d tgi-gaudi-server
 ```
diff --git a/comps/third_parties/vllm/README.md b/comps/third_parties/vllm/README.md
@@ -43,7 +43,7 @@ bash ./launch_vllm_service.sh ${port_number} ${model_name}
 #### Launch vLLM service with docker compose
 
 ```bash
-cd deplopyment/docker_compose
+cd deployment/docker_compose
 docker compose -f compose.yaml up vllm-server -d
 ```
 
@@ -64,8 +64,8 @@ Set `hw_mode` to `hpu`.
 1. Option 1: Use docker compose for quick deploy
 
 ```bash
-cd deplopyment/docker_compose
-docker compose -f compose.yaml vllm-gaudi-server up -d
+cd deployment/docker_compose
+docker compose -f compose.yaml up vllm-gaudi-server -d
 ```
 
 2. Option 2: Use scripts to set parameters.
diff --git a/tests/guardrails/test_guardrails_hallucination_detection_on_intel_hpu.sh b/tests/guardrails/test_guardrails_hallucination_detection_on_intel_hpu.sh
@@ -38,6 +38,7 @@ function start_service() {
     export LLM_ENDPOINT_PORT=12210
     export vLLM_ENDPOINT="http://${host_ip}:${LLM_ENDPOINT_PORT}"
     export HALLUCINATION_DETECTION_PORT=11305
+    export VLLM_SKIP_WARMUP=true
     export TAG=comps
     service_name="vllm-gaudi-server hallucination-detection-server"
     cd $WORKPATH
diff --git a/tests/guardrails/test_guardrails_toxicity_detection_toxicdetection.sh b/tests/guardrails/test_guardrails_toxicity_detection_toxicdetection.sh
diff --git a/tests/llms/test_llms_text-generation_bedrock.sh b/tests/llms/test_llms_text-generation_bedrock.sh