Skip to content

[ChatQnA] Failed to run reranking and embedding service on H100. #442

@PeterYang12

Description

@PeterYang12

Hi, maintainers,
I followed the README https://github.com/opea-project/GenAIExamples/tree/main/ChatQnA/docker/gpu and tried to use docker-compose to deploy ChatQnA. However, tei-reranking-server and tei-embedding-server failed. The following is the error logs:

$ docker logs tei-reranking-server
2024-07-23T04:25:26.629348Z  INFO text_embeddings_router: router/src/main.rs:140: Args { model_id: "BAA*/***-********-*ase", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, hf_api_token: None, hostname: "427706bc91b6", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, cors_allow_origin: None }
2024-07-23T04:25:26.635961Z  INFO hf_hub: /root/.cargo/git/checkouts/hf-hub-1aadb4c6e2cbe1ba/b167f69/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-07-23T04:25:28.061305Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:20: Starting download
2024-07-23T04:25:28.061547Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:37: Model artifacts downloaded in 254.674µs
2024-07-23T04:25:28.900779Z  WARN text_embeddings_router: router/src/lib.rs:165: Could not find a Sentence Transformers config
2024-07-23T04:25:28.900827Z  INFO text_embeddings_router: router/src/lib.rs:169: Maximum number of tokens per request: 512
2024-07-23T04:25:28.925017Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:23: Starting 112 tokenization workers
2024-07-23T04:25:51.644147Z  INFO text_embeddings_router: router/src/lib.rs:194: Starting model backend
Error: Could not create backend

Caused by:
    Could not start backend: Runtime compute cap 90 is not compatible with compile time compute cap 80

I root caused that the image ghcr.io/huggingface/text-embeddings-inference:1.2 reranking and embedding services use is incompatible for some GPUs. For example, my GPU card H100, which is Hopper architecture, it should use image ghcr.io/huggingface/text-embeddings-inference:hopper-1.5. See compatibility in : https://github.com/huggingface/text-embeddings-inference/tree/main

I filed a PR to fix this issue. Please correct me if I am not right. Or notify me if you have a better fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions