[ChatQnA] Failed to run reranking and embedding service on H100.

Hi, maintainers,
I followed the README [https://github.com/opea-project/GenAIExamples/tree/main/ChatQnA/docker/gpu](url) and tried to use docker-compose to deploy ChatQnA. However, tei-reranking-server and tei-embedding-server failed. The following is the error logs:
```
$ docker logs tei-reranking-server
2024-07-23T04:25:26.629348Z  INFO text_embeddings_router: router/src/main.rs:140: Args { model_id: "BAA*/***-********-*ase", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, hf_api_token: None, hostname: "427706bc91b6", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, cors_allow_origin: None }
2024-07-23T04:25:26.635961Z  INFO hf_hub: /root/.cargo/git/checkouts/hf-hub-1aadb4c6e2cbe1ba/b167f69/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-07-23T04:25:28.061305Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:20: Starting download
2024-07-23T04:25:28.061547Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:37: Model artifacts downloaded in 254.674µs
2024-07-23T04:25:28.900779Z  WARN text_embeddings_router: router/src/lib.rs:165: Could not find a Sentence Transformers config
2024-07-23T04:25:28.900827Z  INFO text_embeddings_router: router/src/lib.rs:169: Maximum number of tokens per request: 512
2024-07-23T04:25:28.925017Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:23: Starting 112 tokenization workers
2024-07-23T04:25:51.644147Z  INFO text_embeddings_router: router/src/lib.rs:194: Starting model backend
Error: Could not create backend

Caused by:
    Could not start backend: Runtime compute cap 90 is not compatible with compile time compute cap 80
```

I root caused that the image `ghcr.io/huggingface/text-embeddings-inference:1.2` reranking and embedding services use is incompatible for some GPUs. For example, my GPU card H100, which is Hopper architecture, it should use image `ghcr.io/huggingface/text-embeddings-inference:hopper-1.5`. See compatibility in : https://github.com/huggingface/text-embeddings-inference/tree/main

I filed a PR to fix this issue. Please correct me if I am not right. Or notify me if you have a better fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ChatQnA] Failed to run reranking and embedding service on H100. #442

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ChatQnA] Failed to run reranking and embedding service on H100. #442

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions