-
Notifications
You must be signed in to change notification settings - Fork 300
Description
Priority
P1-Stopper
OS type
Ubuntu
Hardware type
Gaudi/AMD GPU
Running nodes
Single Node
Description
Feature Objective:
Set vLLM as the default serving framework on Gaudi and AMD GPU for all remaining GenAI examples to leverage its optimized performance characteristics, thereby improving throughput and reducing latency in inference tasks.
Feature Details:
Replace TGI with vLLM as the default serving backend for inference on Xeon/Gaudi/AMD GPU devices.
Update serving configurations to align with vLLM's architecture for inference.
Perform performance benchmarking to validate vLLM's superiority in terms of TTFT, TPOT and scalability on Xeon/Gaudi/AMD GPU hardware.
Expected Outcome:
Adopting vLLM as the default framework improves the user experience by significantly lowering latency while exceeding the current TGI throughput levels on Xeon/Gaudi/AMD GPU.
Feature Scope:
Intel Xeon and Gaudi
In v1.3, upgrade vllm-fork version to v0.6.6.post1+Gaudi-1.20.0.
Examples:
- AgentQnA: Already supported
- AudioQnA:
- [AudioQnA] Enable vLLM and set it as default LLM serving #1657
- Adapt latest changes of audioqna GenAIInfra#890
- ChatQnA:
- [ChatQnA] Switch to vLLM as default llm backend on Xeon #1403
- [ChatQnA] Switch to vLLM as default llm backend on Gaudi #1404
- CodeGen:
- Enable CodeGen vLLM #1636
- codegen: add vLLM as default inference engine GenAIInfra#883
- CodeTrans:
- Enable vllm for CodeTrans #1626
- codetrans: add vLLM as default inference engine GenAIInfra#881
- DocSum:
- Enable vllm for DocSum #1716
- Use vLLM as default inference backend for DocSum GenAIInfra#928
- FaqGen (This example has been merged into ChatQnA.):
- Set vLLM as default model for FaqGen #1580
- Merge faqgen to chatqna GenAIInfra#910
- VisualQnA:
- Set vLLM as default model for VisualQnA #1644
- Add vLLM backend for visualqna GenAIInfra#905
Components:
- LVM: vLLM lvm integration GenAIComps#1362
- vLLM version: Upgrade vLLM Gaudi version to v0.6.6 GenAIComps#1346
AMD/GPU ROCm
Examples:
- AgentQnA: Adding files to deploy AgentQnA application on ROCm vLLM #1613
- ChatQnA: Adding files to deploy ChatQnA application on ROCm vLLM #1560
- CodeGen: Adding files to deploy CodeGen application on ROCm vLLM #1544
- CodeTrans: Adding files to deploy CodeTrans application on ROCm vLLM #1545
- DocSum: Adding files to deploy DocSum application on ROCm vLLM #1572
- FaqGen: This example has been merged into ChatQnA.
- AudioQnA: Adding files to deploy AudioQnA application on ROCm vLLM #1655
- VisualQnA (AMD will not support it in v1.3)
- SearchQnA: Adding files to deploy SearchQnA application on ROCm vLLM #1649
- Translation: Adding files to deploy Translation application on ROCm vLLM #1648
Components:
Metadata
Metadata
Assignees
Labels
Type
Projects
Status