[Feature] vLLM enablement for 8 GenAI examples​

### Priority

P1-Stopper

### OS type

Ubuntu

### Hardware type

Gaudi/AMD GPU

### Running nodes

Single Node

### Description

#### Feature Objective:
Set vLLM as the default serving framework on Gaudi and AMD GPU for all remaining GenAI examples to leverage its optimized performance characteristics, thereby improving throughput and reducing latency in inference tasks.

#### Feature Details:

Replace TGI with vLLM as the default serving backend for inference on Xeon/Gaudi/AMD GPU devices.
Update serving configurations to align with vLLM's architecture for inference.
Perform performance benchmarking to validate vLLM's superiority in terms of TTFT, TPOT and scalability on Xeon/Gaudi/AMD GPU hardware.

#### Expected Outcome:
Adopting vLLM as the default framework improves the user experience by significantly lowering latency while exceeding the current TGI throughput levels on Xeon/Gaudi/AMD GPU.

#### Feature Scope:

##### Intel Xeon and Gaudi
[execution plan](https://intel-my.sharepoint.com/:x:/r/personal/letong_han_intel_com/Documents/Desktop/Intel/2025%20Work/v1.3/GenAIExample%20v1.3%20vllm%20enable%20plan.xlsx?d=wf59621fcac2e49929e11cc4d38c7986e&csf=1&web=1&e=YHODrb)

In v1.3, upgrade vllm-fork version to v0.6.6.post1+Gaudi-1.20.0.

Examples:
- [x] AgentQnA: Already supported
- [x] AudioQnA:
- #1657
- opea-project/GenAIInfra#890
- [x] ChatQnA: 
- #1403 
- #1404
- [x] CodeGen: 
- #1636 
- opea-project/GenAIInfra#883
- [x] CodeTrans: 
- #1626 
- opea-project/GenAIInfra#881
- [x] DocSum: 
- #1716
- opea-project/GenAIInfra#928
- [x] FaqGen (This example has been merged into ChatQnA.): 
- #1580 
- opea-project/GenAIInfra#910
- [x] VisualQnA:
- #1644 
- opea-project/GenAIInfra#905

Components:
- [x] LVM: opea-project/GenAIComps#1362
- [x] vLLM version: opea-project/GenAIComps#1346

##### AMD/GPU ROCm

Examples:
- [x] AgentQnA: #1613
- [x] ChatQnA: #1560
- [x] CodeGen: #1544 
- [x] CodeTrans: #1545
- [x] DocSum: #1572
- [x] FaqGen: This example has been merged into ChatQnA.
- [x] AudioQnA: #1655
- [x] VisualQnA （AMD will not support it in v1.3）
- [x] SearchQnA: #1649 
- [x] Translation: #1648 


Components:
- [x] vLLM on rocm: opea-project/GenAIComps#1372



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] vLLM enablement for 8 GenAI examples #1436

Priority

OS type

Hardware type

Running nodes

Description

Feature Objective:

Feature Details:

Expected Outcome:

Feature Scope:

Intel Xeon and Gaudi

AMD/GPU ROCm

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] vLLM enablement for 8 GenAI examples​ #1436

Description

Priority

OS type

Hardware type

Running nodes

Description

Feature Objective:

Feature Details:

Expected Outcome:

Feature Scope:

Intel Xeon and Gaudi

AMD/GPU ROCm

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature] vLLM enablement for 8 GenAI examples #1436