Skip to content

[Feature] vLLM enablement for 8 GenAI examples​ #1436

@joshuayao

Description

@joshuayao

Priority

P1-Stopper

OS type

Ubuntu

Hardware type

Gaudi/AMD GPU

Running nodes

Single Node

Description

Feature Objective:

Set vLLM as the default serving framework on Gaudi and AMD GPU for all remaining GenAI examples to leverage its optimized performance characteristics, thereby improving throughput and reducing latency in inference tasks.

Feature Details:

Replace TGI with vLLM as the default serving backend for inference on Xeon/Gaudi/AMD GPU devices.
Update serving configurations to align with vLLM's architecture for inference.
Perform performance benchmarking to validate vLLM's superiority in terms of TTFT, TPOT and scalability on Xeon/Gaudi/AMD GPU hardware.

Expected Outcome:

Adopting vLLM as the default framework improves the user experience by significantly lowering latency while exceeding the current TGI throughput levels on Xeon/Gaudi/AMD GPU.

Feature Scope:

Intel Xeon and Gaudi

execution plan

In v1.3, upgrade vllm-fork version to v0.6.6.post1+Gaudi-1.20.0.

Examples:

Components:

AMD/GPU ROCm

Examples:

Components:

Metadata

Metadata

Labels

Projects

Status

Done

Relationships

None yet

Development

No branches or pull requests

Issue actions