[ChatQnA] Switch to vLLM as default llm backend on Xeon #1403

wangkl2 · 2025-01-16T10:24:29Z

Description

Switch to vLLM as the default LLM backend on Xeon for ChatQnA pipeline.

Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. Via benchmarking on Xeon server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TTFT, average TPOT and throughput. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline, pinecone as the vectorDB, qdrant as the vectorDB. This PR also aligns the parameters of llm service in all chatqna test scripts with what in readme file.

Issues

#1213

Type of change

New feature (non-breaking change which adds new functionality)
Others (enhancement, documentation, validation, etc.)

Dependencies

n/a

Tests

TGI version: 2.4.0
vLLM version: 0.6.6.post2.dev151+gbd828722

Benchmark and compare the LLMServe perf on GNR server with OOB-vLLM and OOB-TGI backend via GenAIEval. The geomean perf of vLLM performs better than TGI for average total latency, average TTFT, average TPOT and throughput on 7B LLM with 4 sets of ISL/OSL (128/128, 128/1024, 1024/128, 1024/1024), measured on different num_queries and concurrency, including 32/8, 128/32.

Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. Via benchmarking on Xeon server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TTFT, average TPOT and throughput. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline, pinecone as the vectorDB, qdrant as the vectorDB. Implement opea-project#1213 Signed-off-by: Wang, Kai Lawrence <[email protected]>

Signed-off-by: Wang, Kai Lawrence <[email protected]>

… into vllm-default

Signed-off-by: Wang, Kai Lawrence <[email protected]>

github-actions · 2025-01-16T10:24:47Z

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

for more information, see https://pre-commit.ci

Signed-off-by: Wang, Kai Lawrence <[email protected]>

… into vllm-default

Signed-off-by: Wang, Kai Lawrence <[email protected]>

… into vllm-default

Signed-off-by: Wang, Kai Lawrence <[email protected]>

…#1403) Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. opea-project#1213 Signed-off-by: Wang, Kai Lawrence <[email protected]> Signed-off-by: Chingis Yundunov <[email protected]>

wangkl2 and others added 15 commits January 15, 2025 06:20

Use vllm llm backend for pinecone eg

f108812

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Update readme

31c11c6

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Use vllm llm backend for qdrant eg

9ba9897

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Update names of ut scripts

c319bb0

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Fix the vllm test script

937375a

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Update the wo-rerank test script

58f611a

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Update the pinecone test script

0417459

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Update the qdrant test script

df03057

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Align the function names and llm svc val parameters in all test scripts

29816b2

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Merge branch 'opea-project:main' into vllm-default

34a918a

Update readme for descriptions of several deployment variants

a9dd2ae

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Merge branch 'vllm-default' of https://github.com/wangkl2/GenAIExamples…

09b6a7d

… into vllm-default

fix test script isssue for docker start and stop

b268cd2

Signed-off-by: Wang, Kai Lawrence <[email protected]>

solve conflicts

1fec5ab

Signed-off-by: Wang, Kai Lawrence <[email protected]>

wangkl2 requested review from lvliang-intel and letonghan as code owners January 16, 2025 10:24

pre-commit-ci bot and others added 9 commits January 16, 2025 10:24

[pre-commit.ci] auto fixes from pre-commit.com hooks

490026f

for more information, see https://pre-commit.ci

Merge branch 'main' into vllm-default

4c27467

Fix ci issues

b0bcc46

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Merge branch 'vllm-default' of https://github.com/wangkl2/GenAIExamples…

65c59b5

… into vllm-default

Merge branch 'main' into vllm-default

e790569

Fix ci issues

2156bd3

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Merge branch 'vllm-default' of https://github.com/wangkl2/GenAIExamples…

dbe5d35

… into vllm-default

Fix ci issues

4d9c6dc

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Merge branch 'main' into vllm-default

d4938e1

wangkl2 requested review from chensuyue and XinyaoWa January 16, 2025 15:44

joshuayao requested a review from yao531441 January 17, 2025 08:13

joshuayao requested a review from XinyuYe-Intel January 17, 2025 08:13

XinyuYe-Intel approved these changes Jan 17, 2025

View reviewed changes

yao531441 approved these changes Jan 17, 2025

View reviewed changes

chensuyue merged commit 742cb6d into opea-project:main Jan 17, 2025
18 checks passed

lianhao mentioned this pull request Jan 20, 2025

[Bug] default ChatQnA's model Intel/neural-chat-7b-v3-3 is extremely slow with vLLM on ICX cpu #1420

Closed

6 tasks

wangkl2 mentioned this pull request Feb 13, 2025

[Bug]OPEA vllm image issue, only on CPU is busy, other are idle for vllm inference #1519

Closed

8 tasks

joshuayao mentioned this pull request Feb 19, 2025

[Feature] vLLM enablement for 8 GenAI examples #1436

Closed

21 tasks

joshuayao added this to the v1.3 milestone Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ChatQnA] Switch to vLLM as default llm backend on Xeon #1403

[ChatQnA] Switch to vLLM as default llm backend on Xeon #1403

Uh oh!

wangkl2 commented Jan 16, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jan 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[ChatQnA] Switch to vLLM as default llm backend on Xeon #1403

[ChatQnA] Switch to vLLM as default llm backend on Xeon #1403

Uh oh!

Conversation

wangkl2 commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Type of change

Dependencies

Tests

Uh oh!

github-actions bot commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

Uh oh!

Uh oh!

wangkl2 commented Jan 16, 2025 •

edited

Loading

github-actions bot commented Jan 16, 2025 •

edited

Loading