Implement prompt logprobs & Batched topk for computing logprobs #1328

zhuohan123 · 2023-10-11T23:59:22Z

This PR:

Added prompt_logprobs to SamplingParams and RequestOutput. This makes vLLM to support returning the log probabilities of prompt tokens, which is required to support echo in OpenAI server.
Refactor the logprobs logic so that the query to topk logits is done in a batched fashion.

This PR will have merge conflicts with #1337. I think a good plan is to perform the optimization in #1337 along with the refactoring of InputMetadata after this PR is merged.

TODOs:

Test the correctness of this PR.
Test the performance with the new logprobs implementation.
Maybe in a future PR: refactor bloated reference to InputMetadata

vllm/model_executor/layers/sampler.py

zhuohan123 · 2023-10-13T18:50:02Z

@WoosukKwon @Yard1 This PR is ready for review.

zhuohan123 · 2023-10-13T20:54:10Z

Profiling result:

# main without logprobs:
(vllm) zhuohan@zhuohan-1:~/vllm/vllm/benchmarks$ python benchmark_throughput.py --backend vllm --model huggyllama/llama-7b --dataset ../../data/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000
Namespace(backend='vllm', dataset='../../data/ShareGPT_V3_unfiltered_cleaned_split.json', model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=2000, seed=0, hf_max_batch_size=None, trust_remote_code=False, dtype='auto')
INFO 10-13 20:28:36 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 20:29:24 llm_engine.py:72] Initializing an LLM engine with config: model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 10-13 20:29:24 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 20:29:29 llm_engine.py:207] # GPU blocks: 7455, # CPU blocks: 512
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [03:51<00:00,  8.65it/s]
Throughput: 8.65 requests/s, 4186.71 tokens/s

# this branch without logprobs:
(vllm) zhuohan@zhuohan-1:~/vllm/vllm/benchmarks$ python benchmark_throughput.py --backend vllm --model huggyllama/llama-7b --dataset ../../data/ShareGPT_V3_unfiltered_cleaned_split.json --nu
m-prompts 2000
Namespace(backend='vllm', dataset='../../data/ShareGPT_V3_unfiltered_cleaned_split.json', model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', quantization=None, tensor_parallel_siz
e=1, n=1, use_beam_search=False, num_prompts=2000, seed=0, hf_max_batch_size=None, trust_remote_code=False, dtype='auto')
INFO 10-13 18:56:14 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 18:56:57 llm_engine.py:72] Initializing an LLM engine with config: model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 10-13 18:56:57 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 18:57:02 llm_engine.py:207] # GPU blocks: 7455, # CPU blocks: 512
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [03:51<00:00,  8.65it/s]
Throughput: 8.65 requests/s, 4185.26 tokens/s

# main with logprobs 5
(vllm) zhuohan@zhuohan-1:~/vllm/vllm/benchmarks$ python benchmark_throughput.py --backend vllm --model huggyllama/llama-7b --dataset ../../data/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000
Namespace(backend='vllm', dataset='../../data/ShareGPT_V3_unfiltered_cleaned_split.json', model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=2000, seed=0, hf_max_batch_size=None, trust_remote_code=False, dtype='auto')
INFO 10-13 20:42:18 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 20:43:02 llm_engine.py:72] Initializing an LLM engine with config: model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 10-13 20:43:02 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 20:43:07 llm_engine.py:207] # GPU blocks: 7455, # CPU blocks: 512
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [05:11<00:00,  6.42it/s]
Throughput: 6.41 requests/s, 3102.15 tokens/s

# this branch with logprobs 5
(vllm) zhuohan@zhuohan-1:~/vllm/vllm/benchmarks$ python benchmark_throughput.py --backend vllm --model huggyllama/llama-7b --dataset ../../data/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000
Namespace(backend='vllm', dataset='../../data/ShareGPT_V3_unfiltered_cleaned_split.json', model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=2000, seed=0, hf_max_batch_size=None, trust_remote_code=False, dtype='auto')
INFO 10-13 20:48:56 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 20:49:44 llm_engine.py:72] Initializing an LLM engine with config: model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 10-13 20:49:44 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 20:49:49 llm_engine.py:207] # GPU blocks: 7455, # CPU blocks: 512
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [04:02<00:00,  8.24it/s]
Throughput: 8.23 requests/s, 3980.15 tokens/s

Yard1

This looks good to me! I will update my PR once this is merged. We should definitely consider a broader refactor here to precompute as many things as possible in _prepare_inputs to avoid multiple loops and CPU-GPU syncing unless necessary.

vllm/model_executor/layers/sampler.py

WoosukKwon

@zhuohan123 Awesome! Thanks for the hard work. Please check my comments.

vllm/sequence.py

vllm/model_executor/layers/sampler.py

vllm/sequence.py

WoosukKwon · 2023-10-15T16:49:32Z

vllm/model_executor/layers/sampler.py

+    prompt_logprobs: List[Optional[List[Optional[Dict[int, int]]]]],
+    sample_logprobs: List[List[Optional[Dict[int, int]]]],


Suggested change

prompt_logprobs: List[Optional[List[Optional[Dict[int, int]]]]],

sample_logprobs: List[List[Optional[Dict[int, int]]]],

prompt_logprobs: List[Optional[List[Optional[Dict[int, float]]]]],

sample_logprobs: List[List[Optional[Dict[int, float]]]],

A dumb question: Why do we need Optional here? In which case is it used?

For sample_logprobs, there should be no Optional. I have fixed the code. For prompt_logprobs, there are two case:

If a request does not query prompt logprobs, the prompt_logprobs for that request will be None.

The first token of the prompt will not have a log proboability, so it will always be None. This is the same behavior as the OpenAI endpoint.

Got it. Thanks for the explanation!

tests/samplers/test_logprobs.py

vllm/model_executor/layers/sampler.py

WoosukKwon · 2023-10-15T17:40:49Z

BTW, I got 2 logprobs per token when running examples/llm_engine_example.py where prompt_logprobs=1. Is this expected? It's a bit confusing because logprobs=1 returns 1 log prob per token.

..., prompt_logprobs=[None, {250: -3.594587802886963, 100: -1.414900302886963}, {9916: -8.579404830932617, 319: -3.582822799682617},  ...

wanmok · 2023-10-16T04:18:09Z

BTW, I got 2 logprobs per token when running examples/llm_engine_example.py where prompt_logprobs=1. Is this expected? It's a bit confusing because logprobs=1 returns 1 log prob per token.
..., prompt_logprobs=[None, {250: -3.594587802886963, 100: -1.414900302886963}, {9916: -8.579404830932617, 319: -3.582822799682617},  ...

A similar question here. Does the design support logprobs=0? In this case, we would like to only know the selected log probs in the prompt rather than top-k. This is required to implement the OpenAI API's echo.

zhuohan123 · 2023-10-16T07:38:20Z

The logprobs behavior of vLLM follows the OpenAI API's specification:

Include the log probabilities on the logprobs most likely tokens, as well the chosen tokens. For example, if logprobs is 5, the API will return a list of the 5 most likely tokens. The API will always return the logprob of the sampled token, so there may be up to logprobs+1 elements in the response.

zhuohan123 · 2023-10-16T07:42:47Z

@WoosukKwon This PR is ready for review again.

tests/samplers/test_logprobs.py

RanchiZhao · 2023-10-16T07:54:17Z

is this now available? i am eager to use ppl method to do the evals!

vllm/sequence.py

vllm/model_executor/layers/sampler.py

WoosukKwon

LGTM! Thanks for the hard work!

wheel-is · 2023-12-01T22:04:09Z

has this been implemented? doesnt seem to be returning prompt logits when i specify it

…-project#1328) Co-authored-by: Yunmo Chen <[email protected]>

ToSev7en · 2024-08-29T06:11:05Z

why I come across some logprob=-inf ？

When contiguous_pa is enabled, the decode graph is not warmed-up for the max block_id. See example below, when the total number of HPU blocks is 1974, the decode graph should be warmed-up for (bs, 1974). > INFO 05-28 03:29:33 executor_base.py:110] # HPU blocks: 1974, # CPU blocks: 954 Need to work with HabanaAI/vllm-hpu-extension#201 In habana_main, this code has been updated.

zhuohan123 added 6 commits October 11, 2023 02:01

[WIP] Initial implementation of prompt logprobs

4078531

Merge branch 'main' into prompt-logprobs

336438b

Merge branch 'main' into prompt-logprobs

d5e9573

Implement prompt logprobs

d4e5312

remove token property and batched topk

588c9a0

format

31dd12c

zhuohan123 changed the title ~~[WIP] Implement prompt logprobs~~ Implement prompt logprobs & Batched topk for computing logprobs Oct 12, 2023

Yard1 reviewed Oct 12, 2023

View reviewed changes

vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved

zhuohan123 added 3 commits October 13, 2023 02:21

Add partial test

e40a705

fix correctness

54596f0

Add test

4ccafab

WoosukKwon mentioned this pull request Oct 13, 2023

[v0.2.1] Release Tracker #1346

Closed

3 tasks

zhuohan123 added 4 commits October 13, 2023 18:24

Fix review comment

3148eb3

Fix format

e774d05

Uncomment

6156f88

fix variable name

1093cc5

zhuohan123 requested a review from WoosukKwon October 13, 2023 18:49

zhuohan123 mentioned this pull request Oct 13, 2023

Supporting log probabilities of prompt tokens in both engine and OpenAI API server (aka echo) #959

Closed

zhuohan123 requested a review from Yard1 October 13, 2023 20:54

Yard1 approved these changes Oct 14, 2023

View reviewed changes

vllm/model_executor/layers/sampler.py Show resolved Hide resolved

WoosukKwon reviewed Oct 15, 2023

View reviewed changes

zhuohan123 added 3 commits October 16, 2023 06:45

Fix all of the "Note:" in the code

6c4fc09

Fix dtype

957183a

Fix test and revert wrong changes

50c6de6

Add comment in Sampling Params

d411a7e

zhuohan123 requested a review from WoosukKwon October 16, 2023 07:42

WoosukKwon reviewed Oct 16, 2023

View reviewed changes

tests/samplers/test_logprobs.py Outdated Show resolved Hide resolved

WoosukKwon reviewed Oct 16, 2023

View reviewed changes

vllm/sequence.py Outdated Show resolved Hide resolved

WoosukKwon reviewed Oct 16, 2023

View reviewed changes

vllm/model_executor/layers/sampler.py Show resolved Hide resolved

zhuohan123 added 2 commits October 16, 2023 17:41

Remove print

2234016

Fix typing

899b5de

WoosukKwon approved these changes Oct 16, 2023

View reviewed changes

zhuohan123 merged commit 9d9072a into main Oct 16, 2023

zhuohan123 deleted the prompt-logprobs branch October 16, 2023 21:02

WoosukKwon mentioned this pull request Oct 17, 2023

Fix TP bug #1389

Merged

wanmok mentioned this pull request Oct 30, 2023

Added echo function to OpenAI API server. #1504

Merged

zhuohan123 mentioned this pull request Oct 30, 2023

[WIP] Echo prompt tokens #833

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Implement prompt logprobs & Batched topk for computing logprobs (vllm…

5498aba

…-project#1328) Co-authored-by: Yunmo Chen <[email protected]>

lekhang4497 mentioned this pull request Mar 6, 2024

Compute perplexity/logits for the prompt #2364

Closed

hmellor mentioned this pull request Mar 8, 2024

Support vllm to use lm-eval to evaluate model accuracy #776

Closed

silverriver mentioned this pull request Mar 18, 2024

[Feature] 请问Base模型测评ppl任务什么时候可以支持vllm？ open-compass/opencompass#970

Closed

1 task

		prompt_logprobs: List[Optional[List[Optional[Dict[int, int]]]]],
		sample_logprobs: List[List[Optional[Dict[int, int]]]],

Uh oh!

Implement prompt logprobs & Batched topk for computing logprobs #1328

Implement prompt logprobs & Batched topk for computing logprobs #1328

Uh oh!

Conversation

zhuohan123 commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Oct 13, 2023

Uh oh!

zhuohan123 commented Oct 13, 2023

Uh oh!

Yard1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WoosukKwon Oct 15, 2023

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Oct 15, 2023

Choose a reason for hiding this comment

Uh oh!

zhuohan123 Oct 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Oct 16, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WoosukKwon commented Oct 15, 2023

Uh oh!

wanmok commented Oct 16, 2023

Uh oh!

zhuohan123 commented Oct 16, 2023

Uh oh!

zhuohan123 commented Oct 16, 2023

Uh oh!

Uh oh!

RanchiZhao commented Oct 16, 2023

Uh oh!

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

wheel-is commented Dec 1, 2023

Uh oh!

ToSev7en commented Aug 29, 2024

Uh oh!

Uh oh!

zhuohan123 commented Oct 11, 2023 •

edited

Loading

zhuohan123 Oct 16, 2023 •

edited

Loading