-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Implement prompt logprobs & Batched topk for computing logprobs #1328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@WoosukKwon @Yard1 This PR is ready for review. |
Profiling result: # main without logprobs:
(vllm) zhuohan@zhuohan-1:~/vllm/vllm/benchmarks$ python benchmark_throughput.py --backend vllm --model huggyllama/llama-7b --dataset ../../data/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000
Namespace(backend='vllm', dataset='../../data/ShareGPT_V3_unfiltered_cleaned_split.json', model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=2000, seed=0, hf_max_batch_size=None, trust_remote_code=False, dtype='auto')
INFO 10-13 20:28:36 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 20:29:24 llm_engine.py:72] Initializing an LLM engine with config: model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 10-13 20:29:24 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 20:29:29 llm_engine.py:207] # GPU blocks: 7455, # CPU blocks: 512
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [03:51<00:00, 8.65it/s]
Throughput: 8.65 requests/s, 4186.71 tokens/s
# this branch without logprobs:
(vllm) zhuohan@zhuohan-1:~/vllm/vllm/benchmarks$ python benchmark_throughput.py --backend vllm --model huggyllama/llama-7b --dataset ../../data/ShareGPT_V3_unfiltered_cleaned_split.json --nu
m-prompts 2000
Namespace(backend='vllm', dataset='../../data/ShareGPT_V3_unfiltered_cleaned_split.json', model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', quantization=None, tensor_parallel_siz
e=1, n=1, use_beam_search=False, num_prompts=2000, seed=0, hf_max_batch_size=None, trust_remote_code=False, dtype='auto')
INFO 10-13 18:56:14 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 18:56:57 llm_engine.py:72] Initializing an LLM engine with config: model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 10-13 18:56:57 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 18:57:02 llm_engine.py:207] # GPU blocks: 7455, # CPU blocks: 512
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [03:51<00:00, 8.65it/s]
Throughput: 8.65 requests/s, 4185.26 tokens/s
# main with logprobs 5
(vllm) zhuohan@zhuohan-1:~/vllm/vllm/benchmarks$ python benchmark_throughput.py --backend vllm --model huggyllama/llama-7b --dataset ../../data/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000
Namespace(backend='vllm', dataset='../../data/ShareGPT_V3_unfiltered_cleaned_split.json', model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=2000, seed=0, hf_max_batch_size=None, trust_remote_code=False, dtype='auto')
INFO 10-13 20:42:18 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 20:43:02 llm_engine.py:72] Initializing an LLM engine with config: model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 10-13 20:43:02 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 20:43:07 llm_engine.py:207] # GPU blocks: 7455, # CPU blocks: 512
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [05:11<00:00, 6.42it/s]
Throughput: 6.41 requests/s, 3102.15 tokens/s
# this branch with logprobs 5
(vllm) zhuohan@zhuohan-1:~/vllm/vllm/benchmarks$ python benchmark_throughput.py --backend vllm --model huggyllama/llama-7b --dataset ../../data/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000
Namespace(backend='vllm', dataset='../../data/ShareGPT_V3_unfiltered_cleaned_split.json', model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=2000, seed=0, hf_max_batch_size=None, trust_remote_code=False, dtype='auto')
INFO 10-13 20:48:56 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 20:49:44 llm_engine.py:72] Initializing an LLM engine with config: model='huggyllama/llama-7b', tokenizer='huggyllama/llama-7b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 10-13 20:49:44 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-13 20:49:49 llm_engine.py:207] # GPU blocks: 7455, # CPU blocks: 512
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [04:02<00:00, 8.24it/s]
Throughput: 8.23 requests/s, 3980.15 tokens/s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! I will update my PR once this is merged. We should definitely consider a broader refactor here to precompute as many things as possible in _prepare_inputs
to avoid multiple loops and CPU-GPU syncing unless necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhuohan123 Awesome! Thanks for the hard work. Please check my comments.
prompt_logprobs: List[Optional[List[Optional[Dict[int, int]]]]], | ||
sample_logprobs: List[List[Optional[Dict[int, int]]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prompt_logprobs: List[Optional[List[Optional[Dict[int, int]]]]], | |
sample_logprobs: List[List[Optional[Dict[int, int]]]], | |
prompt_logprobs: List[Optional[List[Optional[Dict[int, float]]]]], | |
sample_logprobs: List[List[Optional[Dict[int, float]]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dumb question: Why do we need Optional
here? In which case is it used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sample_logprobs
, there should be no Optional
. I have fixed the code. For prompt_logprobs
, there are two case:
- If a request does not query prompt logprobs, the
prompt_logprobs
for that request will beNone
. - The first token of the prompt will not have a log proboability, so it will always be
None
. This is the same behavior as the OpenAI endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks for the explanation!
BTW, I got 2 logprobs per token when running
|
A similar question here. Does the design support |
The
|
@WoosukKwon This PR is ready for review again. |
is this now available? i am eager to use ppl method to do the evals! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the hard work!
has this been implemented? doesnt seem to be returning prompt logits when i specify it |
…-project#1328) Co-authored-by: Yunmo Chen <[email protected]>
why I come across some logprob=-inf ? |
When contiguous_pa is enabled, the decode graph is not warmed-up for the max block_id. See example below, when the total number of HPU blocks is 1974, the decode graph should be warmed-up for (bs, 1974). > INFO 05-28 03:29:33 executor_base.py:110] # HPU blocks: 1974, # CPU blocks: 954 Need to work with HabanaAI/vllm-hpu-extension#201 In habana_main, this code has been updated.
This PR:
prompt_logprobs
toSamplingParams
andRequestOutput
. This makes vLLM to support returning the log probabilities of prompt tokens, which is required to supportecho
in OpenAI server.topk
logits is done in a batched fashion.This PR will have merge conflicts with #1337. I think a good plan is to perform the optimization in #1337 along with the refactoring of
InputMetadata
after this PR is merged.TODOs:
InputMetadata