Skip to content

Conversation

@sh1ng
Copy link
Contributor

@sh1ng sh1ng commented Dec 15, 2023

A follow-up of #42 cc @zhuohan123

torch.jit.script and TorchScript can't be used as forward methods use parameters not compatible with it https://pytorch.org/docs/stable/jit_language_reference.html#supported-type.
torch.jit.trace looks even more challenging.
I was only able to make it run by using torch.compile with minimal @torch.compiler.disable addition. Unfortunately, I only see performance degradation(RTX 3090)

$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model facebook/opt-125m
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='facebook/opt-125m', tokenizer='facebook/opt-125m', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:11:19 llm_engine.py:73] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=False, seed=0)
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:11:23,098 filelock [DEBUG] - Attempting to acquire lock 139820010183728 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,098 filelock [DEBUG] - Lock 139820010183728 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,221 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:11:23,223 filelock [DEBUG] - Attempting to release lock 139820010183728 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,223 filelock [DEBUG] - Lock 139820010183728 released on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,223 filelock [DEBUG] - Attempting to acquire lock 139821305789792 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,223 filelock [DEBUG] - Lock 139821305789792 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,330 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:11:23,331 filelock [DEBUG] - Attempting to release lock 139821305789792 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,331 filelock [DEBUG] - Lock 139821305789792 released on /tmp/facebook-opt-125m.lock
INFO 12-15 04:11:24 llm_engine.py:223] # GPU blocks: 34503, # CPU blocks: 7281
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [00:10<00:00, 91.47it/s]
Throughput: 91.43 requests/s, 23406.65 tokens/s
$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model facebook/opt-125m --compile-model True
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='facebook/opt-125m', tokenizer='facebook/opt-125m', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:10:07 llm_engine.py:73] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=True, seed=0)
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:10:11,338 filelock [DEBUG] - Attempting to acquire lock 140378255170608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,338 filelock [DEBUG] - Lock 140378255170608 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,500 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:10:11,502 filelock [DEBUG] - Attempting to release lock 140378255170608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,503 filelock [DEBUG] - Lock 140378255170608 released on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,503 filelock [DEBUG] - Attempting to acquire lock 140378858978608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,503 filelock [DEBUG] - Lock 140378858978608 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,617 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:10:11,619 filelock [DEBUG] - Attempting to release lock 140378858978608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,619 filelock [DEBUG] - Lock 140378858978608 released on /tmp/facebook-opt-125m.lock
INFO 12-15 04:10:18 llm_engine.py:223] # GPU blocks: 34524, # CPU blocks: 7281
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
  torch.has_cuda,
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
  torch.has_cudnn,
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
  torch.has_mps,
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
  torch.has_mkldnn,
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [00:11<00:00, 85.64it/s]
Throughput: 85.61 requests/s, 21915.10 tokens/s

llama

$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model h2oai/h2ogpt-4096-llama2-7b-chat 
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:45:25 llm_engine.py:73] Initializing an LLM engine with config: model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=False, seed=0)
INFO 12-15 04:45:25 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:45:29,229 filelock [DEBUG] - Attempting to acquire lock 140026768499296 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:45:29,229 filelock [DEBUG] - Lock 140026768499296 acquired on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:45:29,321 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/h2oai/h2ogpt-4096-llama2-7b-chat/revision/main HTTP/1.1" 200 2270
2023-12-15 04:45:29,325 filelock [DEBUG] - Attempting to release lock 140026768499296 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:45:29,325 filelock [DEBUG] - Lock 140026768499296 released on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
INFO 12-15 04:45:32 llm_engine.py:223] # GPU blocks: 881, # CPU blocks: 512
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 2000/2000 [04:24<00:00,  7.56it/s]
Throughput: 7.56 requests/s, 1936.53 tokens/s
$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model h2oai/h2ogpt-4096-llama2-7b-chat --compile-model True
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:35:47 llm_engine.py:73] Initializing an LLM engine with config: model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=True, seed=0)
INFO 12-15 04:35:47 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:35:50,578 filelock [DEBUG] - Attempting to acquire lock 139709952418400 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:35:50,578 filelock [DEBUG] - Lock 139709952418400 acquired on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:35:50,689 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/h2oai/h2ogpt-4096-llama2-7b-chat/revision/main HTTP/1.1" 200 2270
2023-12-15 04:35:50,692 filelock [DEBUG] - Attempting to release lock 139709952418400 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:35:50,692 filelock [DEBUG] - Lock 139709952418400 released on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
INFO 12-15 04:36:08 llm_engine.py:223] # GPU blocks: 876, # CPU blocks: 512
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 2000/2000 [04:27<00:00,  7.49it/s]
Throughput: 7.49 requests/s, 1916.22 tokens/s

This PR can be considered as a first step to use torch.compiler for further improvements.

BTW onnrt backend returns

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Type Error: Type (seq(tensor(float16))) of output arg (_val_9) of node (_inline_aten_split_with_sizesn0) does not match expected type (seq(tensor(float))).

@sh1ng sh1ng changed the title Compile model with torch.compile, unfortunatly without performance improvments Compiled model with torch.compile, unfortunately without performance improvements Dec 15, 2023
@sh1ng sh1ng force-pushed the try-torch-compiler branch from 2bf9d5c to 179a630 Compare December 21, 2023 13:07
@sh1ng
Copy link
Contributor Author

sh1ng commented Dec 21, 2023

$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False)
Throughput: 4.77 requests/s, 1873.85 tokens/s
$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --enforce-eager
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=True)
Throughput: 4.65 requests/s, 1827.69 tokens/s
$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --enforce-eager --compile-model=True
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=True)
Throughput: 4.53 requests/s, 1778.97 tokens/s
$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --compile-model=True
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False)
Throughput: 4.67 requests/s, 1835.51 tokens/s

@UIHCRITT
Copy link

UIHCRITT commented Dec 22, 2023

using your code, I run vicuna-7b in one L40, torch.__version__2.1.0+cu121; vllm = 0.2.2,
I found
(flot+compile bs=1,code from gpt-fast) 50.07 tokens/sec;
(float+vllm bs=1) 46.24 tokens/sec;
(float+vllm+compile) 42.71 tokens/sec;

it seems using torch.compile,without performance improvements;
on the other hand, i try using torch.compile() rewrite model_forward,rather than using @torch.compiler.disable:
in vllm/worker/worker.py def excute_model

before
''''''
output = self.model(
input_ids = input_tokens,
positions = inputs_positions,
kv_caches=slef.gpu_cache,
input_metadata=input_metadata,
cache_events = cache_events,
)
after
'''''''
def _model_foraward(model,input_ids,positions,kv_caches,input_metadata,cache_events):
return model(input_ids,positions,kv_caches,input_metadata,cache_events)
model_forward = torch.compile(_model_foraward,mode="reduce-overhead",fullgraph=True)
output = model_forward(
model = self.model,
input_ids = input_tokens,
positions = inputs_positions,
kv_caches=slef.gpu_cache,
input_metadata=input_metadata,
cache_events = cache_events,
)
'''''''''''
when I run this code, i get errors''NotImpementedError:ProcessGroupVariable() ''

@Lvjinhong
Copy link

For the latest version v0.2.7, is there any meaningful acceleration in terms of the compiler?

@sh1ng sh1ng force-pushed the try-torch-compiler branch from 45ee43e to 2637c51 Compare February 28, 2024 20:41
@sh1ng sh1ng force-pushed the try-torch-compiler branch from 2637c51 to 73f0f1a Compare February 28, 2024 20:53
@sh1ng sh1ng mentioned this pull request Mar 1, 2024
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Oct 30, 2024
@mergify
Copy link

mergify bot commented Oct 30, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @sh1ng please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@github-actions github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Nov 2, 2024
@hmellor hmellor closed this Feb 17, 2025
WeNeedMoreCode pushed a commit to WeNeedMoreCode/vllm that referenced this pull request Dec 15, 2025
What this PR does / why we need it?
test vllm_ascend/envs.py contains environment variables defination

Does this PR introduce any user-facing change?
N/A

How was this patch tested?
CI passed with new added test.

vLLM version: v0.10.0
vLLM main:
vllm-project@9532a6d

- vLLM version: v0.10.0
- vLLM main:
vllm-project@b4e081c

---------

Signed-off-by: chengyuan <[email protected]>
Co-authored-by: chengyuan <[email protected]>
wz1qqx pushed a commit to wz1qqx/vllm that referenced this pull request Dec 29, 2025
…or Instance-Worker Management (vllm-project#2131)

* Refactor: Replace Complex Mappings with Hierarchical Tree Structure for Instance-Worker Management

Signed-off-by: baoloongmao <[email protected]>

* Fix gemini comment

Signed-off-by: baoloongmao <[email protected]>

* Add todo comment

Signed-off-by: baoloongmao <[email protected]>

---------

Signed-off-by: baoloongmao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend needs-rebase unstale Recieved activity after being labelled stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants