Compiled model with torch.compile, unfortunately without performance improvements #2131

sh1ng · 2023-12-15T13:15:55Z

torch.jit.script and TorchScript can't be used as forward methods use parameters not compatible with it https://pytorch.org/docs/stable/jit_language_reference.html#supported-type.
torch.jit.trace looks even more challenging.
I was only able to make it run by using torch.compile with minimal @torch.compiler.disable addition. Unfortunately, I only see performance degradation(RTX 3090)

$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model facebook/opt-125m
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='facebook/opt-125m', tokenizer='facebook/opt-125m', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:11:19 llm_engine.py:73] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=False, seed=0)
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:11:23,098 filelock [DEBUG] - Attempting to acquire lock 139820010183728 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,098 filelock [DEBUG] - Lock 139820010183728 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,221 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:11:23,223 filelock [DEBUG] - Attempting to release lock 139820010183728 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,223 filelock [DEBUG] - Lock 139820010183728 released on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,223 filelock [DEBUG] - Attempting to acquire lock 139821305789792 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,223 filelock [DEBUG] - Lock 139821305789792 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,330 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:11:23,331 filelock [DEBUG] - Attempting to release lock 139821305789792 on /tmp/facebook-opt-125m.lock
2023-12-15 04:11:23,331 filelock [DEBUG] - Lock 139821305789792 released on /tmp/facebook-opt-125m.lock
INFO 12-15 04:11:24 llm_engine.py:223] # GPU blocks: 34503, # CPU blocks: 7281
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [00:10<00:00, 91.47it/s]
Throughput: 91.43 requests/s, 23406.65 tokens/s

$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model facebook/opt-125m --compile-model True
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='facebook/opt-125m', tokenizer='facebook/opt-125m', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:10:07 llm_engine.py:73] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=True, seed=0)
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:10:11,338 filelock [DEBUG] - Attempting to acquire lock 140378255170608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,338 filelock [DEBUG] - Lock 140378255170608 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,500 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:10:11,502 filelock [DEBUG] - Attempting to release lock 140378255170608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,503 filelock [DEBUG] - Lock 140378255170608 released on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,503 filelock [DEBUG] - Attempting to acquire lock 140378858978608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,503 filelock [DEBUG] - Lock 140378858978608 acquired on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,617 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/facebook/opt-125m/revision/main HTTP/1.1" 200 3748
2023-12-15 04:10:11,619 filelock [DEBUG] - Attempting to release lock 140378858978608 on /tmp/facebook-opt-125m.lock
2023-12-15 04:10:11,619 filelock [DEBUG] - Lock 140378858978608 released on /tmp/facebook-opt-125m.lock
INFO 12-15 04:10:18 llm_engine.py:223] # GPU blocks: 34524, # CPU blocks: 7281
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
  torch.has_cuda,
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
  torch.has_cudnn,
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
  torch.has_mps,
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
  torch.has_mkldnn,
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 1000/1000 [00:11<00:00, 85.64it/s]
Throughput: 85.61 requests/s, 21915.10 tokens/s

llama

$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model h2oai/h2ogpt-4096-llama2-7b-chat 
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:45:25 llm_engine.py:73] Initializing an LLM engine with config: model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=False, seed=0)
INFO 12-15 04:45:25 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:45:29,229 filelock [DEBUG] - Attempting to acquire lock 140026768499296 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:45:29,229 filelock [DEBUG] - Lock 140026768499296 acquired on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:45:29,321 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/h2oai/h2ogpt-4096-llama2-7b-chat/revision/main HTTP/1.1" 200 2270
2023-12-15 04:45:29,325 filelock [DEBUG] - Attempting to release lock 140026768499296 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:45:29,325 filelock [DEBUG] - Lock 140026768499296 released on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
INFO 12-15 04:45:32 llm_engine.py:223] # GPU blocks: 881, # CPU blocks: 512
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 2000/2000 [04:24<00:00,  7.56it/s]
Throughput: 7.56 requests/s, 1936.53 tokens/s

$ python benchmarks/benchmark_throughput.py --input-len 128 --output-len 128 --num-prompts 1000 --warm-up-prompts 1000 --model h2oai/h2ogpt-4096-llama2-7b-chat --compile-model True
Namespace(backend='vllm', dataset=None, input_len=128, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-15 04:35:47 llm_engine.py:73] Initializing an LLM engine with config: model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, compile_model=True, seed=0)
INFO 12-15 04:35:47 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_validation.py:114: UserWarning: WARNING: failed to get cudart_version from onnxruntime build info.
  warnings.warn("WARNING: failed to get cudart_version from onnxruntime build info.")
2023-12-15 04:35:50,578 filelock [DEBUG] - Attempting to acquire lock 139709952418400 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:35:50,578 filelock [DEBUG] - Lock 139709952418400 acquired on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:35:50,689 urllib3.connectionpool [DEBUG] - https://huggingface.co:443 "GET /api/models/h2oai/h2ogpt-4096-llama2-7b-chat/revision/main HTTP/1.1" 200 2270
2023-12-15 04:35:50,692 filelock [DEBUG] - Attempting to release lock 139709952418400 on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
2023-12-15 04:35:50,692 filelock [DEBUG] - Lock 139709952418400 released on /tmp/h2oai-h2ogpt-4096-llama2-7b-chat.lock
INFO 12-15 04:36:08 llm_engine.py:223] # GPU blocks: 876, # CPU blocks: 512
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 2000/2000 [04:27<00:00,  7.49it/s]
Throughput: 7.49 requests/s, 1916.22 tokens/s

This PR can be considered as a first step to use torch.compiler for further improvements.

BTW onnrt backend returns

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Type Error: Type (seq(tensor(float16))) of output arg (_val_9) of node (_inline_aten_split_with_sizesn0) does not match expected type (seq(tensor(float))).

sh1ng · 2023-12-21T13:13:27Z

$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False)
Throughput: 4.77 requests/s, 1873.85 tokens/s

$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --enforce-eager
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=False, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=True)
Throughput: 4.65 requests/s, 1827.69 tokens/s

$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --enforce-eager --compile-model=True
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=True)
Throughput: 4.53 requests/s, 1778.97 tokens/s

$ python benchmarks/benchmark_throughput.py --output-len 128 --num-prompts 1000 --warm-up-prompts 100 --model h2oai/h2ogpt-4096-llama2-7b-chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --compile-model=True
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=128, model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', quantization=None, tensor_parallel_size=1, compile_model=True, n=1, use_beam_search=False, num_prompts=1000, warm_up_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False)
Throughput: 4.67 requests/s, 1835.51 tokens/s

UIHCRITT · 2023-12-22T09:57:17Z

using your code, I run vicuna-7b in one L40, torch.__version__2.1.0+cu121; vllm = 0.2.2,
I found
(flot+compile bs=1,code from gpt-fast) 50.07 tokens/sec;
(float+vllm bs=1) 46.24 tokens/sec;
(float+vllm+compile) 42.71 tokens/sec;

it seems using torch.compile,without performance improvements;
on the other hand, i try using torch.compile() rewrite model_forward,rather than using @torch.compiler.disable:
in vllm/worker/worker.py def excute_model

before
''''''
output = self.model(
input_ids = input_tokens,
positions = inputs_positions,
kv_caches=slef.gpu_cache,
input_metadata=input_metadata,
cache_events = cache_events,
)
after
'''''''
def _model_foraward(model,input_ids,positions,kv_caches,input_metadata,cache_events):
return model(input_ids,positions,kv_caches,input_metadata,cache_events)
model_forward = torch.compile(_model_foraward,mode="reduce-overhead",fullgraph=True)
output = model_forward(
model = self.model,
input_ids = input_tokens,
positions = inputs_positions,
kv_caches=slef.gpu_cache,
input_metadata=input_metadata,
cache_events = cache_events,
)
'''''''''''
when I run this code, i get errors''NotImpementedError:ProcessGroupVariable() ''

Lvjinhong · 2024-01-12T10:00:14Z

For the latest version v0.2.7, is there any meaningful acceleration in terms of the compiler?

github-actions · 2024-10-30T02:04:09Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify · 2024-10-30T02:04:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. @sh1ng please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

What this PR does / why we need it? test vllm_ascend/envs.py contains environment variables defination Does this PR introduce any user-facing change? N/A How was this patch tested? CI passed with new added test. vLLM version: v0.10.0 vLLM main: vllm-project@9532a6d - vLLM version: v0.10.0 - vLLM main: vllm-project@b4e081c --------- Signed-off-by: chengyuan <[email protected]> Co-authored-by: chengyuan <[email protected]>

…or Instance-Worker Management (vllm-project#2131) * Refactor: Replace Complex Mappings with Hierarchical Tree Structure for Instance-Worker Management Signed-off-by: baoloongmao <[email protected]> * Fix gemini comment Signed-off-by: baoloongmao <[email protected]> * Add todo comment Signed-off-by: baoloongmao <[email protected]> --------- Signed-off-by: baoloongmao <[email protected]>

sh1ng changed the title ~~Compile model with torch.compile, unfortunatly without performance improvments~~ Compiled model with torch.compile, unfortunately without performance improvements Dec 15, 2023

sh1ng force-pushed the try-torch-compiler branch from 2bf9d5c to 179a630 Compare December 21, 2023 13:07

sh1ng force-pushed the try-torch-compiler branch from 179a630 to 45ee43e Compare January 2, 2024 20:32

godsakurapeng mentioned this pull request Jan 3, 2024

Can vllm become faster? #2327

Closed

sh1ng force-pushed the try-torch-compiler branch from 45ee43e to 2637c51 Compare February 28, 2024 20:41

sh1ng and others added 3 commits February 28, 2024 12:53

compile torch model, added configuration and warm-up step to benchmark

01b1c56

add @torch.compiler.disable to support opt and llama

b473aad

add comment

73f0f1a

sh1ng force-pushed the try-torch-compiler branch from 2637c51 to 73f0f1a Compare February 28, 2024 20:53

sh1ng mentioned this pull request Mar 1, 2024

torch.compile() support #3014

Closed

github-actions bot added the stale Over 90 days of inactivity label Oct 30, 2024

mergify bot added frontend needs-rebase labels Oct 30, 2024

github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Nov 2, 2024

simon-mo requested review from alexm-redhat, comaniac, njhill, youkaichao and zhuohan123 as code owners November 26, 2024 05:49

hmellor closed this Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Compiled model with torch.compile, unfortunately without performance improvements #2131

Compiled model with torch.compile, unfortunately without performance improvements #2131

Uh oh!

sh1ng commented Dec 15, 2023

Uh oh!

sh1ng commented Dec 21, 2023

Uh oh!

UIHCRITT commented Dec 22, 2023 •

edited

Loading

Uh oh!

Lvjinhong commented Jan 12, 2024

Uh oh!

github-actions bot commented Oct 30, 2024

Uh oh!

mergify bot commented Oct 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Compiled model with torch.compile, unfortunately without performance improvements #2131

Compiled model with torch.compile, unfortunately without performance improvements #2131

Uh oh!

Conversation

sh1ng commented Dec 15, 2023

Uh oh!

sh1ng commented Dec 21, 2023

Uh oh!

UIHCRITT commented Dec 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lvjinhong commented Jan 12, 2024

Uh oh!

github-actions bot commented Oct 30, 2024

Uh oh!

mergify bot commented Oct 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

UIHCRITT commented Dec 22, 2023 •

edited

Loading