feat: add cuda memory fraction by OlivierDehaene · Pull Request #659 · huggingface/text-generation-inference

OlivierDehaene · 2023-07-20T09:34:30Z

Close #673

shayan1897 · 2023-07-24T18:44:03Z

I’m not sure if this is working as expected I assigned 0.5 for memory fraction but it still try to allocate the same memory size.
@OlivierDehaene

OlivierDehaene · 2023-07-25T07:58:34Z

docker run  -v /data:/data -p 3000:80 --gpus all --pull always --shm-size 10g ghcr.io/huggingface/text-generation-inference:sha-37df6df --model-id meta-llama/Llama-2-13b-hf --num-shard 4 --cuda-memory-fraction 0.5

correctly distribute on 4 GPUs using half of the GPU memory for each.

shayan1897 · 2023-07-25T13:53:13Z

```shell
docker run  -v /data:/data -p 3000:80 --gpus all --pull always --shm-size 10g ghcr.io/huggingface/text-generation-inference:sha-37df6df --model-id meta-llama/Llama-2-13b-hf --num-shard 4 --cuda-memory-fraction 0.5

correctly distribute on 4 GPUs using half of the GPU memory for each.

I'm running this on sagemaker with ml.g5.48xlarge instance which is A10G and has 768 GiB memory; model is meta-llama/llama-2-70b-chat-hf

here is the log for when im running it on with 0.5 fraction:

#33[2m2023-07-25T0626.205291Z#033[0m #33[32m INFO#033[0m #33[2mtext_generation_launcher#033[0m#033[2m:#33[0m Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: Some(4), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.5, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner model = get_model( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 185, in get_model return FlashLlama( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 65, in init model = FlashLlamaForCausalLM(config, weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in init self.model = FlashLlamaModel(config, weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in init [ File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in FlashLlamaLayer( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 334, in init self.mlp = LlamaMLP(prefix=f"{prefix}.mlp", config=config, weights=weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 304, in init self.gate_up_proj = TensorParallelColumnLinear.load_multi( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 264, in load_multi weight = weights.get_multi_weights_col( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 134, in get_multi_weights_col w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes] File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 134, in w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes] File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 107, in get_sharded return self.get_partial_sharded(tensor_name, dim) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 95, in get_partial_sharded tensor = http://tensor.to(device=self.device)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 2; 22.20 GiB total capacity; 10.95 GiB already allocated; 10.13 GiB free; 11.10 GiB allowed; 11.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

OlivierDehaene · 2023-07-25T16:47:56Z

Well yes this argument is not magic. Your instance is too smal to fit llama-2-70b-chat-hf: you need at least 2xA100 80GB.

shayan1897 · 2023-07-25T17:26:46Z

Well yes this argument is not magic. Your instance is too smal to fit llama-2-70b-chat-hf: you need at least 2xA100 80GB.

thats the biggest g5 instance on aws (g5.48xlarge) 8 gpu and 192 gpu memory; memory allocation remains the same with 1.0 and 0.5 both Tried to allocate 112.00 MiB same memory size
i also reduced max-batch-prefill-tokens and max-batch-total-tokens to half but same thing happened. could it be something else then ?

OlivierDehaene · 2023-07-25T17:46:28Z

can you try num_shard=8?

shayan1897 · 2023-07-25T18:46:09Z

can you try num_shard=8?

That fixed it, thanks a lot @OlivierDehaene

This was referenced Jul 20, 2023

auto max_batch_total_tokens OOM #651

Closed

CUDA OOM for Llama 2 with inferred max batch total token set automatically #653

Closed

OlivierDehaene closed this Jul 20, 2023

feat: add cuda memory fraction

1b59f8d

OlivierDehaene reopened this Jul 24, 2023

memory fraction on free memory

31f45f6

OlivierDehaene force-pushed the feat/memory_fraction branch from d1aa092 to 31f45f6 Compare July 24, 2023 08:25

OlivierDehaene merged commit 73a4d65 into main Jul 24, 2023

OlivierDehaene deleted the feat/memory_fraction branch July 24, 2023 09:43

OlivierDehaene mentioned this pull request Jul 24, 2023

Flash-attn model taking up all available GPU space. #673

Closed

Yard1 mentioned this pull request Jul 24, 2023

fix(server): 2-stage warmup for sharded models #678

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add cuda memory fraction#659

feat: add cuda memory fraction#659
OlivierDehaene merged 2 commits intomainfrom
feat/memory_fraction

OlivierDehaene commented Jul 20, 2023 •

edited

Loading

Uh oh!

shayan1897 commented Jul 24, 2023

Uh oh!

OlivierDehaene commented Jul 25, 2023

Uh oh!

shayan1897 commented Jul 25, 2023 •

edited

Loading

Uh oh!

OlivierDehaene commented Jul 25, 2023

Uh oh!

shayan1897 commented Jul 25, 2023

Uh oh!

OlivierDehaene commented Jul 25, 2023

Uh oh!

shayan1897 commented Jul 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OlivierDehaene commented Jul 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shayan1897 commented Jul 24, 2023

Uh oh!

OlivierDehaene commented Jul 25, 2023

Uh oh!

shayan1897 commented Jul 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OlivierDehaene commented Jul 25, 2023

Uh oh!

shayan1897 commented Jul 25, 2023

Uh oh!

OlivierDehaene commented Jul 25, 2023

Uh oh!

shayan1897 commented Jul 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OlivierDehaene commented Jul 20, 2023 •

edited

Loading

shayan1897 commented Jul 25, 2023 •

edited

Loading