Skip to content

feat: add cuda memory fraction#659

Merged
OlivierDehaene merged 2 commits intomainfrom
feat/memory_fraction
Jul 24, 2023
Merged

feat: add cuda memory fraction#659
OlivierDehaene merged 2 commits intomainfrom
feat/memory_fraction

Conversation

@OlivierDehaene
Copy link
Contributor

@OlivierDehaene OlivierDehaene commented Jul 20, 2023

Close #673

@shayan1897
Copy link

I’m not sure if this is working as expected I assigned 0.5 for memory fraction but it still try to allocate the same memory size.
@OlivierDehaene

@OlivierDehaene
Copy link
Contributor Author

docker run  -v /data:/data -p 3000:80 --gpus all --pull always --shm-size 10g ghcr.io/huggingface/text-generation-inference:sha-37df6df --model-id meta-llama/Llama-2-13b-hf --num-shard 4 --cuda-memory-fraction 0.5

correctly distribute on 4 GPUs using half of the GPU memory for each.

@shayan1897
Copy link

shayan1897 commented Jul 25, 2023

```shell
docker run  -v /data:/data -p 3000:80 --gpus all --pull always --shm-size 10g ghcr.io/huggingface/text-generation-inference:sha-37df6df --model-id meta-llama/Llama-2-13b-hf --num-shard 4 --cuda-memory-fraction 0.5

correctly distribute on 4 GPUs using half of the GPU memory for each.

I'm running this on sagemaker with ml.g5.48xlarge instance which is A10G and has 768 GiB memory; model is meta-llama/llama-2-70b-chat-hf

here is the log for when im running it on with 0.5 fraction:

#33[2m2023-07-25T0626.205291Z#033[0m #33[32m INFO#033[0m #33[2mtext_generation_launcher#033[0m#033[2m:#33[0m Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: Some(4), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.5, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner model = get_model( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 185, in get_model return FlashLlama( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 65, in init model = FlashLlamaForCausalLM(config, weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in init self.model = FlashLlamaModel(config, weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in init [ File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in FlashLlamaLayer( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 334, in init self.mlp = LlamaMLP(prefix=f"{prefix}.mlp", config=config, weights=weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 304, in init self.gate_up_proj = TensorParallelColumnLinear.load_multi( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 264, in load_multi weight = weights.get_multi_weights_col( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 134, in get_multi_weights_col w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes] File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 134, in w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes] File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 107, in get_sharded return self.get_partial_sharded(tensor_name, dim) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 95, in get_partial_sharded tensor = http://tensor.to(device=self.device)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 2; 22.20 GiB total capacity; 10.95 GiB already allocated; 10.13 GiB free; 11.10 GiB allowed; 11.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

@OlivierDehaene
Copy link
Contributor Author

Well yes this argument is not magic. Your instance is too smal to fit llama-2-70b-chat-hf: you need at least 2xA100 80GB.

@shayan1897
Copy link

Well yes this argument is not magic. Your instance is too smal to fit llama-2-70b-chat-hf: you need at least 2xA100 80GB.

thats the biggest g5 instance on aws (g5.48xlarge) 8 gpu and 192 gpu memory; memory allocation remains the same with 1.0 and 0.5 both Tried to allocate 112.00 MiB same memory size
i also reduced max-batch-prefill-tokens and max-batch-total-tokens to half but same thing happened. could it be something else then ?

@OlivierDehaene
Copy link
Contributor Author

can you try num_shard=8?

@shayan1897
Copy link

can you try num_shard=8?

That fixed it, thanks a lot @OlivierDehaene

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flash-attn model taking up all available GPU space.

2 participants