Conversation
d1aa092 to
31f45f6
Compare
|
I’m not sure if this is working as expected I assigned 0.5 for memory fraction but it still try to allocate the same memory size. |
docker run -v /data:/data -p 3000:80 --gpus all --pull always --shm-size 10g ghcr.io/huggingface/text-generation-inference:sha-37df6df --model-id meta-llama/Llama-2-13b-hf --num-shard 4 --cuda-memory-fraction 0.5correctly distribute on 4 GPUs using half of the GPU memory for each. |
I'm running this on sagemaker with here is the log for when im running it on with 0.5 fraction: #33[2m2023-07-25T0626.205291Z#033[0m #33[32m INFO#033[0m #33[2mtext_generation_launcher#033[0m#033[2m:#33[0m Args { model_id: "/opt/ml/model", revision: None, validation_workers: 2, sharded: None, num_shard: Some(4), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "container-0.local", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.5, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false } File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner model = get_model( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 185, in get_model return FlashLlama( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 65, in init model = FlashLlamaForCausalLM(config, weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 456, in init self.model = FlashLlamaModel(config, weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 394, in init [ File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 395, in FlashLlamaLayer( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 334, in init self.mlp = LlamaMLP(prefix=f"{prefix}.mlp", config=config, weights=weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 304, in init self.gate_up_proj = TensorParallelColumnLinear.load_multi( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 264, in load_multi weight = weights.get_multi_weights_col( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 134, in get_multi_weights_col w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes] File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 134, in w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes] File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 107, in get_sharded return self.get_partial_sharded(tensor_name, dim) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 95, in get_partial_sharded tensor = http://tensor.to(device=self.device) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 2; 22.20 GiB total capacity; 10.95 GiB already allocated; 10.13 GiB free; 11.10 GiB allowed; 11.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF` |
|
Well yes this argument is not magic. Your instance is too smal to fit llama-2-70b-chat-hf: you need at least 2xA100 80GB. |
thats the biggest g5 instance on aws (g5.48xlarge) 8 gpu and 192 gpu memory; memory allocation remains the same with 1.0 and 0.5 both Tried to allocate 112.00 MiB same memory size |
|
can you try |
That fixed it, thanks a lot @OlivierDehaene |
Close #673