[lora] enforce_eager=true slows down generation time dramatically with LoRA#665
Conversation
…daid fix to ensure enforce_eager=false always for LoRA runs
|
/gemini review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for a known performance issue in vLLM where using LoRA with enforce_eager=true causes significant slowdowns. The change automatically disables enforce_eager for LoRA runs when using the vLLM backend and issues a warning, which is a sensible approach. The implementation is clean and also correctly plumbs through the new fully_sharded_loras configuration option. My feedback includes a minor suggestion to improve the robustness of the backend check.
| engine_kwargs["fully_sharded_loras"] = cfg.generator.fully_sharded_loras | ||
|
|
||
| # TODO(devpatel): Bandaid solution, replace this once we have a better solution for LoRA performance degradation on the vLLM side | ||
| if cfg.generator.enforce_eager and cfg.generator.backend == "vllm": |
There was a problem hiding this comment.
For improved robustness, it's a good practice to make the backend check case-insensitive. The configuration value for backend could potentially be provided with different casings (e.g., 'vllm', 'VLLM'). Using .lower() will ensure this check works as expected in all cases.
| if cfg.generator.enforce_eager and cfg.generator.backend == "vllm": | |
| if cfg.generator.enforce_eager and cfg.generator.backend.lower() == "vllm": |
…matically with LoRA (NovaSky-AI#665) Unfortunately, this seems to be a vLLM issue that has been widely reported and not addressed. I've provided some additional vllm configuration flags (fully sharded lora) and double checked that max_lora_rank is equal to the input lora rank which was also recorded as a potential cause. For now, I've implemented a bandaid solution where we always set enforce_eager=false for LoRA runs to prevent slowdowns in all training runs with a warning. This is in line with the vLLM suggested fixes for the generator. See vllm-project/vllm#13204 and vllm-project/vllm#9452
…matically with LoRA (NovaSky-AI#665) Unfortunately, this seems to be a vLLM issue that has been widely reported and not addressed. I've provided some additional vllm configuration flags (fully sharded lora) and double checked that max_lora_rank is equal to the input lora rank which was also recorded as a potential cause. For now, I've implemented a bandaid solution where we always set enforce_eager=false for LoRA runs to prevent slowdowns in all training runs with a warning. This is in line with the vLLM suggested fixes for the generator. See vllm-project/vllm#13204 and vllm-project/vllm#9452
Unfortunately, this seems to be a vLLM issue that has been widely reported and not addressed. I've provided some additional vllm configuration flags (fully sharded lora) and double checked that max_lora_rank is equal to the input lora rank which was also recorded as a potential cause.
For now, I've implemented a bandaid solution where we always set enforce_eager=false for LoRA runs to prevent slowdowns in all training runs with a warning. This is in line with the vLLM suggested fixes for the generator.
See vllm-project/vllm#13204 and vllm-project/vllm#9452