Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions vllm/model_executor/offloader/prefetch.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import torch.nn as nn

# Import prefetch_ops to register custom ops at module load time
import vllm.envs as envs
import vllm.model_executor.offloader.prefetch_ops # noqa: F401
from vllm.logger import init_logger
from vllm.model_executor.offloader.base import BaseOffloader
Expand Down Expand Up @@ -528,7 +529,9 @@ def start_onload_to_static(self):
gpu_buffer = offloader._gpu_buffer
assert cpu_storage is not None, "CPU storage not initialized"
assert gpu_buffer is not None, "GPU buffer not assigned"
assert not is_pin_memory_available() or cpu_storage.is_pinned(), (
assert (envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY
or not is_pin_memory_available()
or cpu_storage.is_pinned()), (
f"CPU storage for {name} is not pinned! "
"non_blocking=True H2D copy from non-pinned memory "
"causes stream synchronization that breaks "
Expand Down Expand Up @@ -629,7 +632,8 @@ def _offload_to_cpu_internal(self):
original GPU tensor is garbage collected.
"""
param = self._param
pin_memory = is_pin_memory_available()
pin_memory = (is_pin_memory_available()
and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to decide whether to use pinned memory, is_pin_memory_available() and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY, is now duplicated in three places within this file. This creates a maintenance risk, as future changes to this logic might not be applied consistently across all locations, potentially re-introducing bugs.

To improve maintainability, I recommend extracting this logic into a single, module-level helper function.

For example:

def _should_pin_memory() -> bool:
    """Returns True if memory should be pinned for offloading."""
    return (is_pin_memory_available()
            and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY)

This function can then be used in all three modified locations:

  1. In _CpuParamOffloader._offload_to_cpu_internal (this location):
    pin_memory = _should_pin_memory()
  2. In _CpuParamOffloader._update_cpu_storage_from_param:
    if _should_pin_memory() and not param.data.is_pinned():
        ...
  3. In PrefetchOffloader.start_onload_to_static:
    assert not _should_pin_memory() or cpu_storage.is_pinned(), (
        ...
    )

This refactoring will make the code cleaner, less error-prone, and easier to understand and modify in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist This is not just limited to prefetch.py. You need to take a look at all files in vllm/model_executor/offloader.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely correct, @he-yufeng. My apologies for the oversight. I will expand the scope of the refactoring to include all relevant files within the vllm/model_executor/offloader directory to ensure consistency in handling the VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY environment variable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks for the catch. I went through every file in vllm/model_executor/offloader/:

  • base.py — added a shared should_pin_memory() helper that consolidates the is_pin_memory_available() and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY check
  • prefetch.py — all 3 call sites now use the helper
  • uva.py — switched to the same helper (was already correct, just DRY'd it up)
  • prefetch_ops.py — no pin memory usage, nothing to change
  • __init__.py — just exports, added should_pin_memory

Also noted there's one more instance in model_loader/utils.py (outside the offloader dir) that could be migrated to the helper in a follow-up if desired.


# Create pinned CPU storage and copy current GPU data
self._cpu_storage = torch.empty_strided(
Expand Down Expand Up @@ -666,7 +670,9 @@ def _update_cpu_storage_from_param(self) -> None:
param = self._param

if param.data.device.type == "cpu":
if is_pin_memory_available() and not param.data.is_pinned():
if (is_pin_memory_available()
and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY
and not param.data.is_pinned()):
pinned = torch.empty_strided(
size=param.data.size(),
stride=param.data.stride(),
Expand Down
Loading