Skip to content

[Bugfix] Respect VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY in prefetch offloader#37699

Merged
vllm-bot merged 3 commits intovllm-project:mainfrom
he-yufeng:fix/prefetch-offload-disable-pin-memory
Apr 15, 2026
Merged

[Bugfix] Respect VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY in prefetch offloader#37699
vllm-bot merged 3 commits intovllm-project:mainfrom
he-yufeng:fix/prefetch-offload-disable-pin-memory

Conversation

@he-yufeng
Copy link
Copy Markdown
Contributor

@he-yufeng he-yufeng commented Mar 20, 2026

Purpose

Fixes #37672

The prefetch CPU offloader (--offload-backend prefetch) ignores VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY and unconditionally allocates pinned memory. On unified-memory systems like GH200, pinned memory consumes GPU memory, causing OOM during model loading even when the env var is set to 1.

The UVA offloader already handles this correctly (added in #32993). This PR applies the same pattern to the prefetch backend and consolidates the duplicated check into a shared helper.

Changes

  1. base.py — new should_pin_memory() helper that combines is_pin_memory_available() with the env var check, eliminating duplication across files

  2. prefetch.py — three places in _CpuParamOffloader now use the shared helper:

    • _offload_to_cpu_internal()pin_memory flag when allocating initial CPU storage
    • _update_cpu_storage_from_param() — skips re-pinning when the env var is set
    • start_onload_to_static() assertion — no longer fires when pinning is intentionally disabled
  3. uva.pyself.pin_memory assignment now uses the same shared helper (functionally identical, just DRY)

  4. __init__.py — exports should_pin_memory for external use

Other files in vllm/model_executor/offloader/ (prefetch_ops.py) don't deal with pin memory at all, so no changes needed.

Test Plan

  • Existing tests/basic_correctness/test_cpu_offload.py covers VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY for the UVA backend
  • The env var handling is identical between UVA and prefetch — both just gate pin_memory on the same boolean via the shared helper
  • Verified no other references to the raw pattern remain in the offloader directory

…h offloader

The prefetch CPU offloader unconditionally allocates pinned memory,
ignoring the VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY env var. On
unified-memory systems like GH200 where pinned memory eats into the
GPU memory pool, this causes OOM during model loading.

The UVA offloader already checks this env var (added in vllm-project#32993). Apply
the same check to the three places in the prefetch backend that
allocate or assert pinned memory:
- _offload_to_cpu_internal(): initial CPU storage allocation
- _update_cpu_storage_from_param(): re-pinning after weight processing
- start_onload_to_static(): pinned memory assertion

Fixes vllm-project#37672

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
@mergify mergify Bot added the bug Something isn't working label Mar 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly extends the handling of the VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY environment variable to the prefetch offloader, aligning its behavior with the UVA offloader. The changes prevent unconditional pinned memory allocation, which can cause out-of-memory errors on unified-memory systems. While the logic is correct, I've identified a code duplication issue across three locations. I've added a high-severity comment suggesting a refactoring to a shared helper function to improve maintainability and prevent future bugs.

Comment on lines +635 to +636
pin_memory = (is_pin_memory_available()
and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to decide whether to use pinned memory, is_pin_memory_available() and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY, is now duplicated in three places within this file. This creates a maintenance risk, as future changes to this logic might not be applied consistently across all locations, potentially re-introducing bugs.

To improve maintainability, I recommend extracting this logic into a single, module-level helper function.

For example:

def _should_pin_memory() -> bool:
    """Returns True if memory should be pinned for offloading."""
    return (is_pin_memory_available()
            and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY)

This function can then be used in all three modified locations:

  1. In _CpuParamOffloader._offload_to_cpu_internal (this location):
    pin_memory = _should_pin_memory()
  2. In _CpuParamOffloader._update_cpu_storage_from_param:
    if _should_pin_memory() and not param.data.is_pinned():
        ...
  3. In PrefetchOffloader.start_onload_to_static:
    assert not _should_pin_memory() or cpu_storage.is_pinned(), (
        ...
    )

This refactoring will make the code cleaner, less error-prone, and easier to understand and modify in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist This is not just limited to prefetch.py. You need to take a look at all files in vllm/model_executor/offloader.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely correct, @he-yufeng. My apologies for the oversight. I will expand the scope of the refactoring to include all relevant files within the vllm/model_executor/offloader directory to ensure consistency in handling the VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY environment variable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks for the catch. I went through every file in vllm/model_executor/offloader/:

  • base.py — added a shared should_pin_memory() helper that consolidates the is_pin_memory_available() and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY check
  • prefetch.py — all 3 call sites now use the helper
  • uva.py — switched to the same helper (was already correct, just DRY'd it up)
  • prefetch_ops.py — no pin memory usage, nothing to change
  • __init__.py — just exports, added should_pin_memory

Also noted there's one more instance in model_loader/utils.py (outside the offloader dir) that could be migrated to the helper in a follow-up if desired.

@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Mar 21, 2026

@gemini-code-assist Change the above review to not just be limited to vllm/model_executor/offloader/prefetch.py. You need to take a look at all files in vllm/model_executor/offloader.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Thanks for the clarification, @ehfd. I've reviewed the changes in vllm/model_executor/offloader/prefetch.py. However, based on the provided diff, vllm/model_executor/offloader/prefetch.py is the only file within the vllm/model_executor/offloader directory that has been modified in this pull request. Therefore, my review is limited to the changes in this specific file.

@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Mar 21, 2026

@gemini-code-assist Then, the solution is to retain the current implementation, since it's exactly the same way that works on vllm/model_executor/offloader/uva.py

@gemini-code-assist
Copy link
Copy Markdown
Contributor

I have noted your comment regarding the implementation's consistency with vllm/model_executor/offloader/uva.py.

@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Mar 21, 2026

@PythonicMind2
Copy link
Copy Markdown

Thanks for the fix!
Pinned memory should definitely be respected in the prefetch path too, not just in the main offloading path. The intent is clear and the bug is real.

prefetch_offloader.py, line 84 concern
The variable name pin is too terse. Since this is a negation of a 'disable' flag it's easy to get the logic backwards on a quick read. Suggest:

use_pin_memory = not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY
return torch.empty(size, dtype=torch.uint8, pin_memory=use_pin_memory)

Also is it safe to read envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY every call, or should we cache this at __init__ time to avoid overhead in tight loops?

tests/worker/test_prefetch_offloader.py concern
No test added. This bug was silently ignoring the env var for the prefetch path. Please add a parametrized test covering both VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY=1 and =0, asserting buf.is_pinned() accordingly.

prefetch_offloader.py, line 12 nit
The new envs import is at the bottom of the import block. Please keep it sorted with the other vllm imports alphabetically.

Once the test is added and naming cleaned up this looks good to go, nice catch on the prefetch path being overlooked.

Extract the duplicated `is_pin_memory_available() and not
envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY` pattern into a shared
`should_pin_memory()` helper in base.py, used by both prefetch.py and
uva.py. This addresses the review feedback to check all files in
vllm/model_executor/offloader/, not just prefetch.py.

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
@he-yufeng he-yufeng force-pushed the fix/prefetch-offload-disable-pin-memory branch from 3d48a5b to 355aea9 Compare March 21, 2026 15:12
@he-yufeng
Copy link
Copy Markdown
Contributor Author

Ran ruff check and ruff format on the offloader directory — all passing now. Thanks for the heads up.

@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Mar 23, 2026

@he-yufeng
Copy link
Copy Markdown
Contributor Author

ping on this — GH200 users are still hitting the OOM.

@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented Apr 13, 2026

It is definitely needed. Else GH200 doesn't work properly with unified memory at all.

@DarkLight1337 DarkLight1337 requested review from Isotr0py and mgoin April 13, 2026 15:31
@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 14, 2026
@Isotr0py Isotr0py enabled auto-merge (squash) April 14, 2026 16:38
@vllm-bot vllm-bot merged commit bcc2306 into vllm-project:main Apr 15, 2026
51 of 52 checks passed
zxd1997066 pushed a commit to zxd1997066/vllm that referenced this pull request Apr 15, 2026
…h offloader (vllm-project#37699)

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026
…h offloader (vllm-project#37699)

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…h offloader (vllm-project#37699)

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Prefetch CPU offload OOMs; VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY must be implemented

6 participants