Skip to content

CPU Offloading w/ FSDP - gradient accumulation is potentially broken #414

@JamesKunstle

Description

@JamesKunstle

From the FSDP docs:
"FSDP currently does not support gradient accumulation outside no_sync() when using CPU offloading. This is because FSDP uses the newly-reduced gradient instead of accumulating with any existing gradient, which can lead to incorrect results."

https://pytorch.org/docs/stable/fsdp.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions