[group offloading] avoid unnecessary moving out to speed up inference#12910
[group offloading] avoid unnecessary moving out to speed up inference#12910gameofdimension wants to merge 1 commit intohuggingface:mainfrom
Conversation
Refactor offloading logic to simplify memory management.
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Pinging to remove stale. |
|
Hi @gameofdimension thanks for putting this together. I believe this change could lead to a big spike in CPU RAM usage right? Would you mind benchmarking the change to get an idea of throughput (iterations/second), GPU VRAM and CPU RAM usage? |
|
|
@gameofdimension Just curious, why not just enable similar behaviour through CUDA streams? The expectation is already set there that there will be a trade off between CPU memory and speed. Is there some specific case you have that you don't want to use streams? |
|
@DN6 IMHO Based on the observed latency differences, I proposed this adjustment for consideration. |
Wouldn't that approach also disadvantage XPU users? |
Explicitly moving weights back to the CPU after computation is unnecessary—we can avoid it just like in the
use_stream=Truecase. Since device-to-host copying is expensive, this change significantly improves inference speed whenuse_stream=False.improvement
device: A100 40G
model:
Qwen/Qwen-Imagetest code
What does this PR do?
Fixes # (issue)
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.