Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (method and checkpoints by lllyasviel) #9149

tin2tin · 2024-08-11T17:43:44Z

lllyasviel/stable-diffusion-webui-forge#981

Flux Checkpoints
The currently supported Flux checkpoints are

flux1-dev-bnb-nf4.safetensors Full flux-dev checkpoint with main model in NF4. <- Recommended
flux1-dev-fp8.safetensors Full flux-dev checkpoint with main model in FP8.
Basic facts:

(i) NF4 is significantly faster than FP8. For GPUs with 6GB/8GB VRAM, the speed-up is about 1.3x to 2.5x (pytorch 2.4, cuda 12.4) or about 1.3x to 4x (pytorch 2.1, cuda 12.1). I test 3070 ti laptop (8GB VRAM) just now, the FP8 is 8.3 seconds per iteration; NF4 is 2.15 seconds per iteration (in my case, 3.86x faster). This is because NF4 uses native bnb.matmul_4bit rather than torch.nn.functional.linear: casts are avoided and computation is done with many low-bit cuda tricks. (Update 1: bnb's speed-up is less salient on pytorch 2.4, cuda 12.4. Newer pytorch may used improved fp8 cast.) (Update 2: the above number is not benchmark - I just tested very few devices. Some other devices may have different performances.) (Update 3: I just tested more devices now and the speed-up is somewhat random but I always see speed-ups - I will give more reliable numbers later!)

(ii) NF4 weights are about half size of FP8.

chuck-ma · 2024-08-12T05:07:44Z

I think if NF4 is much better than FP8, maybe we can make it usable for all models. (Including SDXL)

Ednaordinary · 2024-08-12T05:16:40Z

@chuck-ma NF4 is better in some cases and worse in other cases compared to FP8 (hard to tell when though). NF4 is essentially FP4 (4 bits per weight) with some additional data/additional changes to calibrate it closer to the original model, compared to FP4. Here, that's done by mixing precision (e.g. using higher precision bf16 where it matters and lower precision int4 where it matters a lot less). FP8 is essentially casting down to 8 bits per weight without any sort of calibration, so it's faster to make but larger and different from NF4.

In addition, there's no real reason other models shouldn't support this since it doesn't rely on any flux specific quirk (flux just motivated the development of quantization techniques like this for image models)

Swarzox · 2024-08-12T17:52:36Z

It seems I can't get it work with diffusers, do you have any very simple example code to make it works?

Many thanks.

Ednaordinary · 2024-08-12T17:56:35Z

@Swarzox This is an issue, not a PR. The code to make this work in diffusers has not been contributed or created yet.

Ednaordinary · 2024-08-15T12:49:38Z

Please see #9165, #9174

sayakpaul · 2024-08-15T13:08:02Z

Both NF4 and llm.int8 can be done with some code changes ad-hoc:
#8746

Serialization and direct loading support will be done through the plan proposed in #9174.

Directly loading the said checkpoint can lead to some problematic results because of the reasons explained in #9165 (comment).

If you want to obtain the text encoders and VAE from that checkpoint, you can use the snippet from #9165 (comment) and then use something like #9177 so that computations run in a higher-precision data-type while the params are kept in lower-precision data-type such as FP8.

You can also do a direct llm.int8() or NF4 style loading of the bulky T5-xxl and use it within a diffusers pipeline. See: https://gist.github.com/sayakpaul/82acb5976509851f2db1a83456e504f1

There are many options to run things in a memory-efficient way. So, with a programmatic approach, we let you choose what's best for you :)

lonngxiang · 2024-08-30T09:02:20Z

how to use with diffusrrs？ any code

lonngxiang · 2024-08-30T15:01:16Z

Please see #9165, #9174

The code appears to be complex, and the GPU usage far exceeds 8GB.

Ednaordinary · 2024-08-31T00:11:01Z

@lonngxiang

I can't be sure of what you are doing, but I get 9 GiB with CPU offload and ~18 GiB without

CPU offload:

GPU only:

This may also get even better when #9174 is complete/has a working demo

lonngxiang · 2024-08-31T01:19:33Z

I am very much looking forward to it, hoping for an extremely simple implementation with just a few lines of code.

dylanisreversing · 2024-08-31T16:04:15Z

@lonngxiang

I can't be sure of what you are doing, but I get 9 GiB with CPU offload and ~18 GiB without

CPU offload:

GPU only:

This may also get even better when #9174 is complete/has a working demo

Hey @Ednaordinary, I saw your comment on the other thread about the previous error you were encountering and wanted to know how/if you resolved it. I am talking about the 'Error in FluxImageGenerator initialization: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.'

I am stuck on the same problem right now. Thank you!

kanarch66 · 2024-09-04T12:53:33Z

NF4 is removed from comfyui latest update

github-actions · 2024-09-28T15:03:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul · 2024-09-28T15:22:13Z

#9213

The flexibility of DiffusionPipeline will be that we can use whatever quantization scheme is best for our needs for each of the individual models involved in a pipeline. I haven't considered the framework overhead of doing that but I think in the end, we could want to optimize the trade-off between memory and latency and if so, this flexibility would be good to have.

To illustrate my point, consider the FluxPipeline that has two text encoders, a transformer denoiser, and a VAE.

Quantization strategies like llm.int8() or NF4 or the ones provided in torchao are suitable for models that are composed of mostly nn.Linear layers. Conv1D layers are fine because they can be expressed as linear layers in most cases. But for a VAE that has conv layers, quanto might be a better library to use as it provides better operators and primitives to deal with conv layers.

https://x.com/RisingSayak/status/1836679359521820704 gives a visual of how this might look like.

github-actions · 2024-10-23T15:03:36Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul · 2024-10-23T18:53:15Z

We have landed https://huggingface.co/docs/diffusers/main/en/quantization/bitsandbytes

tin2tin changed the title ~~Support FLUX nf4 & pf8 for GPUs for 6GB/8GB VRAM by lllyasviel~~ Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (by lllyasviel) Aug 11, 2024

tin2tin changed the title ~~Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (by lllyasviel)~~ Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (method and checkpoints by lllyasviel) Aug 12, 2024

github-actions bot added the stale Issues that haven't received updates label Sep 28, 2024

sayakpaul removed the stale Issues that haven't received updates label Sep 28, 2024

github-actions bot added the stale Issues that haven't received updates label Oct 23, 2024

sayakpaul closed this as completed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (method and checkpoints by lllyasviel) #9149

Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (method and checkpoints by lllyasviel) #9149

tin2tin commented Aug 11, 2024

chuck-ma commented Aug 12, 2024

Ednaordinary commented Aug 12, 2024 •

edited

Loading

Swarzox commented Aug 12, 2024

Ednaordinary commented Aug 12, 2024

Ednaordinary commented Aug 15, 2024

sayakpaul commented Aug 15, 2024 •

edited

Loading

lonngxiang commented Aug 30, 2024

lonngxiang commented Aug 30, 2024

Ednaordinary commented Aug 31, 2024

lonngxiang commented Aug 31, 2024

dylanisreversing commented Aug 31, 2024

kanarch66 commented Sep 4, 2024

github-actions bot commented Sep 28, 2024

sayakpaul commented Sep 28, 2024

github-actions bot commented Oct 23, 2024

sayakpaul commented Oct 23, 2024

Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (method and checkpoints by lllyasviel) #9149

Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (method and checkpoints by lllyasviel) #9149

Comments

tin2tin commented Aug 11, 2024

chuck-ma commented Aug 12, 2024

Ednaordinary commented Aug 12, 2024 • edited Loading

Swarzox commented Aug 12, 2024

Ednaordinary commented Aug 12, 2024

Ednaordinary commented Aug 15, 2024

sayakpaul commented Aug 15, 2024 • edited Loading

lonngxiang commented Aug 30, 2024

lonngxiang commented Aug 30, 2024

Ednaordinary commented Aug 31, 2024

lonngxiang commented Aug 31, 2024

dylanisreversing commented Aug 31, 2024

kanarch66 commented Sep 4, 2024

github-actions bot commented Sep 28, 2024

sayakpaul commented Sep 28, 2024

github-actions bot commented Oct 23, 2024

sayakpaul commented Oct 23, 2024

Ednaordinary commented Aug 12, 2024 •

edited

Loading

sayakpaul commented Aug 15, 2024 •

edited

Loading