Skip to content

Commit d8f6063

Browse files
authored
Merge branch 'main' into cog-tests
2 parents f7405f2 + f2be8bd commit d8f6063

File tree

9 files changed

+490
-611
lines changed

9 files changed

+490
-611
lines changed

.github/workflows/claude_review.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@ jobs:
3232
)
3333
runs-on: ubuntu-latest
3434
steps:
35+
- uses: actions/checkout@v4
36+
with:
37+
fetch-depth: 1
3538
- uses: anthropics/claude-code-action@v1
3639
with:
3740
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}

docs/source/en/api/pipelines/cogvideox.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,16 +41,15 @@ The quantized CogVideoX 5B model below requires ~16GB of VRAM.
4141

4242
```py
4343
import torch
44-
from diffusers import CogVideoXPipeline, AutoModel
44+
from diffusers import CogVideoXPipeline, AutoModel, TorchAoConfig
4545
from diffusers.quantizers import PipelineQuantizationConfig
4646
from diffusers.hooks import apply_group_offloading
4747
from diffusers.utils import export_to_video
48+
from torchao.quantization import Int8WeightOnlyConfig
4849

4950
# quantize weights to int8 with torchao
5051
pipeline_quant_config = PipelineQuantizationConfig(
51-
quant_backend="torchao",
52-
quant_kwargs={"quant_type": "int8wo"},
53-
components_to_quantize="transformer"
52+
quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig())}
5453
)
5554

5655
# fp8 layerwise weight-casting

docs/source/en/api/pipelines/ltx2.md

Lines changed: 151 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
1919
</div>
2020

21-
LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
21+
[LTX-2](https://hf.co/papers/2601.03233) is a DiT-based foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
2222

2323
You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization.
2424

@@ -293,6 +293,7 @@ import torch
293293
from diffusers import LTX2ConditionPipeline
294294
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
295295
from diffusers.pipelines.ltx2.export_utils import encode_video
296+
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT
296297
from diffusers.utils import load_image, load_video
297298

298299
device = "cuda"
@@ -315,19 +316,6 @@ prompt = (
315316
"landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the "
316317
"solitude and beauty of a winter drive through a mountainous region."
317318
)
318-
negative_prompt = (
319-
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
320-
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
321-
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
322-
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
323-
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
324-
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
325-
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
326-
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
327-
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
328-
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
329-
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
330-
)
331319

332320
cond_video = load_video(
333321
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
@@ -343,7 +331,7 @@ frame_rate = 24.0
343331
video, audio = pipe(
344332
conditions=conditions,
345333
prompt=prompt,
346-
negative_prompt=negative_prompt,
334+
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
347335
width=width,
348336
height=height,
349337
num_frames=121,
@@ -366,6 +354,154 @@ encode_video(
366354

367355
Because the conditioning is done via latent frames, the 8 data space frames corresponding to the specified latent frame for an image condition will tend to be static.
368356

357+
## Multimodal Guidance
358+
359+
LTX-2.X pipelines support multimodal guidance. It is composed of three terms, all using a CFG-style update rule:
360+
361+
1. Classifier-Free Guidance (CFG): standard [CFG](https://huggingface.co/papers/2207.12598) where the perturbed ("weaker") output is generated using the negative prompt.
362+
2. Spatio-Temporal Guidance (STG): [STG](https://huggingface.co/papers/2411.18664) moves away from a perturbed output created from short-cutting self-attention operations and substitutes in the attention values instead. The idea is that this creates sharper videos and better spatiotemporal consistency.
363+
3. Modality Isolation Guidance: moves away from a perturbed output created from disabling cross-modality (audio-to-video and video-to-audio) cross attention. This guidance is more specific to [LTX-2.X](https://huggingface.co/papers/2601.03233) models, with the idea that this produces better consistency between the generated audio and video.
364+
365+
These are controlled by the `guidance_scale`, `stg_scale`, and `modality_scale` arguments and can be set separately for video and audio. Additionally, for STG the transformer block indices where self-attention is skipped needs to be specified via the `spatio_temporal_guidance_blocks` argument. The LTX-2.X pipelines also support [guidance rescaling](https://huggingface.co/papers/2305.08891) to help reduce over-exposure, which can be a problem when the guidance scales are set to high values.
366+
367+
```py
368+
import torch
369+
from diffusers import LTX2ImageToVideoPipeline
370+
from diffusers.pipelines.ltx2.export_utils import encode_video
371+
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT
372+
from diffusers.utils import load_image
373+
374+
device = "cuda"
375+
width = 768
376+
height = 512
377+
random_seed = 42
378+
frame_rate = 24.0
379+
generator = torch.Generator(device).manual_seed(random_seed)
380+
model_path = "dg845/LTX-2.3-Diffusers"
381+
382+
pipe = LTX2ImageToVideoPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
383+
pipe.enable_sequential_cpu_offload(device=device)
384+
pipe.vae.enable_tiling()
385+
386+
prompt = (
387+
"An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in "
388+
"gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs "
389+
"before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small "
390+
"fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly "
391+
"shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a "
392+
"smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the "
393+
"distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a "
394+
"breath-taking, movie-like shot."
395+
)
396+
397+
image = load_image(
398+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg",
399+
)
400+
401+
video, audio = pipe(
402+
image=image,
403+
prompt=prompt,
404+
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
405+
width=width,
406+
height=height,
407+
num_frames=121,
408+
frame_rate=frame_rate,
409+
num_inference_steps=30,
410+
guidance_scale=3.0, # Recommended LTX-2.3 guidance parameters
411+
stg_scale=1.0, # Note that 0.0 (not 1.0) means that STG is disabled (all other guidance is disabled at 1.0)
412+
modality_scale=3.0,
413+
guidance_rescale=0.7,
414+
audio_guidance_scale=7.0, # Note that a higher CFG guidance scale is recommended for audio
415+
audio_stg_scale=1.0,
416+
audio_modality_scale=3.0,
417+
audio_guidance_rescale=0.7,
418+
spatio_temporal_guidance_blocks=[28],
419+
use_cross_timestep=True,
420+
generator=generator,
421+
output_type="np",
422+
return_dict=False,
423+
)
424+
425+
encode_video(
426+
video[0],
427+
fps=frame_rate,
428+
audio=audio[0].float().cpu(),
429+
audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
430+
output_path="ltx2_3_i2v_stage_1.mp4",
431+
)
432+
```
433+
434+
## Prompt Enhancement
435+
436+
The LTX-2.X models are sensitive to prompting style. Refer to the [official prompting guide](https://ltx.io/model/model-blog/prompting-guide-for-ltx-2) for recommendations on how to write a good prompt. Using prompt enhancement, where the supplied prompts are enhanced using the pipeline's text encoder (by default a [Gemma 3](https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized) model) given a system prompt, can also improve sample quality. The optional `processor` pipeline component needs to be present to use prompt enhancement. Enable prompt enhancement by supplying a `system_prompt` argument:
437+
438+
439+
```py
440+
import torch
441+
from transformers import Gemma3Processor
442+
from diffusers import LTX2Pipeline
443+
from diffusers.pipelines.ltx2.export_utils import encode_video
444+
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT, T2V_DEFAULT_SYSTEM_PROMPT
445+
446+
device = "cuda"
447+
width = 768
448+
height = 512
449+
random_seed = 42
450+
frame_rate = 24.0
451+
generator = torch.Generator(device).manual_seed(random_seed)
452+
model_path = "dg845/LTX-2.3-Diffusers"
453+
454+
pipe = LTX2Pipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
455+
pipe.enable_model_cpu_offload(device=device)
456+
pipe.vae.enable_tiling()
457+
if getattr(pipe, "processor", None) is None:
458+
processor = Gemma3Processor.from_pretrained("google/gemma-3-12b-it-qat-q4_0-unquantized")
459+
pipe.processor = processor
460+
461+
prompt = (
462+
"An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in "
463+
"gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs "
464+
"before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small "
465+
"fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly "
466+
"shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a "
467+
"smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the "
468+
"distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a "
469+
"breath-taking, movie-like shot."
470+
)
471+
472+
video, audio = pipe(
473+
prompt=prompt,
474+
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
475+
width=width,
476+
height=height,
477+
num_frames=121,
478+
frame_rate=frame_rate,
479+
num_inference_steps=30,
480+
guidance_scale=3.0,
481+
stg_scale=1.0,
482+
modality_scale=3.0,
483+
guidance_rescale=0.7,
484+
audio_guidance_scale=7.0,
485+
audio_stg_scale=1.0,
486+
audio_modality_scale=3.0,
487+
audio_guidance_rescale=0.7,
488+
spatio_temporal_guidance_blocks=[28],
489+
use_cross_timestep=True,
490+
system_prompt=T2V_DEFAULT_SYSTEM_PROMPT,
491+
generator=generator,
492+
output_type="np",
493+
return_dict=False,
494+
)
495+
496+
encode_video(
497+
video[0],
498+
fps=frame_rate,
499+
audio=audio[0].float().cpu(),
500+
audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
501+
output_path="ltx2_3_t2v_stage_1.mp4",
502+
)
503+
```
504+
369505
## LTX2Pipeline
370506

371507
[[autodoc]] LTX2Pipeline

docs/source/en/quantization/torchao.md

Lines changed: 13 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -29,24 +29,7 @@ from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConf
2929
from torchao.quantization import Int8WeightOnlyConfig
3030

3131
pipeline_quant_config = PipelineQuantizationConfig(
32-
quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig(group_size=128)))}
33-
)
34-
pipeline = DiffusionPipeline.from_pretrained(
35-
"black-forest-labs/FLUX.1-dev",
36-
quantization_config=pipeline_quant_config,
37-
torch_dtype=torch.bfloat16,
38-
device_map="cuda"
39-
)
40-
```
41-
42-
For simple use cases, you could also provide a string identifier in [`TorchAo`] as shown below.
43-
44-
```py
45-
import torch
46-
from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig
47-
48-
pipeline_quant_config = PipelineQuantizationConfig(
49-
quant_mapping={"transformer": TorchAoConfig("int8wo")}
32+
quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig(group_size=128, version=2))}
5033
)
5134
pipeline = DiffusionPipeline.from_pretrained(
5235
"black-forest-labs/FLUX.1-dev",
@@ -91,18 +74,15 @@ Weight-only quantization stores the model weights in a specific low-bit data typ
9174

9275
Dynamic activation quantization stores the model weights in a low-bit dtype, while also quantizing the activations on-the-fly to save additional memory. This lowers the memory requirements from model weights, while also lowering the memory overhead from activation computations. However, this may come at a quality tradeoff at times, so it is recommended to test different models thoroughly.
9376

94-
The quantization methods supported are as follows:
77+
Refer to the [official torchao documentation](https://docs.pytorch.org/ao/stable/index.html) for a better understanding of the available quantization methods. An exhaustive list of configuration options are available [here](https://docs.pytorch.org/ao/main/workflows/inference.html#inference-workflows).
9578

96-
| **Category** | **Full Function Names** | **Shorthands** |
97-
|--------------|-------------------------|----------------|
98-
| **Integer quantization** | `int4_weight_only`, `int8_dynamic_activation_int4_weight`, `int8_weight_only`, `int8_dynamic_activation_int8_weight` | `int4wo`, `int4dq`, `int8wo`, `int8dq` |
99-
| **Floating point 8-bit quantization** | `float8_weight_only`, `float8_dynamic_activation_float8_weight`, `float8_static_activation_float8_weight` | `float8wo`, `float8wo_e5m2`, `float8wo_e4m3`, `float8dq`, `float8dq_e4m3`, `float8dq_e4m3_tensor`, `float8dq_e4m3_row` |
100-
| **Floating point X-bit quantization** | `fpx_weight_only` | `fpX_eAwB` where `X` is the number of bits (1-7), `A` is exponent bits, and `B` is mantissa bits. Constraint: `X == A + B + 1` |
101-
| **Unsigned Integer quantization** | `uintx_weight_only` | `uint1wo`, `uint2wo`, `uint3wo`, `uint4wo`, `uint5wo`, `uint6wo`, `uint7wo` |
79+
Some example popular quantization configurations are as follows:
10280

103-
Some quantization methods are aliases (for example, `int8wo` is the commonly used shorthand for `int8_weight_only`). This allows using the quantization methods described in the torchao docs as-is, while also making it convenient to remember their shorthand notations.
104-
105-
Refer to the [official torchao documentation](https://docs.pytorch.org/ao/stable/index.html) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
81+
| **Category** | **Configuration Classes** |
82+
|---|---|
83+
| **Integer quantization** | [`Int4WeightOnlyConfig`](https://docs.pytorch.org/ao/stable/api_reference/generated/torchao.quantization.Int4WeightOnlyConfig.html), [`Int8WeightOnlyConfig`](https://docs.pytorch.org/ao/stable/api_reference/generated/torchao.quantization.Int8WeightOnlyConfig.html), [`Int8DynamicActivationInt8WeightConfig`](https://docs.pytorch.org/ao/stable/api_reference/generated/torchao.quantization.Int8DynamicActivationInt8WeightConfig.html) |
84+
| **Floating point 8-bit quantization** | [`Float8WeightOnlyConfig`](https://docs.pytorch.org/ao/stable/api_reference/generated/torchao.quantization.Float8WeightOnlyConfig.html), [`Float8DynamicActivationFloat8WeightConfig`](https://docs.pytorch.org/ao/stable/api_reference/generated/torchao.quantization.Float8DynamicActivationFloat8WeightConfig.html) |
85+
| **Unsigned integer quantization** | [`IntxWeightOnlyConfig`](https://docs.pytorch.org/ao/stable/api_reference/generated/torchao.quantization.IntxWeightOnlyConfig.html) |
10686

10787
## Serializing and Deserializing quantized models
10888

@@ -111,8 +91,9 @@ To serialize a quantized model in a given dtype, first load the model with the d
11191
```python
11292
import torch
11393
from diffusers import AutoModel, TorchAoConfig
94+
from torchao.quantization import Int8WeightOnlyConfig
11495

115-
quantization_config = TorchAoConfig("int8wo")
96+
quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
11697
transformer = AutoModel.from_pretrained(
11798
"black-forest-labs/Flux.1-Dev",
11899
subfolder="transformer",
@@ -137,18 +118,19 @@ image = pipe(prompt, num_inference_steps=30, guidance_scale=7.0).images[0]
137118
image.save("output.png")
138119
```
139120

140-
If you are using `torch<=2.6.0`, some quantization methods, such as `uint4wo`, cannot be loaded directly and may result in an `UnpicklingError` when trying to load the models, but work as expected when saving them. In order to work around this, one can load the state dict manually into the model. Note, however, that this requires using `weights_only=False` in `torch.load`, so it should be run only if the weights were obtained from a trustable source.
121+
If you are using `torch<=2.6.0`, some quantization methods, such as `uint4` weight-only, cannot be loaded directly and may result in an `UnpicklingError` when trying to load the models, but work as expected when saving them. In order to work around this, one can load the state dict manually into the model. Note, however, that this requires using `weights_only=False` in `torch.load`, so it should be run only if the weights were obtained from a trustable source.
141122

142123
```python
143124
import torch
144125
from accelerate import init_empty_weights
145126
from diffusers import FluxPipeline, AutoModel, TorchAoConfig
127+
from torchao.quantization import IntxWeightOnlyConfig
146128

147129
# Serialize the model
148130
transformer = AutoModel.from_pretrained(
149131
"black-forest-labs/Flux.1-Dev",
150132
subfolder="transformer",
151-
quantization_config=TorchAoConfig("uint4wo"),
133+
quantization_config=TorchAoConfig(IntxWeightOnlyConfig(dtype=torch.uint4)),
152134
torch_dtype=torch.bfloat16,
153135
)
154136
transformer.save_pretrained("/path/to/flux_uint4wo", safe_serialization=False, max_shard_size="50GB")

0 commit comments

Comments
 (0)