Skip to content

Commit 6282506

Browse files
Add video img2img (#3900)
* Add image to image video * Improve * better naming * make fix copies * add docs * finish tests * trigger tests * make style * correct * finish * Fix more * make style * finish
1 parent 5439e91 commit 6282506

File tree

10 files changed

+1058
-5
lines changed

10 files changed

+1058
-5
lines changed

docs/source/en/api/pipelines/text_to_video.mdx

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,12 @@ Resources:
3737
| Pipeline | Tasks | Demo
3838
|---|---|:---:|
3939
| [TextToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py) | *Text-to-Video Generation* | [🤗 Spaces](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis)
40+
| [VideoToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py) | *Text-Guided Video-to-Video Generation* | [(TODO)🤗 Spaces]()
4041

4142
## Usage example
4243

44+
### `text-to-video-ms-1.7b`
45+
4346
Let's start by generating a short video with the default length of 16 frames (2s at 8 fps):
4447

4548
```python
@@ -119,12 +122,72 @@ Here are some sample outputs:
119122
</tr>
120123
</table>
121124

125+
### `cerspense/zeroscope_v2_576w` & `cerspense/zeroscope_v2_XL`
126+
127+
Zeroscope are watermark-free model and have been trained on specific sizes such as `576x320` and `1024x576`.
128+
One should first generate a video using the lower resolution checkpoint [`cerspense/zeroscope_v2_576w`](https://huggingface.co/cerspense/zeroscope_v2_576w) with [`TextToVideoSDPipeline`],
129+
which can then be upscaled using [`VideoToVideoSDPipeline`] and [`cerspense/zeroscope_v2_XL`](https://huggingface.co/cerspense/zeroscope_v2_XL).
130+
131+
132+
```py
133+
import torch
134+
from diffusers import DiffusionPipeline
135+
from diffusers.utils import export_to_video
136+
137+
pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
138+
pipe.enable_model_cpu_offload()
139+
140+
# memory optimization
141+
pipe.enable_vae_slicing()
142+
143+
prompt = "Darth Vader surfing a wave"
144+
video_frames = pipe(prompt, num_frames=24).frames
145+
video_path = export_to_video(video_frames)
146+
video_path
147+
```
148+
149+
Now the video can be upscaled:
150+
151+
```py
152+
pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_XL", torch_dtype=torch.float16)
153+
pipe.vae.enable_slicing()
154+
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
155+
pipe.enable_model_cpu_offload()
156+
157+
video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames]
158+
159+
video_frames = pipe(prompt, video=video, strength=0.6).frames
160+
video_path = export_to_video(video_frames)
161+
video_path
162+
```
163+
164+
Here are some sample outputs:
165+
166+
<table>
167+
<tr>
168+
<td ><center>
169+
Darth vader surfing in waves.
170+
<br>
171+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/darthvader_cerpense.gif"
172+
alt="Darth vader surfing in waves."
173+
style="width: 576px;" />
174+
</center></td>
175+
</tr>
176+
</table>
177+
122178
## Available checkpoints
123179

124180
* [damo-vilab/text-to-video-ms-1.7b](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/)
125181
* [damo-vilab/text-to-video-ms-1.7b-legacy](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b-legacy)
182+
* [cerspense/zeroscope_v2_576w](https://huggingface.co/cerspense/zeroscope_v2_576w)
183+
* [cerspense/zeroscope_v2_XL](https://huggingface.co/cerspense/zeroscope_v2_XL)
126184

127185
## TextToVideoSDPipeline
128186
[[autodoc]] TextToVideoSDPipeline
129187
- all
130188
- __call__
189+
190+
## VideoToVideoSDPipeline
191+
[[autodoc]] VideoToVideoSDPipeline
192+
- all
193+
- __call__

src/diffusers/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,7 @@
173173
VersatileDiffusionImageVariationPipeline,
174174
VersatileDiffusionPipeline,
175175
VersatileDiffusionTextToImagePipeline,
176+
VideoToVideoSDPipeline,
176177
VQDiffusionPipeline,
177178
)
178179

src/diffusers/models/autoencoder_kl.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -229,7 +229,12 @@ def encode(self, x: torch.FloatTensor, return_dict: bool = True) -> AutoencoderK
229229
if self.use_tiling and (x.shape[-1] > self.tile_sample_min_size or x.shape[-2] > self.tile_sample_min_size):
230230
return self.tiled_encode(x, return_dict=return_dict)
231231

232-
h = self.encoder(x)
232+
if self.use_slicing and x.shape[0] > 1:
233+
encoded_slices = [self.encoder(x_slice) for x_slice in x.split(1)]
234+
h = torch.cat(encoded_slices)
235+
else:
236+
h = self.encoder(x)
237+
233238
moments = self.quant_conv(h)
234239
posterior = DiagonalGaussianDistribution(moments)
235240

src/diffusers/pipelines/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@
8989
StableUnCLIPPipeline,
9090
)
9191
from .stable_diffusion_safe import StableDiffusionPipelineSafe
92-
from .text_to_video_synthesis import TextToVideoSDPipeline, TextToVideoZeroPipeline
92+
from .text_to_video_synthesis import TextToVideoSDPipeline, TextToVideoZeroPipeline, VideoToVideoSDPipeline
9393
from .unclip import UnCLIPImageVariationPipeline, UnCLIPPipeline
9494
from .unidiffuser import ImageTextPipelineOutput, UniDiffuserModel, UniDiffuserPipeline, UniDiffuserTextDecoder
9595
from .versatile_diffusion import (

src/diffusers/pipelines/text_to_video_synthesis/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,5 +28,6 @@ class TextToVideoSDPipelineOutput(BaseOutput):
2828
except OptionalDependencyNotAvailable:
2929
from ...utils.dummy_torch_and_transformers_objects import * # noqa F403
3030
else:
31-
from .pipeline_text_to_video_synth import TextToVideoSDPipeline # noqa: F401
31+
from .pipeline_text_to_video_synth import TextToVideoSDPipeline
32+
from .pipeline_text_to_video_synth_img2img import VideoToVideoSDPipeline # noqa: F401
3233
from .pipeline_text_to_video_zero import TextToVideoZeroPipeline

src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -672,6 +672,9 @@ def __call__(
672672
if callback is not None and i % callback_steps == 0:
673673
callback(i, t, latents)
674674

675+
if output_type == "latent":
676+
return TextToVideoSDPipelineOutput(frames=latents)
677+
675678
video_tensor = self.decode_latents(latents)
676679

677680
if output_type == "pt":

0 commit comments

Comments
 (0)