Skip to content

Commit e3a2c7f

Browse files
sayakpaulpcuenca
andauthored
[Docs] Include more information in the "controlling generation" doc (#2434)
* edit controlling generation doc. * add: demo link to pix2pix zero docs. * refactor oanorama a bit. * Apply suggestions from code review Co-authored-by: Pedro Cuenca <[email protected]> * pix: typo. --------- Co-authored-by: Pedro Cuenca <[email protected]>
1 parent 1586186 commit e3a2c7f

File tree

2 files changed

+64
-48
lines changed

2 files changed

+64
-48
lines changed

docs/source/en/api/pipelines/stable_diffusion/pix2pix_zero.mdx

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ Resources:
2525
* [Project Page](https://pix2pixzero.github.io/).
2626
* [Paper](https://arxiv.org/abs/2302.03027).
2727
* [Original Code](https://github.com/pix2pixzero/pix2pix-zero).
28+
* [Demo](https://huggingface.co/spaces/pix2pix-zero-library/pix2pix-zero-demo).
2829

2930
## Tips
3031

@@ -41,12 +42,13 @@ the above example, a valid input prompt would be: "a high resolution painting of
4142
* Change the input prompt to include "dog".
4243
* To learn more about how the source and target embeddings are generated, refer to the [original
4344
paper](https://arxiv.org/abs/2302.03027). Below, we also provide some directions on how to generate the embeddings.
45+
* Note that the quality of the outputs generated with this pipeline is dependent on how good the `source_embeds` and `target_embeds` are. Please, refer to [this discussion](#generating-source-and-target-embeddings) for some suggestions on the topic.
4446

4547
## Available Pipelines:
4648

4749
| Pipeline | Tasks | Demo
4850
|---|---|:---:|
49-
| [StableDiffusionPix2PixZeroPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_pix2pix_zero.py) | *Text-Based Image Editing* | [🤗 Space] (soon) |
51+
| [StableDiffusionPix2PixZeroPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_pix2pix_zero.py) | *Text-Based Image Editing* | [🤗 Space](https://huggingface.co/spaces/pix2pix-zero-library/pix2pix-zero-demo) |
5052

5153
<!-- TODO: add Colab -->
5254

@@ -74,7 +76,7 @@ pipeline = StableDiffusionPix2PixZeroPipeline.from_pretrained(
7476
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
7577
pipeline.to("cuda")
7678

77-
prompt = "a high resolution painting of a cat in the style of van gough"
79+
prompt = "a high resolution painting of a cat in the style of van gogh"
7880
src_embs_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/embeddings_sd_1.4/cat.pt"
7981
target_embs_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/embeddings_sd_1.4/dog.pt"
8082

docs/source/en/using-diffusers/controlling_generation.mdx

Lines changed: 60 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
22

33
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
44
the License. You may obtain a copy of the License at
@@ -27,108 +27,122 @@ Depending on the use case, one should choose a technique accordingly. In many ca
2727
Unless otherwise mentioned, these are techniques that work with existing models and don't require their own weights.
2828

2929
1. [Instruct Pix2Pix](#instruct-pix2pix)
30-
2. [Pix2Pix 0](#pix2pixzero)
31-
3. [Attend and excite](#attend-and-excite)
32-
4. [Semantic guidance](#semantic-guidance)
33-
5. [Self attention guidance](#self-attention-guidance)
34-
6. [Depth2image](#depth2image)
35-
7. [DreamBooth](#dreambooth)
36-
8. [Textual Inversion](#textual-inversion)
37-
10. [MultiDiffusion Panorama](#panorama)
30+
2. [Pix2Pix Zero](#pix2pixzero)
31+
3. [Attend and Excite](#attend-and-excite)
32+
4. [Semantic Guidance](#semantic-guidance)
33+
5. [Self-attention Guidance](#self-attention-guidance)
34+
6. [Depth2Image](#depth2image)
35+
7. [MultiDiffusion Panorama](#multidiffusion-panorama)
36+
8. [DreamBooth](#dreambooth)
37+
9. [Textual Inversion](#textual-inversion)
3838

39-
## Instruct pix2pix
39+
## Instruct Pix2Pix
4040

41-
[Paper](https://github.com/timothybrooks/instruct-pix2pix)
41+
[Paper](https://arxiv.org/abs/2211.09800)
4242

43-
[Pix2Pix](../api/pipelines/stable_diffusion/pix2pix) is fine-tuned from stable diffusion to support editing input images. It takes as input an image with a prompt describing an edit, and it outputs the edited image.
44-
Pix2Pix has been trained to work explicitely well with instructGPT-like prompts.
43+
[Instruct Pix2Pix](../api/pipelines/stable_diffusion/pix2pix) is fine-tuned from stable diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image.
44+
Instruct Pix2Pix has been explicitly trained to work well with [InstructGPT](https://openai.com/blog/instruction-following/)-like prompts.
4545

4646
See [here](../api/pipelines/stable_diffusion/pix2pix) for more information on how to use it.
4747

48-
## Pix2PixZero
48+
## Pix2Pix Zero
4949

50-
[Paper](https://pix2pixzero.github.io/)
50+
[Paper](https://arxiv.org/abs/2302.03027)
5151

52-
[Pix2Pix-zero](../api/pipelines/stable_diffusion/pix2pix_zero) allows modifying an image from one concept to another while preserving general image semantics.
52+
[Pix2Pix Zero](../api/pipelines/stable_diffusion/pix2pix_zero) allows modifying an image so that one concept or subject is translated to another one while preserving general image semantics.
5353

5454
The denoising process is guided from one conceptual embedding towards another conceptual embedding. The intermediate latents are optimized during the denoising process to push the attention maps towards reference attention maps. The reference attention maps are from the denoising process of the input image and are used to encourage semantic preservation.
5555

56-
Pix2PixZero can be used both to edit synthetic images as well as real images.
57-
- To edit synthetic images, one first generates on image given a caption.
58-
Next, for a concept of the caption that shall be edited as well as the new target concept one generates image captions (e.g. with a model like [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)). Then, "mean" prompt embeddings for both the source and target concepts are created via the text encoder. Finally, the pix2pix-zero algorithm is used to edit the synthetic image.
59-
- To edit a real image, one first generates an image caption using a model like [Blip](https://huggingface.co/docs/transformers/model_doc/blip). Then one applies ddim inversion on the prompt and image to generate "inverse" latents. Similar to before, "mean" prompt embeddings for both source and target concepts are created and finally the pix2pix-zero algorithm in combination with the "inverse" latents is used to edit the image.
56+
Pix2Pix Zero can be used both to edit synthetic images as well as real images.
57+
- To edit synthetic images, one first generates an image given a caption.
58+
Next, we generate image captions for the concept that shall be edited and for the new target concept. We can use a model like [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) for this purpose. Then, "mean" prompt embeddings for both the source and target concepts are created via the text encoder. Finally, the pix2pix-zero algorithm is used to edit the synthetic image.
59+
- To edit a real image, one first generates an image caption using a model like [BLIP](https://huggingface.co/docs/transformers/model_doc/blip). Then one applies ddim inversion on the prompt and image to generate "inverse" latents. Similar to before, "mean" prompt embeddings for both source and target concepts are created and finally the pix2pix-zero algorithm in combination with the "inverse" latents is used to edit the image.
6060

6161
<Tip>
6262

63-
Pix2PixZero is the first model that allows "0-shot" image editing. This means that the model
63+
Pix2Pix Zero is the first model that allows "zero-shot" image editing. This means that the model
6464
can edit an image in less than a minute on a consumer GPU as shown [here](../api/pipelines/stable_diffusion/pix2pix_zero#usage-example)
6565

6666
</Tip>
6767

68+
As mentioned above, Pix2Pix Zero includes optimizing the latents (and not any of the UNet, VAE, or the text encoder) to steer the generation toward a specific concept. This means that the overall
69+
pipeline might require more memory than a standard [StableDiffusionPipeline](../api/pipelines/stable_diffusion/text2img).
70+
6871
See [here](../api/pipelines/stable_diffusion/pix2pix_zero) for more information on how to use it.
6972

70-
## Attend and excite
73+
## Attend and Excite
74+
75+
[Paper](https://arxiv.org/abs/2301.13826)
7176

72-
[Paper](https://attendandexcite.github.io/Attend-and-Excite/)
77+
[Attend and Excite](../api/pipelines/stable_diffusion/attend_and_excite) allows subjects in the prompt to be faithfully represented in the final image.
7378

74-
[Attend and excite](../api/pipelines/stable_diffusion/attend_and_excite) allows subjects in the prompt to be faithfully represented in the final image.
79+
A set of token indices are given as input, corresponding to the subjects in the prompt that need to be present in the image. During denoising, each token index is guaranteed to have a minimum attention threshold for at least one patch of the image. The intermediate latents are iteratively optimized during the denoising process to strengthen the attention of the most neglected subject token until the attention threshold is passed for all subject tokens.
7580

76-
A set of token indices are given as input, corresponding to the subjects in the prompt that need to be present in the image. During denoising, each token index is insured to have above a minimum attention threshold for at least one patch of the image. The intermediate latents are iteratively optimized during the denoising process to strengthen the attention of the most neglected subject token until the attention threshold is passed for all subject tokens.
81+
Like Pix2Pix Zero, Attend and Excite also involves a mini optimization loop (leaving the pre-trained weights untouched) in its pipeline and can require more memory than the usual `StableDiffusionPipeline`.
7782

7883
See [here](../api/pipelines/stable_diffusion/attend_and_excite) for more information on how to use it.
7984

80-
## Semantic guidance
85+
## Semantic Guidance (SEGA)
8186

8287
[Paper](https://arxiv.org/abs/2301.12247)
8388

8489
SEGA allows applying or removing one or more concepts from an image. The strength of the concept can also be controlled. I.e. the smile concept can be used to incrementally increase or decrease the smile of a portrait.
8590

8691
Similar to how classifier free guidance provides guidance via empty prompt inputs, SEGA provides guidance on conceptual prompts. Multiple of these conceptual prompts can be applied simultaneously. Each conceptual prompt can either add or remove their concept depending on if the guidance is applied positively or negatively.
8792

93+
Unlike Pix2Pix Zero or Attend and Excite, SEGA directly interacts with the diffusion process instead of performing any explicit gradient-based optimization.
94+
8895
See [here](../api/pipelines/semantic_stable_diffusion) for more information on how to use it.
8996

90-
## Self attention guidance
97+
## Self-attention Guidance (SAG)
9198

9299
[Paper](https://arxiv.org/abs/2210.00939)
93100

94-
[Self attention guidance](../api/pipelines/stable_diffusion/self_attention_guidance) improves the general quality of images.
101+
[Self-attention Guidance](../api/pipelines/stable_diffusion/self_attention_guidance) improves the general quality of images.
95102

96-
SAG provides guidance from predictions not conditioned on high frequency details to fully conditioned images. The high frequency details are extracted out of the UNet self-attention maps.
103+
SAG provides guidance from predictions not conditioned on high-frequency details to fully conditioned images. The high frequency details are extracted out of the UNet self-attention maps.
97104

98105
See [here](../api/pipelines/stable_diffusion/self_attention_guidance) for more information on how to use it.
99106

100-
## Depth2image
107+
## Depth2Image
101108

102-
[Paper](https://huggingface.co/stabilityai/stable-diffusion-2-depth)
109+
[Project](https://huggingface.co/stabilityai/stable-diffusion-2-depth)
103110

104-
[Depth2image](../pipelines/stable_diffusion_2#depthtoimage) is fine-tuned from stable diffusion to better preserve semantics for text guided image variation.
111+
[Depth2Image](../pipelines/stable_diffusion_2#depthtoimage) is fine-tuned from Stable Diffusion to better preserve semantics for text guided image variation.
105112

106113
It conditions on a monocular depth estimate of the original image.
107114

108-
109115
See [here](../api/pipelines/stable_diffusion_2#depthtoimage) for more information on how to use it.
110116

111-
### Fine-tuning methods
117+
<Tip>
118+
119+
An important distinction between methods like InstructPix2Pix and Pix2Pix Zero is that the former
120+
involves fine-tuning the pre-trained weights while the latter does not. This means that you can
121+
apply Pix2Pix Zero to any of the available Stable Diffusion models.
122+
123+
</Tip>
124+
125+
## MultiDiffusion Panorama
126+
127+
[Paper](https://arxiv.org/abs/2302.08113)
112128

113-
In addition to pre-trained models, diffusers has training scripts for fine-tuning models on user provided data.
129+
MultiDiffusion defines a new generation process over a pre-trained diffusion model. This process binds together multiple diffusion generation methods that can be readily applied to generate high quality and diverse images. Results adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.
130+
[MultiDiffusion Panorama](../api/pipelines/stable_diffusion/panorama) allows to generate high-quality images at arbitrary aspect ratios (e.g., panoramas).
131+
132+
See [here](../api/pipelines/stable_diffusion/panorama) for more information on how to use it to generate panoramic images.
114133

115-
## DreamBooth
134+
## Fine-tuning your own models
135+
136+
In addition to pre-trained models, Diffusers has training scripts for fine-tuning models on user-provided data.
137+
138+
### DreamBooth
116139

117140
[DreamBooth](../training/dreambooth) fine-tunes a model to teach it about a new subject. I.e. a few pictures of a person can be used to generate images of that person in different styles.
118141

119142
See [here](../training/dreambooth) for more information on how to use it.
120143

121-
## Textual Inversion
144+
### Textual Inversion
122145

123146
[Textual Inversion](../training/text_inversion) fine-tunes a model to teach it about a new concept. I.e. a few pictures of a style of artwork can be used to generate images in that style.
124147

125148
See [here](../training/text_inversion) for more information on how to use it.
126-
127-
## MultiDiffusion Panorama
128-
129-
[Paper](https://multidiffusion.github.io/)
130-
[Demo](https://huggingface.co/spaces/weizmannscience/MultiDiffusion)
131-
MultiDiffusion defines a new generation process over a pre-trained diffusion model. This process binds together multiple diffusion generation processes can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.
132-
[MultiDiffusion Panorama](../api/pipelines/stable_diffusion/panorama) allows to generate high-quality images at arbitrary aspect ratios (e.g., panoramas).
133-
134-
See [here](../api/pipelines/stable_diffusion/panorama) for more information on how to use it to generate panoramic images.

0 commit comments

Comments
 (0)