-
Notifications
You must be signed in to change notification settings - Fork 34
Open
Description
Hi @songhan @lmxyy @ctlllll , thanks for such a great method. I tested this method on several GPUs, including the 4090 and L40, and it worked well. But for testing in H100 SXM, Time with 2 GPUs is slower than only 1 GPU.
The tested results are here:
Do you have any ideas about this ? The code I used below.
import torch
from distrifuser.pipelines import DistriSDXLPipeline
from distrifuser.utils import DistriConfig
distri_config = DistriConfig(height=1024, width=1024, warmup_steps=1,use_cuda_graph=False,split_batch=False)
pipeline = DistriSDXLPipeline.from_pretrained(
distri_config=distri_config,
pretrained_model_name_or_path="Lykon/dreamshaper-xl-v2-turbo",
variant="fp16",
use_safetensors=True,
)
#pipeline.set_progress_bar_config(disable=distri_config.rank != 0)
import time
start_time = time.time()
image = pipeline(
prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
generator=torch.Generator(device="cuda").manual_seed(233),
num_inference_steps=8,
guidance_scale=2.0,
).images[0]
end_time = time.time()
print("Time taken:", end_time - start_time)
if distri_config.rank == 0:
image.save("astronaut.png")
Metadata
Metadata
Assignees
Labels
No labels
