Time in 2xGPUs H100 is slower than 1 GPU H100

Hi @songhan @lmxyy @ctlllll , thanks for such a great method. I tested this method on several GPUs, including the 4090 and L40, and it worked well. But for testing in H100 SXM, Time with 2 GPUs is slower than only 1 GPU. 
The tested results are here:

![Image](https://github.com/user-attachments/assets/fd5d2ae8-ec33-408e-b014-098be17f721a)

Do you have any ideas about this ? The code I used below.

```
import torch

from distrifuser.pipelines import DistriSDXLPipeline
from distrifuser.utils import DistriConfig

distri_config = DistriConfig(height=1024, width=1024, warmup_steps=1,use_cuda_graph=False,split_batch=False)
pipeline = DistriSDXLPipeline.from_pretrained(
    distri_config=distri_config,
    pretrained_model_name_or_path="Lykon/dreamshaper-xl-v2-turbo",
    variant="fp16",
    use_safetensors=True,
)

#pipeline.set_progress_bar_config(disable=distri_config.rank != 0)
import time
start_time = time.time()
image = pipeline(
    prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    generator=torch.Generator(device="cuda").manual_seed(233),
    num_inference_steps=8,
    guidance_scale=2.0,
).images[0]
end_time = time.time()
print("Time taken:", end_time - start_time)
if distri_config.rank == 0:
    image.save("astronaut.png")
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time in 2xGPUs H100 is slower than 1 GPU H100 #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time in 2xGPUs H100 is slower than 1 GPU H100 #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions