Skip to content

Recent Qwen2VL merge request (#35837) break compatibility with DeepSpeed #36187

@ArdalanM

Description

@ArdalanM

The recent merge request (#35837) works with accelerate but breaks with DeepSpeed (w/ and w/o deepspeed config)

  • distributed_type: MULTI_GPU (work)
  • distributed_type: DEEPSPEED (no longer works)

To be more precise the issue lies in this section: https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py#L200

    emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1)
    cos = emb.cos().float()
    sin = emb.sin().float()
else:
    cos, sin = position_embeddings
q, k = apply_rotary_pos_emb_flashatt(q.unsqueeze(0), k.unsqueeze(0), cos, sin)

cos, sin = position_embeddings these are not casted to float and are subject to various dtypes depending on the DeepSpeed and mixed_precision config.

This accelerate config works:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity:  #false
main_training_function: main
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
mixed_precision: bf16

This accelerate config no longer works:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: DEEPSPEED
deepspeed_config:
  zero_stage: 3
downcast_bf16: 'no'
enable_cpu_affinity: false
main_training_function: main
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions