Skip to content

Qwen3.5 FSDP SFT Bug #5944

@feirz-tech

Description

@feirz-tech

System Info

----------Python Info----------
Version : 3.12.12
Compiler : GCC 11.4.0
Build : ('main', 'Nov 28 2025 11:02:15')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 26.0.1
Directory : /usr/local/lib/python3.12/site-packages/pip
vllm : 0.17.1
sglang : not found.
ray : 2.54.0
torch : 2.10.0
----------verl Info-----------
Version : 0.8.0.dev
Directory : /mnt/workspace/Smartflow/verl-main/verl
Commit Hash : e997236
----------Platform Info----------
Platform : Linux-5.10.134-013.5.kangaroo.al8.x86_64-x86_64-with-glibc2.35
system : Linux
node : dsw-404383-7464648d7f-7mcdr
release : 5.10.134-013.5.kangaroo.al8.x86_64
version : #1 SMP Fri Feb 27 08:07:36 UTC 2026
----------Environment----------
CUDA Runtime : 12.8
CUDA Compiler : Cuda compilation tools, release 12.9, V12.9.86

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

大佬们帮忙看看,用 FSDP2 跑 qwen3.5 的 SFT 出现以下 BUG
我的 config
torchrun
--nproc_per_node=8
--nnodes=1
-m verl.trainer.sft_trainer
data.train_files=${TRAIN_FILES}
data.val_files=${VAL_FILES}
data.messages_key=messages
data.max_length=${TRAIN_MAX_LEN}
data.max_token_len_per_gpu=${TRAIN_MAX_LEN}
data.micro_batch_size_per_gpu=${MICRO_BATCH_SIZE_PER_GPU}
data.train_batch_size=${TRAIN_BATCH_SIZE}
data.num_workers=8
model.path=${MODEL_PATH}
model.enable_gradient_checkpointing=True
model.use_liger=False
trainer.total_epochs=${EPOCH}
trainer.save_freq=${SAVE_FREQ}
+trainer.val_freq=-1
trainer.default_hdfs_dir=null
trainer.default_local_dir=${OUTPUT_DIR}
trainer.project_name=${PROJECT_NAME}
trainer.experiment_name=${EXPERIMENT_NAME}
trainer.logger='["console","wandb"]'
optim.lr=${LR}
optim.lr_warmup_steps_ratio=${WARMUP_RATIO}
optim.weight_decay=0.01
data.ignore_input_ids_mismatch=True
optim.lr_scheduler_type=cosine
optim.betas='[0.9,0.999]'
optim.clip_grad=1.0
optim.min_lr_ratio=0
model=hf_model
engine.strategy=fsdp2
engine.fsdp_size=8
engine.model_dtype=bfloat16
+engine.wrap_policy.transformer_layer_cls_to_wrap='[Qwen3_5DecoderLayer,Qwen3_5VisionBlock]'
2>&1 | tee debug.log
bug
[rank6]: File "/mnt/work/Smartflow/verl-main/verl/models/transformers/monkey_patch.py", line 146, in _ulysses_flash_attention_forward
[rank6]: attn_output = _flash_attention_forward(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 684, in _flash_attention_forward
[rank6]: out = flash_varlen_fn(
[rank6]: ^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 1443, in flash_attn_varlen_func
[rank6]: return FlashAttnVarlenFunc.apply(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/autograd/function.py", line 583, in apply
[rank6]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 925, in forward
[rank6]: out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 1209, in call
[rank6]: return self._op(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_library/autograd.py", line 112, in autograd_impl
[rank6]: result = forward_no_grad(*args, Metadata(keyset, keyword_only_args))
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_library/autograd.py", line 41, in forward_no_grad
[rank6]: result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 826, in redispatch
[rank6]: return self._handle.redispatch_boxed(keyset, *args, **kwargs) # type: ignore[return-value]
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 347, in backend_impl
[rank6]: result = self._backend_fns[device_type](*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
[rank6]: return disable_fn(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
[rank6]: return fn(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 382, in wrapped_fn
[rank6]: return fn(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 165, in _flash_attn_varlen_forward
[rank6]: out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Expected behavior

正常训练

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions