Qwen3.5 FSDP SFT Bug

### System Info

----------Python Info----------
Version      : 3.12.12     
Compiler     : GCC 11.4.0    
Build        : ('main', 'Nov 28 2025 11:02:15')
Arch         : ('64bit', '')                                                             
------------Pip Info-----------
Version      : 26.0.1
Directory    : /usr/local/lib/python3.12/site-packages/pip
vllm         : 0.17.1
sglang       : not found.
ray          : 2.54.0
torch        : 2.10.0
----------verl Info-----------
Version      : 0.8.0.dev
Directory    : /mnt/workspace/Smartflow/verl-main/verl
Commit Hash  : e9972368aa6a6078eacd7f0678bdfdd0196ce7b5
----------Platform Info----------
Platform     : Linux-5.10.134-013.5.kangaroo.al8.x86_64-x86_64-with-glibc2.35
system       : Linux
node         : dsw-404383-7464648d7f-7mcdr
release      : 5.10.134-013.5.kangaroo.al8.x86_64
version      : #1 SMP Fri Feb 27 08:07:36 UTC 2026
----------Environment----------
CUDA Runtime : 12.8
CUDA Compiler : Cuda compilation tools, release 12.9, V12.9.86

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

大佬们帮忙看看，用 FSDP2 跑 qwen3.5 的 SFT 出现以下 BUG
我的 config
torchrun \
    --nproc_per_node=8 \
    --nnodes=1 \
    -m verl.trainer.sft_trainer \
    data.train_files=${TRAIN_FILES} \
    data.val_files=${VAL_FILES} \
    data.messages_key=messages \
    data.max_length=${TRAIN_MAX_LEN} \
    data.max_token_len_per_gpu=${TRAIN_MAX_LEN} \
    data.micro_batch_size_per_gpu=${MICRO_BATCH_SIZE_PER_GPU} \
    data.train_batch_size=${TRAIN_BATCH_SIZE} \
    data.num_workers=8 \
    model.path=${MODEL_PATH} \
    model.enable_gradient_checkpointing=True \
    model.use_liger=False \
    trainer.total_epochs=${EPOCH} \
    trainer.save_freq=${SAVE_FREQ} \
    +trainer.val_freq=-1 \
    trainer.default_hdfs_dir=null \
    trainer.default_local_dir=${OUTPUT_DIR} \
    trainer.project_name=${PROJECT_NAME} \
    trainer.experiment_name=${EXPERIMENT_NAME} \
    trainer.logger='["console","wandb"]' \
    optim.lr=${LR} \
    optim.lr_warmup_steps_ratio=${WARMUP_RATIO} \
    optim.weight_decay=0.01 \
    data.ignore_input_ids_mismatch=True \
    optim.lr_scheduler_type=cosine \
    optim.betas='[0.9,0.999]' \
    optim.clip_grad=1.0 \
    optim.min_lr_ratio=0 \
    model=hf_model \
    engine.strategy=fsdp2 \
    engine.fsdp_size=8 \
    engine.model_dtype=bfloat16 \
    +engine.wrap_policy.transformer_layer_cls_to_wrap='[Qwen3_5DecoderLayer,Qwen3_5VisionBlock]' \
    2>&1 | tee debug.log
bug
[rank6]:   File "/mnt/work/Smartflow/verl-main/verl/models/transformers/monkey_patch.py", line 146, in _ulysses_flash_attention_forward
[rank6]:     attn_output = _flash_attention_forward(                                                                                                                               
[rank6]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^                                     
[rank6]:   File "/usr/local/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 684, in _flash_attention_forward
[rank6]:     out = flash_varlen_fn(                                                                                                                                                
[rank6]:           ^^^^^^^^^^^^^^^^                                                      
[rank6]:   File "/usr/local/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 1443, in flash_attn_varlen_func
[rank6]:     return FlashAttnVarlenFunc.apply(                                                                                                                                     
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^                                           
[rank6]:   File "/usr/local/lib/python3.12/site-packages/torch/autograd/function.py", line 583, in apply       
[rank6]:     return super().apply(*args, **kwargs)  # type: ignore[misc]                                                                                                           
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                 
[rank6]:   File "/usr/local/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 925, in forward               
[rank6]:     out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(                                                                                     
[rank6]:                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                     
[rank6]:   File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __call__                         
[rank6]:     return self._op(*args, **kwargs)                                                                                                                                      
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                      
[rank6]:   File "/usr/local/lib/python3.12/site-packages/torch/_library/autograd.py", line 112, in autograd_impl      
[rank6]:     result = forward_no_grad(*args, Metadata(keyset, keyword_only_args))                                                                                                  
[rank6]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^        
[rank6]:   File "/usr/local/lib/python3.12/site-packages/torch/_library/autograd.py", line 41, in forward_no_grad     
[rank6]:     result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs)                                                                                           
[rank6]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                           
[rank6]:   File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 826, in redispatch                                                                                   
[rank6]:     return self._handle.redispatch_boxed(keyset, *args, **kwargs)  # type: ignore[return-value]                                                                           
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                         
[rank6]:   File "/usr/local/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 347, in backend_impl            
[rank6]:     result = self._backend_fns[device_type](*args, **kwargs)                                                                                                              
[rank6]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                    
[rank6]:   File "/usr/local/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner                                 
[rank6]:     return disable_fn(*args, **kwargs)                                                                                                                                    
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                          
[rank6]:   File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn              
[rank6]:     return fn(*args, **kwargs)                                                                                                                                            
[rank6]:            ^^^^^^^^^^^^^^^^^^^                                                  
[rank6]:   File "/usr/local/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 382, in wrapped_fn       
[rank6]:     return fn(*args, **kwargs)                                                                                                                                            
[rank6]:            ^^^^^^^^^^^^^^^^^^^                                                                                                                                            
[rank6]:   File "/usr/local/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 165, in _flash_attn_varlen_forward
[rank6]:     out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(                                                                                                     
[rank6]:                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                     
[rank6]: torch.AcceleratorError: CUDA error: an illegal memory access was encountered

### Expected behavior

正常训练

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3.5 FSDP SFT Bug #5944

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen3.5 FSDP SFT Bug #5944

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions