System Info
----------Python Info----------
Version : 3.12.12
Compiler : GCC 11.4.0
Build : ('main', 'Nov 28 2025 11:02:15')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 26.0.1
Directory : /usr/local/lib/python3.12/site-packages/pip
vllm : 0.17.1
sglang : not found.
ray : 2.54.0
torch : 2.10.0
----------verl Info-----------
Version : 0.8.0.dev
Directory : /mnt/workspace/Smartflow/verl-main/verl
Commit Hash : e997236
----------Platform Info----------
Platform : Linux-5.10.134-013.5.kangaroo.al8.x86_64-x86_64-with-glibc2.35
system : Linux
node : dsw-404383-7464648d7f-7mcdr
release : 5.10.134-013.5.kangaroo.al8.x86_64
version : #1 SMP Fri Feb 27 08:07:36 UTC 2026
----------Environment----------
CUDA Runtime : 12.8
CUDA Compiler : Cuda compilation tools, release 12.9, V12.9.86
Information
Tasks
Reproduction
大佬们帮忙看看,用 FSDP2 跑 qwen3.5 的 SFT 出现以下 BUG
我的 config
torchrun
--nproc_per_node=8
--nnodes=1
-m verl.trainer.sft_trainer
data.train_files=${TRAIN_FILES}
data.val_files=${VAL_FILES}
data.messages_key=messages
data.max_length=${TRAIN_MAX_LEN}
data.max_token_len_per_gpu=${TRAIN_MAX_LEN}
data.micro_batch_size_per_gpu=${MICRO_BATCH_SIZE_PER_GPU}
data.train_batch_size=${TRAIN_BATCH_SIZE}
data.num_workers=8
model.path=${MODEL_PATH}
model.enable_gradient_checkpointing=True
model.use_liger=False
trainer.total_epochs=${EPOCH}
trainer.save_freq=${SAVE_FREQ}
+trainer.val_freq=-1
trainer.default_hdfs_dir=null
trainer.default_local_dir=${OUTPUT_DIR}
trainer.project_name=${PROJECT_NAME}
trainer.experiment_name=${EXPERIMENT_NAME}
trainer.logger='["console","wandb"]'
optim.lr=${LR}
optim.lr_warmup_steps_ratio=${WARMUP_RATIO}
optim.weight_decay=0.01
data.ignore_input_ids_mismatch=True
optim.lr_scheduler_type=cosine
optim.betas='[0.9,0.999]'
optim.clip_grad=1.0
optim.min_lr_ratio=0
model=hf_model
engine.strategy=fsdp2
engine.fsdp_size=8
engine.model_dtype=bfloat16
+engine.wrap_policy.transformer_layer_cls_to_wrap='[Qwen3_5DecoderLayer,Qwen3_5VisionBlock]'
2>&1 | tee debug.log
bug
[rank6]: File "/mnt/work/Smartflow/verl-main/verl/models/transformers/monkey_patch.py", line 146, in _ulysses_flash_attention_forward
[rank6]: attn_output = _flash_attention_forward(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 684, in _flash_attention_forward
[rank6]: out = flash_varlen_fn(
[rank6]: ^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 1443, in flash_attn_varlen_func
[rank6]: return FlashAttnVarlenFunc.apply(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/autograd/function.py", line 583, in apply
[rank6]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 925, in forward
[rank6]: out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 1209, in call
[rank6]: return self._op(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_library/autograd.py", line 112, in autograd_impl
[rank6]: result = forward_no_grad(*args, Metadata(keyset, keyword_only_args))
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_library/autograd.py", line 41, in forward_no_grad
[rank6]: result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 826, in redispatch
[rank6]: return self._handle.redispatch_boxed(keyset, *args, **kwargs) # type: ignore[return-value]
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 347, in backend_impl
[rank6]: result = self._backend_fns[device_type](*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
[rank6]: return disable_fn(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
[rank6]: return fn(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 382, in wrapped_fn
[rank6]: return fn(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 165, in _flash_attn_varlen_forward
[rank6]: out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Expected behavior
正常训练
System Info
----------Python Info----------
Version : 3.12.12
Compiler : GCC 11.4.0
Build : ('main', 'Nov 28 2025 11:02:15')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 26.0.1
Directory : /usr/local/lib/python3.12/site-packages/pip
vllm : 0.17.1
sglang : not found.
ray : 2.54.0
torch : 2.10.0
----------verl Info-----------
Version : 0.8.0.dev
Directory : /mnt/workspace/Smartflow/verl-main/verl
Commit Hash : e997236
----------Platform Info----------
Platform : Linux-5.10.134-013.5.kangaroo.al8.x86_64-x86_64-with-glibc2.35
system : Linux
node : dsw-404383-7464648d7f-7mcdr
release : 5.10.134-013.5.kangaroo.al8.x86_64
version : #1 SMP Fri Feb 27 08:07:36 UTC 2026
----------Environment----------
CUDA Runtime : 12.8
CUDA Compiler : Cuda compilation tools, release 12.9, V12.9.86
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
大佬们帮忙看看,用 FSDP2 跑 qwen3.5 的 SFT 出现以下 BUG
我的 config
torchrun
--nproc_per_node=8
--nnodes=1
-m verl.trainer.sft_trainer
data.train_files=${TRAIN_FILES}
data.val_files=${VAL_FILES}
data.messages_key=messages
data.max_length=${TRAIN_MAX_LEN}
data.max_token_len_per_gpu=${TRAIN_MAX_LEN}
data.micro_batch_size_per_gpu=${MICRO_BATCH_SIZE_PER_GPU}
data.train_batch_size=${TRAIN_BATCH_SIZE}
data.num_workers=8
model.path=${MODEL_PATH}
model.enable_gradient_checkpointing=True
model.use_liger=False
trainer.total_epochs=${EPOCH}
trainer.save_freq=${SAVE_FREQ}
+trainer.val_freq=-1
trainer.default_hdfs_dir=null
trainer.default_local_dir=${OUTPUT_DIR}
trainer.project_name=${PROJECT_NAME}
trainer.experiment_name=${EXPERIMENT_NAME}
trainer.logger='["console","wandb"]'
optim.lr=${LR}
optim.lr_warmup_steps_ratio=${WARMUP_RATIO}
optim.weight_decay=0.01
data.ignore_input_ids_mismatch=True
optim.lr_scheduler_type=cosine
optim.betas='[0.9,0.999]'
optim.clip_grad=1.0
optim.min_lr_ratio=0
model=hf_model
engine.strategy=fsdp2
engine.fsdp_size=8
engine.model_dtype=bfloat16
+engine.wrap_policy.transformer_layer_cls_to_wrap='[Qwen3_5DecoderLayer,Qwen3_5VisionBlock]'
2>&1 | tee debug.log
bug
[rank6]: File "/mnt/work/Smartflow/verl-main/verl/models/transformers/monkey_patch.py", line 146, in _ulysses_flash_attention_forward
[rank6]: attn_output = _flash_attention_forward(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 684, in _flash_attention_forward
[rank6]: out = flash_varlen_fn(
[rank6]: ^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 1443, in flash_attn_varlen_func
[rank6]: return FlashAttnVarlenFunc.apply(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/autograd/function.py", line 583, in apply
[rank6]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 925, in forward
[rank6]: out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 1209, in call
[rank6]: return self._op(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_library/autograd.py", line 112, in autograd_impl
[rank6]: result = forward_no_grad(*args, Metadata(keyset, keyword_only_args))
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_library/autograd.py", line 41, in forward_no_grad
[rank6]: result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_ops.py", line 826, in redispatch
[rank6]: return self._handle.redispatch_boxed(keyset, *args, **kwargs) # type: ignore[return-value]
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 347, in backend_impl
[rank6]: result = self._backend_fns[device_type](*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_compile.py", line 54, in inner
[rank6]: return disable_fn(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
[rank6]: return fn(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/torch/_library/custom_ops.py", line 382, in wrapped_fn
[rank6]: return fn(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^
[rank6]: File "/usr/local/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 165, in _flash_attn_varlen_forward
[rank6]: out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Expected behavior
正常训练