fix edge case for qwen3 data processing #626

RobotSail · 2025-06-23T19:39:08Z

With Qwen3, there's an edge case which can result in the unmask/mask logic breaking during data processing.

Root Cause: The error occurs specifically when using the Qwen/Qwen3-32B tokenizer, not with Qwen/Qwen2.5-32B-Instruct. The problematic sample contains multiple tags in the assistant's
response.

Issue Location: The error occurs in data_process.py:555 in the unmask_messages function, where it encounters an <|UNMASK_END|> token while not in an unmasking state.

Key Findings:

Model-specific issue: The sample processes fine with Qwen/Qwen2.5-32B-Instruct but fails with Qwen/Qwen3-32B
Chat template differences: Different models have different chat templates that may tokenize the unmask tokens differently
Token ordering: The issue suggests that the unmask tokens are getting reordered or processed incorrectly by the Qwen3 chat template

This is likely due to differences in how the chat templates of Qwen2.5 vs Qwen3 handle special tokens, particularly when there are multiple special tokens or complex content like the
tags present in the assistant's response.

Signed-off-by: Oleg S [email protected]

cdoern · 2025-06-27T21:10:21Z

@RobotSail , we are working on CI. In the meantime please just run the Large e2e job manually on this PR when its ready for review

github-actions · 2025-06-28T01:10:36Z

E2E (NVIDIA L40S x4) (python 3.11) workflow launched on this PR: View run

github-actions · 2025-06-28T04:51:10Z

e2e workflow succeeded on this PR: View run, congrats!

Maxusmusti

LGTM

cdoern · 2025-06-30T19:03:47Z

@Mergifyio rebase

…_content` fields Signed-off-by: Oleg S <[email protected]>

Signed-off-by: Oleg S <[email protected]>

…: false` on samples functions as expected Signed-off-by: Oleg S <[email protected]>

mergify · 2025-06-30T19:04:31Z

rebase

✅ Branch has been successfully rebased

mergify bot added the ci-failure label Jun 23, 2025

RobotSail force-pushed the fix-data-processing branch from d9a21d8 to 79c2f2e Compare June 24, 2025 04:04

mergify bot added documentation Improvements or additions to documentation testing Relates to testing labels Jun 24, 2025

RobotSail force-pushed the fix-data-processing branch 7 times, most recently from 765eb34 to 414c296 Compare June 27, 2025 05:17

Maxusmusti approved these changes Jun 30, 2025

View reviewed changes

mergify bot added the one-approval label Jun 30, 2025

RobotSail added 3 commits June 30, 2025 19:04

enable the training repo to handle reasoning traces within `reasoning…

b111abd

…_content` fields Signed-off-by: Oleg S <[email protected]>

robust data processing logic + testing

4a15c6b

Signed-off-by: Oleg S <[email protected]>

add tests to validate that the behavior of unmask: true and `unmask…

a05d1e4

…: false` on samples functions as expected Signed-off-by: Oleg S <[email protected]>

cdoern force-pushed the fix-data-processing branch from 968ba35 to a05d1e4 Compare June 30, 2025 19:04

cdoern approved these changes Jun 30, 2025

View reviewed changes

mergify bot removed ci-failure one-approval labels Jun 30, 2025

mergify bot merged commit 8e7f2f9 into instructlab:main Jun 30, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix edge case for qwen3 data processing #626

fix edge case for qwen3 data processing #626

Uh oh!

RobotSail commented Jun 23, 2025

Uh oh!

cdoern commented Jun 27, 2025

Uh oh!

github-actions bot commented Jun 28, 2025

Uh oh!

github-actions bot commented Jun 28, 2025

Uh oh!

Maxusmusti left a comment

Uh oh!

cdoern commented Jun 30, 2025

Uh oh!

mergify bot commented Jun 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix edge case for qwen3 data processing #626

fix edge case for qwen3 data processing #626

Uh oh!

Conversation

RobotSail commented Jun 23, 2025

Uh oh!

cdoern commented Jun 27, 2025

Uh oh!

github-actions bot commented Jun 28, 2025

Uh oh!

github-actions bot commented Jun 28, 2025

Uh oh!

Maxusmusti left a comment

Choose a reason for hiding this comment

Uh oh!

cdoern commented Jun 30, 2025

Uh oh!

mergify bot commented Jun 30, 2025

✅ Branch has been successfully rebased

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants