Skip to content

Conversation

@RobotSail
Copy link
Member

With Qwen3, there's an edge case which can result in the unmask/mask logic breaking during data processing.

Root Cause: The error occurs specifically when using the Qwen/Qwen3-32B tokenizer, not with Qwen/Qwen2.5-32B-Instruct. The problematic sample contains multiple tags in the assistant's
response.

Issue Location: The error occurs in data_process.py:555 in the unmask_messages function, where it encounters an <|UNMASK_END|> token while not in an unmasking state.

Key Findings:

  1. Model-specific issue: The sample processes fine with Qwen/Qwen2.5-32B-Instruct but fails with Qwen/Qwen3-32B
  2. Chat template differences: Different models have different chat templates that may tokenize the unmask tokens differently
  3. Token ordering: The issue suggests that the unmask tokens are getting reordered or processed incorrectly by the Qwen3 chat template

The Problem:
The Qwen/Qwen3-32B model's chat template is processing the <|UNMASK_BEGIN|> and <|UNMASK_END|> tokens in a way that causes them to appear out of order or in an unexpected state, leading to
the algorithm encountering an <|UNMASK_END|> token when it's not actively unmasking.

This is likely due to differences in how the chat templates of Qwen2.5 vs Qwen3 handle special tokens, particularly when there are multiple special tokens or complex content like the
tags present in the assistant's response.

Signed-off-by: Oleg S [email protected]

@mergify mergify bot added the ci-failure label Jun 23, 2025
@RobotSail RobotSail force-pushed the fix-data-processing branch from d9a21d8 to 79c2f2e Compare June 24, 2025 04:04
@mergify mergify bot added documentation Improvements or additions to documentation testing Relates to testing labels Jun 24, 2025
@RobotSail RobotSail force-pushed the fix-data-processing branch 7 times, most recently from 765eb34 to 414c296 Compare June 27, 2025 05:17
@cdoern
Copy link
Contributor

cdoern commented Jun 27, 2025

@RobotSail , we are working on CI. In the meantime please just run the Large e2e job manually on this PR when its ready for review

@github-actions
Copy link

E2E (NVIDIA L40S x4) (python 3.11) workflow launched on this PR: View run

@github-actions
Copy link

e2e workflow succeeded on this PR: View run, congrats!

Copy link
Collaborator

@Maxusmusti Maxusmusti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot added the one-approval label Jun 30, 2025
@cdoern
Copy link
Contributor

cdoern commented Jun 30, 2025

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Jun 30, 2025

rebase

✅ Branch has been successfully rebased

@cdoern cdoern force-pushed the fix-data-processing branch from 968ba35 to a05d1e4 Compare June 30, 2025 19:04
@mergify mergify bot merged commit 8e7f2f9 into instructlab:main Jun 30, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants