Skip to content

Warn instead of raise when user-provided data_files yields a subset#8215

Open
adityasingh2400 wants to merge 2 commits into
huggingface:mainfrom
adityasingh2400:fix-non-matching-splits-with-explicit-data-files-7867
Open

Warn instead of raise when user-provided data_files yields a subset#8215
adityasingh2400 wants to merge 2 commits into
huggingface:mainfrom
adityasingh2400:fix-non-matching-splits-with-explicit-data-files-7867

Conversation

@adityasingh2400

Copy link
Copy Markdown
Contributor

Fixes #7867.

NonMatchingSplitsSizesError currently fires whenever the loaded split size differs from the expected size, including when the user explicitly passed data_files for a known subset of the dataset. The only user-side workaround is verification_mode='no_checks', which silences ALL checks rather than just the split-size one.

This PR downgrades the split-size mismatch to a UserWarning when data_files was explicitly provided by the caller. Other mismatch paths (corrupted download, wrong size for full download) remain hard errors.

NonMatchingSplitsSizesError fires whenever the loaded split size differs
from the expected size, including when the user has explicitly passed
data_files for a known subset of the dataset. The user-side workaround
is verification_mode='no_checks', which silences ALL checks rather than
just the split-size one.

Downgrade the split-size mismatch to a UserWarning when data_files was
explicitly provided by the caller. Other mismatch paths (corrupted
download, wrong size for full download) remain hard errors.

Fixes huggingface#7867
@lhoestq

lhoestq commented Jun 5, 2026

Copy link
Copy Markdown
Member

I think we can simply not raise or warn anything at all when the user provides the data_files. The number of examples is expected to be likely different

@adityasingh2400

Copy link
Copy Markdown
Contributor Author

Makes sense, done in 2162efc. verify_splits now returns early when the user passed data_files, so there is no raise and no warning at all in that case. The early return also covers the split name checks, since custom data_files can define a different set of splits. I removed the warning path and updated the tests accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NonMatchingSplitsSizesError when loading partial dataset files

2 participants