Skip to content

Conversation

@JortBergfeld
Copy link
Contributor

This pull request introduces significant improvements to category handling and validation in the experimental Datumaro codebase. The main focus is on enforcing that fields referencing categories (such as labels and masks) use appropriate unsigned integer types, validating that data values correspond to defined categories, and improving the structure and typing of category classes. It also introduces stricter checks and error messages to help catch schema and data mismatches early.

Category and Field Type Enforcement:

  • Added BaseLabelCategories as a base class for label categories, and updated LabelCategories and HierarchicalLabelCategories to inherit from it, improving type safety and organization. (src/datumaro/experimental/categories.py) [1] [2]
  • Enforced that LabelField and mask-related fields must use unsigned integer types (e.g., pl.UInt8, pl.UInt32), and added runtime checks to raise errors if incorrect types are used. (src/datumaro/experimental/fields/annotations.py, src/datumaro/experimental/fields/masks.py) [1] [2] [3] [4] [5]

Category Validation Logic:

  • Implemented validation in Dataset to ensure that all values in fields with associated categories do not exceed the number of defined categories, raising informative errors otherwise. (src/datumaro/experimental/dataset.py) [1] [2]
  • Added get_expected_categories_type methods to fields, and updated Schema to check that fields requiring categories have the correct type of categories attached, raising errors if not. (src/datumaro/experimental/fields/base.py, src/datumaro/experimental/fields/annotations.py, src/datumaro/experimental/fields/masks.py, src/datumaro/experimental/schema.py) [1] [2] [3] [4] [5] [6] [7]

Field and Data Type Updates:

  • Updated sample field definitions (e.g., in CocoSample and tests) to use unsigned integer types for label-related fields. (src/datumaro/experimental/data_formats/coco/sample.py, tests/integration/experimental/test_export_import.py) [1] [2] [3]
  • Changed default label_field dtype to pl.UInt8 for consistency and safety. (src/datumaro/experimental/fields/annotations.py)

Mask Category Improvements:

  • Improved MaskCategories with proper __getitem__, __iter__, __len__, and __hash__ implementations, and ensured correct construction and usage in mask converters. (src/datumaro/experimental/categories.py, src/datumaro/experimental/converters/mask_converters.py) [1] [2] [3] [4]

These changes together provide better type safety, clearer error messages, and stricter validation for category-based fields, reducing the risk of subtle bugs and making the codebase more robust.

Checklist

  • I have added tests to cover my changes or documented any manual tests.
  • I have updated the documentation accordingly

@JortBergfeld JortBergfeld self-assigned this Dec 24, 2025
@JortBergfeld JortBergfeld requested a review from a team as a code owner December 24, 2025 14:21
Copilot AI review requested due to automatic review settings December 24, 2025 14:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces comprehensive category validation and type safety improvements to the Datumaro experimental codebase. The changes enforce that fields referencing categories (such as labels and masks) use unsigned integer types, validate that data values correspond to defined categories, and improve category class structure with proper base classes and interface implementations.

Key changes:

  • Enforced unsigned integer types for label and mask fields with runtime validation
  • Added BaseLabelCategories base class for improved type hierarchy
  • Implemented category validation in Dataset to ensure data values don't exceed defined categories

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/datumaro/experimental/categories.py Added BaseLabelCategories base class, improved MaskCategories with proper __getitem__, __iter__, __len__, and __hash__ implementations
src/datumaro/experimental/fields/annotations.py Added type validation for LabelField to enforce unsigned integer dtypes, changed default dtype to pl.UInt8(), added get_expected_categories_type() method
src/datumaro/experimental/fields/masks.py Added type validation for all mask field classes to enforce unsigned integer or Boolean dtypes, added get_expected_categories_type() methods
src/datumaro/experimental/fields/base.py Added get_expected_categories_type() method to base Field class
src/datumaro/experimental/schema.py Added get_fields_with_categories() method to validate and retrieve fields requiring categories
src/datumaro/experimental/dataset.py Added validate_fields_with_categories() method and integrated validation in append() to check data values against category bounds
src/datumaro/experimental/data_formats/coco/sample.py Updated CocoSample label fields to use pl.UInt32() dtype
src/datumaro/experimental/converters/mask_converters.py Updated MaskCategories construction to use Colormap wrapper
tests/unit/experimental/test_schema.py Added tests for category validation in schema
tests/unit/experimental/test_dataset.py Added tests for field validation with categories on append and validate operations
tests/unit/experimental/test_legacy.py Updated test samples to use unsigned integer dtypes and added required categories to datasets
tests/unit/experimental/test_converters.py Updated all test label fields to use unsigned integer dtypes
tests/unit/experimental/data_formats/coco/test_coco_io_unit.py Added missing caption_group_ids categories to test dataset
tests/integration/experimental/test_export_import.py Updated test sample to use pl.UInt16() dtype

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JortBergfeld
Copy link
Contributor Author

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link
Contributor

Copilot AI commented Dec 29, 2025

@JortBergfeld I've opened a new pull request, #1995, to work on those changes. Once the pull request is ready, I'll request review from you.

JortBergfeld added a commit that referenced this pull request Dec 31, 2025
…and method naming (#1995)

Addresses review feedback on PR #1994 regarding type annotation
mismatches, error message clarity, docstring specificity, and method
naming.

**Type Hint Corrections:**
- Fixed numpy dtype annotations in test samples: `np.dtype[np.int32]` →
`np.dtype[np.uint32]` to match polars `UInt32` field types

**Error Message Improvements:**
- Enhanced mask field dtype validation errors across 4 field classes
(MaskField, InstanceMaskField, InstanceMaskCallableField,
MaskCallableField) to provide better context about integer values
representing category indices
- Added missing space in schema validation error message for better
readability

**Method Rename:**
- Renamed `get_fields_with_categories()` to
`get_fields_with_required_categories()` in Schema class to better
reflect that the method only returns fields that strictly require
categories
- Updated all references in dataset.py and test_schema.py

**Docstring Updates:**
- Generalized `Categories` base class method docstrings from
label-specific to category-agnostic terminology, reflecting that all
category types implement these methods
- Fixed grammar in Schema method docstring

### Checklist
- [x] I have added tests to cover my changes or documented any manual
tests.
- [ ] I have updated the
[documentation](https://github.com/open-edge-platform/datumaro/tree/develop/docs)
accordingly

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: JortBergfeld <[email protected]>
Copy link
Contributor

@leoll2 leoll2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

JortBergfeld and others added 9 commits January 5, 2026 15:52
Also introduces a BaseLabelCategories class to distinguish label
categories from other categories classes.

Signed-off-by: Jort Bergfeld <[email protected]>
Co-authored-by: Leonardo Lai <[email protected]>
Signed-off-by: Jort Bergfeld <[email protected]>
…and method naming (#1995)

Addresses review feedback on PR #1994 regarding type annotation
mismatches, error message clarity, docstring specificity, and method
naming.

**Type Hint Corrections:**
- Fixed numpy dtype annotations in test samples: `np.dtype[np.int32]` →
`np.dtype[np.uint32]` to match polars `UInt32` field types

**Error Message Improvements:**
- Enhanced mask field dtype validation errors across 4 field classes
(MaskField, InstanceMaskField, InstanceMaskCallableField,
MaskCallableField) to provide better context about integer values
representing category indices
- Added missing space in schema validation error message for better
readability

**Method Rename:**
- Renamed `get_fields_with_categories()` to
`get_fields_with_required_categories()` in Schema class to better
reflect that the method only returns fields that strictly require
categories
- Updated all references in dataset.py and test_schema.py

**Docstring Updates:**
- Generalized `Categories` base class method docstrings from
label-specific to category-agnostic terminology, reflecting that all
category types implement these methods
- Fixed grammar in Schema method docstring

### Checklist
- [x] I have added tests to cover my changes or documented any manual
tests.
- [ ] I have updated the
[documentation](https://github.com/open-edge-platform/datumaro/tree/develop/docs)
accordingly

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: JortBergfeld <[email protected]>
@JortBergfeld JortBergfeld force-pushed the jort/categories_validation branch from a876bf1 to 7bb2e3d Compare January 5, 2026 14:59
Signed-off-by: Jort Bergfeld <[email protected]>
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 78.40909% with 19 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datumaro/experimental/fields/masks.py 64.00% 5 Missing and 4 partials ⚠️
src/datumaro/experimental/export_import.py 45.45% 5 Missing and 1 partial ⚠️
src/datumaro/experimental/categories.py 86.66% 2 Missing ⚠️
src/datumaro/experimental/fields/annotations.py 71.42% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants