-
Notifications
You must be signed in to change notification settings - Fork 155
Add categories validation #1994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces comprehensive category validation and type safety improvements to the Datumaro experimental codebase. The changes enforce that fields referencing categories (such as labels and masks) use unsigned integer types, validate that data values correspond to defined categories, and improve category class structure with proper base classes and interface implementations.
Key changes:
- Enforced unsigned integer types for label and mask fields with runtime validation
- Added
BaseLabelCategoriesbase class for improved type hierarchy - Implemented category validation in
Datasetto ensure data values don't exceed defined categories
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/datumaro/experimental/categories.py | Added BaseLabelCategories base class, improved MaskCategories with proper __getitem__, __iter__, __len__, and __hash__ implementations |
| src/datumaro/experimental/fields/annotations.py | Added type validation for LabelField to enforce unsigned integer dtypes, changed default dtype to pl.UInt8(), added get_expected_categories_type() method |
| src/datumaro/experimental/fields/masks.py | Added type validation for all mask field classes to enforce unsigned integer or Boolean dtypes, added get_expected_categories_type() methods |
| src/datumaro/experimental/fields/base.py | Added get_expected_categories_type() method to base Field class |
| src/datumaro/experimental/schema.py | Added get_fields_with_categories() method to validate and retrieve fields requiring categories |
| src/datumaro/experimental/dataset.py | Added validate_fields_with_categories() method and integrated validation in append() to check data values against category bounds |
| src/datumaro/experimental/data_formats/coco/sample.py | Updated CocoSample label fields to use pl.UInt32() dtype |
| src/datumaro/experimental/converters/mask_converters.py | Updated MaskCategories construction to use Colormap wrapper |
| tests/unit/experimental/test_schema.py | Added tests for category validation in schema |
| tests/unit/experimental/test_dataset.py | Added tests for field validation with categories on append and validate operations |
| tests/unit/experimental/test_legacy.py | Updated test samples to use unsigned integer dtypes and added required categories to datasets |
| tests/unit/experimental/test_converters.py | Updated all test label fields to use unsigned integer dtypes |
| tests/unit/experimental/data_formats/coco/test_coco_io_unit.py | Added missing caption_group_ids categories to test dataset |
| tests/integration/experimental/test_export_import.py | Updated test sample to use pl.UInt16() dtype |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@copilot open a new pull request to apply changes based on the comments in this thread |
|
@JortBergfeld I've opened a new pull request, #1995, to work on those changes. Once the pull request is ready, I'll request review from you. |
…and method naming (#1995) Addresses review feedback on PR #1994 regarding type annotation mismatches, error message clarity, docstring specificity, and method naming. **Type Hint Corrections:** - Fixed numpy dtype annotations in test samples: `np.dtype[np.int32]` → `np.dtype[np.uint32]` to match polars `UInt32` field types **Error Message Improvements:** - Enhanced mask field dtype validation errors across 4 field classes (MaskField, InstanceMaskField, InstanceMaskCallableField, MaskCallableField) to provide better context about integer values representing category indices - Added missing space in schema validation error message for better readability **Method Rename:** - Renamed `get_fields_with_categories()` to `get_fields_with_required_categories()` in Schema class to better reflect that the method only returns fields that strictly require categories - Updated all references in dataset.py and test_schema.py **Docstring Updates:** - Generalized `Categories` base class method docstrings from label-specific to category-agnostic terminology, reflecting that all category types implement these methods - Fixed grammar in Schema method docstring ### Checklist - [x] I have added tests to cover my changes or documented any manual tests. - [ ] I have updated the [documentation](https://github.com/open-edge-platform/datumaro/tree/develop/docs) accordingly <!-- START COPILOT CODING AGENT TIPS --> --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: JortBergfeld <[email protected]>
leoll2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Also introduces a BaseLabelCategories class to distinguish label categories from other categories classes. Signed-off-by: Jort Bergfeld <[email protected]>
Signed-off-by: Jort Bergfeld <[email protected]>
Signed-off-by: Jort Bergfeld <[email protected]>
Signed-off-by: Jort Bergfeld <[email protected]>
Signed-off-by: Jort Bergfeld <[email protected]>
Co-authored-by: Leonardo Lai <[email protected]> Signed-off-by: Jort Bergfeld <[email protected]>
…and method naming (#1995) Addresses review feedback on PR #1994 regarding type annotation mismatches, error message clarity, docstring specificity, and method naming. **Type Hint Corrections:** - Fixed numpy dtype annotations in test samples: `np.dtype[np.int32]` → `np.dtype[np.uint32]` to match polars `UInt32` field types **Error Message Improvements:** - Enhanced mask field dtype validation errors across 4 field classes (MaskField, InstanceMaskField, InstanceMaskCallableField, MaskCallableField) to provide better context about integer values representing category indices - Added missing space in schema validation error message for better readability **Method Rename:** - Renamed `get_fields_with_categories()` to `get_fields_with_required_categories()` in Schema class to better reflect that the method only returns fields that strictly require categories - Updated all references in dataset.py and test_schema.py **Docstring Updates:** - Generalized `Categories` base class method docstrings from label-specific to category-agnostic terminology, reflecting that all category types implement these methods - Fixed grammar in Schema method docstring ### Checklist - [x] I have added tests to cover my changes or documented any manual tests. - [ ] I have updated the [documentation](https://github.com/open-edge-platform/datumaro/tree/develop/docs) accordingly <!-- START COPILOT CODING AGENT TIPS --> --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: JortBergfeld <[email protected]>
Signed-off-by: Jort Bergfeld <[email protected]>
Signed-off-by: Jort Bergfeld <[email protected]>
a876bf1 to
7bb2e3d
Compare
Signed-off-by: Jort Bergfeld <[email protected]>
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Signed-off-by: Jort Bergfeld <[email protected]>
This pull request introduces significant improvements to category handling and validation in the experimental Datumaro codebase. The main focus is on enforcing that fields referencing categories (such as labels and masks) use appropriate unsigned integer types, validating that data values correspond to defined categories, and improving the structure and typing of category classes. It also introduces stricter checks and error messages to help catch schema and data mismatches early.
Category and Field Type Enforcement:
BaseLabelCategoriesas a base class for label categories, and updatedLabelCategoriesandHierarchicalLabelCategoriesto inherit from it, improving type safety and organization. (src/datumaro/experimental/categories.py) [1] [2]LabelFieldand mask-related fields must use unsigned integer types (e.g.,pl.UInt8,pl.UInt32), and added runtime checks to raise errors if incorrect types are used. (src/datumaro/experimental/fields/annotations.py,src/datumaro/experimental/fields/masks.py) [1] [2] [3] [4] [5]Category Validation Logic:
Datasetto ensure that all values in fields with associated categories do not exceed the number of defined categories, raising informative errors otherwise. (src/datumaro/experimental/dataset.py) [1] [2]get_expected_categories_typemethods to fields, and updatedSchemato check that fields requiring categories have the correct type of categories attached, raising errors if not. (src/datumaro/experimental/fields/base.py,src/datumaro/experimental/fields/annotations.py,src/datumaro/experimental/fields/masks.py,src/datumaro/experimental/schema.py) [1] [2] [3] [4] [5] [6] [7]Field and Data Type Updates:
CocoSampleand tests) to use unsigned integer types for label-related fields. (src/datumaro/experimental/data_formats/coco/sample.py,tests/integration/experimental/test_export_import.py) [1] [2] [3]label_fielddtype topl.UInt8for consistency and safety. (src/datumaro/experimental/fields/annotations.py)Mask Category Improvements:
MaskCategorieswith proper__getitem__,__iter__,__len__, and__hash__implementations, and ensured correct construction and usage in mask converters. (src/datumaro/experimental/categories.py,src/datumaro/experimental/converters/mask_converters.py) [1] [2] [3] [4]These changes together provide better type safety, clearer error messages, and stricter validation for category-based fields, reducing the risk of subtle bugs and making the codebase more robust.
Checklist