Skip to content

Conversation

@AlbertvanHouten
Copy link
Contributor

@AlbertvanHouten AlbertvanHouten commented Aug 20, 2025

Summary

This pull request introduces robust support for Python Union types in the experimental Datumaro type registry and dataset schema inference. It enables seamless conversion between multiple candidate types (including both typing.Union and modern A | B syntax), with fallback logic and comprehensive test coverage. The changes also improve image type conversion and schema inference for datasets, making the system more flexible and reliable.

Type registry and conversion improvements

  • Added full support for Union types in the type registry: both typing.Union and Python 3.10+ A | B syntax are now handled, with fallback to subsequent types if the first conversion fails. This includes updated logic in from_polars_data and new tests for ordering, error handling, and fallback behavior. [1] [2] [3]
  • Added comprehensive tests for type registry conversions, including basic types, union types, error cases, ordering, and converter functionality for numpy and torch tensors.

Dataset and schema inference enhancements

  • Improved schema inference in Dataset to resolve string annotations to actual type objects, supporting cases where from __future__ import annotations is used, and added correct handling for Union types to preserve the original annotation. [1] [2]
  • Updated type variable definitions and method signatures in dataset.py for clarity and correctness, and removed unnecessary imports. [1] [2] [3]

API and import improvements

  • Updated the experimental module’s public API to expose new converters, dataset classes, fields, schema types, and registry functions.

Test coverage

  • Added targeted tests for union type handling in dataset samples, ensuring both modern and legacy union syntax are supported.

These changes significantly improve the flexibility and reliability of type conversion and schema inference in Datumaro’s experimental pipeline.

How to test

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have added the description of my changes into CHANGELOG.​
  • I have updated the documentation accordingly

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2025 Intel Corporation
#
# SPDX-License-Identifier: MIT

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces comprehensive support for Python Union types in Datumaro's experimental type registry and dataset schema inference. It enables seamless conversion between multiple candidate types using both modern (A | B) and legacy (typing.Union) syntax, with fallback logic when conversions fail.

Key changes include:

  • Union type support in the type registry with fallback behavior
  • Enhanced schema inference for string annotations and Union types
  • New image type conversion capabilities
  • Expanded API with additional fields and converters

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
tests/unit/experimental/test_type_registry.py New comprehensive test suite for type registry functionality
tests/unit/experimental/test_dataset.py Added Union type handling tests for dataset samples
src/datumaro/experimental/type_registry.py Core Union type support and image conversion functionality
src/datumaro/experimental/legacy.py Minor type annotation fix
src/datumaro/experimental/fields.py New annotation field classes and helper functions
src/datumaro/experimental/dataset.py Enhanced schema inference and type resolution
src/datumaro/experimental/converters.py New ImageTypeConverter for image format transformations
src/datumaro/experimental/__init__.py Updated public API exports

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@gdlg gdlg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Albert! First round of review, I still need to review the part on the type union.

@AlbertvanHouten AlbertvanHouten marked this pull request as ready for review August 21, 2025 08:49
DType = TypeVar("DType", bound=Sample, default=Sample)
DTargetType = TypeVar("DTargetType", bound=Sample, default=Sample)
DType = TypeVar("DType")
DTargetType = TypeVar("DTargetType")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment in the code explaining why removing them. Mention the issue with Dataset(schema) too.

if isinstance(value, np.ndarray):
value_list = value.tolist()
elif isinstance(value, list):
value_list = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert to Numpy instead of to a list. You can use the to_numpy function from type_registry.py

return {name: pl.Series(name, [value_list], dtype=pl.List(self.dtype))}

# Handle single integer value
elif isinstance(value, (int, np.integer)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif isinstance(value, (int, np.integer)):
else:

In this case, you can assume that we are working with a single label. No need to check the input type.

if target_type == np.ndarray or target_type is np.ndarray:
return np.array(data, dtype=np.int64)
elif target_type is list:
return data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use from_polars_data from type_registry.py

Copy link
Contributor

@gdlg gdlg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Albert!

@AlbertvanHouten AlbertvanHouten merged commit 6417998 into develop Aug 22, 2025
15 checks passed
@gdlg gdlg deleted the albert/otx-integration branch August 28, 2025 09:55
@leoll2 leoll2 mentioned this pull request Nov 20, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants