Skip to content

[Data] Add serialization framework for preprocessors#58321

Merged
raulchen merged 7 commits intomasterfrom
cem/save_load_1
Nov 14, 2025
Merged

[Data] Add serialization framework for preprocessors#58321
raulchen merged 7 commits intomasterfrom
cem/save_load_1

Conversation

@cem-anyscale
Copy link
Contributor

@cem-anyscale cem-anyscale commented Oct 30, 2025

Description

This commit introduces a new serialization system for Ray Data preprocessors that improves maintainability, extensibility, and backward compatibility.

Key changes:

  1. New serialization infrastructure:

    • Add serialization_handlers.py with factory pattern for format handling
    • Implement CloudPickleSerializationHandler (primary format)
    • Support legacy PickleSerializationHandler for backward compatibility
    • Add format auto-detection via magic bytes (CPKL:)
  2. New preprocessor base class:

    • Add SerializablePreprocessorBase abstract class
    • Define serialization interface via abstract methods:
      • _get_serializable_fields() / _set_serializable_fields()
      • _get_stats() / _set_stats()
    • Mark serialize() and deserialize() as @Final to prevent overrides
  3. Preprocessor registration system:

    • Add version_support.py with @SerializablePreprocessor decorator
    • Enable versioned serialization with stable identifiers
    • Support class registration and lookup
    • Add UnknownPreprocessorError for missing types
  4. Migrate preprocessors to new framework:

    • SimpleImputer
    • OrdinalEncoder
    • OneHotEncoder
    • MultiHotEncoder
    • LabelEncoder
    • Categorizer
    • StandardScaler
    • MinMaxScaler
    • MaxAbsScaler
    • RobustScaler
  5. Enhanced Preprocessor base class:

    • Add get_input_columns() and get_output_columns() methods (for future use)
    • Add has_stats() (for future use)
    • Add type hints to getstate() and setstate()
  6. Backward compatibility improvements to Concatenator for existing functionality:

    • Add setstate override in Concatenator for flatten field
    • Handle missing fields gracefully during deserialization

The new architecture makes it easier to:

  • Add new serialization formats without modifying core logic
  • Maintain backward compatibility with existing serialized data
  • Handle version migrations for preprocessor schemas
  • Register new preprocessors with stable identifiers

@cem-anyscale cem-anyscale requested a review from a team as a code owner October 30, 2025 18:14
cloudpickle_result = cloudpickle_deserialized.transform_batch(test_df.copy())
pickle_result = pickle_deserialized.transform_batch(test_df.copy())

pd.testing.assert_frame_equal(cloudpickle_result, pickle_result)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test Fails Due to Hardcoded Serialization

The test_encoder_serialization_formats() method incorrectly claims to test different serialization formats. The SerializablePreprocessorBase.serialize() method is hardcoded to CloudPickle, causing the test to compare two identical CloudPickle outputs. Additionally, the serialize() docstring incorrectly lists an output_format parameter.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust new serialization framework for Ray Data preprocessors, which is a significant improvement for maintainability, extensibility, and backward compatibility. The changes are well-structured, introducing a new SerializablePreprocessorBase class, a factory pattern for handling different serialization formats, and a versioned registration system for preprocessors. The migration of existing preprocessors to this new framework is well-executed. The code quality is high, and the new functionality is accompanied by a comprehensive suite of tests. My review focuses on a few minor documentation inconsistencies and a potentially misleading test case.

Comment on lines 619 to 621
Args:
output_format: The serialization format to use

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for serialize mentions an output_format argument, but this argument is not present in the method signature. This can be confusing for developers using this API. Please remove the output_format from the Args section of the docstring.

Comment on lines 146 to 148
Args:


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for get_handler is incomplete. The Args section is empty. Please document the parameters format_identifier, data, and **kwargs to improve clarity for developers who might use or extend this factory.

Comment on lines +952 to +978
def test_encoder_serialization_formats(self):
"""Test that encoders work with different serialization formats."""
encoder = OrdinalEncoder(columns=["category"])
dataset = ray.data.from_pandas(self.categorical_df)
fitted_encoder = encoder.fit(dataset)

# Test CloudPickle format (default)
cloudpickle_serialized = fitted_encoder.serialize()
assert isinstance(cloudpickle_serialized, bytes)

# Test Pickle format (legacy)
pickle_serialized = fitted_encoder.serialize()
assert isinstance(pickle_serialized, bytes)

# Both should deserialize to equivalent objects
cloudpickle_deserialized = SerializablePreprocessor.deserialize(
cloudpickle_serialized
)
pickle_deserialized = SerializablePreprocessor.deserialize(pickle_serialized)

# Test functional equivalence
test_df = pd.DataFrame({"category": ["A", "B"]})

cloudpickle_result = cloudpickle_deserialized.transform_batch(test_df.copy())
pickle_result = pickle_deserialized.transform_batch(test_df.copy())

pd.testing.assert_frame_equal(cloudpickle_result, pickle_result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test test_encoder_serialization_formats seems to have a misleading comment and implementation. It claims to test the legacy Pickle format, but it calls fitted_encoder.serialize(), which for a SerializablePreprocessorBase subclass will always use the new CloudPickle-based serialization. The old pickle-based serialization produced a str, while the new one produces bytes.

The test is effectively re-running the CloudPickle serialization test. To properly test backward compatibility for deserialization of the legacy pickle format, you would need to either manually create a legacy-formatted serialized string or have a pre-serialized object from an older version.

Given that this test doesn't check what it claims to, it would be best to either fix it to correctly test legacy format deserialization or remove it to avoid confusion.

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Oct 30, 2025
@cem-anyscale cem-anyscale requested a review from a team as a code owner November 13, 2025 18:26
This commit introduces a new serialization system for Ray Data preprocessors
that improves maintainability, extensibility, and backward compatibility.

Key changes:

1. New serialization infrastructure:
   - Add serialization_handlers.py with factory pattern for format handling
   - Implement CloudPickleSerializationHandler (primary format)
   - Support legacy PickleSerializationHandler for backward compatibility
   - Add format auto-detection via magic bytes (CPKL:)

2. New preprocessor base class:
   - Add SerializablePreprocessorBase abstract class
   - Define serialization interface via abstract methods:
     * _get_serializable_fields() / _set_serializable_fields()
     * _get_stats() / _set_stats()
   - Mark serialize() and deserialize() as @Final to prevent overrides

3. Preprocessor registration system:
   - Add version_support.py with @SerializablePreprocessor decorator
   - Enable versioned serialization with stable identifiers
   - Support class registration and lookup
   - Add UnknownPreprocessorError for missing types

4. Migrate preprocessors to new framework:
   - SimpleImputer
   - OrdinalEncoder
   - OneHotEncoder
   - MultiHotEncoder
   - LabelEncoder
   - Categorizer
   - StandardScaler
   - MinMaxScaler
   - MaxAbsScaler
   - RobustScaler

5. Enhanced Preprocessor base class:
   - Add get_input_columns() and get_output_columns() methods (for future use)
   - Add has_stats() (for future use)
   - Add type hints to __getstate__() and __setstate__()

6. Backward compatibility improvements to Concatenator for existing functionality:
   - Add __setstate__ override in Concatenator for flatten field
   - Handle missing fields gracefully during deserialization

The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
@raulchen raulchen enabled auto-merge (squash) November 13, 2025 20:00
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 13, 2025
Copy link
Contributor

@dstrodtman dstrodtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
@github-actions github-actions bot disabled auto-merge November 13, 2025 21:25
self.fill_value = fields.get("fill_value")

if self.strategy == "constant":
self._is_fittable = False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Serialization Skips Crucial Imputer Validation

The _set_serializable_fields method doesn't validate that fill_value is not None when strategy is "constant", unlike the constructor which raises a ValueError for this invalid combination. Deserializing a SimpleImputer with strategy="constant" and fill_value=None creates an invalid object that will fail during transformation with a confusing error message.

Fix in Cursor Fix in Web

Signed-off-by: cem <cem@anyscale.com>
@cem-anyscale cem-anyscale requested a review from a team as a code owner November 13, 2025 23:36
def deserialize(serialized: str) -> "Preprocessor":
"""Load the original preprocessor serialized via `self.serialize()`."""
return pickle.loads(base64.b64decode(serialized))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Serialize: Unexpected Return Type

The Preprocessor.serialize() method's docstring and return type annotation claim it returns str, but SerializablePreprocessorBase.serialize() (which overrides this method for new preprocessors) returns Union[str, bytes] and actually returns bytes for CloudPickle format. This creates a breaking change where code expecting a string from serialize() will receive bytes instead, potentially causing type errors or incorrect behavior in downstream code that assumes string output.

Fix in Cursor Fix in Web

@raulchen raulchen merged commit f6490dd into master Nov 14, 2025
6 checks passed
@raulchen raulchen deleted the cem/save_load_1 branch November 14, 2025 17:50
ArturNiederfahrenhorst pushed a commit to ArturNiederfahrenhorst/ray that referenced this pull request Nov 16, 2025
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.

Key changes:

1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
   - Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
   - Add format auto-detection via magic bytes (CPKL:)

2. New preprocessor base class:
   - Add SerializablePreprocessorBase abstract class
   - Define serialization interface via abstract methods:
     * _get_serializable_fields() / _set_serializable_fields()
     * _get_stats() / _set_stats()
   - Mark serialize() and deserialize() as @Final to prevent overrides

3. Preprocessor registration system:
   - Add version_support.py with @SerializablePreprocessor decorator
   - Enable versioned serialization with stable identifiers
   - Support class registration and lookup
   - Add UnknownPreprocessorError for missing types

4. Migrate preprocessors to new framework:
   - SimpleImputer
   - OrdinalEncoder
   - OneHotEncoder
   - MultiHotEncoder
   - LabelEncoder
   - Categorizer
   - StandardScaler
   - MinMaxScaler
   - MaxAbsScaler
    - RobustScaler

5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
   - Add has_stats() (for future use)
   - Add type hints to __getstate__() and __setstate__()

6. Backward compatibility improvements to Concatenator for existing
functionality:
   - Add __setstate__ override in Concatenator for flatten field
   - Handle missing fields gracefully during deserialization

The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers

---------

Signed-off-by: cem <cem@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.

Key changes:

1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
   - Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
   - Add format auto-detection via magic bytes (CPKL:)

2. New preprocessor base class:
   - Add SerializablePreprocessorBase abstract class
   - Define serialization interface via abstract methods:
     * _get_serializable_fields() / _set_serializable_fields()
     * _get_stats() / _set_stats()
   - Mark serialize() and deserialize() as @Final to prevent overrides

3. Preprocessor registration system:
   - Add version_support.py with @SerializablePreprocessor decorator
   - Enable versioned serialization with stable identifiers
   - Support class registration and lookup
   - Add UnknownPreprocessorError for missing types

4. Migrate preprocessors to new framework:
   - SimpleImputer
   - OrdinalEncoder
   - OneHotEncoder
   - MultiHotEncoder
   - LabelEncoder
   - Categorizer
   - StandardScaler
   - MinMaxScaler
   - MaxAbsScaler
    - RobustScaler

5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
   - Add has_stats() (for future use)
   - Add type hints to __getstate__() and __setstate__()

6. Backward compatibility improvements to Concatenator for existing
functionality:
   - Add __setstate__ override in Concatenator for flatten field
   - Handle missing fields gracefully during deserialization

The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.

Key changes:

1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
   - Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
   - Add format auto-detection via magic bytes (CPKL:)

2. New preprocessor base class:
   - Add SerializablePreprocessorBase abstract class
   - Define serialization interface via abstract methods:
     * _get_serializable_fields() / _set_serializable_fields()
     * _get_stats() / _set_stats()
   - Mark serialize() and deserialize() as @Final to prevent overrides

3. Preprocessor registration system:
   - Add version_support.py with @SerializablePreprocessor decorator
   - Enable versioned serialization with stable identifiers
   - Support class registration and lookup
   - Add UnknownPreprocessorError for missing types

4. Migrate preprocessors to new framework:
   - SimpleImputer
   - OrdinalEncoder
   - OneHotEncoder
   - MultiHotEncoder
   - LabelEncoder
   - Categorizer
   - StandardScaler
   - MinMaxScaler
   - MaxAbsScaler
    - RobustScaler

5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
   - Add has_stats() (for future use)
   - Add type hints to __getstate__() and __setstate__()

6. Backward compatibility improvements to Concatenator for existing
functionality:
   - Add __setstate__ override in Concatenator for flatten field
   - Handle missing fields gracefully during deserialization

The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.

Key changes:

1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
   - Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
   - Add format auto-detection via magic bytes (CPKL:)

2. New preprocessor base class:
   - Add SerializablePreprocessorBase abstract class
   - Define serialization interface via abstract methods:
     * _get_serializable_fields() / _set_serializable_fields()
     * _get_stats() / _set_stats()
   - Mark serialize() and deserialize() as @Final to prevent overrides

3. Preprocessor registration system:
   - Add version_support.py with @SerializablePreprocessor decorator
   - Enable versioned serialization with stable identifiers
   - Support class registration and lookup
   - Add UnknownPreprocessorError for missing types

4. Migrate preprocessors to new framework:
   - SimpleImputer
   - OrdinalEncoder
   - OneHotEncoder
   - MultiHotEncoder
   - LabelEncoder
   - Categorizer
   - StandardScaler
   - MinMaxScaler
   - MaxAbsScaler
    - RobustScaler

5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
   - Add has_stats() (for future use)
   - Add type hints to __getstate__() and __setstate__()

6. Backward compatibility improvements to Concatenator for existing
functionality:
   - Add __setstate__ override in Concatenator for flatten field
   - Handle missing fields gracefully during deserialization

The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers

---------

Signed-off-by: cem <cem@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.

Key changes:

1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
   - Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
   - Add format auto-detection via magic bytes (CPKL:)

2. New preprocessor base class:
   - Add SerializablePreprocessorBase abstract class
   - Define serialization interface via abstract methods:
     * _get_serializable_fields() / _set_serializable_fields()
     * _get_stats() / _set_stats()
   - Mark serialize() and deserialize() as @Final to prevent overrides

3. Preprocessor registration system:
   - Add version_support.py with @SerializablePreprocessor decorator
   - Enable versioned serialization with stable identifiers
   - Support class registration and lookup
   - Add UnknownPreprocessorError for missing types

4. Migrate preprocessors to new framework:
   - SimpleImputer
   - OrdinalEncoder
   - OneHotEncoder
   - MultiHotEncoder
   - LabelEncoder
   - Categorizer
   - StandardScaler
   - MinMaxScaler
   - MaxAbsScaler
    - RobustScaler

5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
   - Add has_stats() (for future use)
   - Add type hints to __getstate__() and __setstate__()

6. Backward compatibility improvements to Concatenator for existing
functionality:
   - Add __setstate__ override in Concatenator for flatten field
   - Handle missing fields gracefully during deserialization

The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.

Key changes:

1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
   - Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
   - Add format auto-detection via magic bytes (CPKL:)

2. New preprocessor base class:
   - Add SerializablePreprocessorBase abstract class
   - Define serialization interface via abstract methods:
     * _get_serializable_fields() / _set_serializable_fields()
     * _get_stats() / _set_stats()
   - Mark serialize() and deserialize() as @Final to prevent overrides

3. Preprocessor registration system:
   - Add version_support.py with @SerializablePreprocessor decorator
   - Enable versioned serialization with stable identifiers
   - Support class registration and lookup
   - Add UnknownPreprocessorError for missing types

4. Migrate preprocessors to new framework:
   - SimpleImputer
   - OrdinalEncoder
   - OneHotEncoder
   - MultiHotEncoder
   - LabelEncoder
   - Categorizer
   - StandardScaler
   - MinMaxScaler
   - MaxAbsScaler
    - RobustScaler

5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
   - Add has_stats() (for future use)
   - Add type hints to __getstate__() and __setstate__()

6. Backward compatibility improvements to Concatenator for existing
functionality:
   - Add __setstate__ override in Concatenator for flatten field
   - Handle missing fields gracefully during deserialization

The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants