[Data] Add serialization framework for preprocessors by cem-anyscale · Pull Request #58321 · ray-project/ray

cem-anyscale · 2025-10-30T18:14:04Z

Description

This commit introduces a new serialization system for Ray Data preprocessors that improves maintainability, extensibility, and backward compatibility.

Key changes:

New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
- Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
- Add format auto-detection via magic bytes (CPKL:)
New preprocessor base class:
- Add SerializablePreprocessorBase abstract class
- Define serialization interface via abstract methods:
  - _get_serializable_fields() / _set_serializable_fields()
  - _get_stats() / _set_stats()
- Mark serialize() and deserialize() as @Final to prevent overrides
Preprocessor registration system:
- Add version_support.py with @SerializablePreprocessor decorator
- Enable versioned serialization with stable identifiers
- Support class registration and lookup
- Add UnknownPreprocessorError for missing types
Migrate preprocessors to new framework:
- SimpleImputer
- OrdinalEncoder
- OneHotEncoder
- MultiHotEncoder
- LabelEncoder
- Categorizer
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- RobustScaler
Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future use)
- Add has_stats() (for future use)
- Add type hints to getstate() and setstate()
Backward compatibility improvements to Concatenator for existing functionality:
- Add setstate override in Concatenator for flatten field
- Handle missing fields gracefully during deserialization

The new architecture makes it easier to:

Add new serialization formats without modifying core logic
Maintain backward compatibility with existing serialized data
Handle version migrations for preprocessor schemas
Register new preprocessors with stable identifiers

cursor · 2025-10-30T18:20:43Z

python/ray/data/tests/preprocessors/test_encoder.py

+        cloudpickle_result = cloudpickle_deserialized.transform_batch(test_df.copy())
+        pickle_result = pickle_deserialized.transform_batch(test_df.copy())
+
+        pd.testing.assert_frame_equal(cloudpickle_result, pickle_result)


Bug: Test Fails Due to Hardcoded Serialization

The test_encoder_serialization_formats() method incorrectly claims to test different serialization formats. The SerializablePreprocessorBase.serialize() method is hardcoded to CloudPickle, causing the test to compare two identical CloudPickle outputs. Additionally, the serialize() docstring incorrectly lists an output_format parameter.

Additional Locations (1)

python/ray/data/preprocessor.py#L618-L620

gemini-code-assist

Code Review

This pull request introduces a robust new serialization framework for Ray Data preprocessors, which is a significant improvement for maintainability, extensibility, and backward compatibility. The changes are well-structured, introducing a new SerializablePreprocessorBase class, a factory pattern for handling different serialization formats, and a versioned registration system for preprocessors. The migration of existing preprocessors to this new framework is well-executed. The code quality is high, and the new functionality is accompanied by a comprehensive suite of tests. My review focuses on a few minor documentation inconsistencies and a potentially misleading test case.

gemini-code-assist · 2025-10-30T18:25:46Z

python/ray/data/preprocessor.py

+        Args:
+            output_format: The serialization format to use
+


The docstring for serialize mentions an output_format argument, but this argument is not present in the method signature. This can be confusing for developers using this API. Please remove the output_format from the Args section of the docstring.

gemini-code-assist · 2025-10-30T18:25:46Z

python/ray/data/preprocessors/serialization_handlers.py

+        Args:
+
+


The docstring for get_handler is incomplete. The Args section is empty. Please document the parameters format_identifier, data, and **kwargs to improve clarity for developers who might use or extend this factory.

gemini-code-assist · 2025-10-30T18:25:46Z

python/ray/data/tests/preprocessors/test_encoder.py

+    def test_encoder_serialization_formats(self):
+        """Test that encoders work with different serialization formats."""
+        encoder = OrdinalEncoder(columns=["category"])
+        dataset = ray.data.from_pandas(self.categorical_df)
+        fitted_encoder = encoder.fit(dataset)
+
+        # Test CloudPickle format (default)
+        cloudpickle_serialized = fitted_encoder.serialize()
+        assert isinstance(cloudpickle_serialized, bytes)
+
+        # Test Pickle format (legacy)
+        pickle_serialized = fitted_encoder.serialize()
+        assert isinstance(pickle_serialized, bytes)
+
+        # Both should deserialize to equivalent objects
+        cloudpickle_deserialized = SerializablePreprocessor.deserialize(
+            cloudpickle_serialized
+        )
+        pickle_deserialized = SerializablePreprocessor.deserialize(pickle_serialized)
+
+        # Test functional equivalence
+        test_df = pd.DataFrame({"category": ["A", "B"]})
+
+        cloudpickle_result = cloudpickle_deserialized.transform_batch(test_df.copy())
+        pickle_result = pickle_deserialized.transform_batch(test_df.copy())
+
+        pd.testing.assert_frame_equal(cloudpickle_result, pickle_result)


This test test_encoder_serialization_formats seems to have a misleading comment and implementation. It claims to test the legacy Pickle format, but it calls fitted_encoder.serialize(), which for a SerializablePreprocessorBase subclass will always use the new CloudPickle-based serialization. The old pickle-based serialization produced a str, while the new one produces bytes.

The test is effectively re-running the CloudPickle serialization test. To properly test backward compatibility for deserialization of the legacy pickle format, you would need to either manually create a legacy-formatted serialized string or have a pre-serialized object from an older version.

Given that this test doesn't check what it claims to, it would be best to either fix it to correctly test legacy format deserialization or remove it to avoid confusion.

python/ray/data/preprocessors/scaler.py

This commit introduces a new serialization system for Ray Data preprocessors that improves maintainability, extensibility, and backward compatibility. Key changes: 1. New serialization infrastructure: - Add serialization_handlers.py with factory pattern for format handling - Implement CloudPickleSerializationHandler (primary format) - Support legacy PickleSerializationHandler for backward compatibility - Add format auto-detection via magic bytes (CPKL:) 2. New preprocessor base class: - Add SerializablePreprocessorBase abstract class - Define serialization interface via abstract methods: * _get_serializable_fields() / _set_serializable_fields() * _get_stats() / _set_stats() - Mark serialize() and deserialize() as @Final to prevent overrides 3. Preprocessor registration system: - Add version_support.py with @SerializablePreprocessor decorator - Enable versioned serialization with stable identifiers - Support class registration and lookup - Add UnknownPreprocessorError for missing types 4. Migrate preprocessors to new framework: - SimpleImputer - OrdinalEncoder - OneHotEncoder - MultiHotEncoder - LabelEncoder - Categorizer - StandardScaler - MinMaxScaler - MaxAbsScaler - RobustScaler 5. Enhanced Preprocessor base class: - Add get_input_columns() and get_output_columns() methods (for future use) - Add has_stats() (for future use) - Add type hints to __getstate__() and __setstate__() 6. Backward compatibility improvements to Concatenator for existing functionality: - Add __setstate__ override in Concatenator for flatten field - Handle missing fields gracefully during deserialization The new architecture makes it easier to: - Add new serialization formats without modifying core logic - Maintain backward compatibility with existing serialized data - Handle version migrations for preprocessor schemas - Register new preprocessors with stable identifiers Signed-off-by: cem <cem@anyscale.com>

Signed-off-by: cem <cem@anyscale.com>

dstrodtman

stamp

Signed-off-by: cem <cem@anyscale.com>

cursor · 2025-11-13T21:29:47Z

python/ray/data/preprocessors/imputer.py

+        self.fill_value = fields.get("fill_value")
+
+        if self.strategy == "constant":
+            self._is_fittable = False


Bug: Serialization Skips Crucial Imputer Validation

The _set_serializable_fields method doesn't validate that fill_value is not None when strategy is "constant", unlike the constructor which raises a ValueError for this invalid combination. Deserializing a SimpleImputer with strategy="constant" and fill_value=None creates an invalid object that will fail during transformation with a confusing error message.

Signed-off-by: cem <cem@anyscale.com>

cursor · 2025-11-13T23:41:58Z

python/ray/data/preprocessor.py

    def deserialize(serialized: str) -> "Preprocessor":
        """Load the original preprocessor serialized via `self.serialize()`."""
        return pickle.loads(base64.b64decode(serialized))
+


Bug: Serialize: Unexpected Return Type

The Preprocessor.serialize() method's docstring and return type annotation claim it returns str, but SerializablePreprocessorBase.serialize() (which overrides this method for new preprocessors) returns Union[str, bytes] and actually returns bytes for CloudPickle format. This creates a breaking change where code expecting a string from serialize() will receive bytes instead, potentially causing type errors or incorrect behavior in downstream code that assumes string output.

## Description This commit introduces a new serialization system for Ray Data preprocessors that improves maintainability, extensibility, and backward compatibility. Key changes: 1. New serialization infrastructure: - Add serialization_handlers.py with factory pattern for format handling - Implement CloudPickleSerializationHandler (primary format) - Support legacy PickleSerializationHandler for backward compatibility - Add format auto-detection via magic bytes (CPKL:) 2. New preprocessor base class: - Add SerializablePreprocessorBase abstract class - Define serialization interface via abstract methods: * _get_serializable_fields() / _set_serializable_fields() * _get_stats() / _set_stats() - Mark serialize() and deserialize() as @Final to prevent overrides 3. Preprocessor registration system: - Add version_support.py with @SerializablePreprocessor decorator - Enable versioned serialization with stable identifiers - Support class registration and lookup - Add UnknownPreprocessorError for missing types 4. Migrate preprocessors to new framework: - SimpleImputer - OrdinalEncoder - OneHotEncoder - MultiHotEncoder - LabelEncoder - Categorizer - StandardScaler - MinMaxScaler - MaxAbsScaler - RobustScaler 5. Enhanced Preprocessor base class: - Add get_input_columns() and get_output_columns() methods (for future use) - Add has_stats() (for future use) - Add type hints to __getstate__() and __setstate__() 6. Backward compatibility improvements to Concatenator for existing functionality: - Add __setstate__ override in Concatenator for flatten field - Handle missing fields gracefully during deserialization The new architecture makes it easier to: - Add new serialization formats without modifying core logic - Maintain backward compatibility with existing serialized data - Handle version migrations for preprocessor schemas - Register new preprocessors with stable identifiers --------- Signed-off-by: cem <cem@anyscale.com>

## Description This commit introduces a new serialization system for Ray Data preprocessors that improves maintainability, extensibility, and backward compatibility. Key changes: 1. New serialization infrastructure: - Add serialization_handlers.py with factory pattern for format handling - Implement CloudPickleSerializationHandler (primary format) - Support legacy PickleSerializationHandler for backward compatibility - Add format auto-detection via magic bytes (CPKL:) 2. New preprocessor base class: - Add SerializablePreprocessorBase abstract class - Define serialization interface via abstract methods: * _get_serializable_fields() / _set_serializable_fields() * _get_stats() / _set_stats() - Mark serialize() and deserialize() as @Final to prevent overrides 3. Preprocessor registration system: - Add version_support.py with @SerializablePreprocessor decorator - Enable versioned serialization with stable identifiers - Support class registration and lookup - Add UnknownPreprocessorError for missing types 4. Migrate preprocessors to new framework: - SimpleImputer - OrdinalEncoder - OneHotEncoder - MultiHotEncoder - LabelEncoder - Categorizer - StandardScaler - MinMaxScaler - MaxAbsScaler - RobustScaler 5. Enhanced Preprocessor base class: - Add get_input_columns() and get_output_columns() methods (for future use) - Add has_stats() (for future use) - Add type hints to __getstate__() and __setstate__() 6. Backward compatibility improvements to Concatenator for existing functionality: - Add __setstate__ override in Concatenator for flatten field - Handle missing fields gracefully during deserialization The new architecture makes it easier to: - Add new serialization formats without modifying core logic - Maintain backward compatibility with existing serialized data - Handle version migrations for preprocessor schemas - Register new preprocessors with stable identifiers --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

## Description This commit introduces a new serialization system for Ray Data preprocessors that improves maintainability, extensibility, and backward compatibility. Key changes: 1. New serialization infrastructure: - Add serialization_handlers.py with factory pattern for format handling - Implement CloudPickleSerializationHandler (primary format) - Support legacy PickleSerializationHandler for backward compatibility - Add format auto-detection via magic bytes (CPKL:) 2. New preprocessor base class: - Add SerializablePreprocessorBase abstract class - Define serialization interface via abstract methods: * _get_serializable_fields() / _set_serializable_fields() * _get_stats() / _set_stats() - Mark serialize() and deserialize() as @Final to prevent overrides 3. Preprocessor registration system: - Add version_support.py with @SerializablePreprocessor decorator - Enable versioned serialization with stable identifiers - Support class registration and lookup - Add UnknownPreprocessorError for missing types 4. Migrate preprocessors to new framework: - SimpleImputer - OrdinalEncoder - OneHotEncoder - MultiHotEncoder - LabelEncoder - Categorizer - StandardScaler - MinMaxScaler - MaxAbsScaler - RobustScaler 5. Enhanced Preprocessor base class: - Add get_input_columns() and get_output_columns() methods (for future use) - Add has_stats() (for future use) - Add type hints to __getstate__() and __setstate__() 6. Backward compatibility improvements to Concatenator for existing functionality: - Add __setstate__ override in Concatenator for flatten field - Handle missing fields gracefully during deserialization The new architecture makes it easier to: - Add new serialization formats without modifying core logic - Maintain backward compatibility with existing serialized data - Handle version migrations for preprocessor schemas - Register new preprocessors with stable identifiers --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>

## Description This commit introduces a new serialization system for Ray Data preprocessors that improves maintainability, extensibility, and backward compatibility. Key changes: 1. New serialization infrastructure: - Add serialization_handlers.py with factory pattern for format handling - Implement CloudPickleSerializationHandler (primary format) - Support legacy PickleSerializationHandler for backward compatibility - Add format auto-detection via magic bytes (CPKL:) 2. New preprocessor base class: - Add SerializablePreprocessorBase abstract class - Define serialization interface via abstract methods: * _get_serializable_fields() / _set_serializable_fields() * _get_stats() / _set_stats() - Mark serialize() and deserialize() as @Final to prevent overrides 3. Preprocessor registration system: - Add version_support.py with @SerializablePreprocessor decorator - Enable versioned serialization with stable identifiers - Support class registration and lookup - Add UnknownPreprocessorError for missing types 4. Migrate preprocessors to new framework: - SimpleImputer - OrdinalEncoder - OneHotEncoder - MultiHotEncoder - LabelEncoder - Categorizer - StandardScaler - MinMaxScaler - MaxAbsScaler - RobustScaler 5. Enhanced Preprocessor base class: - Add get_input_columns() and get_output_columns() methods (for future use) - Add has_stats() (for future use) - Add type hints to __getstate__() and __setstate__() 6. Backward compatibility improvements to Concatenator for existing functionality: - Add __setstate__ override in Concatenator for flatten field - Handle missing fields gracefully during deserialization The new architecture makes it easier to: - Add new serialization formats without modifying core logic - Maintain backward compatibility with existing serialized data - Handle version migrations for preprocessor schemas - Register new preprocessors with stable identifiers --------- Signed-off-by: cem <cem@anyscale.com>

## Description This commit introduces a new serialization system for Ray Data preprocessors that improves maintainability, extensibility, and backward compatibility. Key changes: 1. New serialization infrastructure: - Add serialization_handlers.py with factory pattern for format handling - Implement CloudPickleSerializationHandler (primary format) - Support legacy PickleSerializationHandler for backward compatibility - Add format auto-detection via magic bytes (CPKL:) 2. New preprocessor base class: - Add SerializablePreprocessorBase abstract class - Define serialization interface via abstract methods: * _get_serializable_fields() / _set_serializable_fields() * _get_stats() / _set_stats() - Mark serialize() and deserialize() as @Final to prevent overrides 3. Preprocessor registration system: - Add version_support.py with @SerializablePreprocessor decorator - Enable versioned serialization with stable identifiers - Support class registration and lookup - Add UnknownPreprocessorError for missing types 4. Migrate preprocessors to new framework: - SimpleImputer - OrdinalEncoder - OneHotEncoder - MultiHotEncoder - LabelEncoder - Categorizer - StandardScaler - MinMaxScaler - MaxAbsScaler - RobustScaler 5. Enhanced Preprocessor base class: - Add get_input_columns() and get_output_columns() methods (for future use) - Add has_stats() (for future use) - Add type hints to __getstate__() and __setstate__() 6. Backward compatibility improvements to Concatenator for existing functionality: - Add __setstate__ override in Concatenator for flatten field - Handle missing fields gracefully during deserialization The new architecture makes it easier to: - Add new serialization formats without modifying core logic - Maintain backward compatibility with existing serialized data - Handle version migrations for preprocessor schemas - Register new preprocessors with stable identifiers --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

## Description This commit introduces a new serialization system for Ray Data preprocessors that improves maintainability, extensibility, and backward compatibility. Key changes: 1. New serialization infrastructure: - Add serialization_handlers.py with factory pattern for format handling - Implement CloudPickleSerializationHandler (primary format) - Support legacy PickleSerializationHandler for backward compatibility - Add format auto-detection via magic bytes (CPKL:) 2. New preprocessor base class: - Add SerializablePreprocessorBase abstract class - Define serialization interface via abstract methods: * _get_serializable_fields() / _set_serializable_fields() * _get_stats() / _set_stats() - Mark serialize() and deserialize() as @Final to prevent overrides 3. Preprocessor registration system: - Add version_support.py with @SerializablePreprocessor decorator - Enable versioned serialization with stable identifiers - Support class registration and lookup - Add UnknownPreprocessorError for missing types 4. Migrate preprocessors to new framework: - SimpleImputer - OrdinalEncoder - OneHotEncoder - MultiHotEncoder - LabelEncoder - Categorizer - StandardScaler - MinMaxScaler - MaxAbsScaler - RobustScaler 5. Enhanced Preprocessor base class: - Add get_input_columns() and get_output_columns() methods (for future use) - Add has_stats() (for future use) - Add type hints to __getstate__() and __setstate__() 6. Backward compatibility improvements to Concatenator for existing functionality: - Add __setstate__ override in Concatenator for flatten field - Handle missing fields gracefully during deserialization The new architecture makes it easier to: - Add new serialization formats without modifying core logic - Maintain backward compatibility with existing serialized data - Handle version migrations for preprocessor schemas - Register new preprocessors with stable identifiers --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

cem-anyscale requested a review from a team as a code owner October 30, 2025 18:14

cursor bot reviewed Oct 30, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 30, 2025

View reviewed changes

ray-gardener bot added the data Ray Data-related issues label Oct 30, 2025

cem-anyscale force-pushed the cem/save_load_1 branch from 3c79f29 to 5f10aad Compare November 13, 2025 05:22

cursor bot reviewed Nov 13, 2025

View reviewed changes

python/ray/data/preprocessors/scaler.py Show resolved Hide resolved

cem-anyscale force-pushed the cem/save_load_1 branch from e345e24 to 1bd8e63 Compare November 13, 2025 17:44

cem-anyscale requested a review from a team as a code owner November 13, 2025 18:26

cem-anyscale added 4 commits November 13, 2025 10:57

fix linter warnings for doc

f603c69

Signed-off-by: cem <cem@anyscale.com>

fix doc error

ef17381

Signed-off-by: cem <cem@anyscale.com>

fix doc error

a40eb2b

Signed-off-by: cem <cem@anyscale.com>

cem-anyscale force-pushed the cem/save_load_1 branch from 297e3cb to a40eb2b Compare November 13, 2025 18:58

raulchen approved these changes Nov 13, 2025

View reviewed changes

raulchen enabled auto-merge (squash) November 13, 2025 20:00

github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 13, 2025

dstrodtman approved these changes Nov 13, 2025

View reviewed changes

cem-anyscale added 2 commits November 13, 2025 13:19

add missing annotations

06245cb

Signed-off-by: cem <cem@anyscale.com>

add quantile_precision to RobustScaler

f30c0bb

Signed-off-by: cem <cem@anyscale.com>

github-actions bot disabled auto-merge November 13, 2025 21:25

cursor bot reviewed Nov 13, 2025

View reviewed changes

update example in user guide

7154e7c

Signed-off-by: cem <cem@anyscale.com>

cem-anyscale requested a review from a team as a code owner November 13, 2025 23:36

cursor bot reviewed Nov 13, 2025

View reviewed changes

raulchen merged commit f6490dd into master Nov 14, 2025
6 checks passed

raulchen deleted the cem/save_load_1 branch November 14, 2025 17:50

richardliaw mentioned this pull request Nov 15, 2025

Ray Data Q4 Roadmap + Wishlist #58665

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add serialization framework for preprocessors#58321

[Data] Add serialization framework for preprocessors#58321
raulchen merged 7 commits intomasterfrom
cem/save_load_1

cem-anyscale commented Oct 30, 2025 •

edited

Loading

Uh oh!

cursor bot Oct 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

gemini-code-assist bot Oct 30, 2025

Uh oh!

Uh oh!

dstrodtman left a comment

Uh oh!

cursor bot Nov 13, 2025

Uh oh!

cursor bot Nov 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cem-anyscale commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

cursor bot Oct 30, 2025

Choose a reason for hiding this comment

Bug: Test Fails Due to Hardcoded Serialization

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dstrodtman left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Nov 13, 2025

Choose a reason for hiding this comment

Bug: Serialization Skips Crucial Imputer Validation

Uh oh!

cursor bot Nov 13, 2025

Choose a reason for hiding this comment

Bug: Serialize: Unexpected Return Type

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cem-anyscale commented Oct 30, 2025 •

edited

Loading