[Data] Add serialization framework for preprocessors#58321
Conversation
| cloudpickle_result = cloudpickle_deserialized.transform_batch(test_df.copy()) | ||
| pickle_result = pickle_deserialized.transform_batch(test_df.copy()) | ||
|
|
||
| pd.testing.assert_frame_equal(cloudpickle_result, pickle_result) |
There was a problem hiding this comment.
Bug: Test Fails Due to Hardcoded Serialization
The test_encoder_serialization_formats() method incorrectly claims to test different serialization formats. The SerializablePreprocessorBase.serialize() method is hardcoded to CloudPickle, causing the test to compare two identical CloudPickle outputs. Additionally, the serialize() docstring incorrectly lists an output_format parameter.
Additional Locations (1)
There was a problem hiding this comment.
Code Review
This pull request introduces a robust new serialization framework for Ray Data preprocessors, which is a significant improvement for maintainability, extensibility, and backward compatibility. The changes are well-structured, introducing a new SerializablePreprocessorBase class, a factory pattern for handling different serialization formats, and a versioned registration system for preprocessors. The migration of existing preprocessors to this new framework is well-executed. The code quality is high, and the new functionality is accompanied by a comprehensive suite of tests. My review focuses on a few minor documentation inconsistencies and a potentially misleading test case.
python/ray/data/preprocessor.py
Outdated
| Args: | ||
| output_format: The serialization format to use | ||
|
|
| Args: | ||
|
|
||
|
|
| def test_encoder_serialization_formats(self): | ||
| """Test that encoders work with different serialization formats.""" | ||
| encoder = OrdinalEncoder(columns=["category"]) | ||
| dataset = ray.data.from_pandas(self.categorical_df) | ||
| fitted_encoder = encoder.fit(dataset) | ||
|
|
||
| # Test CloudPickle format (default) | ||
| cloudpickle_serialized = fitted_encoder.serialize() | ||
| assert isinstance(cloudpickle_serialized, bytes) | ||
|
|
||
| # Test Pickle format (legacy) | ||
| pickle_serialized = fitted_encoder.serialize() | ||
| assert isinstance(pickle_serialized, bytes) | ||
|
|
||
| # Both should deserialize to equivalent objects | ||
| cloudpickle_deserialized = SerializablePreprocessor.deserialize( | ||
| cloudpickle_serialized | ||
| ) | ||
| pickle_deserialized = SerializablePreprocessor.deserialize(pickle_serialized) | ||
|
|
||
| # Test functional equivalence | ||
| test_df = pd.DataFrame({"category": ["A", "B"]}) | ||
|
|
||
| cloudpickle_result = cloudpickle_deserialized.transform_batch(test_df.copy()) | ||
| pickle_result = pickle_deserialized.transform_batch(test_df.copy()) | ||
|
|
||
| pd.testing.assert_frame_equal(cloudpickle_result, pickle_result) |
There was a problem hiding this comment.
This test test_encoder_serialization_formats seems to have a misleading comment and implementation. It claims to test the legacy Pickle format, but it calls fitted_encoder.serialize(), which for a SerializablePreprocessorBase subclass will always use the new CloudPickle-based serialization. The old pickle-based serialization produced a str, while the new one produces bytes.
The test is effectively re-running the CloudPickle serialization test. To properly test backward compatibility for deserialization of the legacy pickle format, you would need to either manually create a legacy-formatted serialized string or have a pre-serialized object from an older version.
Given that this test doesn't check what it claims to, it would be best to either fix it to correctly test legacy format deserialization or remove it to avoid confusion.
3c79f29 to
5f10aad
Compare
e345e24 to
1bd8e63
Compare
This commit introduces a new serialization system for Ray Data preprocessors
that improves maintainability, extensibility, and backward compatibility.
Key changes:
1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
- Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
- Add format auto-detection via magic bytes (CPKL:)
2. New preprocessor base class:
- Add SerializablePreprocessorBase abstract class
- Define serialization interface via abstract methods:
* _get_serializable_fields() / _set_serializable_fields()
* _get_stats() / _set_stats()
- Mark serialize() and deserialize() as @Final to prevent overrides
3. Preprocessor registration system:
- Add version_support.py with @SerializablePreprocessor decorator
- Enable versioned serialization with stable identifiers
- Support class registration and lookup
- Add UnknownPreprocessorError for missing types
4. Migrate preprocessors to new framework:
- SimpleImputer
- OrdinalEncoder
- OneHotEncoder
- MultiHotEncoder
- LabelEncoder
- Categorizer
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- RobustScaler
5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future use)
- Add has_stats() (for future use)
- Add type hints to __getstate__() and __setstate__()
6. Backward compatibility improvements to Concatenator for existing functionality:
- Add __setstate__ override in Concatenator for flatten field
- Handle missing fields gracefully during deserialization
The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
297e3cb to
a40eb2b
Compare
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
| self.fill_value = fields.get("fill_value") | ||
|
|
||
| if self.strategy == "constant": | ||
| self._is_fittable = False |
There was a problem hiding this comment.
Bug: Serialization Skips Crucial Imputer Validation
The _set_serializable_fields method doesn't validate that fill_value is not None when strategy is "constant", unlike the constructor which raises a ValueError for this invalid combination. Deserializing a SimpleImputer with strategy="constant" and fill_value=None creates an invalid object that will fail during transformation with a confusing error message.
Signed-off-by: cem <cem@anyscale.com>
| def deserialize(serialized: str) -> "Preprocessor": | ||
| """Load the original preprocessor serialized via `self.serialize()`.""" | ||
| return pickle.loads(base64.b64decode(serialized)) | ||
|
|
There was a problem hiding this comment.
Bug: Serialize: Unexpected Return Type
The Preprocessor.serialize() method's docstring and return type annotation claim it returns str, but SerializablePreprocessorBase.serialize() (which overrides this method for new preprocessors) returns Union[str, bytes] and actually returns bytes for CloudPickle format. This creates a breaking change where code expecting a string from serialize() will receive bytes instead, potentially causing type errors or incorrect behavior in downstream code that assumes string output.
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.
Key changes:
1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
- Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
- Add format auto-detection via magic bytes (CPKL:)
2. New preprocessor base class:
- Add SerializablePreprocessorBase abstract class
- Define serialization interface via abstract methods:
* _get_serializable_fields() / _set_serializable_fields()
* _get_stats() / _set_stats()
- Mark serialize() and deserialize() as @Final to prevent overrides
3. Preprocessor registration system:
- Add version_support.py with @SerializablePreprocessor decorator
- Enable versioned serialization with stable identifiers
- Support class registration and lookup
- Add UnknownPreprocessorError for missing types
4. Migrate preprocessors to new framework:
- SimpleImputer
- OrdinalEncoder
- OneHotEncoder
- MultiHotEncoder
- LabelEncoder
- Categorizer
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- RobustScaler
5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
- Add has_stats() (for future use)
- Add type hints to __getstate__() and __setstate__()
6. Backward compatibility improvements to Concatenator for existing
functionality:
- Add __setstate__ override in Concatenator for flatten field
- Handle missing fields gracefully during deserialization
The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers
---------
Signed-off-by: cem <cem@anyscale.com>
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.
Key changes:
1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
- Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
- Add format auto-detection via magic bytes (CPKL:)
2. New preprocessor base class:
- Add SerializablePreprocessorBase abstract class
- Define serialization interface via abstract methods:
* _get_serializable_fields() / _set_serializable_fields()
* _get_stats() / _set_stats()
- Mark serialize() and deserialize() as @Final to prevent overrides
3. Preprocessor registration system:
- Add version_support.py with @SerializablePreprocessor decorator
- Enable versioned serialization with stable identifiers
- Support class registration and lookup
- Add UnknownPreprocessorError for missing types
4. Migrate preprocessors to new framework:
- SimpleImputer
- OrdinalEncoder
- OneHotEncoder
- MultiHotEncoder
- LabelEncoder
- Categorizer
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- RobustScaler
5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
- Add has_stats() (for future use)
- Add type hints to __getstate__() and __setstate__()
6. Backward compatibility improvements to Concatenator for existing
functionality:
- Add __setstate__ override in Concatenator for flatten field
- Handle missing fields gracefully during deserialization
The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers
---------
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.
Key changes:
1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
- Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
- Add format auto-detection via magic bytes (CPKL:)
2. New preprocessor base class:
- Add SerializablePreprocessorBase abstract class
- Define serialization interface via abstract methods:
* _get_serializable_fields() / _set_serializable_fields()
* _get_stats() / _set_stats()
- Mark serialize() and deserialize() as @Final to prevent overrides
3. Preprocessor registration system:
- Add version_support.py with @SerializablePreprocessor decorator
- Enable versioned serialization with stable identifiers
- Support class registration and lookup
- Add UnknownPreprocessorError for missing types
4. Migrate preprocessors to new framework:
- SimpleImputer
- OrdinalEncoder
- OneHotEncoder
- MultiHotEncoder
- LabelEncoder
- Categorizer
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- RobustScaler
5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
- Add has_stats() (for future use)
- Add type hints to __getstate__() and __setstate__()
6. Backward compatibility improvements to Concatenator for existing
functionality:
- Add __setstate__ override in Concatenator for flatten field
- Handle missing fields gracefully during deserialization
The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers
---------
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.
Key changes:
1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
- Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
- Add format auto-detection via magic bytes (CPKL:)
2. New preprocessor base class:
- Add SerializablePreprocessorBase abstract class
- Define serialization interface via abstract methods:
* _get_serializable_fields() / _set_serializable_fields()
* _get_stats() / _set_stats()
- Mark serialize() and deserialize() as @Final to prevent overrides
3. Preprocessor registration system:
- Add version_support.py with @SerializablePreprocessor decorator
- Enable versioned serialization with stable identifiers
- Support class registration and lookup
- Add UnknownPreprocessorError for missing types
4. Migrate preprocessors to new framework:
- SimpleImputer
- OrdinalEncoder
- OneHotEncoder
- MultiHotEncoder
- LabelEncoder
- Categorizer
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- RobustScaler
5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
- Add has_stats() (for future use)
- Add type hints to __getstate__() and __setstate__()
6. Backward compatibility improvements to Concatenator for existing
functionality:
- Add __setstate__ override in Concatenator for flatten field
- Handle missing fields gracefully during deserialization
The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers
---------
Signed-off-by: cem <cem@anyscale.com>
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.
Key changes:
1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
- Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
- Add format auto-detection via magic bytes (CPKL:)
2. New preprocessor base class:
- Add SerializablePreprocessorBase abstract class
- Define serialization interface via abstract methods:
* _get_serializable_fields() / _set_serializable_fields()
* _get_stats() / _set_stats()
- Mark serialize() and deserialize() as @Final to prevent overrides
3. Preprocessor registration system:
- Add version_support.py with @SerializablePreprocessor decorator
- Enable versioned serialization with stable identifiers
- Support class registration and lookup
- Add UnknownPreprocessorError for missing types
4. Migrate preprocessors to new framework:
- SimpleImputer
- OrdinalEncoder
- OneHotEncoder
- MultiHotEncoder
- LabelEncoder
- Categorizer
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- RobustScaler
5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
- Add has_stats() (for future use)
- Add type hints to __getstate__() and __setstate__()
6. Backward compatibility improvements to Concatenator for existing
functionality:
- Add __setstate__ override in Concatenator for flatten field
- Handle missing fields gracefully during deserialization
The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers
---------
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.
Key changes:
1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
- Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
- Add format auto-detection via magic bytes (CPKL:)
2. New preprocessor base class:
- Add SerializablePreprocessorBase abstract class
- Define serialization interface via abstract methods:
* _get_serializable_fields() / _set_serializable_fields()
* _get_stats() / _set_stats()
- Mark serialize() and deserialize() as @Final to prevent overrides
3. Preprocessor registration system:
- Add version_support.py with @SerializablePreprocessor decorator
- Enable versioned serialization with stable identifiers
- Support class registration and lookup
- Add UnknownPreprocessorError for missing types
4. Migrate preprocessors to new framework:
- SimpleImputer
- OrdinalEncoder
- OneHotEncoder
- MultiHotEncoder
- LabelEncoder
- Categorizer
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- RobustScaler
5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
- Add has_stats() (for future use)
- Add type hints to __getstate__() and __setstate__()
6. Backward compatibility improvements to Concatenator for existing
functionality:
- Add __setstate__ override in Concatenator for flatten field
- Handle missing fields gracefully during deserialization
The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers
---------
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
This commit introduces a new serialization system for Ray Data preprocessors that improves maintainability, extensibility, and backward compatibility.
Key changes:
New serialization infrastructure:
New preprocessor base class:
Preprocessor registration system:
Migrate preprocessors to new framework:
Enhanced Preprocessor base class:
Backward compatibility improvements to Concatenator for existing functionality:
The new architecture makes it easier to: