Commit f8db43c
[Data] Add serialization framework for preprocessors (ray-project#58321)
## Description
This commit introduces a new serialization system for Ray Data
preprocessors that improves maintainability, extensibility, and backward
compatibility.
Key changes:
1. New serialization infrastructure:
- Add serialization_handlers.py with factory pattern for format handling
- Implement CloudPickleSerializationHandler (primary format)
- Support legacy PickleSerializationHandler for backward compatibility
- Add format auto-detection via magic bytes (CPKL:)
2. New preprocessor base class:
- Add SerializablePreprocessorBase abstract class
- Define serialization interface via abstract methods:
* _get_serializable_fields() / _set_serializable_fields()
* _get_stats() / _set_stats()
- Mark serialize() and deserialize() as @Final to prevent overrides
3. Preprocessor registration system:
- Add version_support.py with @SerializablePreprocessor decorator
- Enable versioned serialization with stable identifiers
- Support class registration and lookup
- Add UnknownPreprocessorError for missing types
4. Migrate preprocessors to new framework:
- SimpleImputer
- OrdinalEncoder
- OneHotEncoder
- MultiHotEncoder
- LabelEncoder
- Categorizer
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- RobustScaler
5. Enhanced Preprocessor base class:
- Add get_input_columns() and get_output_columns() methods (for future
use)
- Add has_stats() (for future use)
- Add type hints to __getstate__() and __setstate__()
6. Backward compatibility improvements to Concatenator for existing
functionality:
- Add __setstate__ override in Concatenator for flatten field
- Handle missing fields gracefully during deserialization
The new architecture makes it easier to:
- Add new serialization formats without modifying core logic
- Maintain backward compatibility with existing serialized data
- Handle version migrations for preprocessor schemas
- Register new preprocessors with stable identifiers
---------
Signed-off-by: cem <cem@anyscale.com>1 parent d50a275 commit f8db43c
File tree
14 files changed
+1886
-28
lines changed- doc/source
- train/user-guides
- python/ray/data
- preprocessors
- tests/preprocessors
14 files changed
+1886
-28
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
135 | 135 | | |
136 | 136 | | |
137 | 137 | | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
138 | 141 | | |
139 | 142 | | |
140 | 143 | | |
| |||
Lines changed: 9 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
502 | 502 | | |
503 | 503 | | |
504 | 504 | | |
| 505 | + | |
505 | 506 | | |
506 | 507 | | |
507 | 508 | | |
| |||
542 | 543 | | |
543 | 544 | | |
544 | 545 | | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
545 | 550 | | |
546 | 551 | | |
547 | 552 | | |
548 | 553 | | |
549 | | - | |
| 554 | + | |
550 | 555 | | |
551 | 556 | | |
552 | 557 | | |
553 | 558 | | |
554 | | - | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
555 | 562 | | |
556 | 563 | | |
557 | 564 | | |
| |||
0 commit comments