-
Notifications
You must be signed in to change notification settings - Fork 7.3k
[Data] ArrowInvalid error when you backfill missing fields from map tasks #60643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
4dccb02
861d061
0a741ae
a65c8d8
8c2cda3
773cd0d
8bdfa20
dd8a662
6a35990
ec2a812
4c52650
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -798,6 +798,35 @@ async def __call__(self, batch): | |
| assert len(output) == len(expected_output), (len(output), len(expected_output)) | ||
|
|
||
|
|
||
| def test_map_batches_struct_field_type_divergence(shutdown_only): | ||
| """Test map_batches with struct fields that have diverging primitive types.""" | ||
|
|
||
| def generator_fn(batch): | ||
| for i, row_id in enumerate(batch["id"]): | ||
| if i % 2 == 0: | ||
| # Yield struct with fields (a: int64, b: string) | ||
| yield {"data": [{"a": 1, "b": "hello"}]} | ||
| else: | ||
| # Yield struct with fields (a: float64, c: int32) | ||
| # Field 'a' has different type, field 'b' missing, field 'c' new | ||
| yield {"data": [{"a": 1.5, "c": 100}]} | ||
|
|
||
| ds = ray.data.range(4, override_num_blocks=1) | ||
| ds = ds.map_batches(generator_fn, batch_size=4) | ||
| result = ds.materialize() | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: I think this |
||
|
|
||
| rows = result.take_all() | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Redundant
|
||
| assert len(rows) == 4 | ||
|
|
||
| # Rows 0 and 2 should have int cast to float, with c=None | ||
| assert rows[0]["data"] == {"a": 1.0, "b": "hello", "c": None} | ||
| assert rows[2]["data"] == {"a": 1.0, "b": "hello", "c": None} | ||
|
|
||
| # Rows 1 and 3 should have float a, with b=None | ||
| assert rows[1]["data"] == {"a": 1.5, "b": None, "c": 100} | ||
| assert rows[3]["data"] == {"a": 1.5, "b": None, "c": 100} | ||
|
||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| import sys | ||
|
|
||
|
|
||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to this E2E test, I think we should add a unit test for this bug as well (maybe at the
concatfunction layer of abstraction). Unit tests are not only much faster to run, but also serve as documentationThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unit test in: 6a35990