[Data] ArrowInvalid error when you backfill missing fields from map tasks#60643
[Data] ArrowInvalid error when you backfill missing fields from map tasks#60643bveeramani merged 11 commits intoray-project:masterfrom
Conversation
Signed-off-by: machichima <nary12321@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request addresses an ArrowInvalid error that occurs when backfilling missing fields in struct columns, particularly from map tasks. The issue arises from type mismatches between struct fields in different blocks (e.g., int64 vs. float64) that are not handled after schema unification. The proposed change correctly identifies these type discrepancies and explicitly casts the array to the unified field type. The implementation is clean, includes robust error handling with an informative ValueError, and effectively resolves the bug. The changes look good to me.
Signed-off-by: machichima <nary12321@gmail.com>
|
@bveeramani PTAL, thank you! |
Signed-off-by: machichima <nary12321@gmail.com>
|
@machichima can you please take a look at build failures? |
|
|
||
| ds = ray.data.range(4, override_num_blocks=1) | ||
| ds = ds.map_batches(generator_fn, batch_size=4) | ||
| result = ds.materialize() |
There was a problem hiding this comment.
Nit: I think this materialize is redundant. take_all() already materializes the dataset
| # Rows 0 and 2 should have int cast to float, with c=None | ||
| assert rows[0]["data"] == {"a": 1.0, "b": "hello", "c": None} | ||
| assert rows[2]["data"] == {"a": 1.0, "b": "hello", "c": None} | ||
|
|
||
| # Rows 1 and 3 should have float a, with b=None | ||
| assert rows[1]["data"] == {"a": 1.5, "b": None, "c": 100} | ||
| assert rows[3]["data"] == {"a": 1.5, "b": None, "c": 100} |
There was a problem hiding this comment.
Though I don't think this is possible given the current Ray Data implementation, this row ordering isn't guaranteed by the interface. This has historically been the cause of a lot of Ray Data's flaky tests.
Could you refactor this test so that it doesn't depend on a particular test ordering?
| assert len(output) == len(expected_output), (len(output), len(expected_output)) | ||
|
|
||
|
|
||
| def test_map_batches_struct_field_type_divergence(shutdown_only): |
There was a problem hiding this comment.
In addition to this E2E test, I think we should add a unit test for this bug as well (maybe at the concat function layer of abstraction). Unit tests are not only much faster to run, but also serve as documentation
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
|
@machichima looks like there's a minor import error. Would you mind fixing it? Think we should be good to land after |
Signed-off-by: machichima <nary12321@gmail.com>
| ds = ds.map_batches(generator_fn, batch_size=4) | ||
| result = ds.materialize() | ||
|
|
||
| rows = result.take_all() |
There was a problem hiding this comment.
Redundant materialize() call before take_all()
Low Severity
The materialize() call on line 824 is redundant because take_all() on line 826 already materializes the dataset. This was noted in the PR review comments. The extra call doesn't cause incorrect behavior but adds unnecessary overhead and clutters the test code.
Done! |
…asks (ray-project#60643) ## Description Try type casting if struct field types mismatch when backfilling missing fields ## Related issues Closes ray-project#60628 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
…asks (ray-project#60643) ## Description Try type casting if struct field types mismatch when backfilling missing fields ## Related issues Closes ray-project#60628 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
…asks (ray-project#60643) ## Description Try type casting if struct field types mismatch when backfilling missing fields ## Related issues Closes ray-project#60628 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
…asks (#60643) ## Description Try type casting if struct field types mismatch when backfilling missing fields ## Related issues Closes #60628 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…asks (#60643) ## Description Try type casting if struct field types mismatch when backfilling missing fields ## Related issues Closes #60628 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
…asks (ray-project#60643) ## Description Try type casting if struct field types mismatch when backfilling missing fields ## Related issues Closes ray-project#60628 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…asks (ray-project#60643) ## Description Try type casting if struct field types mismatch when backfilling missing fields ## Related issues Closes ray-project#60628 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
…asks (ray-project#60643) ## Description Try type casting if struct field types mismatch when backfilling missing fields ## Related issues Closes ray-project#60628 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…asks (ray-project#60643) ## Description Try type casting if struct field types mismatch when backfilling missing fields ## Related issues Closes ray-project#60628 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>


Description
Try type casting if struct field types mismatch when backfilling missing fields
Related issues
Closes #60628
Additional information