Skip to content

[Data] ArrowInvalid error when you backfill missing fields from map tasks#60643

Merged
bveeramani merged 11 commits intoray-project:masterfrom
machichima:60628-arrow-invalid-err
Feb 4, 2026
Merged

[Data] ArrowInvalid error when you backfill missing fields from map tasks#60643
bveeramani merged 11 commits intoray-project:masterfrom
machichima:60628-arrow-invalid-err

Conversation

@machichima
Copy link
Contributor

Description

Try type casting if struct field types mismatch when backfilling missing fields

Related issues

Closes #60628

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: machichima <nary12321@gmail.com>
@machichima machichima requested a review from a team as a code owner February 1, 2026 11:07
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an ArrowInvalid error that occurs when backfilling missing fields in struct columns, particularly from map tasks. The issue arises from type mismatches between struct fields in different blocks (e.g., int64 vs. float64) that are not handled after schema unification. The proposed change correctly identifies these type discrepancies and explicitly casts the array to the unified field type. The implementation is clean, includes robust error handling with an informative ValueError, and effectively resolves the bug. The changes look good to me.

Signed-off-by: machichima <nary12321@gmail.com>
@machichima
Copy link
Contributor Author

@bveeramani PTAL, thank you!

@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Feb 1, 2026
Signed-off-by: machichima <nary12321@gmail.com>
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Feb 2, 2026
@alexeykudinkin
Copy link
Contributor

@machichima can you please take a look at build failures?


ds = ray.data.range(4, override_num_blocks=1)
ds = ds.map_batches(generator_fn, batch_size=4)
result = ds.materialize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this materialize is redundant. take_all() already materializes the dataset

Comment on lines 821 to 827
# Rows 0 and 2 should have int cast to float, with c=None
assert rows[0]["data"] == {"a": 1.0, "b": "hello", "c": None}
assert rows[2]["data"] == {"a": 1.0, "b": "hello", "c": None}

# Rows 1 and 3 should have float a, with b=None
assert rows[1]["data"] == {"a": 1.5, "b": None, "c": 100}
assert rows[3]["data"] == {"a": 1.5, "b": None, "c": 100}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I don't think this is possible given the current Ray Data implementation, this row ordering isn't guaranteed by the interface. This has historically been the cause of a lot of Ray Data's flaky tests.

Could you refactor this test so that it doesn't depend on a particular test ordering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 773cd0d

assert len(output) == len(expected_output), (len(output), len(expected_output))


def test_map_batches_struct_field_type_divergence(shutdown_only):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to this E2E test, I think we should add a unit test for this bug as well (maybe at the concat function layer of abstraction). Unit tests are not only much faster to run, but also serve as documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unit test in: 6a35990

machichima and others added 4 commits February 3, 2026 20:06
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@bveeramani
Copy link
Member

@machichima looks like there's a minor import error. Would you mind fixing it? Think we should be good to land after

Signed-off-by: machichima <nary12321@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

ds = ds.map_batches(generator_fn, batch_size=4)
result = ds.materialize()

rows = result.take_all()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant materialize() call before take_all()

Low Severity

The materialize() call on line 824 is redundant because take_all() on line 826 already materializes the dataset. This was noted in the PR review comments. The extra call doesn't cause incorrect behavior but adds unnecessary overhead and clutters the test code.

Fix in Cursor Fix in Web

@machichima
Copy link
Contributor Author

@machichima looks like there's a minor import error. Would you mind fixing it? Think we should be good to land after

Done!

@bveeramani bveeramani enabled auto-merge (squash) February 4, 2026 00:52
@github-actions github-actions bot disabled auto-merge February 4, 2026 02:12
@bveeramani bveeramani merged commit 4cf7794 into ray-project:master Feb 4, 2026
6 checks passed
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…asks (ray-project#60643)

## Description
Try type casting if struct field types mismatch when backfilling missing
fields

## Related issues
Closes ray-project#60628

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: machichima <nary12321@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…asks (ray-project#60643)


## Description
Try type casting if struct field types mismatch when backfilling missing
fields

## Related issues
Closes ray-project#60628

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: machichima <nary12321@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…asks (ray-project#60643)


## Description
Try type casting if struct field types mismatch when backfilling missing
fields

## Related issues
Closes ray-project#60628

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: machichima <nary12321@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
…asks (#60643)

## Description
Try type casting if struct field types mismatch when backfilling missing
fields

## Related issues
Closes #60628

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: machichima <nary12321@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
…asks (#60643)

## Description
Try type casting if struct field types mismatch when backfilling missing
fields

## Related issues
Closes #60628

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: machichima <nary12321@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…asks (ray-project#60643)

## Description
Try type casting if struct field types mismatch when backfilling missing
fields

## Related issues
Closes ray-project#60628

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: machichima <nary12321@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
Aydin-ab pushed a commit to kunling-anyscale/ray that referenced this pull request Feb 20, 2026
…asks (ray-project#60643)

## Description
Try type casting if struct field types mismatch when backfilling missing
fields

## Related issues
Closes ray-project#60628

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: machichima <nary12321@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…asks (ray-project#60643)

## Description
Try type casting if struct field types mismatch when backfilling missing
fields

## Related issues
Closes ray-project#60628

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: machichima <nary12321@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…asks (ray-project#60643)

## Description
Try type casting if struct field types mismatch when backfilling missing
fields

## Related issues
Closes ray-project#60628

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: machichima <nary12321@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] ArrowInvalid error when you backfill missing fields from map tasks

3 participants