Skip to content

bugfix: correct offsets when serializing a list of fixed sized list and non-zero start offset #7318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 24, 2025

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Mar 21, 2025

Which issue does this PR close?

Rationale for this change

This bug occurs when these conditions are all met:

  • We have a generic list of fixed size list
  • The offset of the outer generic list is non-zero
  • We writing out (serializing) the arrays

What changes are included in this PR?

This PR checks for the case of a fixed sized list and adjusts the offsets to the child arrays during write.

Unit tests added.

Are there any user-facing changes?

None

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 21, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @timsaucer -- this looks great to me

I think it would be better with a test for nullable data as well (to make sure the nulls are sliced correctly) but in my opinion this PR is better than what is on main so it could be merged in as is as well

I had some small other comments, but nothing required in my opinion

@@ -3075,4 +3098,111 @@ mod tests {
assert_eq!(stream_bytes_written_on_flush, expected_stream_flushed_bytes);
assert_eq!(file_bytes_written_on_flush, expected_file_flushed_bytes);
}

#[test]
fn test_roundtrip_list_of_fixed_list() -> Result<(), ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that this test fails without the code in this PR

assertion `left == right` failed
  left: RecordBatch { schema: Schema { fields: [Field { name: "points", data_type: List(Field { name: "item", data_type: FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 3), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [ListArray
[
  FixedSizeListArray<3>
[
  PrimitiveArray<Float32>
[
  10.0,
  11.0,
  12.0,
],
],
]], row_count: 1 }
 right: RecordBatch { schema: Schema { fields: [Field { name: "points", data_type: List(Field { name: "item", data_type: FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 3), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [ListArray
[
  FixedSizeListArray<3>
[
  PrimitiveArray<Float32>
[
  1.0,
  2.0,
  3.0,
],
],
]], row_count: 1 }

<Click to see difference>

thread 'writer::tests::test_roundtrip_list_of_fixed_list' panicked at arrow-ipc/src/writer.rs:3150:9:
assertion `left == right` failed
  left: RecordBatch { schema: Schema { fields: [Field { name: "points", data_type: List(Field { name: "item", data_type: FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 3), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [ListArray
[
  FixedSizeListArray<3>
[
  PrimitiveArray<Float32>
[
  10.0,
  11.0,
  12.0,
],
],
]], row_count: 1 }
 right: RecordBatch { schema: Schema { fields: [Field { name: "points", data_type: List(Field { name: "item", data_type: FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 3), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [ListArray
[
  FixedSizeListArray<3>
[
  PrimitiveArray<Float32>
[
  1.0,
  2.0,
  3.0,
],
],
]], row_count: 1 }
stack backtrace:
   0: rust_begin_unwind
             at /rustc/4eb161250e340c8f48f66e2b929ef4a5bed7c181/library/std/src/panicking.rs:692:5
   1: core::panicking::panic_fmt
             at /rustc/4eb161250e340c8f48f66e2b929ef4a5bed7c181/library/core/src/panicking.rs:75:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /Users/andrewlamb/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/panicking.rs:364:5
   4: arrow_ipc::writer::tests::test_subarray
             at ./src/writer.rs:3150:9
   5: arrow_ipc::writer::tests::test_roundtrip_list_of_fixed_list
             at ./src/writer.rs:3127:9
   6: arrow_ipc::writer::tests::test_roundtrip_list_of_fixed_list::{{closure}}
             at ./src/writer.rs:3083:47
   7: core::ops::function::FnOnce::call_once
             at /Users/andrewlamb/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
   8: core::ops::function::FnOnce::call_once
             at /rustc/4eb161250e340c8f48f66e2b929ef4a5bed7c181/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

@timsaucer
Copy link
Contributor Author

Thanks @alamb . I've added the requested changes. I appreciate the rapid response.

@timsaucer
Copy link
Contributor Author

Ah, I don't have merge approval for this repository since I'm not a committer on Arrow.

Comment on lines +3174 to +3183
let point = [Some(10.), None, None];
for p in point {
match p {
Some(p) => l2_builder.values().values().append_value(p),
None => l2_builder.values().values().append_null(),
}
}

l2_builder.values().append(true);
l2_builder.append(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder why this point [Some(10.), None, None] needs to be separately appended? Cannot it be in the above loop too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this entry a different row of the outer list.

}

#[test]
fn test_roundtrip_fixed_list() -> Result<(), ArrowError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add another test with null value too?

@viirya viirya merged commit 3b90fc9 into apache:main Mar 24, 2025
26 checks passed
@viirya
Copy link
Member

viirya commented Mar 24, 2025

Thanks @timsaucer @alamb

@emilk
Copy link
Contributor

emilk commented Mar 25, 2025

I would very much like to see this in a bugfix release 🙏

@alamb
Copy link
Contributor

alamb commented Mar 25, 2025

I would very much like to see this in a bugfix release 🙏

Thanks, i filed a ticket to track the idea. @timsaucer also was interested I think

mbrobbel pushed a commit to mbrobbel/arrow-rs that referenced this pull request Mar 25, 2025
…nd non-zero start offset (apache#7318)

* When serializing fixed length arrays, adjust the offsets for writing out

* Add unit test

* clippy warnings

* Add unit test for nulls

* Update unit test to account for which schema had nulls
mbrobbel pushed a commit to mbrobbel/arrow-rs that referenced this pull request Mar 25, 2025
…nd non-zero start offset (apache#7318)

* When serializing fixed length arrays, adjust the offsets for writing out

* Add unit test

* clippy warnings

* Add unit test for nulls

* Update unit test to account for which schema had nulls
timsaucer added a commit to timsaucer/arrow-rs that referenced this pull request Mar 25, 2025
…nd non-zero start offset (apache#7318)

* When serializing fixed length arrays, adjust the offsets for writing out

* Add unit test

* clippy warnings

* Add unit test for nulls

* Update unit test to account for which schema had nulls
alamb pushed a commit that referenced this pull request Mar 26, 2025
* bugfix: correct offsets when serializing a list of fixed sized list and non-zero start offset (#7318)

* When serializing fixed length arrays, adjust the offsets for writing out

* Add unit test

* clippy warnings

* Add unit test for nulls

* Update unit test to account for which schema had nulls

* Add missing type annotation (#7326)

* Update version

* Create changelog

---------

Co-authored-by: Tim Saucer <[email protected]>
alamb pushed a commit that referenced this pull request Mar 27, 2025
* bugfix: correct offsets when serializing a list of fixed sized list and non-zero start offset (#7318)

* When serializing fixed length arrays, adjust the offsets for writing out

* Add unit test

* clippy warnings

* Add unit test for nulls

* Update unit test to account for which schema had nulls

* Add missing type annotation (#7326)

* Update version

* Create changelog

---------

Co-authored-by: Matthijs Brobbel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Round trip encoding of list of fixed list fails when offset is not zero
4 participants