Skip to content

fix: Re-implement some Parquet decode methods without copy_nonoverlapping #558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 13, 2024

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jun 11, 2024

Which issue does this PR close?

Part of #557

Rationale for this change

Parquet decoding when converting between different integral types was using copy_nonoverlapping without meeting the precondition that both pointers were properly aligned.

What changes are included in this PR?

  • Rewrite without using copy_nonoverlapping
  • Added unit tests in Rust

How are these changes tested?

  • New tests in Rust
  • Existing tests in Scala

@andygrove andygrove changed the title fix: Re-implement int decode methods using safe code fix: Re-implement some Parquet decode methods using safe code Jun 11, 2024
@andygrove
Copy link
Member Author

Comparison of safe code vs using unsafe read/write aligned:

pub fn copy_i32_to_i16(src: &[u8], dst: &mut [u8], num: usize) {
    debug_assert!(src.len() >= num * 4, "Source slice is too small");
    debug_assert!(dst.len() >= num * 2, "Destination slice is too small");

    for i in 0..num {
        let i32_value =
            i32::from_le_bytes([src[i * 4], src[i * 4 + 1], src[i * 4 + 2], src[i * 4 + 3]]);

        // Downcast to i16, potentially losing data
        let i16_value = i32_value as i16;
        let i16_bytes = i16_value.to_le_bytes();

        dst[i * 2] = i16_bytes[0];
        dst[i * 2 + 1] = i16_bytes[1];
    }
}

pub fn copy_i32_to_i16_unsafe(src: &[u8], dst: &mut [u8], num: usize) {
    debug_assert!(src.len() >= num * 4, "Source slice is too small");
    debug_assert!(dst.len() >= num * 2, "Destination slice is too small");

    let src_ptr = src.as_ptr() as *const i32;
    let dst_ptr = dst.as_mut_ptr() as *mut i16;
    unsafe {
        for i in 0..num {
            dst_ptr
                .add(i)
                .write_unaligned(src_ptr.add(i).read_unaligned() as i16);
        }
    }
}
parquet_decode/decode_i32_to_i16_safe
                        time:   [839.29 ns 839.57 ns 839.93 ns]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
parquet_decode/decode_i32_to_i16_unsafe
                        time:   [68.269 ps 68.400 ps 68.593 ps]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

I am going to reimplement with the unaligned approach and see if that has any safety issues

@parthchandra
Copy link
Contributor

Great benchmark Andy!
Is that really 839 nanoseconds versus 68 picoseconds or am I interpreting the units incorrectly?

@parthchandra
Copy link
Contributor

How would one even measure picoseconds. I expect that is a pseudo µ ? So micro not pico ?

@andygrove andygrove changed the title fix: Re-implement some Parquet decode methods using safe code fix: Re-implement some Parquet decode methods without copy_nonoverlapping Jun 11, 2024
@andygrove
Copy link
Member Author

Is that really 839 nanoseconds versus 68 picoseconds or am I interpreting the units incorrectly?

Yes, it really is. Here is the criterion code for displaying the time units.

pub fn time(ns: f64) -> String {
    if ns < 1.0 {
        format!("{:>6} ps", short(ns * 1e3))
    } else if ns < 10f64.powi(3) {
        format!("{:>6} ns", short(ns))
    } else if ns < 10f64.powi(6) {
        format!("{:>6} µs", short(ns / 1e3))
    } else if ns < 10f64.powi(9) {
        format!("{:>6} ms", short(ns / 1e6))
    } else {
        format!("{:>6} s", short(ns / 1e9))
    }
}

@andygrove
Copy link
Member Author

Is that really 839 nanoseconds versus 68 picoseconds or am I interpreting the units incorrectly?

I am questioning the earlier results now. Latest benchmark comparing safe version of decode_i32_to_u16 vs unsafe version of decode_i32_to_i16. Now seeing 67 ns not 68 ps. Curious.

parquet_decode/decode_i32_to_i16
                        time:   [66.709 ns 67.213 ns 67.711 ns]
                        change: [-2.4520% -1.0484% +0.3895%] (p = 0.15 > 0.05)
                        No change in performance detected.
parquet_decode/decode_i32_to_u16
                        time:   [589.31 ns 590.62 ns 592.21 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

@andygrove andygrove marked this pull request as ready for review June 12, 2024 00:38
@parthchandra
Copy link
Contributor

Now seeing 67 ns not 68 ps. Curious

That makes more sense to me. It's hard to measure in picoseconds if the system clock only provides nanosecond granularity.

@andygrove
Copy link
Member Author

Once tests pass, I'll see if I can turn this back into macros to avoid so much boilerplate

@andygrove
Copy link
Member Author

That makes more sense to me. It's hard to measure in picoseconds if the system clock only provides nanosecond granularity.

That time was based on running ~75 million iterations and taking the average time

@andygrove
Copy link
Member Author

@parthchandra This is ready for review now

@@ -182,25 +182,6 @@ make_plain_dict_impl! { Int8Type, UInt8Type, Int16Type, UInt16Type, Int32Type, U
make_plain_dict_impl! { Int32DateType, Int64Type, FloatType, FLBAType }
make_plain_dict_impl! { DoubleType, Int64TimestampMillisType, Int64TimestampMicrosType }

impl PlainDecoding for Int32To64Type {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated macro is based on this impl and has replaced this impl

dst_offset += $type_size;
}
let dst_offset = dst.num_values * $type_width;
$copy_fn(&src.data[src.offset..], &mut dst_slice[dst_offset..], num);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that we are calling a macro-generated function here


// unsigned type require double the width and zeroes are written for the second half
// perhaps because they are implemented as the next size up signed type?
impl_plain_decoding_int!(UInt8Type, copy_i32_to_u8, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why? I thought Int8 and UInt8 both are 8-bit width, no?

https://doc.rust-lang.org/book/ch03-02-data-types.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confused me as well, but I am using the same type widths as the original code:

make_int_variant_impl!(Int8Type, i8, 1);
make_int_variant_impl!(UInt8Type, u8, 2);
make_int_variant_impl!(Int16Type, i16, 2);
make_int_variant_impl!(UInt16Type, u16, 4);
make_int_variant_impl!(UInt32Type, u32, 8);

This seems to be specific to the way Parquet represents unsigned types. I think Parquet only supports signed types so u8 becomes i16, u16 becomes i32, and so on

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I forgot it. As Java doesn't support unsigned integer types, we need to use wider type to support it. For example, UInt8Array is read as Short, UInt32Array is read as Long, etc.

generate_cast_to_unsigned!(copy_i32_to_u32, i32, u32, 0_u32);

macro_rules! generate_cast_to_signed {
($name: ident, $src_type:ty, $dst_type:ty, $zero_value:expr) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems generate_cast_to_signed doesn't need zero_value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

…ain_decoding_int to the original name of make_int_variant_impl
@viirya
Copy link
Member

viirya commented Jun 13, 2024

I am questioning the earlier results now. Latest benchmark comparing safe version of decode_i32_to_u16 vs unsafe version of decode_i32_to_i16. Now seeing 67 ns not 68 ps. Curious.

So unsafe version is slower?

EDIT: I compared it with wrong item. It should compare with parquet_decode/decode_i32_to_i16_safe.

@viirya
Copy link
Member

viirya commented Jun 13, 2024

Parquet decoding when converting between different integral types was using copy_nonoverlapping without meeting the precondition that both pointers were properly aligned.

The newer Rust version requires alignments on copy_nonoverlapping call?

EDIT: I saw it now: https://doc.rust-lang.org/std/ptr/fn.copy_nonoverlapping.html

Behavior is undefined if any of the following conditions are violated:

  • Both src and dst must be properly aligned.

Comment on lines +463 to 471
unsafe {
for i in 0..num {
dst_ptr
.add(2 * i)
.write_unaligned(src_ptr.add(i).read_unaligned() as $dst_type);
// write zeroes
dst_ptr.add(2 * i + 1).write_unaligned($zero_value);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is any possible we make source and destination aligned? So we can still use unsafe copy_nonoverlapping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would involve making an extra copy?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For addressing the issue of copy_nonoverlapping with unaligned allocations, this looks good to me. Although I'm wondering if we can enhance alignment of these allocations.

@andygrove
Copy link
Member Author

The newer Rust version requires alignments on copy_nonoverlapping call?

Just to be clear, this was always the requirement, but Rust 1.78 added debug assertions to catch violations.

@andygrove
Copy link
Member Author

Thanks for the review @viirya

@andygrove andygrove merged commit 2fc45a2 into apache:main Jun 13, 2024
43 checks passed
@andygrove andygrove deleted the parquet-safe-decode branch June 13, 2024 21:38
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
…pping` (apache#558)

* Re-implement int decode methods using safe code

* fix

* fix tests

* more tests

* fix

* fix

* re-implement using unsafe read/write unaligned and add benchmark

* lint

* macros

* more macros

* combine macros

* replace another impl with the macro

* fix a regression

* remove zero_value arg from generate_cast_to_signed and rename impl_plain_decoding_int to the original name of make_int_variant_impl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants