-
Notifications
You must be signed in to change notification settings - Fork 976
Improve performance of reading int8/int16 Parquet data #7055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 15 commits
97418a7
6ca4aad
244e4b4
75adc3d
01d6c13
cf713ac
863e40f
21c0bb0
3d2b87f
023f7bd
db0823a
500a631
1df021a
dd9e262
6f583c4
b7a9167
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,8 +29,9 @@ use arrow_array::{ | |
TimestampNanosecondBufferBuilder, TimestampSecondBufferBuilder, | ||
}, | ||
ArrayRef, BooleanArray, Decimal128Array, Decimal256Array, Float32Array, Float64Array, | ||
Int32Array, Int64Array, TimestampMicrosecondArray, TimestampMillisecondArray, | ||
TimestampNanosecondArray, TimestampSecondArray, UInt32Array, UInt64Array, | ||
Int16Array, Int32Array, Int64Array, Int8Array, TimestampMicrosecondArray, | ||
TimestampMillisecondArray, TimestampNanosecondArray, TimestampSecondArray, UInt16Array, | ||
UInt32Array, UInt64Array, UInt8Array, | ||
}; | ||
use arrow_buffer::{i256, BooleanBuffer, Buffer}; | ||
use arrow_data::ArrayDataBuilder; | ||
|
@@ -261,6 +262,45 @@ where | |
// - date64: cast int32 to date32, then date32 to date64. | ||
// - decimal: cast int32 to decimal, int64 to decimal | ||
let array = match target_type { | ||
// Using `arrow_cast::cast` has been found to be very slow for converting | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are there other conversions that can avoid the type cast? i saw this being expensive in some other benchmarks as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I briefly looked at others and it wasn't super obvious There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what benchmark, btw? I am curious There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Several of the clickbench queries (not sure what data types, but it was spending like 20% of samples in casting during parquet reading). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll try to burrow down into that code again early next week and see if there are any other obvious candidates. I did try signed->unsigned for 32 and 64 bit ints and there was no difference. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
FWIW many of the clickbench columns are Int16, as I found when working on #7470. I started running some benchmarks on a draft update to parquet in this PR (hopefully it will show some improvements) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ahh, the reason for this is that I32/64->U32/64 is handled above (around L171). I would think anything that falls through and relies on It might be worth exploring enumerating all of the allowed Parquet physical to logical type mappings and account for them here and not rely on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
// INT32 physical type to lower bitwidth logical types. Since rust casts | ||
// are infallible, instead use `unary` which is much faster (by up to 40%). | ||
// One consequence of this approach is that some malformed integer columns | ||
// will return (an arguably correct) result rather than null. | ||
// See https://github.com/apache/arrow-rs/issues/7040 for a discussion of this | ||
// issue. | ||
ArrowType::UInt8 if *(array.data_type()) == ArrowType::Int32 => { | ||
let array = array | ||
.as_any() | ||
.downcast_ref::<Int32Array>() | ||
.unwrap() | ||
.unary(|i| i as u8) as UInt8Array; | ||
Arc::new(array) as ArrayRef | ||
} | ||
ArrowType::Int8 if *(array.data_type()) == ArrowType::Int32 => { | ||
let array = array | ||
.as_any() | ||
.downcast_ref::<Int32Array>() | ||
.unwrap() | ||
.unary(|i| i as i8) as Int8Array; | ||
Arc::new(array) as ArrayRef | ||
} | ||
ArrowType::UInt16 if *(array.data_type()) == ArrowType::Int32 => { | ||
let array = array | ||
.as_any() | ||
.downcast_ref::<Int32Array>() | ||
.unwrap() | ||
.unary(|i| i as u16) as UInt16Array; | ||
Arc::new(array) as ArrayRef | ||
} | ||
ArrowType::Int16 if *(array.data_type()) == ArrowType::Int32 => { | ||
let array = array | ||
.as_any() | ||
.downcast_ref::<Int32Array>() | ||
.unwrap() | ||
.unary(|i| i as i16) as Int16Array; | ||
Arc::new(array) as ArrayRef | ||
} | ||
ArrowType::Date64 if *(array.data_type()) == ArrowType::Int32 => { | ||
// this is cheap as it internally reinterprets the data | ||
let a = arrow_cast::cast(&array, &ArrowType::Date32)?; | ||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own convenience I pulled the benchmark into its own PR to make it easier to compare this branch to main:
arrow_reader
benchmarks for {u}int{8,16} columns #7484