Skip to content

Use IntoIter trait for write_batch/write_mini_batch #43

Closed
@alamb

Description

@alamb

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-5153

Writing data to a parquet file requires a lot of copying and intermediate Vec creation. Take a record struct like:

{{struct MyData {}}{{  name: String,}}{{  address: Option}}{{}}}

Over the course of working sets of this data, you'll have the bulk data Vec,  the names column in a Vec<&String>, the address column in a Vec<Option>. This puts extra memory pressure on the system, at the minimum we have to allocate a Vec the same size as the bulk data even if we are using references.

What I'm proposing is to use an IntoIter style. This will maintain backward compat as a slice automatically implements IntoIter. Where ColumnWriterImpl#write_batch goes from "values: &[T::T]"to values "values: IntoIter<Item=T::T>". Then you can do things like

{{  write_batch(bulk.iter().map(|x| x.name), None, None)}}{{  write_batch(bulk.iter().map(|x| x.address), Some(bulk.iter().map(|x| x.is_some())), None)}}

and you can see there's no need for an intermediate Vec, so no short-term allocations to write out the data.

I am writing data with many columns and I think this would really help to speed things up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    parquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions