Skip to content

Allow reading more than 32k Parquet row groups#10149

Open
etseidl wants to merge 4 commits into
apache:mainfrom
etseidl:i32_rowgroup_ordinal
Open

Allow reading more than 32k Parquet row groups#10149
etseidl wants to merge 4 commits into
apache:mainfrom
etseidl:i32_rowgroup_ordinal

Conversation

@etseidl

@etseidl etseidl commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

The parquet crate will error if more than 32767 row groups are present in the file. This is a limit imposed on write when encryption is in use, but there is no other limit on the number of row groups beyond that imposed by the Thrift compact protocol.

What changes are included in this PR?

This changes the ordinal field of the RowGroupMetaData from an i16 to i32. This allows reading up to the maximum number of row groups allowed by Thrift. On write, the ordinal on the RowGroup will not be written if more than 32k row groups are present.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, RowGroupMetaData::ordinal now returns Option<i32> and RowGroupMetaDataBuilder::set_ordinal takes Option<i32>.

@etseidl etseidl added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Jun 16, 2026
@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 16, 2026
@etseidl

etseidl commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

If this is needed sooner, I can revert the public changes and add an i32 accessor for use by the row numbering.

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me @etseidl -- thank you

I think we could potentially reduce the change required in this PR (and not have to change any APIs) by simply not populating the ordinal field for row groups with indexes that are too large and keep returning i16

What do you think?

let ordinal = self.row_group_index;

let ordinal: i16 = ordinal.try_into().map_err(|_| {
// Thrift cannot encode lists with more than 2B elements

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the 2B elements refers to

self.write_path_in_schema
}

/// Control the writing of the `ordinal` element of the `RowGroup` struct.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some context here about why one would disable writing the optional ordinal field? Namely because it can't be set for row groups greater than i32::MAX?

let mut writer = SerializedFileWriter::new(&file, schema, props).unwrap();

// Create 32k + 1 empty rowgroups. No row group ordinals should be written (but we can't
// test for that).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and arguably it is an implementation detail -- just testing writing 32k row groups is enough

#[allow(unused_assignments)]
fn write_thrift<W: Write>(&self, writer: &mut ThriftCompactOutputProtocol<W>) -> Result<()> {
writer.set_write_path_in_schema(self.write_path_in_schema);
// only write ordinal if all values will fit in an i16

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to write ordinal for the first 2^16-1 row groups, and then just not write it for larger row groups. I think you could avoid changing the thrift writer / adding set_write_row_group_ordinal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parquet reader fails to read files with more than 32767 row groups when RowGroup.ordinal is absent

2 participants