Skip to content

Conversation

@anton-kutuzov
Copy link
Contributor

@anton-kutuzov anton-kutuzov commented Apr 20, 2025

Description

We have tables in dds layer that look like

CREATE TABLE dwh.dds.some_table (
   id1 bigint,
   flag boolean,
   id2 bigint
)
WITH (
   external_location = 's3a://some_path',
   format = 'ORC'
)

where id1 is unique and id2 has only two values. The data file has 70 MB size and more than 7 000 000 000 rows.
When we try to read data the error occurs

TrinoExternalError(type=EXTERNAL, name=HIVE_CANNOT_OPEN_SPLIT, message="Error opening Hive split s3a://some_path/some_table.orc (offset=0, length=71782995): integer overflow", query_id= 20250417_090308_49505_msr65)

It happens because OrcProto support long type, but in the trino code we have int type.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Hive, Iceberg
* Fix query failure when reading ORC files with a large row count. ({issue}`25634`)

@cla-bot
Copy link

cla-bot bot commented Apr 20, 2025

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Anton Kutuzov.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@anton-kutuzov anton-kutuzov force-pushed the fix_issue_when_orc_file_has_more_then_max_int_rows branch from 0385952 to 6adcd9b Compare April 20, 2025 18:08
@cla-bot cla-bot bot added the cla-signed label Apr 20, 2025
@anton-kutuzov anton-kutuzov force-pushed the fix_issue_when_orc_file_has_more_then_max_int_rows branch from 6adcd9b to 8e9477e Compare April 21, 2025 05:26
@anton-kutuzov anton-kutuzov self-assigned this Apr 21, 2025
@anton-kutuzov anton-kutuzov requested a review from ebyhr April 21, 2025 06:24
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show the full error stacktrace from the failed query ?
I don't think Trino would write orc files with such large stripe row count, so I'd expect the code changes to be mostly in the reader classes.

@anton-kutuzov
Copy link
Contributor Author

anton-kutuzov commented Apr 21, 2025

Can you show the full error stacktrace from the failed query ? I don't think Trino would write orc files with such large stripe row count, so I'd expect the code changes to be mostly in the reader classes.

Yes, of course. We write this files using other tools.
There is error stacktrace:

"type": "io.trino.spi.TrinoException",
  "message": "Error opening Hive split s3a://path/file.orc (offset=0, length=53002489): integer overflow",
  "cause": {
    "type": "java.lang.ArithmeticException",
    "message": "integer overflow",
    "suppressed": [],
    "stack": [
      "java.base/java.lang.Math.toIntExact(Math.java:1372)",
      "io.trino.orc.metadata.OrcMetadataReader.toStripeInformation(OrcMetadataReader.java:166)",
      "java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:215)",
      "java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1709)",
      "java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:570)",
      "java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:560)",
      "java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)",
      "java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265)",
      "java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:727)",
      "io.trino.orc.metadata.OrcMetadataReader.toStripeInformation(OrcMetadataReader.java:160)",
      "io.trino.orc.metadata.OrcMetadataReader.readFooter(OrcMetadataReader.java:149)",
      "io.trino.orc.metadata.ExceptionWrappingMetadataReader.readFooter(ExceptionWrappingMetadataReader.java:74)",
      "io.trino.orc.OrcReader.<init>(OrcReader.java:205)",
      "io.trino.orc.OrcReader.createOrcReader(OrcReader.java:118)",
      "io.trino.orc.OrcReader.createOrcReader(OrcReader.java:95)",
      "io.trino.plugin.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:269)",
      "io.trino.plugin.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:194)",
      "io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:203)",
      "io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:138)",
      "io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:48)",
      "io.trino.split.PageSourceManager$PageSourceProviderInstance.createPageSource(PageSourceManager.java:79)",
      "io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:265)",
      "io.trino.operator.Driver.processInternal(Driver.java:403)",
      "io.trino.operator.Driver.lambda$process$8(Driver.java:306)",
      "io.trino.operator.Driver.tryWithLock(Driver.java:709)",
      "io.trino.operator.Driver.process(Driver.java:298)",
      "io.trino.operator.Driver.processForDuration(Driver.java:269)",
      "io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:890)",
      "io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:187)",
      "io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:565)",
      "io.trino.$gen.Trino_461____20250416_210446_2.run(Unknown Source)",
      "java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)",
      "java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)",
      "java.base/java.lang.Thread.run(Thread.java:1575)"
    ],
    "errorCode": {
      "code": 65536,
      "name": "GENERIC_INTERNAL_ERROR",
      "type": "INTERNAL_ERROR",
      "fatal": false
    }
  }

The code changes in writer classes only in places where the numberOfRows used, because the type was changed from int to long as in OrcProto.

@anton-kutuzov anton-kutuzov force-pushed the fix_issue_when_orc_file_has_more_then_max_int_rows branch from 8e9477e to 2f4a113 Compare April 21, 2025 19:21
@anton-kutuzov
Copy link
Contributor Author

@raunaqmorarka could you please merge PR? Or do you wait something from me?

@raunaqmorarka raunaqmorarka force-pushed the fix_issue_when_orc_file_has_more_then_max_int_rows branch 2 times, most recently from 82dae36 to 9c49dbc Compare May 9, 2025 04:50
@raunaqmorarka raunaqmorarka force-pushed the fix_issue_when_orc_file_has_more_then_max_int_rows branch from 9c49dbc to 0a076bf Compare May 9, 2025 05:00
@raunaqmorarka raunaqmorarka requested a review from Copilot May 9, 2025 06:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue with reading ORC files that have a very large stripe by updating the data types and method signatures from int to long where necessary to prevent integer overflow.

  • Update test coverage to verify large stripe row counts
  • Change field types and method parameters from int to long
  • Adjust calculations (e.g. ceil function) accordingly in multiple modules

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
lib/trino-orc/src/test/java/io/trino/orc/metadata/TestOrcMetadataReader.java Adds a test verifying correct handling of large row counts
lib/trino-orc/src/main/java/io/trino/orc/metadata/StripeInformation.java Updates the numberOfRows field and related accessors from int to long
lib/trino-orc/src/main/java/io/trino/orc/metadata/OrcMetadataReader.java Adapts conversion to use long values for stripe row counts
lib/trino-orc/src/main/java/io/trino/orc/StripeReader.java Changes parameters and local variable types from int to long and revises the ceil function
lib/trino-orc/src/main/java/io/trino/orc/OrcWriterStats.java Updates the signature for stripeRows to long
lib/trino-orc/src/main/java/io/trino/orc/OrcWriterFlushStats.java Updates the signature for stripeRows to long
lib/trino-orc/src/main/java/io/trino/orc/OrcWriteValidation.java Revises addStripe method to accept long values
lib/trino-orc/src/main/java/io/trino/orc/OrcRecordReader.java Updates helper method validateWriteStripe to use long

@raunaqmorarka raunaqmorarka merged commit 6e94019 into trinodb:master May 9, 2025
58 of 59 checks passed
@github-actions github-actions bot added this to the 476 milestone May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants