Add Deletion Vector Read Support to Multithreaded Parquet Reader #13491

razajafri · 2025-09-25T01:15:49Z

Description

Problem
Previously, GPU support for reading Delta Lake DVs was only available for the PERFILE Parquet reader. As a result, users of the multithreaded reader would experience CPU fallback and degraded performance when reading Delta tables with deletion vectors. This PR closes that gap, improving performance, consistency, and feature parity.

Key changes

Implements logic to read and process Delta Deletion Vectors within the multithreaded Parquet reader path.
Ensures file-order agnosticism between Spark and the GPU for robust integration.
The COALESCING reader falls back to the MULTITHREADED reader regardless of the config after showing a warning.
Updates and renames tests to validate both PERFILE, COALESCING and MULTITHREADED reader support for DVs. The test modified is test_delta_deletion_vector_read.

Performance

	Baseline - commit id `5d88f6a`	DV MULTITHREADED READER
Percentage Deleted	Average	Average	Speedup
5%	38504.6	29869.6	1.29
10%	37225.6	29824.2	1.25
25%	37978.4	29465.6	1.29
50%	36255.6	28331	1.28
75%	35527.8	28369.8	1.25
50% (500 small files)	35464	28947.4	1.23

Baseline: commit id - 5d88f6a
Target: This PR
Dataset: TPC-DS (sf100_parquet)
Environment: Local
Spark Configs

export SPARK_CONF=("--master" "local[16]"
                   "--conf" "spark.driver.maxResultSize=2GB"
                   "--conf" "spark.driver.memory=50G"
                   "--conf" "spark.executor.cores=16"
                   "--conf" "spark.executor.instances=1"
                   "--conf" "spark.executor.memory=16G"
                   "--conf" "spark.driver.maxResultSize=4gb"
                   "--conf" "spark.sql.files.maxPartitionBytes=2gb"
                   "--conf" "spark.sql.adaptive.enabled=true"
                   "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin"
                   "--conf" "spark.rapids.memory.host.spillStorageSize=16G"
                   "--conf" "spark.rapids.memory.pinnedPool.size=8g"
                   "--conf" "spark.rapids.sql.concurrentGpuTasks=3"
                   "--conf" "spark.rapids.sql.explain=all"
                   "--conf" "spark.eventLog.enabled=true"
                   "--conf" "spark.eventLog.dir=/tmp/spark-events"
                   "--conf" "spark.sql.warehouse.dir=/home/rjafri/spark-warehouse"
                   "--conf" "spark.sql.legacy.createHiveTableByDefault=false"
                   "--conf" "spark.databricks.delta.deletionVectors.useMetadataRowIndex=false"
                   "--conf" "spark.rapids.sql.format.parquet.reader.type=MULTITHREADED"
                   "--packages" "io.delta:delta-spark_2.12:3.3.0"
                   "--conf" "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"
                   "--conf" "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
                   "--conf" "spark.driver.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR:$NDS_LISTENER_JAR"
                   "--conf" "spark.executor.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR:$NDS_LISTENER_JAR")

Query:

-- start query 1
select *
from store_sales
ORDER BY ss_ticket_number DESC
LIMIT 1000000;
-- end query 1

Checklists

This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Signed-off-by: Raza Jafri <[email protected]>

...33x/src/main/scala/com/nvidia/spark/rapids/delta/delta33x/GpuDelta33xParquetFileFormat.scala

liurenjie1024 · 2025-09-25T03:41:39Z

...33x/src/main/scala/com/nvidia/spark/rapids/delta/delta33x/GpuDelta33xParquetFileFormat.scala

+    tablePath: Option[String]
+    ) extends GpuParquetMultiFilePartitionReaderFactory(fileScan.conf, broadcastedConf,
+      fileScan.relation.dataSchema, fileScan.requiredSchema, fileScan.readPartitionSchema,
+      pushedFilters, fileScan.rapidsConf, fileScan.allMetrics, fileScan.queryUsesInputFile) {


We should enforce fileScan.queryUsesInputFile to true to avoid concating files.

This worths a warning log as well. Also please file an issue to support the concat.

...33x/src/main/scala/com/nvidia/spark/rapids/delta/delta33x/GpuDelta33xParquetFileFormat.scala

razajafri · 2025-09-25T05:06:16Z

build

…i-dv

razajafri · 2025-09-25T17:08:14Z

All Delta Lake integration tests are passing

===================== 355 passed, 139 skipped, 64 warnings in 578.85s (0:09:38) =======================

jihoonson

Thanks @razajafri, looks good overall. I left some comments. Also, would you elaborate on what 50% (500 small files) means? Were there 500 files left after the delete? Was it only the case of 50% delete? What about other cases?

jihoonson · 2025-09-25T18:02:53Z

...33x/src/main/scala/com/nvidia/spark/rapids/delta/delta33x/GpuDelta33xParquetFileFormat.scala

+        indexVectorTuples += (isRowDeletedColumnOpt.get.index -> isRowDeletedVector)
+        replaceVectors(batch, indexVectorTuples.toSeq: _*)
+      } catch {
+        case e: Exception => indexVectorTuples.foreach(item => item._2.close())


close() can throw an exception, and we'd want it to be reported as well.

Suggested change

case e: Exception => indexVectorTuples.foreach(item => item._2.close())

case e: Exception =>

try {

indexVectorTuples.foreach(item => item._2.close())

} catch {

case t: Throwable =>

e.addSuppressed(t)

}

jihoonson · 2025-09-25T18:12:34Z

integration_tests/src/main/python/delta_lake_delete_test.py

    assert_gpu_and_cpu_are_equal_collect(read_parquet_sql(data_path))

+fallback_readers_pre_353=["PERFILE", "MULTITHREADED", "COALESCING"]
+fallback_readers_353_plus=["COALESCING"]


Why do we fall back for the COALESCING reader here? Based on this, should it still run on GPU but with multithreaded instead?

Good catch, I should remove that

According to your previous comment, this test seems to have passed with COALESCING fallback in your previous testing. Why did it pass? Can you make sure that this test passes after your last update?

...33x/src/main/scala/com/nvidia/spark/rapids/delta/delta33x/GpuDelta33xParquetFileFormat.scala

razajafri · 2025-09-25T18:46:08Z

Thanks @razajafri, looks good overall. I left some comments. Also, would you elaborate on what 50% (500 small files) means? Were there 500 files left after the delete? Was it only the case of 50% delete? What about other cases?

This case was first created to test the onPrem/Coalescing reader, as I wanted to make sure we are concatenating files before decoding them. This dataset has 500 files (~39M each) and around 30 deletion vector files to handle a 50% deletion of rows. I chose this because to me, this was a good middle ground to do a quick test.

But since we aren't supporting coalescing in this PR, this still provides a good insight into the performance with small(er) files

gerashegalov · 2025-09-25T18:12:49Z

...33x/src/main/scala/com/nvidia/spark/rapids/delta/delta33x/GpuDelta33xParquetFileFormat.scala

+      fileScan: GpuFileSourceScanExec): PartitionReaderFactory = {
+
+    if (fileScan.rapidsConf.isParquetCoalesceFileReadEnabled) {
+      logWarning("Coalescing is not supported when Deletion Vectors are enabled, " +


Better use actual table properties or that the user can check the docs for

gerashegalov · 2025-09-25T18:13:35Z

...33x/src/main/scala/com/nvidia/spark/rapids/delta/delta33x/GpuDelta33xParquetFileFormat.scala

+
+    if (fileScan.rapidsConf.isParquetCoalesceFileReadEnabled) {
+      logWarning("Coalescing is not supported when Deletion Vectors are enabled, " +
+        "using the multi-threaded reader")


Tie it to our config / doc that the user can check

gerashegalov · 2025-09-25T18:19:16Z

...33x/src/main/scala/com/nvidia/spark/rapids/delta/delta33x/GpuDelta33xParquetFileFormat.scala

+    if (results.length > 1) {
+      throw new IllegalArgumentException(
+        s"There are more than one column with name=`$name` requested in the reader output")
+    }


There is a syntactic sugar for it

Suggested change

if (results.length > 1) {

throw new IllegalArgumentException(

s"There are more than one column with name=`$name` requested in the reader output")

}

require (results.length <= 1,

s"There are more than one column with name=`$name` requested in the reader output")

The use would understand this message better if it is phrased

N columns found for name ..., expected number of columns 1

or something like that

The message is copied from OSS Delta. I think we should keep it the same to be consistent

...33x/src/main/scala/com/nvidia/spark/rapids/delta/delta33x/GpuDelta33xParquetFileFormat.scala

gerashegalov · 2025-09-25T18:27:53Z

...33x/src/main/scala/com/nvidia/spark/rapids/delta/delta33x/GpuDelta33xParquetFileFormat.scala

+        if (rowIndexColumnOpt.isDefined) {
+          indexVectorTuples += (rowIndexColumnOpt.get.index -> rowIndexGpuCol.incRefCount())
        }


Scala way:

rowIndexColumnOpt.foreach { rowIndexColumn => indexVectorTuples += (rowIndexColumn.index -> rowIndexGpuCol.incRefCount()) }

gerashegalov · 2025-09-25T18:33:02Z

...33x/src/main/scala/com/nvidia/spark/rapids/delta/delta33x/GpuDelta33xParquetFileFormat.scala

+        indexVectorTuples += (isRowDeletedColumnOpt.get.index -> isRowDeletedVector)
+        replaceVectors(batch, indexVectorTuples.toSeq: _*)
+      } catch {
+        case e: Exception => indexVectorTuples.foreach(item => item._2.close())


If one of closes throws it will leak the rest. Need to use one of safeClose closeOnExcept kind of thing

Oh yes. This is a better way. I thought indexVectorTuples is an Option. Since it's a list, Gera's suggestion is better.

razajafri · 2025-09-25T22:10:57Z

Thank you for reviewing @gerashegalov @jihoonson. I have addressed your concerns PTAL again

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala

jihoonson · 2025-09-25T23:06:10Z

integration_tests/src/main/python/delta_lake_delete_test.py

    assert_gpu_and_cpu_are_equal_collect(read_parquet_sql(data_path))

+fallback_readers_pre_353=["PERFILE", "MULTITHREADED", "COALESCING"]
+fallback_readers_353_plus=["COALESCING"]


According to your previous comment, this test seems to have passed with COALESCING fallback in your previous testing. Why did it pass? Can you make sure that this test passes after your last update?

razajafri · 2025-09-26T00:31:43Z

The test was passing earlier because FileSourceScanExec still falls back due to useMetadataRowIndex having a default value of true. Now it will be skipped for Spark 353+

jihoonson

LGTM

razajafri · 2025-09-26T03:58:42Z

build

…DIA#13491) Fixes NVIDIA#13401 Fixes NVIDIA#13487 ### Description Problem Previously, GPU support for reading Delta Lake DVs was only available for the PERFILE Parquet reader. As a result, users of the multithreaded reader would experience CPU fallback and degraded performance when reading Delta tables with deletion vectors. This PR closes that gap, improving performance, consistency, and feature parity. Key changes - Implements logic to read and process Delta Deletion Vectors within the multithreaded Parquet reader path. - Ensures file-order agnosticism between Spark and the GPU for robust integration. - The COALESCING reader falls back to the MULTITHREADED reader regardless of the config after showing a warning. - Updates and renames tests to validate both PERFILE, COALESCING and MULTITHREADED reader support for DVs. The test modified is `test_delta_deletion_vector_read`. ### Performance | Baseline - commit id 5d88f6a | DV MULTITHREADED READER | -- | -- | -- | -- Percentage Deleted | Average | Average | Speedup 5% | 38504.6 | 29869.6 | 1.29 10% | 37225.6 | 29824.2 | 1.25 25% | 37978.4 | 29465.6 | 1.29 50% | 36255.6 | 28331 | 1.28 75% | 35527.8 | 28369.8 | 1.25 50% (500 small files) | 35464 | 28947.4 | 1.23 Baseline: commit id - NVIDIA@5d88f6a Target: This PR Dataset: TPC-DS (sf100_parquet) Environment: Local Spark Configs ``` export SPARK_CONF=("--master" "local[16]" "--conf" "spark.driver.maxResultSize=2GB" "--conf" "spark.driver.memory=50G" "--conf" "spark.executor.cores=16" "--conf" "spark.executor.instances=1" "--conf" "spark.executor.memory=16G" "--conf" "spark.driver.maxResultSize=4gb" "--conf" "spark.sql.files.maxPartitionBytes=2gb" "--conf" "spark.sql.adaptive.enabled=true" "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin" "--conf" "spark.rapids.memory.host.spillStorageSize=16G" "--conf" "spark.rapids.memory.pinnedPool.size=8g" "--conf" "spark.rapids.sql.concurrentGpuTasks=3" "--conf" "spark.rapids.sql.explain=all" "--conf" "spark.eventLog.enabled=true" "--conf" "spark.eventLog.dir=/tmp/spark-events" "--conf" "spark.sql.warehouse.dir=/home/rjafri/spark-warehouse" "--conf" "spark.sql.legacy.createHiveTableByDefault=false" "--conf" "spark.databricks.delta.deletionVectors.useMetadataRowIndex=false" "--conf" "spark.rapids.sql.format.parquet.reader.type=MULTITHREADED" "--packages" "io.delta:delta-spark_2.12:3.3.0" "--conf" "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" "--conf" "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" "--conf" "spark.driver.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR:$NDS_LISTENER_JAR" "--conf" "spark.executor.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR:$NDS_LISTENER_JAR") ``` Query: ``` -- start query 1 select * from store_sales ORDER BY ss_ticket_number DESC LIMIT 1000000; -- end query 1 ``` ### Checklists - [ ] This PR has added documentation for new or modified features or behaviors. - [x] This PR has added new tests or modified existing tests to cover new code paths. (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.) - [x] Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description. --------- Signed-off-by: Raza Jafri <[email protected]>

razajafri added 9 commits September 22, 2025 11:01

multi and perfile compiling

647558c

Rebase and pull in PERFILE changes from the PR in review

99803b2

add PERFILE back to the test

d724e13

rename the test to reflect all reader types

b7bb265

Allow the file order to be different from Spark

a4ab2e6

Merge remote-tracking branch 'origin/branch-25.10' into dv-multi-perfile

420c470

Fixed memory leaks

cde9b8a

Refactor

ca5fecf

Signing off

c865ed1

Signed-off-by: Raza Jafri <[email protected]>

razajafri changed the title ~~Add Deletion Vector Read support to multithreaded Parquet reader~~ Add Deletion Vector Read Support to Multithreaded Parquet Reader Sep 25, 2025

razajafri marked this pull request as ready for review September 25, 2025 03:07

liurenjie1024 reviewed Sep 25, 2025

View reviewed changes

razajafri added 5 commits September 25, 2025 00:58

addressed review comments

96a72a0

Merge remote-tracking branch 'origin/branch-25.10' into SP-13402-mult…

26f6853

…i-dv

Fixed the queryUsesInput value

dfde0de

Fixed column name mapping

b7f0953

Merge remote-tracking branch 'origin/branch-25.10' into SP-13402-mult…

0cdea07

…i-dv

Added warning when falling back to multithreaded reader

852c35e

jihoonson reviewed Sep 25, 2025

View reviewed changes

gerashegalov reviewed Sep 25, 2025

View reviewed changes

razajafri added 4 commits September 25, 2025 13:38

Addressed review comments

236a106

close on except

4d2d0ea

Added defensive fallback condition against useMetadataRowIndex

5897c75

Added detection of DV

304e526

jihoonson reviewed Sep 25, 2025

View reviewed changes

Selectively add GpuMetrics for tracking skip_row column creation

2498056

jihoonson approved these changes Sep 26, 2025

View reviewed changes

liurenjie1024 approved these changes Sep 26, 2025

View reviewed changes

razajafri merged commit 01a5380 into NVIDIA:branch-25.10 Sep 26, 2025
60 checks passed

razajafri deleted the SP-13402-multi-dv branch September 26, 2025 15:54

pxLi mentioned this pull request Sep 29, 2025

[BUG] NDS run failed java.lang.NoClassDefFoundError: org/apache/spark/sql/delta/DeltaParquetFileFormat #13519

Closed

sameerz added the feature request New feature or request label Sep 30, 2025

Add Deletion Vector Read Support to Multithreaded Parquet Reader #13491

Add Deletion Vector Read Support to Multithreaded Parquet Reader #13491

Uh oh!

Conversation

razajafri commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Performance

Checklists

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

razajafri commented Sep 25, 2025

Uh oh!

razajafri commented Sep 25, 2025

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

razajafri commented Sep 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

razajafri commented Sep 25, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

razajafri commented Sep 26, 2025

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

razajafri commented Sep 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

razajafri commented Sep 25, 2025 •

edited

Loading