[KYUUBI #6943][1/2]HiveScan support dpp#7436
[KYUUBI #6943][1/2]HiveScan support dpp#7436maomaodev wants to merge 2 commits intoapache:masterfrom
Conversation
|
How can KSHC be faster than Vanilla Spark? |
Thanks for the review! After investigation, KSHC being faster than vanilla Spark on TEXT-format Hive tables mainly comes from two factors, the two factors partially overlap, together they explain essentially all of the gap:
With the defaults (minSize=1B, blockSize=128MB, numSplits=2), splitSize is only ~2MB, so each small file (1–4MB) is split into ~2 tasks, and scheduling overhead dominates.
With the defaults (maxPartitionBytes=128MB, openCostInBytes=4MB), each small file becomes at most one task.
2、KSHC reuses FileStatus via
Note on Orc/Parquet On the Spark 3.3 CI failure |
Why are the changes needed?
Part 1 of 2 to add KSHC support for dynamic partition pruning (DPP). See #6943.
HiveScanfor non-Parquet/ORC tables.ParquetScan/ORCScanfor Parquet/ORC tables.spark.sql.kyuubi.hive.connector.read.runtimeFilter.enabledis introduced by KSHC to control whether partition columns are exposed as runtime filter attributes, which is required for Spark DPP. The default value is true. To disable DPP on KSHC tables, set it to false.How was this patch tested?
KSHC Now provides a 7.25% (~200 s) speedup over KSHC Before, with no correctness regression.
DPP trigger was detected by matching
runtime partition filterin the driver logs.On the DPP-hit subset, KSHC Now provides a 10.2% speedup over KSHC Before, noticeably larger than the overall 7.25%, indicating the performance benefit mainly comes from queries where DPP is triggered.
Was this patch authored or co-authored using generative AI tooling?
Partially assisted by Claude Code (Claude Opus 4.7) for unit test, code style fixes, and analysis of TPC-DS benchmark results. Core design and implementation are human-authored.