This repository was archived by the owner on May 9, 2024. It is now read-only.
Add TableStats to TableFragmentsInfo. #559
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Chunks metadata is used at every work unit execution and in many cases stats are used. But the main reason for stats usage is range info requests for column references. And to compute the range we simply merge all chunk stats.
This patch introduces TableStats which can be used for column range info computation. While table stats originally are computed using chunk stats, we might avoid it in some cases by propagating input table stats to the resulting table. E. g. simple projection, sort, and shuffle do not modify table stats and therefore stats can be simply assigned to the result. This way we can avoid chunk stats computations that are directly used only to skip fragments during filter execution.
This PR only introduces new stats and doesn't utilize them yet. It also adds checks to ExpressionRange computation to make sure that new stats provide us with the same range as existing ones. There will be another PR to stop using chunk stats in range computations and propagate table stats through shuffling to avoid stats computation for partitioned data.