-
Notifications
You must be signed in to change notification settings - Fork 960
POC: Parquet predicate results cache #7760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🤖 |
🤖: Benchmark completed Details
|
Nice, no regressions? |
Seems like the coalesce batches work is paying off ! |
d236925
to
025d411
Compare
I have been thinking about memory usage as well as that will be a major factor in if we can cache the predicate results. I annotated the Q22 with code to calculate memory usage of the cached results:
Q22 (some of the best performance gains):
// Q22: SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;
Query {
name: "Q22",
filter_columns: vec!["Title", "URL", "SearchPhrase"],
projection_columns: vec!["SearchPhrase", "URL", "Title", "UserID"],
predicates: vec![
ClickBenchPredicate::like_Google(0),
ClickBenchPredicate::nlike_google(1),
ClickBenchPredicate::not_empty(2),
],
expected_row_count: 46,
}, for hits_1.parquet, the data sizes are:
(I totally used @XiangpengHao 's https://parquet-viewer.xiangpeng.systems/ for this analysis) I will try and add some additional debugging / annotation code to see what the peak memory usage was (and if I limit it to 1MB if that will get triggered for any query) |
Seems reasonable to me. I need to think about memory handling a bit more carefully now |
Which issue does this PR close?
TODO:
coalesce
)coalesece
kernel (BatchCoalescer
) #7761Rationale for this change
I am working on not decoding predicate columns twice when evaluating filters in the reader
In #7513 we prototyped several APIs that we have now started making real (like
BatchCoalescer
) so I made a new PR that used those APIs which doesn't have hundreds of comments.I am pleased with how it is looking now, and as before I don't really plan to merge this PR as is, I am using it as a design vehicle
What changes are included in this PR?
Are there any user-facing changes?
Not yet,