-
Notifications
You must be signed in to change notification settings - Fork 377
Compute Limit bugfix #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Compute Limit bugfix #123
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
colin-ho
added a commit
that referenced
this pull request
Mar 20, 2025
## Summary <!-- Provide a short summary of the changes in this PR. --> ## Related Issues <!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123" --> ## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Checklist - [ ] All tests have passed - [ ] Documented in API Docs - [ ] Documented in User Guide - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag [@ccmao1130](https://github.com/ccmao1130) for docs review)
colin-ho
added a commit
that referenced
this pull request
Mar 20, 2025
## Summary Fix some typos, missing info, on s3 tables docs. ## Related Issues <!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123" --> ## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Checklist - [ ] All tests have passed - [ ] Documented in API Docs - [ ] Documented in User Guide - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag [@ccmao1130](https://github.com/ccmao1130) for docs review)
colin-ho
added a commit
that referenced
this pull request
Mar 21, 2025
## Summary Track imports on scarf ## Related Issues <!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123" --> ## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Checklist - [ ] All tests have passed - [ ] Documented in API Docs - [ ] Documented in User Guide - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag [@ccmao1130](https://github.com/ccmao1130) for docs review)
colin-ho
added a commit
that referenced
this pull request
Mar 21, 2025
## Summary <!-- Provide a short summary of the changes in this PR. --> ## Related Issues <!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123" --> ## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Checklist - [ ] All tests have passed - [ ] Documented in API Docs - [ ] Documented in User Guide - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag [@ccmao1130](https://github.com/ccmao1130) for docs review)
colin-ho
added a commit
that referenced
this pull request
Mar 24, 2025
…ad of endswith (#4046) ## Summary Fix runner switching test. ## Related Issues <!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123" --> ## Changes Made Pass the assertion as long as `cannot set runner more than once` is inside error message instead of endswith, as there could be other stuff. ## Checklist - [ ] All tests have passed - [ ] Documented in API Docs - [ ] Documented in User Guide - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag [@ccmao1130](https://github.com/ccmao1130) for docs review)
kevinzwang
added a commit
that referenced
this pull request
Mar 25, 2025
## Summary Clean up the PR template in various ways ## Related Issues <!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123" --> ## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> - remove summary section (I felt that there was a large amount of redundancy between the PR title, summary, and changes made sections) - remove JIRA mention in changes made since we do not use that - remove "All tests have passed" from checklist since that is already a strict requirement for merging - add "(if applicable)" to some checklist items ## Checklist - [x] All tests have passed - [x] Documented in API Docs - [x] Documented in User Guide - [x] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [x] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
colin-ho
added a commit
that referenced
this pull request
Mar 25, 2025
## Summary Remove ray compatibility workflow as it has been failing for a while. ## Related Issues <!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123" --> ## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Checklist - [ ] All tests have passed - [ ] Documented in API Docs - [ ] Documented in User Guide - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
colin-ho
added a commit
that referenced
this pull request
Mar 28, 2025
## Changes Made Adds overwrite mode to spark connect for writing csv and parquet. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) --------- Co-authored-by: universalmind303 <[email protected]>
colin-ho
added a commit
that referenced
this pull request
Mar 31, 2025
## Changes Made Adds docs in core concepts for cross column expressions ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
colin-ho
added a commit
that referenced
this pull request
Mar 31, 2025
## Changes Made Add a section called numbering rows that explains how to add ids to rows with monotonically increasing id ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
colin-ho
added a commit
that referenced
this pull request
Mar 31, 2025
## Changes Made Use io-runtime for task in remote parquet reader. Fixes deadlock. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
colin-ho
added a commit
that referenced
this pull request
Mar 31, 2025
## Changes Made Disable scan task merging for warcs. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) --------- Co-authored-by: Desmond Cheong <[email protected]>
desmondcheongzx
pushed a commit
that referenced
this pull request
Apr 1, 2025
## Changes Made Quick refactor of some of the logic that previously lived in PushDownFilter into its own rule. I previously implemented it in PushDownFilter because it could have interactions with the other filter pushdown logic. However, we should try to keep PushDownFilter as simple as we can, just for pushing filter nodes through a plan. Thus, I am moving this logic into a new rule and using the repeated execution of the rule batch to deal with interactions between the two rules. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> Related to why we closed #4118 ## Checklist - [x] Documented in API Docs (if applicable) - [x] Documented in User Guide (if applicable) - [x] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [x] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
desmondcheongzx
added a commit
that referenced
this pull request
Apr 1, 2025
## Changes Made This PR adds an optimizer rule that pushes down join predicates. This will fix our TPC-H benchmarks, since Q13, which is technically a non-equi join, can be optimized into an equi-join. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [x] Documented in API Docs (if applicable) - [x] Documented in User Guide (if applicable) - [x] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [x] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) Co-authored-by: desmondcheongzx <[email protected]>
kevinzwang
added a commit
that referenced
this pull request
Apr 2, 2025
## Changes Made
Added an optimizer rule `PushDownAntiSemiJoin`, which pushes anti and semi joins through inner joins, projects, filters, sorts, and distincts.
I decided to do this instead of adding anti and semi joins to our brute force join reorderer because it is simpler and will largely produce the same results, especially with TPC-H.
More specifically, join graph creation makes some assumptions about the uniqueness of input column names that I'm not sure will hold for anti and semi joins, since their columns are not preserved in the output. I was not confident in my ability to modify the join reordering algorithm to properly fix this within a reasonable time. We will want to implement a more efficient reordering algorithm in the future anyway, so I do not want to spend a significant effort on patching the current one up.
### TPC-H Q18
#### Results
| Benchmark | Before | After | Improvement |
|--------|--------|--------|--------|
| Local (SF 100) | 56.79s | 10.38s | 5.47x |
| Distributed (SF 1000) | 285.4s | 66.12s | 4.32x |
#### Optimized Plan (Before)
```mermaid
flowchart TD
subgraph optimized["Optimized LogicalPlan"]
direction TB
optimizedLimit0["Limit: 100
Stats = { Approx num rows = 100, Approx size bytes = 1.56 KiB, Accumulated
selectivity = 0.00 }"]
optimizedSort1["Sort: Sort by = (col(o_totalprice), descending, nulls first), (col(o_orderdate),
ascending, nulls last)
Stats = { Approx num rows = 5,665,935, Approx size bytes = 86.46 MiB,
Accumulated selectivity = 0.12 }"]
optimizedAggregate2["Aggregation: sum(col(l_quantity))
Group by = col(c_name), col(c_custkey), col(o_orderkey), col(o_orderdate),
col(o_totalprice)
Output schema = c_name#Utf8, c_custkey#Int64, o_orderkey#Int64,
o_orderdate#Date, o_totalprice#Float64, l_quantity#Int64
Stats = { Approx num rows = 5,665,935, Approx size bytes = 86.46 MiB,
Accumulated selectivity = 0.12 }"]
optimizedProject3["Project: col(l_quantity), col(c_name), col(c_custkey), col(o_orderkey),
col(o_orderdate), col(o_totalprice)
Stats = { Approx num rows = 7,082,419, Approx size bytes = 109.76 MiB,
Accumulated selectivity = 0.14 }"]
optimizedJoin4["Join: Type = Semi
Strategy = Auto
On = col(left.o_orderkey#Int64) == col(right.l_orderkey#Int64)
Output schema = o_orderkey#Int64, c_custkey#Int64, c_name#Utf8,
o_totalprice#Float64, o_orderdate#Date, l_quantity#Int64
Stats = { Approx num rows = 7,082,419, Approx size bytes = 109.76 MiB,
Accumulated selectivity = 0.14 }"]
optimizedProject5["Project: col(o_orderkey), col(c_custkey), col(c_name), col(o_totalprice),
col(o_orderdate), col(l_quantity)
Stats = { Approx num rows = 46,594,852, Approx size bytes = 722.09 MiB,
Accumulated selectivity = 0.95 }"]
optimizedJoin6["Join: Type = Inner
Strategy = Auto
On = col(left.o_orderkey#Int64) == col(right.l_orderkey#Int64)
Output schema = c_custkey#Int64, c_name#Utf8, o_orderkey#Int64,
o_totalprice#Float64, o_orderdate#Date, l_orderkey#Int64, l_quantity#Int64
Stats = { Approx num rows = 46,594,852, Approx size bytes = 722.09 MiB,
Accumulated selectivity = 0.95 }"]
optimizedProject7["Project: col(c_custkey), col(c_name), col(o_orderkey), col(o_totalprice),
col(o_orderdate)
Stats = { Approx num rows = 18,273,257, Approx size bytes = 496.66 MiB,
Accumulated selectivity = 0.95 }"]
optimizedJoin8["Join: Type = Inner
Strategy = Auto
On = col(left.c_custkey#Int64) == col(right.o_custkey#Int64)
Output schema = c_custkey#Int64, c_name#Utf8, o_orderkey#Int64, o_custkey#Int64,
o_totalprice#Float64, o_orderdate#Date
Stats = { Approx num rows = 18,273,257, Approx size bytes = 496.66 MiB,
Accumulated selectivity = 0.95 }"]
optimizedProject9["Project: col(C_CUSTKEY) as c_custkey, col(C_NAME) as c_name
Stats = { Approx num rows = 3,308,528, Approx size bytes = 89.14 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource10["Num Scan Tasks = 32
File schema = C_CUSTKEY#Int64, C_NAME#Utf8, C_ADDRESS#Utf8, C_NATIONKEY#Int64,
C_PHONE#Utf8, C_ACCTBAL#Float64, C_MKTSEGMENT#Utf8, C_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [C_CUSTKEY, C_NAME]
Output schema = C_CUSTKEY#Int64, C_NAME#Utf8
Stats = { Approx num rows = 3,308,528, Approx size bytes = 89.14 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource10 --> optimizedProject9
optimizedProject9 --> optimizedJoin8
optimizedProject11["Project: col(O_ORDERKEY) as o_orderkey, col(O_CUSTKEY) as o_custkey,
col(O_TOTALPRICE) as o_totalprice, col(O_ORDERDATE) as o_orderdate
Stats = { Approx num rows = 18,273,257, Approx size bytes = 496.66 MiB,
Accumulated selectivity = 0.95 }"]
optimizedSource12["Num Scan Tasks = 32
File schema = O_ORDERKEY#Int64, O_CUSTKEY#Int64, O_ORDERSTATUS#Utf8,
O_TOTALPRICE#Float64, O_ORDERDATE#Date, O_ORDERPRIORITY#Utf8, O_CLERK#Utf8,
O_SHIPPRIORITY#Int64, O_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [O_ORDERKEY, O_CUSTKEY, O_TOTALPRICE, O_ORDERDATE]
Filter pushdown = not(is_null(col(O_ORDERKEY)))
Output schema = O_ORDERKEY#Int64, O_CUSTKEY#Int64, O_TOTALPRICE#Float64,
O_ORDERDATE#Date
Stats = { Approx num rows = 18,273,257, Approx size bytes = 496.66 MiB,
Accumulated selectivity = 0.95 }"]
optimizedSource12 --> optimizedProject11
optimizedProject11 --> optimizedJoin8
optimizedJoin8 --> optimizedProject7
optimizedProject7 --> optimizedJoin6
optimizedProject13["Project: col(L_ORDERKEY) as l_orderkey, col(L_QUANTITY) as l_quantity
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource14["Num Scan Tasks = 32
File schema = L_ORDERKEY#Int64, L_PARTKEY#Int64, L_SUPPKEY#Int64,
L_LINENUMBER#Int64, L_QUANTITY#Int64, L_EXTENDEDPRICE#Float64,
L_DISCOUNT#Float64, L_TAX#Float64, L_RETURNFLAG#Utf8, L_LINESTATUS#Utf8,
L_SHIPDATE#Date, L_COMMITDATE#Date, L_RECEIPTDATE#Date, L_SHIPINSTRUCT#Utf8,
L_SHIPMODE#Utf8, L_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [L_ORDERKEY, L_QUANTITY]
Output schema = L_ORDERKEY#Int64, L_QUANTITY#Int64
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource14 --> optimizedProject13
optimizedProject13 --> optimizedJoin6
optimizedJoin6 --> optimizedProject5
optimizedProject5 --> optimizedJoin4
optimizedProject15["Project: col(l_orderkey)
Stats = { Approx num rows = 7,455,177, Approx size bytes = 113.76 MiB,
Accumulated selectivity = 0.15 }"]
optimizedFilter16["Filter: col((l_quantity.local_sum() > Literal(Int64(300))))
Stats = { Approx num rows = 7,455,177, Approx size bytes = 113.76 MiB,
Accumulated selectivity = 0.15 }"]
optimizedProject17["Project: col(l_orderkey), col(l_quantity.local_sum()) > lit(300) as
(l_quantity.local_sum() > Literal(Int64(300)))
Stats = { Approx num rows = 37,275,881, Approx size bytes = 568.78 MiB,
Accumulated selectivity = 0.76 }"]
optimizedFilter18["Filter: not(is_null(col(l_orderkey)))
Stats = { Approx num rows = 37,275,881, Approx size bytes = 568.78 MiB,
Accumulated selectivity = 0.76 }"]
optimizedAggregate19["Aggregation: sum(col(l_quantity)) as l_quantity.local_sum()
Group by = col(l_orderkey)
Output schema = l_orderkey#Int64, l_quantity.local_sum()#Int64
Stats = { Approx num rows = 39,237,769, Approx size bytes = 598.72 MiB,
Accumulated selectivity = 0.80 }"]
optimizedProject20["Project: col(L_QUANTITY) as l_quantity, col(L_ORDERKEY) as l_orderkey
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource21["Num Scan Tasks = 32
File schema = L_ORDERKEY#Int64, L_PARTKEY#Int64, L_SUPPKEY#Int64,
L_LINENUMBER#Int64, L_QUANTITY#Int64, L_EXTENDEDPRICE#Float64,
L_DISCOUNT#Float64, L_TAX#Float64, L_RETURNFLAG#Utf8, L_LINESTATUS#Utf8,
L_SHIPDATE#Date, L_COMMITDATE#Date, L_RECEIPTDATE#Date, L_SHIPINSTRUCT#Utf8,
L_SHIPMODE#Utf8, L_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [L_QUANTITY, L_ORDERKEY]
Output schema = L_ORDERKEY#Int64, L_QUANTITY#Int64
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource21 --> optimizedProject20
optimizedProject20 --> optimizedAggregate19
optimizedAggregate19 --> optimizedFilter18
optimizedFilter18 --> optimizedProject17
optimizedProject17 --> optimizedFilter16
optimizedFilter16 --> optimizedProject15
optimizedProject15 --> optimizedJoin4
optimizedJoin4 --> optimizedProject3
optimizedProject3 --> optimizedAggregate2
optimizedAggregate2 --> optimizedSort1
optimizedSort1 --> optimizedLimit0
end
```
#### Optimized Plan (After)
```mermaid
flowchart TD
subgraph optimized["Optimized LogicalPlan"]
direction TB
optimizedLimit0["Limit: 100
Stats = { Approx num rows = 100, Approx size bytes = 1.56 KiB, Accumulated
selectivity = 0.00 }"]
optimizedSort1["Sort: Sort by = (col(o_totalprice), descending, nulls first), (col(o_orderdate),
ascending, nulls last)
Stats = { Approx num rows = 5,665,935, Approx size bytes = 86.46 MiB,
Accumulated selectivity = 0.12 }"]
optimizedAggregate2["Aggregation: sum(col(l_quantity))
Group by = col(c_name), col(c_custkey), col(o_orderkey), col(o_orderdate),
col(o_totalprice)
Output schema = c_name#Utf8, c_custkey#Int64, o_orderkey#Int64,
o_orderdate#Date, o_totalprice#Float64, l_quantity#Int64
Stats = { Approx num rows = 5,665,935, Approx size bytes = 86.46 MiB,
Accumulated selectivity = 0.12 }"]
optimizedProject3["Project: col(l_quantity), col(c_name), col(c_custkey), col(o_orderkey),
col(o_orderdate), col(o_totalprice)
Stats = { Approx num rows = 7,082,419, Approx size bytes = 109.76 MiB,
Accumulated selectivity = 0.14 }"]
optimizedProject4["Project: col(c_custkey), col(c_name), col(o_orderkey), col(o_totalprice),
col(o_orderdate), col(l_orderkey), col(l_quantity)
Stats = { Approx num rows = 7,082,419, Approx size bytes = 109.76 MiB,
Accumulated selectivity = 0.14 }"]
optimizedJoin5["Join: Type = Inner
Strategy = Auto
On = col(left.c_custkey#Int64) == col(right.o_custkey#Int64)
Output schema = c_custkey#Int64, c_name#Utf8, o_orderkey#Int64, o_custkey#Int64,
o_totalprice#Float64, o_orderdate#Date, l_orderkey#Int64, l_quantity#Int64
Stats = { Approx num rows = 7,082,419, Approx size bytes = 109.76 MiB,
Accumulated selectivity = 0.14 }"]
optimizedProject6["Project: col(C_CUSTKEY) as c_custkey, col(C_NAME) as c_name
Stats = { Approx num rows = 3,308,528, Approx size bytes = 89.14 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource7["Num Scan Tasks = 32
File schema = C_CUSTKEY#Int64, C_NAME#Utf8, C_ADDRESS#Utf8, C_NATIONKEY#Int64,
C_PHONE#Utf8, C_ACCTBAL#Float64, C_MKTSEGMENT#Utf8, C_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [C_CUSTKEY, C_NAME]
Output schema = C_CUSTKEY#Int64, C_NAME#Utf8
Stats = { Approx num rows = 3,308,528, Approx size bytes = 89.14 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource7 --> optimizedProject6
optimizedProject6 --> optimizedJoin5
optimizedJoin8["Join: Type = Inner
Strategy = Auto
On = col(left.o_orderkey#Int64) == col(right.l_orderkey#Int64)
Output schema = o_orderkey#Int64, o_custkey#Int64, o_totalprice#Float64,
o_orderdate#Date, l_orderkey#Int64, l_quantity#Int64
Stats = { Approx num rows = 7,082,419, Approx size bytes = 109.76 MiB,
Accumulated selectivity = 0.14 }"]
optimizedProject9["Project: col(O_ORDERKEY) as o_orderkey, col(O_CUSTKEY) as o_custkey,
col(O_TOTALPRICE) as o_totalprice, col(O_ORDERDATE) as o_orderdate
Stats = { Approx num rows = 7,082,419, Approx size bytes = 108.07 MiB,
Accumulated selectivity = 0.14 }"]
optimizedJoin10["Join: Type = Semi
Strategy = Auto
On = col(left.O_ORDERKEY#Int64) == col(right.l_orderkey#Int64)
Output schema = O_ORDERKEY#Int64, O_CUSTKEY#Int64, O_TOTALPRICE#Float64,
O_ORDERDATE#Date
Stats = { Approx num rows = 7,082,419, Approx size bytes = 108.07 MiB,
Accumulated selectivity = 0.14 }"]
optimizedSource11["Num Scan Tasks = 32
File schema = O_ORDERKEY#Int64, O_CUSTKEY#Int64, O_ORDERSTATUS#Utf8,
O_TOTALPRICE#Float64, O_ORDERDATE#Date, O_ORDERPRIORITY#Utf8, O_CLERK#Utf8,
O_SHIPPRIORITY#Int64, O_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [O_ORDERKEY, O_CUSTKEY, O_TOTALPRICE, O_ORDERDATE]
Filter pushdown = not(is_null(col(O_ORDERKEY)))
Output schema = O_ORDERKEY#Int64, O_CUSTKEY#Int64, O_TOTALPRICE#Float64,
O_ORDERDATE#Date
Stats = { Approx num rows = 18,273,257, Approx size bytes = 496.66 MiB,
Accumulated selectivity = 0.95 }"]
optimizedSource11 --> optimizedJoin10
optimizedProject12["Project: col(l_orderkey)
Stats = { Approx num rows = 7,455,177, Approx size bytes = 113.76 MiB,
Accumulated selectivity = 0.15 }"]
optimizedFilter13["Filter: col((l_quantity.local_sum() > Literal(Int64(300))))
Stats = { Approx num rows = 7,455,177, Approx size bytes = 113.76 MiB,
Accumulated selectivity = 0.15 }"]
optimizedProject14["Project: col(l_orderkey), col(l_quantity.local_sum()) > lit(300) as
(l_quantity.local_sum() > Literal(Int64(300)))
Stats = { Approx num rows = 37,275,881, Approx size bytes = 568.78 MiB,
Accumulated selectivity = 0.76 }"]
optimizedFilter15["Filter: not(is_null(col(l_orderkey)))
Stats = { Approx num rows = 37,275,881, Approx size bytes = 568.78 MiB,
Accumulated selectivity = 0.76 }"]
optimizedAggregate16["Aggregation: sum(col(l_quantity)) as l_quantity.local_sum()
Group by = col(l_orderkey)
Output schema = l_orderkey#Int64, l_quantity.local_sum()#Int64
Stats = { Approx num rows = 39,237,769, Approx size bytes = 598.72 MiB,
Accumulated selectivity = 0.80 }"]
optimizedProject17["Project: col(L_QUANTITY) as l_quantity, col(L_ORDERKEY) as l_orderkey
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource18["Num Scan Tasks = 32
File schema = L_ORDERKEY#Int64, L_PARTKEY#Int64, L_SUPPKEY#Int64,
L_LINENUMBER#Int64, L_QUANTITY#Int64, L_EXTENDEDPRICE#Float64,
L_DISCOUNT#Float64, L_TAX#Float64, L_RETURNFLAG#Utf8, L_LINESTATUS#Utf8,
L_SHIPDATE#Date, L_COMMITDATE#Date, L_RECEIPTDATE#Date, L_SHIPINSTRUCT#Utf8,
L_SHIPMODE#Utf8, L_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [L_QUANTITY, L_ORDERKEY]
Output schema = L_ORDERKEY#Int64, L_QUANTITY#Int64
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource18 --> optimizedProject17
optimizedProject17 --> optimizedAggregate16
optimizedAggregate16 --> optimizedFilter15
optimizedFilter15 --> optimizedProject14
optimizedProject14 --> optimizedFilter13
optimizedFilter13 --> optimizedProject12
optimizedProject12 --> optimizedJoin10
optimizedJoin10 --> optimizedProject9
optimizedProject9 --> optimizedJoin8
optimizedProject19["Project: col(L_ORDERKEY) as l_orderkey, col(L_QUANTITY) as l_quantity
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource20["Num Scan Tasks = 32
File schema = L_ORDERKEY#Int64, L_PARTKEY#Int64, L_SUPPKEY#Int64,
L_LINENUMBER#Int64, L_QUANTITY#Int64, L_EXTENDEDPRICE#Float64,
L_DISCOUNT#Float64, L_TAX#Float64, L_RETURNFLAG#Utf8, L_LINESTATUS#Utf8,
L_SHIPDATE#Date, L_COMMITDATE#Date, L_RECEIPTDATE#Date, L_SHIPINSTRUCT#Utf8,
L_SHIPMODE#Utf8, L_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [L_ORDERKEY, L_QUANTITY]
Output schema = L_ORDERKEY#Int64, L_QUANTITY#Int64
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource20 --> optimizedProject19
optimizedProject19 --> optimizedJoin8
optimizedJoin8 --> optimizedJoin5
optimizedJoin5 --> optimizedProject4
optimizedProject4 --> optimizedProject3
optimizedProject3 --> optimizedAggregate2
optimizedAggregate2 --> optimizedSort1
optimizedSort1 --> optimizedLimit0
end
```
## Related Issues
<!-- Link to related GitHub issues, e.g., "Closes #123" -->
## Checklist
- [x] Documented in API Docs (if applicable)
- [x] Documented in User Guide (if applicable)
- [x] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation
- [x] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
colin-ho
added a commit
that referenced
this pull request
Apr 3, 2025
## Changes Made Use native runner for doctests. Modify code examples to add a sort if needed (e.g. groupbys) because swordfish does not have ordering guarantees without an order by. Also, don't run them for monotonically increasing id, because it requires partitioning. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
f4t4nt
pushed a commit
that referenced
this pull request
Apr 3, 2025
## Changes Made Use native runner for doctests. Modify code examples to add a sort if needed (e.g. groupbys) because swordfish does not have ordering guarantees without an order by. Also, don't run them for monotonically increasing id, because it requires partitioning. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
colin-ho
added a commit
that referenced
this pull request
Apr 4, 2025
## Changes Made Expose an async shuffle cache to python, remove use of python threadpool in the shuffle actor, spawn partitioner tasks that connect directly to writer tasks. ### How it works: Flight shuffle works by first partitioning and saving all outputs of the map pipeline to disk in arrow IPC stream format, then reading the files back and sending them to the reducer via arrow flight rpc. We use actor per nodes to do the partitioning and saving. Each actor has a `ShuffleCache` in rust which is responsible for partitioning and saving. Previously, we use python threads via python's concurrent threadpool executor to concurrently push partitions into the cache, which allows us to do partitioning in parallel. But, we shouldn't need to use a python threadpool, and instead be able to use the tokio runtimes in rust to do this. This reduces reliance on python, reduces number of threads, and will provide a smoother transition to using swordfish with the shuffle cache. To do this, we spawn `num_cpu` partitioner tasks and `num_partition` writer tasks on a single threaded runtime, the tasks then spawn compute on the compute runtime when needed. This is also how swordfish works. We then use https://docs.rs/pyo3-async-runtimes/latest/pyo3_async_runtimes/ to expose an async bridge to python. Since ray has async actors, this makes it easy to use async in python as well. ### Results: Question | Before (s) | After (s) | Difference (s) | % Change -- | -- | -- | -- | -- Q1 | 31.73 | 26.63 | -5.10 | -16.07% Q2 | 36.40 | 31.26 | -5.14 | -14.12% Q3 | 40.50 | 39.58 | -0.92 | -2.27% Q4 | 27.58 | 22.90 | -4.68 | -16.97% Q5 | 64.22 | 63.06 | -1.16 | -1.81% Q6 | 9.77 | 9.55 | -0.22 | -2.25% Q7 | 44.56 | 45.43 | 0.87 | 1.95% Q8 | 96.06 | 96.61 | 0.55 | 0.57% Q9 | 120.09 | 117.40 | -2.69 | -2.24% Q10 | 36.68 | 37.49 | 0.81 | 2.21% Q11 | 29.91 | 33.55 | 3.64 | 12.17% Q12 | 25.21 | 25.51 | 0.30 | 1.19% Q13 | 36.05 | 33.92 | -2.13 | -5.91% Q14 | 17.68 | 17.12 | -0.56 | -3.17% Q15 | 33.26 | 33.27 | 0.01 | 0.03% Q16 | 19.11 | 20.84 | 1.73 | 9.05% Q17 | 120.98 | 118.21 | -2.77 | -2.29% Q18 | 64.95 | 62.89 | -2.06 | -3.17% Q19 | 24.38 | 24.01 | -0.37 | -1.52% Q20 | 51.75 | 53.28 | 1.53 | 2.96% Q21 | 142.00 | 140.75 | -1.25 | -0.88% Q22 | 13.32 | 13.63 | 0.31 | 2.32% Total | 1086.19 | 1066.38 | -19.81 | -1.85% ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
universalmind303
added a commit
that referenced
this pull request
Apr 7, 2025
## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
universalmind303
added a commit
that referenced
this pull request
Apr 8, 2025
Fix TimeUnit repr format doctest was failing because it didnt match the repr. ## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
colin-ho
added a commit
that referenced
this pull request
Apr 9, 2025
## Changes Made Mypy wasn't running. This removes the daft_dashboard arg (which is fine because it doesn't exist) and it works now. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) --------- Co-authored-by: R. Conner Howell <[email protected]>
rchowell
pushed a commit
that referenced
this pull request
Apr 9, 2025
## Changes Made Expose an async shuffle cache to python, remove use of python threadpool in the shuffle actor, spawn partitioner tasks that connect directly to writer tasks. ### How it works: Flight shuffle works by first partitioning and saving all outputs of the map pipeline to disk in arrow IPC stream format, then reading the files back and sending them to the reducer via arrow flight rpc. We use actor per nodes to do the partitioning and saving. Each actor has a `ShuffleCache` in rust which is responsible for partitioning and saving. Previously, we use python threads via python's concurrent threadpool executor to concurrently push partitions into the cache, which allows us to do partitioning in parallel. But, we shouldn't need to use a python threadpool, and instead be able to use the tokio runtimes in rust to do this. This reduces reliance on python, reduces number of threads, and will provide a smoother transition to using swordfish with the shuffle cache. To do this, we spawn `num_cpu` partitioner tasks and `num_partition` writer tasks on a single threaded runtime, the tasks then spawn compute on the compute runtime when needed. This is also how swordfish works. We then use https://docs.rs/pyo3-async-runtimes/latest/pyo3_async_runtimes/ to expose an async bridge to python. Since ray has async actors, this makes it easy to use async in python as well. ### Results: Question | Before (s) | After (s) | Difference (s) | % Change -- | -- | -- | -- | -- Q1 | 31.73 | 26.63 | -5.10 | -16.07% Q2 | 36.40 | 31.26 | -5.14 | -14.12% Q3 | 40.50 | 39.58 | -0.92 | -2.27% Q4 | 27.58 | 22.90 | -4.68 | -16.97% Q5 | 64.22 | 63.06 | -1.16 | -1.81% Q6 | 9.77 | 9.55 | -0.22 | -2.25% Q7 | 44.56 | 45.43 | 0.87 | 1.95% Q8 | 96.06 | 96.61 | 0.55 | 0.57% Q9 | 120.09 | 117.40 | -2.69 | -2.24% Q10 | 36.68 | 37.49 | 0.81 | 2.21% Q11 | 29.91 | 33.55 | 3.64 | 12.17% Q12 | 25.21 | 25.51 | 0.30 | 1.19% Q13 | 36.05 | 33.92 | -2.13 | -5.91% Q14 | 17.68 | 17.12 | -0.56 | -3.17% Q15 | 33.26 | 33.27 | 0.01 | 0.03% Q16 | 19.11 | 20.84 | 1.73 | 9.05% Q17 | 120.98 | 118.21 | -2.77 | -2.29% Q18 | 64.95 | 62.89 | -2.06 | -3.17% Q19 | 24.38 | 24.01 | -0.37 | -1.52% Q20 | 51.75 | 53.28 | 1.53 | 2.96% Q21 | 142.00 | 140.75 | -1.25 | -0.88% Q22 | 13.32 | 13.63 | 0.31 | 2.32% Total | 1086.19 | 1066.38 | -19.81 | -1.85% ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
rchowell
pushed a commit
that referenced
this pull request
Apr 9, 2025
## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
rchowell
pushed a commit
that referenced
this pull request
Apr 9, 2025
Fix TimeUnit repr format doctest was failing because it didnt match the repr. ## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
rchowell
added a commit
that referenced
this pull request
Apr 9, 2025
## Changes Made Mypy wasn't running. This removes the daft_dashboard arg (which is fine because it doesn't exist) and it works now. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) --------- Co-authored-by: R. Conner Howell <[email protected]>
colin-ho
added a commit
that referenced
this pull request
Apr 10, 2025
## Changes Made Replace python runner with native runner in bug report. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
colin-ho
pushed a commit
that referenced
this pull request
Dec 15, 2025
## Changes Made Support split and merge jsonl/ndjson files, now only support jsonl and ndjson files. Unsupported: 1. json file. 2. compressed files haven't support now. <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Related Issues #5615 <!-- Link to related GitHub issues, e.g., "Closes #123" --> Co-authored-by: cancai <[email protected]>
kevinzwang
added a commit
that referenced
this pull request
Dec 15, 2025
## Changes Made an an option to ignore deletion vectors on a delta lake table read instead of failing. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho
added a commit
that referenced
this pull request
Dec 15, 2025
## Changes Made Before this, if there were empty json/jsonl files in a path, `read_json` would fail. Now, add a new param named `skip_empty_files`, when there are empty json/jsonl files, users can decide whether to skip the empty files and continue the execution according to their own will. <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Related Issues #5646 <!-- Link to related GitHub issues, e.g., "Closes #123" --> --------- Co-authored-by: cancai <[email protected]> Co-authored-by: Colin Ho <[email protected]>
universalmind303
added a commit
that referenced
this pull request
Dec 16, 2025
## Changes Made adds a `values` function on primitive arrays to get a `ScalarBuffer`. this allows us to iterate over only the values. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho
added a commit
that referenced
this pull request
Dec 16, 2025
## Changes Made Types the `**options` param in the model APIs using TypedDicts. Each model api now has it's own specific TypedDict that indicates the a subset of the possible options that can be passed, providing better typing hints for the user and their IDEs. For most of the model apis, the options include `max_retries`, `on_error`, and `batch_size`, which are subsequently passed into the UDF. Provider specific options can be subclassed, for example the `OpenAIPromptOptions` is a subclass of `PromptOptions`, and has an additional field for `use_chat_completions`, which is specific to openai prompting. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" -->
universalmind303
pushed a commit
that referenced
this pull request
Dec 16, 2025
## Changes Made ## Overview The following diagram describes the core workflow of the new add_columns feature. When a user calls write_lance(mode='merge'), the system checks for schema changes: - No new columns: Falls back to the existing append mode. - New columns present: Triggers the new row-level merge workflow, which groups data by fragment_id and efficiently merges new columns within each fragment using a join key (e.g., _rowaddr), ultimately committing only metadata changes. <img width="1016" height="2786" alt="whiteboard_exported_image-已去除(lightpdf cn)" src="https://github.com/user-attachments/assets/c7f2a3c6-ac3b-419e-bb61-7d5e540d302e" /> # Core Interface Changes This optimization introduces new APIs and extends existing interfaces. The core changes are as follows: ## 1. DataFrame.write_lance() The write_lance method adds a merge mode to implement efficient schema evolution. - Mode: mode: Literal["create", "append", "overwrite", "merge"] = "create" - New merge mode behavior: 1. Dataset does not exist: Behaves like create mode. 2. Dataset exists but no new columns: Behaves like append mode. 3. Dataset exists with new columns: Triggers the row-level merge workflow. In this mode, the DataFrame must contain fragment_id and a join key (defaults to _rowaddr or as specified by left_on/right_on). - UT Example (tests/io/lancedb/test_lance_merge_evolution.py): ``` # Read existing data, including fragment_id and _rowaddr df_loaded = daft.read_lance( lance_dataset_path, default_scan_options={"with_row_address": True}, include_fragment_id=True ) # Derive a new column df_with_new_col = df_loaded.with_column("double_lat", daft.col("lat") * 2) # Write with merge mode; Daft handles column merging automatically df_with_new_col.write_lance(lance_dataset_path, mode="merge") ``` ## 2. daft.io.lance.read_lance() read_lance adds the include_fragment_id parameter to expose each fragment's ID as a new column (fragment_id) to the user upon reading. This is a key prerequisite for row-level merging. - New Parameter: include_fragment_id: Optional[bool] = None - UT Example (tests/io/lancedb/test_lance_merge_evolution.py): ``` df = daft.read_lance( lance_dataset_path, default_scan_options={"with_row_address": True}, include_fragment_id=True ) assert "fragment_id" in df.column_names ``` ## 3. daft.io.lance.merge_columns_df() This is a new low-level API that provides a more flexible, DataFrame-based row-level column merge capability. write_lance(mode='merge') internally wraps this function. - Core Functionality: Receives a DataFrame containing fragment_id, a join key, and new columns, and merges it efficiently into an existing Lance dataset. - UT Example (tests/io/lancedb/test_lance_merge_evolution.py): ``` # Prepare a DataFrame with only fragment_id, join key, and new columns df_subset = df_loaded.with_column("double_lat", daft.col("lat") * 2)\ .select("_rowaddr", "fragment_id", "double_lat") # Call the low-level API to perform the merge daft.io.lance.merge_columns_df( df_subset, lance_dataset_path, read_columns=["_rowaddr", "double_lat"], ) ``` ## 4. Interface Comparison Both interfaces support column merging, but their purpose and usage differ: - Interface 1: DataFrame.write_lance(mode="merge") — A writer-side merge; automatically detects new columns; requires fragment_id and a join key; returns write statistics metadata. - Interface 2: daft.io.lance.merge_columns_df() — A DataFrame-based row-level merge; explicitly takes a DataFrame with only fragment_id, join key, and new columns; returns None; allows for finer control via read_columns and batch_size. Usage Recommendation: - For end-to-end write pipelines with more intuitive semantics, prefer write_lance(mode="merge"). - For complex DataFrame transformations or custom column reading before writing, use merge_columns_df(). # Deprecation Plan With the introduction of the new` write_lance(mode='merge') `and `daft.io.lance.merge_columns_df`, the old `daft.io.lance.merge_columns` interface has been marked as deprecated. - Reason for Deprecation: The old interface was based on a fragment-level transformation function, which did not support DataFrames that had undergone complex operations (like joins, groupbys, etc.), limiting its flexibility and use cases. - Future Plan:` daft.io.lance.merge_columns` will be removed in a future release. All users are advised to migrate to the new write_lance(mode='merge') interface, which provides a more powerful and user-friendly column extension capability. <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho
added a commit
that referenced
this pull request
Dec 17, 2025
## Changes Made Adds a `RetryAfterException` that can be raised from a UDF, which will supply the UDF retry mechanism with a delay to sleep. This allows us to respect the retry-afters provided in headers from 429 and 503 errors from clients, such as our openai and google clients in the model apis. By default, all of our model apis should already be configured with max_retries = 3, and our exponential backoff has an initial delay of 100ms. It might be worth exploring in the future how we may want to more easily expose these configurations to users. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho
added a commit
that referenced
this pull request
Dec 17, 2025
## Changes Made Add support for Series[start:end] <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Related Issues #4771 <!-- Link to related GitHub issues, e.g., "Closes #123" --> Co-authored-by: wangzheyan <[email protected]> Co-authored-by: Colin Ho <[email protected]>
universalmind303
pushed a commit
that referenced
this pull request
Dec 17, 2025
## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> Change "We <3 developers!" to "We ❤️ developers!". I know this is a rather interesting expression, but some customers have interpreted this sentence as "Daft has no more than 3 full-time developers", mistakenly thinking that Daft is a personal project and thus being hesitant to adopt it in production easily. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> Signed-off-by: plotor <[email protected]>
universalmind303
pushed a commit
that referenced
this pull request
Dec 17, 2025
## Changes Made Try to overwrite files via native IO instead of pyarrow fs, it's convenient to extend new sources. And keep the same logic with the overwrite_files of Python code even though there might some overwrite issue mentioned #5700, should submit another PR to handle it if worthy ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" -->
kevinzwang
pushed a commit
that referenced
this pull request
Dec 17, 2025
## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> ### Extended SQL LIKE Pattern Matching in MemoryCatalog - Upgrade `sqlparser` to extract namespace from query - Translate SQL LIKE patterns to regex (`src/daft-catalog/src/pattern.rs`) - Supports standard SQL LIKE wildcards: `%` (zero or more), `_` (exactly one), `\` (escape) - Updated documentation on `SHOW TABLES` syntax and pattern behavior ### Testing - Added pattern matching tests to `test_sql_show_tables.py` and in `src/daft-catalog/pattern.rs` - `cargo test -p daft-catalog pattern::` ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> Closes #4461 Closes #4007 ## Checklist - [x] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly
colin-ho
added a commit
that referenced
this pull request
Dec 18, 2025
## Changes Made Set the default ImageMode for decode_image to RGB. Currently, the behavior is to infer the image mode at a per-image level, which means we could have 8-bit and 16-bit images in the same column, which will error because the underlying datatypes (uint8 vs uint16) are different. The solution to this is to force decode into a specific image mode. This PR simply makes that the default. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho
pushed a commit
that referenced
this pull request
Dec 18, 2025
universalmind303
added a commit
that referenced
this pull request
Dec 18, 2025
## Changes Made gets field and dtype roundtrips working. Note, since arrow-rs uses fields to store extension data, converting a datatype with extensions to arrow will fail by design. Instead you must operate on the field to preserve extension metadata. the following conversions are possible. - DaftField -> ArrowField - ArrowField -> DaftField - DaftDataType -> ArrowDataType where DaftDataType != Extension - ArrowDataType -> DaftDataType - ArrowField -> DaftDataType All of these are fallible operations. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
colin-ho
added a commit
that referenced
this pull request
Dec 18, 2025
## Changes Made This was failing https://github.com/Eventual-Inc/Daft/actions/runs/20347885747/job/58464760614 because of the failed assertion. I believe this is due to https://github.com/Eventual-Inc/Daft/pull/5676/changes#diff-a77514cfe53156c701c4b3b7426dc9d2c0d257deb52dae53fba976f51fcf8daf, which changed the way the buffer gets popped to be one by one instead of as many as possible, so we don't need the assertion anymore ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho
pushed a commit
that referenced
this pull request
Dec 18, 2025
## Changes Made 1. support configure log format and date format 2. enable all logger by default 3. modify documentation to easy setup. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" -->
kevinzwang
pushed a commit
that referenced
this pull request
Dec 19, 2025
## Changes Made 1. implement TosMultipartWriter 2. support native writer via tos ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho
pushed a commit
that referenced
this pull request
Dec 19, 2025
#5855) Namespace the `RemoteFlotillaRunner` Ray actor name per Ray job / process to avoid cross-client plan ID collisions when multiple Daft clients connect to the same Ray cluster. - Introduce a cached suffix used to namespace the flotilla plan runner actor name per Ray job / process. - Derive the suffix from the Ray runtime context `job_id` when available. - If no job identifier can be determined, fall back to a per-process UUID so the actor name remains unique across different Python processes. - Update `FlotillaRunner` to construct the Ray actor with the new per-job actor name while keeping the `namespace="daft"` and `get_if_exists=True` semantics unchanged. - Add comments explaining the root cause: `DistributedPhysicalPlan` / `QueryIdx` uses a per-process counter, so plan IDs like `"0"` can collide across different clients if they share the same named actor instance. - Add a lightweight unit test that verifies the flotilla actor name is derived from the base name plus a job-specific suffix and remains stable within a process. ## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> ## Related Issues #5856 <!-- Link to related GitHub issues, e.g., "Closes #123" -->
universalmind303
added a commit
that referenced
this pull request
Dec 19, 2025
## Changes Made adds the following methods to most array implementations ```rs // convert to type erased ArrayRef // all arrays implement this fn to_arrow(&self) -> DaftResult<ArrayRef>; // convert to a concrete type which varies for each array type // such as DataArray<Int8Type> -> arrow::array::Int8Array // most arrays implement this, but not all of them. // this is also true for the original `as_arrow` (now `as_arrow2`) that this is intended to replace. fn as_arrow(&self) -> DaftResult<Self::ArrowOutput> ``` and `from` methods. ```rs // this is also implemented for all arrays. fn from_arrow<F: Into<Arc<Field>>>(field: F, array: ArrayRef) -> DaftResult<Self>; ``` ## Note for reviewers the bulk of the changes is in the following - `src/daft-core/src/array/ops/from_arrow.rs` - `src/daft-core/src/datatypes/logical.rs` - `src/daft-core/src/array/*` ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
This was referenced Dec 22, 2025
universalmind303
pushed a commit
that referenced
this pull request
Dec 22, 2025
… guide (#5865) ## Changes Made <!-- Describe what changes were made and why. Include implementation details if necessary. --> Add instructions on the basic usage of Daft Dashboard in the development guide to reduce the learning cost for developers. Daft Dashboard is currently in the rapid development stage. There will be a dedicated article or page to introduce its functions and usage in detail later, but it is more appropriate to include it in the development guide at this stage. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" --> Signed-off-by: plotor <[email protected]>
colin-ho
pushed a commit
that referenced
this pull request
Dec 22, 2025
…5864) ## Changes Made The integration test TestIcebergCountPushdown.test_count_pushdown_with_delete_files was failing for the test_overlapping_deletes table because it incorrectly enabled count pushdown. The root cause was that the initial Spark write created multiple small data files. Subsequent DELETE operations were optimized by Iceberg to mark entire data files as removed instead of generating position/equality delete files. As a result, Daft's _has_delete_files() check did not find any delete files and incorrectly allowed the count pushdown optimization. This PR fixes the test by adding coalesce(1) to the Spark DataFrame before writing the initial data for the test_overlapping_deletes table. This ensures the data is written to a single Parquet file, forcing subsequent DELETE operations to generate actual delete files. This aligns the test's behavior with its intent, correctly disabling count pushdown when delete files are present. ## Related Issues #5863 5863 <!-- Link to related GitHub issues, e.g., "Closes #123" --> --------- Co-authored-by: root <root@bytedance> Co-authored-by: huleilei <huleilei@bytedance>
This was referenced Dec 24, 2025
srilman
pushed a commit
that referenced
this pull request
Dec 28, 2025
## Changes Made Bump `mypy` to 1.19.1 and `ruff` to 0.14.10 and resolve pre-commit errors. Rationale: my local `ruff` version is much higher than that in the project `pre-commit`, so I am getting a lot of linter warnings in my IDE. It is also about time to upgrade lints to utilize newer Python language features. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes #123" -->
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.