Skip to content

Conversation

@jaychia
Copy link
Contributor

@jaychia jaychia commented Aug 26, 2022

No description provided.

@jaychia jaychia merged commit 2daf382 into main Aug 26, 2022
@jaychia jaychia deleted the jay/compute-limit-bug-repro branch August 26, 2022 08:14
colin-ho added a commit that referenced this pull request Mar 20, 2025
## Summary

<!-- Provide a short summary of the changes in this PR. -->

## Related Issues

<!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123"
-->

## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Checklist

- [ ] All tests have passed
- [ ] Documented in API Docs
- [ ] Documented in User Guide
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag
[@ccmao1130](https://github.com/ccmao1130) for docs review)
colin-ho added a commit that referenced this pull request Mar 20, 2025
## Summary

Fix some typos, missing info, on s3 tables docs.

## Related Issues

<!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123"
-->

## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Checklist

- [ ] All tests have passed
- [ ] Documented in API Docs
- [ ] Documented in User Guide
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag
[@ccmao1130](https://github.com/ccmao1130) for docs review)
colin-ho added a commit that referenced this pull request Mar 21, 2025
## Summary

Track imports on scarf

## Related Issues

<!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123"
-->

## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Checklist

- [ ] All tests have passed
- [ ] Documented in API Docs
- [ ] Documented in User Guide
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag
[@ccmao1130](https://github.com/ccmao1130) for docs review)
colin-ho added a commit that referenced this pull request Mar 21, 2025
## Summary

<!-- Provide a short summary of the changes in this PR. -->

## Related Issues

<!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123"
-->

## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Checklist

- [ ] All tests have passed
- [ ] Documented in API Docs
- [ ] Documented in User Guide
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag
[@ccmao1130](https://github.com/ccmao1130) for docs review)
colin-ho added a commit that referenced this pull request Mar 24, 2025
…ad of endswith (#4046)

## Summary

Fix runner switching test.

## Related Issues

<!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123"
-->

## Changes Made

Pass the assertion as long as `cannot set runner more than once` is
inside error message instead of endswith, as there could be other stuff.

## Checklist

- [ ] All tests have passed
- [ ] Documented in API Docs
- [ ] Documented in User Guide
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag
[@ccmao1130](https://github.com/ccmao1130) for docs review)
kevinzwang added a commit that referenced this pull request Mar 25, 2025
## Summary

Clean up the PR template in various ways

## Related Issues

<!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123"
-->

## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->
- remove summary section (I felt that there was a large amount of
redundancy between the PR title, summary, and changes made sections)
- remove JIRA mention in changes made since we do not use that
- remove "All tests have passed" from checklist since that is already a
strict requirement for merging
- add "(if applicable)" to some checklist items

## Checklist

- [x] All tests have passed
- [x] Documented in API Docs
- [x] Documented in User Guide
- [x] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [x] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
colin-ho added a commit that referenced this pull request Mar 25, 2025
## Summary

Remove ray compatibility workflow as it has been failing for a while.

## Related Issues

<!-- Link to related GitHub issues or JIRA tickets, e.g., "Closes #123"
-->

## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Checklist

- [ ] All tests have passed
- [ ] Documented in API Docs
- [ ] Documented in User Guide
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
colin-ho added a commit that referenced this pull request Mar 28, 2025
## Changes Made

Adds overwrite mode to spark connect for writing csv and parquet.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)

---------

Co-authored-by: universalmind303 <[email protected]>
colin-ho added a commit that referenced this pull request Mar 31, 2025
## Changes Made

Adds docs in core concepts for cross column expressions 

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
colin-ho added a commit that referenced this pull request Mar 31, 2025
## Changes Made

Add a section called numbering rows that explains how to add ids to rows
with monotonically increasing id

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
colin-ho added a commit that referenced this pull request Mar 31, 2025
## Changes Made

Use io-runtime for task in remote parquet reader. Fixes deadlock.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
colin-ho added a commit that referenced this pull request Mar 31, 2025
## Changes Made

Disable scan task merging for warcs. 

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)

---------

Co-authored-by: Desmond Cheong <[email protected]>
desmondcheongzx pushed a commit that referenced this pull request Apr 1, 2025
## Changes Made

Quick refactor of some of the logic that previously lived in
PushDownFilter into its own rule.

I previously implemented it in PushDownFilter because it could have
interactions with the other filter pushdown logic. However, we should
try to keep PushDownFilter as simple as we can, just for pushing filter
nodes through a plan. Thus, I am moving this logic into a new rule and
using the repeated execution of the rule batch to deal with interactions
between the two rules.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

Related to why we closed #4118

## Checklist

- [x] Documented in API Docs (if applicable)
- [x] Documented in User Guide (if applicable)
- [x] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [x] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
desmondcheongzx added a commit that referenced this pull request Apr 1, 2025
## Changes Made

This PR adds an optimizer rule that pushes down join predicates. This
will fix our TPC-H benchmarks, since Q13, which is technically a
non-equi join, can be optimized into an equi-join.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [x] Documented in API Docs (if applicable)
- [x] Documented in User Guide (if applicable)
- [x] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [x] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)

Co-authored-by: desmondcheongzx <[email protected]>
kevinzwang added a commit that referenced this pull request Apr 2, 2025
## Changes Made

Added an optimizer rule `PushDownAntiSemiJoin`, which pushes anti and semi joins through inner joins, projects, filters, sorts, and distincts.

I decided to do this instead of adding anti and semi joins to our brute force join reorderer because it is simpler and will largely produce the same results, especially with TPC-H. 

More specifically, join graph creation makes some assumptions about the uniqueness of input column names that I'm not sure will hold for anti and semi joins, since their columns are not preserved in the output. I was not confident in my ability to modify the join reordering algorithm to properly fix this within a reasonable time. We will want to implement a more efficient reordering algorithm in the future anyway, so I do not want to spend a significant effort on patching the current one up.

### TPC-H Q18 
#### Results
| Benchmark | Before | After | Improvement |
|--------|--------|--------|--------|
| Local (SF 100) | 56.79s | 10.38s | 5.47x |
| Distributed (SF 1000) | 285.4s | 66.12s | 4.32x |

#### Optimized Plan (Before)
```mermaid
flowchart TD
subgraph optimized["Optimized LogicalPlan"]
direction TB
optimizedLimit0["Limit: 100
Stats = { Approx num rows = 100, Approx size bytes = 1.56 KiB, Accumulated
selectivity = 0.00 }"]
optimizedSort1["Sort: Sort by = (col(o_totalprice), descending, nulls first), (col(o_orderdate),
ascending, nulls last)
Stats = { Approx num rows = 5,665,935, Approx size bytes = 86.46 MiB,
Accumulated selectivity = 0.12 }"]
optimizedAggregate2["Aggregation: sum(col(l_quantity))
Group by = col(c_name), col(c_custkey), col(o_orderkey), col(o_orderdate),
col(o_totalprice)
Output schema = c_name#Utf8, c_custkey#Int64, o_orderkey#Int64,
o_orderdate#Date, o_totalprice#Float64, l_quantity#Int64
Stats = { Approx num rows = 5,665,935, Approx size bytes = 86.46 MiB,
Accumulated selectivity = 0.12 }"]
optimizedProject3["Project: col(l_quantity), col(c_name), col(c_custkey), col(o_orderkey),
col(o_orderdate), col(o_totalprice)
Stats = { Approx num rows = 7,082,419, Approx size bytes = 109.76 MiB,
Accumulated selectivity = 0.14 }"]
optimizedJoin4["Join: Type = Semi
Strategy = Auto
On = col(left.o_orderkey#Int64) == col(right.l_orderkey#Int64)
Output schema = o_orderkey#Int64, c_custkey#Int64, c_name#Utf8,
o_totalprice#Float64, o_orderdate#Date, l_quantity#Int64
Stats = { Approx num rows = 7,082,419, Approx size bytes = 109.76 MiB,
Accumulated selectivity = 0.14 }"]
optimizedProject5["Project: col(o_orderkey), col(c_custkey), col(c_name), col(o_totalprice),
col(o_orderdate), col(l_quantity)
Stats = { Approx num rows = 46,594,852, Approx size bytes = 722.09 MiB,
Accumulated selectivity = 0.95 }"]
optimizedJoin6["Join: Type = Inner
Strategy = Auto
On = col(left.o_orderkey#Int64) == col(right.l_orderkey#Int64)
Output schema = c_custkey#Int64, c_name#Utf8, o_orderkey#Int64,
o_totalprice#Float64, o_orderdate#Date, l_orderkey#Int64, l_quantity#Int64
Stats = { Approx num rows = 46,594,852, Approx size bytes = 722.09 MiB,
Accumulated selectivity = 0.95 }"]
optimizedProject7["Project: col(c_custkey), col(c_name), col(o_orderkey), col(o_totalprice),
col(o_orderdate)
Stats = { Approx num rows = 18,273,257, Approx size bytes = 496.66 MiB,
Accumulated selectivity = 0.95 }"]
optimizedJoin8["Join: Type = Inner
Strategy = Auto
On = col(left.c_custkey#Int64) == col(right.o_custkey#Int64)
Output schema = c_custkey#Int64, c_name#Utf8, o_orderkey#Int64, o_custkey#Int64,
o_totalprice#Float64, o_orderdate#Date
Stats = { Approx num rows = 18,273,257, Approx size bytes = 496.66 MiB,
Accumulated selectivity = 0.95 }"]
optimizedProject9["Project: col(C_CUSTKEY) as c_custkey, col(C_NAME) as c_name
Stats = { Approx num rows = 3,308,528, Approx size bytes = 89.14 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource10["Num Scan Tasks = 32
File schema = C_CUSTKEY#Int64, C_NAME#Utf8, C_ADDRESS#Utf8, C_NATIONKEY#Int64,
C_PHONE#Utf8, C_ACCTBAL#Float64, C_MKTSEGMENT#Utf8, C_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [C_CUSTKEY, C_NAME]
Output schema = C_CUSTKEY#Int64, C_NAME#Utf8
Stats = { Approx num rows = 3,308,528, Approx size bytes = 89.14 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource10 --> optimizedProject9
optimizedProject9 --> optimizedJoin8
optimizedProject11["Project: col(O_ORDERKEY) as o_orderkey, col(O_CUSTKEY) as o_custkey,
col(O_TOTALPRICE) as o_totalprice, col(O_ORDERDATE) as o_orderdate
Stats = { Approx num rows = 18,273,257, Approx size bytes = 496.66 MiB,
Accumulated selectivity = 0.95 }"]
optimizedSource12["Num Scan Tasks = 32
File schema = O_ORDERKEY#Int64, O_CUSTKEY#Int64, O_ORDERSTATUS#Utf8,
O_TOTALPRICE#Float64, O_ORDERDATE#Date, O_ORDERPRIORITY#Utf8, O_CLERK#Utf8,
O_SHIPPRIORITY#Int64, O_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [O_ORDERKEY, O_CUSTKEY, O_TOTALPRICE, O_ORDERDATE]
Filter pushdown = not(is_null(col(O_ORDERKEY)))
Output schema = O_ORDERKEY#Int64, O_CUSTKEY#Int64, O_TOTALPRICE#Float64,
O_ORDERDATE#Date
Stats = { Approx num rows = 18,273,257, Approx size bytes = 496.66 MiB,
Accumulated selectivity = 0.95 }"]
optimizedSource12 --> optimizedProject11
optimizedProject11 --> optimizedJoin8
optimizedJoin8 --> optimizedProject7
optimizedProject7 --> optimizedJoin6
optimizedProject13["Project: col(L_ORDERKEY) as l_orderkey, col(L_QUANTITY) as l_quantity
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource14["Num Scan Tasks = 32
File schema = L_ORDERKEY#Int64, L_PARTKEY#Int64, L_SUPPKEY#Int64,
L_LINENUMBER#Int64, L_QUANTITY#Int64, L_EXTENDEDPRICE#Float64,
L_DISCOUNT#Float64, L_TAX#Float64, L_RETURNFLAG#Utf8, L_LINESTATUS#Utf8,
L_SHIPDATE#Date, L_COMMITDATE#Date, L_RECEIPTDATE#Date, L_SHIPINSTRUCT#Utf8,
L_SHIPMODE#Utf8, L_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [L_ORDERKEY, L_QUANTITY]
Output schema = L_ORDERKEY#Int64, L_QUANTITY#Int64
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource14 --> optimizedProject13
optimizedProject13 --> optimizedJoin6
optimizedJoin6 --> optimizedProject5
optimizedProject5 --> optimizedJoin4
optimizedProject15["Project: col(l_orderkey)
Stats = { Approx num rows = 7,455,177, Approx size bytes = 113.76 MiB,
Accumulated selectivity = 0.15 }"]
optimizedFilter16["Filter: col((l_quantity.local_sum() > Literal(Int64(300))))
Stats = { Approx num rows = 7,455,177, Approx size bytes = 113.76 MiB,
Accumulated selectivity = 0.15 }"]
optimizedProject17["Project: col(l_orderkey), col(l_quantity.local_sum()) > lit(300) as
(l_quantity.local_sum() > Literal(Int64(300)))
Stats = { Approx num rows = 37,275,881, Approx size bytes = 568.78 MiB,
Accumulated selectivity = 0.76 }"]
optimizedFilter18["Filter: not(is_null(col(l_orderkey)))
Stats = { Approx num rows = 37,275,881, Approx size bytes = 568.78 MiB,
Accumulated selectivity = 0.76 }"]
optimizedAggregate19["Aggregation: sum(col(l_quantity)) as l_quantity.local_sum()
Group by = col(l_orderkey)
Output schema = l_orderkey#Int64, l_quantity.local_sum()#Int64
Stats = { Approx num rows = 39,237,769, Approx size bytes = 598.72 MiB,
Accumulated selectivity = 0.80 }"]
optimizedProject20["Project: col(L_QUANTITY) as l_quantity, col(L_ORDERKEY) as l_orderkey
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource21["Num Scan Tasks = 32
File schema = L_ORDERKEY#Int64, L_PARTKEY#Int64, L_SUPPKEY#Int64,
L_LINENUMBER#Int64, L_QUANTITY#Int64, L_EXTENDEDPRICE#Float64,
L_DISCOUNT#Float64, L_TAX#Float64, L_RETURNFLAG#Utf8, L_LINESTATUS#Utf8,
L_SHIPDATE#Date, L_COMMITDATE#Date, L_RECEIPTDATE#Date, L_SHIPINSTRUCT#Utf8,
L_SHIPMODE#Utf8, L_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [L_QUANTITY, L_ORDERKEY]
Output schema = L_ORDERKEY#Int64, L_QUANTITY#Int64
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource21 --> optimizedProject20
optimizedProject20 --> optimizedAggregate19
optimizedAggregate19 --> optimizedFilter18
optimizedFilter18 --> optimizedProject17
optimizedProject17 --> optimizedFilter16
optimizedFilter16 --> optimizedProject15
optimizedProject15 --> optimizedJoin4
optimizedJoin4 --> optimizedProject3
optimizedProject3 --> optimizedAggregate2
optimizedAggregate2 --> optimizedSort1
optimizedSort1 --> optimizedLimit0
end
```

#### Optimized Plan (After)

```mermaid
flowchart TD
subgraph optimized["Optimized LogicalPlan"]
direction TB
optimizedLimit0["Limit: 100
Stats = { Approx num rows = 100, Approx size bytes = 1.56 KiB, Accumulated
selectivity = 0.00 }"]
optimizedSort1["Sort: Sort by = (col(o_totalprice), descending, nulls first), (col(o_orderdate),
ascending, nulls last)
Stats = { Approx num rows = 5,665,935, Approx size bytes = 86.46 MiB,
Accumulated selectivity = 0.12 }"]
optimizedAggregate2["Aggregation: sum(col(l_quantity))
Group by = col(c_name), col(c_custkey), col(o_orderkey), col(o_orderdate),
col(o_totalprice)
Output schema = c_name#Utf8, c_custkey#Int64, o_orderkey#Int64,
o_orderdate#Date, o_totalprice#Float64, l_quantity#Int64
Stats = { Approx num rows = 5,665,935, Approx size bytes = 86.46 MiB,
Accumulated selectivity = 0.12 }"]
optimizedProject3["Project: col(l_quantity), col(c_name), col(c_custkey), col(o_orderkey),
col(o_orderdate), col(o_totalprice)
Stats = { Approx num rows = 7,082,419, Approx size bytes = 109.76 MiB,
Accumulated selectivity = 0.14 }"]
optimizedProject4["Project: col(c_custkey), col(c_name), col(o_orderkey), col(o_totalprice),
col(o_orderdate), col(l_orderkey), col(l_quantity)
Stats = { Approx num rows = 7,082,419, Approx size bytes = 109.76 MiB,
Accumulated selectivity = 0.14 }"]
optimizedJoin5["Join: Type = Inner
Strategy = Auto
On = col(left.c_custkey#Int64) == col(right.o_custkey#Int64)
Output schema = c_custkey#Int64, c_name#Utf8, o_orderkey#Int64, o_custkey#Int64,
o_totalprice#Float64, o_orderdate#Date, l_orderkey#Int64, l_quantity#Int64
Stats = { Approx num rows = 7,082,419, Approx size bytes = 109.76 MiB,
Accumulated selectivity = 0.14 }"]
optimizedProject6["Project: col(C_CUSTKEY) as c_custkey, col(C_NAME) as c_name
Stats = { Approx num rows = 3,308,528, Approx size bytes = 89.14 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource7["Num Scan Tasks = 32
File schema = C_CUSTKEY#Int64, C_NAME#Utf8, C_ADDRESS#Utf8, C_NATIONKEY#Int64,
C_PHONE#Utf8, C_ACCTBAL#Float64, C_MKTSEGMENT#Utf8, C_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [C_CUSTKEY, C_NAME]
Output schema = C_CUSTKEY#Int64, C_NAME#Utf8
Stats = { Approx num rows = 3,308,528, Approx size bytes = 89.14 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource7 --> optimizedProject6
optimizedProject6 --> optimizedJoin5
optimizedJoin8["Join: Type = Inner
Strategy = Auto
On = col(left.o_orderkey#Int64) == col(right.l_orderkey#Int64)
Output schema = o_orderkey#Int64, o_custkey#Int64, o_totalprice#Float64,
o_orderdate#Date, l_orderkey#Int64, l_quantity#Int64
Stats = { Approx num rows = 7,082,419, Approx size bytes = 109.76 MiB,
Accumulated selectivity = 0.14 }"]
optimizedProject9["Project: col(O_ORDERKEY) as o_orderkey, col(O_CUSTKEY) as o_custkey,
col(O_TOTALPRICE) as o_totalprice, col(O_ORDERDATE) as o_orderdate
Stats = { Approx num rows = 7,082,419, Approx size bytes = 108.07 MiB,
Accumulated selectivity = 0.14 }"]
optimizedJoin10["Join: Type = Semi
Strategy = Auto
On = col(left.O_ORDERKEY#Int64) == col(right.l_orderkey#Int64)
Output schema = O_ORDERKEY#Int64, O_CUSTKEY#Int64, O_TOTALPRICE#Float64,
O_ORDERDATE#Date
Stats = { Approx num rows = 7,082,419, Approx size bytes = 108.07 MiB,
Accumulated selectivity = 0.14 }"]
optimizedSource11["Num Scan Tasks = 32
File schema = O_ORDERKEY#Int64, O_CUSTKEY#Int64, O_ORDERSTATUS#Utf8,
O_TOTALPRICE#Float64, O_ORDERDATE#Date, O_ORDERPRIORITY#Utf8, O_CLERK#Utf8,
O_SHIPPRIORITY#Int64, O_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [O_ORDERKEY, O_CUSTKEY, O_TOTALPRICE, O_ORDERDATE]
Filter pushdown = not(is_null(col(O_ORDERKEY)))
Output schema = O_ORDERKEY#Int64, O_CUSTKEY#Int64, O_TOTALPRICE#Float64,
O_ORDERDATE#Date
Stats = { Approx num rows = 18,273,257, Approx size bytes = 496.66 MiB,
Accumulated selectivity = 0.95 }"]
optimizedSource11 --> optimizedJoin10
optimizedProject12["Project: col(l_orderkey)
Stats = { Approx num rows = 7,455,177, Approx size bytes = 113.76 MiB,
Accumulated selectivity = 0.15 }"]
optimizedFilter13["Filter: col((l_quantity.local_sum() > Literal(Int64(300))))
Stats = { Approx num rows = 7,455,177, Approx size bytes = 113.76 MiB,
Accumulated selectivity = 0.15 }"]
optimizedProject14["Project: col(l_orderkey), col(l_quantity.local_sum()) > lit(300) as
(l_quantity.local_sum() > Literal(Int64(300)))
Stats = { Approx num rows = 37,275,881, Approx size bytes = 568.78 MiB,
Accumulated selectivity = 0.76 }"]
optimizedFilter15["Filter: not(is_null(col(l_orderkey)))
Stats = { Approx num rows = 37,275,881, Approx size bytes = 568.78 MiB,
Accumulated selectivity = 0.76 }"]
optimizedAggregate16["Aggregation: sum(col(l_quantity)) as l_quantity.local_sum()
Group by = col(l_orderkey)
Output schema = l_orderkey#Int64, l_quantity.local_sum()#Int64
Stats = { Approx num rows = 39,237,769, Approx size bytes = 598.72 MiB,
Accumulated selectivity = 0.80 }"]
optimizedProject17["Project: col(L_QUANTITY) as l_quantity, col(L_ORDERKEY) as l_orderkey
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource18["Num Scan Tasks = 32
File schema = L_ORDERKEY#Int64, L_PARTKEY#Int64, L_SUPPKEY#Int64,
L_LINENUMBER#Int64, L_QUANTITY#Int64, L_EXTENDEDPRICE#Float64,
L_DISCOUNT#Float64, L_TAX#Float64, L_RETURNFLAG#Utf8, L_LINESTATUS#Utf8,
L_SHIPDATE#Date, L_COMMITDATE#Date, L_RECEIPTDATE#Date, L_SHIPINSTRUCT#Utf8,
L_SHIPMODE#Utf8, L_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [L_QUANTITY, L_ORDERKEY]
Output schema = L_ORDERKEY#Int64, L_QUANTITY#Int64
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource18 --> optimizedProject17
optimizedProject17 --> optimizedAggregate16
optimizedAggregate16 --> optimizedFilter15
optimizedFilter15 --> optimizedProject14
optimizedProject14 --> optimizedFilter13
optimizedFilter13 --> optimizedProject12
optimizedProject12 --> optimizedJoin10
optimizedJoin10 --> optimizedProject9
optimizedProject9 --> optimizedJoin8
optimizedProject19["Project: col(L_ORDERKEY) as l_orderkey, col(L_QUANTITY) as l_quantity
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource20["Num Scan Tasks = 32
File schema = L_ORDERKEY#Int64, L_PARTKEY#Int64, L_SUPPKEY#Int64,
L_LINENUMBER#Int64, L_QUANTITY#Int64, L_EXTENDEDPRICE#Float64,
L_DISCOUNT#Float64, L_TAX#Float64, L_RETURNFLAG#Utf8, L_LINESTATUS#Utf8,
L_SHIPDATE#Date, L_COMMITDATE#Date, L_RECEIPTDATE#Date, L_SHIPINSTRUCT#Utf8,
L_SHIPMODE#Utf8, L_COMMENT#Utf8
Partitioning keys = []
Projection pushdown = [L_ORDERKEY, L_QUANTITY]
Output schema = L_ORDERKEY#Int64, L_QUANTITY#Int64
Stats = { Approx num rows = 49,047,212, Approx size bytes = 760.10 MiB,
Accumulated selectivity = 1.00 }"]
optimizedSource20 --> optimizedProject19
optimizedProject19 --> optimizedJoin8
optimizedJoin8 --> optimizedJoin5
optimizedJoin5 --> optimizedProject4
optimizedProject4 --> optimizedProject3
optimizedProject3 --> optimizedAggregate2
optimizedAggregate2 --> optimizedSort1
optimizedSort1 --> optimizedLimit0
end
```
## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [x] Documented in API Docs (if applicable)
- [x] Documented in User Guide (if applicable)
- [x] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation
- [x] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
colin-ho added a commit that referenced this pull request Apr 3, 2025
## Changes Made

Use native runner for doctests. Modify code examples to add a sort if
needed (e.g. groupbys) because swordfish does not have ordering
guarantees without an order by.

Also, don't run them for monotonically increasing id, because it
requires partitioning.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
f4t4nt pushed a commit that referenced this pull request Apr 3, 2025
## Changes Made

Use native runner for doctests. Modify code examples to add a sort if
needed (e.g. groupbys) because swordfish does not have ordering
guarantees without an order by.

Also, don't run them for monotonically increasing id, because it
requires partitioning.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
colin-ho added a commit that referenced this pull request Apr 4, 2025
## Changes Made

Expose an async shuffle cache to python, remove use of python threadpool
in the shuffle actor, spawn partitioner tasks that connect directly to
writer tasks.

### How it works:

Flight shuffle works by first partitioning and saving all outputs of the
map pipeline to disk in arrow IPC stream format, then reading the files
back and sending them to the reducer via arrow flight rpc.

We use actor per nodes to do the partitioning and saving. Each actor has
a `ShuffleCache` in rust which is responsible for partitioning and
saving. Previously, we use python threads via python's concurrent
threadpool executor to concurrently push partitions into the cache,
which allows us to do partitioning in parallel. But, we shouldn't need
to use a python threadpool, and instead be able to use the tokio
runtimes in rust to do this. This reduces reliance on python, reduces
number of threads, and will provide a smoother transition to using
swordfish with the shuffle cache.

To do this, we spawn `num_cpu` partitioner tasks and `num_partition`
writer tasks on a single threaded runtime, the tasks then spawn compute
on the compute runtime when needed. This is also how swordfish works. We
then use https://docs.rs/pyo3-async-runtimes/latest/pyo3_async_runtimes/
to expose an async bridge to python. Since ray has async actors, this
makes it easy to use async in python as well.

### Results:

Question | Before (s) | After (s) | Difference (s) | % Change
-- | -- | -- | -- | --
Q1 | 31.73 | 26.63 | -5.10 | -16.07%
Q2 | 36.40 | 31.26 | -5.14 | -14.12%
Q3 | 40.50 | 39.58 | -0.92 | -2.27%
Q4 | 27.58 | 22.90 | -4.68 | -16.97%
Q5 | 64.22 | 63.06 | -1.16 | -1.81%
Q6 | 9.77 | 9.55 | -0.22 | -2.25%
Q7 | 44.56 | 45.43 | 0.87 | 1.95%
Q8 | 96.06 | 96.61 | 0.55 | 0.57%
Q9 | 120.09 | 117.40 | -2.69 | -2.24%
Q10 | 36.68 | 37.49 | 0.81 | 2.21%
Q11 | 29.91 | 33.55 | 3.64 | 12.17%
Q12 | 25.21 | 25.51 | 0.30 | 1.19%
Q13 | 36.05 | 33.92 | -2.13 | -5.91%
Q14 | 17.68 | 17.12 | -0.56 | -3.17%
Q15 | 33.26 | 33.27 | 0.01 | 0.03%
Q16 | 19.11 | 20.84 | 1.73 | 9.05%
Q17 | 120.98 | 118.21 | -2.77 | -2.29%
Q18 | 64.95 | 62.89 | -2.06 | -3.17%
Q19 | 24.38 | 24.01 | -0.37 | -1.52%
Q20 | 51.75 | 53.28 | 1.53 | 2.96%
Q21 | 142.00 | 140.75 | -1.25 | -0.88%
Q22 | 13.32 | 13.63 | 0.31 | 2.32%
Total | 1086.19 | 1066.38 | -19.81 | -1.85%

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
universalmind303 added a commit that referenced this pull request Apr 7, 2025
## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
universalmind303 added a commit that referenced this pull request Apr 8, 2025
Fix TimeUnit repr format
doctest was failing because it didnt match the repr. 


## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
colin-ho added a commit that referenced this pull request Apr 9, 2025
## Changes Made

Mypy wasn't running. This removes the daft_dashboard arg (which is fine
because it doesn't exist) and it works now.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)

---------

Co-authored-by: R. Conner Howell <[email protected]>
rchowell pushed a commit that referenced this pull request Apr 9, 2025
## Changes Made

Expose an async shuffle cache to python, remove use of python threadpool
in the shuffle actor, spawn partitioner tasks that connect directly to
writer tasks.

### How it works:

Flight shuffle works by first partitioning and saving all outputs of the
map pipeline to disk in arrow IPC stream format, then reading the files
back and sending them to the reducer via arrow flight rpc.

We use actor per nodes to do the partitioning and saving. Each actor has
a `ShuffleCache` in rust which is responsible for partitioning and
saving. Previously, we use python threads via python's concurrent
threadpool executor to concurrently push partitions into the cache,
which allows us to do partitioning in parallel. But, we shouldn't need
to use a python threadpool, and instead be able to use the tokio
runtimes in rust to do this. This reduces reliance on python, reduces
number of threads, and will provide a smoother transition to using
swordfish with the shuffle cache.

To do this, we spawn `num_cpu` partitioner tasks and `num_partition`
writer tasks on a single threaded runtime, the tasks then spawn compute
on the compute runtime when needed. This is also how swordfish works. We
then use https://docs.rs/pyo3-async-runtimes/latest/pyo3_async_runtimes/
to expose an async bridge to python. Since ray has async actors, this
makes it easy to use async in python as well.

### Results:

Question | Before (s) | After (s) | Difference (s) | % Change
-- | -- | -- | -- | --
Q1 | 31.73 | 26.63 | -5.10 | -16.07%
Q2 | 36.40 | 31.26 | -5.14 | -14.12%
Q3 | 40.50 | 39.58 | -0.92 | -2.27%
Q4 | 27.58 | 22.90 | -4.68 | -16.97%
Q5 | 64.22 | 63.06 | -1.16 | -1.81%
Q6 | 9.77 | 9.55 | -0.22 | -2.25%
Q7 | 44.56 | 45.43 | 0.87 | 1.95%
Q8 | 96.06 | 96.61 | 0.55 | 0.57%
Q9 | 120.09 | 117.40 | -2.69 | -2.24%
Q10 | 36.68 | 37.49 | 0.81 | 2.21%
Q11 | 29.91 | 33.55 | 3.64 | 12.17%
Q12 | 25.21 | 25.51 | 0.30 | 1.19%
Q13 | 36.05 | 33.92 | -2.13 | -5.91%
Q14 | 17.68 | 17.12 | -0.56 | -3.17%
Q15 | 33.26 | 33.27 | 0.01 | 0.03%
Q16 | 19.11 | 20.84 | 1.73 | 9.05%
Q17 | 120.98 | 118.21 | -2.77 | -2.29%
Q18 | 64.95 | 62.89 | -2.06 | -3.17%
Q19 | 24.38 | 24.01 | -0.37 | -1.52%
Q20 | 51.75 | 53.28 | 1.53 | 2.96%
Q21 | 142.00 | 140.75 | -1.25 | -0.88%
Q22 | 13.32 | 13.63 | 0.31 | 2.32%
Total | 1086.19 | 1066.38 | -19.81 | -1.85%

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
rchowell pushed a commit that referenced this pull request Apr 9, 2025
## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
rchowell pushed a commit that referenced this pull request Apr 9, 2025
Fix TimeUnit repr format
doctest was failing because it didnt match the repr. 


## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
rchowell added a commit that referenced this pull request Apr 9, 2025
## Changes Made

Mypy wasn't running. This removes the daft_dashboard arg (which is fine
because it doesn't exist) and it works now.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)

---------

Co-authored-by: R. Conner Howell <[email protected]>
colin-ho added a commit that referenced this pull request Apr 10, 2025
## Changes Made

Replace python runner with native runner in bug report.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
colin-ho pushed a commit that referenced this pull request Dec 15, 2025
## Changes Made

Support split and merge jsonl/ndjson files, now only support jsonl and
ndjson files.

Unsupported:
1.  json file.
2. compressed files haven't support now.

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Related Issues
#5615
<!-- Link to related GitHub issues, e.g., "Closes #123" -->

Co-authored-by: cancai <[email protected]>
kevinzwang added a commit that referenced this pull request Dec 15, 2025
## Changes Made

an an option to ignore deletion vectors on a delta lake table read
instead of failing.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho added a commit that referenced this pull request Dec 15, 2025
## Changes Made
Before this, if there were empty json/jsonl files in a path, `read_json`
would fail.

Now, add a new param named `skip_empty_files`, when there are empty
json/jsonl files, users can decide whether to skip the empty files and
continue the execution according to their own will.

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Related Issues
#5646
<!-- Link to related GitHub issues, e.g., "Closes #123" -->

---------

Co-authored-by: cancai <[email protected]>
Co-authored-by: Colin Ho <[email protected]>
universalmind303 added a commit that referenced this pull request Dec 16, 2025
## Changes Made

adds a `values` function on primitive arrays to get a `ScalarBuffer`.
this allows us to iterate over only the values.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho added a commit that referenced this pull request Dec 16, 2025
## Changes Made

Types the `**options` param in the model APIs using TypedDicts. Each
model api now has it's own specific TypedDict that indicates the a
subset of the possible options that can be passed, providing better
typing hints for the user and their IDEs.

For most of the model apis, the options include `max_retries`,
`on_error`, and `batch_size`, which are subsequently passed into the
UDF.

Provider specific options can be subclassed, for example the
`OpenAIPromptOptions` is a subclass of `PromptOptions`, and has an
additional field for `use_chat_completions`, which is specific to openai
prompting.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
universalmind303 pushed a commit that referenced this pull request Dec 16, 2025
## Changes Made
## Overview
The following diagram describes the core workflow of the new add_columns
feature. When a user calls write_lance(mode='merge'), the system checks
for schema changes:
- No new columns: Falls back to the existing append mode.
- New columns present: Triggers the new row-level merge workflow, which
groups data by fragment_id and efficiently merges new columns within
each fragment using a join key (e.g., _rowaddr), ultimately committing
only metadata changes.

<img width="1016" height="2786"
alt="whiteboard_exported_image-已去除(lightpdf cn)"
src="https://github.com/user-attachments/assets/c7f2a3c6-ac3b-419e-bb61-7d5e540d302e"
/>

# Core Interface Changes
This optimization introduces new APIs and extends existing interfaces.
The core changes are as follows:
## 1. DataFrame.write_lance()
The write_lance method adds a merge mode to implement efficient schema
evolution.
- Mode: mode: Literal["create", "append", "overwrite", "merge"] =
"create"
- New merge mode behavior:
  1. Dataset does not exist: Behaves like create mode.
  2. Dataset exists but no new columns: Behaves like append mode.
3. Dataset exists with new columns: Triggers the row-level merge
workflow. In this mode, the DataFrame must contain fragment_id and a
join key (defaults to _rowaddr or as specified by left_on/right_on).
- UT Example (tests/io/lancedb/test_lance_merge_evolution.py):
```
# Read existing data, including fragment_id and _rowaddr
df_loaded = daft.read_lance(
    lance_dataset_path, 
    default_scan_options={"with_row_address": True}, 
    include_fragment_id=True
)
# Derive a new column
df_with_new_col = df_loaded.with_column("double_lat", daft.col("lat") * 2)
# Write with merge mode; Daft handles column merging automatically
df_with_new_col.write_lance(lance_dataset_path, mode="merge")

```

## 2. daft.io.lance.read_lance()
read_lance adds the include_fragment_id parameter to expose each
fragment's ID as a new column (fragment_id) to the user upon reading.
This is a key prerequisite for row-level merging.
- New Parameter: include_fragment_id: Optional[bool] = None
- UT Example (tests/io/lancedb/test_lance_merge_evolution.py):
```
df = daft.read_lance(
    lance_dataset_path, 
    default_scan_options={"with_row_address": True}, 
    include_fragment_id=True
)
assert "fragment_id" in df.column_names
```

## 3. daft.io.lance.merge_columns_df()
This is a new low-level API that provides a more flexible,
DataFrame-based row-level column merge capability.
write_lance(mode='merge') internally wraps this function.
- Core Functionality: Receives a DataFrame containing fragment_id, a
join key, and new columns, and merges it efficiently into an existing
Lance dataset.
- UT Example (tests/io/lancedb/test_lance_merge_evolution.py):
```
# Prepare a DataFrame with only fragment_id, join key, and new columns
df_subset = df_loaded.with_column("double_lat", daft.col("lat") * 2)\
                     .select("_rowaddr", "fragment_id", "double_lat")

# Call the low-level API to perform the merge
daft.io.lance.merge_columns_df(
    df_subset,
    lance_dataset_path,
    read_columns=["_rowaddr", "double_lat"],
)

```
## 4. Interface Comparison
Both interfaces support column merging, but their purpose and usage
differ:
- Interface 1: DataFrame.write_lance(mode="merge") — A writer-side
merge; automatically detects new columns; requires fragment_id and a
join key; returns write statistics metadata.
- Interface 2: daft.io.lance.merge_columns_df() — A DataFrame-based
row-level merge; explicitly takes a DataFrame with only fragment_id,
join key, and new columns; returns None; allows for finer control via
read_columns and batch_size.
Usage Recommendation:
- For end-to-end write pipelines with more intuitive semantics, prefer
write_lance(mode="merge").
- For complex DataFrame transformations or custom column reading before
writing, use merge_columns_df().

# Deprecation Plan
With the introduction of the new` write_lance(mode='merge') `and
`daft.io.lance.merge_columns_df`, the old `daft.io.lance.merge_columns`
interface has been marked as deprecated.
- Reason for Deprecation: The old interface was based on a
fragment-level transformation function, which did not support DataFrames
that had undergone complex operations (like joins, groupbys, etc.),
limiting its flexibility and use cases.
- Future Plan:` daft.io.lance.merge_columns` will be removed in a future
release. All users are advised to migrate to the new
write_lance(mode='merge') interface, which provides a more powerful and
user-friendly column extension capability.

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho added a commit that referenced this pull request Dec 17, 2025
## Changes Made

Adds a `RetryAfterException` that can be raised from a UDF, which will
supply the UDF retry mechanism with a delay to sleep.

This allows us to respect the retry-afters provided in headers from 429
and 503 errors from clients, such as our openai and google clients in
the model apis.

By default, all of our model apis should already be configured with
max_retries = 3, and our exponential backoff has an initial delay of
100ms. It might be worth exploring in the future how we may want to more
easily expose these configurations to users.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho added a commit that referenced this pull request Dec 17, 2025
## Changes Made
Add support for Series[start:end]

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Related Issues
#4771 

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

Co-authored-by: wangzheyan <[email protected]>
Co-authored-by: Colin Ho <[email protected]>
universalmind303 pushed a commit that referenced this pull request Dec 17, 2025
## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

Change "We <3 developers!" to "We ❤️ developers!". I know this is a
rather interesting expression, but some customers have interpreted this
sentence as "Daft has no more than 3 full-time developers", mistakenly
thinking that Daft is a personal project and thus being hesitant to
adopt it in production easily.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

Signed-off-by: plotor <[email protected]>
universalmind303 pushed a commit that referenced this pull request Dec 17, 2025
## Changes Made

Try to overwrite files via native IO instead of pyarrow fs, it's
convenient to extend new sources.

And keep the same logic with the overwrite_files of Python code even
though there might some overwrite issue mentioned
#5700, should submit another
PR to handle it if worthy

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
kevinzwang pushed a commit that referenced this pull request Dec 17, 2025
## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

### Extended SQL LIKE Pattern Matching in MemoryCatalog
-  Upgrade `sqlparser` to extract namespace from query
- Translate SQL LIKE patterns to regex
(`src/daft-catalog/src/pattern.rs`)
- Supports standard SQL LIKE wildcards: `%` (zero or more), `_` (exactly
one), `\` (escape)
- Updated documentation on `SHOW TABLES` syntax and pattern behavior

### Testing
- Added pattern matching tests to `test_sql_show_tables.py` and in
`src/daft-catalog/pattern.rs`
- `cargo test -p daft-catalog pattern::`

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
Closes #4461
Closes #4007 

## Checklist

- [x] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly
colin-ho added a commit that referenced this pull request Dec 18, 2025
## Changes Made

Set the default ImageMode for decode_image to RGB. Currently, the
behavior is to infer the image mode at a per-image level, which means we
could have 8-bit and 16-bit images in the same column, which will error
because the underlying datatypes (uint8 vs uint16) are different.

The solution to this is to force decode into a specific image mode. This
PR simply makes that the default.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho pushed a commit that referenced this pull request Dec 18, 2025
## Changes Made

Since we support overwrite files and write empty files via native io
from #5728 and #5682 , these two method is no need

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
universalmind303 added a commit that referenced this pull request Dec 18, 2025
## Changes Made

gets field and dtype roundtrips working.

Note, since arrow-rs uses fields to store extension data, converting a
datatype with extensions to arrow will fail by design. Instead you must
operate on the field to preserve extension metadata.

the following conversions are possible.
- DaftField -> ArrowField
- ArrowField -> DaftField 
- DaftDataType -> ArrowDataType where DaftDataType != Extension
- ArrowDataType -> DaftDataType
- ArrowField -> DaftDataType

All of these are fallible operations. 


## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
colin-ho added a commit that referenced this pull request Dec 18, 2025
## Changes Made

This was failing
https://github.com/Eventual-Inc/Daft/actions/runs/20347885747/job/58464760614
because of the failed assertion.

I believe this is due to
https://github.com/Eventual-Inc/Daft/pull/5676/changes#diff-a77514cfe53156c701c4b3b7426dc9d2c0d257deb52dae53fba976f51fcf8daf,
which changed the way the buffer gets popped to be one by one instead of
as many as possible, so we don't need the assertion anymore

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho pushed a commit that referenced this pull request Dec 18, 2025
#5845)

## Changes Made

the value of ` Estimated Scan Bytes` is from the disk file size, if
there are some pushdown on source, it's a little bit confused especially
trying to analysis the plan and try to optimize daft job.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho pushed a commit that referenced this pull request Dec 18, 2025
## Changes Made

1. support configure log format and date format
2. enable all logger by default
3. modify documentation to easy setup.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
kevinzwang pushed a commit that referenced this pull request Dec 19, 2025
## Changes Made

1. implement TosMultipartWriter
2. support native writer via tos

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
colin-ho pushed a commit that referenced this pull request Dec 19, 2025
#5855)

Namespace the `RemoteFlotillaRunner` Ray actor name per Ray job /
process to avoid cross-client plan ID collisions when multiple Daft
clients connect to the same Ray cluster.

- Introduce a cached suffix used to namespace the flotilla plan runner
actor name per Ray job / process.
- Derive the suffix from the Ray runtime context `job_id` when
available.
- If no job identifier can be determined, fall back to a per-process
UUID so the actor name remains unique across different Python processes.
- Update `FlotillaRunner` to construct the Ray actor with the new
per-job actor name while keeping the `namespace="daft"` and
`get_if_exists=True` semantics unchanged.
- Add comments explaining the root cause: `DistributedPhysicalPlan` /
`QueryIdx` uses a per-process counter, so plan IDs like `"0"` can
collide across different clients if they share the same named actor
instance.
- Add a lightweight unit test that verifies the flotilla actor name is
derived from the base name plus a job-specific suffix and remains stable
within a process.

## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

## Related Issues
#5856
<!-- Link to related GitHub issues, e.g., "Closes #123" -->
universalmind303 added a commit that referenced this pull request Dec 19, 2025
## Changes Made

adds the following methods to most array implementations
```rs
// convert to type erased ArrayRef
// all arrays implement this
fn to_arrow(&self) -> DaftResult<ArrayRef>;
// convert to a concrete type which varies for each array type
// such as DataArray<Int8Type> -> arrow::array::Int8Array
// most arrays implement this, but not all of them. 
// this is also true for the original `as_arrow` (now `as_arrow2`) that this is intended to replace.
fn as_arrow(&self) -> DaftResult<Self::ArrowOutput>
```

and `from` methods.

```rs
// this is also implemented for all arrays. 
fn from_arrow<F: Into<Arc<Field>>>(field: F, array: ArrayRef) -> DaftResult<Self>;
```

## Note for reviewers
the bulk of the changes is in the following
- `src/daft-core/src/array/ops/from_arrow.rs`
- `src/daft-core/src/datatypes/logical.rs`
- `src/daft-core/src/array/*`






## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
universalmind303 pushed a commit that referenced this pull request Dec 22, 2025
… guide (#5865)

## Changes Made

<!-- Describe what changes were made and why. Include implementation
details if necessary. -->

Add instructions on the basic usage of Daft Dashboard in the development
guide to reduce the learning cost for developers.

Daft Dashboard is currently in the rapid development stage. There will
be a dedicated article or page to introduce its functions and usage in
detail later, but it is more appropriate to include it in the
development guide at this stage.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

Signed-off-by: plotor <[email protected]>
colin-ho pushed a commit that referenced this pull request Dec 22, 2025
…5864)

## Changes Made
The integration test
TestIcebergCountPushdown.test_count_pushdown_with_delete_files was
failing for the test_overlapping_deletes table because it incorrectly
enabled count pushdown.
The root cause was that the initial Spark write created multiple small
data files. Subsequent DELETE operations were optimized by Iceberg to
mark entire data files as removed instead of generating
position/equality delete files. As a result, Daft's _has_delete_files()
check did not find any delete files and incorrectly allowed the count
pushdown optimization.
This PR fixes the test by adding coalesce(1) to the Spark DataFrame
before writing the initial data for the test_overlapping_deletes table.
This ensures the data is written to a single Parquet file, forcing
subsequent DELETE operations to generate actual delete files. This
aligns the test's behavior with its intent, correctly disabling count
pushdown when delete files are present.

## Related Issues
#5863 5863
<!-- Link to related GitHub issues, e.g., "Closes #123" -->

---------

Co-authored-by: root <root@bytedance>
Co-authored-by: huleilei <huleilei@bytedance>
srilman pushed a commit that referenced this pull request Dec 28, 2025
## Changes Made

Bump `mypy` to 1.19.1 and `ruff` to 0.14.10 and resolve pre-commit
errors.

Rationale: my local `ruff` version is much higher than that in the
project `pre-commit`, so I am getting a lot of linter warnings in my
IDE. It is also about time to upgrade lints to utilize newer Python
language features.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants