-
Notifications
You must be signed in to change notification settings - Fork 356
Fixed the limit bug and added test for count() method and documentation for count() #2423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Overall looks good to me! left some comments
tests/table/test_count.py
Outdated
|
||
def test_count_basic(): | ||
# Create a mock table with the necessary attributes | ||
table = Mock(spec=DataScan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We should call this variable scan
rather than table
since we are mocking a DataScan
object
tests/table/test_count.py
Outdated
|
||
def test_count_empty(): | ||
# Create a mock table with the necessary attributes | ||
table = Mock(spec=DataScan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here to rename to scan
tests/table/test_count.py
Outdated
|
||
def test_count_large(): | ||
# Create a mock table with the necessary attributes | ||
table = Mock(spec=DataScan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
|
||
Count all rows in a table: | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be worth mentioning as a note that we could get the total count of a table from snapshot properties doing this:
table.current_snapshot().summary.additional_properties["total-records"]
so users can avoid doing a full table scan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comments, I will work on them 😊
from pyiceberg.expressions import AlwaysTrue | ||
|
||
|
||
class DummyFile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could write real data files and use that for testing wdyt?
Here are some fixtures we could use to get a FileScanTask
with a file with some rows in it: example
Maybe we can also add some more fixtures to get FileScanTasks for empty files and large ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it will be a good addition actually.
mkdocs/docs/recipe-count.md
Outdated
# Count rows with population > 1,000,000 | ||
large_cities = table.scan().filter(GreaterThan("population", 1000000)).count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the SQL like expressions is easier to read:
# Count rows with population > 1,000,000 | |
large_cities = table.scan().filter(GreaterThan("population", 1000000)).count() | |
large_cities = table.scan().filter("population > 1000000").count() |
Thank you @kaustuvnandy for adding this documentation and test cases. Great First PR! 😄 |
Fix DataScan.count() limit parameter and add comprehensive unit tests
Rationale for this change
This PR fixes Issue #2121 where the
count()
method in PyIceberg'sDataScan
class was not respecting thelimit
parameter, causing scans to process more data than necessary. Additionally, it introduces comprehensive unit tests to ensure reliable row counting functionality across different scenarios.The changes address:
limit
parameter with early terminationThese changes improve performance, fix incorrect behavior, and provide confidence in the count operation's correctness, which is essential for data validation and analytics workflows.
Implementation Details
Core Bug Fix (pyiceberg/table/init.py)
Fixed the
DataScan.count()
method to properly handle thelimit
parameter:Key improvements:
ArrowScan
instancesComprehensive Test Coverage
The tests use mocking to simulate different table states and file planning scenarios, plus integration tests for end-to-end validation:
Test Coverage:
test_count_basic()
: Validates counting with a single file task containing 42 recordstest_count_empty()
: Ensures proper handling of empty tables (0 records)test_count_large()
: Tests aggregation across multiple file tasks (1M+ records)test_count_with_limit_mock()
: NEW - Validates limit parameter with early termination using mockstest_datascan_count_respects_limit()
: NEW - Integration test verifying limit behavior with real table operationsEnhanced Documentation (mkdocs/docs/recipe-count.md)
Added comprehensive documentation featuring:
"population > 1000000"
instead ofGreaterThan("population", 1000000)
)Are these changes tested?
Yes, this PR adds comprehensive unit tests with both mocked dependencies and integration tests. All 5 tests pass, validating:
Are there any user-facing changes?
Yes - This includes a bug fix that changes user-visible behavior:
Fixed Behavior
table.scan().limit(N).count()
ignored the limit and counted all rowstable.scan().limit(N).count()
properly stops counting at N rowsPerformance Improvements
Enhanced Documentation
Backward Compatibility
count()
without limit continues to work unchanged