Skip to content

Conversation

kaustuvnandy
Copy link

@kaustuvnandy kaustuvnandy commented Sep 3, 2025

Fix DataScan.count() limit parameter and add comprehensive unit tests

Rationale for this change

This PR fixes Issue #2121 where the count() method in PyIceberg's DataScan class was not respecting the limit parameter, causing scans to process more data than necessary. Additionally, it introduces comprehensive unit tests to ensure reliable row counting functionality across different scenarios.

The changes address:

  • Bug fix: DataScan.count() now properly respects the limit parameter with early termination
  • Performance improvement: Stops processing additional files once the count limit is satisfied
  • Test coverage: Validates the method's behavior when counting rows in tables with data, handling empty tables, and processing large datasets
  • Documentation: Comprehensive recipe documentation with SQL-like expressions and best practices

These changes improve performance, fix incorrect behavior, and provide confidence in the count operation's correctness, which is essential for data validation and analytics workflows.

Implementation Details

Core Bug Fix (pyiceberg/table/init.py)

Fixed the DataScan.count() method to properly handle the limit parameter:

# Added proper limit handling with early termination
if self.limit is not None and res >= self.limit:
    break  # Stop processing more tasks when limit reached

# Pass remaining limit to ArrowScan operations
arrow_scan = ArrowScan(..., limit=(self.limit - res) if self.limit else None)

Key improvements:

  • Early termination: Stops processing tasks once limit is reached
  • Limit propagation: Passes remaining limit to ArrowScan instances
  • Performance optimization: Reduces unnecessary I/O operations on large tables

Comprehensive Test Coverage

The tests use mocking to simulate different table states and file planning scenarios, plus integration tests for end-to-end validation:

Test Coverage:

  • test_count_basic(): Validates counting with a single file task containing 42 records
  • test_count_empty(): Ensures proper handling of empty tables (0 records)
  • test_count_large(): Tests aggregation across multiple file tasks (1M+ records)
  • test_count_with_limit_mock(): NEW - Validates limit parameter with early termination using mocks
  • test_datascan_count_respects_limit(): NEW - Integration test verifying limit behavior with real table operations

Enhanced Documentation (mkdocs/docs/recipe-count.md)

Added comprehensive documentation featuring:

  • SQL-like expressions: Simplified filter syntax ("population > 1000000" instead of GreaterThan("population", 1000000))
  • Limit functionality: Examples and performance benefits of using count with limits
  • Best practices: When to use limits, existence checks, and monitoring use cases
  • Performance optimization: Tips for using snapshot properties for fastest total counts
  • Real-world examples: Practical scenarios for data validation and analytics

Are these changes tested?

Yes, this PR adds comprehensive unit tests with both mocked dependencies and integration tests. All 5 tests pass, validating:

  • The bug fix for limit parameter handling
  • Existing count functionality across multiple scenarios
  • Edge cases like empty tables and large datasets
  • Integration with real table operations

Are there any user-facing changes?

Yes - This includes a bug fix that changes user-visible behavior:

Fixed Behavior

  • Before: table.scan().limit(N).count() ignored the limit and counted all rows
  • After: table.scan().limit(N).count() properly stops counting at N rows

Performance Improvements

  • Limited counting operations are now significantly faster on large tables
  • Early termination reduces unnecessary file processing and I/O operations

Enhanced Documentation

  • Comprehensive count recipe with SQL-like expressions for better readability
  • Performance optimization guidance and best practices
  • Real-world examples for data validation and monitoring workflows

Backward Compatibility

  • Existing code using count() without limit continues to work unchanged
  • No breaking changes to the existing API

Copy link
Contributor

@gabeiglio gabeiglio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Overall looks good to me! left some comments


def test_count_basic():
# Create a mock table with the necessary attributes
table = Mock(spec=DataScan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We should call this variable scan rather than table since we are mocking a DataScan object


def test_count_empty():
# Create a mock table with the necessary attributes
table = Mock(spec=DataScan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here to rename to scan


def test_count_large():
# Create a mock table with the necessary attributes
table = Mock(spec=DataScan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here


Count all rows in a table:

```python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be worth mentioning as a note that we could get the total count of a table from snapshot properties doing this:
table.current_snapshot().summary.additional_properties["total-records"]

so users can avoid doing a full table scan

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments, I will work on them 😊

from pyiceberg.expressions import AlwaysTrue


class DummyFile:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could write real data files and use that for testing wdyt?

Here are some fixtures we could use to get a FileScanTask with a file with some rows in it: example

Maybe we can also add some more fixtures to get FileScanTasks for empty files and large ones

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it will be a good addition actually.

Comment on lines 40 to 41
# Count rows with population > 1,000,000
large_cities = table.scan().filter(GreaterThan("population", 1000000)).count()
Copy link
Contributor

@Fokko Fokko Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the SQL like expressions is easier to read:

Suggested change
# Count rows with population > 1,000,000
large_cities = table.scan().filter(GreaterThan("population", 1000000)).count()
large_cities = table.scan().filter("population > 1000000").count()

@kaustuvnandy kaustuvnandy changed the title Added test for count() method and documentation for count() Fixed the limit bug and added test for count() method and documentation for count() Sep 5, 2025
@tushar-choudhary-tc
Copy link

tushar-choudhary-tc commented Sep 8, 2025

Thank you @kaustuvnandy for adding this documentation and test cases. Great First PR! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants