Fixed the limit bug and added test for count() method and documentation for count() #2423

kaustuvnandy · 2025-09-03T11:58:17Z

Fix DataScan.count() limit parameter and add comprehensive unit tests

Rationale for this change

This PR fixes Issue #2121 where the count() method in PyIceberg's DataScan class was not respecting the limit parameter, causing scans to process more data than necessary. Additionally, it introduces comprehensive unit tests to ensure reliable row counting functionality across different scenarios.

The changes address:

Bug fix: DataScan.count() now properly respects the limit parameter with early termination
Performance improvement: Stops processing additional files once the count limit is satisfied
Test coverage: Validates the method's behavior when counting rows in tables with data, handling empty tables, and processing large datasets
Documentation: Comprehensive recipe documentation with SQL-like expressions and best practices

These changes improve performance, fix incorrect behavior, and provide confidence in the count operation's correctness, which is essential for data validation and analytics workflows.

Implementation Details

Core Bug Fix (pyiceberg/table/init.py)

Fixed the DataScan.count() method to properly handle the limit parameter:

# Added proper limit handling with early termination
if self.limit is not None and res >= self.limit:
    break  # Stop processing more tasks when limit reached

# Pass remaining limit to ArrowScan operations
arrow_scan = ArrowScan(..., limit=(self.limit - res) if self.limit else None)

Key improvements:

Early termination: Stops processing tasks once limit is reached
Limit propagation: Passes remaining limit to ArrowScan instances
Performance optimization: Reduces unnecessary I/O operations on large tables

Comprehensive Test Coverage

The tests use mocking to simulate different table states and file planning scenarios, plus integration tests for end-to-end validation:

Test Coverage:

test_count_basic(): Validates counting with a single file task containing 42 records
test_count_empty(): Ensures proper handling of empty tables (0 records)
test_count_large(): Tests aggregation across multiple file tasks (1M+ records)
test_count_with_limit_mock(): NEW - Validates limit parameter with early termination using mocks
test_datascan_count_respects_limit(): NEW - Integration test verifying limit behavior with real table operations

Enhanced Documentation (mkdocs/docs/recipe-count.md)

Added comprehensive documentation featuring:

SQL-like expressions: Simplified filter syntax ("population > 1000000" instead of GreaterThan("population", 1000000))
Limit functionality: Examples and performance benefits of using count with limits
Best practices: When to use limits, existence checks, and monitoring use cases
Performance optimization: Tips for using snapshot properties for fastest total counts
Real-world examples: Practical scenarios for data validation and analytics

Are these changes tested?

Yes, this PR adds comprehensive unit tests with both mocked dependencies and integration tests. All 5 tests pass, validating:

The bug fix for limit parameter handling
Existing count functionality across multiple scenarios
Edge cases like empty tables and large datasets
Integration with real table operations

Are there any user-facing changes?

Yes - This includes a bug fix that changes user-visible behavior:

Fixed Behavior

Before: table.scan().limit(N).count() ignored the limit and counted all rows
After: table.scan().limit(N).count() properly stops counting at N rows

Performance Improvements

Limited counting operations are now significantly faster on large tables
Early termination reduces unnecessary file processing and I/O operations

Enhanced Documentation

Comprehensive count recipe with SQL-like expressions for better readability
Performance optimization guidance and best practices
Real-world examples for data validation and monitoring workflows

Backward Compatibility

Existing code using count() without limit continues to work unchanged
No breaking changes to the existing API

gabeiglio

Thanks for the PR! Overall looks good to me! left some comments

gabeiglio · 2025-09-03T16:58:43Z

tests/table/test_count.py

+
+def test_count_basic():
+    # Create a mock table with the necessary attributes
+    table = Mock(spec=DataScan)


nit: We should call this variable scan rather than table since we are mocking a DataScan object

gabeiglio · 2025-09-03T17:28:28Z

tests/table/test_count.py

+
+def test_count_empty():
+    # Create a mock table with the necessary attributes
+    table = Mock(spec=DataScan)


same here to rename to scan

gabeiglio · 2025-09-03T17:28:33Z

tests/table/test_count.py

+
+def test_count_large():
+    # Create a mock table with the necessary attributes
+    table = Mock(spec=DataScan)


gabeiglio · 2025-09-03T17:50:56Z

mkdocs/docs/recipe-count.md

+
+Count all rows in a table:
+
+```python


It could be worth mentioning as a note that we could get the total count of a table from snapshot properties doing this:
table.current_snapshot().summary.additional_properties["total-records"]

so users can avoid doing a full table scan

Thank you for the comments, I will work on them 😊

gabeiglio · 2025-09-03T18:04:03Z

tests/table/test_count.py

+from pyiceberg.expressions import AlwaysTrue
+
+
+class DummyFile:


I think we could write real data files and use that for testing wdyt?

Here are some fixtures we could use to get a FileScanTask with a file with some rows in it: example

Maybe we can also add some more fixtures to get FileScanTasks for empty files and large ones

Yep, it will be a good addition actually.

Fokko · 2025-09-05T06:31:07Z

mkdocs/docs/recipe-count.md

+# Count rows with population > 1,000,000
+large_cities = table.scan().filter(GreaterThan("population", 1000000)).count()


I think using the SQL like expressions is easier to read:

Suggested change

# Count rows with population > 1,000,000

large_cities = table.scan().filter(GreaterThan("population", 1000000)).count()

large_cities = table.scan().filter("population > 1000000").count()

tushar-choudhary-tc · 2025-09-08T06:40:41Z

Thank you @kaustuvnandy for adding this documentation and test cases. Great First PR! 😄

Added test for count() method and documentation for count()

ca777d6

kaustuvnandy mentioned this pull request Sep 3, 2025

DataScan count method does not respect limit #2121

Open

3 tasks

Enhanced documentation on the test for count() recipe

d8f9411

gabeiglio reviewed Sep 3, 2025

View reviewed changes

Some changes as per review comments

ecf72d1

Fokko reviewed Sep 5, 2025

View reviewed changes

kaustuvnandy added 2 commits September 5, 2025 12:44

Fixed DataScan.count() limit parameter

782bea5

SQL-Like expressions

8b59b81

kaustuvnandy changed the title ~~Added test for count() method and documentation for count()~~ Fixed the limit bug and added test for count() method and documentation for count() Sep 5, 2025

kaustuvnandy requested review from Fokko and gabeiglio September 9, 2025 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed the limit bug and added test for count() method and documentation for count() #2423

Fixed the limit bug and added test for count() method and documentation for count() #2423

kaustuvnandy commented Sep 3, 2025 •

edited

Loading

Uh oh!

gabeiglio left a comment

Uh oh!

gabeiglio Sep 3, 2025

Uh oh!

gabeiglio Sep 3, 2025

Uh oh!

gabeiglio Sep 3, 2025

Uh oh!

gabeiglio Sep 3, 2025

Uh oh!

kaustuvnandy Sep 3, 2025

Uh oh!

gabeiglio Sep 3, 2025

Uh oh!

kaustuvnandy Sep 3, 2025

Uh oh!

Fokko Sep 5, 2025 •

edited

Loading

Uh oh!

tushar-choudhary-tc commented Sep 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

		from pyiceberg.expressions import AlwaysTrue


		class DummyFile:

		# Count rows with population > 1,000,000
		large_cities = table.scan().filter(GreaterThan("population", 1000000)).count()

	# Count rows with population > 1,000,000
	large_cities = table.scan().filter(GreaterThan("population", 1000000)).count()
	large_cities = table.scan().filter("population > 1000000").count()

Fixed the limit bug and added test for count() method and documentation for count() #2423

Are you sure you want to change the base?

Fixed the limit bug and added test for count() method and documentation for count() #2423

Conversation

kaustuvnandy commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix DataScan.count() limit parameter and add comprehensive unit tests

Rationale for this change

Implementation Details

Core Bug Fix (pyiceberg/table/init.py)

Comprehensive Test Coverage

Enhanced Documentation (mkdocs/docs/recipe-count.md)

Are these changes tested?

Are there any user-facing changes?

Fixed Behavior

Performance Improvements

Enhanced Documentation

Backward Compatibility

Uh oh!

gabeiglio left a comment

Choose a reason for hiding this comment

Uh oh!

gabeiglio Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gabeiglio Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gabeiglio Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gabeiglio Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

kaustuvnandy Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gabeiglio Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

kaustuvnandy Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tushar-choudhary-tc commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kaustuvnandy commented Sep 3, 2025 •

edited

Loading

Fokko Sep 5, 2025 •

edited

Loading

tushar-choudhary-tc commented Sep 8, 2025 •

edited

Loading