Enhance partition_by to support strings #1191

dmpetrov · 2025-06-28T21:48:50Z

This pull request contains changes generated by Cursor background composer.

Summary by Sourcery

Enable string-based notation for partition_by in the agg method by updating type definitions, converting string column names to Column objects, and adding targeted tests; while also simplifying the dataset chunking logic in query application.

New Features:

Allow simple string and sequence of strings notation for the partition_by parameter in the agg method

Enhancements:

Extend the PartitionByType union to include str and sequences of str, Function, or ColumnElement
Refactor dataset chunking in apply_steps to use a dedicated chunk method instead of manual slicing and filtering

Tests:

Add unit tests verifying string and string-sequence support for partition_by in agg

Co-authored-by: dmitry <[email protected]>

sourcery-ai · 2025-06-28T21:48:54Z

Reviewer's Guide

Enable string-based partition_by in agg() by extending its type, adding string-to-Column conversion logic, simplifying query chunking, and introducing corresponding unit tests.

File-Level Changes

Change	Details	Files
Extended PartitionByType to include strings	Added `str` to the union of accepted types Allowed sequences containing `str` alongside existing types	`src/datachain/query/dataset.py`
Implemented string-to-Column conversion in agg()	Wrapped single string or Function into a list for uniform processing Converted each string element to a Column via ColumnMeta and schema lookup Handled existing Function and ColumnElement types transparently Used processed_partition_by when generating the UDF query	`src/datachain/lib/dc/datachain.py`
Replaced manual chunk filtering with query.chunk helper	Removed manual step limiting, filter injection, and step reordering Replaced with a single call to `query.chunk(index, total)`	`src/datachain/query/dataset.py`
Added unit tests for string-based partition_by	Created `test_agg_partition_by_string_notation` to verify single-string support Created `test_agg_partition_by_string_sequence` to verify multi-string support	`tests/unit/lib/test_datachain.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

for more information, see https://pre-commit.ci

sourcery-ai

Hey @dmpetrov - I've reviewed your changes - here's some feedback:

Extract the string-to-Column conversion in agg (and group_by) into a shared helper to reduce duplication and simplify maintenance.
Add tests covering mixed-type partition_by sequences (strings, ColumnElements, Functions) and nested field names (e.g. 'file.path') to exercise the full PartitionByType union.
Verify that replacing manual step manipulation with query.chunk in apply_steps preserves the original step ordering and filtering behavior in all cases.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Extract the string-to-Column conversion in agg (and group_by) into a shared helper to reduce duplication and simplify maintenance.
- Add tests covering mixed-type partition_by sequences (strings, ColumnElements, Functions) and nested field names (e.g. 'file.path') to exercise the full PartitionByType union.
- Verify that replacing manual step manipulation with query.chunk in apply_steps preserves the original step ordering and filtering behavior in all cases.

## Individual Comments

### Comment 1
<location> `src/datachain/lib/dc/datachain.py:823` </location>
<code_context>
+2. **Adding String Conversion Logic**: Modified the `agg` method to convert strings to `Column` objects before passing them to the underlying UDF steps, similar to how `group_by` handles strings:
+   ```python
+   # Convert string partition_by parameters to Column objects
+   if isinstance(col, str):
+       col_db_name = ColumnMeta.to_db_name(col)
+       col_type = self.signals_schema.get_column_type(col_db_name)
</code_context>

<issue_to_address>
Potential risk if column name is not present in signals_schema.

If get_column_type is called with a non-existent column, it may raise an exception. Please ensure this case is handled or provide a clear error message.
</issue_to_address>

### Comment 2
<location> `tests/unit/lib/test_datachain.py:3457` </location>
<code_context>
         )
+
+
+def test_agg_partition_by_string_notation(test_session):
+    """Test that agg method supports string notation for partition_by."""
+    class _ImageGroup(BaseModel):
+        name: str
+        size: int
+
+    def func(key, val) -> Iterator[tuple[File, _ImageGroup]]:
+        n = "-".join(key)
+        v = sum(val)
+        yield File(path=n), _ImageGroup(name=n, size=v)
+
+    keys = ["n1", "n2", "n1"]
+    values = [1, 5, 9]
+    
+    # Test using string notation (NEW functionality)
+    ds = dc.read_values(key=keys, val=values, session=test_session).agg(
+        x=func, partition_by="key"  # String notation instead of C("key")
+    )
+
+    assert ds.order_by("x_1.name").to_values("x_1.name") == ["n1-n1", "n2"]
+    assert ds.order_by("x_1.size").to_values("x_1.size") == [5, 10]
+
+
</code_context>

<issue_to_address>
Missing test for invalid string column in partition_by.

Please add a test where partition_by is set to a non-existent column name to verify proper error handling.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-06-28T21:49:48Z

src/datachain/lib/dc/datachain.py

+                if isinstance(col, str):
+                    col_db_name = ColumnMeta.to_db_name(col)
+                    col_type = self.signals_schema.get_column_type(col_db_name)
+                    column = Column(col_db_name, python_to_sql(col_type))
+                    processed_partition_columns.append(column)


issue (bug_risk): Potential risk if column name is not present in signals_schema.

If get_column_type is called with a non-existent column, it may raise an exception. Please ensure this case is handled or provide a clear error message.

sourcery-ai · 2025-06-28T21:49:48Z

tests/unit/lib/test_datachain.py

+def test_agg_partition_by_string_notation(test_session):
+    """Test that agg method supports string notation for partition_by."""
+    class _ImageGroup(BaseModel):
+        name: str
+        size: int
+
+    def func(key, val) -> Iterator[tuple[File, _ImageGroup]]:
+        n = "-".join(key)
+        v = sum(val)
+        yield File(path=n), _ImageGroup(name=n, size=v)


suggestion (testing): Missing test for invalid string column in partition_by.

Please add a test where partition_by is set to a non-existent column name to verify proper error handling.

cloudflare-workers-and-pages · 2025-06-28T21:50:12Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`c9028b5`
Status:	✅ Deploy successful!
Preview URL:	https://5ecb09d7.datachain-documentation.pages.dev
Branch Preview URL:	https://cursor-enhance-partition-by.datachain-documentation.pages.dev

View logs

…gs-52d0

codecov · 2025-07-04T06:39:43Z

Codecov Report

Attention: Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.

Project coverage is 88.72%. Comparing base (dbd2c65) to head (c9028b5).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/lib/dc/datachain.py	94.44%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1191   +/-   ##
=======================================
  Coverage   88.71%   88.72%           
=======================================
  Files         152      152           
  Lines       13557    13575   +18     
  Branches     1884     1889    +5     
=======================================
+ Hits        12027    12044   +17     
  Misses       1088     1088           
- Partials      442      443    +1

Flag	Coverage Δ
datachain	`88.65% <94.44%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/query/dataset.py	`93.60% <ø> (ø)`
src/datachain/lib/dc/datachain.py	`90.08% <94.44%> (+0.11%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add string support for partition_by in agg method

f403154

Co-authored-by: dmitry <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

42344cf

for more information, see https://pre-commit.ci

sourcery-ai bot reviewed Jun 28, 2025

View reviewed changes

dmpetrov added 4 commits July 3, 2025 22:45

fix dc unit tests

8d56625

fix linter

1c4e0e3

revert unnecessary change

ccef36b

Merge branch 'main' into cursor/enhance-partition-by-to-support-strin…

b77a6d8

…gs-52d0

rm agent stuff

c9028b5

amritghimire approved these changes Jul 4, 2025

View reviewed changes

dmpetrov merged commit 396d8a9 into main Jul 4, 2025
57 of 59 checks passed

dmpetrov deleted the cursor/enhance-partition-by-to-support-strings-52d0 branch July 4, 2025 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance partition_by to support strings #1191

Enhance partition_by to support strings #1191

Uh oh!

dmpetrov commented Jun 28, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Jun 28, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Uh oh!

sourcery-ai bot Jun 28, 2025

Uh oh!

cloudflare-workers-and-pages bot commented Jun 28, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Enhance partition_by to support strings #1191

Enhance partition_by to support strings #1191

Uh oh!

Conversation

dmpetrov commented Jun 28, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

cloudflare-workers-and-pages bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

codecov bot commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

dmpetrov commented Jun 28, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jun 28, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Jun 28, 2025 •

edited

Loading

codecov bot commented Jul 4, 2025 •

edited

Loading