-
Notifications
You must be signed in to change notification settings - Fork 115
Enhance partition_by to support strings #1191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance partition_by to support strings #1191
Conversation
Co-authored-by: dmitry <[email protected]>
Reviewer's GuideEnable string-based partition_by in agg() by extending its type, adding string-to-Column conversion logic, simplifying query chunking, and introducing corresponding unit tests. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dmpetrov - I've reviewed your changes - here's some feedback:
- Extract the string-to-Column conversion in agg (and group_by) into a shared helper to reduce duplication and simplify maintenance.
- Add tests covering mixed-type partition_by sequences (strings, ColumnElements, Functions) and nested field names (e.g. 'file.path') to exercise the full PartitionByType union.
- Verify that replacing manual step manipulation with query.chunk in apply_steps preserves the original step ordering and filtering behavior in all cases.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Extract the string-to-Column conversion in agg (and group_by) into a shared helper to reduce duplication and simplify maintenance.
- Add tests covering mixed-type partition_by sequences (strings, ColumnElements, Functions) and nested field names (e.g. 'file.path') to exercise the full PartitionByType union.
- Verify that replacing manual step manipulation with query.chunk in apply_steps preserves the original step ordering and filtering behavior in all cases.
## Individual Comments
### Comment 1
<location> `src/datachain/lib/dc/datachain.py:823` </location>
<code_context>
+2. **Adding String Conversion Logic**: Modified the `agg` method to convert strings to `Column` objects before passing them to the underlying UDF steps, similar to how `group_by` handles strings:
+ ```python
+ # Convert string partition_by parameters to Column objects
+ if isinstance(col, str):
+ col_db_name = ColumnMeta.to_db_name(col)
+ col_type = self.signals_schema.get_column_type(col_db_name)
</code_context>
<issue_to_address>
Potential risk if column name is not present in signals_schema.
If get_column_type is called with a non-existent column, it may raise an exception. Please ensure this case is handled or provide a clear error message.
</issue_to_address>
### Comment 2
<location> `tests/unit/lib/test_datachain.py:3457` </location>
<code_context>
)
+
+
+def test_agg_partition_by_string_notation(test_session):
+ """Test that agg method supports string notation for partition_by."""
+ class _ImageGroup(BaseModel):
+ name: str
+ size: int
+
+ def func(key, val) -> Iterator[tuple[File, _ImageGroup]]:
+ n = "-".join(key)
+ v = sum(val)
+ yield File(path=n), _ImageGroup(name=n, size=v)
+
+ keys = ["n1", "n2", "n1"]
+ values = [1, 5, 9]
+
+ # Test using string notation (NEW functionality)
+ ds = dc.read_values(key=keys, val=values, session=test_session).agg(
+ x=func, partition_by="key" # String notation instead of C("key")
+ )
+
+ assert ds.order_by("x_1.name").to_values("x_1.name") == ["n1-n1", "n2"]
+ assert ds.order_by("x_1.size").to_values("x_1.size") == [5, 10]
+
+
</code_context>
<issue_to_address>
Missing test for invalid string column in partition_by.
Please add a test where partition_by is set to a non-existent column name to verify proper error handling.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
if isinstance(col, str): | ||
col_db_name = ColumnMeta.to_db_name(col) | ||
col_type = self.signals_schema.get_column_type(col_db_name) | ||
column = Column(col_db_name, python_to_sql(col_type)) | ||
processed_partition_columns.append(column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Potential risk if column name is not present in signals_schema.
If get_column_type is called with a non-existent column, it may raise an exception. Please ensure this case is handled or provide a clear error message.
tests/unit/lib/test_datachain.py
Outdated
def test_agg_partition_by_string_notation(test_session): | ||
"""Test that agg method supports string notation for partition_by.""" | ||
class _ImageGroup(BaseModel): | ||
name: str | ||
size: int | ||
|
||
def func(key, val) -> Iterator[tuple[File, _ImageGroup]]: | ||
n = "-".join(key) | ||
v = sum(val) | ||
yield File(path=n), _ImageGroup(name=n, size=v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Missing test for invalid string column in partition_by.
Please add a test where partition_by is set to a non-existent column name to verify proper error handling.
Deploying datachain-documentation with
|
Latest commit: |
c9028b5
|
Status: | ✅ Deploy successful! |
Preview URL: | https://5ecb09d7.datachain-documentation.pages.dev |
Branch Preview URL: | https://cursor-enhance-partition-by.datachain-documentation.pages.dev |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1191 +/- ##
=======================================
Coverage 88.71% 88.72%
=======================================
Files 152 152
Lines 13557 13575 +18
Branches 1884 1889 +5
=======================================
+ Hits 12027 12044 +17
Misses 1088 1088
- Partials 442 443 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
This pull request contains changes generated by Cursor background composer.
Summary by Sourcery
Enable string-based notation for
partition_by
in theagg
method by updating type definitions, converting string column names toColumn
objects, and adding targeted tests; while also simplifying the dataset chunking logic in query application.New Features:
partition_by
parameter in theagg
methodEnhancements:
PartitionByType
union to includestr
and sequences ofstr
,Function
, orColumnElement
apply_steps
to use a dedicatedchunk
method instead of manual slicing and filteringTests:
partition_by
inagg