[Data] Add namespaced expressions that expose pyarrow functions#58465
[Data] Add namespaced expressions that expose pyarrow functions#58465richardliaw merged 21 commits intoray-project:masterfrom
Conversation
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a powerful new feature by exposing PyArrow compute functions through namespaced expressions (.str, .list, .struct). The implementation is well-structured, using dynamic method generation from a configuration, which is a great pattern for extensibility. The addition of a .pyi stub file is excellent for static analysis and IDE support, and the new tests are comprehensive.
My main feedback is a medium-severity issue regarding the placement of pyarrow.compute imports in the manually defined namespace methods. Moving these imports inside the UDF wrappers will improve robustness by preventing potential serialization issues. I've left comments on all affected methods with suggestions.
Signed-off-by: Goutam <goutam@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a powerful and well-designed feature for namespaced expressions, exposing a wide range of pyarrow compute functions for list, str, and struct types. The use of dynamic method generation via configuration dictionaries is clean and extensible, and the inclusion of a .pyi stub file for type hinting is excellent for developer experience and static analysis. The accompanying tests are comprehensive and well-structured. I have a few suggestions to improve type hint correctness and simplify some of the implementations.
There was a problem hiding this comment.
Bug: Alias Expressions: Incorrect Rename State
The AliasExpr.alias method incorrectly preserves the _is_rename flag from the original expression when creating a new alias. When .alias() is called, it should always create an alias expression with _is_rename=False, regardless of whether the underlying expression was a rename. Preserving _is_rename=True causes the new alias to be incorrectly treated as a rename operation, which affects logical plan optimization and projection pushdown.
python/ray/data/expressions.py#L1306-L1311
ray/python/ray/data/expressions.py
Lines 1306 to 1311 in c39c65b
Signed-off-by: Goutam <goutam@anyscale.com>
There was a problem hiding this comment.
Bug: Alias method mixes up rename and alias.
The AliasExpr.alias() method incorrectly preserves the _is_rename flag from the original expression when creating a new alias. When .alias() is explicitly called, it creates an alias operation (not a rename), so _is_rename should always be False in the returned AliasExpr, regardless of the original expression's _is_rename value. This causes incorrect semantics when chaining operations like col("x")._rename("y").alias("z").
python/ray/data/expressions.py#L1152-L1157
ray/python/ray/data/expressions.py
Lines 1152 to 1157 in 3b5f1a4
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a powerful and intuitive namespaced expression API (.str, .list, .struct) for ray.data.Dataset, mirroring pandas functionality. The implementation leverages pyarrow.compute functions wrapped in a new pyarrow_udf decorator, which is a clever way to quickly expand the API surface. The changes are well-tested and documented.
My main feedback is on improving schema propagation. Currently, several methods default to an object return type, which limits the optimizer's ability to reason about data types. I've left a specific comment on how this could be improved. I also have a suggestion to reduce boilerplate code in the _StringNamespace for better maintainability. Overall, this is a great addition to Ray Data.
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
bveeramani
left a comment
There was a problem hiding this comment.
In meeting -- will finish review after
| def assert_df_equal(result: pd.DataFrame, expected: pd.DataFrame): | ||
| """Assert dataframes are equal, ignoring dtype differences.""" | ||
| pd.testing.assert_frame_equal(result, expected, check_dtype=False) |
There was a problem hiding this comment.
Here and elsewhere -- this functions makes assumptions about the output ordering, so the tests might fail unexpectedly if tasks finish out of order. Consider using rows_same instead
There was a problem hiding this comment.
So rows_same sorts the df but pandas actual.sort_values(sorted(actual.columns)).reset_index(drop=True) fails on unhashable types like list and dict. Also nothing here should be order dependent.
There was a problem hiding this comment.
Also nothing here should be order dependent.
I don't think that's true.
ds = ray.data.from_items([{"val": "hello"}, {"val": "world"}])
result = ds.with_column("rev", col("val").str.reverse()).to_pandas()
expected = pd.DataFrame({"val": ["hello", "world"], "rev": ["olleh", "dlrow"]})
assert_df_equal(result, expected)For example, this dataset starts with two blocks and launches two tasks. If the second task finishes in an earlier scheduling loop than the first task, then the result will look like this:
>>> result
val rev
0 world dlrow
1 hello ollehI don't think it'll be likely, but many of our tests are flaky because of this exact sort of behaviour, so I'd prefer not to make assumptions
There was a problem hiding this comment.
Okay to merge as-is to avoid blocking you. I can make rows_same more robust as a follow up
There was a problem hiding this comment.
Sounds good
Signed-off-by: Goutam <goutam@anyscale.com>
There was a problem hiding this comment.
Bug: Self-comparison breaks structural equality.
The AliasExpr.structurally_equals method compares self._is_rename == self._is_rename instead of comparing with the other object. This always returns True and prevents proper structural equality checking when the _is_rename flags differ between two AliasExpr instances.
python/ray/data/expressions.py#L925-L932
ray/python/ray/data/expressions.py
Lines 925 to 932 in 5a5e6b4
There was a problem hiding this comment.
Bug: Incorrect Alias Structural Equality
In AliasExpr.structurally_equals, the comparison self._is_rename == self._is_rename always evaluates to True. This should compare self._is_rename to other._is_rename to correctly check structural equality. The bug causes two AliasExpr instances with different _is_rename values to incorrectly be considered structurally equal.
python/ray/data/expressions.py#L925-L932
ray/python/ray/data/expressions.py
Lines 925 to 932 in 27a34f6
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
There was a problem hiding this comment.
Bug: Self-Comparison Breaks Object Identity
AliasExpr.structurally_equals compares self._is_rename with itself instead of comparing it with other._is_rename. This causes two AliasExpr objects with different _is_rename values to incorrectly be deemed structurally equal, breaking equality semantics.
python/ray/data/expressions.py#L913-L922
ray/python/ray/data/expressions.py
Lines 913 to 922 in 564880c
bveeramani
left a comment
There was a problem hiding this comment.
LGTM pending using testcode and removing from_items from the list of dataset
doc/source/data/api/expressions.rst
Outdated
|
|
||
| The following example shows how to use the string namespace to transform text columns: | ||
|
|
||
| .. code-block:: python |
There was a problem hiding this comment.
Can you use testcode in this doc? We've had problems in the past with our code snippets breaking over time, and using testcode prevents that
Signed-off-by: Goutam <goutam@anyscale.com>
…project#58465) Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…project#58465) Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
…project#58465) Signed-off-by: Future-Outlier <eric901201@gmail.com>
…project#58465) Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Adds support to expose pyarrow compute functions to expressions to make
with_columntransforms more powerful.Related issues
Closes #57668
Additional information