Skip to content

[Data] Add namespaced expressions that expose pyarrow functions#58465

Merged
richardliaw merged 21 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/pyarrow_expr
Nov 16, 2025
Merged

[Data] Add namespaced expressions that expose pyarrow functions#58465
richardliaw merged 21 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/pyarrow_expr

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

Description

Adds support to expose pyarrow compute functions to expressions to make with_column transforms more powerful.

Related issues

Closes #57668

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner November 7, 2025 23:59
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a powerful new feature by exposing PyArrow compute functions through namespaced expressions (.str, .list, .struct). The implementation is well-structured, using dynamic method generation from a configuration, which is a great pattern for extensibility. The addition of a .pyi stub file is excellent for static analysis and IDE support, and the new tests are comprehensive.

My main feedback is a medium-severity issue regarding the placement of pyarrow.compute imports in the manually defined namespace methods. Moving these imports inside the UDF wrappers will improve robustness by preventing potential serialization issues. I've left comments on all affected methods with suggestions.

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Nov 8, 2025
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a powerful and well-designed feature for namespaced expressions, exposing a wide range of pyarrow compute functions for list, str, and struct types. The use of dynamic method generation via configuration dictionaries is clean and extensible, and the inclusion of a .pyi stub file for type hinting is excellent for developer experience and static analysis. The accompanying tests are comprehensive and well-structured. I have a few suggestions to improve type hint correctness and simplify some of the implementations.

Signed-off-by: Goutam <goutam@anyscale.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Alias Expressions: Incorrect Rename State

The AliasExpr.alias method incorrectly preserves the _is_rename flag from the original expression when creating a new alias. When .alias() is called, it should always create an alias expression with _is_rename=False, regardless of whether the underlying expression was a rename. Preserving _is_rename=True causes the new alias to be incorrectly treated as a rename operation, which affects logical plan optimization and projection pushdown.

python/ray/data/expressions.py#L1306-L1311

return self._name
def alias(self, name: str) -> "Expr":
# Always unalias before creating new one
return AliasExpr(
self.expr.data_type, self.expr, _name=name, _is_rename=self._is_rename

Fix in Cursor Fix in Web


Signed-off-by: Goutam <goutam@anyscale.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Alias method mixes up rename and alias.

The AliasExpr.alias() method incorrectly preserves the _is_rename flag from the original expression when creating a new alias. When .alias() is explicitly called, it creates an alias operation (not a rename), so _is_rename should always be False in the returned AliasExpr, regardless of the original expression's _is_rename value. This causes incorrect semantics when chaining operations like col("x")._rename("y").alias("z").

python/ray/data/expressions.py#L1152-L1157

function_name: Optional name for the function (for debugging)
Example:
>>> from ray.data.expressions import col, udf
>>> import pyarrow as pa
>>> import pyarrow.compute as pc

Fix in Cursor Fix in Web


Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner November 8, 2025 02:38
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Nov 10, 2025
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a powerful and intuitive namespaced expression API (.str, .list, .struct) for ray.data.Dataset, mirroring pandas functionality. The implementation leverages pyarrow.compute functions wrapped in a new pyarrow_udf decorator, which is a clever way to quickly expand the API surface. The changes are well-tested and documented.

My main feedback is on improving schema propagation. Currently, several methods default to an object return type, which limits the optimizer's ability to reason about data types. I've left a specific comment on how this could be improved. I also have a suggestion to reduce boilerplate code in the _StringNamespace for better maintainability. Overall, this is a great addition to Ray Data.

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Copy link
Contributor

@srinathk10 srinathk10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@goutamvenkat-anyscale goutamvenkat-anyscale changed the title Add namespaced expressions that expose pyarrow functions [Data] Add namespaced expressions that expose pyarrow functions Nov 11, 2025
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In meeting -- will finish review after

Comment on lines +17 to +19
def assert_df_equal(result: pd.DataFrame, expected: pd.DataFrame):
"""Assert dataframes are equal, ignoring dtype differences."""
pd.testing.assert_frame_equal(result, expected, check_dtype=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere -- this functions makes assumptions about the output ordering, so the tests might fail unexpectedly if tasks finish out of order. Consider using rows_same instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So rows_same sorts the df but pandas actual.sort_values(sorted(actual.columns)).reset_index(drop=True) fails on unhashable types like list and dict. Also nothing here should be order dependent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also nothing here should be order dependent.

I don't think that's true.

ds = ray.data.from_items([{"val": "hello"}, {"val": "world"}])
result = ds.with_column("rev", col("val").str.reverse()).to_pandas()
expected = pd.DataFrame({"val": ["hello", "world"], "rev": ["olleh", "dlrow"]})
assert_df_equal(result, expected)

For example, this dataset starts with two blocks and launches two tasks. If the second task finishes in an earlier scheduling loop than the first task, then the result will look like this:

>>> result
     val    rev
0  world  dlrow
1  hello  olleh

I don't think it'll be likely, but many of our tests are flaky because of this exact sort of behaviour, so I'd prefer not to make assumptions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay to merge as-is to avoid blocking you. I can make rows_same more robust as a follow up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

Signed-off-by: Goutam <goutam@anyscale.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Self-comparison breaks structural equality.

The AliasExpr.structurally_equals method compares self._is_rename == self._is_rename instead of comparing with the other object. This always returns True and prevents proper structural equality checking when the _is_rename flags differ between two AliasExpr instances.

python/ray/data/expressions.py#L925-L932

def structurally_equals(self, other: Any) -> bool:
return (
isinstance(other, AliasExpr)
and self.expr.structurally_equals(other.expr)
and self.name == other.name
and self._is_rename == self._is_rename
)

Fix in Cursor Fix in Web


Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect Alias Structural Equality

In AliasExpr.structurally_equals, the comparison self._is_rename == self._is_rename always evaluates to True. This should compare self._is_rename to other._is_rename to correctly check structural equality. The bug causes two AliasExpr instances with different _is_rename values to incorrectly be considered structurally equal.

python/ray/data/expressions.py#L925-L932

def structurally_equals(self, other: Any) -> bool:
return (
isinstance(other, AliasExpr)
and self.expr.structurally_equals(other.expr)
and self.name == other.name
and self._is_rename == self._is_rename
)

Fix in Cursor Fix in Web


Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Self-Comparison Breaks Object Identity

AliasExpr.structurally_equals compares self._is_rename with itself instead of comparing it with other._is_rename. This causes two AliasExpr objects with different _is_rename values to incorrectly be deemed structurally equal, breaking equality semantics.

python/ray/data/expressions.py#L913-L922

all existing columns should be preserved at this position in the output.
It's typically used internally by operations like with_column() and
rename_columns() to maintain existing columns.
Example:
When with_column("new_col", expr) is called, it creates:
Project(exprs=[star(), expr.alias("new_col")])
This means: keep all existing columns, then add/overwrite "new_col"
"""

Fix in Cursor Fix in Web


Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending using testcode and removing from_items from the list of dataset


The following example shows how to use the string namespace to transform text columns:

.. code-block:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use testcode in this doc? We've had problems in the past with our code snippets breaking over time, and using testcode prevents that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Goutam <goutam@anyscale.com>
@richardliaw richardliaw merged commit 7498739 into ray-project:master Nov 16, 2025
6 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/pyarrow_expr branch November 16, 2025 05:19
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…project#58465)

Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ray Data] Enhance the expression for getting a field or slice

6 participants