[Data] Add namespaced expressions that expose pyarrow functions by goutamvenkat-anyscale · Pull Request #58465 · ray-project/ray

goutamvenkat-anyscale · 2025-11-07T23:59:10Z

Description

Adds support to expose pyarrow compute functions to expressions to make with_column transforms more powerful.

Related issues

Closes #57668

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a powerful new feature by exposing PyArrow compute functions through namespaced expressions (.str, .list, .struct). The implementation is well-structured, using dynamic method generation from a configuration, which is a great pattern for extensibility. The addition of a .pyi stub file is excellent for static analysis and IDE support, and the new tests are comprehensive.

My main feedback is a medium-severity issue regarding the placement of pyarrow.compute imports in the manually defined namespace methods. Moving these imports inside the UDF wrappers will improve robustness by preventing potential serialization issues. I've left comments on all affected methods with suggestions.

python/ray/data/expressions.py

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-11-08T01:38:38Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a powerful and well-designed feature for namespaced expressions, exposing a wide range of pyarrow compute functions for list, str, and struct types. The use of dynamic method generation via configuration dictionaries is clean and extensible, and the inclusion of a .pyi stub file for type hinting is excellent for developer experience and static analysis. The accompanying tests are comprehensive and well-structured. I have a few suggestions to improve type hint correctness and simplify some of the implementations.

python/ray/data/expressions.py

Signed-off-by: Goutam <goutam@anyscale.com>

cursor

Bug: Alias Expressions: Incorrect Rename State

The AliasExpr.alias method incorrectly preserves the _is_rename flag from the original expression when creating a new alias. When .alias() is called, it should always create an alias expression with _is_rename=False, regardless of whether the underlying expression was a rename. Preserving _is_rename=True causes the new alias to be incorrectly treated as a rename operation, which affects logical plan optimization and projection pushdown.

python/ray/data/expressions.py#L1306-L1311

ray/python/ray/data/expressions.py

Lines 1306 to 1311 in c39c65b

    
               return self._name 
        
           def alias(self, name: str) -> "Expr": 
        
               # Always unalias before creating new one 
        
               return AliasExpr( 
        
                   self.expr.data_type, self.expr, _name=name, _is_rename=self._is_rename

Signed-off-by: Goutam <goutam@anyscale.com>

cursor

Bug: Alias method mixes up rename and alias.

The AliasExpr.alias() method incorrectly preserves the _is_rename flag from the original expression when creating a new alias. When .alias() is explicitly called, it creates an alias operation (not a rename), so _is_rename should always be False in the returned AliasExpr, regardless of the original expression's _is_rename value. This causes incorrect semantics when chaining operations like col("x")._rename("y").alias("z").

python/ray/data/expressions.py#L1152-L1157

ray/python/ray/data/expressions.py

Lines 1152 to 1157 in 3b5f1a4

    
                   function_name: Optional name for the function (for debugging) 
        
               Example: 
        
                   >>> from ray.data.expressions import col, udf 
        
                   >>> import pyarrow as pa 
        
                   >>> import pyarrow.compute as pc

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/expressions.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/expressions.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/expressions.py

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-11-10T08:14:23Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a powerful and intuitive namespaced expression API (.str, .list, .struct) for ray.data.Dataset, mirroring pandas functionality. The implementation leverages pyarrow.compute functions wrapped in a new pyarrow_udf decorator, which is a clever way to quickly expand the API surface. The changes are well-tested and documented.

My main feedback is on improving schema propagation. Currently, several methods default to an object return type, which limits the optimizer's ability to reason about data types. I've left a specific comment on how this could be improved. I also have a suggestion to reduce boilerplate code in the _StringNamespace for better maintainability. Overall, this is a great addition to Ray Data.

python/ray/data/expressions.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/expressions.py

Signed-off-by: Goutam <goutam@anyscale.com>

srinathk10

LGTM

doc/source/data/api/expressions.rst

python/ray/data/expressions.py

bveeramani

In meeting -- will finish review after

doc/source/conf.py

python/ray/data/tests/test_namespace_expressions.py

bveeramani · 2025-11-14T22:04:34Z

python/ray/data/tests/test_namespace_expressions.py

+def assert_df_equal(result: pd.DataFrame, expected: pd.DataFrame):
+    """Assert dataframes are equal, ignoring dtype differences."""
+    pd.testing.assert_frame_equal(result, expected, check_dtype=False)


Here and elsewhere -- this functions makes assumptions about the output ordering, so the tests might fail unexpectedly if tasks finish out of order. Consider using rows_same instead

So rows_same sorts the df but pandas actual.sort_values(sorted(actual.columns)).reset_index(drop=True) fails on unhashable types like list and dict. Also nothing here should be order dependent.

Also nothing here should be order dependent.

I don't think that's true.

ds = ray.data.from_items([{"val": "hello"}, {"val": "world"}]) result = ds.with_column("rev", col("val").str.reverse()).to_pandas() expected = pd.DataFrame({"val": ["hello", "world"], "rev": ["olleh", "dlrow"]}) assert_df_equal(result, expected)

For example, this dataset starts with two blocks and launches two tasks. If the second task finishes in an earlier scheduling loop than the first task, then the result will look like this:

>>> result val rev 0 world dlrow 1 hello olleh

I don't think it'll be likely, but many of our tests are flaky because of this exact sort of behaviour, so I'd prefer not to make assumptions

Okay to merge as-is to avoid blocking you. I can make rows_same more robust as a follow up

Sounds good

python/ray/data/tests/test_namespace_expressions.py

Signed-off-by: Goutam <goutam@anyscale.com>

cursor

Bug: Self-comparison breaks structural equality.

The AliasExpr.structurally_equals method compares self._is_rename == self._is_rename instead of comparing with the other object. This always returns True and prevents proper structural equality checking when the _is_rename flags differ between two AliasExpr instances.

python/ray/data/expressions.py#L925-L932

ray/python/ray/data/expressions.py

Lines 925 to 932 in 5a5e6b4

    
           def structurally_equals(self, other: Any) -> bool: 
        
               return ( 
        
                   isinstance(other, AliasExpr) 
        
                   and self.expr.structurally_equals(other.expr) 
        
                   and self.name == other.name 
        
                   and self._is_rename == self._is_rename 
        
               )

Signed-off-by: Goutam <goutam@anyscale.com>

cursor

Bug: Incorrect Alias Structural Equality

In AliasExpr.structurally_equals, the comparison self._is_rename == self._is_rename always evaluates to True. This should compare self._is_rename to other._is_rename to correctly check structural equality. The bug causes two AliasExpr instances with different _is_rename values to incorrectly be considered structurally equal.

python/ray/data/expressions.py#L925-L932

ray/python/ray/data/expressions.py

Lines 925 to 932 in 27a34f6

    
           def structurally_equals(self, other: Any) -> bool: 
        
               return ( 
        
                   isinstance(other, AliasExpr) 
        
                   and self.expr.structurally_equals(other.expr) 
        
                   and self.name == other.name 
        
                   and self._is_rename == self._is_rename 
        
               )

Signed-off-by: Goutam <goutam@anyscale.com>

cursor

Bug: Self-Comparison Breaks Object Identity

AliasExpr.structurally_equals compares self._is_rename with itself instead of comparing it with other._is_rename. This causes two AliasExpr objects with different _is_rename values to incorrectly be deemed structurally equal, breaking equality semantics.

python/ray/data/expressions.py#L913-L922

ray/python/ray/data/expressions.py

Lines 913 to 922 in 564880c

    
               all existing columns should be preserved at this position in the output. 
        
               It's typically used internally by operations like with_column() and 
        
               rename_columns() to maintain existing columns. 
        
               Example: 
        
                   When with_column("new_col", expr) is called, it creates: 
        
                   Project(exprs=[star(), expr.alias("new_col")]) 
        
                   This means: keep all existing columns, then add/overwrite "new_col" 
        
               """

bveeramani

LGTM pending using testcode and removing from_items from the list of dataset

bveeramani · 2025-11-15T04:47:41Z

doc/source/data/api/expressions.rst

+
+The following example shows how to use the string namespace to transform text columns:
+
+.. code-block:: python


Can you use testcode in this doc? We've had problems in the past with our code snippets breaking over time, and using testcode prevents that

python/ray/data/tests/test_namespace_expressions.py

Signed-off-by: Goutam <goutam@anyscale.com>

…project#58465) Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…project#58465) Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>

…project#58465)

…project#58465) Signed-off-by: Future-Outlier <eric901201@gmail.com>

…project#58465) Signed-off-by: peterxcli <peterxcli@gmail.com>

goutamvenkat-anyscale added 3 commits November 7, 2025 15:30

[Data] - Pyarrow Functions as Expressions

2a721de

Signed-off-by: Goutam <goutam@anyscale.com>

Merge branch 'master' into goutam/pyarrow_expr

d6a6229

Add .pyi file

e202479

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner November 7, 2025 23:59

gemini-code-assist bot reviewed Nov 8, 2025

View reviewed changes

ray-gardener bot added the data Ray Data-related issues label Nov 8, 2025

Some doc failures

9665ae3

Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist bot reviewed Nov 8, 2025

View reviewed changes

python/ray/data/expressions.py Outdated Show resolved Hide resolved

python/ray/data/expressions.py Outdated Show resolved Hide resolved

python/ray/data/expressions.py Outdated Show resolved Hide resolved

python/ray/data/expressions.py Outdated Show resolved Hide resolved

Docs

c39c65b

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 8, 2025

View reviewed changes

Fix typing

3b5f1a4

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 8, 2025

View reviewed changes

goutamvenkat-anyscale added 2 commits November 7, 2025 18:17

Vale linter

dab16b2

Signed-off-by: Goutam <goutam@anyscale.com>

One more try

336d882

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner November 8, 2025 02:38

cursor bot reviewed Nov 8, 2025

View reviewed changes

python/ray/data/expressions.py Outdated Show resolved Hide resolved

idk

b1108a3

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 8, 2025

View reviewed changes

python/ray/data/expressions.py Outdated Show resolved Hide resolved

Use pyarrow_udf

77e8753

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 10, 2025

View reviewed changes

python/ray/data/expressions.py Outdated Show resolved Hide resolved

Fix build

51b51f6

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Nov 10, 2025

gemini-code-assist bot reviewed Nov 10, 2025

View reviewed changes

python/ray/data/expressions.py Outdated Show resolved Hide resolved

python/ray/data/expressions.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale added 2 commits November 10, 2025 00:40

Clean up

5056a88

Signed-off-by: Goutam <goutam@anyscale.com>

Doctest

c55beed

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 10, 2025

View reviewed changes

python/ray/data/expressions.py Show resolved Hide resolved

goutamvenkat-anyscale added 2 commits November 10, 2025 11:28

Remove example

80864ab

Signed-off-by: Goutam <goutam@anyscale.com>

Falsy error

1d90262

Signed-off-by: Goutam <goutam@anyscale.com>

srinathk10 approved these changes Nov 11, 2025

View reviewed changes

doc/source/data/api/expressions.rst Show resolved Hide resolved

goutamvenkat-anyscale changed the title ~~Add namespaced expressions that expose pyarrow functions~~ [Data] Add namespaced expressions that expose pyarrow functions Nov 11, 2025

raulchen approved these changes Nov 14, 2025

View reviewed changes

doc/source/data/api/expressions.rst Show resolved Hide resolved

python/ray/data/expressions.py Outdated Show resolved Hide resolved

bveeramani reviewed Nov 14, 2025

View reviewed changes

Address comments

5a5e6b4

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 15, 2025

View reviewed changes

raulchen approved these changes Nov 15, 2025

View reviewed changes

goutamvenkat-anyscale added 2 commits November 14, 2025 17:19

Doclint

1f0e4f5

Signed-off-by: Goutam <goutam@anyscale.com>

Merge branch 'master' into goutam/pyarrow_expr

27a34f6

cursor bot reviewed Nov 15, 2025

View reviewed changes

goutamvenkat-anyscale added 2 commits November 14, 2025 17:21

Remove uunused func

e17b19f

Signed-off-by: Goutam <goutam@anyscale.com>

Make sphinx happy

564880c

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 15, 2025

View reviewed changes

bveeramani approved these changes Nov 15, 2025

View reviewed changes

richardliaw mentioned this pull request Nov 15, 2025

Ray Data Q4 Roadmap + Wishlist #58665

Open

Address comments

d5b56c0

Signed-off-by: Goutam <goutam@anyscale.com>

richardliaw merged commit 7498739 into ray-project:master Nov 16, 2025
6 checks passed

goutamvenkat-anyscale deleted the goutam/pyarrow_expr branch November 16, 2025 05:19

goutamvenkat-anyscale mentioned this pull request Nov 16, 2025

[data] [docs] Adding unstructured data templates from ray summit 2025 #57063

Merged

8 tasks

Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025

[Data] Add namespaced expressions that expose pyarrow functions (ray-…

6a4ae22

…project#58465) Signed-off-by: Aydin Abiar <aydin@anyscale.com>

ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025

[Data] Add namespaced expressions that expose pyarrow functions (ray-…

c18cd51

…project#58465) Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>

SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025

[Data] Add namespaced expressions that expose pyarrow functions (ray-…

4c9c1e5

…project#58465)

Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025

[Data] Add namespaced expressions that expose pyarrow functions (ray-…

365f9eb

…project#58465) Signed-off-by: Future-Outlier <eric901201@gmail.com>

myandpr mentioned this pull request Dec 18, 2025

[Data] Compute Expression-str Padding #59552

Merged

peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026

[Data] Add namespaced expressions that expose pyarrow functions (ray-…

9ac19c6

…project#58465) Signed-off-by: peterxcli <peterxcli@gmail.com>

	return self._name

	def alias(self, name: str) -> "Expr":
	# Always unalias before creating new one
	return AliasExpr(
	self.expr.data_type, self.expr, _name=name, _is_rename=self._is_rename

	function_name: Optional name for the function (for debugging)

	Example:
	>>> from ray.data.expressions import col, udf
	>>> import pyarrow as pa
	>>> import pyarrow.compute as pc


	def structurally_equals(self, other: Any) -> bool:
	return (
	isinstance(other, AliasExpr)
	and self.expr.structurally_equals(other.expr)
	and self.name == other.name
	and self._is_rename == self._is_rename
	)

	all existing columns should be preserved at this position in the output.
	It's typically used internally by operations like with_column() and
	rename_columns() to maintain existing columns.

	Example:
	When with_column("new_col", expr) is called, it creates:
	Project(exprs=[star(), expr.alias("new_col")])

	This means: keep all existing columns, then add/overwrite "new_col"
	"""


		The following example shows how to use the string namespace to transform text columns:

		.. code-block:: python

Conversation

goutamvenkat-anyscale commented Nov 7, 2025

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Nov 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Alias Expressions: Incorrect Rename State

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Alias method mixes up rename and alias.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Nov 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srinathk10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bveeramani Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

bveeramani Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

bveeramani Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment