Skip to content

[Data] Compute Expression-str Padding#59552

Merged
richardliaw merged 3 commits intoray-project:masterfrom
myandpr:compute-expression-str-padding
Dec 23, 2025
Merged

[Data] Compute Expression-str Padding#59552
richardliaw merged 3 commits intoray-project:masterfrom
myandpr:compute-expression-str-padding

Conversation

@myandpr
Copy link
Member

@myandpr myandpr commented Dec 18, 2025

Description

Completing the str Padding operations (lpad, rpad)

test for example:

import ray
from ray.data import from_items
from ray.data.expressions import col

ray.init(include_dashboard=False, ignore_reinit_error=True)

ds = from_items([
    {"x": "ray"},
    {"x": "data"},
    {"x": "expr"},
])

for row in ds.iter_rows():
    print(row)

x_expr = col("x")

# lpad
hasattr(x_expr.str, "lpad")
ds = ds.with_column("x_lpad", x_expr.str.lpad(6, "_"))

# rpad
hasattr(x_expr.str, "rpad")
ds = ds.with_column("x_rpad", x_expr.str.rpad(6, "_"))

for row in ds.iter_rows():
    print(row)

Related issues

Related to #58674

Related PR: [Data] Add namespaced expressions that expose pyarrow functions (#58465)

Additional information

Signed-off-by: yaommen <myanstu@163.com>
@myandpr myandpr requested a review from a team as a code owner December 18, 2025 17:17
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces lpad and rpad string expression operations, which are valuable additions for data manipulation. The implementation is clean, correct, and consistent with the existing patterns in the string namespace. The accompanying tests verify the basic functionality. I have a couple of suggestions to improve the docstrings to make the behavior more explicit for users.

def lpad(
self, width: int, padding: str = " ", *args: Any, **kwargs: Any
) -> "UDFExpr":
"""Left-pad strings up to ``width`` using ``padding``."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current docstring is a bit ambiguous about what happens when a string is longer than width. The underlying pyarrow.compute.utf8_lpad function truncates the string from the right in this case. It would be beneficial to clarify this behavior in the docstring to avoid surprises for users, especially those familiar with pandas' string methods which do not truncate.

Suggested change
"""Left-pad strings up to ``width`` using ``padding``."""
"""Left-pad strings to a fixed ``width``. If a string is longer than ``width``, it is truncated from the right."""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the description is not quite correct.

for lpad: Right-align strings by padding with a given character while respecting width for rpad: `Left-align strings by padding with a given character while respecting `width

Basing this off of https://arrow.apache.org/docs/python/generated/pyarrow.compute.utf8_lpad.html pyarrow's docs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re absolutely right — my original description was incorrect and didn’t match the PyArrow semantics.

To make sure, I re-read the PyArrow docs and also verified the behavior with a quick local test:

>>> import pyarrow.compute as pc
>>> result1 = pc.utf8_lpad("overflow", 5, "-")
>>> print(f"'overflow' str length is greater than 5: `{result1}`") 
'overflow' str length is greater than 5: `overflow`
>>> result2 = pc.utf8_lpad("overflow", 10, "-")
>>> print(f"'overflow' str length is less than 10: `{result2}`")
'overflow' str length is less than 10: `--overflow`

This confirms that padding is only applied when the string length is less than width; if the string is longer than width, it is returned unchanged (no truncation).

I’ve updated the description accordingly and aligned it with the official wording/semantics:
• lpad: right-align strings by prepending the padding character while respecting width
• rpad: left-align strings by appending the padding character while respecting width

Thanks again for pointing this out and for the careful review.

def rpad(
self, width: int, padding: str = " ", *args: Any, **kwargs: Any
) -> "UDFExpr":
"""Right-pad strings up to ``width`` using ``padding``."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to lpad, the docstring for rpad could be more explicit about the truncation behavior for strings longer than width. The pyarrow.compute.utf8_rpad function also truncates from the right.

Suggested change
"""Right-pad strings up to ``width`` using ``padding``."""
"""Right-pad strings to a fixed ``width``. If a string is longer than ``width``, it is truncated from the right."""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as lpad, this has been updated as well.

Signed-off-by: yaommen <myanstu@163.com>
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Dec 18, 2025
Signed-off-by: yaommen <myanstu@163.com>
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Dec 23, 2025
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thanks

@richardliaw richardliaw enabled auto-merge (squash) December 23, 2025 19:16
@richardliaw richardliaw merged commit 10869d5 into ray-project:master Dec 23, 2025
8 checks passed
seanlaii pushed a commit to seanlaii/ray that referenced this pull request Dec 23, 2025
Signed-off-by: seanlaii <qazwsx0939059006@gmail.com>
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
Signed-off-by: lee1258561 <lee1258561@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants