[Data] Compute Expression-str Padding#59552
Conversation
Signed-off-by: yaommen <myanstu@163.com>
There was a problem hiding this comment.
Code Review
This pull request introduces lpad and rpad string expression operations, which are valuable additions for data manipulation. The implementation is clean, correct, and consistent with the existing patterns in the string namespace. The accompanying tests verify the basic functionality. I have a couple of suggestions to improve the docstrings to make the behavior more explicit for users.
| def lpad( | ||
| self, width: int, padding: str = " ", *args: Any, **kwargs: Any | ||
| ) -> "UDFExpr": | ||
| """Left-pad strings up to ``width`` using ``padding``.""" |
There was a problem hiding this comment.
The current docstring is a bit ambiguous about what happens when a string is longer than width. The underlying pyarrow.compute.utf8_lpad function truncates the string from the right in this case. It would be beneficial to clarify this behavior in the docstring to avoid surprises for users, especially those familiar with pandas' string methods which do not truncate.
| """Left-pad strings up to ``width`` using ``padding``.""" | |
| """Left-pad strings to a fixed ``width``. If a string is longer than ``width``, it is truncated from the right.""" |
There was a problem hiding this comment.
Actually the description is not quite correct.
for lpad: Right-align strings by padding with a given character while respecting width for rpad: `Left-align strings by padding with a given character while respecting `width
Basing this off of https://arrow.apache.org/docs/python/generated/pyarrow.compute.utf8_lpad.html pyarrow's docs
There was a problem hiding this comment.
You’re absolutely right — my original description was incorrect and didn’t match the PyArrow semantics.
To make sure, I re-read the PyArrow docs and also verified the behavior with a quick local test:
>>> import pyarrow.compute as pc
>>> result1 = pc.utf8_lpad("overflow", 5, "-")
>>> print(f"'overflow' str length is greater than 5: `{result1}`")
'overflow' str length is greater than 5: `overflow`
>>> result2 = pc.utf8_lpad("overflow", 10, "-")
>>> print(f"'overflow' str length is less than 10: `{result2}`")
'overflow' str length is less than 10: `--overflow`This confirms that padding is only applied when the string length is less than width; if the string is longer than width, it is returned unchanged (no truncation).
I’ve updated the description accordingly and aligned it with the official wording/semantics:
• lpad: right-align strings by prepending the padding character while respecting width
• rpad: left-align strings by appending the padding character while respecting width
Thanks again for pointing this out and for the careful review.
| def rpad( | ||
| self, width: int, padding: str = " ", *args: Any, **kwargs: Any | ||
| ) -> "UDFExpr": | ||
| """Right-pad strings up to ``width`` using ``padding``.""" |
There was a problem hiding this comment.
Similar to lpad, the docstring for rpad could be more explicit about the truncation behavior for strings longer than width. The pyarrow.compute.utf8_rpad function also truncates from the right.
| """Right-pad strings up to ``width`` using ``padding``.""" | |
| """Right-pad strings to a fixed ``width``. If a string is longer than ``width``, it is truncated from the right.""" |
There was a problem hiding this comment.
Same as lpad, this has been updated as well.
Signed-off-by: yaommen <myanstu@163.com>
Signed-off-by: yaommen <myanstu@163.com>
goutamvenkat-anyscale
left a comment
There was a problem hiding this comment.
lgtm! Thanks
Signed-off-by: seanlaii <qazwsx0939059006@gmail.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
Signed-off-by: lee1258561 <lee1258561@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Completing the str Padding operations (lpad, rpad)
test for example:
Related issues
Related to #58674
Related PR: [Data] Add namespaced expressions that expose pyarrow functions (#58465)
Additional information