-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Python] pyarrow.compute.utf8_center disagrees with str.center when number of needed padding characters is odd #15053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So this diff makes things consistent with Python stdlib, but the comment notes that this might be intentional? diff --git a/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc b/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc
index fb197e13a..6c4d88ba7 100644
--- a/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc
@@ -930,7 +930,7 @@ struct Utf8PadTransform : public StringTransformBase {
int64_t right = 0;
if (PadLeft && PadRight) {
// If odd number of spaces, put the extra space on the right
- left = spaces / 2;
+ left = (spaces / 2) + 1;
right = spaces - left;
} else if (PadLeft) {
left = spaces; |
…right alignment on odd number of padding (#41449) ### Rationale for this change See the issue #15053 for some more context, but in summary: for the "center" padding, and the number of characters that are being added, one needs to decide whether to add one more character on the left or right. Our implementation (somewhat randomly, I think) decided to put the extra space on the right. The Python standard library however, puts the extra space on the left. And for the usage of pyarrow as a string compute engine in the pandas project, we would like to have the option to have consistent behaviour with Python. ### What changes are included in this PR? Add an option `align_left_on_odd_padding` to `PadOptions` that controls where the extra space is put. This keyword is quite ugly, but I am not sure what other solution there is if we want to give pyarrow users this option (also happy to hear other argument name options) ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: #15053 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Issue resolved by pull request 41449 |
… left/right alignment on odd number of padding (apache#41449) ### Rationale for this change See the issue apache#15053 for some more context, but in summary: for the "center" padding, and the number of characters that are being added, one needs to decide whether to add one more character on the left or right. Our implementation (somewhat randomly, I think) decided to put the extra space on the right. The Python standard library however, puts the extra space on the left. And for the usage of pyarrow as a string compute engine in the pandas project, we would like to have the option to have consistent behaviour with Python. ### What changes are included in this PR? Add an option `align_left_on_odd_padding` to `PadOptions` that controls where the extra space is put. This keyword is quite ugly, but I am not sure what other solution there is if we want to give pyarrow users this option (also happy to hear other argument name options) ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: apache#15053 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
I think this might still not match the cpython behavior. Based on the OP and pandas-dev/pandas#59624 (the pandas test test_center_ljust_rjust_fillchar)
So far so good. But when the input string has an even number of characters:
IIUC with lean_left_on_odd_padding=True the number of spaces to put on right/left is determined here as
By contrast in cpython it is determined here as
With another dose of "IIUC", it looks like the |
The default for the argument (keeping existing behaviour) is
It's certainly not a very clear argument name .. (it's the original data that "leans left", and not the new padding characters), but when setting it to False, do the pandas tests then pass? |
(Updated my previous comment, for which the Using lean_left_on_odd_padding=False fixes the b4 case, but breaks the a1 case, (see pandas-dev/pandas#59624 (comment)). |
Hmm, ok so it seems, based on the single example above, we missed how Python actually works .. It doesn't just align on the right side (e.g One lucky part might be that this addition in pyarrow might not completely miss the point, because I think we can mimic that behaviour in pandas exactly (if we want), by doing something like: if width % 2 == 0:
# width is even
lean_left = True
else:
lean_left = False
pc.utf8_center(arr, width, padding=fillchar, lean_left_on_odd_padding=lean_left) |
Describe the bug, including details regarding any error messages, version, and platform.
I suppose in theory it's arbitrary where the two
XX
are added (front or back) to center the string, but I would expect to match the standard library behavior for consistencyComponent(s)
Python
The text was updated successfully, but these errors were encountered: