Skip to content

[Python] pyarrow.compute.utf8_center disagrees with str.center when number of needed padding characters is odd #15053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mroeschke opened this issue Dec 20, 2022 · 6 comments

Comments

@mroeschke
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

In [1]: import pyarrow as pa; import pyarrow.compute as pc

In [2]: pa.__version__
Out[2]: '10.0.1'

In [3]: s = "bb"

In [4]: s.center(5, "X")
Out[4]: 'XXbbX'

In [5]: pc.utf8_center(pa.scalar(s), 5, padding="X")
Out[5]: <pyarrow.StringScalar: 'XbbXX'>

I suppose in theory it's arbitrary where the two XX are added (front or back) to center the string, but I would expect to match the standard library behavior for consistency

Component(s)

Python

@mroeschke
Copy link
Contributor Author

So this diff makes things consistent with Python stdlib, but the comment notes that this might be intentional?

diff --git a/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc b/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc
index fb197e13a..6c4d88ba7 100644
--- a/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc
@@ -930,7 +930,7 @@ struct Utf8PadTransform : public StringTransformBase {
     int64_t right = 0;
     if (PadLeft && PadRight) {
       // If odd number of spaces, put the extra space on the right
-      left = spaces / 2;
+      left = (spaces / 2) + 1;
       right = spaces - left;
     } else if (PadLeft) {
       left = spaces;

jorisvandenbossche added a commit that referenced this issue Jun 26, 2024
…right alignment on odd number of padding (#41449)

### Rationale for this change

See the issue #15053 for some more context, but in summary: for the "center" padding, and the number of characters that are being added, one needs to decide whether to add one more character on the left or right. Our implementation (somewhat randomly, I think) decided to put the extra space on the right. 
The Python standard library however, puts the extra space on the left. And for the usage of pyarrow as a string compute engine in the pandas project, we would like to have the option to have consistent behaviour with Python.

### What changes are included in this PR?

Add an option `align_left_on_odd_padding` to `PadOptions` that controls where the extra space is put. This keyword is quite ugly, but I am not sure what other solution there is if we want to give pyarrow users this option (also happy to hear other argument name options)

### Are these changes tested?

Yes

### Are there any user-facing changes?
No
* GitHub Issue: #15053

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
@jorisvandenbossche
Copy link
Member

Issue resolved by pull request 41449
#41449

@jorisvandenbossche jorisvandenbossche added this to the 17.0.0 milestone Jun 26, 2024
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Jul 9, 2024
… left/right alignment on odd number of padding (apache#41449)

### Rationale for this change

See the issue apache#15053 for some more context, but in summary: for the "center" padding, and the number of characters that are being added, one needs to decide whether to add one more character on the left or right. Our implementation (somewhat randomly, I think) decided to put the extra space on the right. 
The Python standard library however, puts the extra space on the left. And for the usage of pyarrow as a string compute engine in the pandas project, we would like to have the option to have consistent behaviour with Python.

### What changes are included in this PR?

Add an option `align_left_on_odd_padding` to `PadOptions` that controls where the extra space is put. This keyword is quite ugly, but I am not sure what other solution there is if we want to give pyarrow users this option (also happy to hear other argument name options)

### Are these changes tested?

Yes

### Are there any user-facing changes?
No
* GitHub Issue: apache#15053

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
@jbrockmendel
Copy link

jbrockmendel commented Aug 28, 2024

I think this might still not match the cpython behavior. Based on the OP and pandas-dev/pandas#59624 (the pandas test test_center_ljust_rjust_fillchar)

In [19]: a1 = "a"

In [20]: a1.center(4, "X")
Out[20]: 'XaXX'

In [21]: pc.utf8_center(pa.scalar(a1), 4, padding="X", lean_left_on_odd_padding=True)
Out[21]: <pyarrow.StringScalar: 'XXaX'>

So far so good. But when the input string has an even number of characters:

In [22]: b4 = "bbbb"

In [23]: b4.center(5, "X")
Out[23]: 'Xbbbb'

In [24]: pc.utf8_center(pa.scalar(b4), 5, padding="X", lean_left_on_odd_padding=True)
Out[24]: <pyarrow.StringScalar: 'bbbbX'>

IIUC with lean_left_on_odd_padding=True the number of spaces to put on right/left is determined here as

left = spaces / 2;
right = spaces - left;

By contrast in cpython it is determined here as

marg = width - STRINGLIB_LEN(self)
left = marg / 2 + (marg & width & 1);

With another dose of "IIUC", it looks like the marg &width & 1 term is missing from the pyarrow version.

@jorisvandenbossche
Copy link
Member

The default for the argument (keeping existing behaviour) is True, so by setting it to False I get:

In [42]: pc.utf8_center(pa.scalar(b4), 5, padding="X", lean_left_on_odd_padding=False)
Out[42]: <pyarrow.StringScalar: 'Xbbbb'>

It's certainly not a very clear argument name .. (it's the original data that "leans left", and not the new padding characters), but when setting it to False, do the pandas tests then pass?

@jbrockmendel
Copy link

(Updated my previous comment, for which the a1 example should have been using 4 spaces instead of 5.)

Using lean_left_on_odd_padding=False fixes the b4 case, but breaks the a1 case, (see pandas-dev/pandas#59624 (comment)).

@jorisvandenbossche
Copy link
Member

Hmm, ok so it seems, based on the single example above, we missed how Python actually works .. It doesn't just align on the right side (e.g 'XXbbX'), but so apparently they align on the left or right depending on whether the total width is odd or even .. (maybe a bit similar like round-to-even logic?)

One lucky part might be that this addition in pyarrow might not completely miss the point, because I think we can mimic that behaviour in pandas exactly (if we want), by doing something like:

if width % 2 == 0:
    # width is even
    lean_left = True
else:
    lean_left = False

pc.utf8_center(arr, width, padding=fillchar, lean_left_on_odd_padding=lean_left)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants