-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
apply_along_axis cuts strings #8352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
(found this via #8363) @lukovnikov - the code fails because it was written with numerical arrays in mind, for which the computation on any part of an array can be expected to return the same type of output as any other part. I should note more generally that numpy arrays are not particularly good or efficient at strings, and unless you have a very complicated array, my guess is that you would be much better off just working with lists and python functions, especially since you already are using the python string functions to do the concatenating. |
@mhvk : I disagree completely with your statement. This is a bug, and it should be patched unless you can come up with a more convincing argument than what you have provided. |
@gfyoung - I'm not saying one should not try to solve the bug (though I think it is obvious any solution better not cause a huge performance regression for more typical), just explaining why the bug exists and suggesting that for strings one really is better off not using |
@mhvk : Fair enough, though your response came across as if this wasn't really a concern of |
I'm assuming that that's a deliberately contrived example, because you shouldn't be using
|
@eric-wieser your comment above helped me solve a problem I've been having for a few months with numpy and string operations. 👍
Thanks |
Perhaps we should just add a |
I agree you should add that argument, if there are performance differences between the mask array version and the regular version. I just did a small test (not sure if it means anything) and here is a pic of the results: My use case: I'm storing parsed NLP data (strings) in numpy arrays and trying to get rid of all EDITFor anyone else experiencing this issue with numpy string operations: Detail on my string operations so you can see if it applies to you : My numpy array (set as nump in code below) has a shape of (26,1) and each element in the array is an information extraction from a sentence in a document. Each information extraction is a list of key/value pairs, and I am extracting the key/value pairs that represent NLP triples (subject, relation, object). The %timeit -n 500 np.apply_along_axis(lambda x: ("{} {} {}".format(x[0]['subject'],x[0]['relation'],x[0]['object'])),1,np.array(nump[0][0])[:,np.newaxis]) My speed was 116 µs ± 5.42 µs per loop (mean ± std. dev. of 7 runs, 500 loops each) SolutionUsing @eric-wieser 's comment above, I used to following to fix my problem and maintain speeds near the original:
%timeit -n 500 np.apply_along_axis(lambda x: np.array("{} {} {}".format(x[0]['subject'],x[0]['relation'],x[0]['object']),dtype='S255'),1,np.array(nump[0][0])[:,np.newaxis]) My speed was 152 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 500 loops each) The masked array approach works without changing your code, but is slower.
%timeit -n 500 np.ma.apply_along_axis(lambda x: ("{} {} {}".format(x[0]['subject'],x[0]['relation'],x[0]['object'])),1,np.array(nump[0][0])[:,np.newaxis]) My speed was 925 µs ± 33.6 µs per loop (mean ± std. dev. of 7 runs, 500 loops each) |
Note that |
Ah shucks; i was doing numpy because I always thought it led to faster speeds. Sheesh. |
fml |
To avoid the cut when joining string with np.apply_along_axis:
the result is
|
@cerlymarco , how does the approach you propose compared to @linwoodc3's solution in terms of speed? |
Unfortunately this isn't possible without breaking someone. The current signature is: def apply_along_axis(func1d, axis, arr, *args, **kwargs): Today, users can call it as both: def f1(x):
return x
np.apply_along_axis(f1, 0, my_arr) def f2(x, *, dtype):
return x.astype(dtype)
np.apply_along_axis(f2, 0, my_arr, dtype=int) If we make a What we could do is:
|
Just a comment that I was trying to add a prefix and suffix to a filename using Numpy on a Pandas Series and ran into the same problem (I think); i.e., that the second string was cut to the same length as the first.
|
I'm trying to concatenate all elements of a row into a string as follows:
b is
However, the result of the line is:
It looks like np.apply_along_axis is cutting the second string to be of the same length as the first one. If I put a longer sequence first, the result is correct:
So I'm guessing this is a bug?
Summary 2019-04-30 by @seberg
np.apply_along_axis
infers the output dtype from the first pass. Which can be worked around for example but the function returning an array of a correct type.Actions:
np.apply_along_axis
could/should get adtype
kwarg (or similar, compare alsonp.vectorize
).The text was updated successfully, but these errors were encountered: