apply_along_axis cuts strings #8352

lukovnikov · 2016-12-07T17:02:53Z

I'm trying to concatenate all elements of a row into a string as follows:

np.apply_along_axis(lambda x: " ".join(map(str, x)), 1, b)

b is

[[111,111,0,0,0], [111,111,111,111,111]]

However, the result of the line is:

['111 111 0 0 0', '111 111 111 1']

It looks like np.apply_along_axis is cutting the second string to be of the same length as the first one. If I put a longer sequence first, the result is correct:

['111 111 111 111 111', '111 111 0 0 0']

So I'm guessing this is a bug?

Summary 2019-04-30 by @seberg

np.apply_along_axis infers the output dtype from the first pass. Which can be worked around for example but the function returning an array of a correct type.

Actions:

np.apply_along_axis could/should get a dtype kwarg (or similar, compare also np.vectorize).

The text was updated successfully, but these errors were encountered:

mhvk · 2016-12-13T16:42:29Z

(found this via #8363) @lukovnikov - the code fails because it was written with numerical arrays in mind, for which the computation on any part of an array can be expected to return the same type of output as any other part. I should note more generally that numpy arrays are not particularly good or efficient at strings, and unless you have a very complicated array, my guess is that you would be much better off just working with lists and python functions, especially since you already are using the python string functions to do the concatenating.

gfyoung · 2016-12-13T16:50:01Z

@mhvk : I disagree completely with your statement. numpy might be written for numerical computations, but that doesn't mean we have omit functionality with str arrays. After all, that is why we have specific dtypes for strings unlike libraries like pandas.

This is a bug, and it should be patched unless you can come up with a more convincing argument than what you have provided.

mhvk · 2016-12-13T17:27:07Z

@gfyoung - I'm not saying one should not try to solve the bug (though I think it is obvious any solution better not cause a huge performance regression for more typical), just explaining why the bug exists and suggesting that for strings one really is better off not using ndarray. Anyway, those are my 2¢.

gfyoung · 2016-12-13T17:28:11Z

@mhvk : Fair enough, though your response came across as if this wasn't really a concern of numpy. That is why I wanted to come down strongly to emphasize that this is something we should be trying to patch.

yvan · 2017-05-27T15:55:51Z

i had the same issue:

you can see it crops the 'g' of jpg by simply referencing it. i figured it has something to do with the shape change. ended up using lists + map instead.

eric-wieser · 2017-05-27T16:37:13Z

I'm assuming that that's a deliberately contrived example, because you shouldn't be using apply_along_axis for simple indexing like that.

np.ma.apply_along_axis will work correctly here (for now - see #8511). Another option (crashed until 1.13) is a manual cast to dtype object:

np.apply_along_axis(lambda x: np.array(x[0], object), 1, fnames)

linwoodc3 · 2017-12-29T13:54:23Z

@eric-wieser your comment above helped me solve a problem I've been having for a few months with numpy and string operations. 👍

I'm assuming that that's a deliberately contrived example, because you shouldn't be using apply_along_axis for simple indexing like that.

np.ma.apply_along_axis will work correctly here (for now - see #8511). Another option (crashed until 1.13) is a manual cast to dtype object:

np.apply_along_axis(lambda x: np.array(x[0], object), 1, fnames)

Thanks

eric-wieser · 2017-12-29T14:00:10Z

Perhaps we should just add a dtype= argument to apply_along_axis

linwoodc3 · 2017-12-29T15:06:09Z

@eric-wieser

I agree you should add that argument, if there are performance differences between the mask array version and the regular version. I just did a small test (not sure if it means anything) and here is a pic of the results:

My use case: I'm storing parsed NLP data (strings) in numpy arrays and trying to get rid of all for loops and if-else clauses; I apply some text analytics functions using the apply_along_axis. Any speed benefit would be awesome as each information extraction could have 10-20 variations (from a document that could have tens to hundreds of information extractions which comes from a corpus of thousands of documents...per day).

EDIT

For anyone else experiencing this issue with numpy string operations:
I just explicitly set the dtype in the normal apply_along_axis vice using the masked array approach with is slower than normal apply_along_axis.

Detail on my string operations so you can see if it applies to you : My numpy array (set as nump in code below) has a shape of (26,1) and each element in the array is an information extraction from a sentence in a document. Each information extraction is a list of key/value pairs, and I am extracting the key/value pairs that represent NLP triples (subject, relation, object). The lambda function is passed over the array to combine the triple into a single sentence which I will then test for flesh kincaid reading ease since these are computer generated sentences; the sentences were being truncated or set to some default length based on the dtype value. The original code that was truncating the strings was:

%timeit -n 500 np.apply_along_axis(lambda x: ("{} {} {}".format(x[0]['subject'],x[0]['relation'],x[0]['object'])),1,np.array(nump[0][0])[:,np.newaxis])

My speed was 116 µs ± 5.42 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)

Solution

Using @eric-wieser 's comment above, I used to following to fix my problem and maintain speeds near the original:

Explicitly passing in the dtype argument

%timeit -n 500 np.apply_along_axis(lambda x: np.array("{} {} {}".format(x[0]['subject'],x[0]['relation'],x[0]['object']),dtype='S255'),1,np.array(nump[0][0])[:,np.newaxis])

My speed was 152 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)

The masked array approach works without changing your code, but is slower.

Masked array with no dtype argument

%timeit -n 500 np.ma.apply_along_axis(lambda x: ("{} {} {}".format(x[0]['subject'],x[0]['relation'],x[0]['object'])),1,np.array(nump[0][0])[:,np.newaxis])

My speed was 925 µs ± 33.6 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)

eric-wieser · 2017-12-29T16:43:11Z

and trying to get rid of all for loops

Note that apply_along_axis is just a python for loop, with a little help in allocating the output array for you - it's not unlikely that it's slower than the loop it replaces

linwoodc3 · 2017-12-29T17:25:13Z

Ah shucks; i was doing numpy because I always thought it led to faster speeds. Sheesh.

Divye02 · 2018-02-19T05:21:50Z

fml

cerlymarco · 2019-04-30T22:11:09Z

To avoid the cut when joining string with np.apply_along_axis:

a = np.array(['sssssssss','ffffffffffffff'])
b = np.array(['cccccccccccc','iiiiiiiiiiiiiiiii'])

def join_txt(text): return np.asarray(" ".join(text),dtype=object)

np.apply_along_axis(join_txt,0,[a,b])

the result is

array(['sssssssss cccccccccccc', 'ffffffffffffff iiiiiiiiiiiiiiiii'], dtype=object)

alexcoca · 2019-12-04T12:14:19Z

@cerlymarco , how does the approach you propose compared to @linwoodc3's solution in terms of speed?

eric-wieser · 2019-12-04T12:28:34Z

Perhaps we should just add a dtype= argument to apply_along_axis

Unfortunately this isn't possible without breaking someone. The current signature is:

def apply_along_axis(func1d, axis, arr, *args, **kwargs):

Today, users can call it as both:

def f1(x):
    return x
np.apply_along_axis(f1, 0, my_arr)

def f2(x, *, dtype):
    return x.astype(dtype)
np.apply_along_axis(f2, 0, my_arr, dtype=int)

If we make a apply_along_axis take a dtype argument and not pass it on to f, then f2 will fail. If we make it take a dtype argument and pass it on to f, then f1 will fail.

What we could do is:

Emit a FutureWarning if 'dtype' in kwargs telling people to rename their arguments
Wait 2 years
Break any users still using something like f2 above

jcpayne · 2020-01-19T03:07:06Z

Just a comment that I was trying to add a prefix and suffix to a filename using Numpy on a Pandas Series and ran into the same problem (I think); i.e., that the second string was cut to the same length as the first.

filenames = Series(['S1/C03/C03_R1/S1_C03_R1_PICT0239.JPG','S1/C03/C03_R1/S1_C03_R1_PICT0239.JPG'])
prefix = 'somepath'
np.char.add(prefix, filenames.astype(str))

array(['somepathS1/C03/C', 'somepathS1/C03/C'], dtype='<U16')

gfyoung mentioned this issue Dec 9, 2016

BUG: Get common dtype in apply_along_axis #8363

Closed

eric-wieser mentioned this issue Feb 13, 2017

numpy apply_along_axis drops values after casting incorrectly #5193

Open

eric-wieser mentioned this issue Feb 20, 2017

MAINT: make np.ma.apply_along_axis consistent with np.apply_along_axis #8511

Closed

rcomer mentioned this issue Oct 6, 2018

cube.aggregated_by and multidimensional auxcoords SciTools/iris#3174

Merged

matsen mentioned this issue Jan 18, 2019

apply_along_axis is trimming strings to be the same length matsengrp/vampire#82

Closed

seberg added 01 - Enhancement component: numpy.lib defunct — difficulty: Intermediate labels May 1, 2019

eric-wieser mentioned this issue Jul 8, 2020

Concatenation issue during array transformation #16785

Closed

simonjayhawkins mentioned this issue Sep 24, 2020

BUG:Pandas 1.0.3 → 1.1.1 behavior change on DataFrame.apply() whith raw option and func returning string pandas-dev/pandas#35940

Closed

3 tasks

Talmaj mentioned this issue May 6, 2021

Speed up save_obj function, add option of saving normals. facebookresearch/pytorch3d#667

Closed

seisman mentioned this issue Oct 8, 2023

clib: Fix the bug when passing multiple columns of strings with variable lengths to the GMT C API GenericMappingTools/pygmt#2719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apply_along_axis cuts strings #8352

apply_along_axis cuts strings #8352

lukovnikov commented Dec 7, 2016 •

edited by seberg

Loading

mhvk commented Dec 13, 2016

gfyoung commented Dec 13, 2016

mhvk commented Dec 13, 2016

gfyoung commented Dec 13, 2016 •

edited

Loading

yvan commented May 27, 2017

eric-wieser commented May 27, 2017

linwoodc3 commented Dec 29, 2017

eric-wieser commented Dec 29, 2017

linwoodc3 commented Dec 29, 2017 •

edited

Loading

eric-wieser commented Dec 29, 2017

linwoodc3 commented Dec 29, 2017 •

edited

Loading

Divye02 commented Feb 19, 2018

cerlymarco commented Apr 30, 2019 •

edited

Loading

alexcoca commented Dec 4, 2019

eric-wieser commented Dec 4, 2019 •

edited

Loading

jcpayne commented Jan 19, 2020 •

edited

Loading

apply_along_axis cuts strings #8352

apply_along_axis cuts strings #8352

Comments

lukovnikov commented Dec 7, 2016 • edited by seberg Loading

mhvk commented Dec 13, 2016

gfyoung commented Dec 13, 2016

mhvk commented Dec 13, 2016

gfyoung commented Dec 13, 2016 • edited Loading

yvan commented May 27, 2017

eric-wieser commented May 27, 2017

linwoodc3 commented Dec 29, 2017

eric-wieser commented Dec 29, 2017

linwoodc3 commented Dec 29, 2017 • edited Loading

EDIT

Solution

eric-wieser commented Dec 29, 2017

linwoodc3 commented Dec 29, 2017 • edited Loading

Divye02 commented Feb 19, 2018

cerlymarco commented Apr 30, 2019 • edited Loading

alexcoca commented Dec 4, 2019

eric-wieser commented Dec 4, 2019 • edited Loading

jcpayne commented Jan 19, 2020 • edited Loading

lukovnikov commented Dec 7, 2016 •

edited by seberg

Loading

gfyoung commented Dec 13, 2016 •

edited

Loading

linwoodc3 commented Dec 29, 2017 •

edited

Loading

linwoodc3 commented Dec 29, 2017 •

edited

Loading

cerlymarco commented Apr 30, 2019 •

edited

Loading

eric-wieser commented Dec 4, 2019 •

edited

Loading

jcpayne commented Jan 19, 2020 •

edited

Loading