ENH: add default value to str.extract #38001

erfannariman · 2020-11-22T17:15:55Z

Is your feature request related to a problem?

In some cases we can set a default value for non matches of str.extract with Series/Frame.fillna.

But there are cases when the data prior to applying str.extract already has NaN values. So running fillna on the results would fill both, and there we cannot distinguish between what the actual NaN values were and what the NaN values are as a result of str.extract.

Describe the solution you'd like

Set a default value which has to be a string, to indicate the non matches of the regex pattern.

API breaking implications

None I think

from pandas import DataFrame
import numpy as np

df = DataFrame({'A': ['a84', 'abcd', '99string', np.nan]})
result = df['A'].str.extract(r'(\d+)', expand=False, default='missing')
print(df, '\n')
print(result)

          A
0       a84
1      abcd
2  99string
3       NaN 

0         84
1    missing            # <--- the value which did not match          
2         99
3        NaN             # <--- the NaN already present in the data prior str.extract
Name: A, dtype: object

The text was updated successfully, but these errors were encountered:

erfannariman · 2020-11-22T17:18:12Z

take

jreback · 2020-11-26T16:52:54Z

does this functionaily match anything in the standard library? why is this useful?

erfannariman · 2020-11-26T23:04:17Z

does this functionaily match anything in the standard library?

Not that I can think of, it would require three extra steps to achieve this:

df = pd.DataFrame({'A': ['a84', 'abcd', '99string', np.nan]})

mask1 = df['A'].notna()
result = df['A'].str.extract(r'(\d+)', expand=False)
mask2 = result.isna()

result = np.where(mask1&mask2, 'missing', result)

why is this useful?

Because right now it is not possible to distinguish between already NaN values before using str.extract and NaN values which are returned after using str.extract

@jreback

simonjayhawkins · 2021-05-10T09:29:06Z

@erfannariman some string accessor methods have a na parameter, e.g. startswith and match, maybe this is more appropriate.

simonjayhawkins · 2021-05-10T09:30:53Z

Set a default value which has to be a string, to indicate the non matches of the regex pattern.

why would it need to be restricted to a string? (for object dtypes)

erfannariman added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 22, 2020

github-actions bot assigned erfannariman Nov 22, 2020

erfannariman mentioned this issue Nov 22, 2020

ENH: str extract with default value #38003

Closed

5 tasks

jreback added Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 26, 2020

simonjayhawkins added this to the Contributions Welcome milestone May 10, 2021

simonjayhawkins added the API - Consistency Internal Consistency of API/Behavior label May 10, 2021

mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 14, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add default value to str.extract #38001

ENH: add default value to str.extract #38001

erfannariman commented Nov 22, 2020 •

edited

Loading

erfannariman commented Nov 22, 2020

jreback commented Nov 26, 2020

erfannariman commented Nov 26, 2020 •

edited

Loading

simonjayhawkins commented May 10, 2021

simonjayhawkins commented May 10, 2021 •

edited

Loading

ENH: add default value to str.extract #38001

ENH: add default value to str.extract #38001

Comments

erfannariman commented Nov 22, 2020 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

erfannariman commented Nov 22, 2020

jreback commented Nov 26, 2020

erfannariman commented Nov 26, 2020 • edited Loading

simonjayhawkins commented May 10, 2021

simonjayhawkins commented May 10, 2021 • edited Loading

erfannariman commented Nov 22, 2020 •

edited

Loading

erfannariman commented Nov 26, 2020 •

edited

Loading

simonjayhawkins commented May 10, 2021 •

edited

Loading