Skip to content

ENH: add default value to str.extract #38001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
erfannariman opened this issue Nov 22, 2020 · 5 comments
Open

ENH: add default value to str.extract #38001

erfannariman opened this issue Nov 22, 2020 · 5 comments
Assignees
Labels
API - Consistency Internal Consistency of API/Behavior Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@erfannariman
Copy link
Member

erfannariman commented Nov 22, 2020

Is your feature request related to a problem?

In some cases we can set a default value for non matches of str.extract with Series/Frame.fillna.

But there are cases when the data prior to applying str.extract already has NaN values. So running fillna on the results would fill both, and there we cannot distinguish between what the actual NaN values were and what the NaN values are as a result of str.extract.

Describe the solution you'd like

Set a default value which has to be a string, to indicate the non matches of the regex pattern.

API breaking implications

None I think

from pandas import DataFrame
import numpy as np

df = DataFrame({'A': ['a84', 'abcd', '99string', np.nan]})
result = df['A'].str.extract(r'(\d+)', expand=False, default='missing')
print(df, '\n')
print(result)

          A
0       a84
1      abcd
2  99string
3       NaN 

0         84
1    missing            # <--- the value which did not match          
2         99
3        NaN             # <--- the NaN already present in the data prior str.extract
Name: A, dtype: object
@erfannariman erfannariman added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 22, 2020
@erfannariman
Copy link
Member Author

take

@jreback jreback added Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 26, 2020
@jreback
Copy link
Contributor

jreback commented Nov 26, 2020

does this functionaily match anything in the standard library? why is this useful?

@erfannariman
Copy link
Member Author

erfannariman commented Nov 26, 2020

does this functionaily match anything in the standard library?

Not that I can think of, it would require three extra steps to achieve this:

df = pd.DataFrame({'A': ['a84', 'abcd', '99string', np.nan]})

mask1 = df['A'].notna()
result = df['A'].str.extract(r'(\d+)', expand=False)
mask2 = result.isna()

result = np.where(mask1&mask2, 'missing', result)

why is this useful?

Because right now it is not possible to distinguish between already NaN values before using str.extract and NaN values which are returned after using str.extract

@jreback

@simonjayhawkins
Copy link
Member

@erfannariman some string accessor methods have a na parameter, e.g. startswith and match, maybe this is more appropriate.

@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone May 10, 2021
@simonjayhawkins
Copy link
Member

simonjayhawkins commented May 10, 2021

Set a default value which has to be a string, to indicate the non matches of the regex pattern.

why would it need to be restricted to a string? (for object dtypes)

@simonjayhawkins simonjayhawkins added the API - Consistency Internal Consistency of API/Behavior label May 10, 2021
@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 14, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants