Skip to content

BUG: Series.replace converts np.nan into pd.NaT implicitly #48034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
ng-henry opened this issue Aug 11, 2022 · 8 comments
Open
3 tasks done

BUG: Series.replace converts np.nan into pd.NaT implicitly #48034

ng-henry opened this issue Aug 11, 2022 · 8 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version replace replace method

Comments

@ng-henry
Copy link

ng-henry commented Aug 11, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

s1 = pd.Series([pd.Timestamp(1000223), np.nan], dtype = object)

a1 = s1.replace("abcdef", pd.NaT, regex=True)
a2 = s1.replace("abcdef", pd.NaT, regex=False)

# Why?
assert a1[1] is pd.NaT # Here np.nan is implicitly converted to pd.NaT
assert a2[1] is np.nan # Here np.nan is NOT implicitly converted

Issue Description

np.nan are implicitly converted to pd.NaT when all doing:

  • Series.replace with regex = True flag
  • all values are either pd.Timestamp or np.nan

Expected Behavior

This behaviour is inconsistent. The documentation of Series.replace suggests that the function only replaces strings matching regex with the to_replace value. I did not expect for it to implicitly coerce values.

Expected behaviour is like the a2 case in the example: np.nan keeps being np.nan and does not get converted.

Where it happens

In L763-764 blocks.py, the block variable in L763 has np.nan. But after running block.convert(...), the result has pd.NaT. I believe the coercion happens specifically in L490 blocks.py:

attempt to coerce any object types to better types return a copy
of the block (if copy = True) by definition we are not an ObjectBlock
here!

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.9.13.final.0
python-bits : 64
OS : Darwin
OS-release : 21.5.0
Version : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:29 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.4
setuptools : 62.1.0
Cython : None
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : 0.8.1
fsspec : 2022.02.0
gcsfs : None
markupsafe : 2.1.1
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 6.0.1
pyreadstat : None
pyxlsb : 1.0.9
s3fs : 0.4.2
scipy : None
snappy : None
sqlalchemy : 1.4.36
tables : None
tabulate : 0.8.9
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None

@ng-henry ng-henry added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 11, 2022
@phofl
Copy link
Member

phofl commented Aug 19, 2022

Hi, thanks for your report.

sorry, tested on 1.3.5 initially. Yes you are correct, this should not convert. Labelling as a regression, since this worked in 1.3.5

@phofl phofl added Usage Question Needs Info Clarification about behavior needed to assess issue replace replace method Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member Usage Question Needs Info Clarification about behavior needed to assess issue labels Aug 19, 2022
@phofl phofl added this to the 1.4.4 milestone Aug 19, 2022
@phofl
Copy link
Member

phofl commented Aug 19, 2022

This is a general issue, regex=False also replaces NaN with NaT in specific cases:

s1 = pd.Series([pd.Timedelta("PT1H"), np.nan, 1], dtype = object)
a2 = s1.replace(1, pd.Timedelta("PT1H"), regex=False)

returns

0   0 days 01:00:00
1               NaT
2   0 days 01:00:00
dtype: timedelta64[ns]

@ng-henry
Copy link
Author

Yes, it seems that pd.replace will attempt to cast columns to their "best possible type." Since all values of a column are either datetimes or nans, it converts the whole column to datetime.

I wonder if we can disable this behaviour for nans? Don't treat nan's as datetimes therefore don't convert nan's to NaT.

@CloseChoice
Copy link
Member

CloseChoice commented Aug 21, 2022

I looked into this a bit and found, that we change the whole array when we use replace_regex. For the reproducible example of @ng-henry the conversion happens exactly in here. When we use replace on s1 = pd.Series([pd.Timestamp(1000223), np.nan], dtype = object), we end up with seen.nan_ = True and seen.datetime_ = True.
This results effectively in returning
DatetimeIndex([Timestamp('1970-01-01 00:00:00.001000223'), np.nan])._data._ndarray which ends up in DatetimeIndex(['1970-01-01 00:00:00.001000223', 'NaT'], dtype='datetime64[ns]', freq=None).
So we totally neglect that np.nan was seen during looping over the array. I guess we must change the total casting procedure to fix this.

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 25, 2022
@simonjayhawkins
Copy link
Member

sorry, tested on 1.3.5 initially. Yes you are correct, this should not convert. Labelling as a regression, since this worked in 1.3.5

first bad commit: [b1a2f48] BUG: inconsistency in dtype of replace() (#44897)

The change looks intentionally but agree with OP that would expect the code sample to be a no-op.

@simonjayhawkins
Copy link
Member

#46393 (comment) maybe related.

@simonjayhawkins
Copy link
Member

The change that caused the regression was in _replace_regex in pandas/core/internals/blocks.py

Now there is a convert parameter in the function signature that is unused.

I assume the last line should have been changed to return block.convert(numeric=False, copy=False) if convert else [block] in #44897

now that won't fix this issue since there does not appear to be any logic to determine if the op is a no-op based on the regex not matching any values. There is code dotted around that short circuits if some other conditions determine that the op is a no-op.

But also stepping back, whether the "fix" to be consistent with regex=False was the right thing to do considering #48034 (comment)

@simonjayhawkins
Copy link
Member

removing from 1.4.x milestone as I think needs further discussion.

@simonjayhawkins simonjayhawkins removed this from the 1.4.4 milestone Aug 30, 2022
@jorisvandenbossche jorisvandenbossche added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version replace replace method
Projects
None yet
Development

No branches or pull requests

5 participants