Skip to content

BUG: read_csv() silently ignores out-of-range integers #55232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
davidchall opened this issue Sep 21, 2023 · 5 comments
Open
2 of 3 tasks

BUG: read_csv() silently ignores out-of-range integers #55232

davidchall opened this issue Sep 21, 2023 · 5 comments
Labels
Bug IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@davidchall
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from io import StringIO
import pandas as pd

# raises exception: cannot safely cast non-equivalent int32 to uint8
pd.Series([-1, 257], dtype="UInt8")

# no exception raised
data = StringIO("x\n-1\n257")
df = pd.read_csv(data, dtype={"x": "UInt8"})

# unexpected wraparound behavior: -1 -> 255, 257 -> 1
df.x

Issue Description

The read_csv() function no longer raises an exception when it encounters an out-of-range integer. Instead, integer overflow silently exhibits a wraparound behavior.

Expected Behavior

On pandas 1.5.3, pd.read_csv() raises a "cannot cast" exception, which is similar to how this scenario is handled by the pd.Series() constructor. I expect pandas 2.1.1 to continue this behavior.

Installed Versions

python : 3.11.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22621

pandas : 2.1.1
numpy : 1.26.0

@davidchall davidchall added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 21, 2023
@paulreece
Copy link
Contributor

paulreece commented Sep 21, 2023

I can confirm that the read_csv method no longer raises an exception on the main development branch. However, I received different output than noted for df.x:

>>> data = StringIO("x\n-1\n257")
>>> df = pd.read_csv(data, dtype={"x": "UInt8"})
>>> df.x
0    255
1      1
Name: x, dtype: UInt8

@davidchall
Copy link
Contributor Author

Hi @paulreece - thanks for your quick reply. Your output is consistent with mine (sorry if I was unclear). The -1 input becomes 255 output and the 257 input becomes 1 output.

@rhshadrach rhshadrach added the Regression Functionality that used to work in a prior pandas version label Sep 22, 2023
@rhshadrach rhshadrach added this to the 2.1.2 milestone Sep 22, 2023
@rhshadrach
Copy link
Member

I didn't see any notes on this in the whatsnew for 2.0.0 nor 2.1.0. A git bisect should be run to determine where this behavior changed.

@lithomas1 lithomas1 modified the milestones: 2.1.2, 2.1.3 Oct 26, 2023
@lithomas1
Copy link
Member

So I think the issue is that _from_sequence_of_strings basically does an astype to the specified dtype, but without specifying safe casting, here.

values = values.astype(dtype.numpy_dtype, copy=False)

The old code called _safe_cast on IntegerArray which raised a TypeError.

e.g. this

pd.UInt8Dtype().construct_array_type()._from_sequence_of_strings(['-1', '257'], dtype=pd.UInt8Dtype())

cc @jbrockmendel

@lithomas1 lithomas1 added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 8, 2023
@jbrockmendel
Copy link
Member

i think we would need something like #45588 this but AFAIK there hasn't been any movement on that recently

@jorisvandenbossche jorisvandenbossche modified the milestones: 2.1.3, 2.1.4 Nov 13, 2023
@lithomas1 lithomas1 modified the milestones: 2.1.4, 2.2 Dec 8, 2023
@lithomas1 lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

6 participants