Skip to content

BUG: using dtype='int64' argument of Series causes ValueError: values cannot be losslessly cast to int64 for integer strings #45017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed
15 changes: 12 additions & 3 deletions pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -1811,8 +1811,17 @@ def maybe_cast_to_integer_array(
# doesn't handle `uint64` correctly.
arr = np.asarray(arr)

if is_unsigned_integer_dtype(dtype) and (arr < 0).any():
raise OverflowError("Trying to coerce negative values to unsigned integers")
if is_unsigned_integer_dtype(dtype):
try:
if (arr < 0).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this is a try except

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input in arr is ["0", "1", "2"], this is causing a TypeError in the check (arr < 0).any() but it can be casted to uint, so I am checking (casted < 0).any() which it hits and casts correctly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't make sense
why does this check matter? is the casted valid?

where is the test that checks the overflow?

strive for minimal code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result = Series(["0", "1", "2"], dtype=uint8), this should give a valid cast and casted is valid, but (arr < 0).any() is giving TypeError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that string arrays cannot be compared with integers, numpy does not support it, thus it is leading to a Type Error.

raise OverflowError(
"Trying to coerce negative values to unsigned integers"
)
except TypeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here about what cases get here

if (casted < 0).any():
raise OverflowError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this hit in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result = Series(["0", "1", "2"], dtype=uint8), this is hit. For all uint in any_int_dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont understand this. if i pass ["0", "1", "2"] to np.array on L2046 with dtype=np.uint8 i get back a uint8 ndarray that shouldn't raise here

"Trying to coerce negative values to unsigned integers"
)

if is_float_dtype(arr.dtype):
if not np.isfinite(arr).all():
Expand All @@ -1823,7 +1832,7 @@ def maybe_cast_to_integer_array(
if is_object_dtype(arr.dtype):
raise ValueError("Trying to coerce float values to integers")

if casted.dtype < arr.dtype:
if casted.dtype < arr.dtype or is_string_dtype(arr.dtype):
# GH#41734 e.g. [1, 200, 923442] and dtype="int8" -> overflows
warnings.warn(
f"Values are too large to be losslessly cast to {dtype}. "
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/series/test_constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -1895,6 +1895,20 @@ def test_constructor_bool_dtype_missing_values(self):
expected = Series(True, index=[0], dtype="bool")
tm.assert_series_equal(result, expected)

def test_constructor_int64_dtype(self, any_int_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, pls just use the fixture itself, e.g. no parameterize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing Assertion Error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code segment is leading to this issue, if we have only int64 there is no issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to match the expected value as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I think I have covered everything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shubham11941140 you are not using the fixtures pls do so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just remove the paramterize completely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint -> uint8, uint16, uint32, uint64 are failing due to internal code implementation. Do i fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback removed parametrization, now it should be ready.

# GH-44923
result = Series(["0", "1", "2"], dtype=any_int_dtype)
expected = Series([0, 1, 2], dtype=any_int_dtype)
tm.assert_series_equal(result, expected)

def test_constructor_float64_dtype(self, any_float_dtype):
# GH-44923
if any_float_dtype in ["Float32", "Float64"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an issue for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Float -> Float32, Float64 are failing as it is unable to implicitly cast strings. As the implicit cast fails so I am xfailing them,

u can xfail the failing ones that are tricky to fix eg Float but the others should work

This one you had mentioned above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these now work in main #45424

pytest.xfail(reason="Cannot be casted to FloatDtype Series")
result = Series(["-1", "0", "1", "2"], dtype=any_float_dtype)
expected = Series([-1.0, 0.0, 1.0, 2.0], dtype=any_float_dtype)
tm.assert_series_equal(result, expected)

@pytest.mark.filterwarnings(
"ignore:elementwise comparison failed:DeprecationWarning"
)
Expand Down