-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @shubham11941140! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-10-17 02:14:48 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs some scrutiny
the check is way too early
needs scrutiny in the sense |
Please can you help me to compete my build so I can move to the testing. |
The code segment passes the testcase in pytest successfully. Please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As jeff said, you are checking too early. You have to look into the functions called here to determine where the cast takes place and fix it nearby
I am unable to understand that in which other files are the casts taking place and to fix them. |
cast.py is the only place where the actual cast takes place and the inferred_dtype takes place, in series.py and generic.py only functions are called with the necessary arguments where no other change is required. |
This is handled a few lines below in the if conditions, you have to adjust there |
Few lines below the if condition in which particular file? |
Same file. You have to check where bytes is cast to string |
Do you mean to say that in cast.py, there is a place where bytes is cast to string? |
Yes. Please step through with the debugger to see what takes place |
Please inform if any more changes are required. |
Thanks @shubham11941140 for working on this. As I have commented above, I am happy to not fix the reported "regression" and not revert to 1.2.5 behavior (i.e. do nothing and close this PR) That is only my opinion and previously suggested that we wait for comments from others. I will bring this up in today's dev meeting. |
@shubham11941140 we discussed this at the dev meeting this week and the consensus was to revert to 1.2.5 behavior. i.e. A column with object dtype containing bytes objects should not be changed by convert_dtypes. so this PR currently converts the column to
We are planning to release 1.3.4 tomorrow, so changing milestone here to 1.3.5 |
If I keep the dtype object then it essentially means doing no change in the PR, so how will I solve the bug for which I opened the PR? |
from #43183 (comment)
using the following code sample
1.2.5 gives
master gives
but this PR is currently producing
we want the same output as 1.2.5 |
I am getting the exact output as you mentioned above. It follows the exact same behaviour as 1.2.5 |
@jbrockmendel from your comment #43183 (comment), is this the fix you were expecting? |
doc/source/whatsnew/v1.3.4.rst
Outdated
@@ -14,6 +14,8 @@ including other versions of pandas. | |||
|
|||
Fixed regressions | |||
~~~~~~~~~~~~~~~~~ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove this whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@simonjayhawkins if ok here this could go in 1.3.4 |
sure on green (and a thumbs up from @jbrockmendel) |
test failure probably unrelated.
|
pandas/core/dtypes/cast.py
Outdated
@@ -1426,6 +1426,8 @@ def convert_dtypes( | |||
if is_string_dtype(inferred_dtype): | |||
if not convert_string: | |||
return input_array.dtype | |||
elif inferred_dtype == "bytes": | |||
return pandas_dtype("object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel is this the same as returning input_array.dtype
?
If we are not changing is_string_dtype(inferred_dtype)
to return False
, then maybe we should do
if not convert_string or inferred_dtype == "bytes":
return input_array.dtype
else:
...
instead?
I've still not looked at the source of the regression to know what the correct fix is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change.
The incorrect fix to the string is stopped as the class continues to remain bytes so decoding will give string. It will not convert it to string anymore as the dtype will be object and not string.
test failure probably unrelated
|
Thanks @shubham11941140 |
…s byte strings to strings in 1.3+
…ings to strings in 1.3+ (#44066)
Added extra try except block to identify the correct type to solve the problem.