Skip to content

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Oct 17, 2021

Conversation

shubham11941140
Copy link
Contributor

@shubham11941140 shubham11941140 commented Aug 24, 2021

Added extra try except block to identify the correct type to solve the problem.

@pep8speaks
Copy link

pep8speaks commented Aug 24, 2021

Hello @shubham11941140! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-10-17 02:14:48 UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs some scrutiny
the check is way too early

@shubham11941140
Copy link
Contributor Author

needs scrutiny in the sense
Actually the problem is I am unable to build my pytest as I am getting E
ModuleNotFoundError: No module named 'pandas._libs.interval' error
Please help me to solve this, then I can run local tests.

@shubham11941140
Copy link
Contributor Author

Please can you help me to compete my build so I can move to the testing.

@shubham11941140
Copy link
Contributor Author

The code segment passes the testcase in pytest successfully. Please review.

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As jeff said, you are checking too early. You have to look into the functions called here to determine where the cast takes place and fix it nearby

@shubham11941140
Copy link
Contributor Author

I am unable to understand that in which other files are the casts taking place and to fix them.

@shubham11941140
Copy link
Contributor Author

cast.py is the only place where the actual cast takes place and the inferred_dtype takes place, in series.py and generic.py only functions are called with the necessary arguments where no other change is required.

@phofl
Copy link
Member

phofl commented Aug 25, 2021

This is handled a few lines below in the if conditions, you have to adjust there

@shubham11941140
Copy link
Contributor Author

Few lines below the if condition in which particular file?

@phofl
Copy link
Member

phofl commented Aug 25, 2021

Same file.

You have to check where bytes is cast to string

@shubham11941140
Copy link
Contributor Author

Do you mean to say that in cast.py, there is a place where bytes is cast to string?

@phofl
Copy link
Member

phofl commented Aug 25, 2021

Yes. Please step through with the debugger to see what takes place

@simonjayhawkins simonjayhawkins added Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data labels Aug 25, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Aug 25, 2021
@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Aug 25, 2021
@shubham11941140 shubham11941140 requested a review from phofl August 29, 2021 07:18
@shubham11941140
Copy link
Contributor Author

Please inform if any more changes are required.

@simonjayhawkins
Copy link
Member

Please, mentors, just give me a method on how to proceed ahead as it has been about half a month.

Thanks @shubham11941140 for working on this.

As I have commented above, I am happy to not fix the reported "regression" and not revert to 1.2.5 behavior (i.e. do nothing and close this PR)

That is only my opinion and previously suggested that we wait for comments from others. I will bring this up in today's dev meeting.

@simonjayhawkins
Copy link
Member

@shubham11941140 we discussed this at the dev meeting this week and the consensus was to revert to 1.2.5 behavior. i.e. A column with object dtype containing bytes objects should not be changed by convert_dtypes.

so this PR currently converts the column to |S13 dtype whereas convert _dtypes should be a no-op and the dtype should remain as object

E       Attribute "dtype" are different
E       [left]:  |S13
E       [right]: object

We are planning to release 1.3.4 tomorrow, so changing milestone here to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@shubham11941140
Copy link
Contributor Author

If I keep the dtype object then it essentially means doing no change in the PR, so how will I solve the bug for which I opened the PR?

@simonjayhawkins
Copy link
Member

from #43183 (comment)

The issue is that converted_dtypes() will convert the data column values to the string b'binary-data', almost like it has had str(val) called on it. On 1.2.5 it is left correctly as a byte array.

using the following code sample

import pandas as pd

print(pd.__version__)
byte_str = b"binary-string"
df = pd.DataFrame(
    data={
        "data": byte_str,
    },
    index=[0],
)
result = df.convert_dtypes()
print(result)
print(result.dtypes)
print(type(result.data[0]))

1.2.5 gives

1.2.5
               data
0  b'binary-string'
data    object
dtype: object
<class 'bytes'>

master gives

1.4.0.dev0+894.gdca6901d45
               data
0  b'binary-string'
data    string
dtype: object
<class 'str'>

but this PR is currently producing

1.4.0.dev0+834.g5f2933d534
               data
0  b'binary-string'
data    |S13
dtype: object
<class 'numpy.bytes_'>

If I keep the dtype object then it essentially means doing no change in the PR, so how will I solve the bug for which I opened the PR?

we want the same output as 1.2.5

@shubham11941140
Copy link
Contributor Author

I am getting the exact output as you mentioned above. It follows the exact same behaviour as 1.2.5

@simonjayhawkins
Copy link
Member

@jbrockmendel from your comment #43183 (comment), is this the fix you were expecting?

@@ -14,6 +14,8 @@ including other versions of pandas.

Fixed regressions
~~~~~~~~~~~~~~~~~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this whitespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback jreback modified the milestones: 1.3.5, 1.3.4 Oct 16, 2021
@jreback
Copy link
Contributor

jreback commented Oct 16, 2021

@simonjayhawkins if ok here this could go in 1.3.4

@simonjayhawkins
Copy link
Member

@simonjayhawkins if ok here this could go in 1.3.4

sure on green (and a thumbs up from @jbrockmendel)

@simonjayhawkins
Copy link
Member

test failure probably unrelated.

=========================== short test summary info ============================
FAILED pandas/tests/io/test_gcs.py::test_to_csv_compression_encoding_gcs[zip-cp1251]
= 1 failed, 174784 passed, 5034 skipped, 1220 xfailed, 6 xpassed, 24 warnings in 2530.08s (0:42:10) =

@@ -1426,6 +1426,8 @@ def convert_dtypes(
if is_string_dtype(inferred_dtype):
if not convert_string:
return input_array.dtype
elif inferred_dtype == "bytes":
return pandas_dtype("object")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel is this the same as returning input_array.dtype?

If we are not changing is_string_dtype(inferred_dtype) to return False, then maybe we should do

            if not convert_string or inferred_dtype == "bytes":
                return input_array.dtype
            else:
                ...

instead?

I've still not looked at the source of the regression to know what the correct fix is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change.

The incorrect fix to the string is stopped as the class continues to remain bytes so decoding will give string. It will not convert it to string anymore as the dtype will be object and not string.

@simonjayhawkins
Copy link
Member

test failure probably unrelated

=========================== short test summary info ===========================
FAILED pandas/tests/io/formats/style/test_style.py::TestStyler::test_applymap_subset_multiindex[slice_4]
= 1 failed, 55488 passed, 1752 skipped, 462 xfailed, 8 xpassed, 41 warnings in 603.29s (0:10:03) =

@simonjayhawkins simonjayhawkins merged commit 3eeef2a into pandas-dev:master Oct 17, 2021
@simonjayhawkins
Copy link
Member

Thanks @shubham11941140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+
6 participants