BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43199

shubham11941140 · 2021-08-24T13:42:46Z

closes BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43183
tests added / passed in the test_convert_dtypes.py file
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Added extra try except block to identify the correct type to solve the problem.

pep8speaks · 2021-08-24T13:42:49Z

Hello @shubham11941140! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-10-17 02:14:48 UTC

jreback

this needs some scrutiny
the check is way too early

shubham11941140 · 2021-08-24T13:50:28Z

needs scrutiny in the sense
Actually the problem is I am unable to build my pytest as I am getting E
ModuleNotFoundError: No module named 'pandas._libs.interval' error
Please help me to solve this, then I can run local tests.

shubham11941140 · 2021-08-24T17:24:12Z

Please can you help me to compete my build so I can move to the testing.

shubham11941140 · 2021-08-25T06:43:36Z

The code segment passes the testcase in pytest successfully. Please review.

phofl

As jeff said, you are checking too early. You have to look into the functions called here to determine where the cast takes place and fix it nearby

shubham11941140 · 2021-08-25T13:31:48Z

I am unable to understand that in which other files are the casts taking place and to fix them.

shubham11941140 · 2021-08-25T13:47:03Z

cast.py is the only place where the actual cast takes place and the inferred_dtype takes place, in series.py and generic.py only functions are called with the necessary arguments where no other change is required.

phofl · 2021-08-25T13:50:43Z

This is handled a few lines below in the if conditions, you have to adjust there

shubham11941140 · 2021-08-25T13:51:37Z

Few lines below the if condition in which particular file?

phofl · 2021-08-25T13:52:35Z

Same file.

You have to check where bytes is cast to string

shubham11941140 · 2021-08-25T13:53:49Z

Do you mean to say that in cast.py, there is a place where bytes is cast to string?

phofl · 2021-08-25T13:54:20Z

Yes. Please step through with the debugger to see what takes place

shubham11941140 · 2021-08-29T07:18:38Z

Please inform if any more changes are required.

simonjayhawkins · 2021-10-13T09:43:13Z

Please, mentors, just give me a method on how to proceed ahead as it has been about half a month.

Thanks @shubham11941140 for working on this.

As I have commented above, I am happy to not fix the reported "regression" and not revert to 1.2.5 behavior (i.e. do nothing and close this PR)

That is only my opinion and previously suggested that we wait for comments from others. I will bring this up in today's dev meeting.

simonjayhawkins · 2021-10-16T17:30:11Z

@shubham11941140 we discussed this at the dev meeting this week and the consensus was to revert to 1.2.5 behavior. i.e. A column with object dtype containing bytes objects should not be changed by convert_dtypes.

so this PR currently converts the column to |S13 dtype whereas convert _dtypes should be a no-op and the dtype should remain as object

E       Attribute "dtype" are different
E       [left]:  |S13
E       [right]: object

We are planning to release 1.3.4 tomorrow, so changing milestone here to 1.3.5

shubham11941140 · 2021-10-16T17:34:15Z

If I keep the dtype object then it essentially means doing no change in the PR, so how will I solve the bug for which I opened the PR?

simonjayhawkins · 2021-10-16T17:51:32Z

from #43183 (comment)

The issue is that converted_dtypes() will convert the data column values to the string b'binary-data', almost like it has had str(val) called on it. On 1.2.5 it is left correctly as a byte array.

using the following code sample

import pandas as pd

print(pd.__version__)
byte_str = b"binary-string"
df = pd.DataFrame(
    data={
        "data": byte_str,
    },
    index=[0],
)
result = df.convert_dtypes()
print(result)
print(result.dtypes)
print(type(result.data[0]))

1.2.5 gives

1.2.5
               data
0  b'binary-string'
data    object
dtype: object
<class 'bytes'>

master gives

1.4.0.dev0+894.gdca6901d45
               data
0  b'binary-string'
data    string
dtype: object
<class 'str'>

but this PR is currently producing

1.4.0.dev0+834.g5f2933d534
               data
0  b'binary-string'
data    |S13
dtype: object
<class 'numpy.bytes_'>

If I keep the dtype object then it essentially means doing no change in the PR, so how will I solve the bug for which I opened the PR?

we want the same output as 1.2.5

shubham11941140 · 2021-10-16T18:13:29Z

I am getting the exact output as you mentioned above. It follows the exact same behaviour as 1.2.5

simonjayhawkins · 2021-10-16T18:20:33Z

@jbrockmendel from your comment #43183 (comment), is this the fix you were expecting?

simonjayhawkins · 2021-10-16T18:22:07Z

doc/source/whatsnew/v1.3.4.rst

@@ -14,6 +14,8 @@ including other versions of pandas.

 Fixed regressions
 ~~~~~~~~~~~~~~~~~
+


can you remove this whitespace.

jreback · 2021-10-16T18:27:42Z

@simonjayhawkins if ok here this could go in 1.3.4

simonjayhawkins · 2021-10-16T18:50:06Z

@simonjayhawkins if ok here this could go in 1.3.4

sure on green (and a thumbs up from @jbrockmendel)

simonjayhawkins · 2021-10-16T20:41:15Z

test failure probably unrelated.

=========================== short test summary info ============================
FAILED pandas/tests/io/test_gcs.py::test_to_csv_compression_encoding_gcs[zip-cp1251]
= 1 failed, 174784 passed, 5034 skipped, 1220 xfailed, 6 xpassed, 24 warnings in 2530.08s (0:42:10) =

simonjayhawkins · 2021-10-16T21:00:22Z

pandas/core/dtypes/cast.py

@@ -1426,6 +1426,8 @@ def convert_dtypes(
        if is_string_dtype(inferred_dtype):
            if not convert_string:
                return input_array.dtype
+            elif inferred_dtype == "bytes":
+                return pandas_dtype("object")


@jbrockmendel is this the same as returning input_array.dtype?

If we are not changing is_string_dtype(inferred_dtype) to return False, then maybe we should do

if not convert_string or inferred_dtype == "bytes": return input_array.dtype else: ...

instead?

I've still not looked at the source of the regression to know what the correct fix is.

Made the change.

The incorrect fix to the string is stopped as the class continues to remain bytes so decoding will give string. It will not convert it to string anymore as the dtype will be object and not string.

simonjayhawkins · 2021-10-17T11:03:48Z

test failure probably unrelated

=========================== short test summary info ===========================
FAILED pandas/tests/io/formats/style/test_style.py::TestStyler::test_applymap_subset_multiindex[slice_4]
= 1 failed, 55488 passed, 1752 skipped, 462 xfailed, 8 xpassed, 41 warnings in 603.29s (0:10:03) =

simonjayhawkins · 2021-10-17T11:04:22Z

Thanks @shubham11941140

…s byte strings to strings in 1.3+

…ings to strings in 1.3+ (#44066)

Implemented the byte_string

b534f56

shubham11941140 added 2 commits August 24, 2021 19:15

Update test_convert_dtypes.py

23583fc

Update test_convert_dtypes.py

96338fa

jreback requested changes Aug 24, 2021

View reviewed changes

Update cast.py

ad4189f

Passes the tests

e915bc0

shubham11941140 requested a review from jreback August 25, 2021 06:44

shubham11941140 added 2 commits August 25, 2021 12:36

Removed PEP 8 issues

6d0a497

Update cast.py

5ab2d79

phofl requested changes Aug 25, 2021

View reviewed changes

simonjayhawkins added Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data labels Aug 25, 2021

simonjayhawkins added this to the 1.3.3 milestone Aug 25, 2021

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Aug 25, 2021

Removed Try Except Block

7921c6f

shubham11941140 requested a review from phofl August 29, 2021 07:18

shubham11941140 added 2 commits August 29, 2021 20:27

pre-commit changes added

3e7a91f

mypy static error solved

b5b0e27

simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021

shubham11941140 added 2 commits October 16, 2021 23:41

Follows 1.2.5 behaviour

bfb1242

Merge branch 'b2' of https://github.com/shubham11941140/pandas into b2

37ee298

simonjayhawkins requested a review from jbrockmendel October 16, 2021 18:19

simonjayhawkins reviewed Oct 16, 2021

View reviewed changes

Removed Whitespace

77ee435

jreback approved these changes Oct 16, 2021

View reviewed changes

jreback modified the milestones: 1.3.5, 1.3.4 Oct 16, 2021

shubham11941140 requested a review from simonjayhawkins October 16, 2021 18:29

simonjayhawkins reviewed Oct 16, 2021

View reviewed changes

Removed elif

e70f68b

shubham11941140 requested a review from simonjayhawkins October 17, 2021 02:16

simonjayhawkins merged commit 3eeef2a into pandas-dev:master Oct 17, 2021

meeseeksmachine mentioned this pull request Oct 17, 2021

Backport PR #43199 on branch 1.3.x (BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+) #44066

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Oct 17, 2021

Backport PR pandas-dev#43199: BUG: convert_dtypes incorrectly convert…

c74b60e

…s byte strings to strings in 1.3+

jreback pushed a commit that referenced this pull request Oct 17, 2021

Backport PR #43199: BUG: convert_dtypes incorrectly converts byte str…

3dd4974

…ings to strings in 1.3+ (#44066)

shubham11941140 deleted the b2 branch October 17, 2021 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43199

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43199

shubham11941140 commented Aug 24, 2021 •

edited

Loading

pep8speaks commented Aug 24, 2021 •

edited

Loading

jreback left a comment

shubham11941140 commented Aug 24, 2021

shubham11941140 commented Aug 24, 2021

shubham11941140 commented Aug 25, 2021

phofl left a comment

shubham11941140 commented Aug 25, 2021

shubham11941140 commented Aug 25, 2021

phofl commented Aug 25, 2021

shubham11941140 commented Aug 25, 2021

phofl commented Aug 25, 2021

shubham11941140 commented Aug 25, 2021

phofl commented Aug 25, 2021

shubham11941140 commented Aug 29, 2021

simonjayhawkins commented Oct 13, 2021

simonjayhawkins commented Oct 16, 2021

shubham11941140 commented Oct 16, 2021

simonjayhawkins commented Oct 16, 2021

shubham11941140 commented Oct 16, 2021

simonjayhawkins commented Oct 16, 2021

simonjayhawkins Oct 16, 2021

shubham11941140 Oct 16, 2021

jreback commented Oct 16, 2021

simonjayhawkins commented Oct 16, 2021

simonjayhawkins commented Oct 16, 2021

simonjayhawkins Oct 16, 2021

shubham11941140 Oct 17, 2021

simonjayhawkins commented Oct 17, 2021

simonjayhawkins commented Oct 17, 2021

		@@ -14,6 +14,8 @@ including other versions of pandas.

		Fixed regressions
		~~~~~~~~~~~~~~~~~

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43199

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43199

Conversation

shubham11941140 commented Aug 24, 2021 • edited Loading

pep8speaks commented Aug 24, 2021 • edited Loading

Comment last updated at 2021-10-17 02:14:48 UTC

jreback left a comment

Choose a reason for hiding this comment

shubham11941140 commented Aug 24, 2021

shubham11941140 commented Aug 24, 2021

shubham11941140 commented Aug 25, 2021

phofl left a comment

Choose a reason for hiding this comment

shubham11941140 commented Aug 25, 2021

shubham11941140 commented Aug 25, 2021

phofl commented Aug 25, 2021

shubham11941140 commented Aug 25, 2021

phofl commented Aug 25, 2021

shubham11941140 commented Aug 25, 2021

phofl commented Aug 25, 2021

shubham11941140 commented Aug 29, 2021

simonjayhawkins commented Oct 13, 2021

simonjayhawkins commented Oct 16, 2021

shubham11941140 commented Oct 16, 2021

simonjayhawkins commented Oct 16, 2021

shubham11941140 commented Oct 16, 2021

simonjayhawkins commented Oct 16, 2021

simonjayhawkins Oct 16, 2021

Choose a reason for hiding this comment

shubham11941140 Oct 16, 2021

Choose a reason for hiding this comment

jreback commented Oct 16, 2021

simonjayhawkins commented Oct 16, 2021

simonjayhawkins commented Oct 16, 2021

simonjayhawkins Oct 16, 2021

Choose a reason for hiding this comment

shubham11941140 Oct 17, 2021

Choose a reason for hiding this comment

simonjayhawkins commented Oct 17, 2021

simonjayhawkins commented Oct 17, 2021

shubham11941140 commented Aug 24, 2021 •

edited

Loading

pep8speaks commented Aug 24, 2021 •

edited

Loading