Skip to content

BUG: read_hdf bad filtering in case of categorical string columns #39351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

nofarm3
Copy link
Contributor

@nofarm3 nofarm3 commented Jan 23, 2021

This bug happens when filtering on categorical string columns and choose a value which doesn't exist in the column.
Instead of returning an empty dataframe, we get some records.
It happens because of the usage in numpy searchsorted(v, side="left") that find indices where elements should be inserted to maintain order (and not 0 in case that the value doesn't exist), like was assumed in the code.

I changed it to first the for the value, and use searchsorted only if value exists, I also added a test for this specific use case.
I think in the long run, maybe we should refactor this area in the code since one function covers multiple use cases which makes it more complex to test.

@nofarm3 nofarm3 marked this pull request as draft January 23, 2021 18:06
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a test pls

@jreback jreback added Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore labels Jan 24, 2021
@nofarm3
Copy link
Contributor Author

nofarm3 commented Jan 25, 2021

yes, ofc.
I used unittest to write it and then understood that I'm not allowed to use it, so i need to convert my tests to monkeypatch pytest.

@pep8speaks
Copy link

pep8speaks commented Jan 26, 2021

Hello @nofarm3! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-01-26 19:36:42 UTC

@nofarm3 nofarm3 closed this Jan 26, 2021
@nofarm3 nofarm3 deleted the read-hdf-returns-unexpected-values-for-categorical branch January 26, 2021 20:57
@nofarm3 nofarm3 restored the read-hdf-returns-unexpected-values-for-categorical branch January 26, 2021 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: queries on categorical string columns in read_hdf return unexpected results