-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
duplicated removing items that are not duplicates in 0.18 #14443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
show what you are actually calling as well as |
Here is the lines of code that I run:
and here is the result of
|
I have been looking further into the code and stepping along the pandas code and I found that (predictably) it was my own mistake. There were duplicates, I was just not seeing it. You can close this ticket. |
17 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I wish I could say more about this, but it is with proprietary data. Basically, I have a DataFrame of 10 columns and around 7000 rows. When I call "duplicated" on 4 of the rows (3 have strings, 1 has a float), I get around 10 items being flagged as duplicate (5 pairs) when in fact, they are different.
This is sample text of what is being flagged:
I know that duplicates are checked through hashing, but is there some way to, I don't know, compare a checksum or a more robust measurement to ensure that only duplicates are flagged? What is the chance that two items could be hashed to the same value?
By adding more of the fields to check the data I am able to prevent false duplicate flags.
I am using Pandas 0.18 from WinPython-64bit-3.4.4.2Qt5.
The text was updated successfully, but these errors were encountered: