Skip to content

ENH: df.duplicated() default behavior should default to tagging all duplicates as True #65320

@griffinjmbur

Description

@griffinjmbur

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Hi, folks. I don't complain much about excellent open-source freeware, and I'm reasonably proficient at Pandas. I have about three years of solid use under my belt; before that, I had about six using Stata and am (was) a Stata maven. Pandas is generally much better...BUT...

df.duplicated() has what seems to me to be very confusing default behavior. In Stata, for example, one might use the equivalent duplicate-tagger function (in Stata, a "command") to locate people who are duplicates on a particular column to then hand inspect some of the rows to see if these are data entry errors or if, by contrast, one has perhaps missed a nuance and should expect a small number of duplicates on some col. This might sound sloppy, but when the codebook/docs for the DF sucks, this is a good way to figure out precisely what it might mean for a row to be a duplicate on some subset.

For that reason, I propose that df.duplicated(keep=False) should be the default behavior. I just spent around 30 minutes trying to figure out the problem with my code or possibly with the df, etc. because I was trying to locate people who are duplicates on a col that should in theory have a few duplicates but not many, and I could not actually see any duplicates after the following.

df["dupe"] = df.duplicated(subset=["year", "id"])
df.groupby("year").apply(lambda g :
    g["df"].mean())
df.loc[df["dupe"]==True].apply(foo)

Of course, I realized that this is because the default behavior is to not tag the first duplicate as a duplicate; in my case, since the multiplicity of a duplicate is almost always two, that meant that subsetting the data to only include duplicates was actually omitting those duplicates from view.

I did even skim the Docs, obviously inefficiently at first, but the wording itself seems extremely confusing. I did not realize that keep=False would solve my problems because this is not really about keep/drop behavior but tagging behavior. Changing the option's actual name to tag or something similar would probably lead to less confusion/make it more obvious that the default behavior is odd.

On that note, another reason to change this, I think, is that there is no particular reason to expect the first duplicate to be the one to keep (at least without more context on the df; certainly, this is not true in general). This might make sense if it is a duplicate on all cols, but if the subset is anything but the full set of cols, this seems like behavior that might lead to very unexpected results. If someone does simply have outright duplicate records, it doesn't really cost significant time or memory (AFAICT) to then go back and set keep="first" once it's been confirmed that one might as well keep any record.

Feature Description

I'm not a backend dev., merely a fairly tech-savvy sociologist. I think the technical side of this would be very easy.

Alternative Solutions

People could read the docs very carefully before seeking other solutions.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions