Feature Type
Problem Description
Hi, folks. I don't complain much about excellent open-source freeware, and I'm reasonably proficient at Pandas. I have about three years of solid use under my belt; before that, I had about six using Stata and am (was) a Stata maven. Pandas is generally much better...BUT...
df.duplicated() has what seems to me to be very confusing default behavior. In Stata, for example, one might use the equivalent duplicate-tagger function (in Stata, a "command") to locate people who are duplicates on a particular column to then hand inspect some of the rows to see if these are data entry errors or if, by contrast, one has perhaps missed a nuance and should expect a small number of duplicates on some col. This might sound sloppy, but when the codebook/docs for the DF sucks, this is a good way to figure out precisely what it might mean for a row to be a duplicate on some subset.
For that reason, I propose that df.duplicated(keep=False) should be the default behavior. I just spent around 30 minutes trying to figure out the problem with my code or possibly with the df, etc. because I was trying to locate people who are duplicates on a col that should in theory have a few duplicates but not many, and I could not actually see any duplicates after the following.
df["dupe"] = df.duplicated(subset=["year", "id"])
df.groupby("year").apply(lambda g :
g["df"].mean())
df.loc[df["dupe"]==True].apply(foo)
Of course, I realized that this is because the default behavior is to not tag the first duplicate as a duplicate; in my case, since the multiplicity of a duplicate is almost always two, that meant that subsetting the data to only include duplicates was actually omitting those duplicates from view.
I did even skim the Docs, obviously inefficiently at first, but the wording itself seems extremely confusing. I did not realize that keep=False would solve my problems because this is not really about keep/drop behavior but tagging behavior. Changing the option's actual name to tag or something similar would probably lead to less confusion/make it more obvious that the default behavior is odd.
On that note, another reason to change this, I think, is that there is no particular reason to expect the first duplicate to be the one to keep (at least without more context on the df; certainly, this is not true in general). This might make sense if it is a duplicate on all cols, but if the subset is anything but the full set of cols, this seems like behavior that might lead to very unexpected results. If someone does simply have outright duplicate records, it doesn't really cost significant time or memory (AFAICT) to then go back and set keep="first" once it's been confirmed that one might as well keep any record.
Feature Description
I'm not a backend dev., merely a fairly tech-savvy sociologist. I think the technical side of this would be very easy.
Alternative Solutions
People could read the docs very carefully before seeking other solutions.
Additional Context
No response
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Hi, folks. I don't complain much about excellent open-source freeware, and I'm reasonably proficient at Pandas. I have about three years of solid use under my belt; before that, I had about six using Stata and am (was) a Stata maven. Pandas is generally much better...BUT...
df.duplicated()has what seems to me to be very confusing default behavior. In Stata, for example, one might use the equivalent duplicate-tagger function (in Stata, a "command") to locate people who are duplicates on a particular column to then hand inspect some of the rows to see if these are data entry errors or if, by contrast, one has perhaps missed a nuance and should expect a small number of duplicates on some col. This might sound sloppy, but when the codebook/docs for the DF sucks, this is a good way to figure out precisely what it might mean for a row to be a duplicate on some subset.For that reason, I propose that
df.duplicated(keep=False)should be the default behavior. I just spent around 30 minutes trying to figure out the problem with my code or possibly with the df, etc. because I was trying to locate people who are duplicates on a col that should in theory have a few duplicates but not many, and I could not actually see any duplicates after the following.Of course, I realized that this is because the default behavior is to not tag the first duplicate as a duplicate; in my case, since the multiplicity of a duplicate is almost always two, that meant that subsetting the data to only include duplicates was actually omitting those duplicates from view.
I did even skim the Docs, obviously inefficiently at first, but the wording itself seems extremely confusing. I did not realize that
keep=Falsewould solve my problems because this is not really about keep/drop behavior but tagging behavior. Changing the option's actual name totagor something similar would probably lead to less confusion/make it more obvious that the default behavior is odd.On that note, another reason to change this, I think, is that there is no particular reason to expect the first duplicate to be the one to keep (at least without more context on the df; certainly, this is not true in general). This might make sense if it is a duplicate on all cols, but if the subset is anything but the full set of cols, this seems like behavior that might lead to very unexpected results. If someone does simply have outright duplicate records, it doesn't really cost significant time or memory (AFAICT) to then go back and set
keep="first"once it's been confirmed that one might as well keep any record.Feature Description
I'm not a backend dev., merely a fairly tech-savvy sociologist. I think the technical side of this would be very easy.
Alternative Solutions
People could read the docs very carefully before seeking other solutions.
Additional Context
No response