Skip to content

BUG: DataFrameGroupBy.sum() drops column names when applied to an empty dataframe #46375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
eugene57 opened this issue Mar 15, 2022 · 9 comments
Closed
2 of 3 tasks
Assignees
Labels
Bug Groupby Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version

Comments

@eugene57
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'c'])
print(df.groupby('a', as_index=False).sum())

Issue Description

Only first column (groupby key) is preserved:

Empty DataFrame
Columns: [a]
Index: []

Expected Behavior

All columns of original dataframe should be preserved:

Empty DataFrame
Columns: [a, b, c]
Index: []

Installed Versions

``` INSTALLED VERSIONS ------------------ commit : 67a3d42 python : 3.7.9.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-693.5.2.el7.x86_64 Version : #1 SMP Fri Oct 20 20:32:50 UTC 2017 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.4
numpy : 1.19.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 47.3.1.post20210215
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.30.0
sphinx : 3.0.3
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : 0.10.1
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.17.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pytables : None
pyxlsb : 1.0.9
s3fs : None
scipy : 1.5.4
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2

</details>

I also checked that the issue is still present in pandas 1.4.1.
@eugene57 eugene57 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 15, 2022
@ryansdowning
Copy link

ryansdowning commented Mar 15, 2022

Can confirm this issue exists on main. The issue also arises with DataFrameGroupBy.mean(), .sum(), .median(), but interestingly does not occur with .apply(), .min(), .max(), .all(), or .any(). I am looking further into this and will report back if I find anything.

@ryansdowning
Copy link

It looks like this has something to do with the numeric_only flag in pandas.core.groupby.generic.DataFrameGroupBy._agg_general. By default the aforementioned methods which do not have this issue (min, max) use numeric_only=False. The problematic methods (sum, mean, median) use lib.no_default as the default numeric_only argument. Here are some interesting things you can test to see whats going on:

note: I left off as_index=False from the groupby because it does not make a difference here

In [1]: import pandas as pd, numpy as np

In [2]: df = pd.DataFrame(columns=['a', 'b', 'c'])

In [3]: df.groupby('a').sum()
Out[3]: 
Empty DataFrame
Columns: []
Index: []

In [4]: df.groupby('a').min()
Out[4]: 
Empty DataFrame
Columns: [b, c]
Index: []

In [5]: df.groupby('a').min(numeric_only=True)
Out[5]: 
Empty DataFrame
Columns: []
Index: []

In [6]: df.groupby('a').sum(numeric_only=False)
Out[6]: 
Empty DataFrame
Columns: [b, c]
Index: []

I am not sure if the default behavior of these methods is intended to be different, so I'll leave it to the maintainers to direct closing this issue.

@mroeschke mroeschke added Groupby Reduction Operations sum, mean, min, max, etc. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 17, 2022
@dospix
Copy link
Contributor

dospix commented Apr 10, 2022

take

@dospix dospix removed their assignment Apr 16, 2022
@hhheidi
Copy link

hhheidi commented Apr 20, 2022

take

@hhheidi
Copy link

hhheidi commented Apr 22, 2022

@ryansdowning's examples are very helpful! I definitely agree that this is caused by numeric_only=True. I also found that this issue isn't unique to empty dataframes. Here:

In [0]:  df = pd.DataFrame(
        {
            "a": [0, 0, 1, 1],
            "b": [1, "x", 2, "y"],
            "c": [1, 1, 2, 2],
        } 
     )
Out [0]:
| a | b | c
  0 | 1 | 1
  0 | x | 1
  1 | 2 | 2
  1 | y | 2

numeric_only=False performs as expected, as does numeric_only=None:

In [1]: df.groupby('a').first(numeric_only=False)
Out [1]:
  | b | c
a _______
0   1 | 1
1   2 | 2

In [2]: df.groupby('a').first(numeric_only=None)
Out [2]:
  | b | c
a _______
0   1 | 1
1   2 | 2

while numeric_only=True drops a column:

In [3]: df.groupby('a').first(numeric_only=True)
Out [3]:
  | c
a ____
0   1
1   2

It looks like any columns whose elements aren't strictly numeric are getting pruned.

Edit: actually, I'm not sure if that's the correct behavior for numeric_only=False, or if it's supposed to raise an exception.

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Jun 10, 2022
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 10, 2022
@simonjayhawkins
Copy link
Member

Expected Behavior

All columns of original dataframe should be preserved:

Empty DataFrame
Columns: [a, b, c]
Index: []

the was the result in pandas-1.2.5

first bad commit: [6b94e24] BUG: DataFrameGroupBy with numeric_only and empty non-numeric data (#41706)

@jbrockmendel

@jbrockmendel
Copy link
Member

Will this be fixed automatically in 2.0 when the numeric_only default/behavior changes?

@jbrockmendel
Copy link
Member

@rhshadrach is this closed by the numeric_only deprecation?

@rhshadrach
Copy link
Member

Yes - closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants