BUG: DataFrameGroupBy.sum() drops column names when applied to an empty dataframe #46375

eugene57 · 2022-03-15T13:45:48Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'c'])
print(df.groupby('a', as_index=False).sum())

Issue Description

Only first column (groupby key) is preserved:

Empty DataFrame
Columns: [a]
Index: []

Expected Behavior

All columns of original dataframe should be preserved:

Empty DataFrame
Columns: [a, b, c]
Index: []

Installed Versions

``` INSTALLED VERSIONS ------------------ commit : 67a3d42 python : 3.7.9.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-693.5.2.el7.x86_64 Version : #1 SMP Fri Oct 20 20:32:50 UTC 2017 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.4
numpy : 1.19.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 47.3.1.post20210215
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.30.0
sphinx : 3.0.3
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : 0.10.1
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.17.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pytables : None
pyxlsb : 1.0.9
s3fs : None
scipy : 1.5.4
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2

</details>

I also checked that the issue is still present in pandas 1.4.1.

The text was updated successfully, but these errors were encountered:

ryansdowning · 2022-03-15T20:31:37Z

Can confirm this issue exists on main. The issue also arises with DataFrameGroupBy.mean(), .sum(), .median(), but interestingly does not occur with .apply(), .min(), .max(), .all(), or .any(). I am looking further into this and will report back if I find anything.

ryansdowning · 2022-03-16T00:08:45Z

It looks like this has something to do with the numeric_only flag in pandas.core.groupby.generic.DataFrameGroupBy._agg_general. By default the aforementioned methods which do not have this issue (min, max) use numeric_only=False. The problematic methods (sum, mean, median) use lib.no_default as the default numeric_only argument. Here are some interesting things you can test to see whats going on:

note: I left off as_index=False from the groupby because it does not make a difference here

In [1]: import pandas as pd, numpy as np

In [2]: df = pd.DataFrame(columns=['a', 'b', 'c'])

In [3]: df.groupby('a').sum()
Out[3]: 
Empty DataFrame
Columns: []
Index: []

In [4]: df.groupby('a').min()
Out[4]: 
Empty DataFrame
Columns: [b, c]
Index: []

In [5]: df.groupby('a').min(numeric_only=True)
Out[5]: 
Empty DataFrame
Columns: []
Index: []

In [6]: df.groupby('a').sum(numeric_only=False)
Out[6]: 
Empty DataFrame
Columns: [b, c]
Index: []

I am not sure if the default behavior of these methods is intended to be different, so I'll leave it to the maintainers to direct closing this issue.

dospix · 2022-04-10T12:19:54Z

take

hhheidi · 2022-04-20T22:01:49Z

take

hhheidi · 2022-04-22T01:37:16Z

@ryansdowning's examples are very helpful! I definitely agree that this is caused by numeric_only=True. I also found that this issue isn't unique to empty dataframes. Here:

In [0]:  df = pd.DataFrame(
        {
            "a": [0, 0, 1, 1],
            "b": [1, "x", 2, "y"],
            "c": [1, 1, 2, 2],
        } 
     )
Out [0]:
| a | b | c
  0 | 1 | 1
  0 | x | 1
  1 | 2 | 2
  1 | y | 2

numeric_only=False performs as expected, as does numeric_only=None:

In [1]: df.groupby('a').first(numeric_only=False)
Out [1]:
  | b | c
a _______
0   1 | 1
1   2 | 2

In [2]: df.groupby('a').first(numeric_only=None)
Out [2]:
  | b | c
a _______
0   1 | 1
1   2 | 2

while numeric_only=True drops a column:

In [3]: df.groupby('a').first(numeric_only=True)
Out [3]:
  | c
a ____
0   1
1   2

It looks like any columns whose elements aren't strictly numeric are getting pruned.

Edit: actually, I'm not sure if that's the correct behavior for numeric_only=False, or if it's supposed to raise an exception.

simonjayhawkins · 2022-06-10T11:51:59Z

Expected Behavior

All columns of original dataframe should be preserved:
Empty DataFrame
Columns: [a, b, c]
Index: []

the was the result in pandas-1.2.5

first bad commit: [6b94e24] BUG: DataFrameGroupBy with numeric_only and empty non-numeric data (#41706)

@jbrockmendel

jbrockmendel · 2022-07-11T17:37:06Z

Will this be fixed automatically in 2.0 when the numeric_only default/behavior changes?

jbrockmendel · 2023-02-02T00:16:50Z

@rhshadrach is this closed by the numeric_only deprecation?

rhshadrach · 2023-02-02T00:32:04Z

Yes - closing.

eugene57 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 15, 2022

mroeschke added Groupby Reduction Operations sum, mean, min, max, etc. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 17, 2022

github-actions bot assigned dospix Apr 10, 2022

dospix removed their assignment Apr 16, 2022

github-actions bot assigned hhheidi Apr 20, 2022

hhheidi mentioned this issue Apr 22, 2022

stopped numeric_only=True from dropping columns #46830

Closed

1 task

This was referenced May 30, 2022

BUG Fix: DataFrameGroupBy.sum() drops column names when applied to an empty dataframe #47174

Closed

Data frame groupby sum drop weikhor/pandas#1

Closed

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Jun 10, 2022

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 10, 2022

code sample for pandas-dev#46375

5117fd1

rhshadrach closed this as completed Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrameGroupBy.sum() drops column names when applied to an empty dataframe #46375

BUG: DataFrameGroupBy.sum() drops column names when applied to an empty dataframe #46375

eugene57 commented Mar 15, 2022

ryansdowning commented Mar 15, 2022 •

edited

Loading

ryansdowning commented Mar 16, 2022

dospix commented Apr 10, 2022

hhheidi commented Apr 20, 2022

hhheidi commented Apr 22, 2022 •

edited

Loading

simonjayhawkins commented Jun 10, 2022

Expected Behavior

jbrockmendel commented Jul 11, 2022

jbrockmendel commented Feb 2, 2023

rhshadrach commented Feb 2, 2023

BUG: DataFrameGroupBy.sum() drops column names when applied to an empty dataframe #46375

BUG: DataFrameGroupBy.sum() drops column names when applied to an empty dataframe #46375

Comments

eugene57 commented Mar 15, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

ryansdowning commented Mar 15, 2022 • edited Loading

ryansdowning commented Mar 16, 2022

dospix commented Apr 10, 2022

hhheidi commented Apr 20, 2022

hhheidi commented Apr 22, 2022 • edited Loading

simonjayhawkins commented Jun 10, 2022

Expected Behavior

jbrockmendel commented Jul 11, 2022

jbrockmendel commented Feb 2, 2023

rhshadrach commented Feb 2, 2023

ryansdowning commented Mar 15, 2022 •

edited

Loading

hhheidi commented Apr 22, 2022 •

edited

Loading