BUG: Memory consumption #53793

psulski · 2023-06-22T11:39:01Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

df_groubyA = df.groupby(['a_lA' ,  'multiA'  ], group_keys=True)
df_groubyB = df.groupby(['a_lB' ,  'multiB'  ], group_keys=True)

lA = df_groubyA.groups.keys()
lB = df_groubyB.groups.keys()

Issue Description

Above lines consume memory and do not release memory even if I del lA, lB, df pattern, df_groupbyA and df_groupbyB
I call also gc.collect() after delete.
This lines work in loop.
df - are data from file.
I am reading a files one by one and i make some operation on data frame.
But after 3- 5 files RAM is over.

Expected Behavior

release ram.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 87cfe4e python : 3.8.0.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-150-generic Version : #167~18.04.1-Ubuntu SMP Wed May 24 00:51:42 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : pl_PL.UTF-8 LOCALE : pl_PL.UTF-8

pandas : 1.5.0
numpy : 1.23.5
pytz : 2022.7
dateutil : 2.8.2
setuptools : 67.1.0
pip : 23.0
Cython : 0.29.32
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.6.3
numba : 0.56.4
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : None
tables : 3.8.0
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : 2022.7

The text was updated successfully, but these errors were encountered:

harshvardhaniimi · 2023-06-22T18:25:09Z

Can you share more about the structure of your loop? Does deleting df free memory? Maybe that can be made more efficient. Loops and pandas don't play well with each other.

Also, something else to note:

When you call groupby, pandas creates a new DataFrameGroupBy object, which is a subsetting of the original DataFrame. This DataFrameGroupBy object keeps a reference to the entire original DataFrame. This is why deleting the GroupBy object does not immediately release all of the memory, even after gc.collect(), as there might still be references held to these objects.

Pandas does this because it uses a "lazy" computation model for groupby operations. When you call df.groupby(), no actual computation happens. The DataFrameGroupBy object that has all the information needed to apply some operation to each of the groups. When you call a function like sum() or mean() on the DataFrameGroupBy object, the actual computation happens.

rhshadrach · 2023-06-22T21:35:56Z

@psulski: Can you provide a fully reproducible example? We need an example df to run. Will any df work to reproduce (with the proper columns)?

topper-123 · 2023-06-27T14:57:23Z

The issue will be closed unless we get a reproducible example, sorry.

rhshadrach · 2023-07-11T20:44:58Z

Closing since there is no reproducible example. I also tried the code below - the memory usage when back down to its original level between iterations.

size = 100000000

for e in range(2):
    df = pd.DataFrame(
    {
        'a_lA': np.random.randint(0, 100, size),
        'multiA': np.random.randint(0, 100, size),
        'a_lB': np.random.randint(0, 100, size),
        'multiB': np.random.randint(0, 100, size),
        'a': np.random.randint(0, 100, size),
        'b': np.random.randint(0, 100, size),
        'c': np.random.randint(0, 100, size),
    })

    df_groubyA = df.groupby(['a_lA' ,  'multiA'  ], group_keys=True)
    df_groubyB = df.groupby(['a_lB' ,  'multiB'  ], group_keys=True)

    lA = df_groubyA.groups.keys()
    lB = df_groubyB.groups.keys()
    print(f'Finished {e}')

psulski added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 22, 2023

rhshadrach added Needs Info Clarification about behavior needed to assess issue Groupby Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 22, 2023

topper-123 added the Closing Candidate May be closeable, needs more eyeballs label Jun 27, 2023

rhshadrach closed this as completed Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Memory consumption #53793

BUG: Memory consumption #53793

psulski commented Jun 22, 2023

harshvardhaniimi commented Jun 22, 2023

rhshadrach commented Jun 22, 2023

topper-123 commented Jun 27, 2023

rhshadrach commented Jul 11, 2023

BUG: Memory consumption #53793

BUG: Memory consumption #53793

Comments

psulski commented Jun 22, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

harshvardhaniimi commented Jun 22, 2023

rhshadrach commented Jun 22, 2023

topper-123 commented Jun 27, 2023

rhshadrach commented Jul 11, 2023