Skip to content

BUG: Memory consumption #53793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
psulski opened this issue Jun 22, 2023 · 4 comments
Closed
2 of 3 tasks

BUG: Memory consumption #53793

psulski opened this issue Jun 22, 2023 · 4 comments
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Groupby Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance

Comments

@psulski
Copy link

psulski commented Jun 22, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

df_groubyA = df.groupby(['a_lA' ,  'multiA'  ], group_keys=True)
df_groubyB = df.groupby(['a_lB' ,  'multiB'  ], group_keys=True)

lA = df_groubyA.groups.keys()
lB = df_groubyB.groups.keys()

Issue Description

Above lines consume memory and do not release memory even if I del lA, lB, df pattern, df_groupbyA and df_groupbyB
I call also gc.collect() after delete.
This lines work in loop.
df - are data from file.
I am reading a files one by one and i make some operation on data frame.
But after 3- 5 files RAM is over.

Expected Behavior

release ram.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 87cfe4e python : 3.8.0.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-150-generic Version : #167~18.04.1-Ubuntu SMP Wed May 24 00:51:42 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : pl_PL.UTF-8 LOCALE : pl_PL.UTF-8

pandas : 1.5.0
numpy : 1.23.5
pytz : 2022.7
dateutil : 2.8.2
setuptools : 67.1.0
pip : 23.0
Cython : 0.29.32
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.6.3
numba : 0.56.4
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : None
tables : 3.8.0
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : 2022.7

@psulski psulski added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 22, 2023
@harshvardhaniimi
Copy link

Can you share more about the structure of your loop? Does deleting df free memory? Maybe that can be made more efficient. Loops and pandas don't play well with each other.

Also, something else to note:

When you call groupby, pandas creates a new DataFrameGroupBy object, which is a subsetting of the original DataFrame. This DataFrameGroupBy object keeps a reference to the entire original DataFrame. This is why deleting the GroupBy object does not immediately release all of the memory, even after gc.collect(), as there might still be references held to these objects.

Pandas does this because it uses a "lazy" computation model for groupby operations. When you call df.groupby(), no actual computation happens. The DataFrameGroupBy object that has all the information needed to apply some operation to each of the groups. When you call a function like sum() or mean() on the DataFrameGroupBy object, the actual computation happens.

@rhshadrach
Copy link
Member

@psulski: Can you provide a fully reproducible example? We need an example df to run. Will any df work to reproduce (with the proper columns)?

@rhshadrach rhshadrach added Needs Info Clarification about behavior needed to assess issue Groupby Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 22, 2023
@topper-123 topper-123 added the Closing Candidate May be closeable, needs more eyeballs label Jun 27, 2023
@topper-123
Copy link
Contributor

The issue will be closed unless we get a reproducible example, sorry.

@rhshadrach
Copy link
Member

Closing since there is no reproducible example. I also tried the code below - the memory usage when back down to its original level between iterations.

size = 100000000

for e in range(2):
    df = pd.DataFrame(
    {
        'a_lA': np.random.randint(0, 100, size),
        'multiA': np.random.randint(0, 100, size),
        'a_lB': np.random.randint(0, 100, size),
        'multiB': np.random.randint(0, 100, size),
        'a': np.random.randint(0, 100, size),
        'b': np.random.randint(0, 100, size),
        'c': np.random.randint(0, 100, size),
    })

    df_groubyA = df.groupby(['a_lA' ,  'multiA'  ], group_keys=True)
    df_groubyB = df.groupby(['a_lB' ,  'multiB'  ], group_keys=True)

    lA = df_groubyA.groups.keys()
    lB = df_groubyB.groups.keys()
    print(f'Finished {e}')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Groupby Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants