Skip to content

Memory stays around after pickle cycle #43156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mrocklin opened this issue Aug 21, 2021 · 9 comments
Open

Memory stays around after pickle cycle #43156

mrocklin opened this issue Aug 21, 2021 · 9 comments
Labels
IO Pickle read_pickle, to_pickle Performance Memory or execution speed performance

Comments

@mrocklin
Copy link
Contributor

Hi Folks,

Related to #43155 I'm running into memory issues when pickling many small pandas dataframes. The following script creates a pandas dataframe, splits it up, and then pickles each little split and brings them back again. It then deletes all objects from memory but something is still sticking around. Here is the script, followed by the outputs of using memory_profiler

import numpy as np
import pandas as pd
import pickle


@profile
def test():
    df = pd.DataFrame(np.random.random((20000, 1000)))
    df["partitions"] = (df[0] * 10000).astype(int)
    _, groups = zip(*df.groupby("partitions"))
    del df

    groups = [pickle.dumps(group) for group in groups]
    groups = [pickle.loads(group) for group in groups]

    del groups


if __name__ == "__main__":
    test()
python -m memory_profiler memory_issue.py
Filename: memory_issue.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     7   76.574 MiB   76.574 MiB           1   @profile
     8                                         def test():
     9  229.445 MiB  152.871 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    10  230.738 MiB    1.293 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    11  398.453 MiB  167.715 MiB           1       _, groups = zip(*df.groupby("partitions"))
    12  245.633 MiB -152.820 MiB           1       del df
    13                                         
    14  445.688 MiB   47.273 MiB        8631       groups = [pickle.dumps(group) for group in groups]
    15  712.285 MiB  266.598 MiB        8631       groups = [pickle.loads(group) for group in groups]
    16                                         
    17  557.488 MiB -154.797 MiB           1       del groups

As you can see, we start with 70 MiB in memory and end with 550 MiB, despite all relevant objects being released. This leak increases with the number of groups (scale the 10000 number to move the leak up or down). Any help or pointers on how to track this down would be welcome.

@mrocklin mrocklin added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@jbrockmendel
Copy link
Member

Another datapoint: running your script on OSX I'm seeing a lot more being released at the end

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     7   68.484 MiB   68.484 MiB           1   @profile
     8                                         def test():
     9  221.121 MiB  152.637 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    10  221.828 MiB    0.707 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    11  395.141 MiB  173.312 MiB           1       _, groups = zip(*df.groupby("partitions"))
    12  242.551 MiB -152.590 MiB           1       del df
    13                                         
    14  499.613 MiB  104.137 MiB        8684       groups = [pickle.dumps(group) for group in groups]
    15  915.664 MiB  284.641 MiB        8684       groups = [pickle.loads(group) for group in groups]
    16                                         
    17  286.395 MiB -629.270 MiB           1       del groups

Also if I add a gc.collect() after del groups i get another 40mb back.

@mrocklin
Copy link
Contributor Author

Same with gc.collect()

Filename: memory_issue.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     8   98.180 MiB   98.180 MiB           1   @profile
     9                                         def test():
    10  250.863 MiB  152.684 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    11  252.039 MiB    1.176 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    12  420.848 MiB  168.809 MiB           1       _, groups = zip(*df.groupby("partitions"))
    13  267.980 MiB -152.867 MiB           1       del df
    14                                         
    15  468.211 MiB   47.391 MiB        8643       groups = [pickle.dumps(group) for group in groups]
    16  738.316 MiB  270.105 MiB        8643       groups = [pickle.loads(group) for group in groups]
    17                                         
    18  579.688 MiB -158.629 MiB           1       del groups
    19  528.438 MiB  -51.250 MiB           1       gc.collect()

@jbrockmendel
Copy link
Member

Going though gc.get_objects() I don't see any big objects left behind

@gjoseph92
Copy link

FYI you should probably run this with MALLOC_TRIM_THRESHOLD_=0 python -m memory_profiler memory_issue.py on linux / DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python -m memory_profiler memory_issue.py on macOS to encourage the allocator to release pages back to the OS. memory_profiler is just tracking the RSS of the process, nothing fancier, so it's possible the memory has been free'd as far as pandas can get it, just not fully released.

@mrocklin
Copy link
Contributor Author

mrocklin commented Aug 24, 2021 via email

@mrocklin
Copy link
Contributor Author

Yes, same result on my linux/ubuntu machine running mambafoge.

@jbrockmendel
Copy link
Member

Is there a viable non-pickle alternative?

When I change pickle.dumps(group) to pickle.dumps(group.values) to pickle the underlying ndarrays I end up with 50-60 mb less (and the gc.collect no longer gets anything) than i do with pickling the DataFrames, but thats still 2-3 times the original footprint.

@simonjayhawkins simonjayhawkins added IO Pickle read_pickle, to_pickle Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 30, 2021
@mroeschke mroeschke removed the Bug label Aug 25, 2024
@ademhilmibozkurt
Copy link

checkout this response from stackoverflow.

@ademhilmibozkurt
Copy link

I tried different libraries from pickle. I used dill and joblib.

pickle library results

import pandas as pd
import numpy as np
import pickle
import dill
from io import BytesIO
import joblib
from memory_profiler import profile

@profile
def test1():
    df = pd.DataFrame(np.random.random((20000, 1000)))
    df["partitions"] = (df[0] * 10000).astype(int)
    _, groups = zip(*df.groupby("partitions"))
    del df
    
    groups = [pickle.dumps(group) for group in groups]
    groups = [pickle.loads(group) for group in groups]
    del groups
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9    167.2 MiB    167.2 MiB           1   @profile
    10                                         def test1():
    11    319.8 MiB    152.6 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    12    320.3 MiB      0.4 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    13    633.2 MiB    313.0 MiB           1       _, groups = zip(*df.groupby("partitions"))
    14    480.7 MiB   -152.6 MiB           1       del df
    15                                             
    16    673.2 MiB   -103.4 MiB        8631       groups = [pickle.dumps(group) for group in groups]
    17    812.1 MiB    -42.9 MiB        8631       groups = [pickle.loads(group) for group in groups]
    18    248.5 MiB   -563.6 MiB           1       del groups

also tried different approach

groups = pickle.dumps(groups)
groups = pickle.loads(groups)
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9    167.4 MiB    167.4 MiB           1   @profile
    10                                         def test1():
    11    320.1 MiB    152.7 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    12    320.6 MiB      0.4 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    13    634.2 MiB    313.6 MiB           1       _, groups = zip(*df.groupby("partitions"))
    14    481.6 MiB   -152.6 MiB           1       del df
    15                                             
    16    354.1 MiB   -127.4 MiB           1       groups = pickle.dumps(groups)
    17    363.4 MiB      9.2 MiB           1       groups = pickle.loads(groups)
    18    197.6 MiB   -165.7 MiB           1       del groups

joblib library results

@profile
def test3():
    df = pd.DataFrame(np.random.random((20000, 1000)))
    df["partitions"] = (df[0] * 10000).astype(int)
    _, groups = zip(*df.groupby("partitions"))
    del df
    
    bytes_container = BytesIO()
    groups = joblib.dump(groups, bytes_container)
    bytes_container.seek(0)
    groups = bytes_container.read()
    del groups
Line #   Mem usage    Increment  Occurrences   Line Contents
=============================================================
    31    167.7 MiB    167.7 MiB           1   @profile
    32                                         def test3():
    33    320.3 MiB    152.6 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    34    320.8 MiB      0.5 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    35    634.2 MiB    313.4 MiB           1       _, groups = zip(*df.groupby("partitions"))
    36    481.6 MiB   -152.6 MiB           1       del df
    37                                             
    38    481.6 MiB      0.0 MiB           1       bytes_container = BytesIO()
    39    352.5 MiB   -129.1 MiB           1       groups = joblib.dump(groups, bytes_container)
    40    352.5 MiB      0.0 MiB           1       bytes_container.seek(0)
    41    507.4 MiB    154.9 MiB           1       groups = bytes_container.read()
    42    352.5 MiB   -154.9 MiB           1       del groups

dill library results

@profile
def test2():
    df = pd.DataFrame(np.random.random((20000, 1000)))
    df["partitions"] = (df[0] * 10000).astype(int)
    _, groups = zip(*df.groupby("partitions"))
    del df

    groups = dill.dumps(groups)
    groups = dill.loads(groups)
    del groups
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    20    167.2 MiB    167.2 MiB           1   @profile
    21                                         def test2():
    22    319.8 MiB    152.6 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    23    320.2 MiB      0.4 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    24    632.8 MiB    312.6 MiB           1       _, groups = zip(*df.groupby("partitions"))
    25    480.3 MiB   -152.6 MiB           1       del df
    26                                         
    27    358.0 MiB   -122.2 MiB           1       groups = dill.dumps(groups)
    28    369.6 MiB     11.6 MiB           1       groups = dill.loads(groups)
    29    207.1 MiB   -162.6 MiB           1       del groups

I think this not about a library we are using. Python uses extra memory for processing. gc.collect() not a proper way to free memory. Please warn me if I took the wrong approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Pickle read_pickle, to_pickle Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

6 participants