Skip to content

Memory leak when loading HDFStore keys in 'table' format #22082

@federicofontana

Description

@federicofontana

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
import psutil
import os


def get_ram_usage_pct():
    ram_usage_pct = psutil.Process(os.getpid()).memory_percent()
    return "{0:.2f}%".format(ram_usage_pct)


def get_hdfstore_ram_usage_table(**kwargs):
    store = pd.HDFStore('store.h5', mode='w')
    df = pd.DataFrame(np.random.normal(size=(int(1e7), 10)))
    store.put('df', df, **kwargs)
    memory_usage_pct = list()
    for i in range(8):
        memory_usage_pct.append(get_ram_usage_pct())
        data = store.select('df')
    store.close()
    return memory_usage_pct


# Output. Run from the terminal (i.e. no jupyter notebook).
print(get_hdfstore_ram_usage_table(format='fixed', index=False, append=False))
print(get_hdfstore_ram_usage_table(format='table', index=False, append=True))

# Console.  Script run from the terminal (i.e. no jupyter notebook).
['2.76%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%']
['2.77%', '5.35%', '7.92%', '10.50%', '13.07%', '15.64%', '18.21%', '20.78%']
Process finished with exit code 0

Problem description

The previous results shows that when accessing the same pandas.HDFStore key and format='fixed', RAM usage does not grow over time. However, when format='table' (i.e. appendable), RAM usage grows linearly with the number of calls! The behaviour is the same also when accessing the data using HDFStore.get.

The current behaviour is a problem because accessing 1GB of data from the store 100 times requires 100GB and not ~1GB. A possible workaround has been indicated in issue #5329. The workaround works, but the code is >10% slower due to gc.collect() AND very ugly (i.e. need to close and open the store every time + gc.collect slows down the runtime).

I've tried to go through the source code and the memory leak occurs in the class AppendableFrameTable in pytables.py of pandas.

# Workaround: gc.collect + open AND close the store every time.
def get_hdfstore_ram_usage_table(**kwargs):
    store = pd.HDFStore('store.h5', mode='w')
    df = pd.DataFrame(np.random.normal(size=(int(1e7), 10)))
    store.put('df', df, **kwargs)
    store.close()
    memory_usage_pct = list()
    for i in range(8):
        store = pd.HDFStore('store.h5', mode='r')
        memory_usage_pct.append(get_ram_usage_pct())
        data = store.select('df')
        gc.collect()
        store.close()
    store.close()
    return memory_usage_pct


print(get_hdfstore_ram_usage_table(format='fixed', index=False, append=False))
print(get_hdfstore_ram_usage_table(format='table', index=False, append=True))

# Console.  Script run from the terminal (i.e. no jupyter notebook).
['2.76%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%']
['2.76%', '5.34%', '5.34%', '5.34%', '5.34%', '5.35%', '5.34%', '5.35%']
Process finished with exit code 0

References

issue #5329
issue #16740
Pandas HDFStore unload dataframe from memory
Memory leak when using hdf in table format?

Question
Is there any plan to fix this serious memory leak? gc.collect is very low level IMO and if the solution is to use gc.collect +open/close the store every time, then maybe this logic should be implemented directly in pandas. @jreback

Expected Output

['2.76%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%']
['2.76%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%']

Output of pd.show_versions()

Details [paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.23.3 pytest: 3.6.3 pip: 10.0.1 setuptools: 39.0.1 Cython: 0.28.3 numpy: 1.14.2 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.2 pytz: 2018.4 blosc: None bottleneck: None tables: 3.4.3 numexpr: 2.6.5 feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO HDF5read_hdf, HDFStorePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions