-
-
Notifications
You must be signed in to change notification settings - Fork 19.7k
Description
Code Sample, a copy-pastable example if possible
import numpy as np
import pandas as pd
import psutil
import os
def get_ram_usage_pct():
ram_usage_pct = psutil.Process(os.getpid()).memory_percent()
return "{0:.2f}%".format(ram_usage_pct)
def get_hdfstore_ram_usage_table(**kwargs):
store = pd.HDFStore('store.h5', mode='w')
df = pd.DataFrame(np.random.normal(size=(int(1e7), 10)))
store.put('df', df, **kwargs)
memory_usage_pct = list()
for i in range(8):
memory_usage_pct.append(get_ram_usage_pct())
data = store.select('df')
store.close()
return memory_usage_pct
# Output. Run from the terminal (i.e. no jupyter notebook).
print(get_hdfstore_ram_usage_table(format='fixed', index=False, append=False))
print(get_hdfstore_ram_usage_table(format='table', index=False, append=True))
# Console. Script run from the terminal (i.e. no jupyter notebook).
['2.76%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%']
['2.77%', '5.35%', '7.92%', '10.50%', '13.07%', '15.64%', '18.21%', '20.78%']
Process finished with exit code 0Problem description
The previous results shows that when accessing the same pandas.HDFStore key and format='fixed', RAM usage does not grow over time. However, when format='table' (i.e. appendable), RAM usage grows linearly with the number of calls! The behaviour is the same also when accessing the data using HDFStore.get.
The current behaviour is a problem because accessing 1GB of data from the store 100 times requires 100GB and not ~1GB. A possible workaround has been indicated in issue #5329. The workaround works, but the code is >10% slower due to gc.collect() AND very ugly (i.e. need to close and open the store every time + gc.collect slows down the runtime).
I've tried to go through the source code and the memory leak occurs in the class AppendableFrameTable in pytables.py of pandas.
# Workaround: gc.collect + open AND close the store every time.
def get_hdfstore_ram_usage_table(**kwargs):
store = pd.HDFStore('store.h5', mode='w')
df = pd.DataFrame(np.random.normal(size=(int(1e7), 10)))
store.put('df', df, **kwargs)
store.close()
memory_usage_pct = list()
for i in range(8):
store = pd.HDFStore('store.h5', mode='r')
memory_usage_pct.append(get_ram_usage_pct())
data = store.select('df')
gc.collect()
store.close()
store.close()
return memory_usage_pct
print(get_hdfstore_ram_usage_table(format='fixed', index=False, append=False))
print(get_hdfstore_ram_usage_table(format='table', index=False, append=True))
# Console. Script run from the terminal (i.e. no jupyter notebook).
['2.76%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%']
['2.76%', '5.34%', '5.34%', '5.34%', '5.34%', '5.35%', '5.34%', '5.35%']
Process finished with exit code 0References
issue #5329
issue #16740
Pandas HDFStore unload dataframe from memory
Memory leak when using hdf in table format?
Question
Is there any plan to fix this serious memory leak? gc.collect is very low level IMO and if the solution is to use gc.collect +open/close the store every time, then maybe this logic should be implemented directly in pandas. @jreback
Expected Output
['2.76%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%']
['2.76%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%', '5.33%']