-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Large Monotonic Index Objects Always Allocate Hash Tables on get_loc #14266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
xref #12272 for the general usecase (frequency based index) makes this vastly more memory efficient (IOW it already knows |
Unfortunately, I don't think a range-based implementation will work for Zipline's specific use-case, since there are lots of irregularities to real-world trading schedules, but a datetime-based range index definitely seems useful for folks with nicely regular data. |
yeah I suspect what you actually want is a sparse range (IOW a range based, with a mask which is much more memory efficient) |
get_loc
#13594 does not seem to matter here (though maybe it did originally) |
so this fixes the issue (it breaks some other things, but those just need a trivial IIRC quite some time ago we changed the is_monotonic check. It used to also compute uniqueness when it actually is unique, so this is a necessary but not sufficient condition. However we are not using that information and recomputing (and constructing the mapping which is memory heavy). This uses the check where possible (and does the re-initialization if needed). just needs a little fixup and I think will solve the issue. |
@jreback thanks for the quick response. Does your branch also plan to update the other code paths that can trigger hash table creation ? At a glance, it looks like |
would need some other test cases this by definition IS special cased |
@ssanderson by-definition |
I'm not sure if you're saying that |
and the DTE will avoid populating for the cases we are talking. For non-large (e.g. < cutoff) it has always used the existing logic (which does populate). separate / independent whether that should be profiled. |
Right. My original question was whether we should make the other engines have the same behavior as the datetime engine. It seemed odd to me the different index types would want to make different choices about whether to populate the hash table, but it's possible that that's by design for reasons I missed? |
so it IS possible for int64 index as Benchmarks on this behavior would be welcome though to make a decision. Profiling is key here. |
I'm not sure what you mean when you say that As I understand it, I certainly could see the argument that such a change to |
Fair enough. Another thing to think about here is that there are workloads where the user might accept a slower searchsorted-based index in exchange for memory savings. In our production Zipline deployments, for example, our bottleneck is almost always RAM, not CPU, so we'd likely be willing to take a performance hit in exchange for an extra couple hundred MB. The design here is tricky though b/c different users and use-cases will be willing to make different tradeoffs here. |
let me clarify, yes I agree |
Cool, sounds like we're in agreement. If I find the time to work on it, would you be interested in a separate PR that extends the no-hash-table algorithms in some of the cases outlined above? I can't promise for sure that I'll be able to devote a ton of time to it, but zipline leans pretty heavily on DatetimeIndex, so optimizations here can be pretty big wins for us. |
yes I think that would be great. Please create an issue in any event. |
Opened as #14273. |
Historically, large monotonically-increasing
Index
objects would attempt to avoid creating a large hash table onget_loc
calls. In service of this goal, manyIndexEngine
methods have guards like the one inDatetimeEngine.get_loc
:Since at least 5eecdb2,
self.is_unique
has been implemented as a property that would force a hash table to be created unless the index had already been marked as unique. Until #10199, theis_monotonic_increasing
property would perform a check that would sometimes setself.unique
to True, which would prevent the large hash table allocation. After the commit linked above, however, the only code path that ever setsIndexEngine.unique
is inIndexEngine.initialize
, which unconditionally creates a hash table before setting the unique flag..Code Sample, a copy-pastable example if possible
Output (Old Pandas):
Output (Pandas 0.18.1)
For some context, I found this after the internal Jenkins build for Zipline (which makes heavy use of large minutely
DatetimeIndex
es to represent trading calendars) started failing with memory errors after merging quantopian/zipline#1339.Assuming that the memory-saving behavior of older pandas is still desired, I think the right immediate fix for this is to change
IndexEngine._do_unique_check
to actually do a uniqueness check instead of just forcing a hash table creation. Reading through the code, however, there are a bunch of ways that largeIndex
es could still hit code paths that trigger hash table allocations. For example,DatetimeEngine.__contains__
guards againstself.over_size_threshold
, but none of the otherIndexEngine
subclasses do. A more significant refactor is probably needed to provide a meaningful guarantee that indices don't consume too much memory.output of
pd.show_versions()
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.0-16-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.16.1
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 5.1.0
sphinx: 1.3.4
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: None
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
The text was updated successfully, but these errors were encountered: