Large Monotonic Index Objects Always Allocate Hash Tables on get_loc #14266

ssanderson · 2016-09-21T08:28:34Z

Historically, large monotonically-increasing Index objects would attempt to avoid creating a large hash table on get_loc calls. In service of this goal, many IndexEngine methods have guards like the one in DatetimeEngine.get_loc:

        if self.over_size_threshold and self.is_monotonic_increasing:
            if not self.is_unique:
                val = _to_i8(val)
                return self._get_loc_duplicates(val)
            # Do lookup using `searchsorted`
        # Do lookup using hash table.

Since at least 5eecdb2, self.is_unique has been implemented as a property that would force a hash table to be created unless the index had already been marked as unique. Until #10199, the is_monotonic_increasing property would perform a check that would sometimes set self.unique to True, which would prevent the large hash table allocation. After the commit linked above, however, the only code path that ever sets IndexEngine.unique is in IndexEngine.initialize, which unconditionally creates a hash table before setting the unique flag..

Code Sample, a copy-pastable example if possible

import os
import humanize
import psutil
import pandas as pd


def get_mem_usage():
    pid = os.getpid()
    proc = psutil.Process(pid)
    return humanize.naturalsize(proc.memory_full_info().uss)

print("Pandas Version: " + pd.__version__)
print("Before Index Creation: " + get_mem_usage())

# The cutoff for allocating a hash table inside the index is supposed to be
#1,000,000 entries in the index.  This index is about 10x larger.
data = pd.date_range('1990', '2016', freq='min')

print("After Index Creation: " + get_mem_usage())

# Trigger a hash of the index's contents.
data.get_loc(data[5])

print("After get_loc() call: " + get_mem_usage())

Output (Old Pandas):

$ python repro.py
Pandas Version: 0.16.1
Before Index Creation: 36.8 MB
After Index Creation: 146.5 MB
After get_loc() call: 146.6 MB

Output (Pandas 0.18.1)

$ python repro.py
Pandas Version: 0.18.1
Before Index Creation: 47.6 MB
After Index Creation: 157.4 MB
After get_loc() call: 698.7 MB

For some context, I found this after the internal Jenkins build for Zipline (which makes heavy use of large minutely DatetimeIndexes to represent trading calendars) started failing with memory errors after merging quantopian/zipline#1339.

Assuming that the memory-saving behavior of older pandas is still desired, I think the right immediate fix for this is to change IndexEngine._do_unique_check to actually do a uniqueness check instead of just forcing a hash table creation. Reading through the code, however, there are a bunch of ways that large Indexes could still hit code paths that trigger hash table allocations. For example, DatetimeEngine.__contains__ guards against self.over_size_threshold, but none of the other IndexEngine subclasses do. A more significant refactor is probably needed to provide a meaningful guarantee that indices don't consume too much memory.

output of `pd.show_versions()`

In [4]: pd.show_versions() ## INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.0-16-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.1
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 5.1.0
sphinx: 1.3.4
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: None
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)

The text was updated successfully, but these errors were encountered:

jreback · 2016-09-21T10:16:07Z

xref #12272 for the general usecase (frequency based index) makes this vastly more memory efficient (IOW it already knows is_unique, is_monotonic_increasing etc.)

ssanderson · 2016-09-21T10:23:39Z

Unfortunately, I don't think a range-based implementation will work for Zipline's specific use-case, since there are lots of irregularities to real-world trading schedules, but a datetime-based range index definitely seems useful for folks with nicely regular data.

jreback · 2016-09-21T10:25:53Z

yeah I suspect what you actually want is a sparse range (IOW a range based, with a mask which is much more memory efficient)

jreback · 2016-09-21T10:31:11Z

#13594 does not seem to matter here (though maybe it did originally)

jreback · 2016-09-21T12:49:27Z

so this fixes the issue (it breaks some other things, but those just need a trivial _ensure_mapping to fixup.

IIRC quite some time ago we changed the is_monotonic check. It used to also compute uniqueness when it actually is unique, so this is a necessary but not sufficient condition.

However we are not using that information and recomputing (and constructing the mapping which is memory heavy).

This uses the check where possible (and does the re-initialization if needed).

just needs a little fixup and I think will solve the issue.

ssanderson · 2016-09-21T14:44:25Z

@jreback thanks for the quick response. Does your branch also plan to update the other code paths that can trigger hash table creation ? At a glance, it looks like get_indexer does this, as well as __contains__ for everything except DatetimeEngine, which has custom special-case logic.

jreback · 2016-09-21T14:47:41Z

would need some other test cases

this by definition IS special cased

jreback · 2016-09-21T15:34:39Z

@ssanderson by-definition get_indexer MUST populate the hash table, so for __contains__ this also seems reasonable (e.g. even if its unique, how do you find the indexer?)

ssanderson · 2016-09-21T15:50:16Z

so for __contains__ this also seems reasonable

I'm not sure if you're saying that __contains__ should always populate, or if you're saying it should avoid populating when possible. To be clear, DatetimeEngine already implements __contains__ without a hash table by doing a searchsorted and then comparing the value at the located index with the whose containment status was requested.

jreback · 2016-09-21T15:55:22Z

and the DTE will avoid populating for the cases we are talking.

For non-large (e.g. < cutoff) it has always used the existing logic (which does populate). separate / independent whether that should be profiled.

ssanderson · 2016-09-21T16:00:22Z

and the DTE will avoid populating for the cases we are talking.

Right. My original question was whether we should make the other engines have the same behavior as the datetime engine. It seemed odd to me the different index types would want to make different choices about whether to populate the hash table, but it's possible that that's by design for reasons I missed?

jreback · 2016-09-21T16:07:13Z

so it IS possible for int64 index as .searchsorted would be faster. I dont't think this should be the default for Index as I suspect the hashtable impl is much faster that .searchsorted. But that could/should be another issue.

Benchmarks on this behavior would be welcome though to make a decision. Profiling is key here.

ssanderson · 2016-09-21T16:08:17Z

by-definition get_indexer MUST populate the hash table

I'm not sure what you mean when you say that get_indexer "by definition" has to populate the hash table.

As I understand it, get_indexer is essentially just a vectorized version of get_loc, so assuming we can implement get_loc without a hash table (which we've already done for the special case of of a monotonic index), in the worst case we could implement get_indexer in terms of a for-loop that calls get_loc. In practice, you could probably do much better than the naive for-loop, especially if the target indexer is also monotonic.

I certainly could see the argument that such a change to get_indexer is large enough that it deserves to be a separate PR/discussion.

ssanderson · 2016-09-21T16:11:33Z

so it IS possible for int64 index as searchsorted would be faster. I dont't think this should be the default for Index as I suspect the hashtable impl is much faster that searchsorted. But that could/should be another issue.

Fair enough. Another thing to think about here is that there are workloads where the user might accept a slower searchsorted-based index in exchange for memory savings. In our production Zipline deployments, for example, our bottleneck is almost always RAM, not CPU, so we'd likely be willing to take a performance hit in exchange for an extra couple hundred MB. The design here is tricky though b/c different users and use-cases will be willing to make different tradeoffs here.

jreback · 2016-09-21T16:13:01Z

let me clarify, yes I agree .get_indexer could have the same treatment, though it doesn't ATM. So separate issue for that. I wasn't attempting to change things which weren't already implemented the large-sorted-unique behavior (which is a useful special case)

ssanderson · 2016-09-21T16:16:18Z

Cool, sounds like we're in agreement. If I find the time to work on it, would you be interested in a separate PR that extends the no-hash-table algorithms in some of the cases outlined above? I can't promise for sure that I'll be able to devote a ton of time to it, but zipline leans pretty heavily on DatetimeIndex, so optimizations here can be pretty big wins for us.

jreback · 2016-09-21T16:18:02Z

yes I think that would be great. Please create an issue in any event.

ssanderson · 2016-09-21T16:24:18Z

Opened as #14273.

closes pandas-dev#14266

jreback mentioned this issue Sep 21, 2016

lazy array attributes wesm/pandas2#27

Open

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Sep 21, 2016

jreback changed the title ~~Large Monotonic Index Objects Always Allocate Hash Tables on get_loc~~ Large Monotonic Index Objects Always Allocate Hash Tables on get_loc Sep 21, 2016

ssanderson mentioned this issue Sep 21, 2016

Dramatic Memory Usage Increases After Pandas 18 Merge quantopian/zipline#1503

Closed

jreback mentioned this issue Sep 21, 2016

PERF: use uniqueness_check from monotonic check when possible #14270

Closed

jreback added this to the 0.19.0 milestone Sep 21, 2016

ssanderson mentioned this issue Sep 21, 2016

Consider extending hashtable-free indexing algorithms for large sorted indexes #14273

Open

jreback added a commit to jreback/pandas that referenced this issue Sep 21, 2016

PERF: use uniqueness_check from monotonic check when possible

968a4f7

closes pandas-dev#14266

jreback closed this as completed in 3c96442 Sep 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Monotonic Index Objects Always Allocate Hash Tables on get_loc #14266

Large Monotonic Index Objects Always Allocate Hash Tables on get_loc #14266

ssanderson commented Sep 21, 2016 •

edited

Loading

jreback commented Sep 21, 2016

ssanderson commented Sep 21, 2016

jreback commented Sep 21, 2016

jreback commented Sep 21, 2016

jreback commented Sep 21, 2016

ssanderson commented Sep 21, 2016

jreback commented Sep 21, 2016

jreback commented Sep 21, 2016

ssanderson commented Sep 21, 2016 •

edited

Loading

jreback commented Sep 21, 2016

ssanderson commented Sep 21, 2016 •

edited

Loading

jreback commented Sep 21, 2016 •

edited

Loading

ssanderson commented Sep 21, 2016

ssanderson commented Sep 21, 2016 •

edited

Loading

jreback commented Sep 21, 2016

ssanderson commented Sep 21, 2016

jreback commented Sep 21, 2016

ssanderson commented Sep 21, 2016

Large Monotonic Index Objects Always Allocate Hash Tables on get_loc #14266

Large Monotonic Index Objects Always Allocate Hash Tables on get_loc #14266

Comments

ssanderson commented Sep 21, 2016 • edited Loading

Code Sample, a copy-pastable example if possible

Output (Old Pandas):

Output (Pandas 0.18.1)

output of pd.show_versions()

jreback commented Sep 21, 2016

ssanderson commented Sep 21, 2016

jreback commented Sep 21, 2016

jreback commented Sep 21, 2016

jreback commented Sep 21, 2016

ssanderson commented Sep 21, 2016

jreback commented Sep 21, 2016

jreback commented Sep 21, 2016

ssanderson commented Sep 21, 2016 • edited Loading

jreback commented Sep 21, 2016

ssanderson commented Sep 21, 2016 • edited Loading

jreback commented Sep 21, 2016 • edited Loading

ssanderson commented Sep 21, 2016

ssanderson commented Sep 21, 2016 • edited Loading

jreback commented Sep 21, 2016

ssanderson commented Sep 21, 2016

jreback commented Sep 21, 2016

ssanderson commented Sep 21, 2016

ssanderson commented Sep 21, 2016 •

edited

Loading

output of `pd.show_versions()`

ssanderson commented Sep 21, 2016 •

edited

Loading

ssanderson commented Sep 21, 2016 •

edited

Loading

jreback commented Sep 21, 2016 •

edited

Loading

ssanderson commented Sep 21, 2016 •

edited

Loading