Skip to content

Large Monotonic Index Objects Always Allocate Hash Tables on get_loc #14266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ssanderson opened this issue Sep 21, 2016 · 18 comments
Closed

Large Monotonic Index Objects Always Allocate Hash Tables on get_loc #14266

ssanderson opened this issue Sep 21, 2016 · 18 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@ssanderson
Copy link
Contributor

ssanderson commented Sep 21, 2016

Historically, large monotonically-increasing Index objects would attempt to avoid creating a large hash table on get_loc calls. In service of this goal, many IndexEngine methods have guards like the one in DatetimeEngine.get_loc:

        if self.over_size_threshold and self.is_monotonic_increasing:
            if not self.is_unique:
                val = _to_i8(val)
                return self._get_loc_duplicates(val)
            # Do lookup using `searchsorted`
        # Do lookup using hash table.

Since at least 5eecdb2, self.is_unique has been implemented as a property that would force a hash table to be created unless the index had already been marked as unique. Until #10199, the is_monotonic_increasing property would perform a check that would sometimes set self.unique to True, which would prevent the large hash table allocation. After the commit linked above, however, the only code path that ever sets IndexEngine.unique is in IndexEngine.initialize, which unconditionally creates a hash table before setting the unique flag..

Code Sample, a copy-pastable example if possible

import os
import humanize
import psutil
import pandas as pd


def get_mem_usage():
    pid = os.getpid()
    proc = psutil.Process(pid)
    return humanize.naturalsize(proc.memory_full_info().uss)

print("Pandas Version: " + pd.__version__)
print("Before Index Creation: " + get_mem_usage())

# The cutoff for allocating a hash table inside the index is supposed to be
#1,000,000 entries in the index.  This index is about 10x larger.
data = pd.date_range('1990', '2016', freq='min')

print("After Index Creation: " + get_mem_usage())

# Trigger a hash of the index's contents.
data.get_loc(data[5])

print("After get_loc() call: " + get_mem_usage())

Output (Old Pandas):

$ python repro.py
Pandas Version: 0.16.1
Before Index Creation: 36.8 MB
After Index Creation: 146.5 MB
After get_loc() call: 146.6 MB

Output (Pandas 0.18.1)

$ python repro.py
Pandas Version: 0.18.1
Before Index Creation: 47.6 MB
After Index Creation: 157.4 MB
After get_loc() call: 698.7 MB

For some context, I found this after the internal Jenkins build for Zipline (which makes heavy use of large minutely DatetimeIndexes to represent trading calendars) started failing with memory errors after merging quantopian/zipline#1339.

Assuming that the memory-saving behavior of older pandas is still desired, I think the right immediate fix for this is to change IndexEngine._do_unique_check to actually do a uniqueness check instead of just forcing a hash table creation. Reading through the code, however, there are a bunch of ways that large Indexes could still hit code paths that trigger hash table allocations. For example, DatetimeEngine.__contains__ guards against self.over_size_threshold, but none of the other IndexEngine subclasses do. A more significant refactor is probably needed to provide a meaningful guarantee that indices don't consume too much memory.

output of pd.show_versions()

In [4]: pd.show_versions() ## INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.0-16-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.1
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 5.1.0
sphinx: 1.3.4
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: None
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)

@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

xref #12272 for the general usecase (frequency based index) makes this vastly more memory efficient (IOW it already knows is_unique, is_monotonic_increasing etc.)

@ssanderson
Copy link
Contributor Author

Unfortunately, I don't think a range-based implementation will work for Zipline's specific use-case, since there are lots of irregularities to real-world trading schedules, but a datetime-based range index definitely seems useful for folks with nicely regular data.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

yeah I suspect what you actually want is a sparse range (IOW a range based, with a mask which is much more memory efficient)

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Sep 21, 2016
@jreback jreback changed the title Large Monotonic Index Objects Always Allocate Hash Tables on get_loc Large Monotonic Index Objects Always Allocate Hash Tables on get_loc Sep 21, 2016
@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

#13594 does not seem to matter here (though maybe it did originally)

@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

so this fixes the issue (it breaks some other things, but those just need a trivial _ensure_mapping to fixup.

IIRC quite some time ago we changed the is_monotonic check. It used to also compute uniqueness when it actually is unique, so this is a necessary but not sufficient condition.

However we are not using that information and recomputing (and constructing the mapping which is memory heavy).

This uses the check where possible (and does the re-initialization if needed).

just needs a little fixup and I think will solve the issue.

@ssanderson
Copy link
Contributor Author

@jreback thanks for the quick response. Does your branch also plan to update the other code paths that can trigger hash table creation ? At a glance, it looks like get_indexer does this, as well as __contains__ for everything except DatetimeEngine, which has custom special-case logic.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

would need some other test cases

this by definition IS special cased

@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

@ssanderson by-definition get_indexer MUST populate the hash table, so for __contains__ this also seems reasonable (e.g. even if its unique, how do you find the indexer?)

@ssanderson
Copy link
Contributor Author

ssanderson commented Sep 21, 2016

so for __contains__ this also seems reasonable

I'm not sure if you're saying that __contains__ should always populate, or if you're saying it should avoid populating when possible. To be clear, DatetimeEngine already implements __contains__ without a hash table by doing a searchsorted and then comparing the value at the located index with the whose containment status was requested.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

and the DTE will avoid populating for the cases we are talking.

For non-large (e.g. < cutoff) it has always used the existing logic (which does populate). separate / independent whether that should be profiled.

@ssanderson
Copy link
Contributor Author

ssanderson commented Sep 21, 2016

and the DTE will avoid populating for the cases we are talking.

Right. My original question was whether we should make the other engines have the same behavior as the datetime engine. It seemed odd to me the different index types would want to make different choices about whether to populate the hash table, but it's possible that that's by design for reasons I missed?

@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

so it IS possible for int64 index as .searchsorted would be faster. I dont't think this should be the default for Index as I suspect the hashtable impl is much faster that .searchsorted. But that could/should be another issue.

Benchmarks on this behavior would be welcome though to make a decision. Profiling is key here.

@ssanderson
Copy link
Contributor Author

by-definition get_indexer MUST populate the hash table

I'm not sure what you mean when you say that get_indexer "by definition" has to populate the hash table.

As I understand it, get_indexer is essentially just a vectorized version of get_loc, so assuming we can implement get_loc without a hash table (which we've already done for the special case of of a monotonic index), in the worst case we could implement get_indexer in terms of a for-loop that calls get_loc. In practice, you could probably do much better than the naive for-loop, especially if the target indexer is also monotonic.

I certainly could see the argument that such a change to get_indexer is large enough that it deserves to be a separate PR/discussion.

@ssanderson
Copy link
Contributor Author

ssanderson commented Sep 21, 2016

so it IS possible for int64 index as searchsorted would be faster. I dont't think this should be the default for Index as I suspect the hashtable impl is much faster that searchsorted. But that could/should be another issue.

Fair enough. Another thing to think about here is that there are workloads where the user might accept a slower searchsorted-based index in exchange for memory savings. In our production Zipline deployments, for example, our bottleneck is almost always RAM, not CPU, so we'd likely be willing to take a performance hit in exchange for an extra couple hundred MB. The design here is tricky though b/c different users and use-cases will be willing to make different tradeoffs here.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

let me clarify, yes I agree .get_indexer could have the same treatment, though it doesn't ATM. So separate issue for that. I wasn't attempting to change things which weren't already implemented the large-sorted-unique behavior (which is a useful special case)

@ssanderson
Copy link
Contributor Author

Cool, sounds like we're in agreement. If I find the time to work on it, would you be interested in a separate PR that extends the no-hash-table algorithms in some of the cases outlined above? I can't promise for sure that I'll be able to devote a ton of time to it, but zipline leans pretty heavily on DatetimeIndex, so optimizations here can be pretty big wins for us.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2016

yes I think that would be great. Please create an issue in any event.

@ssanderson
Copy link
Contributor Author

Opened as #14273.

jreback added a commit to jreback/pandas that referenced this issue Sep 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants