Indexing Code Roundup #32135

jbrockmendel · 2020-02-20T19:13:31Z

xref #9595 for an overview the the indexing API, this is focused on the state of the code.

Needs Eyeballs

Tests are scattered:
- tests/indexing/
- tests/(frame|series)/indexing/
- tests/indexes/*/indexing
- Moreover, many of the indexing tests are leftover from ix, and it isn't clear whether they are testing anything meaningful anymore. Organizing these tests and getting a handle on exactly what we are testing is going to be a marathon (cc @MomIsBestFriend)
- Many of our tests use the indices fixture or tm.makeFooIndex, which leaves many cases un-covered, off the top of my head:
  - above/below _libs.index._SIZE_CUTOFF
  - monotonic increasing/decreasing/non-monotonic
  - has NAs or not
  - near implementation bounds
  - name attributes
  - readonly or not
  - view or not
Ditto for indexing Benchmarks
GH Issue Tracker
- The "Indexing" label is used pretty loosely. A thorough pass through the tracker to get a handle on what issues are about __setitem__, __getitem__, loc, iloc, at, and iat would be worthwhile.

In _libs.index we define _bin_search as an alternative to ndarray.searchsorted. AFAICT this is more performant than ndarray.searchsorted for object dtype, but not for other dtypes. If this is correct, then we should override the non-object IndexEngine subclasses to use searchsorted (we do this for DatetimeEngine)
The IndexEngine methods get_indexer, get_pad_indexer, get_backfill_indexer don't need to be in cython, can go back on the Index classes.
- Having them in cython makes it harder to tell that the PeriodEngine versions are never called
We could avoid some casting if we had HashTable classes for itemsizes smaller than 64 bits
- I expect this would also let us avoid some casting in core.algorithms.

Some of these increase code complexity, so it isn't obvious whether they are worthwhile:

Separate Loc/iLoc/At/iAt classes for 1D vs 2D could allow some optimizations
iloc input validations could be removed, letting numpy's exception messages surface instead
cached check (or just separate subclass?) for if a object-dtype Index contains any tuples. If we can rule out tuples, a bunch of loc and __getitem__ code can be simplified.

DatetimeIndex, TimedeltaIndex, and PeriodIndex get_loc could be refactored to move most of the action into _maybe_cast_indexer. I think that the resulting _maybe_cast_indexer methods could then be shared with other DTA/TDA/PA methods, in particular the casting done in comparison methods.

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2023-02-26T18:59:46Z

I think this has been taken as far as it can for the time being. Closing.

jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Feb 22, 2020

mroeschke added the Refactor Internal refactoring of code label Jul 29, 2021

jbrockmendel closed this as completed Feb 26, 2023