Skip to content

Indexing Code Roundup #32135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Feb 20, 2020 · 1 comment
Closed

Indexing Code Roundup #32135

jbrockmendel opened this issue Feb 20, 2020 · 1 comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Refactor Internal refactoring of code

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Feb 20, 2020

xref #9595 for an overview the the indexing API, this is focused on the state of the code.

Needs Eyeballs

  • Tests are scattered:

    • tests/indexing/
    • tests/(frame|series)/indexing/
    • tests/indexes/*/indexing
    • Moreover, many of the indexing tests are leftover from ix, and it isn't clear whether they are testing anything meaningful anymore. Organizing these tests and getting a handle on exactly what we are testing is going to be a marathon (cc @MomIsBestFriend)
    • Many of our tests use the indices fixture or tm.makeFooIndex, which leaves many cases un-covered, off the top of my head:
      • above/below _libs.index._SIZE_CUTOFF
      • monotonic increasing/decreasing/non-monotonic
      • has NAs or not
      • near implementation bounds
      • name attributes
      • readonly or not
      • view or not
  • Ditto for indexing Benchmarks

  • GH Issue Tracker

    • The "Indexing" label is used pretty loosely. A thorough pass through the tracker to get a handle on what issues are about __setitem__, __getitem__, loc, iloc, at, and iat would be worthwhile.

_libs

  • In _libs.index we define _bin_search as an alternative to ndarray.searchsorted. AFAICT this is more performant than ndarray.searchsorted for object dtype, but not for other dtypes. If this is correct, then we should override the non-object IndexEngine subclasses to use searchsorted (we do this for DatetimeEngine)
  • The IndexEngine methods get_indexer, get_pad_indexer, get_backfill_indexer don't need to be in cython, can go back on the Index classes.
    • Having them in cython makes it harder to tell that the PeriodEngine versions are never called
  • We could avoid some casting if we had HashTable classes for itemsizes smaller than 64 bits
    • I expect this would also let us avoid some casting in core.algorithms.

Potential Optimizations

Some of these increase code complexity, so it isn't obvious whether they are worthwhile:

  • Separate Loc/iLoc/At/iAt classes for 1D vs 2D could allow some optimizations
  • iloc input validations could be removed, letting numpy's exception messages surface instead
  • cached check (or just separate subclass?) for if a object-dtype Index contains any tuples. If we can rule out tuples, a bunch of loc and __getitem__ code can be simplified.

Potential Refactors

  • DatetimeIndex, TimedeltaIndex, and PeriodIndex get_loc could be refactored to move most of the action into _maybe_cast_indexer. I think that the resulting _maybe_cast_indexer methods could then be shared with other DTA/TDA/PA methods, in particular the casting done in comparison methods.
@jbrockmendel jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Feb 22, 2020
@mroeschke mroeschke added the Refactor Internal refactoring of code label Jul 29, 2021
@jbrockmendel
Copy link
Member Author

I think this has been taken as far as it can for the time being. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Refactor Internal refactoring of code
Projects
None yet
Development

No branches or pull requests

2 participants