You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
xref #9595 for an overview the the indexing API, this is focused on the state of the code.
Needs Eyeballs
Tests are scattered:
tests/indexing/
tests/(frame|series)/indexing/
tests/indexes/*/indexing
Moreover, many of the indexing tests are leftover from ix, and it isn't clear whether they are testing anything meaningful anymore. Organizing these tests and getting a handle on exactly what we are testing is going to be a marathon (cc @MomIsBestFriend)
Many of our tests use the indices fixture or tm.makeFooIndex, which leaves many cases un-covered, off the top of my head:
above/below _libs.index._SIZE_CUTOFF
monotonic increasing/decreasing/non-monotonic
has NAs or not
near implementation bounds
name attributes
readonly or not
view or not
Ditto for indexing Benchmarks
GH Issue Tracker
The "Indexing" label is used pretty loosely. A thorough pass through the tracker to get a handle on what issues are about __setitem__, __getitem__, loc, iloc, at, and iat would be worthwhile.
_libs
In _libs.index we define _bin_search as an alternative to ndarray.searchsorted. AFAICT this is more performant than ndarray.searchsorted for object dtype, but not for other dtypes. If this is correct, then we should override the non-object IndexEngine subclasses to use searchsorted (we do this for DatetimeEngine)
The IndexEngine methods get_indexer, get_pad_indexer, get_backfill_indexer don't need to be in cython, can go back on the Index classes.
Having them in cython makes it harder to tell that the PeriodEngine versions are never called
We could avoid some casting if we had HashTable classes for itemsizes smaller than 64 bits
I expect this would also let us avoid some casting in core.algorithms.
Potential Optimizations
Some of these increase code complexity, so it isn't obvious whether they are worthwhile:
Separate Loc/iLoc/At/iAt classes for 1D vs 2D could allow some optimizations
iloc input validations could be removed, letting numpy's exception messages surface instead
cached check (or just separate subclass?) for if a object-dtype Index contains any tuples. If we can rule out tuples, a bunch of loc and __getitem__ code can be simplified.
Potential Refactors
DatetimeIndex, TimedeltaIndex, and PeriodIndex get_loc could be refactored to move most of the action into _maybe_cast_indexer. I think that the resulting _maybe_cast_indexer methods could then be shared with other DTA/TDA/PA methods, in particular the casting done in comparison methods.
The text was updated successfully, but these errors were encountered:
xref #9595 for an overview the the indexing API, this is focused on the state of the code.
Needs Eyeballs
Tests are scattered:
indices
fixture ortm.makeFooIndex
, which leaves many cases un-covered, off the top of my head:Ditto for indexing Benchmarks
GH Issue Tracker
__setitem__
,__getitem__
,loc
,iloc
,at
, andiat
would be worthwhile._libs
_libs.index
we define_bin_search
as an alternative tondarray.searchsorted
. AFAICT this is more performant than ndarray.searchsorted for object dtype, but not for other dtypes. If this is correct, then we should override the non-object IndexEngine subclasses to use searchsorted (we do this forDatetimeEngine
)IndexEngine
methodsget_indexer
,get_pad_indexer
,get_backfill_indexer
don't need to be in cython, can go back on the Index classes.PeriodEngine
versions are never calledPotential Optimizations
Some of these increase code complexity, so it isn't obvious whether they are worthwhile:
loc
and__getitem__
code can be simplified.Potential Refactors
get_loc
could be refactored to move most of the action into_maybe_cast_indexer
. I think that the resulting_maybe_cast_indexer
methods could then be shared with other DTA/TDA/PA methods, in particular the casting done in comparison methods.The text was updated successfully, but these errors were encountered: