Skip to content

Performance regression in TimedeltaIndexing.time_get_loc #34510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Jun 1, 2020 · 2 comments · Fixed by #34734
Closed

Performance regression in TimedeltaIndexing.time_get_loc #34510

jorisvandenbossche opened this issue Jun 1, 2020 · 2 comments · Fixed by #34734
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Timedelta Timedelta data type
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 1, 2020

Setup:

index = pd.timedelta_range(start="1985", periods=1000, freq="D")
timedelta = index[500]

%timeit index.get_loc(timedelta)
# master
7.39 µs ± 78.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# 1.0.3
2.82 µs ± 13.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

See https://pandas.pydata.org/speed/pandas/#timedelta.TimedeltaIndexing.time_get_loc?commits=9929fca8-62c7dd3e

Happened somewhere between May 11 and May 27 (the gap in the benchmarks)

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Jun 1, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Jun 1, 2020
@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Timedelta Timedelta data type labels Jun 1, 2020
@jbrockmendel
Copy link
Member

Looking at %prun results, about 40% of the change is accounted for by _validate_scalar

@TomAugspurger
Copy link
Contributor

Yeah that's the place we should focus.

   211                                                   """
   212         1          3.0      3.0      3.8          if not is_scalar(key):
   213                                                       raise InvalidIndexError(key)
   214
   215         1         31.0     31.0     39.7          msg = str(key)
   216         1          1.0      1.0      1.3          try:
   217         1         21.0     21.0     26.9              key = self._data._validate_scalar(key, msg, cast_str=True)
   218                                                   except TypeError as err:
   219                                                       raise KeyError(key) from err
   220
   221         1         22.0     22.0     28.2          return Index.get_loc(self, key, method, tolerance)

Two proposals:

  1. Avoid casting the timedelta to a str until after we've caught the exception. For the happy path of valid data, this cuts out the 40% of time spent on str(Timedelta)
  2. Optimize TimedeltaArray._validate_scalar.
Function: _validate_scalar at line 779

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   779                                               def _validate_scalar(self, value, msg: str, cast_str: bool = False):
   780                                                   """
   781                                                   Validate that the input value can be cast to our scalar_type.
   782
   783                                                   Parameters
   784                                                   ----------
   785                                                   value : object
   786                                                   msg : str
   787                                                       Message to raise in TypeError on invalid input.
   788                                                   cast_str : bool, default False
   789                                                       Whether to try to parse string input to scalar_type.
   790
   791                                                   Returns
   792                                                   -------
   793                                                   self._scalar_type or NaT
   794                                                   """
   795         1          2.0      2.0     10.0          if cast_str and isinstance(value, str):
   796                                                       # NB: Careful about tzawareness
   797                                                       try:
   798                                                           value = self._scalar_from_string(value)
   799                                                       except ValueError as err:
   800                                                           raise TypeError(msg) from err
   801
   802         1         14.0     14.0     70.0          elif is_valid_nat_for_dtype(value, self.dtype):
   803                                                       # GH#18295
   804                                                       value = NaT
   805
   806         1          1.0      1.0      5.0          elif isinstance(value, self._recognized_scalars):
   807         1          3.0      3.0     15.0              value = self._scalar_type(value)  # type: ignore
   808
   809                                                   else:
   810                                                       raise TypeError(msg)
   811
   812         1          0.0      0.0      0.0          return value

maybe is_valid_nat_for_dtype can be faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants