Skip to content

PERF/BUG: improve factorize for datetimetz #13750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 24, 2016

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Jul 22, 2016

  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

because factorize internally localize datetimetz, it raises when data contains DST boundary.

dti = pd.date_range('2016-11-06', freq='H', periods=5, tz='US/Eastern')
dti.factorize()
# AmbiguousTimeError: Cannot infer dst time from Timestamp('2016-11-06 01:00:00'), try using the 'ambiguous' argument

Skipped this localization to fix, also it improves perf.

dti = pd.date_range('2011-01-01', freq='H', periods=1000000, tz='Asia/Tokyo')
%timeit dti.factorize()
# on current master
#1 loop, best of 3: 475 ms per loop

# after this PR
#1 loop, best of 3: 262 ms per loop

asv:

   before     after       ratio
  [bb6b5e54] [a2c3370a]
-   22.46ms     9.97ms      0.44  timeseries.datetime_algorithm.time_dti_tz_factorize
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@sinhrks sinhrks added Bug Performance Memory or execution speed performance Timezones Timezone data dtype labels Jul 22, 2016
@sinhrks sinhrks added this to the 0.19.0 milestone Jul 22, 2016
@codecov-io
Copy link

Current coverage is 84.56% (diff: 100%)

Merging #13750 into master will not change coverage

@@             master     #13750   diff @@
==========================================
  Files           141        141          
  Lines         51196      51196          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43296      43296          
  Misses         7900       7900          
  Partials          0          0          

Powered by Codecov. Last update bb6b5e5...a2c3370

@jreback
Copy link
Contributor

jreback commented Jul 22, 2016

lgtm.

I think this may solve some of these Ambiguous time issues: https://github.com/pydata/pandas/search?q=cannot+infer+dst&type=Issues&utf8=%E2%9C%93

maybe

@jreback jreback merged commit 42cc66d into pandas-dev:master Jul 24, 2016
@jreback
Copy link
Contributor

jreback commented Jul 24, 2016

thanks! see my comment above if you have some time.

@sinhrks
Copy link
Member Author

sinhrks commented Jul 24, 2016

yeah, i think we can remove most of tz_localize(None) logic from core.

@sinhrks sinhrks deleted the perf_factorize branch July 24, 2016 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Performance Memory or execution speed performance Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants