Skip to content

PERF: making DatetimeIndex.date more performant #18058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Nov 1, 2017 · 2 comments · Fixed by #18163
Closed

PERF: making DatetimeIndex.date more performant #18058

jreback opened this issue Nov 1, 2017 · 2 comments · Fixed by #18163
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Nov 1, 2017

We can substantially speed up DatetimeIndex.date with a small tweak in the code, from SO

In [44]: rng = pd.date_range('2000-04-03', periods=200000, freq='2H')

In [45]: %timeit rng.date
480 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [46]: %timeit rng.normalize().to_pydatetime()
94.7 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [47]: rng.normalize().to_pydatetime()
Out[47]: 
array([datetime.datetime(2000, 4, 3, 0, 0),
       datetime.datetime(2000, 4, 3, 0, 0),
       datetime.datetime(2000, 4, 3, 0, 0), ...,
       datetime.datetime(2045, 11, 19, 0, 0),
       datetime.datetime(2045, 11, 19, 0, 0),
       datetime.datetime(2045, 11, 19, 0, 0)], dtype=object)

In [48]: rng.date
Out[48]: 
array([datetime.date(2000, 4, 3), datetime.date(2000, 4, 3),
       datetime.date(2000, 4, 3), ..., datetime.date(2045, 11, 19),
       datetime.date(2045, 11, 19), datetime.date(2045, 11, 19)], dtype=object)

so [47] and [48] are almost the same, the difference is datetime for [47] and date for [48].

If we allowed ints_to_pydatetime to create date objects (just needs a simple function pointer) around https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslib.pyx#L140, then this would work, IOW

@property
def date(self):
     return self.normalize().to_pydate()

where .to_pydate() is basically .to_pydatetime() but adding an additional arg, say kind='date', which ints_to_pydatetime would handle (and create date rather than datetime).

This bypasses the iteration which creates many python objects.

@jreback jreback added Difficulty Intermediate Performance Memory or execution speed performance Datetime Datetime data dtype labels Nov 1, 2017
@jreback jreback added this to the Next Major Release milestone Nov 1, 2017
@jreback jreback changed the title PERF: makeing DatetimeIndex.date more performant PERF: making DatetimeIndex.date more performant Nov 1, 2017
@jamestran201
Copy link

Here are the changes made following the steps outlined above. In datetimes.py:

@property
def date(self):
    """
    Returns numpy array of python datetime.date objects (namely, the date
    part of Timestamps without timezone information).
    """
    return self._maybe_mask_results(libalgos.arrmap_object(
        self.asobject.values, lambda x: x.date()))

@property
def date_new(self):
    return self.normalize().to_pydate()

def to_pydate(self):
    return libts.ints_to_pydatetime(self.asi8, kind="date")

In tslib.pyx, I added a method to create the date objects, a check for the function to use and consider tz
only if kind is not date:

cdef inline object create_date_from_ts(
		int64_t value, pandas_datetimestruct dts,
        object tz, object freq):
    """ convenience routine to construct a datetime.date from its parts """
    return date(dts.year, dts.month, dts.day)

def ints_to_pydatetime(ndarray[int64_t] arr, tz=None, freq=None, box=False, kind="datetime"):
    # convert an i8 repr to an ndarray of datetimes or Timestamp (if box ==
    # True)

    cdef:
        Py_ssize_t i, n = len(arr)
        ndarray[int64_t] trans, deltas
        pandas_datetimestruct dts
        object dt
        int64_t value
        ndarray[object] result = np.empty(n, dtype=object)
        object (*func_create)(int64_t, pandas_datetimestruct, object, object)
		
    if kind == "date":
        func_create = create_date_from_ts
    else:
        if box and is_string_object(freq):
            from pandas.tseries.frequencies import to_offset
            freq = to_offset(freq)

        if box:
            func_create = create_timestamp_from_ts
        else:
            func_create = create_datetime_from_ts

    if tz is not None and kind != "date":
        if is_utc(tz):
            for i in range(n):
                value = arr[i]
                if value == NPY_NAT:
                    result[i] = NaT
                else:
                    dt64_to_dtstruct(value, &dts)
                    result[i] = func_create(value, dts, tz, freq)
        elif is_tzlocal(tz) or is_fixed_offset(tz):
            for i in range(n):
                value = arr[i]
                if value == NPY_NAT:
                    result[i] = NaT
                else:
                    dt64_to_dtstruct(value, &dts)
                    dt = create_datetime_from_ts(value, dts, tz, freq)
                    dt = dt + tz.utcoffset(dt)
                    if box:
                        dt = Timestamp(dt)
                    result[i] = dt
        else:
            trans, deltas, typ = get_dst_info(tz)

            for i in range(n):

                value = arr[i]
                if value == NPY_NAT:
                    result[i] = NaT
                else:

                    # Adjust datetime64 timestamp, recompute datetimestruct
                    pos = trans.searchsorted(value, side='right') - 1
                    if treat_tz_as_pytz(tz):
                        # find right representation of dst etc in pytz timezone
                        new_tz = tz._tzinfos[tz._transition_info[pos]]
                    else:
                        # no zone-name change for dateutil tzs - dst etc
                        # represented in single object.
                        new_tz = tz

                    dt64_to_dtstruct(value + deltas[pos], &dts)
                    result[i] = func_create(value, dts, new_tz, freq)
    else:
        for i in range(n):

            value = arr[i]
            if value == NPY_NAT:
                result[i] = NaT
            else:
                dt64_to_dtstruct(value, &dts)
                result[i] = func_create(value, dts, None, freq)

    return result

A quick comparison using timeit:

In [1]: import pandas as pd

In [2]: rng = pd.date_range('2000-04-03', periods=200000, freq='2H')

In [3]: %timeit rng.date
555 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit rng.date_new
90.4 ms ± 5.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit rng.normalize().to_pydatetime()
121 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: rng.date
Out[6]:
array([datetime.date(2000, 4, 3), datetime.date(2000, 4, 3),
       datetime.date(2000, 4, 3), ..., datetime.date(2045, 11, 19),
       datetime.date(2045, 11, 19), datetime.date(2045, 11, 19)], dtype=object)

In [7]: rng.date_new
Out[7]:
array([datetime.date(2000, 4, 3), datetime.date(2000, 4, 3),
       datetime.date(2000, 4, 3), ..., datetime.date(2045, 11, 19),
       datetime.date(2045, 11, 19), datetime.date(2045, 11, 19)], dtype=object)

In [8]: rng.normalize().to_pydatetime()
Out[8]:
array([datetime.datetime(2000, 4, 3, 0, 0),
       datetime.datetime(2000, 4, 3, 0, 0),
       datetime.datetime(2000, 4, 3, 0, 0), ...,
       datetime.datetime(2045, 11, 19, 0, 0),
       datetime.datetime(2045, 11, 19, 0, 0),
       datetime.datetime(2045, 11, 19, 0, 0)], dtype=object)

@jreback
Copy link
Contributor Author

jreback commented Nov 7, 2017

if you would put this in a PR we can have a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants