PERF: Implement PeriodArray._unique #23586

TomAugspurger · 2018-11-08T22:45:34Z

Avoid an astype(object).

diff --git a/pandas/core/arrays/period.py b/pandas/core/arrays/period.py
index 5a75f2706..12191ad89 100644
--- a/pandas/core/arrays/period.py
+++ b/pandas/core/arrays/period.py
@@ -216,6 +216,10 @@ class PeriodArray(dtl.DatetimeLikeArrayMixin, ExtensionArray):
         ordinals = libperiod.extract_ordinals(periods, freq)
         return cls(ordinals, freq=freq)
 
+    def unique(self):
+        from pandas.core.algorithms import unique
+        return type(self)(unique(self.asi8), self.freq)
+
     def _values_for_factorize(self):
         return self.asi8, iNaT

should work.

jbrockmendel · 2018-11-08T23:50:06Z

This would also work for DatetimeArray/TimedeltaArray if put into DatetimelikeArrayMixin and the last line were changed to type(self)(unique(self.asi8), dtype=self.dtype). Though then in DatetimeIndex and TimedeltaIndex we would need to override (for now) with unique = Index.unique

jorisvandenbossche · 2018-11-09T12:56:05Z

@TomAugspurger do you remember why we don't have a base unique implementation based on _values_for_factorize? (of course in principle factorize does a bit too much, but still should be faster than object roundtrip)
Eg in case of PeriodIndex this would already have worked, since factorizing does not goes through object dtype (which is not to say that specifically for a builtin PeriodArray, we shouldn't do a direct unique as you show above of course)

TomAugspurger · 2018-11-09T13:00:29Z

Hmm, can we say in general whether factorizing or converting to object is more expensive?

jorisvandenbossche · 2018-11-09T13:15:30Z

Probably not in general, but my feeling is that it is likely the overhead of keeping track of the codes in factorize will give less overhead compared to doing it in object mode.

For integers:

In [26]: a = np.random.randint(100, size=1000000)

In [28]: %timeit pd.unique(a)
3.75 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [29]: %timeit pd.factorize(a)
8.99 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [30]: a2 = np.array(a, dtype=object)

In [31]: %timeit pd.unique(a2)
25.6 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

(but so this also clearly shows that for PeriodArray it is worth to explicitly use unique instead of factorize)

jorisvandenbossche · 2018-11-09T13:17:07Z

Actually, can't we do something like this as default:

+    def unique(self):
+        from pandas.core.algorithms import unique
+        return self._from_factorized(unique(self._values_for_factorize), self)

TomAugspurger · 2018-11-09T13:19:31Z

Yeah, I think so... I don't think we make any claims about the order of the result (but I think right now unique and the default factorize will preserve it).

TomAugspurger added the Performance Memory or execution speed performance label Nov 8, 2018

TomAugspurger added this to the 0.24.0 milestone Nov 8, 2018

TomAugspurger added Period Period data type Effort Low good first issue labels Nov 8, 2018

TomAugspurger mentioned this issue Nov 12, 2018

Implement _most_ of the EA interface for DTA/TDA #23643

Merged

jreback closed this as completed in #23643 Nov 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Implement PeriodArray._unique #23586

PERF: Implement PeriodArray._unique #23586

TomAugspurger commented Nov 8, 2018

jbrockmendel commented Nov 8, 2018

jorisvandenbossche commented Nov 9, 2018

TomAugspurger commented Nov 9, 2018

jorisvandenbossche commented Nov 9, 2018

jorisvandenbossche commented Nov 9, 2018 •

edited

Loading

TomAugspurger commented Nov 9, 2018 •

edited

Loading

PERF: Implement PeriodArray._unique #23586

PERF: Implement PeriodArray._unique #23586

Comments

TomAugspurger commented Nov 8, 2018

jbrockmendel commented Nov 8, 2018

jorisvandenbossche commented Nov 9, 2018

TomAugspurger commented Nov 9, 2018

jorisvandenbossche commented Nov 9, 2018

jorisvandenbossche commented Nov 9, 2018 • edited Loading

TomAugspurger commented Nov 9, 2018 • edited Loading

jorisvandenbossche commented Nov 9, 2018 •

edited

Loading

TomAugspurger commented Nov 9, 2018 •

edited

Loading