Skip to content

PERF: Implement PeriodArray._unique #23586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Nov 8, 2018 · 6 comments
Closed

PERF: Implement PeriodArray._unique #23586

TomAugspurger opened this issue Nov 8, 2018 · 6 comments
Labels
good first issue Performance Memory or execution speed performance Period Period data type
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Avoid an astype(object).

diff --git a/pandas/core/arrays/period.py b/pandas/core/arrays/period.py
index 5a75f2706..12191ad89 100644
--- a/pandas/core/arrays/period.py
+++ b/pandas/core/arrays/period.py
@@ -216,6 +216,10 @@ class PeriodArray(dtl.DatetimeLikeArrayMixin, ExtensionArray):
         ordinals = libperiod.extract_ordinals(periods, freq)
         return cls(ordinals, freq=freq)
 
+    def unique(self):
+        from pandas.core.algorithms import unique
+        return type(self)(unique(self.asi8), self.freq)
+
     def _values_for_factorize(self):
         return self.asi8, iNaT

should work.

@TomAugspurger TomAugspurger added the Performance Memory or execution speed performance label Nov 8, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Nov 8, 2018
@jbrockmendel
Copy link
Member

This would also work for DatetimeArray/TimedeltaArray if put into DatetimelikeArrayMixin and the last line were changed to type(self)(unique(self.asi8), dtype=self.dtype). Though then in DatetimeIndex and TimedeltaIndex we would need to override (for now) with unique = Index.unique

@jorisvandenbossche
Copy link
Member

@TomAugspurger do you remember why we don't have a base unique implementation based on _values_for_factorize? (of course in principle factorize does a bit too much, but still should be faster than object roundtrip)
Eg in case of PeriodIndex this would already have worked, since factorizing does not goes through object dtype (which is not to say that specifically for a builtin PeriodArray, we shouldn't do a direct unique as you show above of course)

@TomAugspurger
Copy link
Contributor Author

Hmm, can we say in general whether factorizing or converting to object is more expensive?

@jorisvandenbossche
Copy link
Member

Probably not in general, but my feeling is that it is likely the overhead of keeping track of the codes in factorize will give less overhead compared to doing it in object mode.

For integers:

In [26]: a = np.random.randint(100, size=1000000)

In [28]: %timeit pd.unique(a)
3.75 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [29]: %timeit pd.factorize(a)
8.99 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [30]: a2 = np.array(a, dtype=object)

In [31]: %timeit pd.unique(a2)
25.6 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

(but so this also clearly shows that for PeriodArray it is worth to explicitly use unique instead of factorize)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 9, 2018

Actually, can't we do something like this as default:

+    def unique(self):
+        from pandas.core.algorithms import unique
+        return self._from_factorized(unique(self._values_for_factorize), self)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 9, 2018

Yeah, I think so... I don't think we make any claims about the order of the result (but I think right now unique and the default factorize will preserve it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Performance Memory or execution speed performance Period Period data type
Projects
None yet
Development

No branches or pull requests

3 participants