-
Notifications
You must be signed in to change notification settings - Fork 367
Implement Series.plot.kde #767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #767 +/- ##
==========================================
- Coverage 93.9% 93.83% -0.07%
==========================================
Files 32 32
Lines 5691 5744 +53
==========================================
+ Hits 5344 5390 +46
- Misses 347 354 +7
Continue to review full report at Codecov.
|
6c34c7a to
b70e59e
Compare
ab5e7cf to
12bf276
Compare
Softagram Impact Report for pull/767 (head commit: 12bf276)⭐ Change Overview
📄 Full report
Impact Report explained. Give feedback on this report to [email protected] |
|
Let me merge this to move forward. |
|
|
||
| pax = pdf['a'].plot('kde', bw_method=0.3) | ||
| kax = kdf['a'].plot('kde', bw_method=0.3) | ||
| self.compare_plots(pax, kax) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Why does this still work even though the values are slightly different?
This PR implements kde in DataFrame. It reuses Series' implementation. at #767 like Series's kde plot, Since DataFrame's also uses MLlib's KernelDensity API to calculate KDE , so slightly different from pandas but seems good enough either: <img width="884" alt="스크린샷 2019-09-17 오후 2 24 34" src="https://user-images.githubusercontent.com/44108233/65013645-e3933100-d956-11e9-9166-be6d534046bd.png"> <img width="882" alt="스크린샷 2019-09-17 오후 2 25 18" src="https://user-images.githubusercontent.com/44108233/65013684-ff96d280-d956-11e9-8714-6c860c4e2c13.png"> And also kde is an alias of 'density', you can get exactly same result when you use 'density' rather than 'kde' like below: <img width="886" alt="스크린샷 2019-09-17 오후 2 27 30" src="https://user-images.githubusercontent.com/44108233/65013769-66b48700-d957-11e9-84f0-5b5989ed2d49.png"> <img width="882" alt="스크린샷 2019-09-17 오후 2 27 09" src="https://user-images.githubusercontent.com/44108233/65013770-67e5b400-d957-11e9-875c-0fd73b32fa5f.png"> **Multiple columns examples:** <img width="819" alt="스크린샷 2019-09-17 오후 2 55 35" src="https://user-images.githubusercontent.com/44108233/65015007-45ee3080-d95b-11e9-9c31-a4b85631e404.png"> and for each row as Series.plot.kde looks same like below: <img width="739" alt="스크린샷 2019-09-17 오후 2 57 57" src="https://user-images.githubusercontent.com/44108233/65015172-b9903d80-d95b-11e9-8245-8b47190c38b6.png">

This PR takes over #710
KDE, unlike other plots like line or area, it needs to calculates values via Spark so that we can compute it in a distributed manner.
This PR uses MLlib's KernelDensity API to calculate KDE. Since Spark only support scalar bandwidth, unlike SciPy that pandas' uses, Koalas will currently only support fixed scalar bandwidth only.
Implementation is different so the values are slightly different but seems good enough: