Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Sep 10, 2019

This PR takes over #710

KDE, unlike other plots like line or area, it needs to calculates values via Spark so that we can compute it in a distributed manner.

This PR uses MLlib's KernelDensity API to calculate KDE. Since Spark only support scalar bandwidth, unlike SciPy that pandas' uses, Koalas will currently only support fixed scalar bandwidth only.

Implementation is different so the values are slightly different but seems good enough:

import pandas as pd
pd.Series([1, 2, 2.5, 3, 3.5, 4, 5]).plot.kde(bw_method=0.3).figure.savefig("image.png")

image

import databricks.koalas as ks
ks.Series([1, 2, 2.5, 3, 3.5, 4, 5]).plot.kde(bw_method=0.3).figure.savefig("image.png")

image

import pandas as pd
pd.Series([1, 2, 2.5, 3, 3.5, 4, 5]).plot.kde(bw_method=3.0).figure.savefig("image.png")

image

import databricks.koalas as ks
ks.Series([1, 2, 2.5, 3, 3.5, 4, 5]).plot.kde(bw_method=3.0).figure.savefig("image.png")

image

@HyukjinKwon HyukjinKwon changed the title Implement kde for Series [WIP] Implement kde for Series Sep 10, 2019
@codecov-io
Copy link

codecov-io commented Sep 10, 2019

Codecov Report

Merging #767 into master will decrease coverage by 0.06%.
The diff coverage is 87.27%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #767      +/-   ##
==========================================
- Coverage    93.9%   93.83%   -0.07%     
==========================================
  Files          32       32              
  Lines        5691     5744      +53     
==========================================
+ Hits         5344     5390      +46     
- Misses        347      354       +7
Impacted Files Coverage Δ
databricks/koalas/plot.py 93.97% <87.27%> (-1.06%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29ab31d...12bf276. Read the comment docs.

@HyukjinKwon HyukjinKwon changed the title [WIP] Implement kde for Series Implement Series.plot.kde Sep 10, 2019
@softagram-bot
Copy link

Softagram Impact Report for pull/767 (head commit: 12bf276)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

@HyukjinKwon HyukjinKwon requested a review from dvgodoy September 10, 2019 10:11
@HyukjinKwon
Copy link
Member Author

Let me merge this to move forward.

@HyukjinKwon HyukjinKwon merged commit a1125f9 into databricks:master Sep 12, 2019

pax = pdf['a'].plot('kde', bw_method=0.3)
kax = kdf['a'].plot('kde', bw_method=0.3)
self.compare_plots(pax, kax)
Copy link
Collaborator

@ueshin ueshin Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Why does this still work even though the values are slightly different?

HyukjinKwon pushed a commit that referenced this pull request Sep 28, 2019
This PR implements kde in DataFrame. It reuses Series' implementation. at #767  

like Series's kde plot, Since DataFrame's also uses MLlib's KernelDensity API to calculate KDE , so slightly different from pandas but seems good enough either:

<img width="884" alt="스크린샷 2019-09-17 오후 2 24 34" src="https://user-images.githubusercontent.com/44108233/65013645-e3933100-d956-11e9-9166-be6d534046bd.png">
<img width="882" alt="스크린샷 2019-09-17 오후 2 25 18" src="https://user-images.githubusercontent.com/44108233/65013684-ff96d280-d956-11e9-8714-6c860c4e2c13.png">

And also kde is an alias of 'density', you can get exactly same result when you use 'density' rather than 'kde' like below:

<img width="886" alt="스크린샷 2019-09-17 오후 2 27 30" src="https://user-images.githubusercontent.com/44108233/65013769-66b48700-d957-11e9-84f0-5b5989ed2d49.png">
<img width="882" alt="스크린샷 2019-09-17 오후 2 27 09" src="https://user-images.githubusercontent.com/44108233/65013770-67e5b400-d957-11e9-875c-0fd73b32fa5f.png">

**Multiple columns examples:**

<img width="819" alt="스크린샷 2019-09-17 오후 2 55 35" src="https://user-images.githubusercontent.com/44108233/65015007-45ee3080-d95b-11e9-9c31-a4b85631e404.png">

and for each row as Series.plot.kde looks same like below:

<img width="739" alt="스크린샷 2019-09-17 오후 2 57 57" src="https://user-images.githubusercontent.com/44108233/65015172-b9903d80-d95b-11e9-8245-8b47190c38b6.png">
@HyukjinKwon HyukjinKwon deleted the impl_series_kde branch November 6, 2019 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants