-
Notifications
You must be signed in to change notification settings - Fork 367
DataFrameGroupBy.describe #1168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrameGroupBy.describe #1168
Conversation
|
hmm it seems that the result is not same with pandas?? >>> kdf = ks.DataFrame({'a': [1, 1, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> pdf = kdf.to_pandas()
>>> kdf.groupby('a').describe()
b c b c
count mean std min max count mean std min max 25% 50% 75% 25% 50% 75%
a
1 2 4.5 0.707107 4 5 2 7.5 0.707107 7 8 4 4 5 7 7 8
3 1 6.0 NaN 6 6 1 9.0 NaN 9 9 6 6 6 9 9 9
>>> pdf.groupby('a').describe()
b c
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
a
1 2.0 4.5 0.707107 4.0 4.25 4.5 4.75 5.0 2.0 7.5 0.707107 7.0 7.25 7.5 7.75 8.0
3 1.0 6.0 NaN 6.0 6.00 6.0 6.00 6.0 1.0 9.0 NaN 9.0 9.00 9.0 9.00 9.0since the one of main purpose of Koalas is make our API works same with pandas as much as possible, this part should be made clear very first. so first of all, i think we better add an unittest to def test_describe(self):
kdf = ks.DataFrame({'a': [1, 1, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
pdf = kdf.to_pandas()
self.assert_eq(kdf.groupby('a').describe.sort_index(),
pdf.groupby('a').describe.sort_index())and thanks for the contribution, let's make it work together 👍 |
databricks/koalas/groupby.py
Outdated
| # Split "quartiles" columns into first, second, and third quartiles. | ||
| for label, content in kdf.iteritems(): | ||
| if label[1] == "quartiles": | ||
| exploded = ks.DataFrame(content.tolist()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line seems a little danger since it can potentially raise memory issue such like OOM,
(because tolist() loads all the data into the single driver's memory.)
so i think maybe we can use content.to_frame(), or should find another way.
or we can simply just don't support quartiles for now since memory issue written above with describing proper notice to docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can just get items from the "quartiles" column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Deepyaman! thanks for your continued efforts here.
Basically, for handling DataFrame of Koalas efficiently,
we usually use internal spark DataFrame (sdf in short, and you can get by kdf._sdf or kdf._internal.sdf), not directly Koalas API.
I made some another way of implementation using sdf for you below.
(I can't say that this is a perfect & good quality code since it's very roughly implemented and not enough tested, but maybe it will help you to understand Koalas' internal processing even just a bit)
def describe(self):
kdf = self.agg(["count", "mean", "std", "min", "max", "quartiles"]).reset_index()
formatted_percentiles = ["25%", "50%", "75%"]
sdf = kdf._sdf
group_key_names = [groupkey.name for groupkey in self._groupkeys]
quartiles_columns = []
for data_column in self._kdf._internal.data_columns:
if data_column not in group_key_names:
quartiles_columns.append((data_column, 'quartiles'))
# `quartiles_columns` here looks like the below
# [('b', 'quartiles'), ('c', 'quartiles')]
# add columns (b, 25%), (b, 50%) ... (c, 50%), (c, 75%) to `sdf`
for col_name, quartiles_column in quartiles_columns:
for i, percentile in enumerate(formatted_percentiles):
sdf = sdf.withColumn(
name_like_string((col_name, percentile)),
F.col(name_like_string((col_name, quartiles_column))).getItem(i))
# so, `sdf` here looks like the below
# +-----------------+-----+ ... +-----------------+--------+--------+--------+--------+--------+--------+
# |__index_level_0__|(a, )| ... |__natural_order__|(b, 25%)|(b, 50%)|(b, 75%)|(c, 25%)|(c, 50%)|(c, 75%)|
# +-----------------+-----+ ... +-----------------+--------+--------+--------+--------+--------+--------+
# | 0| 1| ... | 592705486848| 4| 4| 5| 7| 7| 8|
# | 1| 3| ... | 919123001344| 6| 6| 6| 9| 9| 9|
# +-----------------+-----+ ... +-----------------+--------+--------+--------+--------+--------+--------+
# make the column list what we want to select from `sdf`
columns = []
for col_name, _ in quartiles_columns:
for func in ["count", "mean", "std", "min", "25%", "50%", "75%", "max"]:
columns.append((col_name, func))
name_like_string_columns = [name_like_string(col) for col in columns]
internal = _InternalFrame(
sdf=sdf.select(*self._kdf._internal.index_columns, *name_like_string_columns),
index_map=self._kdf._internal.index_map)
idx = pd.MultiIndex.from_tuples(columns)
# `idx` here looks like the below
# MultiIndex([('b', 'count'),
# ('b', 'mean'),
# ('b', 'std'),
# ('b', 'min'),
# ('b', '25%'),
# ('b', '50%'),
# ('b', '75%'),
# ('b', 'max'),
# ('c', 'count'),
# ('c', 'mean'),
# ('c', 'std'),
# ('c', 'min'),
# ('c', '25%'),
# ('c', '50%'),
# ('c', '75%'),
# ('c', 'max')],
# )
result = DataFrame(internal)
result.columns = idx
return result.astype("float64")and, now implementation seems like invokes job many times like the below.
>>> kdf.groupby('a').describe()
2020-01-13 17:08:31 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
2020-01-13 17:08:31 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
2020-01-13 17:08:32 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
2020-01-13 17:08:32 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
2020-01-13 17:08:32 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
2020-01-13 17:08:33 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
2020-01-13 17:08:33 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
b c
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
a
1 2.0 4.5 0.707107 4.0 4.0 4.0 5.0 5.0 2.0 7.5 0.707107 7.0 7.0 7.0 8.0 8.0
3 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0 1.0 9.0 NaN 9.0 9.0 9.0 9.0 9.0we can reduce them via handling internal frame properly like the below.
>>> kdf.groupby('a').describe()
2020-01-13 17:33:00 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
b c
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
0 2.0 4.5 0.707107 4.0 4.0 4.0 5.0 5.0 2.0 7.5 0.707107 7.0 7.0 7.0 8.0 8.0
1 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0 1.0 9.0 NaN 9.0 9.0 9.0 9.0 9.0There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itholic Thank you for the feedback. I'll try to rewrite it following your suggestions above and get back to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepyaman My pleasure :) Hope it helps you!
Good catch. I assume you're referring to the cases where the values are different moreso than the differences in decimal representation or column order. Under the hood, pandas That being said, I don't think you can even guarantee that the implementation would match the pandas result specifying Would it make sense to reimplement some sort of linear interpolation, or is documenting that the quantiles aren't interpolated sufficient? I'm not aware of any sort of interpolated percentile calculation in Spark, but I could be wrong. Happy to add a test tomorrow or day after, probably based on what think the expected (interpolation) behavior should be? |
Speaking of keeping the APIs as similar as possible, do you think the fact that I've added a If not, I could use your help in determining how else to do it. My first thought would be to create a function very similar to |
|
thanks for the explanation :) ah sorry i just meant that the 'shape' of the result looks not same with pandas. for example, >>> kdf.groupby('a').describe()
b c b c
count mean std min max count mean std min max 25% 50% 75% 25% 50% 75%
a
1 2 4.5 0.707107 4 5 2 7.5 0.707107 7 8 4 4 5 7 7 8
3 1 6.0 NaN 6 6 1 9.0 NaN 9 9 6 6 6 9 9 9whereas for pandas looks like below: >>> pdf.groupby('a').describe()
b c
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
a
1 2.0 4.5 0.707107 4.0 4.25 4.5 4.75 5.0 2.0 7.5 0.707107 7.0 7.25 7.5 7.75 8.0
3 1.0 6.0 NaN 6.0 6.00 6.0 6.00 6.0 1.0 9.0 NaN 9.0 9.00 9.0 9.00 9.0
i think the idea of now implementation that using and maybe it is better just adding the note to docstring why the result is slightly different from pandas as you commented. (to handle larger data) |
this is good point out. now we have already some functionalities similar like this, (not 100% same with pandas for several reasons, for example Although it seems okay to me if there is proper note in its docs like maybe we better discuss it with other maintainers who have more insight of those kind of functionalities. (Technically, i'm not even one of maintainers of this repository 😅 ) |
Got it. I'm aware and can fix that relatively easily. Included it on the list of TODOs for this PR in the initial message:
"Reorder percentiles" probably wasn't the best description. Anyway, just wanted to see if there were more fundamental issues first. :) |
Codecov Report
@@ Coverage Diff @@
## master #1168 +/- ##
=========================================
Coverage ? 95.24%
=========================================
Files ? 35
Lines ? 7124
Branches ? 0
=========================================
Hits ? 6785
Misses ? 339
Partials ? 0
Continue to review full report at Codecov.
|
Softagram Impact Report for pull/1168 (head commit: 707f1eb)
|
| for i, x in enumerate(formatted_percentiles) | ||
| } | ||
| ) | ||
| kdf = kdf.drop(label).join(exploded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
join is expensive .. we should avoid. Actually I like @itholic's suggestion in that way.
|
I am going to just merge. @itholic can you make a PR to address the comments you pointed out? |
|
We should also fix docstring. |
|
Thanks @deepyaman for finding this issue and working on this. |
|
@HyukjinKwon okay i'll make PR soon. |
|
@itholic @HyukjinKwon Sorry, I don't get much time to work on these things during the week. I didn't see a PR with the docstring/code updates yet; I started by adding the docstring, if that's OK (#1202). I'll add in @itholic's code shortly, too. Please feel free to ignore if it's already done or in progress; I'm doing this partly for my own learning, as well. |
|
Don't worry and keep going :) i also didn't check more about this since it's crazily busy last week 😅 i'll check #1202 soon, and thanks for the contribution!! |


Close #1166
Manual test:
TODO: