DataFrameGroupBy.describe #1168

deepyaman · 2020-01-04T23:30:50Z

Manual test:

>>> import databricks.koalas as ks
>>> df = ks.DataFrame({'a': [1, 1, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
20/01/04 18:16:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> df.groupby('a').describe()
20/01/04 18:17:09 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
20/01/04 18:17:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
20/01/04 18:17:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
      b                                        c                                   
  count mean       std min 25% 50% 75% max count mean       std min 25% 50% 75% max
a                                                                                  
1     2  4.5  0.707107   4   4   4   5   5     2  7.5  0.707107   7   7   7   8   8
3     1  6.0       NaN   6   6   6   6   6     1  9.0       NaN   9   9   9   9   9

TODO:

Fix formatting (line length over 100, etc.)
Add docstring
Add unit tests
Reorder percentiles

itholic · 2020-01-05T04:51:27Z

hmm it seems that the result is not same with pandas??

>>> kdf = ks.DataFrame({'a': [1, 1, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> pdf = kdf.to_pandas()

>>> kdf.groupby('a').describe()
      b                            c                          b           c
  count mean       std min max count mean       std min max 25% 50% 75% 25% 50% 75%
a
1     2  4.5  0.707107   4   5     2  7.5  0.707107   7   8   4   4   5   7   7   8
3     1  6.0       NaN   6   6     1  9.0       NaN   9   9   6   6   6   9   9   9

>>> pdf.groupby('a').describe()
      b                                               c
  count mean       std  min   25%  50%   75%  max count mean       std  min   25%  50%   75%  max
a
1   2.0  4.5  0.707107  4.0  4.25  4.5  4.75  5.0   2.0  7.5  0.707107  7.0  7.25  7.5  7.75  8.0
3   1.0  6.0       NaN  6.0  6.00  6.0  6.00  6.0   1.0  9.0       NaN  9.0  9.00  9.0  9.00  9.0

since the one of main purpose of Koalas is make our API works same with pandas as much as possible, this part should be made clear very first.

so first of all, i think we better add an unittest to tests/test_groupby.py even just very simple case like below.

def test_describe(self):
    kdf = ks.DataFrame({'a': [1, 1, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
    pdf = kdf.to_pandas()

    self.assert_eq(kdf.groupby('a').describe.sort_index(),
                   pdf.groupby('a').describe.sort_index())

and thanks for the contribution, let's make it work together 👍

itholic · 2020-01-05T05:00:33Z

databricks/koalas/groupby.py

+        # Split "quartiles" columns into first, second, and third quartiles.
+        for label, content in kdf.iteritems():
+            if label[1] == "quartiles":
+                exploded = ks.DataFrame(content.tolist())


this line seems a little danger since it can potentially raise memory issue such like OOM,
(because tolist() loads all the data into the single driver's memory.)

so i think maybe we can use content.to_frame(), or should find another way.

or we can simply just don't support quartiles for now since memory issue written above with describing proper notice to docs

Yeah, we can just get items from the "quartiles" column.

@itholic @ueshin Hey! Sorry it's taking me a while to make the requested changes. I figured out a way to refactor this using _column_op; will it work? I'm still using to_numpy, though, only because Koalas doesn't allow constructing a DataFrame from a column-Series mapping yet.

@deepyaman

Hi Deepyaman! thanks for your continued efforts here.

Basically, for handling DataFrame of Koalas efficiently,

we usually use internal spark DataFrame (sdf in short, and you can get by kdf._sdf or kdf._internal.sdf), not directly Koalas API.

I made some another way of implementation using sdf for you below.
(I can't say that this is a perfect & good quality code since it's very roughly implemented and not enough tested, but maybe it will help you to understand Koalas' internal processing even just a bit)

def describe(self): kdf = self.agg(["count", "mean", "std", "min", "max", "quartiles"]).reset_index() formatted_percentiles = ["25%", "50%", "75%"] sdf = kdf._sdf group_key_names = [groupkey.name for groupkey in self._groupkeys] quartiles_columns = [] for data_column in self._kdf._internal.data_columns: if data_column not in group_key_names: quartiles_columns.append((data_column, 'quartiles')) # `quartiles_columns` here looks like the below # [('b', 'quartiles'), ('c', 'quartiles')] # add columns (b, 25%), (b, 50%) ... (c, 50%), (c, 75%) to `sdf` for col_name, quartiles_column in quartiles_columns: for i, percentile in enumerate(formatted_percentiles): sdf = sdf.withColumn( name_like_string((col_name, percentile)), F.col(name_like_string((col_name, quartiles_column))).getItem(i)) # so, `sdf` here looks like the below # +-----------------+-----+ ... +-----------------+--------+--------+--------+--------+--------+--------+ # |__index_level_0__|(a, )| ... |__natural_order__|(b, 25%)|(b, 50%)|(b, 75%)|(c, 25%)|(c, 50%)|(c, 75%)| # +-----------------+-----+ ... +-----------------+--------+--------+--------+--------+--------+--------+ # | 0| 1| ... | 592705486848| 4| 4| 5| 7| 7| 8| # | 1| 3| ... | 919123001344| 6| 6| 6| 9| 9| 9| # +-----------------+-----+ ... +-----------------+--------+--------+--------+--------+--------+--------+ # make the column list what we want to select from `sdf` columns = [] for col_name, _ in quartiles_columns: for func in ["count", "mean", "std", "min", "25%", "50%", "75%", "max"]: columns.append((col_name, func)) name_like_string_columns = [name_like_string(col) for col in columns] internal = _InternalFrame( sdf=sdf.select(*self._kdf._internal.index_columns, *name_like_string_columns), index_map=self._kdf._internal.index_map) idx = pd.MultiIndex.from_tuples(columns) # `idx` here looks like the below # MultiIndex([('b', 'count'), # ('b', 'mean'), # ('b', 'std'), # ('b', 'min'), # ('b', '25%'), # ('b', '50%'), # ('b', '75%'), # ('b', 'max'), # ('c', 'count'), # ('c', 'mean'), # ('c', 'std'), # ('c', 'min'), # ('c', '25%'), # ('c', '50%'), # ('c', '75%'), # ('c', 'max')], # ) result = DataFrame(internal) result.columns = idx return result.astype("float64")

and, now implementation seems like invokes job many times like the below.

>>> kdf.groupby('a').describe() 2020-01-13 17:08:31 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 2020-01-13 17:08:31 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 2020-01-13 17:08:32 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 2020-01-13 17:08:32 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 2020-01-13 17:08:32 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 2020-01-13 17:08:33 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 2020-01-13 17:08:33 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. b c count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max a 1 2.0 4.5 0.707107 4.0 4.0 4.0 5.0 5.0 2.0 7.5 0.707107 7.0 7.0 7.0 8.0 8.0 3 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0 1.0 9.0 NaN 9.0 9.0 9.0 9.0 9.0

we can reduce them via handling internal frame properly like the below.

>>> kdf.groupby('a').describe() 2020-01-13 17:33:00 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. b c count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max 0 2.0 4.5 0.707107 4.0 4.0 4.0 5.0 5.0 2.0 7.5 0.707107 7.0 7.0 7.0 8.0 8.0 1 1.0 6.0 NaN 6.0 6.0 6.0 6.0 6.0 1.0 9.0 NaN 9.0 9.0 9.0 9.0 9.0

@itholic Thank you for the feedback. I'll try to rewrite it following your suggestions above and get back to you.

@deepyaman My pleasure :) Hope it helps you!

deepyaman · 2020-01-05T08:08:42Z

hmm it seems that the result is not same with pandas??

Good catch. I assume you're referring to the cases where the values are different moreso than the differences in decimal representation or column order.

Under the hood, pandas describe uses quantile, letting interpolation default to 'linear'. You'd actually get the same quartiles if you specified 'nearest' instead:

>>> pdf.groupby('a').quantile([0.25, 0.5, 0.75], interpolation='nearest')
        b  c
a           
1 0.25  4  7
  0.50  4  7
  0.75  5  8
3 0.25  6  9
  0.50  6  9
  0.75  6  9

That being said, I don't think you can even guarantee that the implementation would match the pandas result specifying interpolation in all cases, since I used percentile_approx to handle larger data.

Would it make sense to reimplement some sort of linear interpolation, or is documenting that the quantiles aren't interpolated sufficient? I'm not aware of any sort of interpolated percentile calculation in Spark, but I could be wrong.

Happy to add a test tomorrow or day after, probably based on what think the expected (interpolation) behavior should be?

deepyaman · 2020-01-05T08:13:22Z

since the one of main purpose of Koalas is make our API works same with pandas as much as possible, this part should be made clear very first.

Speaking of keeping the APIs as similar as possible, do you think the fact that I've added a "quartiles" aggregation function that doesn't exist in pandas is okay? That's actually something I was concerned about.

If not, I could use your help in determining how else to do it. My first thought would be to create a function very similar to _spark_groupby, but just with support for that "quartiles" aggregation function. Then could implement DataFrameGroupBy.quantile (calling the new function), and use the new quantile in describe.

itholic · 2020-01-05T10:08:24Z

@deepyaman

thanks for the explanation :)

ah sorry i just meant that the 'shape' of the result looks not same with pandas.

for example, DataFrameGroupBy.describe for Koalas looks like below:
(columns b and c are repeated twice.)

>>> kdf.groupby('a').describe()
      b                            c                          b           c
  count mean       std min max count mean       std min max 25% 50% 75% 25% 50% 75%
a
1     2  4.5  0.707107   4   5     2  7.5  0.707107   7   8   4   4   5   7   7   8
3     1  6.0       NaN   6   6     1  9.0       NaN   9   9   6   6   6   9   9   9

whereas for pandas looks like below:
(columns b and c appear only once.)

>>> pdf.groupby('a').describe()
      b                                               c
  count mean       std  min   25%  50%   75%  max count mean       std  min   25%  50%   75%  max
a
1   2.0  4.5  0.707107  4.0  4.25  4.5  4.75  5.0   2.0  7.5  0.707107  7.0  7.25  7.5  7.75  8.0
3   1.0  6.0       NaN  6.0  6.00  6.0  6.00  6.0   1.0  9.0       NaN  9.0  9.00  9.0  9.00  9.0

Happy to add a test tomorrow or day after, probably based on what think the expected (interpolation) behavior should be?

i think the idea of now implementation that using percentile_approx looks good.

and maybe it is better just adding the note to docstring why the result is slightly different from pandas as you commented. (to handle larger data)

itholic · 2020-01-05T10:12:19Z

@deepyaman

Speaking of keeping the APIs as similar as possible, do you think the fact that I've added a "quartiles" aggregation function that doesn't exist in pandas is okay? That's actually something I was concerned about.

this is good point out.

now we have already some functionalities similar like this, (not 100% same with pandas for several reasons, for example Series.quantile as you said)

Although it seems okay to me if there is proper note in its docs like Series.quantile,

maybe we better discuss it with other maintainers who have more insight of those kind of functionalities. (Technically, i'm not even one of maintainers of this repository 😅 )

cc @ueshin @HyukjinKwon

deepyaman · 2020-01-05T17:28:36Z

ah sorry i just meant that the 'shape' of the result looks not same with pandas.

Got it. I'm aware and can fix that relatively easily. Included it on the list of TODOs for this PR in the initial message:

TODO:

Fix formatting (line length over 100, etc.)

Add docstring

Add unit tests

Reorder percentiles

"Reorder percentiles" probably wasn't the best description. Anyway, just wanted to see if there were more fundamental issues first. :)

codecov-io · 2020-01-10T06:00:23Z

Codecov Report

❗ No coverage uploaded for pull request base (master@72e810f). Click here to learn what that means.
The diff coverage is 97.53%.

@@            Coverage Diff            @@
##             master    #1168   +/-   ##
=========================================
  Coverage          ?   95.24%           
=========================================
  Files             ?       35           
  Lines             ?     7124           
  Branches          ?        0           
=========================================
  Hits              ?     6785           
  Misses            ?      339           
  Partials          ?        0

Impacted Files	Coverage Δ
databricks/koalas/missing/indexes.py	`100% <ø> (ø)`
databricks/koalas/series.py	`96.46% <100%> (ø)`
databricks/koalas/base.py	`96.39% <100%> (ø)`
databricks/koalas/indexing.py	`95.96% <100%> (ø)`
databricks/koalas/groupby.py	`91.9% <100%> (ø)`
databricks/koalas/namespace.py	`87.57% <100%> (ø)`
databricks/koalas/window.py	`97.09% <100%> (ø)`
databricks/koalas/version.py	`100% <100%> (ø)`
databricks/koalas/indexes.py	`96.6% <100%> (ø)`
databricks/koalas/frame.py	`97.01% <100%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72e810f...707f1eb. Read the comment docs.

softagram-bot · 2020-01-12T20:10:05Z

Softagram Impact Report for pull/1168 (head commit: `707f1eb`)

⚠️ Copy paste found

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 896, 922:

        kdf = ks.from_pandas(pdf)

        self.assert_eq(pdf.groupby('a').head(2).sort_index(),
                       kdf.groupby('a').head(2...(truncated 690 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 767, 791:

        kdf = ks.from_pandas(pdf)
        self.assert_eq(kdf.groupby(\"b\").transform(lambda x: x + 1).sort_index(),
   ...(truncated 421 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 774, 800:


        # multi-index columns
        columns = pd.MultiIndex.from_tuples([('x', 'a'), ('x', 'b'), ('y', 'c')])
   ...(truncated 485 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 556, 571:

        pdf = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2, 3, 3, 3] * 3,
                            'b': [1, 2, 2, 2, 3, 3, 3, 4, 4] * 3,
                            'c': [1, 2...(truncated 266 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 36, 112:

        pdf = pd.DataFrame({'a': [1, 2, 6, 4, 4, 6, 4, 3, 7],
                            'b': [4, 2, 7, 3, 3, 1, 1, 1, 2],
                            'c': [4, 2, 7, 3, No...(truncated 165 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 613, 642:

        pdf = pd.DataFrame({'A': [1, 1, 2, 2] * 3,
                            'B': [2, 4, None, 3] * 3,
                            'C': [None, None, None, 1] * 3,
         ...(truncated 266 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 400, 426, 452, 478, 504, 530:

        pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6] * 3,
                            'b': [1, 1, 2, 3, 5, 8] * 3,
                            'c': [1, 4, 9, 16, 25, 36] * 3},
...(truncated 151 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 734, 751:

        kdf = ks.from_pandas(pdf)

        self.assert_eq(
            kdf.groupby('car_id').apply(lambda _: pd.DataFrame({\"column\": [0.0]})).sort_index(),
            pdf.groupby('car_id')...(truncated 216 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 842, 867:

        pdf = pd.DataFrame({'a': [1, 1, 2, 2, 3] * 3,
                            'b': [1, 2, 3, 4, 5] * 3,
                            'c': [5, 4, 3, 2, 1] * 3},
          ...(truncated 145 chars)

ℹ️ groupby.py: Copy paste fragment inside the same file on lines 1163, 1234:

            window = Window.partitionBy(groupkey_cols) \
                .orderBy(order_column, NATURAL_ORDER_COLUMN_NAME)
    ...(truncated 1017 chars)

ℹ️ groupby.py: Copy paste fragment inside the same file on lines 2097, 2144:

        sdf = sdf.withColumn('rank', F.row_number().over(window)).filter(F.col('rank') <= n)
        internal = _InternalFrame(sdf=...(truncated 511 chars)

Now that you are on the file, it would be easier to pay back some tech. debt.

⭐ Change Overview

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/1168

Impact Report explained. Give feedback on this report to [email protected]

HyukjinKwon · 2020-01-16T04:48:10Z

databricks/koalas/groupby.py

+                        for i, x in enumerate(formatted_percentiles)
+                    }
+                )
+                kdf = kdf.drop(label).join(exploded)


join is expensive .. we should avoid. Actually I like @itholic's suggestion in that way.

HyukjinKwon · 2020-01-16T04:48:43Z

I am going to just merge. @itholic can you make a PR to address the comments you pointed out?

HyukjinKwon · 2020-01-16T04:49:01Z

We should also fix docstring.

HyukjinKwon · 2020-01-16T04:49:31Z

Thanks @deepyaman for finding this issue and working on this.

itholic · 2020-01-16T16:02:44Z

@HyukjinKwon okay i'll make PR soon.
@deepyaman Thank you for your efforts!

deepyaman · 2020-01-18T22:53:10Z

@itholic @HyukjinKwon Sorry, I don't get much time to work on these things during the week. I didn't see a PR with the docstring/code updates yet; I started by adding the docstring, if that's OK (#1202). I'll add in @itholic's code shortly, too.

Please feel free to ignore if it's already done or in progress; I'm doing this partly for my own learning, as well.

itholic · 2020-01-20T02:29:35Z

@deepyaman

Don't worry and keep going :)

i also didn't check more about this since it's crazily busy last week 😅

i'll check #1202 soon, and thanks for the contribution!!

Add DataFrameGroupBy.describe with helper change

4a02bcd

deepyaman mentioned this pull request Jan 4, 2020

DataFrameGroupBy.describe #1166

Closed

itholic reviewed Jan 5, 2020

View reviewed changes

deepyaman added 7 commits January 7, 2020 07:56

Reorder columns lexicographically (agg col, stats)

c216aae

Check/ensure that non-percentile columns are equal

77f07d5

Undo unnecessarily-complicated astype subsetting

bc5d79d

Check that percentile columns equal pdf.quantile

72e810f

Merge branch 'master' into dataframegroupby-describe

1ea2db4

Fix pycodestyle (line too long) ✅

69cc56c

Move DataFrameGroupBy.describe to proper class

086e391

Replace potentially-inefficient tolist operation

707f1eb

HyukjinKwon reviewed Jan 16, 2020

View reviewed changes

HyukjinKwon merged commit 7aae45e into databricks:master Jan 16, 2020

deepyaman deleted the dataframegroupby-describe branch January 18, 2020 21:53

deepyaman mentioned this pull request Jan 19, 2020

Create docstring for DataFrameGroupBy.describe and reimplement leveraging _InternalFrame #1202

Merged

DataFrameGroupBy.describe #1168

DataFrameGroupBy.describe #1168

Uh oh!

Conversation

deepyaman commented Jan 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itholic commented Jan 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itholic Jan 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Jan 7, 2020

Choose a reason for hiding this comment

Uh oh!

deepyaman Jan 12, 2020

Choose a reason for hiding this comment

Uh oh!

itholic Jan 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deepyaman Jan 13, 2020

Choose a reason for hiding this comment

Uh oh!

itholic Jan 14, 2020

Choose a reason for hiding this comment

Uh oh!

deepyaman commented Jan 5, 2020

Uh oh!

deepyaman commented Jan 5, 2020

Uh oh!

itholic commented Jan 5, 2020

Uh oh!

itholic commented Jan 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deepyaman commented Jan 5, 2020

Uh oh!

codecov-io commented Jan 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

softagram-bot commented Jan 12, 2020

Softagram Impact Report for pull/1168 (head commit: 707f1eb)

⚠️ Copy paste found

⭐ Change Overview

📄 Full report

Uh oh!

HyukjinKwon Jan 16, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 16, 2020

Uh oh!

HyukjinKwon commented Jan 16, 2020

Uh oh!

HyukjinKwon commented Jan 16, 2020

Uh oh!

itholic commented Jan 16, 2020

Uh oh!

deepyaman commented Jan 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itholic commented Jan 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

deepyaman commented Jan 4, 2020 •

edited

Loading

itholic commented Jan 5, 2020 •

edited

Loading

itholic Jan 5, 2020 •

edited

Loading

itholic Jan 13, 2020 •

edited

Loading

itholic commented Jan 5, 2020 •

edited

Loading

codecov-io commented Jan 10, 2020 •

edited

Loading

Softagram Impact Report for pull/1168 (head commit: `707f1eb`)

deepyaman commented Jan 18, 2020 •

edited

Loading

itholic commented Jan 20, 2020 •

edited

Loading