Skip to content

Conversation

@deepyaman
Copy link
Contributor

@deepyaman deepyaman commented Jan 4, 2020

Close #1166

Manual test:

>>> import databricks.koalas as ks
>>> df = ks.DataFrame({'a': [1, 1, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
20/01/04 18:16:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> df.groupby('a').describe()
20/01/04 18:17:09 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
20/01/04 18:17:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
20/01/04 18:17:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
      b                                        c                                   
  count mean       std min 25% 50% 75% max count mean       std min 25% 50% 75% max
a                                                                                  
1     2  4.5  0.707107   4   4   4   5   5     2  7.5  0.707107   7   7   7   8   8
3     1  6.0       NaN   6   6   6   6   6     1  9.0       NaN   9   9   9   9   9

TODO:

  • Fix formatting (line length over 100, etc.)
  • Add docstring
  • Add unit tests
  • Reorder percentiles

@itholic
Copy link
Contributor

itholic commented Jan 5, 2020

hmm it seems that the result is not same with pandas??

>>> kdf = ks.DataFrame({'a': [1, 1, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> pdf = kdf.to_pandas()

>>> kdf.groupby('a').describe()
      b                            c                          b           c
  count mean       std min max count mean       std min max 25% 50% 75% 25% 50% 75%
a
1     2  4.5  0.707107   4   5     2  7.5  0.707107   7   8   4   4   5   7   7   8
3     1  6.0       NaN   6   6     1  9.0       NaN   9   9   6   6   6   9   9   9

>>> pdf.groupby('a').describe()
      b                                               c
  count mean       std  min   25%  50%   75%  max count mean       std  min   25%  50%   75%  max
a
1   2.0  4.5  0.707107  4.0  4.25  4.5  4.75  5.0   2.0  7.5  0.707107  7.0  7.25  7.5  7.75  8.0
3   1.0  6.0       NaN  6.0  6.00  6.0  6.00  6.0   1.0  9.0       NaN  9.0  9.00  9.0  9.00  9.0

since the one of main purpose of Koalas is make our API works same with pandas as much as possible, this part should be made clear very first.

so first of all, i think we better add an unittest to tests/test_groupby.py even just very simple case like below.

def test_describe(self):
    kdf = ks.DataFrame({'a': [1, 1, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
    pdf = kdf.to_pandas()

    self.assert_eq(kdf.groupby('a').describe.sort_index(),
                   pdf.groupby('a').describe.sort_index())

and thanks for the contribution, let's make it work together 👍

# Split "quartiles" columns into first, second, and third quartiles.
for label, content in kdf.iteritems():
if label[1] == "quartiles":
exploded = ks.DataFrame(content.tolist())
Copy link
Contributor

@itholic itholic Jan 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line seems a little danger since it can potentially raise memory issue such like OOM,
(because tolist() loads all the data into the single driver's memory.)

so i think maybe we can use content.to_frame(), or should find another way.

or we can simply just don't support quartiles for now since memory issue written above with describing proper notice to docs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can just get items from the "quartiles" column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic @ueshin Hey! Sorry it's taking me a while to make the requested changes. I figured out a way to refactor this using _column_op; will it work? I'm still using to_numpy, though, only because Koalas doesn't allow constructing a DataFrame from a column-Series mapping yet.

Copy link
Contributor

@itholic itholic Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman

Hi Deepyaman! thanks for your continued efforts here.

Basically, for handling DataFrame of Koalas efficiently,

we usually use internal spark DataFrame (sdf in short, and you can get by kdf._sdf or kdf._internal.sdf), not directly Koalas API.

I made some another way of implementation using sdf for you below.
(I can't say that this is a perfect & good quality code since it's very roughly implemented and not enough tested, but maybe it will help you to understand Koalas' internal processing even just a bit)

    def describe(self):
        kdf = self.agg(["count", "mean", "std", "min", "max", "quartiles"]).reset_index()
        formatted_percentiles = ["25%", "50%", "75%"]
        sdf = kdf._sdf
        group_key_names = [groupkey.name for groupkey in self._groupkeys]

        quartiles_columns = []
        for data_column in self._kdf._internal.data_columns:
            if data_column not in group_key_names:
                quartiles_columns.append((data_column, 'quartiles'))
        # `quartiles_columns` here looks like the below
        # [('b', 'quartiles'), ('c', 'quartiles')]

        # add columns (b, 25%), (b, 50%) ... (c, 50%), (c, 75%) to `sdf`
        for col_name, quartiles_column in quartiles_columns:
            for i, percentile in enumerate(formatted_percentiles):
                sdf = sdf.withColumn(
                    name_like_string((col_name, percentile)),
                    F.col(name_like_string((col_name, quartiles_column))).getItem(i))
        # so, `sdf` here looks like the below
        # +-----------------+-----+ ... +-----------------+--------+--------+--------+--------+--------+--------+
        # |__index_level_0__|(a, )| ... |__natural_order__|(b, 25%)|(b, 50%)|(b, 75%)|(c, 25%)|(c, 50%)|(c, 75%)|
        # +-----------------+-----+ ... +-----------------+--------+--------+--------+--------+--------+--------+
        # |                0|    1| ... |     592705486848|       4|       4|       5|       7|       7|       8|
        # |                1|    3| ... |     919123001344|       6|       6|       6|       9|       9|       9|
        # +-----------------+-----+ ... +-----------------+--------+--------+--------+--------+--------+--------+

        # make the column list what we want to select from `sdf`
        columns = []
        for col_name, _ in quartiles_columns:
            for func in ["count", "mean", "std", "min", "25%", "50%", "75%", "max"]:
                columns.append((col_name, func))

        name_like_string_columns = [name_like_string(col) for col in columns]
        internal = _InternalFrame(
            sdf=sdf.select(*self._kdf._internal.index_columns, *name_like_string_columns),
            index_map=self._kdf._internal.index_map)

        idx = pd.MultiIndex.from_tuples(columns)
        # `idx` here looks like the below
        # MultiIndex([('b', 'count'),
        #             ('b',  'mean'),
        #             ('b',   'std'),
        #             ('b',   'min'),
        #             ('b',   '25%'),
        #             ('b',   '50%'),
        #             ('b',   '75%'),
        #             ('b',   'max'),
        #             ('c', 'count'),
        #             ('c',  'mean'),
        #             ('c',   'std'),
        #             ('c',   'min'),
        #             ('c',   '25%'),
        #             ('c',   '50%'),
        #             ('c',   '75%'),
        #             ('c',   'max')],
        #            )

        result = DataFrame(internal)
        result.columns = idx

        return result.astype("float64")

and, now implementation seems like invokes job many times like the below.

>>> kdf.groupby('a').describe()
2020-01-13 17:08:31 WARN  WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
2020-01-13 17:08:31 WARN  WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
2020-01-13 17:08:32 WARN  WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
2020-01-13 17:08:32 WARN  WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
2020-01-13 17:08:32 WARN  WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
2020-01-13 17:08:33 WARN  WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
2020-01-13 17:08:33 WARN  WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
      b                                             c
  count mean       std  min  25%  50%  75%  max count mean       std  min  25%  50%  75%  max
a
1   2.0  4.5  0.707107  4.0  4.0  4.0  5.0  5.0   2.0  7.5  0.707107  7.0  7.0  7.0  8.0  8.0
3   1.0  6.0       NaN  6.0  6.0  6.0  6.0  6.0   1.0  9.0       NaN  9.0  9.0  9.0  9.0  9.0

we can reduce them via handling internal frame properly like the below.

>>> kdf.groupby('a').describe()
2020-01-13 17:33:00 WARN  WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
      b                                             c
  count mean       std  min  25%  50%  75%  max count mean       std  min  25%  50%  75%  max
0   2.0  4.5  0.707107  4.0  4.0  4.0  5.0  5.0   2.0  7.5  0.707107  7.0  7.0  7.0  8.0  8.0
1   1.0  6.0       NaN  6.0  6.0  6.0  6.0  6.0   1.0  9.0       NaN  9.0  9.0  9.0  9.0  9.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic Thank you for the feedback. I'll try to rewrite it following your suggestions above and get back to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman My pleasure :) Hope it helps you!

@deepyaman
Copy link
Contributor Author

hmm it seems that the result is not same with pandas??

Good catch. I assume you're referring to the cases where the values are different moreso than the differences in decimal representation or column order.

Under the hood, pandas describe uses quantile, letting interpolation default to 'linear'. You'd actually get the same quartiles if you specified 'nearest' instead:

>>> pdf.groupby('a').quantile([0.25, 0.5, 0.75], interpolation='nearest')
        b  c
a           
1 0.25  4  7
  0.50  4  7
  0.75  5  8
3 0.25  6  9
  0.50  6  9
  0.75  6  9

That being said, I don't think you can even guarantee that the implementation would match the pandas result specifying interpolation in all cases, since I used percentile_approx to handle larger data.

Would it make sense to reimplement some sort of linear interpolation, or is documenting that the quantiles aren't interpolated sufficient? I'm not aware of any sort of interpolated percentile calculation in Spark, but I could be wrong.

Happy to add a test tomorrow or day after, probably based on what think the expected (interpolation) behavior should be?

@deepyaman
Copy link
Contributor Author

since the one of main purpose of Koalas is make our API works same with pandas as much as possible, this part should be made clear very first.

Speaking of keeping the APIs as similar as possible, do you think the fact that I've added a "quartiles" aggregation function that doesn't exist in pandas is okay? That's actually something I was concerned about.

If not, I could use your help in determining how else to do it. My first thought would be to create a function very similar to _spark_groupby, but just with support for that "quartiles" aggregation function. Then could implement DataFrameGroupBy.quantile (calling the new function), and use the new quantile in describe.

@itholic
Copy link
Contributor

itholic commented Jan 5, 2020

@deepyaman

thanks for the explanation :)

ah sorry i just meant that the 'shape' of the result looks not same with pandas.

for example, DataFrameGroupBy.describe for Koalas looks like below:
(columns b and c are repeated twice.)

>>> kdf.groupby('a').describe()
      b                            c                          b           c
  count mean       std min max count mean       std min max 25% 50% 75% 25% 50% 75%
a
1     2  4.5  0.707107   4   5     2  7.5  0.707107   7   8   4   4   5   7   7   8
3     1  6.0       NaN   6   6     1  9.0       NaN   9   9   6   6   6   9   9   9

whereas for pandas looks like below:
(columns b and c appear only once.)

>>> pdf.groupby('a').describe()
      b                                               c
  count mean       std  min   25%  50%   75%  max count mean       std  min   25%  50%   75%  max
a
1   2.0  4.5  0.707107  4.0  4.25  4.5  4.75  5.0   2.0  7.5  0.707107  7.0  7.25  7.5  7.75  8.0
3   1.0  6.0       NaN  6.0  6.00  6.0  6.00  6.0   1.0  9.0       NaN  9.0  9.00  9.0  9.00  9.0

Happy to add a test tomorrow or day after, probably based on what think the expected (interpolation) behavior should be?

i think the idea of now implementation that using percentile_approx looks good.

and maybe it is better just adding the note to docstring why the result is slightly different from pandas as you commented. (to handle larger data)

@itholic
Copy link
Contributor

itholic commented Jan 5, 2020

@deepyaman

Speaking of keeping the APIs as similar as possible, do you think the fact that I've added a "quartiles" aggregation function that doesn't exist in pandas is okay? That's actually something I was concerned about.

this is good point out.

now we have already some functionalities similar like this, (not 100% same with pandas for several reasons, for example Series.quantile as you said)

Although it seems okay to me if there is proper note in its docs like Series.quantile,

스크린샷 2020-01-05 오후 7 18 44

maybe we better discuss it with other maintainers who have more insight of those kind of functionalities. (Technically, i'm not even one of maintainers of this repository 😅 )

cc @ueshin @HyukjinKwon

@deepyaman
Copy link
Contributor Author

ah sorry i just meant that the 'shape' of the result looks not same with pandas.

Got it. I'm aware and can fix that relatively easily. Included it on the list of TODOs for this PR in the initial message:

TODO:

  • Fix formatting (line length over 100, etc.)
  • Add docstring
  • Add unit tests
  • Reorder percentiles

"Reorder percentiles" probably wasn't the best description. Anyway, just wanted to see if there were more fundamental issues first. :)

@codecov-io
Copy link

codecov-io commented Jan 10, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@72e810f). Click here to learn what that means.
The diff coverage is 97.53%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1168   +/-   ##
=========================================
  Coverage          ?   95.24%           
=========================================
  Files             ?       35           
  Lines             ?     7124           
  Branches          ?        0           
=========================================
  Hits              ?     6785           
  Misses            ?      339           
  Partials          ?        0
Impacted Files Coverage Δ
databricks/koalas/missing/indexes.py 100% <ø> (ø)
databricks/koalas/series.py 96.46% <100%> (ø)
databricks/koalas/base.py 96.39% <100%> (ø)
databricks/koalas/indexing.py 95.96% <100%> (ø)
databricks/koalas/groupby.py 91.9% <100%> (ø)
databricks/koalas/namespace.py 87.57% <100%> (ø)
databricks/koalas/window.py 97.09% <100%> (ø)
databricks/koalas/version.py 100% <100%> (ø)
databricks/koalas/indexes.py 96.6% <100%> (ø)
databricks/koalas/frame.py 97.01% <100%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72e810f...707f1eb. Read the comment docs.

@softagram-bot
Copy link

Softagram Impact Report for pull/1168 (head commit: 707f1eb)

⚠️ Copy paste found

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 896, 922:

        kdf = ks.from_pandas(pdf)

        self.assert_eq(pdf.groupby('a').head(2).sort_index(),
                       kdf.groupby('a').head(2...(truncated 690 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 767, 791:

        kdf = ks.from_pandas(pdf)
        self.assert_eq(kdf.groupby(\"b\").transform(lambda x: x + 1).sort_index(),
   ...(truncated 421 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 774, 800:


        # multi-index columns
        columns = pd.MultiIndex.from_tuples([('x', 'a'), ('x', 'b'), ('y', 'c')])
   ...(truncated 485 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 556, 571:

        pdf = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2, 3, 3, 3] * 3,
                            'b': [1, 2, 2, 2, 3, 3, 3, 4, 4] * 3,
                            'c': [1, 2...(truncated 266 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 36, 112:

        pdf = pd.DataFrame({'a': [1, 2, 6, 4, 4, 6, 4, 3, 7],
                            'b': [4, 2, 7, 3, 3, 1, 1, 1, 2],
                            'c': [4, 2, 7, 3, No...(truncated 165 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 613, 642:

        pdf = pd.DataFrame({'A': [1, 1, 2, 2] * 3,
                            'B': [2, 4, None, 3] * 3,
                            'C': [None, None, None, 1] * 3,
         ...(truncated 266 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 400, 426, 452, 478, 504, 530:

        pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6] * 3,
                            'b': [1, 1, 2, 3, 5, 8] * 3,
                            'c': [1, 4, 9, 16, 25, 36] * 3},
...(truncated 151 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 734, 751:

        kdf = ks.from_pandas(pdf)

        self.assert_eq(
            kdf.groupby('car_id').apply(lambda _: pd.DataFrame({\"column\": [0.0]})).sort_index(),
            pdf.groupby('car_id')...(truncated 216 chars)

ℹ️ test_groupby.py: Copy paste fragment inside the same file on lines 842, 867:

        pdf = pd.DataFrame({'a': [1, 1, 2, 2, 3] * 3,
                            'b': [1, 2, 3, 4, 5] * 3,
                            'c': [5, 4, 3, 2, 1] * 3},
          ...(truncated 145 chars)

ℹ️ groupby.py: Copy paste fragment inside the same file on lines 1163, 1234:

            window = Window.partitionBy(groupkey_cols) \
                .orderBy(order_column, NATURAL_ORDER_COLUMN_NAME)
    ...(truncated 1017 chars)

ℹ️ groupby.py: Copy paste fragment inside the same file on lines 2097, 2144:

        sdf = sdf.withColumn('rank', F.row_number().over(window)).filter(F.col('rank') <= n)
        internal = _InternalFrame(sdf=...(truncated 511 chars)

Now that you are on the file, it would be easier to pay back some tech. debt.

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

for i, x in enumerate(formatted_percentiles)
}
)
kdf = kdf.drop(label).join(exploded)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

join is expensive .. we should avoid. Actually I like @itholic's suggestion in that way.

@HyukjinKwon
Copy link
Member

I am going to just merge. @itholic can you make a PR to address the comments you pointed out?

@HyukjinKwon
Copy link
Member

We should also fix docstring.

@HyukjinKwon HyukjinKwon merged commit 7aae45e into databricks:master Jan 16, 2020
@HyukjinKwon
Copy link
Member

Thanks @deepyaman for finding this issue and working on this.

@itholic
Copy link
Contributor

itholic commented Jan 16, 2020

@HyukjinKwon okay i'll make PR soon.
@deepyaman Thank you for your efforts!

@deepyaman deepyaman deleted the dataframegroupby-describe branch January 18, 2020 21:53
@deepyaman
Copy link
Contributor Author

deepyaman commented Jan 18, 2020

@itholic @HyukjinKwon Sorry, I don't get much time to work on these things during the week. I didn't see a PR with the docstring/code updates yet; I started by adding the docstring, if that's OK (#1202). I'll add in @itholic's code shortly, too.

Please feel free to ignore if it's already done or in progress; I'm doing this partly for my own learning, as well.

@itholic
Copy link
Contributor

itholic commented Jan 20, 2020

@deepyaman

Don't worry and keep going :)

i also didn't check more about this since it's crazily busy last week 😅

i'll check #1202 soon, and thanks for the contribution!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DataFrameGroupBy.describe

6 participants