Add a hidden column `__natural_order__`. #1146

ueshin · 2019-12-19T23:39:16Z

Adding a hidden column __natural_order__ for ordering the rows as it is, especially for window-like operations like cumxxx.

codecov-io · 2019-12-20T01:18:15Z

Codecov Report

Merging #1146 into master will decrease coverage by <.01%.
The diff coverage is 98.07%.

@@            Coverage Diff             @@
##           master    #1146      +/-   ##
==========================================
- Coverage   95.22%   95.22%   -0.01%     
==========================================
  Files          35       35              
  Lines        7062     7055       -7     
==========================================
- Hits         6725     6718       -7     
  Misses        337      337

Impacted Files	Coverage Δ
databricks/koalas/series.py	`96.46% <100%> (-0.03%)`	⬇️
databricks/koalas/indexing.py	`94.22% <100%> (ø)`	⬆️
databricks/koalas/generic.py	`96.18% <100%> (ø)`	⬆️
databricks/koalas/groupby.py	`91.6% <100%> (-0.02%)`	⬇️
databricks/koalas/frame.py	`97.02% <100%> (-0.02%)`	⬇️
databricks/koalas/internal.py	`96.5% <85.71%> (+0.07%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0865228...3d48413. Read the comment docs.

softagram-bot · 2019-12-27T17:50:19Z

Softagram Impact Report for pull/1146 (head commit: `3d48413`)

⚠️ Copy paste found

ℹ️ frame.py: Copy paste fragment on line 5792 shared with ../namespace.py:

              on: Union[str, List[str], Tuple[str, ...], List[Tuple[str, ...]]] = None,
              left_on: Union[str, List[str], Tuple[s...(truncated 273 chars)

ℹ️ frame.py: Copy paste fragment inside the same file on lines 7294, 7377:


        # TODO: there is a similar logic to transpose in, for instance,
        #  DataFrame.any, Series.quantile. Maybe ...(truncated 1065 chars)

ℹ️ frame.py: Copy paste fragment inside the same file on lines 4913, 4935:

            sdf = self._sdf.select(
                self._internal.index_scols +
                [self._internal.scol_for(idx...(truncated 505 chars)

ℹ️ groupby.py: Copy paste fragment inside the same file on lines 1153, 1224:

            window = Window.partitionBy(groupkey_cols) \
                .orderBy(order_column, NATURAL_ORDER_COLUMN_NAME)
    ...(truncated 1017 chars)

ℹ️ groupby.py: Copy paste fragment inside the same file on lines 2058, 2104:

        sdf = sdf.withColumn('rank', F.row_number().over(window)).filter(F.col('rank') <= n)
        internal = _...(truncated 497 chars)

Now that you are on the file, it would be easier to pay back some tech. debt.

⭐ Change Overview

(Open in Softagram Desktop for full details)

⭐ Details of Dependency Changes

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/1146

Impact Report explained. Give feedback on this report to [email protected]

HyukjinKwon

My concern is that it harms readability in the codes ... Now hidden columns are going around in the Spark DataFrame. But I guess we have no choice to deal with the natural order ..

I am merging this for now but I really think we should improve the readability .. at least we can have, for instance, sdf_without_hidden_columns property or another layer or util to control such things ..

HyukjinKwon · 2019-12-27T02:07:50Z

databricks/koalas/internal.py


+NATURAL_ORDER_COLUMN_NAME = '__natural_order__'
+
+HIDDEN_COLUMNS = set([NATURAL_ORDER_COLUMN_NAME])


no big deal but we don't we just use a list to keep the order? I don't think it's likely to have a duplicated columns if that was the concern.

HyukjinKwon · 2019-12-30T01:42:41Z

databricks/koalas/internal.py


+            return_schema = StructType(
+                [StructField(SPARK_INDEX_NAME_FORMAT(0), LongType())] + list(sdf.schema))
+            columns = [f.name for f in return_schema]


It think StructType has names property for this line.

HyukjinKwon · 2019-12-30T02:05:16Z

databricks/koalas/frame.py


        sdf = self._sdf.select(
-            self._internal.index_scols + [c._scol for c in applied])
+            self._internal.index_scols + [c._scol for c in applied] + list(HIDDEN_COLUMNS))


Do we need to select the hidden columns here explicitly? I suspect it was selected since they are map-like operations (?). Maybe we should just don't do this in all code base if I am not mistaken ...

Actually in most cases we can remove, but in some cases we can't, e.g., right after _cum, and window-like functions after #1151 is merged.

HyukjinKwon · 2019-12-30T02:05:26Z

I am merging this for now. I will think about how to clean up too ... we can address the comments together later.

ueshin added 3 commits December 19, 2019 15:24

Add a hidden column for row-id.

4e78541

Preserve the row-id columns.

5ae22fa

Rename.

4068d2b

ueshin requested a review from HyukjinKwon December 19, 2019 23:39

ueshin changed the title ~~Add a hidden column __natural_order__.~~ [WIP] Add a hidden column __natural_order__. Dec 20, 2019

ueshin added 2 commits December 19, 2019 16:40

Fix.

83b6220

Fix?

2538cab

Fix.

f03682e

ueshin changed the title ~~[WIP] Add a hidden column __natural_order__.~~ Add a hidden column __natural_order__. Dec 20, 2019

ueshin added 5 commits December 20, 2019 10:59

Merge branch 'master' into natural_order

153102c

Fix GroupBby.idxmax/idxmin.

0adaaa2

Merge branch 'master' into natural_order

e9047fc

Merge branch 'master' into natural_order

53634dc

Merge branch 'master' into natural_order

3d48413

HyukjinKwon reviewed Dec 30, 2019

View reviewed changes

HyukjinKwon merged commit b3babf9 into databricks:master Dec 30, 2019

ueshin deleted the natural_order branch December 30, 2019 18:45

itholic mentioned this pull request Aug 9, 2021

Does spark DataFrame.limit guarantees any order? #1064

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a hidden column `__natural_order__`. #1146

Add a hidden column `__natural_order__`. #1146

Uh oh!

ueshin commented Dec 19, 2019 •

edited

Loading

Uh oh!

codecov-io commented Dec 20, 2019 •

edited

Loading

Uh oh!

softagram-bot commented Dec 27, 2019

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon Dec 27, 2019

Uh oh!

HyukjinKwon Dec 30, 2019

Uh oh!

HyukjinKwon Dec 30, 2019

Uh oh!

ueshin Dec 30, 2019

Uh oh!

HyukjinKwon commented Dec 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		NATURAL_ORDER_COLUMN_NAME = '__natural_order__'

		HIDDEN_COLUMNS = set([NATURAL_ORDER_COLUMN_NAME])

Add a hidden column __natural_order__. #1146

Add a hidden column __natural_order__. #1146

Uh oh!

Conversation

ueshin commented Dec 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Dec 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

softagram-bot commented Dec 27, 2019

Softagram Impact Report for pull/1146 (head commit: 3d48413)

⚠️ Copy paste found

⭐ Change Overview

⭐ Details of Dependency Changes

📄 Full report

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 27, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

ueshin Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add a hidden column `__natural_order__`. #1146

Add a hidden column `__natural_order__`. #1146

ueshin commented Dec 19, 2019 •

edited

Loading

codecov-io commented Dec 20, 2019 •

edited

Loading

Softagram Impact Report for pull/1146 (head commit: `3d48413`)