Skip to content

Conversation

@ueshin
Copy link
Collaborator

@ueshin ueshin commented Dec 19, 2019

Adding a hidden column __natural_order__ for ordering the rows as it is, especially for window-like operations like cumxxx.

@ueshin ueshin requested a review from HyukjinKwon December 19, 2019 23:39
@ueshin ueshin changed the title Add a hidden column __natural_order__. [WIP] Add a hidden column __natural_order__. Dec 20, 2019
@codecov-io
Copy link

codecov-io commented Dec 20, 2019

Codecov Report

Merging #1146 into master will decrease coverage by <.01%.
The diff coverage is 98.07%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1146      +/-   ##
==========================================
- Coverage   95.22%   95.22%   -0.01%     
==========================================
  Files          35       35              
  Lines        7062     7055       -7     
==========================================
- Hits         6725     6718       -7     
  Misses        337      337
Impacted Files Coverage Δ
databricks/koalas/series.py 96.46% <100%> (-0.03%) ⬇️
databricks/koalas/indexing.py 94.22% <100%> (ø) ⬆️
databricks/koalas/generic.py 96.18% <100%> (ø) ⬆️
databricks/koalas/groupby.py 91.6% <100%> (-0.02%) ⬇️
databricks/koalas/frame.py 97.02% <100%> (-0.02%) ⬇️
databricks/koalas/internal.py 96.5% <85.71%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0865228...3d48413. Read the comment docs.

@ueshin ueshin changed the title [WIP] Add a hidden column __natural_order__. Add a hidden column __natural_order__. Dec 20, 2019
@softagram-bot
Copy link

Softagram Impact Report for pull/1146 (head commit: 3d48413)

⚠️ Copy paste found

ℹ️ frame.py: Copy paste fragment on line 5792 shared with ../namespace.py:

              on: Union[str, List[str], Tuple[str, ...], List[Tuple[str, ...]]] = None,
              left_on: Union[str, List[str], Tuple[s...(truncated 273 chars)

ℹ️ frame.py: Copy paste fragment inside the same file on lines 7294, 7377:


        # TODO: there is a similar logic to transpose in, for instance,
        #  DataFrame.any, Series.quantile. Maybe ...(truncated 1065 chars)

ℹ️ frame.py: Copy paste fragment inside the same file on lines 4913, 4935:

            sdf = self._sdf.select(
                self._internal.index_scols +
                [self._internal.scol_for(idx...(truncated 505 chars)

ℹ️ groupby.py: Copy paste fragment inside the same file on lines 1153, 1224:

            window = Window.partitionBy(groupkey_cols) \
                .orderBy(order_column, NATURAL_ORDER_COLUMN_NAME)
    ...(truncated 1017 chars)

ℹ️ groupby.py: Copy paste fragment inside the same file on lines 2058, 2104:

        sdf = sdf.withColumn('rank', F.row_number().over(window)).filter(F.col('rank') <= n)
        internal = _...(truncated 497 chars)

Now that you are on the file, it would be easier to pay back some tech. debt.

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

⭐ Details of Dependency Changes

details of dependency changes - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that it harms readability in the codes ... Now hidden columns are going around in the Spark DataFrame. But I guess we have no choice to deal with the natural order ..

I am merging this for now but I really think we should improve the readability .. at least we can have, for instance, sdf_without_hidden_columns property or another layer or util to control such things ..


NATURAL_ORDER_COLUMN_NAME = '__natural_order__'

HIDDEN_COLUMNS = set([NATURAL_ORDER_COLUMN_NAME])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no big deal but we don't we just use a list to keep the order? I don't think it's likely to have a duplicated columns if that was the concern.


return_schema = StructType(
[StructField(SPARK_INDEX_NAME_FORMAT(0), LongType())] + list(sdf.schema))
columns = [f.name for f in return_schema]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It think StructType has names property for this line.


sdf = self._sdf.select(
self._internal.index_scols + [c._scol for c in applied])
self._internal.index_scols + [c._scol for c in applied] + list(HIDDEN_COLUMNS))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to select the hidden columns here explicitly? I suspect it was selected since they are map-like operations (?). Maybe we should just don't do this in all code base if I am not mistaken ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in most cases we can remove, but in some cases we can't, e.g., right after _cum, and window-like functions after #1151 is merged.

@HyukjinKwon
Copy link
Member

I am merging this for now. I will think about how to clean up too ... we can address the comments together later.

@HyukjinKwon HyukjinKwon merged commit b3babf9 into databricks:master Dec 30, 2019
@ueshin ueshin deleted the natural_order branch December 30, 2019 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants