Skip to content

Conversation

@ueshin
Copy link
Collaborator

@ueshin ueshin commented Feb 14, 2020

This is a follow-up of #1273.
The Spark column names are not always the same as its column label.
This PR is to rename data columns prior to filter to make sure the column names are as expected.

@ueshin ueshin requested a review from HyukjinKwon February 14, 2020 22:07
@codecov-io
Copy link

codecov-io commented Feb 14, 2020

Codecov Report

Merging #1283 into master will increase coverage by <.01%.
The diff coverage is 97.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1283      +/-   ##
==========================================
+ Coverage   95.11%   95.11%   +<.01%     
==========================================
  Files          34       34              
  Lines        7220     7272      +52     
==========================================
+ Hits         6867     6917      +50     
- Misses        353      355       +2
Impacted Files Coverage Δ
databricks/koalas/series.py 96.39% <100%> (+0.02%) ⬆️
databricks/koalas/indexing.py 95.96% <100%> (ø) ⬆️
databricks/koalas/groupby.py 91.43% <100%> (ø) ⬆️
databricks/koalas/utils.py 95.45% <100%> (+0.1%) ⬆️
databricks/koalas/plot.py 94.28% <100%> (ø) ⬆️
databricks/koalas/indexes.py 95.9% <100%> (ø) ⬆️
databricks/koalas/internal.py 96.07% <100%> (+0.08%) ⬆️
databricks/koalas/frame.py 96.51% <96%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4c87b7...24dfef9. Read the comment docs.


sdf = self._sdf.filter(expr)
internal = self._internal.copy(sdf=sdf)
data_columns = [label[0] for label in self._internal.column_labels]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we support multi-index column later, we need to rename to fit the pandas' requirement.

@ueshin
Copy link
Collaborator Author

ueshin commented Feb 18, 2020

also cc @itholic

@itholic
Copy link
Contributor

itholic commented Feb 19, 2020

@ueshin

Thanks for cc me!

The Spark column names are not always the same as its column label.

Does it mean that there are cases where data_columns and column_labels are different?

@ueshin
Copy link
Collaborator Author

ueshin commented Feb 19, 2020

yes, I guess you've seen such cases several times?

@itholic
Copy link
Contributor

itholic commented Feb 19, 2020

@ueshin

yeah, i think i had seen before,

so i'm trying to reproduce such cases, but couldn't yet (even after column re-naming)

>>> df
     name   class  max_speed
0  falcon    bird      389.0
2  parrot    bird       24.0
3    lion  mammal       80.5
1  monkey  mammal        NaN

>>> df._internal.data_columns
['name', 'class', 'max_speed']
>>> df._internal.column_labels
[('name',), ('class',), ('max_speed',)]

>>> df.rename(columns={'name': 'renamed'}, inplace=True)
  renamed   class  max_speed
0  falcon    bird      389.0
2  parrot    bird       24.0
3    lion  mammal       80.5
1  monkey  mammal        NaN

>>> df._internal.data_columns
['renamed', 'class', 'max_speed']
>>> df._internal.column_labels
[('renamed',), ('class',), ('max_speed',)]

could you show me a simple example when you available ?

@itholic
Copy link
Contributor

itholic commented Feb 19, 2020

Anyway, LGTM if the cases could be happened!

@ueshin
Copy link
Collaborator Author

ueshin commented Feb 19, 2020

e.g.,:

>>> kdf = ks.DataFrame({('x','a'): [1,2,3], ('x','b'): [4,5,6], ('y','c'): [7,8,9]})
>>> kdf['x']
   a  b
0  1  4
1  2  5
2  3  6
>>> kdf['x'].query('a > 1')
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: "cannot resolve '`a`' given input columns: [__index_level_0__, (x, a), (x, b), __natural_order__]; ...

@ueshin
Copy link
Collaborator Author

ueshin commented Feb 19, 2020

Thanks! I'd merge this now. Please feel free to leave comments if any.

@ueshin ueshin merged commit a45e484 into databricks:master Feb 19, 2020
@ueshin ueshin deleted the query branch February 19, 2020 22:43
@itholic
Copy link
Contributor

itholic commented Feb 19, 2020

e.g.,:

>>> kdf = ks.DataFrame({('x','a'): [1,2,3], ('x','b'): [4,5,6], ('y','c'): [7,8,9]})
>>> kdf['x']
   a  b
0  1  4
1  2  5
2  3  6
>>> kdf['x'].query('a > 1')
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: "cannot resolve '`a`' given input columns: [__index_level_0__, (x, a), (x, b), __natural_order__]; ...

Thanks !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants