-
Notifications
You must be signed in to change notification settings - Fork 367
Rename data columns prior to filter to make sure the column names are as expected. #1283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1283 +/- ##
==========================================
+ Coverage 95.11% 95.11% +<.01%
==========================================
Files 34 34
Lines 7220 7272 +52
==========================================
+ Hits 6867 6917 +50
- Misses 353 355 +2
Continue to review full report at Codecov.
|
|
|
||
| sdf = self._sdf.filter(expr) | ||
| internal = self._internal.copy(sdf=sdf) | ||
| data_columns = [label[0] for label in self._internal.column_labels] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we support multi-index column later, we need to rename to fit the pandas' requirement.
|
also cc @itholic |
|
Thanks for cc me!
Does it mean that there are cases where |
|
yes, I guess you've seen such cases several times? |
|
yeah, i think i had seen before, so i'm trying to reproduce such cases, but couldn't yet (even after column re-naming) >>> df
name class max_speed
0 falcon bird 389.0
2 parrot bird 24.0
3 lion mammal 80.5
1 monkey mammal NaN
>>> df._internal.data_columns
['name', 'class', 'max_speed']
>>> df._internal.column_labels
[('name',), ('class',), ('max_speed',)]
>>> df.rename(columns={'name': 'renamed'}, inplace=True)
renamed class max_speed
0 falcon bird 389.0
2 parrot bird 24.0
3 lion mammal 80.5
1 monkey mammal NaN
>>> df._internal.data_columns
['renamed', 'class', 'max_speed']
>>> df._internal.column_labels
[('renamed',), ('class',), ('max_speed',)]could you show me a simple example when you available ? |
|
Anyway, LGTM if the cases could be happened! |
|
e.g.,: >>> kdf = ks.DataFrame({('x','a'): [1,2,3], ('x','b'): [4,5,6], ('y','c'): [7,8,9]})
>>> kdf['x']
a b
0 1 4
1 2 5
2 3 6
>>> kdf['x'].query('a > 1')
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: "cannot resolve '`a`' given input columns: [__index_level_0__, (x, a), (x, b), __natural_order__]; ... |
|
Thanks! I'd merge this now. Please feel free to leave comments if any. |
Thanks !! |
This is a follow-up of #1273.
The Spark column names are not always the same as its column label.
This PR is to rename data columns prior to filter to make sure the column names are as expected.