Skip to content

Conversation

@sllynn
Copy link
Contributor

@sllynn sllynn commented Aug 3, 2020

This PR attempts to address an issue I see with column selection of Koalas DataFrames. Pandas allows use of a Pandas Index object for selecting columns, e.g.

column_mask = entity.df.columns.isin([variable.id for variable in entity.variables])
index_cols = entity.df.columns[column_mask] 
dtypes = entity.df[index_cols].dtypes.astype(str).to_dict() 

will work fine in Pandas but fail with Koalas as the type check in the DataFrame's getitem method does not include the Pandas Index type.

If I feed a Panads Index to the Koalas DataFrame.loc method like entity.df.loc[:, index_cols] it works.

I propose expanding the list of allowable types when getting an item from the Frame class, e.g.
if isinstance(key, (str, tuple, list)): -> if isinstance(key, (str, tuple, list, pd.Index)): to bring it (more) in line with pandas.

I've added to the test_dataframe test to reflect but could break this out if required.

FYI, this is a blocker to running a basic featuretools demo with the proposed Koalas backend.

@sllynn sllynn requested a review from ueshin August 3, 2020 17:06
This PR proposes to allow assigning an index as a column:

```python
>>> kdf = koalas.range(3)
>>> kdf["col"] = kdf.index
>>> kdf
```

```
   id  col
0   0    0
1   1    1
2   2    2
```

Note that this is rather a bandaid fix. If we have a change in the index, it doesn't currently work:

```
>>> kdf["col"] = kdf.index + 1
>>> kdf
```
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../koalas/databricks/koalas/frame.py", line 9979, in __repr__
    pdf = self._get_or_create_repr_pandas_cache(max_display_count)
  File "/.../koalas/databricks/koalas/frame.py", line 9971, in _get_or_create_repr_pandas_cache
    self._repr_pandas_cache = {n: self.head(n + 1)._to_internal_pandas()}
  File "/.../koalas/databricks/koalas/frame.py", line 4985, in head
    sdf = self._internal.resolved_copy.spark_frame
  File "/.../koalas/databricks/koalas/utils.py", line 477, in wrapped_lazy_property
    setattr(self, attr_name, fn(self))
  File "/.../koalas/databricks/koalas/internal.py", line 789, in resolved_copy
    sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS))
  File "/.../python/pyspark/sql/dataframe.py", line 1401, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: Resolved attribute(s) __index_level_0__#67 missing from __index_level_0__#36,id#33L,__natural_order__#41L in operator !Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]. Attribute(s) with the same name appear in the operation: __index_level_0__. Please check if the right attribute(s) are used.;;
!Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]
+- Project [__index_level_0__#36, id#33L, monotonically_increasing_id() AS __natural_order__#41L]
```

Looks we should fix https://github.com/databricks/koalas/blob/master/databricks/koalas/indexes.py#L139-L144 and maybe think about changing it back to don't copy its internal.

Resolves #1690
if key is None:
raise KeyError("none key")
if isinstance(key, (str, tuple, list)):
if isinstance(key, (str, tuple, list, pd.Index)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it's okay to just change to if isinstance(key, (str)) or is_list_like(key): and key = list(key) if is_list_like(key) else key for simplicity for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think calling list(key) when key is a koalas series will generate a databricks.koalas.exceptions.PandasNotImplementedError. So something like kdf[kdf["b"] > 2] will fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can switch the if-else conditions. But okay, let's just keep it as is for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, makes sense. I've switched the conditions.

@HyukjinKwon
Copy link
Member

Let me leave it to @ueshin

itholic and others added 6 commits August 4, 2020 14:57
Implemented `Series.product` (`Series.prod` as an alias).

*Note that since It is implemented by using `exp(sum(log(...)))` trick, it only works for positive numbers.

```python
>>> ks.Series([]).prod()
1.0
>>> ks.Series([1, 2, 3, 4, 5]).prod()
120
>>> ks.Series([1, np.nan, 3, np.nan, 5]).prod()
15.0
```
)

If a parquet file is stored by pandas, there is a metadata to describe index columns.
We can read it and use as `index_col` if it's not specified.
Resolves #1645.
@sllynn sllynn closed this Aug 4, 2020
@sllynn sllynn deleted the getitem-pandas-index branch August 4, 2020 10:47
@sllynn sllynn restored the getitem-pandas-index branch August 4, 2020 10:54
@sllynn sllynn deleted the getitem-pandas-index branch August 4, 2020 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants