expanded allowable types for kdf indexing to include pandas Index object #1697

sllynn · 2020-08-03T17:06:09Z

This PR attempts to address an issue I see with column selection of Koalas DataFrames. Pandas allows use of a Pandas Index object for selecting columns, e.g.

column_mask = entity.df.columns.isin([variable.id for variable in entity.variables])
index_cols = entity.df.columns[column_mask] 
dtypes = entity.df[index_cols].dtypes.astype(str).to_dict()

will work fine in Pandas but fail with Koalas as the type check in the DataFrame's getitem method does not include the Pandas Index type.

If I feed a Panads Index to the Koalas DataFrame.loc method like entity.df.loc[:, index_cols] it works.

I propose expanding the list of allowable types when getting an item from the Frame class, e.g.
if isinstance(key, (str, tuple, list)): -> if isinstance(key, (str, tuple, list, pd.Index)): to bring it (more) in line with pandas.

I've added to the test_dataframe test to reflect but could break this out if required.

FYI, this is a blocker to running a basic featuretools demo with the proposed Koalas backend.

This PR proposes to allow assigning an index as a column: ```python >>> kdf = koalas.range(3) >>> kdf["col"] = kdf.index >>> kdf ``` ``` id col 0 0 0 1 1 1 2 2 2 ``` Note that this is rather a bandaid fix. If we have a change in the index, it doesn't currently work: ``` >>> kdf["col"] = kdf.index + 1 >>> kdf ``` ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../koalas/databricks/koalas/frame.py", line 9979, in __repr__ pdf = self._get_or_create_repr_pandas_cache(max_display_count) File "/.../koalas/databricks/koalas/frame.py", line 9971, in _get_or_create_repr_pandas_cache self._repr_pandas_cache = {n: self.head(n + 1)._to_internal_pandas()} File "/.../koalas/databricks/koalas/frame.py", line 4985, in head sdf = self._internal.resolved_copy.spark_frame File "/.../koalas/databricks/koalas/utils.py", line 477, in wrapped_lazy_property setattr(self, attr_name, fn(self)) File "/.../koalas/databricks/koalas/internal.py", line 789, in resolved_copy sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS)) File "/.../python/pyspark/sql/dataframe.py", line 1401, in select jdf = self._jdf.select(self._jcols(*cols)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.AnalysisException: Resolved attribute(s) __index_level_0__#67 missing from __index_level_0__#36,id#33L,__natural_order__#41L in operator !Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]. Attribute(s) with the same name appear in the operation: __index_level_0__. Please check if the right attribute(s) are used.;; !Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L] +- Project [__index_level_0__#36, id#33L, monotonically_increasing_id() AS __natural_order__#41L] ``` Looks we should fix https://github.com/databricks/koalas/blob/master/databricks/koalas/indexes.py#L139-L144 and maybe think about changing it back to don't copy its internal. Resolves #1690

HyukjinKwon · 2020-08-04T02:16:49Z

databricks/koalas/frame.py

        if key is None:
            raise KeyError("none key")
-        if isinstance(key, (str, tuple, list)):
+        if isinstance(key, (str, tuple, list, pd.Index)):


Actually, I think it's okay to just change to if isinstance(key, (str)) or is_list_like(key): and key = list(key) if is_list_like(key) else key for simplicity for now.

I think calling list(key) when key is a koalas series will generate a databricks.koalas.exceptions.PandasNotImplementedError. So something like kdf[kdf["b"] > 2] will fail.

We can switch the if-else conditions. But okay, let's just keep it as is for now.

Of course, makes sense. I've switched the conditions.

HyukjinKwon · 2020-08-04T02:21:41Z

Let me leave it to @ueshin

Implemented `Series.product` (`Series.prod` as an alias). *Note that since It is implemented by using `exp(sum(log(...)))` trick, it only works for positive numbers. ```python >>> ks.Series([]).prod() 1.0 >>> ks.Series([1, 2, 3, 4, 5]).prod() 120 >>> ks.Series([1, np.nan, 3, np.nan, 5]).prod() 15.0 ```

) If a parquet file is stored by pandas, there is a metadata to describe index columns. We can read it and use as `index_col` if it's not specified. Resolves #1645.

… with pandas Series, reordered conditions

…getitem-pandas-index rebase

expanded allowable types for kdf indexing to include pandas Index object

aef696a

sllynn requested a review from ueshin August 3, 2020 17:06

HyukjinKwon approved these changes Aug 4, 2020

View reviewed changes

HyukjinKwon reviewed Aug 4, 2020

View reviewed changes

itholic and others added 6 commits August 4, 2020 14:57

Try to read pandas metadata in read_parquet if index_col is None. (#1695

66df1b0

) If a parquet file is stored by pandas, there is a metadata to describe index columns. We can read it and use as `index_col` if it's not specified. Resolves #1645.

add is_like_like condition and list conversion for dataframe indexing…

616bcf8

… with pandas Series, reordered conditions

expanded allowable types for kdf indexing to include pandas Index object

f9ff9ae

add is_like_like condition and list conversion for dataframe indexing…

00ba5ad

… with pandas Series, reordered conditions

Merge branch 'getitem-pandas-index' of github.com:sllynn/koalas into …

2d7cbfa

…getitem-pandas-index rebase

sllynn closed this Aug 4, 2020

sllynn deleted the getitem-pandas-index branch August 4, 2020 10:47

sllynn restored the getitem-pandas-index branch August 4, 2020 10:54

sllynn deleted the getitem-pandas-index branch August 4, 2020 10:56

sllynn mentioned this pull request Aug 4, 2020

include pandas Index object in dataframe indexing options #1698

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

expanded allowable types for kdf indexing to include pandas Index object #1697

expanded allowable types for kdf indexing to include pandas Index object #1697

Uh oh!

sllynn commented Aug 3, 2020 •

edited by HyukjinKwon

Loading

Uh oh!

HyukjinKwon Aug 4, 2020

Uh oh!

sllynn Aug 4, 2020

Uh oh!

HyukjinKwon Aug 4, 2020

Uh oh!

sllynn Aug 4, 2020

Uh oh!

HyukjinKwon commented Aug 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

expanded allowable types for kdf indexing to include pandas Index object #1697

expanded allowable types for kdf indexing to include pandas Index object #1697

Uh oh!

Conversation

sllynn commented Aug 3, 2020 • edited by HyukjinKwon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon Aug 4, 2020

Choose a reason for hiding this comment

Uh oh!

sllynn Aug 4, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 4, 2020

Choose a reason for hiding this comment

Uh oh!

sllynn Aug 4, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sllynn commented Aug 3, 2020 •

edited by HyukjinKwon

Loading