-
Notifications
You must be signed in to change notification settings - Fork 367
expanded allowable types for kdf indexing to include pandas Index object #1697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR proposes to allow assigning an index as a column:
```python
>>> kdf = koalas.range(3)
>>> kdf["col"] = kdf.index
>>> kdf
```
```
id col
0 0 0
1 1 1
2 2 2
```
Note that this is rather a bandaid fix. If we have a change in the index, it doesn't currently work:
```
>>> kdf["col"] = kdf.index + 1
>>> kdf
```
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../koalas/databricks/koalas/frame.py", line 9979, in __repr__
pdf = self._get_or_create_repr_pandas_cache(max_display_count)
File "/.../koalas/databricks/koalas/frame.py", line 9971, in _get_or_create_repr_pandas_cache
self._repr_pandas_cache = {n: self.head(n + 1)._to_internal_pandas()}
File "/.../koalas/databricks/koalas/frame.py", line 4985, in head
sdf = self._internal.resolved_copy.spark_frame
File "/.../koalas/databricks/koalas/utils.py", line 477, in wrapped_lazy_property
setattr(self, attr_name, fn(self))
File "/.../koalas/databricks/koalas/internal.py", line 789, in resolved_copy
sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS))
File "/.../python/pyspark/sql/dataframe.py", line 1401, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/.../spark/python/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: Resolved attribute(s) __index_level_0__#67 missing from __index_level_0__#36,id#33L,__natural_order__#41L in operator !Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]. Attribute(s) with the same name appear in the operation: __index_level_0__. Please check if the right attribute(s) are used.;;
!Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]
+- Project [__index_level_0__#36, id#33L, monotonically_increasing_id() AS __natural_order__#41L]
```
Looks we should fix https://github.com/databricks/koalas/blob/master/databricks/koalas/indexes.py#L139-L144 and maybe think about changing it back to don't copy its internal.
Resolves #1690
databricks/koalas/frame.py
Outdated
| if key is None: | ||
| raise KeyError("none key") | ||
| if isinstance(key, (str, tuple, list)): | ||
| if isinstance(key, (str, tuple, list, pd.Index)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think it's okay to just change to if isinstance(key, (str)) or is_list_like(key): and key = list(key) if is_list_like(key) else key for simplicity for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think calling list(key) when key is a koalas series will generate a databricks.koalas.exceptions.PandasNotImplementedError. So something like kdf[kdf["b"] > 2] will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can switch the if-else conditions. But okay, let's just keep it as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, makes sense. I've switched the conditions.
|
Let me leave it to @ueshin |
Implemented `Series.product` (`Series.prod` as an alias). *Note that since It is implemented by using `exp(sum(log(...)))` trick, it only works for positive numbers. ```python >>> ks.Series([]).prod() 1.0 >>> ks.Series([1, 2, 3, 4, 5]).prod() 120 >>> ks.Series([1, np.nan, 3, np.nan, 5]).prod() 15.0 ```
) If a parquet file is stored by pandas, there is a metadata to describe index columns. We can read it and use as `index_col` if it's not specified. Resolves #1645.
… with pandas Series, reordered conditions
… with pandas Series, reordered conditions
…getitem-pandas-index rebase
This PR attempts to address an issue I see with column selection of Koalas DataFrames. Pandas allows use of a Pandas Index object for selecting columns, e.g.
will work fine in Pandas but fail with Koalas as the type check in the DataFrame's getitem method does not include the Pandas Index type.
If I feed a Panads Index to the Koalas DataFrame.loc method like
entity.df.loc[:, index_cols]it works.I propose expanding the list of allowable types when getting an item from the Frame class, e.g.
if isinstance(key, (str, tuple, list)):->if isinstance(key, (str, tuple, list, pd.Index)):to bring it (more) in line with pandas.I've added to the test_dataframe test to reflect but could break this out if required.
FYI, this is a blocker to running a basic featuretools demo with the proposed Koalas backend.