Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented May 7, 2019

Created a design principles section in README.md, and moved the old development guide into CONTRIBUTING.md.

Also added a new FAQ question: Is it Koalas or koalas?

@rxin rxin marked this pull request as ready for review May 7, 2019 22:53
@rxin rxin requested review from HyukjinKwon, thunterdb and ueshin May 7, 2019 22:54
@codecov-io
Copy link

codecov-io commented May 7, 2019

Codecov Report

Merging #255 into master will increase coverage by 0.11%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #255      +/-   ##
==========================================
+ Coverage   92.17%   92.29%   +0.11%     
==========================================
  Files          35       35              
  Lines        3158     3256      +98     
==========================================
+ Hits         2911     3005      +94     
- Misses        247      251       +4
Impacted Files Coverage Δ
databricks/koalas/namespace.py 92.4% <0%> (+2.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a3e5160...aae46f4. Read the comment docs.

There are 4 different classes of functions:

To build documentation via Sphinx:
1. Functions that are only found in Spark (`select`, `selectExpr`). These functions should also be available in Koalas.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still am not sure if we should allow those. select conflicts with Pandas's (https://github.com/pandas-dev/pandas/blob/88062f75dbca929ec082295c936edd07cc912dbf/pandas/core/generic.py#L3653-L3669):

>>> pd.DataFrame([{'a': 1}]).select
<bound method NDFrame.select of    a
0  1>

selectExpr means we might face later if we should allow, for instance, filter should accept string expressions as well, which conflicts with Pandas's.

I was wondering if we, instead, say:

Spark APIs can be added only with strong reasons when they don't conflict with Pandas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is indeed a select function, but this is fine because the expected input is different and it is deprecated, see the doc here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select.html#pandas.DataFrame.select

It does not conflict with pandas in the sense that we can add the full pandas behaviour later. For now, like we do with functions that are partially implemented, we throw a notimplementederror for the pandas arguments.

Given the ubiquity of select in spark, I believe we should add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at least we shouldn't pick select as an example in this doc. and .. if you won't mind, I would stick to "Spark APIs can be added only with strong reasons when they don't conflict with Pandas." ..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @HyukjinKwon, at least we shouldn't use select as an example here.
If we add the sentence "Spark APIs can be added ..." as @HyukjinKwon's comment, we can use select as the example for the sentence.

```bash
cd docs && make clean html
```
2. Functions that are found in Spark but that have a clear equivalent in pandas, e.g. `alias` and `rename`. These functions will be implemented as the alias of the pandas function, but should be marked that they are aliases of the same functions. They are provided so that existing users of PySpark can get the benefits of Koalas without having to adapt their code.
Copy link
Member

@HyukjinKwon HyukjinKwon May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think alias and rename are not necessary. Existing users of PySpark will be confused what APIs are exposed and not. I was thinking it's unclear what Koalas API expects. This can be worked around via to_spark. We can expose another API that truncates Spark's index.

PySpark doesn't have index so when PySpark users use to_spark(index='truncate').alias(...), they won't get confused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are used everywhere in pyspark user code, so I would hate to have to rename everything to use koalas. Also, kdf.to_spark(index='truncate').alias(...).to_koalas() is also way too long to be useful.

Copy link
Member

@HyukjinKwon HyukjinKwon May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have clear fence between Koalas and PySpark. PySpark users will come over to use something different.

My impression is that we don't need to add an index support for PySpark side (and PySpark users are arguably not used to index).

In addition, that mixing both side's APIs will bring dev overhead, and confusions to users with less value. If we simply delegate it via to_spark(...), it doesn't add overhead and confusions but can work around to make it PySpark friendly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the wording to say


 1. Functions that are found in both Spark and pandas under the same name (`count`, `dtypes`, `head`). The return value is the same as the return type in pandas (and not Spark's).
    
 2. Functions that are found in Spark but that have a clear equivalent in pandas, e.g. `alias` and `rename`. These functions will be implemented as the alias of the pandas function, but should be marked that they are aliases of the same functions. They are provided so that existing users of PySpark can get the benefits of Koalas without having to adapt their code.
 
 3. Functions that are only found in pandas. When these functions are appropriate for distributed datasets, they should become available in Koalas.
 
 4. Functions that are only found in Spark that are essential to controlling the distributed nature of the computations, e.g. `cache`. These functions should be available in Koalas.

We are still debating whether data transformation functions only available in Spark should be added to Koalas, e.g. `select`. We would love to hear your feedback on that.```


### Is it Koalas or koalas?

It's Koalas. Unlike pandas, we use upper case here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :)

Copy link
Contributor

@thunterdb thunterdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments.

In general, we should encourage having a subset of the spark functions exposed in the API:

  • the conversion back and forth is not straightforward because of indexing
  • we want to encourage existing pyspark users to try out this API too
  • to prevent conflicts, we can scope the content in a .spark accessor anyway.

@rxin rxin merged commit f6abf57 into databricks:master May 8, 2019
@HyukjinKwon
Copy link
Member

LGTM. Thanks, @rxin and all.

@rxin
Copy link
Contributor Author

rxin commented May 8, 2019

I made a mistake that the merge didn't include my final commit. I had to push that commit directly to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants