Design principles #255

rxin · 2019-05-07T21:20:15Z

Created a design principles section in README.md, and moved the old development guide into CONTRIBUTING.md.

Also added a new FAQ question: Is it Koalas or koalas?

README.md

codecov-io · 2019-05-07T23:02:36Z

Codecov Report

Merging #255 into master will increase coverage by 0.11%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #255      +/-   ##
==========================================
+ Coverage   92.17%   92.29%   +0.11%     
==========================================
  Files          35       35              
  Lines        3158     3256      +98     
==========================================
+ Hits         2911     3005      +94     
- Misses        247      251       +4

Impacted Files	Coverage Δ
databricks/koalas/namespace.py	`92.4% <0%> (+2.47%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a3e5160...aae46f4. Read the comment docs.

README.md

HyukjinKwon · 2019-05-08T01:05:30Z

README.md

+There are 4 different classes of functions:

-To build documentation via Sphinx:
+ 1. Functions that are only found in Spark (`select`, `selectExpr`). These functions should also be available in Koalas.


I still am not sure if we should allow those. select conflicts with Pandas's (https://github.com/pandas-dev/pandas/blob/88062f75dbca929ec082295c936edd07cc912dbf/pandas/core/generic.py#L3653-L3669):

>>> pd.DataFrame([{'a': 1}]).select <bound method NDFrame.select of a 0 1>

selectExpr means we might face later if we should allow, for instance, filter should accept string expressions as well, which conflicts with Pandas's.

I was wondering if we, instead, say:

Spark APIs can be added only with strong reasons when they don't conflict with Pandas.

There is indeed a select function, but this is fine because the expected input is different and it is deprecated, see the doc here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select.html#pandas.DataFrame.select

It does not conflict with pandas in the sense that we can add the full pandas behaviour later. For now, like we do with functions that are partially implemented, we throw a notimplementederror for the pandas arguments.

Given the ubiquity of select in spark, I believe we should add it.

I think at least we shouldn't pick select as an example in this doc. and .. if you won't mind, I would stick to "Spark APIs can be added only with strong reasons when they don't conflict with Pandas." ..

I agree with @HyukjinKwon, at least we shouldn't use select as an example here.
If we add the sentence "Spark APIs can be added ..." as @HyukjinKwon's comment, we can use select as the example for the sentence.

HyukjinKwon · 2019-05-08T01:09:11Z

README.md

-```bash
-cd docs && make clean html
-```
+ 2. Functions that are found in Spark but that have a clear equivalent in pandas, e.g. `alias` and `rename`. These functions will be implemented as the alias of the pandas function, but should be marked that they are aliases of the same functions. They are provided so that existing users of PySpark can get the benefits of Koalas without having to adapt their code.


I think alias and rename are not necessary. Existing users of PySpark will be confused what APIs are exposed and not. I was thinking it's unclear what Koalas API expects. This can be worked around via to_spark. We can expose another API that truncates Spark's index.

PySpark doesn't have index so when PySpark users use to_spark(index='truncate').alias(...), they won't get confused.

These functions are used everywhere in pyspark user code, so I would hate to have to rename everything to use koalas. Also, kdf.to_spark(index='truncate').alias(...).to_koalas() is also way too long to be useful.

I think we should have clear fence between Koalas and PySpark. PySpark users will come over to use something different.

My impression is that we don't need to add an index support for PySpark side (and PySpark users are arguably not used to index).

In addition, that mixing both side's APIs will bring dev overhead, and confusions to users with less value. If we simply delegate it via to_spark(...), it doesn't add overhead and confusions but can work around to make it PySpark friendly.

I changed the wording to say

1. Functions that are found in both Spark and pandas under the same name (`count`, `dtypes`, `head`). The return value is the same as the return type in pandas (and not Spark's). 2. Functions that are found in Spark but that have a clear equivalent in pandas, e.g. `alias` and `rename`. These functions will be implemented as the alias of the pandas function, but should be marked that they are aliases of the same functions. They are provided so that existing users of PySpark can get the benefits of Koalas without having to adapt their code. 3. Functions that are only found in pandas. When these functions are appropriate for distributed datasets, they should become available in Koalas. 4. Functions that are only found in Spark that are essential to controlling the distributed nature of the computations, e.g. `cache`. These functions should be available in Koalas. We are still debating whether data transformation functions only available in Spark should be added to Koalas, e.g. `select`. We would love to hear your feedback on that.```

README.md

thunterdb · 2019-05-08T08:01:07Z

README.md

+
+### Is it Koalas or koalas?
+
+It's Koalas. Unlike pandas, we use upper case here.


thunterdb

Just a few comments.

In general, we should encourage having a subset of the spark functions exposed in the API:

the conversion back and forth is not straightforward because of indexing
we want to encourage existing pyspark users to try out this API too
to prevent conflicts, we can scope the content in a .spark accessor anyway.

CONTRIBUTING.md

README.md

CONTRIBUTING.md

HyukjinKwon · 2019-05-08T23:44:16Z

LGTM. Thanks, @rxin and all.

rxin · 2019-05-08T23:47:23Z

I made a mistake that the merge didn't include my final commit. I had to push that commit directly to master.

rxin added 9 commits May 7, 2019 14:18

dp

8c829c9

test

b33dd37

one more push

8d05194

guardrails

a9310ba

one more pass

ab1e1d7

more

c88d773

readme more

653a3fd

rewrite

e9b7f03

rewrite more

1ec33d1

garawalid reviewed May 7, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

rxin added 5 commits May 7, 2019 15:48

remove dev guide

2c5f915

link

05360a3

toc

fe2cd63

fix link

941f0db

new toc

84a06c0

rxin marked this pull request as ready for review May 7, 2019 22:53

rxin requested review from HyukjinKwon, thunterdb and ueshin May 7, 2019 22:54

typo

e8a606c

Koalas vs koalas

aae46f4

HyukjinKwon reviewed May 8, 2019

View reviewed changes