-
Notifications
You must be signed in to change notification settings - Fork 367
Design principles #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design principles #255
Conversation
Codecov Report
@@ Coverage Diff @@
## master #255 +/- ##
==========================================
+ Coverage 92.17% 92.29% +0.11%
==========================================
Files 35 35
Lines 3158 3256 +98
==========================================
+ Hits 2911 3005 +94
- Misses 247 251 +4
Continue to review full report at Codecov.
|
| There are 4 different classes of functions: | ||
|
|
||
| To build documentation via Sphinx: | ||
| 1. Functions that are only found in Spark (`select`, `selectExpr`). These functions should also be available in Koalas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still am not sure if we should allow those. select conflicts with Pandas's (https://github.com/pandas-dev/pandas/blob/88062f75dbca929ec082295c936edd07cc912dbf/pandas/core/generic.py#L3653-L3669):
>>> pd.DataFrame([{'a': 1}]).select
<bound method NDFrame.select of a
0 1>selectExpr means we might face later if we should allow, for instance, filter should accept string expressions as well, which conflicts with Pandas's.
I was wondering if we, instead, say:
Spark APIs can be added only with strong reasons when they don't conflict with Pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is indeed a select function, but this is fine because the expected input is different and it is deprecated, see the doc here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select.html#pandas.DataFrame.select
It does not conflict with pandas in the sense that we can add the full pandas behaviour later. For now, like we do with functions that are partially implemented, we throw a notimplementederror for the pandas arguments.
Given the ubiquity of select in spark, I believe we should add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at least we shouldn't pick select as an example in this doc. and .. if you won't mind, I would stick to "Spark APIs can be added only with strong reasons when they don't conflict with Pandas." ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @HyukjinKwon, at least we shouldn't use select as an example here.
If we add the sentence "Spark APIs can be added ..." as @HyukjinKwon's comment, we can use select as the example for the sentence.
| ```bash | ||
| cd docs && make clean html | ||
| ``` | ||
| 2. Functions that are found in Spark but that have a clear equivalent in pandas, e.g. `alias` and `rename`. These functions will be implemented as the alias of the pandas function, but should be marked that they are aliases of the same functions. They are provided so that existing users of PySpark can get the benefits of Koalas without having to adapt their code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think alias and rename are not necessary. Existing users of PySpark will be confused what APIs are exposed and not. I was thinking it's unclear what Koalas API expects. This can be worked around via to_spark. We can expose another API that truncates Spark's index.
PySpark doesn't have index so when PySpark users use to_spark(index='truncate').alias(...), they won't get confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are used everywhere in pyspark user code, so I would hate to have to rename everything to use koalas. Also, kdf.to_spark(index='truncate').alias(...).to_koalas() is also way too long to be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have clear fence between Koalas and PySpark. PySpark users will come over to use something different.
My impression is that we don't need to add an index support for PySpark side (and PySpark users are arguably not used to index).
In addition, that mixing both side's APIs will bring dev overhead, and confusions to users with less value. If we simply delegate it via to_spark(...), it doesn't add overhead and confusions but can work around to make it PySpark friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the wording to say
1. Functions that are found in both Spark and pandas under the same name (`count`, `dtypes`, `head`). The return value is the same as the return type in pandas (and not Spark's).
2. Functions that are found in Spark but that have a clear equivalent in pandas, e.g. `alias` and `rename`. These functions will be implemented as the alias of the pandas function, but should be marked that they are aliases of the same functions. They are provided so that existing users of PySpark can get the benefits of Koalas without having to adapt their code.
3. Functions that are only found in pandas. When these functions are appropriate for distributed datasets, they should become available in Koalas.
4. Functions that are only found in Spark that are essential to controlling the distributed nature of the computations, e.g. `cache`. These functions should be available in Koalas.
We are still debating whether data transformation functions only available in Spark should be added to Koalas, e.g. `select`. We would love to hear your feedback on that.```
|
|
||
| ### Is it Koalas or koalas? | ||
|
|
||
| It's Koalas. Unlike pandas, we use upper case here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :)
thunterdb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments.
In general, we should encourage having a subset of the spark functions exposed in the API:
- the conversion back and forth is not straightforward because of indexing
- we want to encourage existing pyspark users to try out this API too
- to prevent conflicts, we can scope the content in a
.sparkaccessor anyway.
|
LGTM. Thanks, @rxin and all. |
|
I made a mistake that the merge didn't include my final commit. I had to push that commit directly to master. |
Created a design principles section in README.md, and moved the old development guide into CONTRIBUTING.md.
Also added a new FAQ question: Is it Koalas or koalas?