-
Notifications
You must be signed in to change notification settings - Fork 367
Contrib guide update #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contrib guide update #246
Conversation
|
This is mostly a clean up and doesn't change the gist of it, since it was already pretty well written. @thunterdb can you take a look? Thanks. |
Codecov Report
@@ Coverage Diff @@
## master #246 +/- ##
==========================================
+ Coverage 91.59% 91.62% +0.03%
==========================================
Files 35 35
Lines 3022 3022
==========================================
+ Hits 2768 2769 +1
+ Misses 254 253 -1
Continue to review full report at Codecov.
|
thunterdb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin thanks, just a few comments. Feel free to correct the typos and merge. I think that this should be published into the guide but that be done at a later step.
| @@ -1,112 +1,81 @@ | |||
| # Contributing to Koalas - design and principles | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go into the docs and get published? It contains a lot of useful info for users too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea let's work on that next. maybe we can even put everything in readme.
CONTRIBUTING.md
Outdated
| In particular, it answers the questions: | ||
| - what is in the scope of the Koalas project? What should go into PySpark or Pandas instead? | ||
| - What is expected for code contributions | ||
| - What is in the scope of the Koalas project? What should go into PySpark or Pandas instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas -> pandas :)
CONTRIBUTING.md
Outdated
| Functions that present algorithms specific to distributed datasets | ||
| (approx quantiles for example) | ||
| The workaround is to force the materialization of the pandas DataFrame, either by calling: | ||
| - `.to_pandas()` (koalas only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: returns a pandas DataFrame, koalas only
CONTRIBUTING.md
Outdated
| (approx quantiles for example) | ||
| The workaround is to force the materialization of the pandas DataFrame, either by calling: | ||
| - `.to_pandas()` (koalas only) | ||
| - `.to_numpy()` (works with both pandas and koalas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returns a numpy array, works with both pandas and Koalas. + put a link to the docs of each method.
CONTRIBUTING.md
Outdated
| - DataFrame.values | ||
| - `DataFrame.__iter__` and the array protocol `__array__` | ||
|
|
||
| 3. *Low-level multidimensional arrays*: Other frameworks like Dask or Molin have a low-level block representation of a multidimensional array that Spark lacks. Until such representation is available, these functions should not be considered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Low-level functions for multidimensional arrays:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This includes for example all the array representations in pandas.array
| TODO: Koalas methods for reading and writing should work for both local and distributed files. | ||
| ### Spark functions that should be included in Koalas | ||
|
|
||
| - pyspark.range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since they block the access to data fields, I am thinking that all the functions specific to spark (like physical layout, caching, etc.), they should be put under a spark accessor: k_df.spark.cache() returns a koala df that has been cached. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably need to try a few and see ...
No description provided.