Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented May 6, 2019

No description provided.

@rxin
Copy link
Contributor Author

rxin commented May 6, 2019

This is mostly a clean up and doesn't change the gist of it, since it was already pretty well written. @thunterdb can you take a look? Thanks.

@rxin rxin requested a review from thunterdb May 6, 2019 22:35
@codecov-io
Copy link

codecov-io commented May 6, 2019

Codecov Report

Merging #246 into master will increase coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #246      +/-   ##
==========================================
+ Coverage   91.59%   91.62%   +0.03%     
==========================================
  Files          35       35              
  Lines        3022     3022              
==========================================
+ Hits         2768     2769       +1     
+ Misses        254      253       -1
Impacted Files Coverage Δ
databricks/koalas/__init__.py 92.59% <0%> (+3.7%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19a1536...29de231. Read the comment docs.

Copy link
Contributor

@thunterdb thunterdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin thanks, just a few comments. Feel free to correct the typos and merge. I think that this should be published into the guide but that be done at a later step.

@@ -1,112 +1,81 @@
# Contributing to Koalas - design and principles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go into the docs and get published? It contains a lot of useful info for users too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea let's work on that next. maybe we can even put everything in readme.

CONTRIBUTING.md Outdated
In particular, it answers the questions:
- what is in the scope of the Koalas project? What should go into PySpark or Pandas instead?
- What is expected for code contributions
- What is in the scope of the Koalas project? What should go into PySpark or Pandas instead?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas -> pandas :)

CONTRIBUTING.md Outdated
Functions that present algorithms specific to distributed datasets
(approx quantiles for example)
The workaround is to force the materialization of the pandas DataFrame, either by calling:
- `.to_pandas()` (koalas only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: returns a pandas DataFrame, koalas only

CONTRIBUTING.md Outdated
(approx quantiles for example)
The workaround is to force the materialization of the pandas DataFrame, either by calling:
- `.to_pandas()` (koalas only)
- `.to_numpy()` (works with both pandas and koalas)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returns a numpy array, works with both pandas and Koalas. + put a link to the docs of each method.

CONTRIBUTING.md Outdated
- DataFrame.values
- `DataFrame.__iter__` and the array protocol `__array__`

3. *Low-level multidimensional arrays*: Other frameworks like Dask or Molin have a low-level block representation of a multidimensional array that Spark lacks. Until such representation is available, these functions should not be considered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low-level functions for multidimensional arrays:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This includes for example all the array representations in pandas.array

TODO: Koalas methods for reading and writing should work for both local and distributed files.
### Spark functions that should be included in Koalas

- pyspark.range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since they block the access to data fields, I am thinking that all the functions specific to spark (like physical layout, caching, etc.), they should be put under a spark accessor: k_df.spark.cache() returns a koala df that has been cached. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need to try a few and see ...

@rxin rxin merged commit a3e5160 into databricks:master May 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants