Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented May 22, 2020

This PR proposes to have spark namespace in DataFrame, Series, Index and MultiIndex.
Spark related features are placed under this namespace.

  • (Series|Index|MultiIndex).spark_type -> (Series|Index|MultiIndex).spark.data_type
    spark_type is deprecated

  • (Series|Index|MultiIndex).spark_column -> (Series|Index|MultiIndex).spark.column
    spark_column is deprecated

  • New API (Series|Index).transform

    >>> import databricks.koalas as ks
    >>> import pyspark.sql.functions as F
    >>> kss = ks.Series(["example"])
    >>> kss.spark.transform(lambda s: F.trim(F.upper(s)))
    0    EXAMPLE
    Name: 0, dtype: object

    I intentionally named it transform because it needs to have the same length.

  • DataFrame.spark_schema -> DataFrame.spark.schema
    DataFrame.spark_schema is deprecated

  • DataFrame.print_schema -> DataFrame.spark.print_schema
    DataFrame.print_schema is deprecated

  • DataFrame.to_spark -> DataFrame.spark.frame
    DataFrame.to_spark is NOT deprecated to keep the semantic between to_koalas <> to_spark. It's just an alias of DataFrame.spark.frame

  • DataFrame.cache -> DataFrame.spark.cache
    DataFrame.cache is deprecated

  • DataFrame.persist -> DataFrame.spark.persist
    DataFrame.persist is deprecated

  • DataFrame.hint -> DataFrame.spark.hint
    DataFrame.hint is deprecated

  • DataFrame.unpersist -> DataFrame.spark.unpersist
    DataFrame.unpersist is deprecated

  • DataFrame.storage_level -> DataFrame.spark.storage_level
    DataFrame.storage_level is deprecated

  • DataFrame.to_table -> DataFrame.spark.to_table
    DataFrame.to_table is NOT deprecated to keep the semantic between ks.read_table <> to_table. It's just an alias of DataFrame.spark.to_table. It's also similar with DataFrame.to_parquet, DataFrame.to_csv, etc.

  • DataFrame.to_spark_io -> DataFrame.spark.to_spark_io
    DataFrame.to_spark_io is NOT deprecated to keep the semantic between ks.read_spark_io <> to_spark_io. It's just an alias of DataFrame.spark.to_spark_io. It's also similar with DataFrame.to_parquet, DataFrame.to_csv, etc.

  • DataFrame.explain -> DataFrame.spark.explain
    DataFrame.explain is deprecated

@HyukjinKwon HyukjinKwon requested a review from ueshin May 22, 2020 09:43
@HyukjinKwon
Copy link
Member Author

The PR happened to be too big 😓 .. I will split next time.

with self.assertRaisesRegex(ValueError, msg):
kdf.truncate("C", "B", axis=1)

def test_spark_schema(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the tests are for deprecated methods.

@HyukjinKwon HyukjinKwon force-pushed the add-spark-namespace branch 2 times, most recently from 172fc98 to e1908ec Compare May 22, 2020 11:02
@codecov-commenter
Copy link

codecov-commenter commented May 22, 2020

Codecov Report

Merging #1530 into master will increase coverage by 0.00%.
The diff coverage is 93.09%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1530   +/-   ##
=======================================
  Coverage   94.14%   94.14%           
=======================================
  Files          36       37    +1     
  Lines        8396     8487   +91     
=======================================
+ Hits         7904     7990   +86     
- Misses        492      497    +5     
Impacted Files Coverage Δ
databricks/koalas/generic.py 96.73% <ø> (ø)
databricks/koalas/internal.py 96.69% <0.00%> (ø)
databricks/koalas/namespace.py 86.09% <ø> (ø)
databricks/koalas/strings.py 82.14% <83.33%> (ø)
databricks/koalas/spark.py 89.47% <89.47%> (ø)
databricks/koalas/indexing.py 92.16% <92.30%> (ø)
databricks/koalas/base.py 97.28% <94.54%> (-0.74%) ⬇️
databricks/koalas/series.py 97.75% <95.34%> (+0.01%) ⬆️
databricks/koalas/datetimes.py 86.75% <100.00%> (ø)
databricks/koalas/frame.py 95.74% <100.00%> (+0.31%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58c260c...289475e. Read the comment docs.

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!!
Basically LGTM, except for a few comments.

@HyukjinKwon HyukjinKwon force-pushed the add-spark-namespace branch from e1908ec to efd8d5e Compare May 24, 2020 06:16
@HyukjinKwon HyukjinKwon force-pushed the add-spark-namespace branch from d9fb2c1 to 289475e Compare May 24, 2020 06:36
@HyukjinKwon HyukjinKwon merged commit 26c0501 into databricks:master May 24, 2020
@HyukjinKwon
Copy link
Member Author

Merged! Thanks @ueshin.

@itholic
Copy link
Contributor

itholic commented May 24, 2020

Gorgeous !

@HyukjinKwon HyukjinKwon deleted the add-spark-namespace branch September 11, 2020 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants