-
Notifications
You must be signed in to change notification settings - Fork 367
Description
While I really like the idea of @rxin's recent #256 PR, he uses an (in my opinion) over-simplistic example of ks.sql("select * from range(10) where id > 7"). I believe that the ability to query actual Koalas DataFrames through SQL can prove really valuable to many users. However, when trying to use ks.sql with a Koalas DataFrame, the following exception occurs:
kdf = ks.DataFrame({'A': [1,2,3]})
>>> ks.sql("select * from kdf")
...
org.apache.spark.sql.AnalysisException: Table or view not found: kdf; line 1 pos 14
...This is not surprising to someone with PySpark knowledge who knows that kdf has to be registered as a temporary table before being able to use it with SparkSQL. Unfortunately, (as I understand it) the target group of the Koalas library should not be expected to be Spark experts. To get the above example working, the following workaround is needed, which requires the usage of the lower-level (Py)Spark API, thus somewhat defeating the purpose of Koalas.
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> sc = SparkContext.getOrCreate()
>>> sql = SQLContext(sc)
>>> sql.registerDataFrameAsTable(kdf._sdf, "kdf")
>>> ks.sql("select * from kdf")
__index_level_0__ A
0 0 1
1 1 2
2 2 3
# Optionally clean-up by dropping the temporary table
>>> sql.dropTempTable("kdf")Wouldn't it be much more convenient if this "temporary table magic" would instead be handled by Koalas behind the scenes or are there any design objections against such an approach?