Skip to content

Allow querying DataFrame directly in sql method #285

@floscha

Description

@floscha

While I really like the idea of @rxin's recent #256 PR, he uses an (in my opinion) over-simplistic example of ks.sql("select * from range(10) where id > 7"). I believe that the ability to query actual Koalas DataFrames through SQL can prove really valuable to many users. However, when trying to use ks.sql with a Koalas DataFrame, the following exception occurs:

kdf = ks.DataFrame({'A': [1,2,3]})
>>> ks.sql("select * from kdf")
...
org.apache.spark.sql.AnalysisException: Table or view not found: kdf; line 1 pos 14
...

This is not surprising to someone with PySpark knowledge who knows that kdf has to be registered as a temporary table before being able to use it with SparkSQL. Unfortunately, (as I understand it) the target group of the Koalas library should not be expected to be Spark experts. To get the above example working, the following workaround is needed, which requires the usage of the lower-level (Py)Spark API, thus somewhat defeating the purpose of Koalas.

>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> sc = SparkContext.getOrCreate()
>>> sql = SQLContext(sc)
>>> sql.registerDataFrameAsTable(kdf._sdf, "kdf")
>>> ks.sql("select * from kdf")
   __index_level_0__  A                                                         
0                  0  1
1                  1  2
2                  2  3
# Optionally clean-up by dropping the temporary table
>>> sql.dropTempTable("kdf")

Wouldn't it be much more convenient if this "temporary table magic" would instead be handled by Koalas behind the scenes or are there any design objections against such an approach?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions