[SPARK-28939][SQL][2.4] Propagate SQLConf for plans executed by toRdd #25734

mgaido91 · 2019-09-09T15:47:14Z

What changes were proposed in this pull request?

The PR proposes to create a custom `RDD` which enables to propagate `SQLConf` also in cases not tracked by SQL execution, as it happens when a `Dataset` is converted to and RDD either using `.rdd` or `.queryExecution.toRdd` and then the returned RDD is used to invoke actions on it.

In this way, SQL configs are effective also in these cases, while earlier they were ignored.

Why are the changes needed?

Without this patch, all the times .rdd or .queryExecution.toRdd are used, all the SQL configs set are ignored. An example of a reproducer can be:

      withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") {
        val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
        df.createOrReplaceTempView("spark64kb")
        val data = spark.sql("select * from spark64kb limit 10")
        // Subexpression elimination is used here, despite it should have been disabled
        data.describe()
      }

Why are the changes needed?

Without this patch, all the times .rdd or .queryExecution.toRdd are used, all the SQL configs set are ignored. An example of a reproducer can be:

      withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") {
        val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
        df.createOrReplaceTempView("spark64kb")
        val data = spark.sql("select * from spark64kb limit 10")
        // Subexpression elimination is used here, despite it should have been disabled
        data.describe()
      }

Does this PR introduce any user-facing change?

When a user calls .queryExecution.toRdd, a SQLExecutionRDD is returned wrapping the RDD of the execute. When .rdd is used, an additional SQLExecutionRDD is present in the hierarchy.

How was this patch tested?

added UT

… by toRdd

mgaido91 · 2019-09-09T15:48:40Z

cc @cloud-fan @hvanhovell @maropu @viirya

I haven't yet added the flag requested by @hvanhovell as I have some doubts about it as I expressed in the other PR. I'll add if you think it is needed.

SparkQA · 2019-09-09T15:57:57Z

Test build #110354 has finished for PR 25734 at commit e9d22e1.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SQLExecutionRDD(

SparkQA · 2019-09-09T20:23:04Z

Test build #110356 has finished for PR 25734 at commit d145b14.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-09-09T22:55:34Z

Hi, All.
Shall we hold on this? The original PR seems to break JDK11 build.

https://github.com/apache/spark/actions

dongjoon-hyun · 2019-09-09T23:17:39Z

#25738 is ready.

dongjoon-hyun · 2019-09-10T03:33:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala

+  private val sqlConfigs = conf.getAllConfs
+  private lazy val sqlConfExecutorSide = {
+    val props = new Properties()
+    props.putAll(sqlConfigs.asJava)


Hi, @mgaido91 .
Although Apache Spark branch-2.4 doesn't support JDK11 officially, some down-streams support JDK11 on branch-2.4. Could you include #25738 to match master branch?

mgaido91 · 2019-09-10T08:14:43Z

hi all, I am going to update this once the followups are done, meanwhile do you all agree adding the flag proposed by @hvanhovell ? If so, I'll add it when I update this PR. Thanks.

dongjoon-hyun · 2019-09-10T22:23:14Z

Yes. I'm +1 for @hvanhovell 's advice for the flag. Please add a document for that flag, too.
cc @gatorsmile

maropu · 2019-09-10T22:47:32Z

+1 for the flag to keep the legacy behaivour. Is that document @dongjoon-hyun suggested the migration guide? Yea, we need to update that, too.

mgaido91 · 2019-09-11T08:22:27Z

ok thanks, but then shouldn't we add a note in the migration guide for 3.0 too?

SparkQA · 2019-09-11T13:03:15Z

Test build #110471 has finished for PR 25734 at commit 43bd021.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-09-12T04:28:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.legacy.rdd.applyConf")
+      .internal()
+      .doc("When false, SQL configurations are disregarded when operations on a RDD derived from" +
+        " a dataframe are executed. This is the (buggy) behavior up to 2.4.3.")


up to 2.4.4 ?

dongjoon-hyun · 2019-09-12T04:32:18Z

docs/sql-migration-guide-upgrade.md

+
+ - Starting from 2.4.5, SQL configurations are effective also when a Dataset is converted to an RDD and its
+   plan is executed due to action on the derived RDD. The previous buggy behavior can be restored setting
+   `spark.sql.legacy.rdd.applyConf` to `false`.


Hi, @hvanhovell , @cloud-fan and @gatorsmile . As @mgaido91 asked here, this PR will add this flag only at branch-2.4. In this case, is it okay this config will be added and deprecated at 2.4.5 and will be removed at 3.0.0?

For me, we don't need to add this configuration to master. Did I understand correctly?

I don't think we need to add the flag to master. We are allowed to break behavior there.

hvanhovell · 2019-09-12T08:25:14Z

docs/sql-migration-guide-upgrade.md

+## Upgrading from Spark SQL 2.4 to 2.4.5
+
+ - Starting from 2.4.5, SQL configurations are effective also when a Dataset is converted to an RDD and its
+   plan is executed due to action on the derived RDD. The previous buggy behavior can be restored setting


I'd personally refrain from using the term buggy. Please explain what the previous behavior was.

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala

SparkQA · 2019-09-12T11:40:52Z

Test build #110506 has finished for PR 25734 at commit a5eb604.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-12T14:43:24Z

Test build #110512 has finished for PR 25734 at commit 1b145e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-11-10T18:31:55Z

Retest this please.

SparkQA · 2019-11-10T21:46:29Z

Test build #113545 has finished for PR 25734 at commit 1b145e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Merged to branch-2.4.
Sorry for being late, @mgaido91 .

cc @gatorsmile

### What changes were proposed in this pull request? The PR proposes to create a custom `RDD` which enables to propagate `SQLConf` also in cases not tracked by SQL execution, as it happens when a `Dataset` is converted to and RDD either using `.rdd` or `.queryExecution.toRdd` and then the returned RDD is used to invoke actions on it. In this way, SQL configs are effective also in these cases, while earlier they were ignored. ### Why are the changes needed? Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be: ``` withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") { val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*) df.createOrReplaceTempView("spark64kb") val data = spark.sql("select * from spark64kb limit 10") // Subexpression elimination is used here, despite it should have been disabled data.describe() } ``` ### Why are the changes needed? Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be: ``` withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") { val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*) df.createOrReplaceTempView("spark64kb") val data = spark.sql("select * from spark64kb limit 10") // Subexpression elimination is used here, despite it should have been disabled data.describe() } ``` ### Does this PR introduce any user-facing change? When a user calls `.queryExecution.toRdd`, a `SQLExecutionRDD` is returned wrapping the `RDD` of the execute. When `.rdd` is used, an additional `SQLExecutionRDD` is present in the hierarchy. ### How was this patch tested? added UT Closes #25734 from mgaido91/SPARK-28939_2.4. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-28939][SQL][BACKPORT-2.4] Propagate SQLConf for plans executed…

e9d22e1

… by toRdd

fix

d145b14

dongjoon-hyun added the SQL label Sep 9, 2019

dongjoon-hyun reviewed Sep 10, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-28939][SQL][BACKPORT-2.4] Propagate SQLConf for plans executed…~~ [SPARK-28939][SQL][2.4] Propagate SQLConf for plans executed… Sep 10, 2019

dongjoon-hyun changed the title ~~[SPARK-28939][SQL][2.4] Propagate SQLConf for plans executed…~~ [SPARK-28939][SQL][2.4] Propagate SQLConf for plans executed by toRdd Sep 10, 2019

dongjoon-hyun mentioned this pull request Sep 10, 2019

[SPARK-28939][SQL][FOLLOWUP] Fix JDK11 compilation due to ambiguous reference #25738

Closed

introduce config

43bd021

dongjoon-hyun reviewed Sep 12, 2019

View reviewed changes

fix spark version, add deprecation

a5eb604

hvanhovell reviewed Sep 12, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala Show resolved Hide resolved

Update sql-migration-guide-upgrade.md

1b145e2

dongjoon-hyun approved these changes Nov 10, 2019

View reviewed changes

dongjoon-hyun closed this Nov 10, 2019

[SPARK-28939][SQL][2.4] Propagate SQLConf for plans executed by toRdd #25734

[SPARK-28939][SQL][2.4] Propagate SQLConf for plans executed by toRdd #25734

Uh oh!

Conversation

mgaido91 commented Sep 9, 2019 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

mgaido91 commented Sep 9, 2019

Uh oh!

SparkQA commented Sep 9, 2019

Uh oh!

SparkQA commented Sep 9, 2019

Uh oh!

dongjoon-hyun commented Sep 9, 2019

Uh oh!

dongjoon-hyun commented Sep 9, 2019

Uh oh!

dongjoon-hyun Sep 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Sep 10, 2019

Uh oh!

dongjoon-hyun commented Sep 10, 2019

Uh oh!

maropu commented Sep 10, 2019

Uh oh!

mgaido91 commented Sep 11, 2019

Uh oh!

SparkQA commented Sep 11, 2019

Uh oh!

dongjoon-hyun Sep 12, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 12, 2019

Choose a reason for hiding this comment

Uh oh!

hvanhovell Sep 12, 2019

Choose a reason for hiding this comment

Uh oh!

hvanhovell Sep 12, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Sep 12, 2019

Uh oh!

SparkQA commented Sep 12, 2019

Uh oh!

dongjoon-hyun commented Nov 10, 2019

Uh oh!

SparkQA commented Nov 10, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mgaido91 commented Sep 9, 2019 •

edited by dongjoon-hyun

Loading

dongjoon-hyun Sep 10, 2019 •

edited

Loading