[SPARK-52065][SQL] Produce another plan tree with output columns (name, data type, nullability) in plan change logging #50852

HeartSaVioR · 2025-05-10T09:49:01Z

What changes were proposed in this pull request?

We propose to add another tree string which focuses to produce output columns with data type and nullability. This will be shown in plan change logger, along with existing tree string plan.

For example, here is a one of example from plan change logging:

=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates ===
!'Project [max(a#41) AS max(a)#2843]                                                                                                                                                                                                                                                            Aggregate [max(a#41) AS max(a)#2843]
 +- SubqueryAlias testdata3                                                                                                                                                                                                                                                                     +- SubqueryAlias testdata3
    +- View (`testData3`, [a#41, b#42])                                                                                                                                                                                                                                                            +- View (`testData3`, [a#41, b#42])
       +- SerializeFromObject [invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).a()) AS a#41, unwrapoption(IntegerType, invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).b())) AS b#42]         +- SerializeFromObject [invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).a()) AS a#41, unwrapoption(IntegerType, invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).b())) AS b#42]
          +- ExternalRDD [obj#38]                                                                                                                                                                                                                                                                        +- ExternalRDD [obj#38]

Output Information:
Aggregate <output=max(a)#2843[nullable=true]>
+- SubqueryAlias <output=a#41[nullable=false], b#42[nullable=true]>
   +- View <output=a#41[nullable=false], b#42[nullable=true]>
      +- SerializeFromObject <output=a#41[nullable=false], b#42[nullable=true]>
         +- ExternalRDD <output=obj#38[nullable=true]>

In some cases, it's not even feasible to evaluate the output of the node. (e.g. Project with Star expression) In that case, we will simply put <output=unresolved> since it's mostly due to UnresolvedException.

For example,

=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
!'Project [unresolvedalias('max(a#41))]                                                                                                                                                                                                                                                         'Project [unresolvedalias(max(a#41))]
 +- SubqueryAlias testdata3                                                                                                                                                                                                                                                                     +- SubqueryAlias testdata3
    +- View (`testData3`, [a#41, b#42])                                                                                                                                                                                                                                                            +- View (`testData3`, [a#41, b#42])
       +- SerializeFromObject [invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).a()) AS a#41, unwrapoption(IntegerType, invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).b())) AS b#42]         +- SerializeFromObject [invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).a()) AS a#41, unwrapoption(IntegerType, invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).b())) AS b#42]
          +- ExternalRDD [obj#38]                                                                                                                                                                                                                                                                        +- ExternalRDD [obj#38]

Output Information:
Project <output='Unresolved'>
+- SubqueryAlias <output=a#41[nullable=false], b#42[nullable=true]>
   +- View <output=a#41[nullable=false], b#42[nullable=true]>
      +- SerializeFromObject <output=a#41[nullable=false], b#42[nullable=true]>
         +- ExternalRDD <output=obj#38[nullable=true]>

Why are the changes needed?

We recently got into very tricky issue (nullability change broke stateful operator) which required custom debug logging on plan change logging. This is because of lack of visibility for the output columns, especially their nullability, in our tree string of the plan.

Ideally, we shouldn't have two different tree strings and just make a fix to the existing tree string, but in many cases, current tree string is long enough so that we had to restrict the number of fields to show, hence we think it's better to have a separate tree plan for it.

Does this PR introduce any user-facing change?

Yes, when they change SQL config for plan change logger log level to their visible log level in log4j2 config. The application of this change is at least opt-in instead of opt-out.
(If we are changing the existing tree string, it will change many places.)

How was this patch tested?

Modified UT to cover the change.

Was this patch authored or co-authored using generative AI tooling?

No.

…nullability

HeartSaVioR · 2025-05-10T09:49:42Z

cc. @cloud-fan PTAL, thanks!

HeartSaVioR added 4 commits May 9, 2025 13:06

[WIP] Change the plan change log to also produce output columns with …

365598c

…nullability

Compilation is fixed

e243f9d

fix AQE test failure

7a61566

modified test to cover the change

94c62c4

github-actions bot added the SQL label May 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-52065][SQL] Produce another plan tree with output columns (name, data type, nullability) in plan change logging #50852

[SPARK-52065][SQL] Produce another plan tree with output columns (name, data type, nullability) in plan change logging #50852

HeartSaVioR commented May 10, 2025

HeartSaVioR commented May 10, 2025

[SPARK-52065][SQL] Produce another plan tree with output columns (name, data type, nullability) in plan change logging #50852

Are you sure you want to change the base?

[SPARK-52065][SQL] Produce another plan tree with output columns (name, data type, nullability) in plan change logging #50852

Conversation

HeartSaVioR commented May 10, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HeartSaVioR commented May 10, 2025