Skip to content

[SPARK-52065][SQL] Produce another plan tree with output columns (name, data type, nullability) in plan change logging #50852

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

We propose to add another tree string which focuses to produce output columns with data type and nullability. This will be shown in plan change logger, along with existing tree string plan.

For example, here is a one of example from plan change logging:

=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates ===
!'Project [max(a#41) AS max(a)#2843]                                                                                                                                                                                                                                                            Aggregate [max(a#41) AS max(a)#2843]
 +- SubqueryAlias testdata3                                                                                                                                                                                                                                                                     +- SubqueryAlias testdata3
    +- View (`testData3`, [a#41, b#42])                                                                                                                                                                                                                                                            +- View (`testData3`, [a#41, b#42])
       +- SerializeFromObject [invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).a()) AS a#41, unwrapoption(IntegerType, invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).b())) AS b#42]         +- SerializeFromObject [invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).a()) AS a#41, unwrapoption(IntegerType, invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).b())) AS b#42]
          +- ExternalRDD [obj#38]                                                                                                                                                                                                                                                                        +- ExternalRDD [obj#38]

Output Information:
Aggregate <output=max(a)#2843[nullable=true]>
+- SubqueryAlias <output=a#41[nullable=false], b#42[nullable=true]>
   +- View <output=a#41[nullable=false], b#42[nullable=true]>
      +- SerializeFromObject <output=a#41[nullable=false], b#42[nullable=true]>
         +- ExternalRDD <output=obj#38[nullable=true]>

In some cases, it's not even feasible to evaluate the output of the node. (e.g. Project with Star expression) In that case, we will simply put <output=unresolved> since it's mostly due to UnresolvedException.

For example,

=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
!'Project [unresolvedalias('max(a#41))]                                                                                                                                                                                                                                                         'Project [unresolvedalias(max(a#41))]
 +- SubqueryAlias testdata3                                                                                                                                                                                                                                                                     +- SubqueryAlias testdata3
    +- View (`testData3`, [a#41, b#42])                                                                                                                                                                                                                                                            +- View (`testData3`, [a#41, b#42])
       +- SerializeFromObject [invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).a()) AS a#41, unwrapoption(IntegerType, invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).b())) AS b#42]         +- SerializeFromObject [invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).a()) AS a#41, unwrapoption(IntegerType, invoke(knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData3, true])).b())) AS b#42]
          +- ExternalRDD [obj#38]                                                                                                                                                                                                                                                                        +- ExternalRDD [obj#38]

Output Information:
Project <output='Unresolved'>
+- SubqueryAlias <output=a#41[nullable=false], b#42[nullable=true]>
   +- View <output=a#41[nullable=false], b#42[nullable=true]>
      +- SerializeFromObject <output=a#41[nullable=false], b#42[nullable=true]>
         +- ExternalRDD <output=obj#38[nullable=true]>

Why are the changes needed?

We recently got into very tricky issue (nullability change broke stateful operator) which required custom debug logging on plan change logging. This is because of lack of visibility for the output columns, especially their nullability, in our tree string of the plan.

Ideally, we shouldn't have two different tree strings and just make a fix to the existing tree string, but in many cases, current tree string is long enough so that we had to restrict the number of fields to show, hence we think it's better to have a separate tree plan for it.

Does this PR introduce any user-facing change?

Yes, when they change SQL config for plan change logger log level to their visible log level in log4j2 config. The application of this change is at least opt-in instead of opt-out.
(If we are changing the existing tree string, it will change many places.)

How was this patch tested?

Modified UT to cover the change.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label May 10, 2025
@HeartSaVioR
Copy link
Contributor Author

cc. @cloud-fan PTAL, thanks!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant