Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix PreprocessTableMerge to include new columns from WHEN MATCHED cla…
…uses Delta's `MERGE INTO` fails when there are multiple `UPDATE` clauses and one is `UPDATE SET *` with schema evolution. Specifically, if `UPDATE SET *` is used to merge a source with a superset of target columns and an additional `UPDATE SET` clause is present (which must operate on a target column), then the merge will fail due to inability to resolve some source-only columns. The example below fails: ```sql SET spark.databricks.delta.schema.autoMerge.enabled=true; -- tgt1 has columns: [a] -- s has columns: [a, b] WITH s(a, b) AS (SELECT * FROM VALUES (1, 's_b')) MERGE INTO zvs.tgt1 t USING s ON t.a = s.a WHEN MATCHED AND t.a < 1 THEN UPDATE SET t.a = 0 WHEN MATCHED THEN UPDATE SET * -- output: -- Error in SQL statement: AnalysisException: Resolved attribute(s) b#247210 missing from ... ``` This case seems to have been missed when implementing `processMatched` in `PreprocessTableMerge`. Specifically, that other `WHEN MATCHED` clauses can introduce new columns that must be filled in with ‘null’. Currently, only `WHEN NOT MATCHED` are considered. Best just shown with code flow in the example above: - `processMatched` is map over (clause1 [SET t.a=0], clause2 [SET *]) - resolvedActions: - clause1 resolvedActions are `[a=0]` - clause2 resolvedActions has `[a=a, b=b]` => causes schema evolution: target now has schema `[a, b]`. now we will only consider clause1: this causes the failure. clause2 is only important in that it triggers schema evolution so that finalSchema is `[a, b]`. Since there are no `WHEN NOT MATCHED` clauses, there are no `newColumns`. Then, the only `UpdateOperation` used in `generateUpdateExpressions` is: `[a=0]`. This means that `generateUpdateExpressions` is called with `targetCols` `[a, b]` and only `updateOp` `[a=0]`. column `b` (not present in the target) is passed through and an unresolvable attribute ends up in our final plan. The fix is to simply consider new columns from other `WHEN MATCHED` clauses as well as `WHEN NOT MATCHED`. New unit test validating correct behavior with multiple UPDATE clauses and one is `UPDATE SET *`. GitOrigin-RevId: f2a849c1fc5589a26512e0a7f1cc5adc8e8eb7f1
- Loading branch information