-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat: Adds MERGE INTO transform #109
feat: Adds MERGE INTO transform #109
Conversation
b7b996b
to
3d2ce14
Compare
This commit adds the ability to convert snowflakes [MERGE INTO](https://docs.snowflake.com/en/sql-reference/sql/merge) functionality into a functional equivalent implementation in duckdb. To do this we need to break apart the WHEN [NOT] MATCHED syntax into separate statements to be executed indepedently. This commit only adds the transform, there is more refactoring required in fakes.py in order to handle a transform that transforms a single expression into multiple expressions.
3d2ce14
to
c842900
Compare
Thanks for starting this! It's a tricky feature. I hope you don't mind but I've pushed 40ada94 as a test case to cover. It comes from the Snowflake MERGE docs. It provides good coverage with a mix of delete, update and insert branches with some conditionals: MERGE INTO t1 USING t2 ON t1.t1Key = t2.t2Key
WHEN MATCHED AND t2.marked = 1 THEN DELETE
WHEN MATCHED AND t2.isNewStatus = 1 THEN UPDATE SET val = t2.newVal, status = t2.newStatus
WHEN MATCHED THEN UPDATE SET val = t2.newVal
WHEN NOT MATCHED THEN INSERT (t1Key, val, status) VALUES (t2.t2Key, t2.newVal, t2.newStatus); It also demonstrate that MERGE needs to:
One way to solve these is to calculate the merge operations up front using the original table and store then as a MERGE_OP column in a temporary table. This ensures the branch conditions work with the state of the original table, not the table after it has already been modified by a DML statement run for a previous branch. So something like this for the test case above: CREATE OR REPLACE TEMPORARY TABLE merge_candidates AS
SELECT
t1.t1Key AS t1_t1Key,
t1.val AS t1_val,
t1.status AS t1_status,
t2.t2Key AS t2_t2Key,
t2.marked AS t2_marked,
t2.isNewStatus AS t2_isNewStatus,
t2.newVal AS t2_newVal,
t2.newStatus AS t2_newStatus,
CASE
WHEN t1.t1Key = t2.t2Key AND t2.marked = 1 THEN 1 -- DELETE
WHEN t1.t1Key = t2.t2Key and t2.isNewStatus = 1 THEN 2 -- UPDATE with conditional
WHEN t1.t1Key = t2.t2Key THEN 3 -- UPDATE the rest
WHEN t1.t1Key IS NULL THEN 4 -- INSERT remaining matches
ELSE 0 -- IGNORE
END AS MERGE_OP
FROM t1
FULL OUTER JOIN t2 ON t1.t1Key = t2.t2Key
WHERE MERGE_OP > 0; This can then be used to drive the DML statements for each branch, and provide the result set, like this: DELETE FROM t1
USING merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1key
AND mc.merge_op = 1;
UPDATE t1
SET val = mc.t2_newVal,
status = mc.t2_newStatus
FROM merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1Key
AND mc.merge_op = 2;
UPDATE t1
SET val = mc.t2_newVal
FROM merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1Key
AND mc.merge_op = 3;
INSERT INTO t1 (t1Key, val, status)
SELECT mc.t2_t2Key, mc.t2_newVal, mc.t2_newStatus
FROM merge_candidates AS mc
WHERE mc.merge_op = 4;
SELECT COUNT_IF(merge_op in (4)) AS "number of rows inserted", COUNT_IF(merge_op in (2,3)) AS "number of rows updated", COUNT_IF(merge_op in(1)) AS "number of rows deleted"
from merge_candidates; Let me know what you think, and if this makes sense! |
Thanks for the detailed feedback, seems merge is a lot more intricate than my usecase of upsert 😂 . I'm on holiday next week so it might be awhile before you see more progress on this. Thanks for the testcase. |
* main: chore(main): release 0.9.20 (tekumara#115) fix: $$ not considered a variable fix: concurrent connection write-write conflict refactor: extract test_connect.py feat: SHOW PRIMARY KEYS for table (tekumara#114) chore(main): release 0.9.19 (tekumara#113) feat: Implements basic snowflake session variables via SET/UNSET (tekumara#111) chore(deps-dev): bump pyright from 1.1.366 to 1.1.369 (tekumara#112) chore(main): release 0.9.18 (tekumara#110)
Got basic case for merge working with a merge2 function, needs to be cleaned up, deleting original merge, fix result statement.
Marked PR as WIP while I work through getting this functionality fully running. I've got the new test case running with a second attempt but I've got lots to clean up before its ready for review. I'll mention you when its ready for review again. |
* main: fix: Only set variables for SetItem expressions (tekumara#116)
* main: feat: alter table cluster by chore: cruft update chore: bump sqlglot 25.5.1
How did you go @jsibbison-square? Do you need any help with this? |
Hi @tekumara sorry I've run short on time at the moment. This was working as far as I was concerned but I haven't had time to refactor as its a bit messy. Happy for it to be merged as is and refactor one day or you can take over if you want the feature sooner. |
* main: (38 commits) feat(server): support time & timestamp types ci(server): description - cover more types feat(server): handle snowflake ProgrammingError refactor: use rowtype for sf metadata in arrow schema feat(server): support empty result set chore(main): release 0.9.24 (tekumara#128) fix: don't require pandas at import time chore(main): release 0.9.23 (tekumara#126) feat: support conn.is_closed() refactor: extract conn, cursor, pandas_tools refactor: describe_as_rowtype uses sf_type refactor: describe_as_rowtype refactor: extract describe_as_result_metadata feat(server): support cur.description feat(server): support bool, int, float types chore(main): release 0.9.22 (tekumara#124) wip(server): add /queries/v1/abort-request fix: fetchmany supports irregular sizes fix: column types for DESCRIBE feat: describe view information_schema.* ...
* main: chore(deps): update ruff requirement from ~=0.5.1 to ~=0.6.3 (tekumara#130) chore(deps-dev): bump pyright from 1.1.374 to 1.1.378 (tekumara#133)
@tekumara I've now got a passing build, isolated merge logic into its own files and added some documentation so I can hope to recall how it works in the future. Please review 🙏 |
This reverts commit dfa68e1. because it doesn't work when the alias is for a select expression eg: tekumara#24 (comment)
Thanks so much @jsibbison-square for this significant piece of work. I've pushed some refactoring to use pure functions rather than a class, as there's no permanent state to capture, and turned your comment documenting the sql output into a test. I'm also considering whether it's worth refactoring this to have a single The main thing blocking merging this though, is it doesn't yet solve for a |
Thanks again @jsibbison-square - I've merged this and will address the above in a follow-up PR. |
I did start with one table but then went with two, I don't recall exactly why sadly. Thanks for merging in its partial support form, we have a lot of usecases that are the most basic of merges so it'll be great to start using. |
Glad to hear you'll be using this soon. I've raised #136 |
…136) Supports MERGE with - multiple join columns, that may have the same name - when the source is a subquery To do this, I've refactored the way merge works to align with the [approach suggested here](#109 (comment)), ie: - use a single `merge_candidates` temporary table to map rows to mutation operations (delete, update, insert) - separate candidate table creation, mutation operations (delete, update, insert), and counts into separate functions, rather than co-mingling this functionality, for better readability - use join columns rather than rowids, and avoid using a transaction because duckdb doesn't support nested transactions so the current implementation will error if an existing transaction is active resolves #24
🤖 I have created a release *beep* *boop* --- ## [0.9.25](v0.9.24...v0.9.25) (2024-09-16) ### Features * Adds MERGE INTO transform ([#109](#109)) ([d5e14a7](d5e14a7)) * close duckdb connection ([223f8e2](223f8e2)) * **server:** handle snowflake ProgrammingError ([9455a43](9455a43)) * **server:** support empty result set ([b967b69](b967b69)) * **server:** support FAKESNOW_DB_PATH ([af79f77](af79f77)) * **server:** support time & timestamp types ([1606a3e](1606a3e)) * support MERGE with multiple join columns and source subqueries ([#136](#136)) ([9b5a7a0](9b5a7a0)), closes [#24](#24) ### Chores * **deps-dev:** bump pyright from 1.1.374 to 1.1.378 ([#133](#133)) ([593a420](593a420)) * **deps:** update ruff requirement from ~=0.5.1 to ~=0.6.3 ([#130](#130)) ([6b37d8b](6b37d8b)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: potatobot-prime[bot] <132267321+potatobot-prime[bot]@users.noreply.github.com>
This commit adds the ability to convert snowflakes MERGE INTO functionality into a functional equivalent implementation in duckdb. To do this we need to break apart the WHEN [NOT] MATCHED syntax into separate statements to be executed independently.
This commit only adds the transform, there is more refactoring required in
fakes.py
in order to handle a transform that transforms a single expression into multiple expressions so it cannot be used yet. I have a change baking for adding it intofakes
but wanted to separate PRs.Let me know if you need more tests or anything as its quite a complicated feature.