-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Self serve replication SQL API and server side support #220
Self serve replication SQL API and server side support #220
Conversation
eb1cad5
to
9074b36
Compare
9074b36
to
55bc406
Compare
d1f76f8
to
4929966
Compare
...java-itest/src/test/java/com/linkedin/openhouse/javaclient/OpenHouseTableOperationsTest.java
Show resolved
Hide resolved
Map<String, String> config = schedule.getConfig(); | ||
for (String key : config.keySet()) { | ||
try { | ||
CronExpression.parse(config.get(key)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will need to validate cron schedule based on:
- N days or less schedule (need to define N based on data)
- Schedules less than 1 hour will not be allowed
will do in a separate PR
d1a2eb3
to
00f651d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split API change, SQL extension and server side implementation into different PRs?
Let's get on the same page and identify future steps for the API change first.
sure, closing this PR to split up 👍 |
## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> Branched off from #220, this PR contains only the scope for SQL API support for self serve replication. The changes include SQL API support for adding replication configs to table policies within table properties. SQL API that is supported: ``` ALTER TABLE db.testTable SET POLICY (REPLICATION=({destination:'a', interval:12h})) ``` ``` ALTER TABLE db.testTable SET POLICY (REPLICATION=({destination:'a'})) ``` where interval is defined as the interval at which the replication job is run and cluster is the destination cluster. Interval is an optional parameter where users can define an interval from 12 to 72 as `12h/H`, `24h/H`, etc. If interval is not given, the replication schedule will be set up as daily (24h intervals). We also allow a list input with multiple clusters to enable multi-cluster table replication. ``` ALTER TABLE db.testTable SET POLICY (REPLICATION=({destination:'a', interval:12H}, {destination:'aa', interval:12h})) ``` **Future Scope:** Add validations to check that the destination cluster != source cluster, and that the replication interval follows rules defined for data freshness and compliance. Separate PR for server-side implementation: #227 which will contain validation for SQL string input and cron schedule. ## Changes - [x] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [x] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [x] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. Added unit tests. Ran following commands on local docker: ``` scala> spark.sql("alter table u_tableowner.test_table set policy (replication=({destination:'WAR'}))").show(false) ANTLR Tool version 4.7.1 used for code generation does not match the current runtime version 4.8ANTLR Tool version 4.7.1 used for code generation does not match the current runtime version 4.8++ || ++ ++ ``` ``` scala> spark.sql("alter table u_tableowner.test_table set policy (replication=({destination:'WAR', interval:12H}))").show(false) ++ || ++ ++ ``` ``` scala> spark.sql("alter table u_tableowner.test_table set policy (replication=({interval:'12H'}))").show(false) com.linkedin.openhouse.spark.sql.catalyst.parser.extensions.OpenhouseParseException: mismatched input 'interval' expecting {'.', 'SET'}; line 1 pos 62 ``` ``` scala> spark.sql("alter table u_tableowner.test_table set policy (replication=({destination:'A', interval:12d}))").show(false) com.linkedin.openhouse.spark.sql.catalyst.parser.extensions.OpenhouseParseException: mismatched input '12d' expecting RETENTION_HOUR; line 1 pos 84 ``` # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [x] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> Branched off from #220, this PR adds the server side implementation for the self serve replication API. Separate PR for SQL level changes can be found here: #226. This PR adds validations for the interval and destination cluster parameters and stores the replication config as part of table policies in table properties. Validations on parameters: - Destination cluster cannot be the same as the source cluster of the table. - For the interval parameter, if user inputted it should be in the format <X>H or <X>D where hourly inputs can be 12H and daily inputs can be 1-3D. This PR doesn't include the changes to generate the cron schedule from the interval input, those will be made in a separate PR. ## Changes - [x] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [x] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. Added unit testing. Tested with local docker server: successful POST to `http://localhost:8000/v1/databases/u_tableowner/tables` with parameters: ``` { "tableId": "test_table", "databaseId": "u_tableowner", "baseTableVersion": "INITIAL_VERSION", "clusterId": "LocalHadoopCluster", "schema": "{\"type\": \"struct\", \"fields\": [{\"id\": 1,\"required\": true,\"name\": \"id\",\"type\": \"string\"},{\"id\": 2,\"required\": true,\"name\": \"name\",\"type\": \"string\"},{\"id\": 3,\"required\": true,\"name\": \"ts\",\"type\": \"timestamp\"}]}", "tableProperties": { "key": "value" }, "policies": { "sharingEnabled": "true", "replication": { "config": [ { "destination": "LocalHadoopClusterA", "interval": "12H" } ] } } } ``` successful POST to `http://localhost:8000/v1/databases/u_tableowner/tables` with parameters: ``` { "tableId": "test_table", "databaseId": "u_tableowner", "baseTableVersion": "INITIAL_VERSION", "clusterId": "LocalHadoopCluster", "schema": "{\"type\": \"struct\", \"fields\": [{\"id\": 1,\"required\": true,\"name\": \"id\",\"type\": \"string\"},{\"id\": 2,\"required\": true,\"name\": \"name\",\"type\": \"string\"},{\"id\": 3,\"required\": true,\"name\": \"ts\",\"type\": \"timestamp\"}]}", "tableProperties": { "key": "value" }, "policies": { "sharingEnabled": "true", "replication": { "config": [ { "destination": "LocalHadoopClusterA", "interval": "1D" } ] } } } ``` Using `interval: 24H` gives the following error: ``` { "status": "BAD_REQUEST", "error": "Bad Request", "message": " : Replication interval for the table LocalHadoopCluster.u_tableowner.test_table1 can either be 12 hours or daily for up to 3 days", "stacktrace": null, "cause": "Not Available" } ``` Trying to set the destination cluster as the source cluster gives the following error: ``` { "status": "BAD_REQUEST", "error": "Bad Request", "message": " : Replication destination cluster for the table LocalHadoopCluster.u_tableowner.test_table1 must be different from the source cluster", "stacktrace": null, "cause": "Not Available" } ``` For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
Summary
This PR builds off of @rohitkum2506's PR #185 to add support for the SQL API for self serve replication.
Self serve replication allows users to add their own replication config as part of the table policies to specify a destination cluster and cron schedule for cross cluster table replication. Using the replication config in table properties, we can trigger the replication job, removing the manual process of requiring users to open a ticket to specify a table replication.
The scope of this PR is to add the SQL API and server-side support for adding replication configs to table policies within table properties. SQL API that is supported:
Future Scope:
Add validations to support replication schedule within a scope and according to rules defined for data freshness and compliance.
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Added unit tests.
Local docker testing:
This will store the replication config as part of table properties:
Re-running with:
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.