-
Notifications
You must be signed in to change notification settings - Fork 188
online replication checksum #1097
Comments
I think the best way to implement this is to use FLUSH TABLES WITH READ LOCK on the source mysql database and then stop replication at the correct GTID in the DM binlog player. It's a bit fiddly but you can use a process that goes something like this:
This is what Vitess does to implement checksumming during a shard split. |
@tirsen we think your above proposal is really good! and we only have a few details to append:
|
Ah yes good point. Btw you will also need to have an option to use
Yeah, ideally we would increase the GC life time if necessary.
I think each shard needs to be checksummed separately? With a separate GTID/transaction timestamp, separate checksum process and separate checksumming connection.
Yes.
Yes. You need to run this during low traffic so that FTWRL completes for all tables. There will be a brief outage during FTWRL so that's not something you want during peak traffic.
Not sure what you mean here. :-) You mean that we can add a configurable |
when MySQL-1-GTID-A == TiDB-TSO-A, there are rows in TiDB like:
but when MySQL-2-GTID-B == TiDB-TSO-B, there are rows in TiDB like:
it's a bit hard to check data consistency both for MySQL-1-GTID-A and MySQL-2-GTID-B by checksum because there isn't a clear so may need to block replication for MySQL-1-GTID-A until MySQL-2-GTID-B with rows in TiDB like:
then we can calculate checksums for MySQL-1 and MySQL-2 separately and XOR (or other methods) them to compare the checksum calculated in TiDB. |
YES |
It seems can't |
Saw the above discussion on how to do data checksum. Using FTWRL/LTWRL get consistent snapshots, If we need to migrate data from multiple shard tables to one TiDB table, then multiple consistent snapshots are required. This scheme is very good. But I have some concern about FTWRL, FTWRL has a relatively large impact on the database, and some DBAs I know dare not use this feature in the master database. This is just a concern, but if data comparisons are frequent, then we need to pay attention to this concern. In addition, my understanding of real-time/online replication checksum is an incremental checksum, not a full data checksum mechanism, the closest way may only choose some chunks to do do the data checksum. How to choose chunks is a question worth considering. Based on the above ideas, I propose an optimistic data checksum schema - assuming that a chunk will not be updated for a period of time (or later), then we can verify it multiple times.
In this way, we may be possible to save the need to lock and control the DM replication task |
I love this two-level checksum proposal, what's your opinion? @tirsen |
If we don't lock tables, upstream and downstream may have some different data, but not all data are different. So if split them into chunks by some good WHERE clause, chunks that contains cold data only could pass verification without lock table in "incremental" checksum. And if switched to "lock table" checksum and still some chunks are not touched during binlog replication, these chunks could be skipped to shorten compare time. |
Yeah we only run these diffs against stand by replicas so FTWRL is not a problem. Doesn't chunk checksumming require a special updated_at column maintained by the application? That makes it less generally useful... |
Oh I see you just check it a few times until it succeeds? Yeah that might work! That's really nice! |
@tirsen sorry for the late reply! Yep, the optimistic data checksum schema would try to check it a few times until it succeeds. I have a project implementation question, do you want to implement it as a more general library, like make upstream data fetching, chunk splitting, data checksum &comparison, waiting for verification timing as interfaces? |
There are two ideas proposed above. If other people have no other new ideas, I think we can choose a plan. What do you think? @lance6716 @tirsen @csuzhangxc |
I think we need both of them for different scenarios, but we may choose to implement optimistic data checksum first. |
optimistic data checksum sounds like it would certainly work very well for us. |
Feature Request
Is your feature request related to a problem? Please describe:
now, we use syncer-diff-inspector to check data between upstream MySQL and downstream TiDB, but it can only check data which will not be updated during the checking process (these may be the whole MySQL instances, a database, a table, or part of data in a table with
range
specified in sync-diff-inspector config)another similar issue is #688.
Describe the feature you'd like:
provides a complete online replication checksum feature.
NOTE: no extra data need to be written must be better.
Describe alternatives you've considered:
Teachability, Documentation, Adoption, Migration Strategy:
The text was updated successfully, but these errors were encountered: