Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Verification check fails due to index summary being rebuild after the backup was taken #802

Closed
serban21 opened this issue Sep 11, 2024 · 2 comments · Fixed by #812
Closed

Comments

@serban21
Copy link

serban21 commented Sep 11, 2024

Project board link

I'm using Cassandra 4.06, with Medusa 0.22.0. In production on differential backups medusa verify fails due to mismatch on some Summary.db files between the size and md5 kept in manifest.json and the actual size and md5 of the S3 blob.

I investigated and I found out that the index summary was modified at a later stage long time after the SSTable creation:
Screenshot 2024-09-11 at 12 07 12

and the new version was uploaded in S3:
Screenshot 2024-09-11 at 15 55 41

This is a normal Cassandra behavior, controlled by index_summary_resize_interval

There was a Cassandra log entry about the index summary at almost the same timestamp:

INFO  [IndexSummaryManager:1] 2024-09-11 07:19:34,078 IndexSummaryRedistribution.java:83 - Redistributing index summaries

The last differential backup has the correct size and md5 fingerprint. I guess restore will work regardless, since the new summary is just a better version of it, but I didn't test it. Still, it's not ok for the verify to fail on a good backup.

Some ideas for fixing this:

  1. Do not check for MD5 and size for the summary files, just for their presence. It would be better than what it is now, but it will mean that any modification done to the archived summary file will not be detected. And it should be fairly easy to implement.
  2. Update the manifest.json of all old differential backups. Detect when a summary file is overwritten, and go through all manifests. I don't like this, any error could affect those backups.
  3. In case of such verify errors, only for summary files Medusa could go to the last manifest and see if the file is present there with accurate MD5 and size; if so, it should not report errors. If the file is not present it could go backwards through manifest files until it reaches one that has the file. Problems:
  • This will fail if the backups containing the new MD5 were deleted, but I thinks that's unlikely
  • It could be argued that this means that Medusa will not restore exactly what was backed up. It could be a problem for certifications and audits, I guess. It's not a real problem since it's just an sampling of an index
  1. When a summary file is overwritten save the previous variant. I guess the MD5 could be added to the file name, to identify it easier; or put it in a separate folder in data/. This will then require changes in verify, restore and delete. The main advantage will be that we'll not need to look into other manifests and that the backup will keep the exact copy of the files at the moment of the backup.

My preference will be for (3) or (4). (4) is the ideal solution, while (3) could be good enough. I can try to implement one of the ideas. (1) is just a short-time solution.

┆Issue is synchronized with this Jira Story by Unito
┆Reviewer: Alexander Dejanovski
┆Fix Versions: 2024-10,2024-11
┆Issue Number: MED-95

@rzvoncek
Copy link
Contributor

rzvoncek commented Oct 8, 2024

Hello. I was finally able to look into this.

It did take me by surprise that the Summary.db turns out to be not immutable. I was able to force this behaviour by using really small indexing settings (index_summary_capacity and index_summary_resize_interval in cassandra.yaml). I've extended one integration test to do this and recalculate indices once every minute. Then I had to ALTER the indexing interval of the test table, which made the next indexing rewrite the Summary.db file.

However, what I saw in my tests was that this only became a problem for all the backups that were not the most recent one. The most recent one had manifest with the correct metadata. This is not good enough though - we don't want to render all the old backups useless. Particularly because the Summary.db file seems to be really unimportnant - Cassandra writes it anew during bootstrap if it doesn't find it, and it overwrites it if it finds a different one.

Because it's so unimportant, we decided to treat it in a way similar to the Statistics.db file and simply ignore it during verification. The difference though, is that we made Medusa log a warning if it finds a mismatch in the Summary.db. However, the verify command will not fail because of this.

Doing something more complicated along the lines of updating manifests is a very complicated thing to do that might not necessarily be worth the effort, nor the extra complexity.

@serban21
Copy link
Author

serban21 commented Oct 9, 2024

Hello. Thank you.

Yes, the last backup is always ok, only the other ones have this problem.
I agree that it's ok to just ignore the differences in Summary.db. Anyway the newer file should actually be better than the previous. The only possible argument is that the old backups are thus not immutable, which some people might not like. But I think it's a weak argument, since it's just a summary of an index.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants