From 34781a45fa6436d7091b3033f8d7f49e55f626ce Mon Sep 17 00:00:00 2001 From: ChrisBAshton Date: Mon, 3 Mar 2025 11:53:49 +0000 Subject: [PATCH] Add data migration to delete old link check reports Whitehall introduced the link checker api report concept in Nov 2017: 03c47344a42243f96990559e33125f73edc4d00b Link check reports are never deleted - even ones associated with superseded editions - and editions can also be associated with multiple link check reports. Consequently, at time of writing, there are over 12 million reports in Whitehall, making refactoring the modelling challenging. This PR makes the case that keeping all of these historic link check reports around is not valuable. The old reports are never surfaced to users, and even if they were, how valuable is it to be able to know that 'at the time' a given link was ok (or not)? No, there is only really value in 'recent' reports. One could even argue that a report generated yesterday isn't that useful, as what was a good link yesterday may have become 'bad' overnight. Users can trivially generate new link check reports by hitting the relevant button in Whitehall's UI. All that being said, we see little value in retaining any link check reports generated before 2025. Even if we have logic that, say, prevents publication of a document that is missing a link check report, and we have an edge case where a document was scheduled in 2024 for publication in 2025, that should be mitigated by the fact that we're planning to kick off a batch of new link check reports for all draft/published editions, as part of https://trello.com/c/tmnht4P1/. Before: 11787828 records After: 446009 records (locally, from slightly stale data) --- ...303102938_remove_old_link_check_reports.rb | 29 +++++++++++++++++++ 1 file changed, 29 insertions(+) create mode 100644 db/data_migration/20250303102938_remove_old_link_check_reports.rb diff --git a/db/data_migration/20250303102938_remove_old_link_check_reports.rb b/db/data_migration/20250303102938_remove_old_link_check_reports.rb new file mode 100644 index 00000000000..a6f8602c37c --- /dev/null +++ b/db/data_migration/20250303102938_remove_old_link_check_reports.rb @@ -0,0 +1,29 @@ +# Remove old link check reports (and their links in the +# link_checker_api_report_links table). We have 12 million link +# check reports in the database and it slows the migration job down +# significantly. Keeping only reports from the beginning of the year +# drops this to a ballpark of 50k reports. + +cut_off_point = Date.new(2025, 1, 1) +batch_size = 10_000 +batch_count = 0 + +loop do + # Fetch a batch of report IDs + report_ids = LinkCheckerApiReport.where("created_at < ?", cut_off_point) + .limit(batch_size) + .pluck(:id) + + break if report_ids.empty? # Stop when no more records are left + + # Delete associated links first to satisfy foreign key constraints + LinkCheckerApiReport::Link.where(link_checker_api_report_id: report_ids).delete_all + + # Now delete reports + deleted_count = LinkCheckerApiReport.where(id: report_ids).delete_all + + batch_count += 1 + puts "Deleted batch #{batch_count} (#{deleted_count} reports and their links)" +end + +puts "Completed deletion of old link check reports."