Skip to content

Commit

Permalink
Add data migration to delete old link check reports
Browse files Browse the repository at this point in the history
Whitehall introduced the link checker api report concept in Nov
2017: 03c4734

Link check reports are never deleted - even ones associated with
superseded editions - and editions can also be associated with
multiple link check reports.

Consequently, at time of writing, there are over 12 million reports
in Whitehall, making refactoring the modelling challenging.

This PR makes the case that keeping all of these historic link
check reports around is not valuable. The old reports are never
surfaced to users, and even if they were, how valuable is it to
be able to know that 'at the time' a given link was ok (or not)?
No, there is only really value in 'recent' reports. One could even
argue that a report generated yesterday isn't that useful, as what
was a good link yesterday may have become 'bad' overnight.

Users can trivially generate new link check reports by hitting the
relevant button in Whitehall's UI.

All that being said, we see little value in retaining any link
check reports generated before 2025.

Even if we have logic that, say, prevents publication of a document
that is missing a link check report, and we have an edge case where
a document was scheduled in 2024 for publication in 2025, that
should be mitigated by the fact that we're planning to kick off a
batch of new link check reports for all draft/published editions,
as part of https://trello.com/c/tmnht4P1/.

Before: 11787828 records
After: 446009 records (locally, from slightly stale data)
  • Loading branch information
ChrisBAshton committed Mar 3, 2025
1 parent 30e1f5d commit 34781a4
Showing 1 changed file with 29 additions and 0 deletions.
29 changes: 29 additions & 0 deletions db/data_migration/20250303102938_remove_old_link_check_reports.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Remove old link check reports (and their links in the
# link_checker_api_report_links table). We have 12 million link
# check reports in the database and it slows the migration job down
# significantly. Keeping only reports from the beginning of the year
# drops this to a ballpark of 50k reports.

cut_off_point = Date.new(2025, 1, 1)
batch_size = 10_000
batch_count = 0

loop do
# Fetch a batch of report IDs
report_ids = LinkCheckerApiReport.where("created_at < ?", cut_off_point)
.limit(batch_size)
.pluck(:id)

break if report_ids.empty? # Stop when no more records are left

# Delete associated links first to satisfy foreign key constraints
LinkCheckerApiReport::Link.where(link_checker_api_report_id: report_ids).delete_all

# Now delete reports
deleted_count = LinkCheckerApiReport.where(id: report_ids).delete_all

batch_count += 1
puts "Deleted batch #{batch_count} (#{deleted_count} reports and their links)"
end

puts "Completed deletion of old link check reports."

0 comments on commit 34781a4

Please # to comment.