Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

gazctl: shard pruning should consider all hints #350

Merged
merged 1 commit into from
Sep 28, 2023

Conversation

jgraettinger
Copy link
Contributor

@jgraettinger jgraettinger commented Sep 28, 2023

If a shard is in a persistent state where it immediately becomes FAILED after recovery, it's very possible for the oldest recovered backup hints to actually be newer than primary hints.

Then, if we prune without considering the primary hints, the shard is unable to recover from its log.

Fortunately this is easily remedied -- delete the primary hints while leaving backup recovery hints in place -- but better to not do it in the first place.


This change is Reviewable

If a shard is in a persistent state where it immediately becomes FAILED
after recovery, it's very possible for the oldest recovered backup hints
to actually be _newer_ than primary hints.

Then, if we prune without considering the primary hints, the shard is
unable to recover from its log.

Fortunately this is easily remedied -- delete the primary hints while
leaving backup recovery hints in place -- but better to not do it in the
first place.
Copy link
Contributor

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if hints != nil && len(hints.LiveNodes) > 0 {
foldHintsIntoSegments(*hints, logSegmentSets)
} else {
skipRecoveryLogs[recoveryLog] = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not something that's changed here, but this just made me curious. If there's non-nil hints, but they don't have any LiveNodes, could that indicate that all the files have been deleted, and thus we should prune that journal still? Not proposing that we handle that corner case, but more just checking my understanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that sounds conceptually right but I don't recall if hinted log replay can function if there are literally no LiveNodes. Possibly it handles it by jumping to the head of the log, or possibly it errors out because it wouldn't have an (offset, sequenceNo, checksum) tuple to resume reading from.

But if all data is also deleted, that can also be problematic because there must be written & persisted data that's old enough for the reader's offset to be eligible to jump to the log head.

(To my knowledge there's no actual scenario where LiveNodes is intentionally empty.)

@jgraettinger jgraettinger merged commit 2fb3920 into master Sep 28, 2023
@jgraettinger jgraettinger deleted the johnny/recovery-edge-cases branch September 28, 2023 21:30
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants