Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Phil/pruning paranoia #363

Merged
merged 3 commits into from
Feb 6, 2024
Merged

Phil/pruning paranoia #363

merged 3 commits into from
Feb 6, 2024

Conversation

psFried
Copy link
Contributor

@psFried psFried commented Feb 6, 2024

Rolls up a few minor improvements to gazctl shards prune. Individual commit messages have more details, but the main goal is to add more detailed audit logging, and to continue pruning after failing to remove a fragment.

I've been using this script to compare the audit logs produced by this, to try to detect any journals or fragments that were removed in an earlier prune operation, but not in a subsequent one. So far, I've not found any.


This change is Reviewable

If someone accidentally runs `shards prune` with a selector that doesn't
include all the shards that use a forked recovery log, it's possible for it to
try to prune the current fragment for a journal. This adds a check for that
condition so that we can provide visibility and stop the prune operation.

Attempting to prune an unpersisted fragment would fail anyway, but this
at least gives a clearer error message.
Adds a lot more information to the logs when fragments are deleted. The intent
is to run with JSON logs, to enable automated audits and analysis of prune
operations.  Because we're logging out so much more information as part of the
logs, I also turned down the log level on some of the other messages.

I also added in some additional warning logs for conditions that should never
be encountered.
This allows recovery log pruning to continue after encountering an error
removing a fragment.  We now operate Gazette clusters that use a variety of
different storage buckets, and it seems unavoidable that some of them might
have permissions misconfigured or return an error for some other reason.  In
that case, we'll now log a warning and continue the prune operation. Gazctl
will still exit non-zero if it has encountered any errors removing fragments,
to ensure it never fails silently.
@psFried psFried requested a review from jgraettinger February 6, 2024 12:49
Copy link
Contributor

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@psFried psFried merged commit 338b339 into master Feb 6, 2024
1 check passed
@psFried psFried deleted the phil/pruning-paranoia branch February 6, 2024 21:43
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants