Skip to content

[sled-agent] Destroy orphaned datasets (PR 3/2) #8323

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 4 commits into from
Jun 16, 2025

Conversation

jgallagher
Copy link
Contributor

Builds on #8302. This allows the config-reconciler to destroy durable datasets it believes are orphaned due to expunged Omicron zones. This is disabled by default, and can be enabled on a sled-by-sled case via a new omdb sled-agent chicken-switch destroy-orphans enable subcommand (with related commands to get and disable the same).

We don't want to ship automatic dataset deletion before R17 (R16 should only ship "report orphaned datasets"), but we need to be able to turn it on for upgrade testing in the meantime. All of this chicken-switch stuff should be removeable after R16, once we're comfortable enabling deletion in general.

Base automatically changed from john/sled-agent-config-reconciler-report-orphaned-datasets-inventory to main June 12, 2025 13:33
@jgallagher jgallagher force-pushed the john/sled-agent-destroy-orphaned-datasets branch from f3ac7a8 to fce9033 Compare June 12, 2025 13:36
/// control "chicken switches" (potentially-destructive sled-agent behavior
/// that can be toggled on or off via `omdb`)
#[clap(subcommand)]
ChickenSwitch(ChickenSwitchCommands),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐔

List,
enum ChickenSwitchCommands {
/// interact with the "destroy orphaned datasets" chicken switch
DestroyOrphans(DestroyOrphansArgs),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get over the name of this variant.

@@ -65,6 +66,7 @@ pub(crate) fn spawn<T: SledAgentFacilities>(
currently_managed_zpools_tx: watch::Sender<Arc<CurrentlyManagedZpools>>,
external_disks_tx: watch::Sender<HashSet<Disk>>,
raw_disks_rx: RawDisksReceiver,
destroy_orphans: Arc<AtomicBool>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've said it before. All you need is an AtomicBool!

.datasets_report_orphans(
datasets.clone(),
currently_managed_zpools,
self.destroy_orphans.load(Ordering::Relaxed),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, proper usage of ordering!

@jgallagher
Copy link
Contributor Author

Tested enabling this on london during an upgrade today, and enabling the switch seems to have worked and correctly destroyed orphans.

One case that would have failed without this is this internal DNS dataset, which we expunged and replaced on the same zpool:

*   oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns                                                 c843fe79-82b0-4598-8286-48447b0a49a1   - in service   none      none          off
     └─                                                                                                                                                + expunged
+   oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns                                                 ea8a9f32-c8fd-45d4-a5a2-fdbc580bf4c5   in service     none      none          off

Looking at the zpool history, we see when RSS created the initial dataset, and when sled-agent destroyed it and replaced it with the new one:

# Initial dataset created during RSS (ID c843fe79-82b0-4598-8286-48447b0a49a1 matches now-expunged dataset)
1986-12-28.00:11:04 zfs create -o zoned=on -o canmount=on -o mountpoint=/data oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns
1986-12-28.00:11:05 zfs set quota=none reservation=none compression=off oxide:uuid=c843fe79-82b0-4598-8286-48447b0a49a1 oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns

# sled-agent destroyed it
2025-06-13.15:17:28 zfs destroy -r oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns

# New dataset (ID ea8a9f32-c8fd-45d4-a5a2-fdbc580bf4c5 matches now-added dataset)
2025-06-13.15:17:32 zfs create -o zoned=on -o canmount=on -o mountpoint=/data oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns
2025-06-13.15:17:32 zfs set quota=none reservation=none compression=off oxide:uuid=ea8a9f32-c8fd-45d4-a5a2-fdbc580bf4c5 oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns

@jgallagher
Copy link
Contributor Author

Also tested not turning this switch on, expunging a zone, and confirming datasets are not deleted. On dublin after manually expunging an internal DNS zone and its dataset, inventory reports the dataset as orphaned:

ORPHANED DATASETS

sled 547aeb16-ca92-4745-87a4-b82fc78bffcd (serial BRM27230037)
    1 orphaned dataset(s):
        oxp_25a52511-3c7c-43d9-80ac-5cd75336e14f/crypt/internal_dns
            reason: `destroy_orphans` chicken switch is off (full dataset destruction is tracked by omicron#6177)
            dataset ID: 66bbac8e-ff23-49b0-bbc0-de29f159e5cf
            mounted: false
            available: 3018691280 KiB
            used: 268 KiB

sled a9cceeb5-5b9c-4a1d-af03-8cb2c59aa904 (serial BRM23230010)
    no orphaned datasets

sled bcee3c8a-2d68-41eb-8b69-25cadeaf18d4 (serial BRM23230018)
    no orphaned datasets

sled c7e380dd-8ad9-4235-b986-d5268fe2b62b (serial BRM42220026)
    no orphaned datasets

and the dataset still exists on disk as expected:

BRM27230037 # zfs list oxp_25a52511-3c7c-43d9-80ac-5cd75336e14f/crypt/internal_dns
NAME                                                          USED  AVAIL     REFER  MOUNTPOINT
oxp_25a52511-3c7c-43d9-80ac-5cd75336e14f/crypt/internal_dns   268K  2.81T      268K  /data
BRM27230037 # zfs get oxide:uuid oxp_25a52511-3c7c-43d9-80ac-5cd75336e14f/crypt/internal_dns
NAME                                                         PROPERTY    VALUE                                 SOURCE
oxp_25a52511-3c7c-43d9-80ac-5cd75336e14f/crypt/internal_dns  oxide:uuid  66bbac8e-ff23-49b0-bbc0-de29f159e5cf  local

@jgallagher jgallagher merged commit 5fe3cc7 into main Jun 16, 2025
17 checks passed
@jgallagher jgallagher deleted the john/sled-agent-destroy-orphaned-datasets branch June 16, 2025 13:49
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants