Skip to content

[12/n] [sled-agent] don't start install dataset zones on zone manifest error #8237

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Conversation

sunshowers
Copy link
Contributor

@sunshowers sunshowers commented May 29, 2025

For mupdate overrides, in order to be safe, we must know that the data stored in the JSON is consistent with the images stored on disk. We currently apply this logic to install dataset hashes, and will apply it to artifact hashes in the future.

TODO before landing:

  • Testing on a racklette
  • I think source_resolver.rs is missing a couple of tests for error scenarios.

Questions:

  • Should we apply this logic to more critical zones like the switch zone? They're always going to be part of the ramdisk maybe? We unconditionally succeed for RAMdisk zones, and in particular the switch zone, now.
  • How does this interact with config-reconciler work? On error, that would permanently be blocked it seems to me. Note: we decide errors per-zone now.
  • We ask for the boot zpool to be passed in but also cache it as part of constructing the info... how do we reconcile the relative presence or absence of this info?

Depends on:

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
@sunshowers sunshowers requested a review from jgallagher May 29, 2025 07:53
@sunshowers sunshowers marked this pull request as draft May 29, 2025 18:55
@sunshowers
Copy link
Contributor Author

sunshowers commented May 29, 2025

We need to figure out what to do if the mupdate override is not in place (i.e. if it was removed by sled-agent). Presumably the install dataset will still be around in that case.

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
@sunshowers sunshowers changed the title [12/n] [sled-agent] don't start install dataset zones on mupdate override error [12/n] [sled-agent] don't start install dataset zones on zone manifest error Jun 6, 2025
@sunshowers
Copy link
Contributor Author

sunshowers commented Jun 6, 2025

We need to figure out what to do if the mupdate override is not in place (i.e. if it was removed by sled-agent). Presumably the install dataset will still be around in that case.

I've split up the zone manifest into its own file in #8190.

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
@sunshowers sunshowers marked this pull request as ready for review June 6, 2025 23:33
@sunshowers sunshowers requested a review from andrewjstone June 6, 2025 23:49
@sunshowers
Copy link
Contributor Author

sunshowers commented Jun 9, 2025

From madrid:

00:00:49.211Z INFO SledAgent (ZoneImageSourceResolver): found zone manifest for boot disk
    boot_disk_path = /pool/int/e01f2b02-d921-45d3-9462-c811ad8779c5/install/zones.json
    boot_zpool = oxi_e01f2b02-d921-45d3-9462-c811ad8779c5
    data =   clickhouse.tar.gz: ok (306292018 bytes, 3172061ff19040893018825f52fa2b453fe31f263ba84aaeaf849f272cf6d5a9)\n  clickhouse_keeper.tar.gz: ok (291227523 bytes, 75a373659cb8e6bd716f01086b4d0585f02d2f13669d048e7b383c5e5b255a97)\n  clickhouse_server.tar.gz: ok (306289602 bytes, db92814a437840a673ba67b5dc9f88db3e3dae907a4d5713ff0d39a0d7797d65)\n  cockroachdb.tar.gz: ok (148732359 bytes, 1cacd2f2c6eff958d17d6be0cb6fe3f1edf1478821e63e528ba5f5216a044b5f)\n  crucible.tar.gz: ok (45869552 bytes, c66e6a3b30bdeffa92006969d2f1d4b95bfff720d695347fd1367daf4e019f0e)\n  crucible_pantry.tar.gz: ok (36536826 bytes, ceaaba17088a41e842245958eddc67e1335e6bb27ec168ff113d7ffed638b69c)\n  external_dns.tar.gz: ok (48730446 bytes, 69884f2db26e684f09e9e9b07ec968b1df6e887e2b93c409505b60d5032d440f)\n  internal_dns.tar.gz: ok (48729734 bytes, 95491681a43338842bebce991de554004c2bbf071527bbb63cbcf9bf9a000a9d)\n  nexus.tar.gz: ok (119247846 bytes, f17628abaa1531b554002498fc376c437dfc04695d6f11d2c16d348c151067a5)\n  ntp.tar.gz: ok (15482678 bytes, b1745865f3e7f3611e1337eada1d689fa2b0cdd9ac6a7a8344498a87c4be9591)\n  oximeter.tar.gz: ok (59604394 bytes, 66780eded59644714837f35540c24aaa0ba860564928fc79fc77542562147ef8)\n  probe.tar.gz: ok (2830183 bytes, 4793e66d533f90a30b07df019d5a12c669154948ded83c6830651525739bfb73)\n
    file = sled-agent/zone-images/src/zone_manifest.rs:124
00:00:49.211Z INFO SledAgent (ZoneImageSourceResolver): found zone manifest for non-boot disk
    boot_disk_path = /pool/int/e01f2b02-d921-45d3-9462-c811ad8779c5/install/zones.json
    boot_zpool = oxi_e01f2b02-d921-45d3-9462-c811ad8779c5
    data =   clickhouse.tar.gz: ok (306292018 bytes, 3172061ff19040893018825f52fa2b453fe31f263ba84aaeaf849f272cf6d5a9)\n  clickhouse_keeper.tar.gz: ok (291227523 bytes, 75a373659cb8e6bd716f01086b4d0585f02d2f13669d048e7b383c5e5b255a97)\n  clickhouse_server.tar.gz: ok (306289602 bytes, db92814a437840a673ba67b5dc9f88db3e3dae907a4d5713ff0d39a0d7797d65)\n  cockroachdb.tar.gz: ok (148732359 bytes, 1cacd2f2c6eff958d17d6be0cb6fe3f1edf1478821e63e528ba5f5216a044b5f)\n  crucible.tar.gz: ok (45869552 bytes, c66e6a3b30bdeffa92006969d2f1d4b95bfff720d695347fd1367daf4e019f0e)\n  crucible_pantry.tar.gz: ok (36536826 bytes, ceaaba17088a41e842245958eddc67e1335e6bb27ec168ff113d7ffed638b69c)\n  external_dns.tar.gz: ok (48730446 bytes, 69884f2db26e684f09e9e9b07ec968b1df6e887e2b93c409505b60d5032d440f)\n  internal_dns.tar.gz: ok (48729734 bytes, 95491681a43338842bebce991de554004c2bbf071527bbb63cbcf9bf9a000a9d)\n  nexus.tar.gz: ok (119247846 bytes, f17628abaa1531b554002498fc376c437dfc04695d6f11d2c16d348c151067a5)\n  ntp.tar.gz: ok (15482678 bytes, b1745865f3e7f3611e1337eada1d689fa2b0cdd9ac6a7a8344498a87c4be9591)\n  oximeter.tar.gz: ok (59604394 bytes, 66780eded59644714837f35540c24aaa0ba860564928fc79fc77542562147ef8)\n  probe.tar.gz: ok (2830183 bytes, 4793e66d533f90a30b07df019d5a12c669154948ded83c6830651525739bfb73)\n
    file = sled-agent/zone-images/src/zone_manifest.rs:313
    non_boot_path = /pool/int/f32c2d06-69d7-4df8-861a-23cb5684ca08/install/zones.json
    non_boot_zpool = oxi_f32c2d06-69d7-4df8-861a-23cb5684ca08
00:00:49.212Z INFO SledAgent (ZoneImageSourceResolver): found mupdate override for boot disk
    boot_disk_path = /pool/int/e01f2b02-d921-45d3-9462-c811ad8779c5/install/mupdate-override.json
    boot_zpool = oxi_e01f2b02-d921-45d3-9462-c811ad8779c5
    data = MupdateOverrideInfo { mupdate_uuid: 6fe38986-4cb2-4129-a240-fa40db0702f7 (mupdate_override), hash_ids: {ArtifactHashId { kind: ArtifactKind("control_plane"), hash: ArtifactHash("a3ac1e0e7756f7d4a558a48a7d1154ebba1565625ccee2d5434641d5dc2904d4") }, ArtifactHashId { kind: ArtifactKind("host_phase_2"), hash: ArtifactHash("030fb7e64e7897cbaa57f14fb78195b073ea862a43a81801e34130c21318b80e") }} }
    file = sled-agent/zone-images/src/mupdate_override.rs:106
00:00:49.212Z INFO SledAgent (ZoneImageSourceResolver): mupdate override for non-boot disk matches boot disk (present)
    boot_disk_path = /pool/int/e01f2b02-d921-45d3-9462-c811ad8779c5/install/mupdate-override.json
    boot_zpool = oxi_e01f2b02-d921-45d3-9462-c811ad8779c5
    file = sled-agent/zone-images/src/mupdate_override.rs:199
    path = /pool/int/f32c2d06-69d7-4df8-861a-23cb5684ca08/install/mupdate-override.json
    zpool_name = oxi_f32c2d06-69d7-4df8-861a-23cb5684ca08

sunshowers added a commit that referenced this pull request Jun 9, 2025
…8155)

Part of RFD 556. In upcoming work, sled-agent will check these hashes at
boot time, and mark an error if there's a mismatch.

The stack was [tested on a
racklette](#8237 (comment)).
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
//
// Some zones are distributed from the host OS image and are never
// placed in the install dataset; the Ramdisk enum variant more
// accurately reflects that we are only search `/opt/oxide` for those
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// accurately reflects that we are only search `/opt/oxide` for those
// accurately reflects that we are only searching `/opt/oxide` for those

// Any zones not part of the RAM disk are managed via the
// zone manifest.
//
// XXX: we ask for the boot zpool to be passed in here. But
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to pass in the boot_zpool here? Can we remove it and use the cached version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, I was going to ask a question in the other direction - could we loop over all the current internal disks, and append paths for any disk that is (a) present and (b) has a successful zone manifest result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to pass in the boot_zpool here? Can we remove it and use the cached version?

We could, yeah.

Hah, I was going to ask a question in the other direction - could we loop over all the current internal disks, and append paths for any disk that is (a) present and (b) has a successful zone manifest result?

We could also do this but it's a bit more complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to discuss this in a followup to avoid blocking the rest of the PR set from landing.

// Any zones not part of the RAM disk are managed via the
// zone manifest.
//
// XXX: we ask for the boot zpool to be passed in here. But
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, I was going to ask a question in the other direction - could we loop over all the current internal disks, and append paths for any disk that is (a) present and (b) has a successful zone manifest result?

Comment on lines 202 to 203
// TODO: implement mupdate override here. This will return an
// error if the override isn't found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow this TODO - if there's no mupdate override, doesn't that mean we want to run the artifacts by hash as requested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I got this wrong. I meant to say that this would return an error if there is an issue retrieving the override -- but it's not worth saying this in this comment. Fixed.

@sunshowers sunshowers changed the base branch from sunshowers/spr/main.12n-sled-agent-dont-start-install-dataset-zones-on-mupdate-override-error to main June 19, 2025 00:14
Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) June 19, 2025 00:15
@sunshowers sunshowers merged commit b3656a8 into main Jun 19, 2025
17 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/12n-sled-agent-dont-start-install-dataset-zones-on-mupdate-override-error branch June 19, 2025 01:46
sunshowers added a commit that referenced this pull request Jun 19, 2025
Add these two new zpool kinds that have stricter semantics, and allow
upcasts from them to `ZpoolKind`.

Depends on #8237.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants