RFD 457
Control plane sled lifecycle
RFD
457
Updated

Terminology

Individual physical servers in an Oxide rack are called sleds. Generally, these can be uniquely identified by a tuple of (Oxide part number, serial number).

The control plane also keeps track of "sleds". These are identified by UUIDs generated when the sled is added to the control plane.

These are not quite the same thing, in that:

  • A physical sled might be known to the control plane (having been discovered via the inventory process) without there being an associated control plane sled for it. This happens during the "add sled" process, after a physical sled is powered on and reachable from Management Gateway Service but before the sled is adopted by the control plane.[2]

  • The control plane’s "sled" need not be associated with with a physical Oxide sled. They can be PCs (running Oxide control software, like a physical sled does) or they they can be totally simulated.

  • It is at least conceivable that a physical sled is initially associated with one control plane sled (i.e., has one uuid), then is retired from the control plane, and then is re-added to the system as a "new" sled. In that case, the UUID would be different. To the control plane, it’s a wholly different sled.

This RFD uses the terms physical sled and sled for these two concepts (or control plane sled for the latter, when we want to be pedantic).

There’s a similar relationship with physical disks and what the control plane calls a physical_disk (which we might think of as a control plane disk). The terminology is even more confusing because there are also user-provisioned [Crucible] disks.

We use a few terms to describe the lifecycle of sleds, disks, and other hardware or software components:

  • In service means that the component is fully available for use by the control plane and customer workloads (whatever that means for a given component). A sled that’s in service can be used to run customer VMs and store customer data.

  • Quiesced means that the component is present, the control plane knows about it and can manage it (e.g., upgrade it, detect and report faults), but its function is not currently available to the control plane or customer workloads. For example, a quiesced sled cannot run customer VMs. A quiesced disk cannot be used to read or write customer data or control plane data. A quiesced component can generally be brought back into service.

  • For the purposes of this RFD, failed means that a component is not reachable from the control plane for whatever reason. A sled that’s powered off, removed, rebooting, etc. would be considered failed until it either comes back or an operator marks it "expunged".

  • Expunged means that the component has been permanently removed.

  • Graceful removal is the process by which an in-service component is removed from the system without impact to overall system availability, durability, etc. You can think of this like: first the component is quiesced, then more drastic actions may be taken (e.g., to replicate its data to other sleds) to minimize impact when the component is expunged, and finally the component is expunged.

We’ll explain these in more detail below.

Basic lifecycle for sleds and disks

Oxide systems must support adding, permanently removing, and replacing both physical sleds and disks. It’s probably obvious why but it’s worth being explicit:

  • A new physical sled or disk could be added to a system to expand total system capacity (e.g., as part of going from a half-rack to a full-rack) or as part of sled or disk replacement.

  • A physical sled or disk could be removed from a system because it’s being replaced.

  • A physical sled or disk could be replaced for a few reasons:

    • at Oxide’s request in order to root-cause a failure that cannot be diagnosed remotely

    • because some failure prevents the sled or disk from operating normally and cannot be repaired remotely (e.g., failure of a non-field-replaceable hardware component)

    • to change to a different class of hardware (i.e., upgrade)

An operator may also want to temporarily remove a sled or disk from a system in order to perform some kind of physical maintenance, like replacing a DIMM or fan (for a sled). In this case, they may want to live-migrate instances and maybe quiesce control plane components (that depend on the sled or disk) in order to minimize API disruption, but not replicate all of its data nor generate a new sled or disk id (which might impact metrics and other reporting). Such a mechanism is outside the scope of this RFD.[3]

Adding a sled

When we talk about adding a sled, we expect the process looks like this:

  • The operator inserts a physical sled into a rack.

  • The control plane automatically learns about the new sled, including its part number and serial number.

  • The API, CLI, and console present the new sled to the operator.

  • The operator explicitly chooses to adopt the sled.

  • The sled adoption process automatically takes all steps necessary so that:

    • the new sled can be managed by the control plane like all the other sleds (i.e., can perform software upgrades, respond to faults, etc.)

    • customer resources (instances and storage) can be provisioned on the new sled

[rfd417] describes most of this process in more detail, though the part about being able to use the new sled for customer workloads is covered here.

For the purposes of this RFD, we will assume that a newly added sled is already running the same software images as the rest of the sleds in the system. In practice, we’re likely to achieve this by MUPdating the new sled prior to adding it to the system. In the future, the control plane could do this process automatically ("Nexus-driven sled recovery").

Removing a sled

As mentioned under Terminology, there are two different things that may be meant by removing a sled.

Graceful sled removal means that the sled is still functioning but the operator intends to remove it. The control plane works to migrate functionality elsewhere while minimizing disruption to both customer workloads and the control plane. Graceful removal might be used when upgrading hardware or to replace a sled that’s basically working (e.g., to debug it).

Expunging a sled means that the operator is telling the control plane that the sled is gone and will not return to service. Expunging a sled is needed if for any reason a sled just can’t function any more, as when it’s suffered a catastrophic hardware failure.

A critical difference between these two is that with graceful sled removal, the system works to ensure that the standard level of fault tolerance is maintained. With expungement, a failure has already compromised fault tolerance and the system is working to restore it. Concretely: suppose it takes 3 CockroachDB nodes for the system to function. The control plane might require that there’s always at least 5 working CockroachDB nodes so that the overall system can survive any 2 failures. With graceful removal of a sled with a CockroachDB node, a 6th node is provisioned and one of the original 5 is decommissioned. Thus, the system always has its minimum 5 nodes and can survive 2 failures. With expungement of a sled with a CockroachDB node, the system has only 4 nodes and works to restore the 5-node requirement. In the meantime, the system is up (because of the extra margin built into the 5-node policy), but it could only survive one additional failure instead of the usual two.

Both of these are operator-driven. The control plane never autonomously decides to consider a sled permanently failed or removed.[4]

ActionGraceful sled removalExpunging a sled

Provisionability

Immediately stop using the sled for:

  • customer Instance starts (provisions)

  • target for customer Instance live migrations

  • new Omicron zones (during blueprint creation)

  • region allocation for Crucible (disks, snapshots, etc.)

Trust quorum

Sled is removed from trust quorum after the process completes.

Sled is removed from trust quorum immediately.

Sled power state

No change

Make sure the sled is off

Existing customer Instances

Kick off live migration. Wait for completion.

Clean up instances and potentially start them elsewhere. Do not wait for completion.

Existing Crucible regions

Kick off replacement based on other copies. Wait for completion.

Kick off replacement based on other copies. Do not wait for completion.

Existing Omicron zones

Remove zones from service. (See [rfd459].) Provision replacements elsewhere. Bring those into service. Wait for all this to finish. Remove existing Omicron zones.

Provision replacements elsewhere and bring them into service. Do not wait for completion. Apply component-specific "expunge" actions (see [rfd459]).

Adding a disk

Like sleds, disks may be added to the system after initialization. We expect a process like:

  • The operator inserts a physical disk into a physical sled within the rack.

  • The control plane automatically learns about the new disk, including its part and serial number.

  • The API, CLI, and console present the new disk to the operator.

  • The operator explicitly chooses to adopt the disk.

    • This may be a configuration option, to "automatically approve" new disks.

  • The disk adoption process then automatically manages the disk (e.g., can perform firmware upgrades) and can provision resources to that disk.

Removing a disk

Like sleds, disks may be gracefully removed or directly expunged. Many of the actions above apply to removing a disk as well:

  • Any Omicron zone with its persistent storage on that disk will need to be provisioned elsewhere. This would generally be handled by Nexus and rely on application-level recovery to restore whatever data was lost. For example, if this happened to a CockroachDB node, Nexus would deploy another CockroachDB instance (anywhere in the rack, not necessarily on the same sled) and CockroachDB would take care of incorporating the new node and restoring the right number of copies of whatever data was on the failed node. Similarly for DNS, Nexus would provision a new DNS server and the usual DNS propagation mechanism would ensure that it was configured properly.

  • Any Omicron zone with its transient storage (namely, its root filesystem) will need to be forcibly shutdown and presumably started again with its root filesystem on a different disk. This too will be done by Nexus.

  • Any Crucible regions on the affected disk need to be replaced.

We may implement the corresponding parts of sled removal in terms of "first, gracefully remove / expunge the disks in this sled."

ActionGraceful disk removalExpunging a disk

Provisionability

Immediately stop using the disk for:

  • target for new "transient zone filesystem" datasets

    • NOTE: This is currently managed automously by the Sled Agent - perhaps it should be controlled more explicitly by Nexus?

  • target for new datasets

  • target for new Crucible regions

  • target for internal sled-local storage (e.g., logs, zone bundles, etc)

Disk Usage By Sled Agent

Coordinate with the Sled Agent to ensure that the disk, as well as the corresponding zpool, is not used locally.

Existing Crucible regions

Kick off replacement based on other copies. Wait for completion.

Kick off replacement based on other copies. Do not wait for completion.

Existing Omicron zones

Remove zones from service. (See [rfd459].) Provision replacements elsewhere. Bring those into service. Wait for all this to finish. Remove existing Omicron zones.

Provision replacements elsewhere and bring them into service. Do not wait for completion. Apply component-specific "expunge" actions (see [rfd459]).

Determinations

The key determinations here are the policy and state associated with both sleds and disks: what they mean and how they’re stored. This section also describes a bunch of flows explaining how we expect these fields to change. But the details of those flows may change as the implementation evolves and they may not match this RFD exactly.

Sleds

Sled policy stored in the sled table

In the sled table, we store a sled_policy column with several values:

  • in_service: normal operation. The sled may be used to run customer VMs, to run control plane components, to store customer data (Crucible volumes), or to store control plane data (CockroachDB data, Crucible data, log files, TUF repos, support bundles, etc.)

  • no_provision: The sled may be running customer or control plane workloads, but may not be used for new allocations. No new customer VMs, control plane components, customer data, nor control plane data may be allocated here. This is kind of a weird policy because workloads may still be running on the sled so it’s not truly quarantined. But it’s simple, quick, reliable, and still potentially useful. This policy is currently implemented by the sled.provision_state enum value non-provisionable.

  • [quiesce: Quiesce the sled so that it can be temporarily powered off for maintenance. For quiesce, "easy" steps are taken to minimize impact (like live-migrating VMs to other sleds). Drastic steps (like re-replicating all Crucible data) are not. In terms of availability, quiesced sleds are like failed sleds. If the system is configured to survive two simultaneous sled failures, a quiesced sled would count as one of those failures. This policy is out of scope for this RFD.]

  • graceful_remove`: the operator has requested graceful removal of the sled. It is first quiesced, then additionally more drastic steps are taken to completely eliminate any availability or durability dependency on this sled. For example, all CockroachDB and Crucible data on this sled would be replicated elsewhere.

  • expunge: the operator has instructed us to treat the sled as removed

The sled policy always reflects the operator’s intent, not the actual state of the sled. Only certain transitions are allowed:

Operator policy for a sled

For the following purposes, allocation is only allowed when the sled_policy is in_service:

  • provisioning any control plane zone other than NTP or Crucible

  • provisioning any customer instance

  • provisioning any Crucible regions

Also in the sled table, we store a sled_state column with several values:

  • initializing: the sled is being added to the control plane but is not yet ready to take on any workloads

  • in_service: normal operation

  • [quiesced: The sled has been quiesced as defined above. As above, this state is out of scope for this RFD. It’s included to show how it might fit in here.[6]]

  • decommissioned: the sled policy was changed to graceful_remove or expunge and there are now no workloads (customer or control plane) on the sled any more. It may be powered off or removed without impact, if it wasn’t already.

Sled states

Blueprint changes

Note
This section describes the changes we expect to make here only to flesh out the plan. The details may change as the implementation is built and this section need not be kept up to date.

The changes may not be literally as described here, but this is intended to describe the information that will be added to the blueprint.

  • New field: sled_state: BTreeMap<SledId, SledState> where SledState is an enum with the variants Initializing, InService, [Quiesced, ] Decommissioned above. The planner decides what state the sled should have and indicates that in the blueprint. The executor sets the state accordingly. More on this below.

  • New field: nexus_zones_expunged: BTreeSet<Uuid> to track Nexus zones that have been decommissioned. This is used to reassign sagas.

"Add sled" flow

Note
This section describes the changes we expect to make here only to flesh out the plan. The details may change as the implementation is built and this section need not be kept up to date.
  • [rfd417] "add sled flow":

    • Operator inserts the new sled.

    • As mentioned above, we assume that the operator is currently responsible for ensuring the sled is running compatible software. In practice, they will likely use Wicket to MUPdate the sled to the same software as the rest of the rack. In the future, this step will be automated.

    • The inventory process finds the new sled and stores basic information about it.

    • Operator uses the web console, CLI, or API to:

      • list uninitialized sleds (part numbers and serial numbers)

      • find the new sled (by part number / serial number)

      • invoke the "add sled" API for that sled

    • The system runs through the "add sled" process:

      • Nexus starts the Sled Agent on that sled (by telling a different Sled Agent to do so).

      • The sled is added to the trust quorum.

      • Sled Agent starts up and creates zpools on locally-attached storage devices, etc.

      • Sled Agent notifies Nexus about itself, its zpools, etc.

      • Nexus creates the control plane sled record in the sled table, having sled_policy = 'in_service' and sled_state = 'initializing'.

  • The blueprint planner is activated.[7]

    • The blueprint planner sees a new sled and generates a blueprint that, relative to the previous blueprint, just deploys an "internal NTP" zone on the new sled.

  • The new blueprint is made the current target.[7]

    • Execution of the new blueprint causes the internal NTP zone to be provisioned.

  • The inventory system finds the new NTP zone.[7]

  • The blueprint planner is activated again[7] and generates a new blueprint.

    • Because there’s now an NTP zone, the sled’s state in the blueprint is made InService.[11]

    • The blueprint planner sees the sled lacking Crucible zones and adds them to the new Blueprint.

  • As above, the new blueprint is made the target and then executed, deploying the zones and updating the state in the sled table.

In the end, we have:

  • a new control plane sled for the new physical sled (i.e., a row in the sled table with policy in_service and state in_service)

  • it’s running sled agent, internal NTP, and Crucible

  • it’s able to run customer Instances and Crucible regions

  • the control plane knows to allocate Instances and regions on it

"Expunge sled" flow

Note
This section describes the changes we expect to make here only to flesh out the plan. The details may change as the implementation is built and this section need not be kept up to date.
  • (optional) Operator physically removes the sled.

  • Operator uses the web console, CLI, or API to invoke the "expunge sled" API call

    • Synchronously, this call sets the sled’s sled_policy to expunge

  • The blueprint planner is activated.[7]

    • The blueprint planner generates a blueprint reflecting:

      • any Omicron zones that were on this sled are removed altogether,

      • any expungement actions needed for these zones per RFD 459. For example, any Nexus zones that are removed in this way have their ids added to nexus_zones_expunged: BTreeSet<Uuid> in the blueprint so that their sagas can be reassigned during execution.

      • any other changes implied by other policy — for example, if there should be 3 Nexus zones, and this removal step removes one, then the planner may create a third one.

  • The new blueprint is made the target.[7]

  • The blueprint executor:

    • lists physical disks on "expunged" sleds and marks them expunged (sets the physical_disk’s policy column to "expunged", per the mechanism introduced in oxidecomputer/omicron#3526). On conflict, do nothing — another executor got here first. This triggers replacement of any Crucible reasons (details of which are outside the scope of this RFD).

    • lists instances on the sled and marks them failed in some way. (This should look similar to the above, but the mechanism doesn’t exist yet. See oxidecomputer/omicron#4872.)

    • if the sled being drained has all of its disks in state decommissioned and is no longer running any instances, then the sled’s state is updated to decommissioned

    • performs the other usual execution steps. For example, the removal of zones would result in a new internal DNS generation being created at this point.

We also want to take steps related to sled power to [_ensuring_a_sled_is_really_off]. The exact time and mechanism for this remains to be determined.

This process assumes two other processes already exist that aren’t described in this RFD:

  • a process that kicks off Crucible region replacement for all Crucible regions on all disks with policy expunge

  • a process that kicks off cleanup and possible next steps for instances that have been marked as vanished above (see oxidecomputer/omicron#4872)

From the blueprint planner/executor perspective, all that’s needed from these is a way to kick off these activities and to know when they’ve completed.

"Gracefully remove sled" flow

Details TBD and largely outside the scope of this RFD. This presumably starts similar to the "expunge sled" flow: the operator invokes an API to set the policy to graceful_remove. We then go through a sequence of blueprint generation and execution that provisions replacement components elsewhere, uses the quiesce operations (see [rfd459]) to remove components from service, and then removes the old components altogether and ultimately marks the sled decommissioned.

We may additionally want to take steps to explicitly wipe data from disks using the NVMe device’s secure-erase or other functionality.

Ensuring a sled is really off

For sled expungement, two things are critical:

  1. We must be sure that all customer instances are really off before we start running them somewhere else.

  2. We want to avoid parts of the control plane coming up and taking action, thinking they’re part of the real control plane. The specific problems here have not been characterized, but examples may include:

    • CockroachDB nodes coming up and rejoining the cluster and receiving replicated data.

    • Internal DNS zones coming up and the local system advertising their address, which is actually supposed to be on some other sled.

    • External DNS or Nexus zones coming up and the system configuring their external addresses to route to this zone, when it should instead be routing to some other zone that we’ve used that external IP for.

    • Nexus zones coming up and running blueprint planning and execution or otherwise taking actions to modify the system.

There are various tools at our disposal:

  • Through Ignition, we can control power to the cubby that the sled is plugged into. This would allow us to reliably ensure that a sled powers off at least temporarily, regardless of whether it’s working. This cannot ensure that it powers off indefinitely because the sled could be removed and accidentally inserted into a different cubby. Further, if we left the cubby powered off to try to keep the sled off, we wouldn’t know when it was safe to turn it back on without the operator telling us — and they could get that wrong anyway.

  • Through the service processor, we can put the sled into state A2, meaning the host is off. However, this would not work for a sled that was already removed and the system couldn’t tell if the sled was removed or experienced a transient error. Further, if the sled is unplugged and reinserted, it could go back to A0 (host on).

  • While we cannot do this today, in the future, the trust quorum mechanism could ensure that a sled that’s not supposed to be part of the cluster cannot bring the control plane online.

The Ignition-based power-cycle should work reliably to ensure customer instances are really off.

The trust quorum mechanism is likely required to ensure that the control plane does not start back up again. In the shorter term, various mitigations are possible:

  • put the sled into A2 whenever and wherever we see it (i.e., have Nexus monitor inventory reported by MGS and if we ever find a decommissioned sled in a power state other than A2, send it to A2)

  • in sled agent, on startup, ask Nexus if it’s decommissioned and send itself to A2 if so. This request may fail and the sled agent must continue startup in that case in order to support cold boot.

Neither of these would handle the rack-cold-start case, but could limit damage and probably be effective for most other cases.

Another idea floated is to ask the operator during sled expungement if the sled is still plugged in and believed to be working. If so, do not proceed unless we can successfully wipe it ourselves. This helps in the case where they say "yes", but if they say "no" and get it wrong, we have all the same issues.

Disks

Disk policy stored in the physical_disk table

In the physical_disk table, we store a policy column representing the operator’s intent for the disk. This column may be set by an operator. The values here match the corresponding sled policies:

  • in_service: normal operation, in which the disk may be used for allocations from the control plane. This includes durable dataset, transient zone root filesystems, control plane data, and user Crucible data.

  • graceful_remove: the operator has requested graceful removal of the disk

  • expunge: the operator has flagged the physical disk for permanent removal

Note that there is no option for "disk exists, but has not been added to the control plane yet". The physical_disk table stores information about physical disks that are known to be in-use by the control plane, and does not necessarily store information about disks which have not yet been approved by an operator — that information would come from a Sled Agent itself reporting inventory.

Additionally, we store some information in the physical_disk table about our known state. This column may be set by Nexus. These values similarly match the corresponding sled states:

  • in_service: the disk is operating normally

  • decommissioned: the disk policy was changed to graceful_remove or expunge and there are now no workloads (customer or control plane) using the disk any more. It may be removed without impact, if it wasn’t already.

We do not specify "quiesce" yet.

Operator policy for a disk
Internal state for a disk
  • When a disk is added to the system, it exists in the inventory, but no entry in the physical_disk table is created. Once the disk is explicitly approved by an operator (or implicitly approved, alongside a new sled), a new entry is created as "in service".

  • At any time, an operator may change the policy of a disk. Expungement is a one-way change that cannot be undone.

  • If the disk policy becomes graceful_remove or expunge, Nexus prepares the system to stop using the disk. This implies:

    • Cooperating with Crucible to create replacements for any Crucible regions stored on the disk.

    • If the Sled containing the disk is not also expunged:

      • Telling the Sled Agent to stop any zones that might be using the disk.

      • Telling the Sled Agent to stop using the disk for internal storage (e.g., logs).

    • If the Sled is itself expunged, no requests to it should be made

"Add disk" flow

Note
This section describes the changes we expect to make here only to flesh out the plan. The details may change as the implementation is built and this section need not be kept up to date.
  • Disk addition

    • Operator inserts the new disk.

    • The inventory process finds the new disk and stores basic information about it.

    • Operator updates the disk to ensure it’s running a sufficiently up-to-date firmware revision.

    • Operator uses the web console, CLI, or API to:

      • list uninitialized disks (part numbers and serial numbers)

      • find the new disk (by part number / serial number)

      • invoke the "add disk" API for that disk

    • The system runs through the "add disk" process:

      • Nexus contacts the Sled Agent, and identifies that the disk should be used.

      • Sled Agent formats zpools on the device, and notifies Nexus that the disk is in-service.

      • Nexus creates the control plane physical disk record (in the physical_disk table)

  • The blueprint planner is activated.[7]

    • The blueprint planner sees a new zpool and generates a blueprint that, relative to the previous blueprint, provisions a Crucible zone to this zpool on its sled.

    • Execution of the new blueprint occurs.

    • The blueprint executor causes the Crucible zone to be provisioned.

In the end, we have:

  • a new physical disk (entry in the physical_disk table) for the new physical disk

  • it’s running a Crucible zone managing a dataset on the disk. This zone is ready to receive requests for new Crucible regions, and Nexus uses this Crucible as a target for allocations.

  • additionally, Nexus knows about the zpool on this disk more generally, and can ask for the sled to place filesystem zone roots here when any zones are requested on this particular sled.

"Expunge disk" flow

Note
This section describes the changes we expect to make here only to flesh out the plan. The details may change as the implementation is built and this section need not be kept up to date.
  • (optional) Operator physically removes the disk.

  • Operator uses the web console, CLI, or API to invoke the "expunge disk" API call

    • Synchronously, this call sets the disk’s policy to expunge

      • Setting this policy should prevent future allocations from arriving at this disk.

      • Setting this policy should start the region replacement process.

  • The blueprint planner is activated.[7]

    • The blueprint planner generates a blueprint reflecting that this disk is no longer in-use. This means: all zones which previously used this physical disk (for durable storage, or for zone filesystems) should be removed from the blueprint.

    • (Eventually) The planner should restore redundancy by re-assigning zones elsewhere (e.g., CockroachDB nodes).

  • The new blueprint is made the target. (See above.)

  • Execution of the new blueprint occurs.

    • This sends a request to Sled Agents to use the new set of zones in the blueprint, which should no longer be using the expunged disk.

    • Sends a request to the sled, asking it to stop using the physical disk for other uses of internal storage.

Although we have the ability to remove power to physical disk slots (temporarily - this power-off would only last until the next reboot), it’s not necessarily the right call. Even decommissioned disks should be visible in the inventory system if they are still attached to a functional sled, and they may be instantiated as a new "control plane object" (via operator approval, and creation of a new physical_disk row in the database) if requested.

In lieu of slot power-off, we must contact the Sled Agent on disk expungement to cooperatively ask it to stop using that device.

  • If we cannot contact the Sled Agent for a Sled that is not expunged, we cannot disable the disk. It would be inaccurate to describe a disk as "expunged" if the Sled may continue using the device.

  • If we cannot contact the Sled Agent for a Sled that is expunged, we can skip this step.

This process can be implemented in a retry loop:

  • Loop:

    • If the Sled is expunged, we can set the disk state to decomissioned.

    • Otherwise, contact the Sled and ask it stop using all resources allocated on that disk.

      • On success: we can set the disk state to decomissioned.

      • On failure: retry this whole loop.

"Gracefully remove disk" flow

Details TBD, but it presumably starts similar to the "expunge disk" flow: the operator invokes an API to gracefully remove the disk. We then go through a sequence of blueprint generation and execution that provisions replacement components elsewhere, uses the quiesce operations (see [rfd459]) to remove components from service, and then removes the old components altogether and ultimately marks the disk state as quiesced.

"Oops, I accidentally unplugged the wrong disk" flow

From the perspective of the control plane at large, if a U.2 is temporarily detached and re-attached from a sled, the policy and state for that disk do not change. Effectively: Nexus will not ask the Sled Agent to alter any behavior, and it will continue using that physical disk as a target for provisions. These requests may fail (or lead to later failures) if the disk is still physically removed from the system at a later point-in-time.

The Sled Agent should be responsible for identifying this case and "re-establishing service" for a disk that has been been removed and re-attached to the system. In the limit, this means that the Sled Agent should detect the removal, tear down resources which were using the disk (e.g., zones), and re-start them when the disk re-appears. If this condition is detected, we probably also want to notify the operator through a notification system, as it implies a risk of loss of availability.

Priorities

Based on customer need, the immediate plan is to implement sled expungement for sleds running Crucible and Nexus instances, plus "sled add" for sleds that need to run Crucible and Nexus (plus customer workloads). Graceful removal and support for add/expunge of other kinds of components will happen later (but are still in scope for this RFD). See [rfd459] for more on what’s involved in add/quiesce/expunge of other control plane components.

Open questions

Footnotes
  • 1

    This also happens in environments like dogfood and other test environments where some sleds that are physically cabled to the rest of the system are left out of initial setup for whatever reason.

    View
  • 2

    This also happens in environments like dogfood and other test environments where some sleds that are physically cabled to the rest of the system are left out of initial setup for whatever reason.

    View
  • 3

    If a customer needs to do this and we have not implemented "temporary graceful removal", they’d have two options: (1) just power off the sled, perform the maintenance, and power it back on. This will of course be disruptive to VMs running on that sled and potentially some API operations, but it will preserve the data and the sled’s id. Or (2) perform a graceful removal of the sled and then re-add it to the system after maintenance is complete. This would have little runtime impact, but it would result in re-replicating the sled’s data and generating a new id, which means the sled could be different for metrics/reporting purposes. Obviously neither of these is as good as having a first-class temporary drain feature.

    View
  • 4

    This would of course be possible, but such automation is risky. This major EBS outage is a striking example of how dangerous it can be for the system to incorrectly classify a transient issue as permanent and attempt to take drastic corrective action (like replicating a whole volume’s worth of data). We expect that sled failure severe enough (and long-lasting enough) to warrant replacing the Crucible regions on that sled should be rare enough that operator engagement is a worthwhile requirement to avoid the system choosing to do this when it really shouldn’t.

    View
  • 5

    You’d like this state to mean "the sled is quiesced". Is that possible? What if an operator changes the policy back to in_service? What if we then provision stuff on it? Could that happen before the state was changed to in_service? What if it the policy was then changed to quiesce? Now the state might say quiesced but we didn’t remove whatever was added during that window it was in-service. Maybe the answer is to set the state to in_service whenever the policy is successfully changed to in_service?

    View
  • 6

    You’d like this state to mean "the sled is quiesced". Is that possible? What if an operator changes the policy back to in_service? What if we then provision stuff on it? Could that happen before the state was changed to in_service? What if it the policy was then changed to quiesce? Now the state might say quiesced but we didn’t remove whatever was added during that window it was in-service. Maybe the answer is to set the state to in_service whenever the policy is successfully changed to in_service?

    View
  • 7

    Items marked with this footnote are currently driven by manual operator action (e.g., invoking an internal API to generate a new blueprint). The intent is that once we’re satisfied with the behavior of the system, we will have it automatically generate blueprints, set them as the current target, etc. So as of now, steps marked with this footnote may be manual, but we expect them to be automatic in the future.

    View
  • 8

    Note that the clock is not necessarily sync’d yet. However, any check that the clock is in sync is subject to a time-of-check-to-time-of-use race anyway — the sled could reboot in the meantime and become unsynchronized again. Instead, we’ll mark the sled InService right away, assume the "unsynchronized" condition is transient, and allow the allocation process to retry as needed to work around this transient failure.

    View
  • 9

    Note that the clock is not necessarily sync’d yet. However, any check that the clock is in sync is subject to a time-of-check-to-time-of-use race anyway — the sled could reboot in the meantime and become unsynchronized again. Instead, we’ll mark the sled InService right away, assume the "unsynchronized" condition is transient, and allow the allocation process to retry as needed to work around this transient failure.

    View
  • 10

    Note that the clock is not necessarily sync’d yet. However, any check that the clock is in sync is subject to a time-of-check-to-time-of-use race anyway — the sled could reboot in the meantime and become unsynchronized again. Instead, we’ll mark the sled InService right away, assume the "unsynchronized" condition is transient, and allow the allocation process to retry as needed to work around this transient failure.

    View
  • 11

    Note that the clock is not necessarily sync’d yet. However, any check that the clock is in sync is subject to a time-of-check-to-time-of-use race anyway — the sled could reboot in the meantime and become unsynchronized again. Instead, we’ll mark the sled InService right away, assume the "unsynchronized" condition is transient, and allow the allocation process to retry as needed to work around this transient failure.

    View