459 - Control plane component lifecycle / RFD / Oxide

RFD

459

Authors

Updated

Some software components require special actions to be taken when new instances are deployed, gracefully removed, or expunged. This RFD describes the actions required for the components that make up the control plane.

Introduction

We use the same terminology to [rfd457]:

Components get added (deployed) when the planner decides to (e.g., when adding a sled).
Components may be gracefully removed (quiesced, then removed) when they’re currently functioning but the planner decides they’re no longer necessary or need to be moved (e.g., when removing a sled).
Components may be expunged when they are already not running and will never run again (e.g., because the operator has told us they have physically removed the sled and it will never return).

Zone dispositions

Once this RFD is implemented, each Omicron-managed zone will have a disposition stored in the blueprint: a Nexus-defined, desired end-state for the component^[1]. The disposition may be:

In service: The component is fully operational and actively serving requests.
Quiesced: The component is present and may keep serving existing requests, but is not taking on new workloads or serving new requests.
Expunged: The component is not present. Like the expunged policy in [rfd457], This is a terminal disposition: once a component is expunged, it can never come back under the same identity.

Nexus will use blueprints and dispositions to implement a variety of operations:

graceful sled removal (deploy a new instance, quiesce the old one, remove the old one)
sled expungement (deploy a new instance and do whatever is required to expunge the old one)
eventually: component upgrade (deploy new instance, quiesce old one, remove old one)
eventually: component scale-up (add a new instance), when possible
eventually: component scale-down (quiesce and then remove an instance), when possible

Reconfigurator actions

We assume some actions apply to many zones when they’re added or removed:

All Omicron zones will have DNS entries added when brought into service.
All Omicron zones will have DNS entries removed when removed from service.
Many Omicron zones may need to be removed from Oximeter when they’re removed. (See omicron#5284).
All zones should have existing log files and dump files archived. (See omicron#3860 and omicron#4906).

Boundary services configuration is mostly out of scope here. (As of this writing, as discussed in omicron#4822, Nexus will include an [rfd373]-style reliable persistent workflow (RPW) to reconfigure boundary services as needed for whatever’s currently deployed. This will run separately from blueprint planning and execution.)

The table below summarizes the component-specific actions required for add, graceful removal, and expungement. Details follow after the table.

Summary of extra actions required to adding, quiesce or expunging instances of control plane components. More details follow the table.
Component	Add to service	Remove from service (quiesce)	Expunge
Clickhouse	Rewrite, reload XML configuration for all nodes.	Stop writing to this node from oximeter. Flush any send queues, disable any receive queues and wait for sends to complete. Stop server. Perform the expunge steps.	Rewrite, reload XML configuration for all nodes. Wait for replica to be marked inactive in system.replicas table on a still existing node, then drop the expunged replica on that node via system drop replica.
Clickhouse Keeper	Generate XML configuration for new keeper node, start the new node with that configuration. Rewrite, reload XML at all other keeper nodes. Rewrite, reload XML config at all clickhouse server nodes to point to latest keeper configuration.	Rewrite, reload XML configuration at all keeper and clickhouse server nodes, including keeper node being removed. Wait for reconfiguration to complete. Stop dropped keeper node.	Same as queisce, minus reconfig/dropping at the expunged node. However, note that you cannot reconfigure if you only have two left and one is down. That requires disaster recovery.
CockroachDB	No runtime action needed.	Decommission node and wait for that to complete.	Decommission node.
Crucible	Covered in [rfd457] as part of add/remove/expunge sled.
Crucible Pantry	No runtime action needed.	Issue deactivation requests for each activated volume. Wait for in-progress operations to finish.	No runtime action needed? See also omicron#3763
Customer instances	Covered in [rfd457] as part of add/remove/expunge sled.
External DNS	No runtime action needed.	No runtime action needed.	Planner should deploy a replacement zone using the same external IP. The customer assumes this IP will be hosting a working nameserver.
Internal DNS	No runtime action needed.	No runtime action needed.	No runtime action needed.
Nexus	Update external DNS.	Update external DNS. Quiesce like a typical HTTP service. Stop running new sagas.	Update external DNS. Re-assign sagas to another instance.
Oximeter	No runtime action needed.	Same actions as expungement. Wait for all assignments to be removed.	Remove this Oximeter from the last of candidates used when assigning a collector for a new producer. For each producer assigned to this Oximeter, find another Oximeter and re-assign it.
Boundary NTP	No runtime action needed.	No runtime action needed.	No runtime action needed.
Internal NTP	No runtime action needed.	No runtime action needed.	No runtime action needed.

CockroachDB

Being "in service" means that the node is part of the CockroachDB cluster, storing its share of data (range replicas), and handling requests. Fortunately, CockroachDB handles most of this for us.

Zone added: No action needed. The SMF start method for all CockroachDB instances looks up the current set of nodes in the cluster via internal DNS. It instructs the current instance to join a subset of those nodes. So when a new CockroachDB instance is deployed, it will find the existing nodes that way. CockroachDB takes care of distributing information about the new node to all running nodes (non-persistently). It also takes any database-level actions required for adding a new node (like moving copies of data to the new node). If any node restarts, the start method behavior mentioned above ensures that it finds enough existing nodes to complete this process. This includes the cold start case where all nodes are initially offline.

Zone quiesced or expunged: Use CockroachDB decommission process, which will evacuate data from the node and then remove it from the cluster. For quiesce, we’ll want to use cockroach node (or maybe the Cluster API?) to wait for this process to complete. For expungement, it’s not actionable since the zone is gone. Independent of any add/remove activities, the system should always monitor cluster status, including replication status, and raise an alarm when we find problems.

External DNS

Being "in service" means that:

clients on the customer network know about this instance (discovery)
clients on the customer network can reach this instance
the instance is correctly serving DNS requests

Client discovery is fixed via DNS delegation (glue records) so there’s nothing we need to do or can do aside from ensuring that we always have DNS servers running on all the IPs in the glue records.

Per our assumptions above, once the zone exists, boundary services will be configured so that the zone is reachable from the customer network using its external IP.

Once the zone exists, Nexus will propagate the correct DNS configuration to it so that it can correctly serve DNS requests. This depends on omicron#4889.

Zone added: Summarizing the above, no extra actions are needed to bring it into service.

Zone quiesced: No actions needed. Ideally we’d tell clients to stop using this and wait for that to happen, but we don’t control it.

Zone expunged: No actions needed, but the planner should deploy a replacement zone using the same external IP because the customer assumes that IP will be hosting a working nameserver. omicron#4889 is also needed here.

Internal DNS

Being "in service" means that:

clients within the control plane know about this instance (discovery)
clients within the control plane can reach this instance
the instance is correctly serving DNS requests

Client discovery happens either using the fixed addresses at which we run internal DNS servers or (maybe, in the future) via the working internal DNS servers. Nothing special is needed here.

For routing to work, the network must be configured to put this DNS server’s IP on this sled. This is a little different than usual. Most control plane services use IPs within a sled-specific subnet and no special action is needed to route a new address on this subnet. The fixed internal DNS IPs are on individual subnets, so the new zone’s subnet needs to be mapped to this sled. This happens using a DDM advertisement. Both this and the assignment of the address to the zone happen automatically as part of the internal DNS zone setup process.

In terms of serving the right records, the behavior here is exactly the same as for [_external_dns] and similarly depends on omicron#4889.

Nexus

Being "in service" means that:

clients on the customer network know about this instance (discovery)
clients on the customer network can reach this instance
the instance is serving requests, running sagas, etc.

Customer clients discover Nexus instances via our own external DNS service. So we need to keep this updated when zones are added or removed from service.

Per our assumptions above, once the zone exists, boundary services will be configured so that the zone is reachable from the customer network using its external IP.

Nexus instances can immediately start serving requests. During quiesce, they need to be instructed to gracefully stop to minimize impact.

Zone added: The new Nexus zone’s external IP must be added to external DNS for all Silos' DNS names.

Zone quiesced:

The new Nexus instance’s external IP should be removed from DNS so that clients are told to stop using it. (DNS clients at-large are notorious for not noticing stuff like this. When we’re replacing a Nexus instance by adding one and removing another, we might prefer to use the same IP to minimize impact.)
Nexus should be explicitly instructed to quiesce. Ideally:
- stop accepting new TCP connections (close listen socket)
- close TCP connections the next time a request completes on them (stop using HTTP keep-alive)
- time out idle TCP connections as usual
- do not start any new sagas (generate 503 error instead)
Monitor Nexus until it has quiesced:
- no outstanding external TCP connections
- no in-progress sagas

Zone expunged: we need to reassign any in-progress sagas to another Nexus instance so that they don’t get left in limbo. Per [rfd289], the other instance must be running the same software version as the one being expunged.

Find another Nexus instance at the same version to take over in-progress sagas.
For any in-progress sagas owned by the Nexus being expunged, update the database to change the Saga Execution Coordinator (SEC) to the other Nexus instance.
Poke the other Nexus to tell it to adopt sagas assigned to it that it doesn’t already know about. (This will be the same saga recovery code path that Nexus uses on startup, just ignoring sagas that it’s already operating.)

Oximeter

Being "in service" means that:

the instance is eligible to be assigned to collect metric data from producers
the instance periodically collects data from any assigned producers

Zone added: No action needed. Oximeter will notify Nexus on startup, causing it to become eligible for assignment for new producers.

Zone quiesced/expungement:

Remove this Oximeter from the list of candidates used when assigning a collector for a new producer.
For each producer assigned to this Oximeter, find another Oximeter, and re-assign it.
- Carry out normal assignment using this other Oximeter.
- If quiescing, notify the Oximeter being removed to remove the assignment. (There will be a period where multiple Oximeters are collecting from this producer. This should generally be fine, just higher time-density of data points during this period.)
If quiescing, wait for the Oximeter being removed to have zero assignments.

Boundary NTP

Being "in service" means that:

the instance can reach its upstream NTP servers on the external network
internal NTP servers know about this server and can reach it

Provided we implement omicron#4791, no actions are needed to add or remove these zones from service. The normal zone deployment process will establish external connectivity. As described in 4791, internal NTP servers will notice the DNS update for the boundary NTP service’s DNS name.

Internal NTP

This service has no dependents. Being in service just means that it can reach its dependencies (the internal DNS and boundary NTP services), for which we don’t have to do anything special, and that it sets the local clock accordingly (which it does automatically).

Clickhouse

Being "in service" means that this node:

is runnining
is part of the configuration of all the other clickhouse servers,
is capable of replicating table data between servers and routing distributed table requests
Is configured to reach a keeper cluster so that it can write data to replicated and distributed tables
Is represented in DNS so that oximeter can find it

Please see the tracking issue for full reconfiguration details.

Zone added: New configuration must be generated and stored in a blueprint by reconfigurator. This new configuration must be pushed out to all zones running a dropshot server for this purpose, during the reconfigurator execution phase. The new node must be started, but other nodes will reload their configuration from the configuration file changes written to disk inside the zone automatically.

Zone quiesced or expunged: A new configuration for the cluster must be generated in a blueprint and distributed to all existing zones which will reload it automatically. There are some other steps to take during quiesce to ensure that any data gets replicated appropriately. The first step is to remove this node from DNS and ensure oximeter starts writing to it. The rest all involve system statements and system table queries.

Clickhouse Keeper

Being "in service" means that this node:

is runnining
is part of the configuration of all the other keeper nodes and clickhouse servers

Please see the tracking issue for full reconfiguration details.

Zone added: A keeper node must be started with a new configuration pointing to all the others. The configuration of the other nodes must be updated to include this new keeper node. The configuration at all clickhouse server nodes must also be updated to include the new keeper node.

Zone quiesced or expunged: The configuration should be updated at all keeper and clickhouse server nodes, including the node to be removed during a quiesce.The node should be removed from DNS, and stopped if this is a quiesce.

Crucible Pantry

Being "in service" means that:

Nexus can use this instance for the various Crucible operations that use the pantry (bulk imports, checking hashes, etc.).

Zone added: No action needed. Common process will ensure that the instance shows up in DNS and Nexus instances will find it and start using it.

Zone quiesced: Ideally, deactivation requests should be issued to each Pantry for each activated volume. This isn’t strictly necessary: if Nexus decided to use another Pantry for the volume, then activating there will unconditionally take over the previous Pantry’s activation because of the increased generation number in the volume. We may want to wait for in-progress operations to finish (e.g., bulk imports).

Zone expunged: Since Pantry instances are stateless, no action is needed here. Anything that was running on this pantry may encounter failures and may need to be retried. Nexus should pick a different Pantry if this happens. See omicron#3763 for this.

Security Considerations

None known.

External References

[RFD 289] RFD 289 Steno Upgrade
[RFD 373] RFD 373 Reliable Persistent Workflows
[RFD 457] RFD 457 Control plane sled lifecycle
[RFD 468] RFD 468 Rolling out replicated ClickHouse to new and existing racks

Footnotes

1
[rfd457] uses the terms policy to indicate the operator-defined desired goals, and state for actual, current system state. Here, we introduce a new term disposition to indicate that this isn’t quite policy (it is only indirectly operator-controlled), but also expresses a desired state which may not currently match the current state.
View

RFD 459 Control plane component lifecycle

Table of Contents