RFD 373
Reliable Persistent Workflows
RFD
373
Updated

Much of the control plane functionality ([rfd48]) involves not just taking a single discrete action, but rather monitoring the overall system state, comparing it to the intended state, and taking whatever actions are necessary to bring those into sync. For example, the software upgrade system is not just "update this component to version X", but rather "ensure that this component is currently running version X". One difference is that if for some reason after upgrade we discover that the component is running some other version, we should at least flag that as a problem and potentially update it again to version X. [rfd107] describes this as the difference between a one-shot workflow and a reliable persistent workflow (RPW). Reliable persistent workflows are likely to be useful for things like DNS propagation (see [rfd367]), software update, service rebalancing, overall load balancing of customer instances, distribution of networking configuration, and more.

This RFD seeks to describe:

  • the general idea of a reliable persistent workflow

  • a few possible patterns that might be used to implement RPWs, along with their tradeoffs

  • various bad patterns that might be used to implement RPWs so that we can determine not to use them

Following from this RFD, we hope to provide some common facilities in Nexus for defining RPWs using the good patterns.

Reliable persistent workflows (RPWs)

What are they?

Consider these pieces of control plane functionality:

  • Distribution of DNS data to internal and external DNS servers. Nexus receives external API requests that change the DNS data. The updated DNS data needs to get propagated to the corresponding DNS servers. For example, when a new public-facing Nexus instance is created, all the external DNS servers need to report it. ([rfd367])

  • Distribution of virtual networking configuration (e.g., VPC routes and firewall rules). Nexus receives external API requests to change the configuration a VPC and must propagate this (asynchronous with the request) to Boundary Services and the Sled Agents that need to know about it. ([rfd63])

  • Software update. Nexus receives API requests to move the system from one System Version to another. It then makes an update plan describing a sequence of actions that will move the system to the new System Version. ([rfd365])

  • Service (re)balancing. By this, we mean ensuring that we have the expected number of instances of all control plane components. Some part of Nexus must be responsible for ensuring that if there are supposed to be five CockroachDB nodes, and a sled hosting one of those is permanently removed, then another one must be deployed. It also needs to support changing the size of the CockroachDB cluster. This applies to many other internal services, like Nexus and Internal DNS.

  • Instance load balancing. While probably not part of the MVP, eventually the control plane will need a facility that monitors the utilization of individual sleds and moves workloads to better balance load. The details are well beyond the scope of this RFD. However, this is a useful case to consider in part because unlike most of the other cases here, there’s no explicit action that changes the desired state of the world. Instead, the actual state (the level of load on all systems) unexpectedly drifts from what we intended.

What’s common among these examples?

  • There’s intended state in the control plane database (e.g., the configured firewall rules or DNS data).

  • There’s runtime state in one or more running programs (e.g., DNS servers, Sled Agents, OPTE).

  • Changes to the runtime state usually happen as a result of explicit operations inside Nexus (e.g., a request to change the networking configuration).

    • In some cases, the runtime state can drift as a result of explicit action that is hard to prevent (e.g., an operator replaces one SSD with another, and now the set of deployed SSD firmware has changed and the update system needs to kick in and apply an update).

    • In some cases, the runtime state can drift without the control plane’s involvement, or as a result of a bug (e.g., if a sled reboots, it may no longer have any of its VPC networking configuration[1])

  • In all cases, these changes activate a subsystem in Nexus, causing it to take actions to bring the system to its intended state. These actions generally involve making requests to the services where the runtime state lives. We’ll call these services targets. Of course, these requests can fail. Nexus needs to continue trying. It should not come to rest (or become at-rest) until all necessary changes have been propagated. Note that not all RPWs have targets: "update propagation" might mean something like "provisioning some VMs".

  • It’s possible for the actual or runtime state to continue changing while Nexus is trying to bring them into sync! If that happens, often we want Nexus to adjust its plan rather than to keep trying to get to an intermediate state first.

graph LR;
	intended["Intended state (in CockroachDB)"]
	rpw[RPW]
	runtime1["Runtime state (Target 1)"]
	runtime2["Runtime state (Target ...)"]
	runtime3["Runtime state (Target N)"]

	intended --> rpw

	rpw -- "propagate" --> runtime1
	rpw -- "propagate" --> runtime2
	rpw -- "propagate" --> runtime3

With this phrasing, other examples come to mind:

  • Fault management. Here, the runtime state of the world changes unexpectedly and Nexus needs to take some action to bring the system as close to its intended state as possible. It needs to continue trying to do this until it succeeds. (broadly)

  • Crucible: removing read-only parents by scrubbing blocks. ([rfd177])

We call these behaviors reliable persistent workflows (RPWs).[2] We’ve also seen this called the "reconciliation pattern" or a "controller" (a term that’s unfortunately pretty overloaded in software).

Note
Do we like the term "reconciler" better than "reliable persistent workflow"?

Unfortunately, there’s no single right way to implement RPWs. There are a few patterns that work, and there are many patterns that seem like they might work but either don’t solve the problem or create new, harder problems.

Not distributed sagas

The examples of RPWs above are pretty different from the use case for distributed sagas (provided by [steno] to implement [rfd107]'s "one-shot workflows"). With a saga, we define the whole directed acyclic graph (DAG) of actions that will achieve some specific goal like "provision an Instance". Sagas have several properties that make them unsuitable for what we’re trying to do:

  • Sagas have a defined beginning and end, but we need something that’s always running.

  • Sagas unwind (run undo actions) on failure, but we never want RPWs to fail. We want them to continue trying to bring the intended state and runtime state into sync.

  • The saga graph is immutable once the saga starts executing, but we need something that can adapt to changes in the intended state or runtime state that happen while propagating earlier changes.

Fundamentally, the goal of sagas is to decompose complex one-shot sequences into small pieces so that:

  1. The overall saga can be resumed when the SEC crashes. This is important because new sagas are created at runtime (e.g., when a user provisions an instance) and we never want to just forget about these operations. This isn’t a problem with RPWs because when Nexus starts up, we know all of the RPWs that we need to do (software update, fault management, DNS propagation, etc.). We don’t expect to add new RPWs at runtime, so we don’t have to worry about forgetting about them.

  2. The overall saga can be reliably unwound on failure. Again, we don’t want this with RPWs.

  3. The saga’s actions can be scheduled (and parallelized) based on dependencies between the actions. This is a lesser concern than the first two. It’s nice to have, but as long as we don’t have to worry about crash-recovery (see #1), we can use ordinary programming abstractions.

Sagas can potentially be part of the implementation of RPWs, but this too is tricky. More on this under "Pitfalls" below.

Assumptions

We take the following as given:

  • To the extent that executing RPWs requires coordination, Nexus will drive that coordination (per the principles in [RFD 61]).

  • To the extent that executing RPWs requires storing persistent state, that state will be stored in CockroachDB (per the principles in [RFD 61]).

  • There may be multiple Nexus instances at any given time, jointly responsible for executing RPWs.

  • Nexus instances may be deployed and undeployed at pretty much any time.[3]

  • Nexus instances may unexpectedly disappear for short periods (because of a Nexus crash, sled crash, or transient networking issue) at any time and then reappear again at any time.

  • Nexus instances may unexpectedly disappear forever, but this is likely to be rare.

  • Separate from this RFD, we will eventually implement SEC failover (see omicron#2317), so that we can assume that any saga will eventually finish executing even if the Nexus that’s executing it disappears forever.

Constraints

[4]

  1. If a change is made to the intended state, then for each target, as long as the target is eventually available, that target will eventually learn about the update. (Or: we should never find a system at rest where all targets are online, yet one of them doesn’t know about some change.)

  2. If multiple changes are made to the intended state, Nexus defines a total ordering among them (usually provided by the order that these changes hit CockroachDB). For each target, as long as the target is eventually available, it should eventually converge to the same final state as all the other targets. (Or: we should never find a system at rest where all targets are online, but report different runtime state [because multiple changes were propagated in different orders to different targets].)

  3. If one target is offline, that should not block updates to other targets. (For example, if the sled hosting one of the external DNS servers is offline for a few hours for a hardware replacement, that shouldn’t affect our ability to propagate DNS changes to the other DNS servers.)

Building block: background tasks in Nexus

Let’s say:

  • We statically define a list of background task in Nexus (one for each RPW).

  • Each RPW task sits in an infinite loop sleeping until the RPW is activated.

  • An RPW can be activated by either a periodic timeout (e.g., every hour) or an explicit request from some other part of Nexus. (We envision that API requests could send a message on a channel to activate an RPW.)

  • Activation of the RPW causes it to:

    • Fetch its intended state from CockroachDB

    • Attempt to propagate this state to one or more targets

    • Report the result as one of: no changes needed, changes needed and fully propagated, changes needed and not fully propagated

Example 1. Example activation
sequenceDiagram
    participant Nexus
    participant CockroachDB
    participant Target1
    participant TargetN

    note over Nexus: Activate RPW (trigger: startup, timeout, API request)

    Nexus ->> CockroachDB: list targets
    activate CockroachDB
    CockroachDB -->> Nexus: list of targets
    deactivate CockroachDB

    Nexus ->> CockroachDB: fetch intended state
    activate CockroachDB
    CockroachDB -->> Nexus: state generation N
    deactivate CockroachDB

    Nexus ->> Target1: update to generation N
    activate Target1
    Note right of Target1: ... (more targets)
    Nexus ->> TargetN: update to generation N
    activate TargetN
    Target1 -->> Nexus: ok
    deactivate Target1
    TargetN -->> Nexus: ok
    deactivate TargetN

The implementation of fetching and propagating state is left to each RPW. There are several possible patterns here described below.

The framework in Nexus can keep track of activations and whether an activation has been fully propagated or not. It can also take care of re-invoking the RPW (presumably with some delay) after a partial success.

It’s possible for the RPW implementation to provide detailed information to the framework so that the framework could report, for example:

  • RPW: "dns":

    • dns "external": generation 13 (created 2023-03-01T19:37:24Z for "new nexus provisioned") propagated to 3 of 5 DNS servers as of 2023-03-01T19:38:00Z

    • dns "external": generation 12 (created 2023-02-01T09:53:00Z for "new nexus provisioned") propagated to all 5 DNS servers as of 2023-02-01T09:53:05Z

  • RPW: "vpc_config":

    • vpc 123: generation 12 (created 2023-03-01T19:37:24Z for "firewall rules update (req_id \"abc\")") propagated to 17 of 32 sleds and 1 of 2 switches as of 2023-03-01T19:37:25Z

    • vpc 345: generation 34 (created 2023-03-01T19:37:24Z for "route update (req_id \"def\" propagated to all 32 sleds and 2 switches as of 2023-03-01T19:38:12Z

This would be very valuable for debugging.[5]

Useful patterns

Periodic RPW activation

All RPWs are activated periodically, even if there’s no reason to think they need to be. This allows the system to repair brokenness resulting from bugs where an RPW is not activated when it should be or when Nexus crashes before the RPW finishes propagating an update.

Full vs. incremental updates

There are several points in implementing an RPW where we can choose an "incremental" or "full" (or "holistic") approach:

  • When activating the RPW, it could be told exactly what’s believed to be changed (incremental) or expected to look at all of the current state (full)

  • When determining whether a target needs to be updated, the RPW could look at only what it thinks has changed (incremental) or the full intended state and runtime states. Generally a full approach is worthwhile here: when the RPW is activated, the first step should be "read the full state of the world and (idempotently) determine what to do next".

  • When sending an update to a target, the RPW could send a delta from its previous state (incremental) or the complete new state (full).

These are separate, related choices. The basic tradeoff is that an incremental approach may be more efficient or scale better, but a full approach is usually easier to make correct. With an incremental approach, it’s easy for one bug to cause a system to be incorrect forever. By contrast, a full approach can often repair brokenness introduced by earlier bugs. A full approach may also have the benefit of the constant work pattern: that the amount of work that the system does does not change drastically under different load conditions.

[_generation_numbers] are a useful pattern for getting the benefits of the "full" approach efficiently and scalably.

Generation numbers

It’s useful to group a bunch of "intended state" updates together and assign them a generation number. This is just a unique, monotonically increasing integer. Writing the intended state to CockroachDB already implies a total order on such changes. Generation numbers give us a handle we can use to identify a checkpoint in the neverending sequence of changes.

Using DNS as an example based on [rfd367]: each change to the DNS data is encapsulated into a single generation number. When activating the DNS propagation RPW, Nexus specifies the generation number that was just created. This is essentially a full update (see [_full_vs_incremental_updates]) because the generation number identifies the complete state of the DNS data, not just the change that was made in that generation. (Contrast with the approach where Nexus activates the DNS propagation RPW with a specific list of DNS names that were changed and then the RPW propagates just that.)

Targets (DNS servers) also report their state in terms of the generation number. This allows the RPW to do a full-state comparison by just comparing the generation numbers.

When sending an update to a target, the RPW can choose to send the full contents of a generation or just the delta between two generations. As mentioned above, the former is easier to make correct and also repairs any previous corruption. The latter is considerably more scalable in the long run. (If it comes to it, we can also do both: incrementals for most changes, and an occasional full audit.)

How to propagate updates to targets

This is one of the more complex pieces of RPWs. There are several different ways to think about propagating updates from Nexus to various targets. Let’s take a look at a few approaches.

Target-driven

Target-driven propagation means that the body of the RPW lives inside each target rather than Nexus. The target would ask Nexus what its state should be on startup, periodically, or in response to a request from Nexus suggesting that the target update its state.

Example 2. Target-driven update
sequenceDiagram
    participant Target
    participant Nexus
    participant CockroachDB

    note over Target: trigger: startup, timeout, or explicit request

    Target ->> Nexus: which generation should I have?
    activate Nexus
    Nexus ->> CockroachDB: fetch intended state
    activate CockroachDB
    CockroachDB -->> Nexus: generation N
    deactivate CockroachDB
    Nexus -->> Target: generation N (includes contents)
    deactivate Nexus

    note over Target: apply generation N

In order for Nexus to proactively request that the target update its state, Nexus probably still needs to maintain a simpler version of the RPW that can be activated on-demand to make that request to each target.[6] By simpler, we mean that the "update" step just notifies the target that it should request an update.

Example 3. Event-driven activation with target-driven updates
sequenceDiagram
    participant External client
    participant Nexus
    participant CockroachDB
    participant Target

    note over Nexus: API request to change configuration

    External client ->> Nexus: reconfigure
    activate Nexus
    Nexus ->> CockroachDB: create new generation N for config
    activate CockroachDB
    CockroachDB -->> Nexus: ok
    deactivate CockroachDB
    Nexus -->> External client: ok
    deactivate Nexus

    note over Nexus: Activate RPW (Nexus side)
    activate Nexus
    Nexus ->> CockroachDB: list targets
    activate CockroachDB
    CockroachDB -->> Nexus: list of targets
    deactivate CockroachDB

    Nexus ->> CockroachDB: fetch intended state
    activate CockroachDB
    CockroachDB -->> Nexus: state generation N
    deactivate CockroachDB

    Nexus ->> Target: you need to update
    activate Target
    Target -->> Nexus: ok

    deactivate Nexus
    note over Nexus: End of RPW (Nexus side)

    note over Target: Activate RPW (Target side)
    activate Target

    Target ->> Nexus: which generation should I have?
    activate Nexus
    Nexus ->> CockroachDB: fetch intended state
    activate CockroachDB
    CockroachDB -->> Nexus: generation N
    deactivate CockroachDB
    Nexus -->> Target: generation N (includes contents)
    deactivate Nexus

    deactivate Target
    note over Target: apply generation N

This approach works well when the set of targets and updates are very well-defined. For example, this could work well for each Sled Agent to keep track of what customer Instances it’s running, its VPC configuration, and maybe even which control plane zones it’s running. It could work for the DNS servers, too.

Advantages of this approach:

  • There’s little need for coordination or tracking. Nexus still needs to keep track of which targets exist for the event-triggered update case.

  • There’s little duplication of work by multiple components that gets thrown away when only one of them "wins".

Problems with this approach:

  • It only really works when there’s a discrete target to run the RPW on. It wouldn’t work well for service rebalancing because that activity requires provisioning instances, which requires allocating resources like IP addresses using CockroachDB. The closest thing to a target here is the assigned sled, but that’s not assigned until we start doing some of the work! This problem can often be addressed by splitting this work into separate phases: one which makes all the choices (and records them) and one which takes the decided-upon actions.

  • It takes more work to report on the status of propagation. With Nexus-driven propagation, Nexus can easily track everything it’s done for each target. With target-driven propagation, to make status information available, something (presumably Nexus) would have to collect it from all the targets.

See also the [constant-work] design pattern.

Nexus-driven, no mutual exclusion

Generally, Nexus-driven propagation means that Nexus runs the body of the RPW. It maintains a background task that periodically compares intended state vs. runtime state and then reaches out to targets to update them.

By "no mutual exclusion", we mean that there’s no mechanism used to ensure that only one Nexus instance activates an RPW at a time. It follows that it must be okay if multiple Nexus instances concurrently try to update various targets. This works well in the same cases where target-driven propagation works. In a sense, it’s very nearly the same thing: maybe the only difference is whether the target or Nexus runs the periodic loop.

Advantages of this approach:

  • Like with [_target_driven], there’s little need for coordination or tracking.

  • If Nexus pushes the updated state directly to the target (rather than asking the target to call back to Nexus for the data), then there’s no need for the target to know about Nexus at all. This is nice in the case of DNS servers, which are otherwise pretty Omicron-agnostic, which is useful for testability and understandability.

Disadvantages of this approach:

  • It only works when it’s okay for multiple Nexus instances to run the same update concurrently. This is likely fine for things like propagating an explicit set of state from Nexus to a specific, known component (the target). It wouldn’t work for service rebalancing — see [_nexus_driven_with_mutual_exclusion].

  • Since there’s no mutual exclusion, multiple Nexus instances might do duplicate work to fetch the intended state and compute updates. All but one will be thrown away when one of them "wins". This can be mitigated by staggering their periodic activation times such that on balance, the work is done at the desired frequency, but it’s unlikely they will wind up trying to do it at the same time.

Example 4. Concurrent RPW activation (two Nexuses)
sequenceDiagram
    participant Nexus1
    participant CockroachDB

    participant Nexus2
    participant Target

    note over Nexus1: Activate RPW
    Nexus1 ->> CockroachDB: list targets
    activate CockroachDB
    CockroachDB -->> Nexus1: list of targets
    deactivate CockroachDB
    Nexus1 ->> CockroachDB: fetch intended state
    activate CockroachDB
    CockroachDB -->> Nexus1: state generation N
    deactivate CockroachDB

    note over Nexus2: Activate RPW
    Nexus2 ->> CockroachDB: list targets
    activate CockroachDB
    CockroachDB -->> Nexus2: list of targets
    deactivate CockroachDB
    Nexus2 ->> CockroachDB: fetch intended state
    activate CockroachDB
    CockroachDB -->> Nexus2: state generation N
    deactivate CockroachDB

    Nexus1 ->> Target: update to N
    activate Target
    Target -->> Nexus1: ok
    deactivate Target
    note over Target: apply generation N

    Nexus2 ->> Target: update to N
    activate Target
    Target -->> Nexus2: already at N
    deactivate Target

    note over Nexus1: End of RPW
    note over Nexus2: End of RPW

Nexus-driven, with mutual exclusion

This is Nexus-driven (see above), but where it’s essential that only one Nexus even attempt to perform the update step. The canonical case is service rebalancing. Suppose two Nexus instances decided "expand the CockroachDB cluster by 2 nodes". We really need mutual exclusion here to prevent them from expanding the cluster by 4 nodes.

We really want to avoid these where possible — see [_challenges_with_mutual_exclusion]. But they do exist.

How would we implement this? One approach is to:

  • define a table of these "locks" in CockroachDB

  • wrap operations that are supposed to be protected by the "lock" in a saga:

    • The first action "takes the lock" by storing its id into the "locks" table. The undo action drops the lock by clearing that field (conditional on it still having the same value).

    • The last action "releases the lock" by clearing that field (conditional on it still having the same value).

Here, we leverage CockroachDB to provide the strong consistency / distributed consensus about who owns the lock. By wrapping this in a saga, and assuming we implement SEC failover (see [_assumptions]), we can assume that anybody who’s taken that lock will eventually finish what they’re doing and release the lock.

Advantages:

  • Works in all cases.

Disadvantages:

  • Mutual exclusion is complicated to implement and has lots of things that can go wrong at runtime.

  • Liveness in the face of Nexus failures relies on having implemented SEC failover (which we haven’t done yet, and even once done, incurs some operational risk).

Potential examples

DNS (both internal and external)

See [rfd367]. To summarize: DNS data will be stored in CockroachDB. Each name has a row that includes the generation in which it was introduced and (optionally) removed. The targets are DNS servers that will accept one of two kinds of requests: (1) full update to generation X, or (2) incremental update from generation X to generation Y, which should fail if the server is not currently at generation X. The RPW body is fairly straightforward: fetch the latest generation and for each target, either send it a full update to that generation or else fetch its generation, compute a delta, and send it an incremental update.

Distributing VPC (virtual networking) configuration

Warning
This section might be way off! dap isn’t super familiar with the underlying pieces here.

For background, see [rfd63]. Here, we’re talking about the distribution of VPC configuration. This configuration includes firewall rules, routes, and likely other data.

The "intended state" comprises all the VPC-related configuration that we store in CockroachDB today, including the firewall rules and routes. This information should probably be versioned with a generation number per-VPC.

We may want to think about each VPC’s configuration as an independently-executing RPW to ensure that problems propagating one of them don’t affect the others (and because it’s easier to implement it and report it on a per-VPC basis).[7]

We may want to further split the distribution of networking configuration for a single VPC into two RPWs: one for sleds and one for boundary services, just so that each of these has a uniform set of targets.

One set of targets are the Sled Agents. The "runtime state" of each Sled Agent is comprised of whatever VPC configuration it needs to store, plus whatever is stored in OPTE. This RPW could be target-driven or Nexus-driven without mutual exclusion. Either Sled Agent could ask Nexus periodically for updated configuration for each VPC that it knows about, or Nexus could periodically ping Sled Agent and suggest that it update its config for a particular VPC. Either way, Nexus could poke Sled Agent to fetch updated config (or provide updated config) when something changes.

The other set of targets are instances of Dendrite, which manage Tofino switch configuration. Their runtime state is comprised of whatever they need to store locally, plus whatever has been programmed into the switch. The update process can work largely the same way, with Dendrite accepting updates in the form of a new versioned configuration. There may be some preference for Nexus-driven updates here if we don’t want Dendrite to have to know about Nexus.

List of customer Instances on a sled

We could think of "the list of customer Instances on a sled and their configuration" as an instance of RPWs. Indeed, the original design basically worked this way: on startup, Sled Agents report that they’re blank and Nexus is expected to invoke instance_ensure() to establish the current desired configuration. There’s no process for periodically ensuring that this configuration remains up-to-date (i.e., that all Sled Agents are running the Instances that we expect they are).

Today, when a change is made to intended state (e.g., an incoming request to halt an Instance), it goes to the Sled Agent first, Sled Agent applies the change, then Sled Agent reports the change in actual state to Nexus, which writes them to CockroachDB. We do not store separate intended vs. actual states. Essentially, the intended state is transient in Nexus and Sled Agent while it’s being applied.

One reason for this is that for Instance runtime state, Nexus and CockroachDB aren’t the only source of truth about the intended state. Customers can halt an Instance from within the Instance, without going through Nexus. That would reflect a change to both the actual and intended states. With the current approach, this works (at least architecturally) because Sled Agent can see the halt and propagate the state change back to Nexus and CockroachDB like any other state change.

One could imagine changing the implementation to work like an RPW:

  • intended state: list of Instances (in CockroachDB) assigned to a sled and their states

  • targets: Sled Agents

  • runtime state: Sled Agent state + running Propolis instances

It makes sense to preserve the current behavior that if a sled reboots, we do not assume the Instances that were running there should be re-established there, but instead they would be re-provisioned, by which we mean that they would go through the process anew of allocating runtime resources (possibly on a different Sled) and being started on some sled. We could still implement an RPW that periodically and on-demand propagates requested changes from CockroachDB to Sled Agent. This would be a substantial difference from how the Instance boot/halt/reboot paths work today. But "boot" likely already needs to be changed substantially (it shouldn’t assume the Instance is tied to a specific sled, even when the Instance is off). It would make this subsystem look more like the rest of the system and potentially be more rigorous.[8]

Another consequence of this change is that the instance-provision saga would no longer be responsible for booting the VM. Instead, instance-provision would do all the work to allocate resources and record those in CockroachDB, then activate an RPW that would actually propagate the state (i.e., "this instance should be running and should look like this") to the corresponding sled (and anything else that needs to know, like Crucible).

TBD would be how to deal with changes to intended and actual state that are initiated by the VM itself (mentioned above), and how we reconcile conflicting changes from these two different sources of truth. The answer may involve being more specific about which parts of an Instance’s state (e.g., location vs. is-it-running) are owned by whom?

Note
dap thinks this sounds like a good idea but I’d like feedback! I’m pretty far from the implementation of these pieces now and might be way off. Even if it seems like a good idea, it could be low priority if it doesn’t fix any immediate problem.

List of control plane components on a sled

Similarly, we can think of the list of control plane components on a sled (like CockroachDB, Nexus, Internal DNS servers, External DNS servers, etc.) as an RPW.

  • intended state: the list of control plane components assigned to each sled (recorded in CockroachDB by the service rebalancer)

  • targets: sled agents

  • runtime state: the actual set of components deployed

The RPW would look similar to the one for Instances above. The update process would involve deploying, undeploying, or reconfiguring these components.

Pitfalls

These are tempting approaches for implementing RPWs that either don’t work or create their own new (hard) problems.

Why not update targets in-line with API requests

The most obvious approach for implementing RPWs is to update the targets inline with any API request (or other operation) that changes the intended state. This doesn’t work because in the event of any failure, we wind up violating one or more of our [_constraints].

We assume here that Nexus would update the intended state in CockroachDB before updating the runtime state of any targets. To do otherwise would violate our constraint about there being a total order on updates, not to mention public API expectations around durability and strong consistency.

If Nexus crashes before updating the targets, it violates the constraint about all targets learning about an update.

If two requests make different changes concurrently, even though CockroachDB will serialize them, there’s nothing to prevent some targets from hearing about the updates in the wrong order. Depending on the implementation, that could cause two targets to converge on a different final state, violating another constraint.

If Nexus fails to update one target (out of many), what would it do? If it completes the request successfully, it violates the constraint about all targets learning about an update. If it fails the request (or even just retries it, inducing additional latency), then it violates the constraint about one offline target blocking updates.

Why not create a saga in-line with API requests to propagate the update

Consider implementing the distribution of VPC firewall rules. Suppose that the desired set of firewall rules is stored in CockroachDB and that changes to this set get propagated to some number of sleds.[9]

We could say that when a change is made, Nexus kicks off a saga to propagate the new set of rules. If a sled is offline, the saga will keep trying until the sled does come back online. This has several issues:

  • What if another change comes in while a sled remains down? Now we have two sagas attempting to propagate two different changes to the same sleds. What if they get propagated in a different order? Will the sleds necessarily wind up with the same end state?

  • What if many changes come in while the sled is down? Now there’s an unbounded queue of work piling up and a [thundering-herd] awaiting the sled when it returns.

  • What if the sled never comes back online? What if an operator marks it "removed [destroyed in a fire]"? The saga needs to learn this and stop trying. How would it do that? Does it query the database for any change that might invalidate what it’s trying to do — every time it retries?

We could say that when a change is made, if a saga already exists to propagate the rules, then we somehow update that saga to propagate the latest set of rules. There’s not a great way to do this. Recall that saga DAGs are immutable once the saga starts running. You could imagine a saga action whose behavior is dynamic based on the database state, but this gets ugly quickly. It’s fighting the saga framework (which is oriented around actions being small, atomic, idempotent, undoable steps).

We could say that when a change is made, Nexus kicks off a saga to propagate the rules if one is not already running, and that this saga simply does not retry failures. There would need to be some separate mechanism to update those sleds later. On the plus side, this avoids the unbounded queueing and thundering herd and related problems. But it’s not clear why we’d use a saga here, since it wouldn’t have any undo actions.

There are cases where a saga might be part of an RPW. Consider the service rebalancer. Imagine we want to deploy two more CockroachDB nodes. The RPW could see: "I need 7 nodes, but I only have 5. I need to kick off two provisions." It could kick off a saga that deploys the two nodes. It should use a saga here because each of those deployments individually comprises a bunch of steps that we do want to unwind on failure! This case seems to be the exception rather than the rule when it comes to RPWs.

Another use case of sagas is for RPWs that require mutual exclusion, mentioned above.

Why not queues

One approach to implementing RPWs is that when we make a change to intended state, queue a request to propagate that change to each target. The queue could be stored in the database, written transactionally with the change to intended state so that it can’t be missed. Or one might do this only after trying to [_why_not_update_targets_in_line_with_api_requests]. On failure to update a target, we might enqueue a request to propagate the change to that target.

Queues that are used like this have a bunch of problems. If the queue is bounded, you need some fallback mechanism to ensure correctness when the queue is full. In that case, you could just skip the queue altogether. But unbounded queues can turn transient issues into cascading failures that require human intervention to repair.[10]

Instead of queueing up requests, we activate an RPW by sending a message on a (Rust) channel. A key difference is that any number of messages can be collapsed into one, since the RPW only needs to know if it’s been activated or not. It doesn’t need to know how many times it’s been activated or by what. (See [_full_vs_incremental_updates].) A watch channel is a good fit for this.

Challenges with mutual exclusion

An intuitive way to implement RPW behaviors is to say that each job will be owned at any given time by one Nexus instance. It will be the only instance trying to do that thing. Maybe the Nexus instances negotiate that somehow ahead of time, or maybe each one tries to take a lock before activating any RPW.

The problem is that the [cap-theorem] constrains the properties of a distributed lock that we’d like to remain available when the holder goes out to lunch. Systems like [chubby] use time-based leases, such that if a node holding a lock is unresponsive for too long, it loses its lease (lock). But this cannot be enforced without a hard real-time system. Imagine the case where a node obtains a lease, becomes partitioned, then attempts to issue a request to a service that’s only supposed to be used with the lease held. The request is in the local networking stack socket buffer. The lease expires and is given to some other node. Then the partitioned is removed and the request is sent — mutual exclusion is violated. This particular example might seem contrived, but the observation is that if mutual exclusion relies on the lock holder to enforce its own lease time, then it implicitly assumes something about the liveness of that system. That assumption is often violated in practice as a result of partitions, pathological system behavior, debugging activities, etc.

"Taking a lock" in a distributed sense is fraught. Instead, one approach is to take whatever operation would be protected by that lock and make it safe to be called by multiple nodes (the would-be lock holders). Take the DNS propagation RPW. We’d like to put the whole RPW under a mutex. Instead, multiple Nexus instances might duplicate the work of calculating what updates are needed. All that really needs to be protected is the update to each DNS server. But we don’t need to enforce that at Nexus. The DNS servers themselves can serialize the requests that they receive. Another approach is to invert control as described above: have the target be responsible for requesting updated state, possibly in response to a prod from Nexus. There’s no silver bullet for translating mutual exclusion into something else, but there are several patterns that work well in different conditions.

See also: [kleppmann].

Related topics

Correctness and resilience to bugs

We’ve essentially defined correctness by our [_constraints]. The system is broken if a particular RPW comes to rest with some online target not having the latest update. There are a few ways this tends to happen in similar systems:

  • an RPW was not activated when it should have been ("dropped notification" or "missed wakeup"). This is mitigated by [_periodic_rpw_activation].

  • an update was not propagated to some target, either because it was offline at the wrong time (another form of dropped notification, also mitigated by [_periodic_rpw_activation]) or because the RPW incorrectly computed which targets required an update (which can be mitigated using full updates.

  • an update’s contents were incorrectly propagated to some target (e.g., because Nexus incorrectly computed an incremental update). This can be mitigated by full updates.

Each of these represents a bug. But unlike many bugs that only affect the current runtime state of a program, these bugs potentially cause the system to be wrong forever.

Debuggability

We propose several patterns to reduce the risk that the system becomes permanently broken. It can still happen, of course. And bugs can easily cause the system to be broken until the next activation of an RPW. This can noticeably degrade the customer experience. Imagine we forget to update [rfd21] Instance DNS at provision-time and that’s only fixed by our periodic every-hour reactivation.

We should try to identify activations that we expect will trigger no actions and record an error if we find that they did trigger actions. This should be reported to Oxide support so that we can root-cause and fix the underlying bug. We can’t catch all of these. But we can reliably identify cases where the actual state of the world shouldn’t change behind Nexus’s back, the intended state of the world hasn’t changed since the last successful activation, yet the activation produced next steps to take.

If we get reports of these bugs, what information will we want? Usually we want to know which code path failed to trigger an update. To that end, we’d love to know:

  • what action was unexpectedly taken? (e.g., which DNS names unexpectedly needed to be propagated? to what DNS servers?)

  • which API requests (or other Nexus operations) triggered the underlying change? We could store an API request_id with each DNS change. That would allow us correlate a failed update with the a Nexus log entry for the request that failed to do the update. We could see what that operation was trying to do, if there were any errors, etc. Maybe we could even store a comment with each change like "created instance 123…​".

Of course, we could also run into situations where the final state on some target didn’t match what we expect, either because it was different from other targets or different than the intended state. To root-cause these, it would be useful to keep logs of all the work done by RPWs: what actions were computed and taken on which targets. See [_building_block_background_tasks_in_nexus] for an example of what we could keep.

Monitoring

Besides the obvious correctness issues discussed above, operational issues with systems like this include cases where propagation has stopped for some reason. By construction, when the system is seeing changes to the intended state, there’s always a gap between the intended and runtime states. The operational issue revolves around how large this gap is, usually measured in time. e.g., the DNS servers are often going to be a few seconds out-of-date. We probably want to raise an alarm if they become a few minutes out of date.

With the right abstraction in Nexus, it should be possible for Nexus to provide basic metrics about all RPWs, including:

  • Is an activation in-progress? If so, why? explicit request or periodic? What time did it start?

  • Last activation: when did it start? why did it start? did it identify changes? did it propagate them?

  • Time (and generation number) of last successful propagation to all targets

  • For each target: time (and generation) of the last successful update and the last update attempt (successful or otherwise)

This would enable the system to raise alarms if:

  • an RPW hasn’t been activated for longer than we expect (based on its periodic frequency)

  • an RPW hasn’t been able to propagate to a particular target for longer than we expect

It would also enable the system to affirmatively report how up-to-date some target or subsystem is (instead of expecting people to trust that no alarms means no problems). This follows the principle that software should be able to exonerate itself when it’s not the problem.

This data would also enable Oxide support to answer a lot of questions we might have if we were looking into those alarms, like: did we try to update that target?

Determinations

This RFD was primarily about describing the problem space and various patterns that do and don’t work well. The determinations are kind of wishy-washy.

  • Where we have RPW-like behavior (e.g., an API request that needs to update some number of other services), let’s consider a first-class RPW so we don’t run into the [_pitfalls] and fail to meet the [_constraints].

  • Let’s implement one or more of these RPWs (starting with [rfd367]) with an eye towards generalizability for these other use cases. But let’s also not overthink it: assume we’re going to need to iterate on the abstraction as we build more of these.

  • Let’s avoid distributed mutual exclusion where possible. Prefer flows where we distribute versioned configuration to discrete targets and let those targets synchronize incoming requests.

Open Questions

  • This is a good time to reconsider the name "reliable persistent workflow" because it doesn’t exist anywhere outside two RFDs yet. Are we happy with it? What about "reconciler"?

  • Will it be a problem to have Nexuses duplicating a bunch of work?

  • Is there some way to phrase the service rebalancer case to avoid needing mutual exclusion? We’ll likely have a similar challenge with the load balancing case.

  • Does this approach work for networking configuration?

  • Does this approach work for the list of customer Instances on a sled?

Questions best figured out through implementation experience:

  • We said some RPWs have "targets" and some don’t. How do we phrase this? Is one just a more specialized form of a more general thing? To what extent do we bake this into the framework?

  • Is it useful and important that the framework actually grok a discrete "activation"? This is useful if you want to be able to say: this particular change has (or hasn’t) been propagated somewhere. But it’s simpler to just say that we activate the RPW at any time, and we don’t really keep track of when it says it’s "done" — we just keep activating it until it says there’s nothing else to do. (This might correspond to any number of event propagations.) Maybe the answer is to decouple these: the "activation" is a one-shot thing, and we separately track which generation has been propagated.

Footnotes
  • 1

    This particular example could be avoided with different design choices. But as we’ll discuss, some approaches to solving this problem deal naturally with this case. That sometimes enables simpler design choices like not persisting state.

    View
  • 2

    You could think of virtually all of Omicron’s jobs this way, but it’s not necessarily useful to model everything this way. Even ordinary provisioning of customer Instances could be viewed as a change to the intended state of the world, if that state comprises "the set of all Instances that are supposed to be running, the set of all IP addresses allocated, …​". But it’s not that useful to think of a provision as "converge to a state that has this instance, and three IPs allocated from these subnets, etc." Also, if we fail to provision an Instance immediately, we expect that most users expect to see the failure, along with an error message, not wait indefinitely for us to eventually succeed. That’s pretty different than most of the other cases above, where we always want to retry indefinitely in the event of failure.

    View
  • 3

    We hedge with "pretty much" here because Nexus instance provision and deprovisioning is controlled by the control plane. If needed, we could create critical sections during which it’s not allowed. This seems likely to be brittle.

    View
  • 4

    These are largely generalizations from RFD 367.

    View
  • 5

    It’s very tempting to expose this information over an API for operators. But remember that this is in-memory state that will be different for every Nexus. We don’t really have other operator APIs whose output depends on which Nexus you hit, or that gets lost when Nexus restarts. We could persist some of this to a propagation_log in CockroachDB and serve from that, but that requires stabilizing the abstraction, the data format, etc. so we propose deferring that for now. It’s also useful to keep in mind that the vast majority of activations will be no-ops because there’s nothing to do or because another Nexus got there first.

    View
  • 6

    It might also be reasonable for some use cases to not bother with the proactive notification, provided that the polling interval is frequent enough to meet whatever latency requirements we have for propagating a change. That would eliminnate a bunch of code in Nexus. Such systems are described in detail under the "Constant work design pattern" reference below. It’s worth noting that such systems tend not to scale down well. As that article emphasizes, the work they do is constant, regardless of scale. That means an idle system can wind up using quite a lot of resources, constantly discovering that there’s no work to be done. This can be a problem on development and test machines.

    View
  • 7

    In practice, these might share resources (like a tokio task) to avoid a large number of these starving other kinds of RPWs.

    View
  • 8

    I say this as the person who pushed us towards the current approach.

    View
  • 9

    We’ll ignore the switches for this example.

    View
  • 10

    Even in good times, having no bound on queue length generally means there’s no backpressure. That means there’s nothing to ensure that items are being removed from the queue as fast as they’re being added. If that ratio gets flipped, queues quickly grow very large, consuming lots of memory or disk space. Large queues can stress database and protocol limits and trigger request/query timeouts (which in turn often causes failure to remove items from the queue). When things start being repaired and items are being cleared from the queue again, there can be hours' or days' worth of work queued up, much of which is wasted because it’s been obviated by more recent items. Components can be starved of CPU or other resources because they (or neighbors) are furiously trying to process all these (now irrelevant) events. Sound like a horror show? It is.

    View