289 - Steno Upgrade / RFD / Oxide

RFD

289

Authors

Updated

The control plane leans heavily on distributed sagas via [Steno] to carry out non-trivial changes to the system. Steno breaks up complex distributed flows into small, idempotent actions with associated undo actions. This enables the framework (Steno) to take care of unwinding in the event of failure rather than having developers write (and test) ad hoc recovery code for complex operations. To achieve this, Steno persistently stores internal state for these operations. This presents a problem for upgrade: what if the software has changed between when this state was written and when it’s read? We propose dealing with this by not dealing with it: first, we constrain the system so that this isn’t necessary. Then we create a mechanism to prevent Nexus from trying to run a saga from a different version.

Note

This RFD is more of a directional sketch than a concrete implementation proposal.

Determinations

For a given saga, execution is only ever carried out by one version of Nexus. Failover may still cause execution to move between different instances of Nexus.

There are two pieces to this: draining sagas during upgrade and having Nexus recover only sagas that were created with the same version as itself.

Draining sagas during upgrade

We essentially propose that Nexus be upgraded via a typical blue-green deployment (canarying optional) where sagas are drained before an old Nexus instance is shut down. This is analogous to the way many systems drain client requests and client connections before shutdown. That means:

We deploy a new fleet of Nexus instances before removing the old fleet.
For an individual instance of the old version, we:
- first remove it from service of the public API and disable creation of new sagas, then
- wait for all of its sagas to complete, then
- remove the instance.

Nexus only executes sagas at the same version

We propose that when a saga is created, it’s stored with a version number or signature (details TBD — see considerations under [_more_about_saga_versions_or_signatures]). Each Nexus instance has a matching version number or signature. During recovery or failover, Nexus only looks for sagas having the same version number or signature. It will neither recover nor take over sagas with different versions or signatures.

Failover

There is no plan of record for Saga Execution Coordinator (SEC) failover. We assume when we build that system that it will include mechanisms for determining that failover should happen (fault detection), transferring control over a saga from one SEC to another, and preventing an erroneous SEC from concurrently executing the same saga.

This proposal adds the additional constraint that failover can only happen between instances of Nexus at the same version. This puts another burden on the upgrade system: for some N, the upgrade system must keep at least N instances of an old version of Nexus around until all instances at that version are fully drained. We choose N and M based on how many permanent failures we want the system to survive (M) and how many Nexus instances are required to serve load (N-M). Note that the upgrade system already has to ensure a similar constraint, which is that we have at least N-M instances of Nexus running in the first place.

Open Questions

There remain a lot of implementation details to work through, like:
- What interface can Nexus provide to the upgrade system for draining (both starting to drain and monitoring when it’s finished)?
- What will the version number / signature of a saga look like? How do we make sure it gets updated when it needs to be (and ideally not when it doesn’t need to be)? See [_more_about_saga_versions_or_signatures] for discussion.
- How exactly will we store the signature of a saga in the database? How exactly will Nexus filter based on that?
What happens if we do manage to create orphan sagas (by removing all Nexus instances able to complete them)? It’s not clear how we could automatically recover from such a condition. We may want a way to mark them with a special state that we pay close attention to for phone home / support, and then provide operations to have them abandoned (removed without taking any actions or undo actions) or adopted by a newer version of Nexus (if support determines that that’s safe) or something. A safer remedy will probably be to have the upgrade mechanism deploy a Nexus instance at the correct version to finish or unwind the saga.

Proposal details

Crash course on Steno

Steno is an implementation of the distributed saga pattern. See the [Steno] repository for details and references.

The basic idea of distributed sagas is that you define (in code) a directed acyclic graph (DAG) of nodes, each representing some action that you want to take. Edges of the graph represent dependencies between these actions. Each action is idempotent and reversible with an undo action. There’s a Saga Execution Coordinator (SEC) that executes the graph such that:

Actions are executed in dependency order (no action is started until its dependencies are satisfied).
If there’s a failure at any point, any actions that have ever started are undone in reverse-dependency order (no undo action is started until its dependents have been undone).
Eventually, all actions will complete successfully or any actions that were ever started will be undone.

In Omicron, each Nexus instance is an SEC. The persistent saga log (which is used to survive transient failures) is stored in CockroachDB. When Nexus crashes, it uses the information in CockroachDB to figure out which sagas it’s supposed to be running and how to pick up where it left off. (This is where it’s critical that individual actions be idempotent.) See [rfd107] for more on the motivation for sagas and see [rfd192] for guidance on the conditions when it makes sense for us to use sagas.

Surprisingly, the canonical talk on distributed sagas does not address how data may be shared between nodes in the graph. Steno achives this by saying that each node produces exactly one output object. The output is immutable and recorded persistently in the saga log. Any action can look up the output of any action upon which it depends (directly or indirectly) based on the unique name of the ancestor node.

Here’s a hypothetical DAG for an "instance-provision" saga. This is taken from a Steno demo, not the current implementation of instance provisioning:^[1]

DAG for a hypothetical instance-provision saga

To look at a few examples:

An early node called "instance_id" generates the unique id for the new Instance. Subsequent nodes can look this up by the name "instance_id" because that’s the name of this node.
A subsaga called "server-alloc" determines on what server the new Instance will run. The output is called "server_alloc" and presumably would include the server id or some other identifier.
A node called "volume_id" creates a volume (disk) and emits the new volume’s id.
This isn’t visible in the DAG, but we can imagine a later node like "volume_attach" might use all these pieces of data in order to attach the Volume to the Instance by making a request to the Server where the Instance was deployed.

In terms of recovery: as we said, Nexus stores the saga log in the database, including the outputs from each action. On startup, Nexus recovers any sagas that it was running. It reads each saga log and reconstitutes the runtime state. This is fairly straightforward with some assumptions:

that Nexus can know the shape of the DAG (either because that’s part of the serialized form or because there’s some identifier for the DAG that maps to a canned DAG that Nexus knows about),
that Nexus can associate each action in the DAG with Rust code that implements it, and
that the Rust code for the action groks the outputs that it depends on (that were written to the saga log).

The crux of the problem

This gets to the heart of the upgrade problem. Distributed sagas encourage you to break up a complex operation into many small actions.^[2] These small actions are likely to wind up tightly-coupled. ^[3] Since Steno persistently stores the output of each action, it’s essentially storing private implementation details that we’ve just said are likely to change as the software evolves.

To make it concrete: suppose Nexus evolves from the example above in a way that breaks up the volume_attach action into two separate actions, say volume-setup and volume_attach. And suppose the new version of Nexus finds and recovers a saga that was created on the previous version. How could this possibly work? You’ve got a serialized DAG that’s different from the one that was used before and the actions in that DAG don’t even correspond to discrete actions in the new version. Even if they did, how could we have confidence that the new versions of these actions correctly grok the outputs that were written on a previous version? Suppose now we add a new field to the output of instance_create, called volume_setup_params. The volume_setup step uses this. But they wouldn’t be there in the recovery case because the instance_create step happened on an older version. This is very brittle.

Solving this problem at upgrade-time

With the constraint proposed above that a given saga is only ever run by a given version of Nexus, the assumptions we said above about saga recovery are easy to satisfy. Actions can be associated with Rust code by any unique identifier (e.g., a string name), and they’ll always be able to grok the outputs of actions they depended on because they were written by the same version of the code that’s reading them.

There are some downsides to this approach. But it appears far easier to implement and more robust than the alternatives.

Edge case: but Nexus upgrade is a saga!

The process of rolling out a new version of Nexus is likely to itself be managed as a saga. Would that cause a problem, since the saga needs to be finished by the old version of Nexus, but it can’t be completed until all old versions are removed? This seems addressable if we split the upgrade into two sagas: one that deploys new instances (possibly canarying them), decides at some point to commit to the new version, and triggers one of the new instances to create a second saga that finishes the update. Rollback could work similarly in the other direction.

More about saga versions or signatures

We’ve said that we want to ensure that Nexus only runs sagas created from the same version of Nexus. What exactly does that mean and how do we implement it? Let’s start with the two extremes.

Suppose you start with Nexus version V1. You make a tiny change to Nexus that’s unrelated to any sagas. Call this new Nexus version V2. In practice, V2 is perfectly capable of running sagas created from V1.

Now suppose you change the DAG structure for a particular saga S1 (e.g., split one action into two different actions). Call this new Nexus version V3. Version V3 cannot run instances of saga S1 that were created in Nexus version V2 or earlier. Version S3 could run instances of some other saga S2 created under V2 whose implementations didn’t change between V2 and V3.

In general, how do we know if version V_i can run saga S that was created from version V_i-1? It’s helpful to keep in mind that this proposal works fine if the just declare that the answer is "never". We could implement that behavior by saying that a Nexus version is represented by the git SHA from which it was built or a checksum of the binary or something like that. But we might do better if we allow a little more flexibility. So what are the real constraints?

If the types of the inputs used by any action in the saga have changed, the new version is very likely incompatible. Now, all inputs are either (1) the saga’s parameters, or (2) the output from a previous node. So we could say that if the saga’s parameter type or any node’s output type changes, then the saga has changed incompatibly. This could be achieved by requiring both of these Rust types to impl JsonSchema, serializing these schemas in a deterministic way, and using that (or a checksum of it, maybe in some normalized form) as a signature for the saga. Then Nexus could be willing to work on any saga whose signatures it recognizes.

Note that the shape of the DAGs would not be part of the signature — even for a given Nexus version and a given type of saga (e.g., "instance-provision"), there can be a very large number of valid DAGs. There could also be a number of valid signatures if, say, some action only sometimes appears in the DAG (so its type only sometimes appears in the schema that’s used to generate the signature). What we really need is to create signatures for the parameters and every action’s output, in some deterministic order, and Nexus needs to check that each of these represents something it knows about. This is a rather complicated check — it’s not clear how to operationalize this during failover and recovery. Maybe during failover and recovery, Nexus could evaluate this and store in-memory a list of the sagas that it cannot operate on?

Finally, it’s also possible for the DAG and all actions' types to be unchanged and for the behavior of the saga to have changed incompatibly (e.g., if some critical step is moved from action A2 to action A1 — recovering a saga that was suspended at the wrong time could result in that step being missed altogether). It would be nice if we could create a sort of signature for the code in each action as well. This might generate false positive (e.g., cases where we declared something incompatible that wasn’t), but it would still be a huge improvement over saying that every two Nexus versions are incompatible because the vast majority of Nexus changes presumably won’t include changes to most actions.

Downsides to this approach

If there’s a bug in a saga, it can never be fixed for in-flight sagas. For example, we can fix the bug in the instance-provision saga, but if there are existing in-flight sagas, they will never run on the newer Nexus version, so they will still have the bug.

If we attempt to roll out a Nexus upgrade that we want to rollback, it’s possible that we’ve created some sagas with the new version that can never be finished. This is similar to the upgrade case: assuming the new Nexus version is not totally disastrous, the upgrade mechanism could wait for the new instances to drain sagas before removing them.

Both of these (as well as other cases) suggest that it will need to be possible to do something with sagas that cannot complete because they’re associated with Nexus versions that for whatever reason cannot finish them (e.g., a bug). We’ll need a way to abandon these, explicitly have new versions adopt them (only when support determines that that’s okay), or otherwise clean them up.

In local development environments, which may not be managed by the normal upgrade mechanism, updating Nexus could result in abandoning sagas. It’s unclear how big a problem this would be. You’d have to have in-flight sagas when you restart Nexus and have changed the implementation of those sagas in the new version of Nexus. This doesn’t seem all that common. If it were, we could have Nexus handle SIGINT more gracefully by draining sagas. Are there other, non-development use cases where the upgrade mechanism wouldn’t be available to ensure saga draining?

Alternatives considered

The proximate problem results from the way Steno persists the output of each action and the actions being tightly-coupled. One alternative would be that instead of Steno automatically storing the output of an action, it could provide explicit interfaces for actions to store arbitrary key-value pairs and fetch key-value pairs that had been stored by other actions. The burden would fall on saga actions themseslves to define stable interfaces for whatever data they need to handle recovery and provide whatever support they want for older versions of the same saga. This would be a pretty low-level facility that’s not just easy to use wrong — it’s very hard to use correctly. It’s also hard to test without having older versions of software around. Besides this, there remain lots of unresolved semantic questions, like "what happens to key-value pairs stored by an action if the action fails?" and "are key-value pairs visible to other actions while an action is running?". Depending on the choices made, this may be much harder for Steno to implement correctly than what it does today — all just to push the hard part of the problem onto every consumer of Steno. And we haven’t said how these consumers could actually do this.

It’s tempting to reach for something like [stripe-api-versioning]. We might apply this to sagas by saying that sagas could associate version numbers with their input and output types. They could define translators from old-version inputs to new-version inputs. That might handle changes in these intermediate structs, but what about the shape of the DAG? Maybe sagas could also associate versions with the DAGs and provide translators for those? (That’s harder than it sounds, since a saga like "instance-create" may have many different DAGs depending on the parameters. So a translator would have to iterate the DAG looking for node sequences that are known to have changed and replace them with the updated sequence.) This all seems hard to get right: you’d need some mechanism to know reliably when you’re making a breaking change to any of these, bump the version, keep both versions around, and provide explicit support for every version whose persistent state you want to support. And what about downgrade (i.e., an older Nexus recovering a saga from a newer one)?

Another idea that’s been discussed is to compile the Rust code that makes up the saga actions to WebAssembly (WASM) and store the resulting artifact persistently. During normal execution, you set up a WASM runtime environment and execute the compiled code. On recovery, you’d read the WASM artifact from persistent storage, reconstitute the WASM runtime environment, and resume execution. This has nice properties: like the proposed approach, there’s no opportunity to get out of sync between implementation and persisted state. Additionally, the runtime environment would necessarily be tightly-defined. But there are lots of uncertain details with this approach. We have not tried to flesh it out.

Security Considerations

This approach likely means that older versions of Nexus will be running a little longer than they otherwise might have. Based on current saga implementations, this might be by a number of minutes. With future sagas this could be on the order of hours or days, depending on choices we make. If those older versions of Nexus have security vulnerabilities, then they will remain exposed for longer than they otherwise would have been. This is mitigated because those instances won’t be in service for the public API (which significantly reduces their exposure). It could be further mitigated by putting bounds on how long sagas are generally expected (or even allowed) to run or by adding knobs for operators to abandon such sagas in order to be able to fully decommission vulnerable Nexus instances.

Note that the alternative (support for resuming execution of sagas across multiple versions of Nexus) would likely result in non-trivial security issues as well.

What else?

External References

[RFD 107] RFD 107 Workflows Engine
[RFD 192] RFD 192 Omicron Database Design
[Steno] Steno: work-in-progress prototype distributed saga implementation
[Stripe API versioning] APIs as infrastructure: future-proofing Stripe with versioning

Footnotes

1
This image is from a draft version of a major update to Steno. The source file and rendered image are both in this repo.
View
2
That’s because the actions have to be idempotent, and if you have a sequence of idempotent operations, the whole point is to lean on Steno to make the sequence idempotent.
View
3
If you add some new feature to the saga, it may result in data flowing through the whole saga. That results in changes to the inputs and outputs of many different actions. Or imagine that you break up one action into two. Then the shape of the DAG changes.
View

RFD 289 Steno Upgrade

Table of Contents