419 - Only YOU can prevent unwinding sagas / RFD / Oxide

RFD

419

Authors

Updated

[rfd107] makes the determination that distributed sagas will be used for executing long-running tasks, especially those that require talking over the network to other daemons. But the code in saga nodes can be tricky to get "right", where "right" in this case is in the context of what is required for a distributed saga to be correct. This RFD is an attempt to cover that: how should one write correct saga nodes, and what are some common pitfalls and sharp edges when doing this?

The title of this RFD is a reference to Smokey the Bear, but also an unfortunate reminder that it’s up to the code author in a lot of cases to get this stuff right. Rust doesn’t have a trait for "idempotent", "atomic", or any of the concepts described here. It’s up to the author and those reviewing the code to catch these issues.

Background

[rfd107] introduces the [distributed_saga] as our preferred method of executing long-running complicated tasks. The idea is to structure that complicated task as a graph of saga nodes. Each node has an action to be executed as part of traversing the graph "forward", and a corresponding compensating action to be executed as part of traversing the graph "backwards" (also known as unwinding). Unwinding occurs when an error is seen as part of a node’s action’s execution. Here’s an example of a two node saga executing both actions successfully:

The graph structure is important as nodes can execute concurrently if the saga executor operates concurrently:

If a saga node’s action returns an error, then compensating actions for previous nodes are executed:

Each saga nodes’s output is serialized to JSON and recorded in the saga store, and can be referenced by other "downstream" saga nodes.

A saga will either "apply" each forward action successfully, or be unwound in a way that brings the system back to the state it was in before the saga execution started. In this way sagas are analogous to a database transaction.

The saga "store" is the persistant storage for the saga itself, details about the execution of the saga, the parameters for the saga, and the output of each saga node.

It is the responsibility of a saga "executor" to execute the actions of a saga, updating the saga store as necessary. Please see https://github.com/oxidecomputer/steno for the implementation that Oxide software uses.

Properties of a correct saga node

Compensation

The compensating action for a saga node should return the system to the state it was in before the saga node’s forward action ran, as best it can. If it cannot, then it should return to a state consistent with having never run at all.

Ideally, if a saga node’s forward action changes the system’s state from S to S', then the compensating action would change this from S' to S. This property ensures that the saga itself either fully applies successfully or returns the system to the state it was in prior to the beginning of the saga. However, there are cases where this can’t happen, and that’s ok!

One example is in the disk delete saga: one of the nodes will set the disk’s state to Destroyed, but if the saga unwinds that node’s compensating action will set the disk’s state to Faulted, not what it was previously: other saga nodes in the disk delete saga will have performed actions that cannot be undone and will make the disk unusable (for example deleting associated Crucible resources).

This satisfies another one of our non-ideal-saga-undo-node goals: return the system to a state where the saga can be retriggered, either by the user or some automation. In this case the user can delete a disk in state Faulted, and the whole disk delete saga can be tried again, hopefully running to completion. See https://github.com/oxidecomputer/omicron/pull/4547 for this case.

Idempotent

If the saga executor applies a node’s forward action but crashes before it can record that it did so in the saga store, it will re-execute that forward action when restarted.

Given the same inputs (the saga parameters, other saga node’s output), a saga node’s forward action should be idempotent: if a saga node’s forward action changes a system’s state from S to S', then a re-execution must either detect that the system is already at S' and do nothing, or support changing a system’s state from S' to S' (a no-op).

It should be noted that the case of the executor crashing like this is relatively rare, but idempotence is still required for both re-execution after a crash and, more commonly, for correct behaviour when multiple executions of the same saga (or different sagas!) race and interleave execution of their respective nodes. This is discussed in detail later in [_concurrently_executing_sagas_are_bad].

Atomic

A saga node’s unwind action is not performed if the forward action fails. This is one of the most common sharp edges!. In order to fulfill the requirement that unwinding the saga returns the system to the state it was in prior to the beginning of executing the saga (or at least consistent with that state), a node’s forward action must be atomic: said another way, S → S' must be atomic.

A saga node’s forward action that applies many state changes is problematic: any of those requests for a state change could return an error, or the executor could crash at any point. For example, if a forward action transitions the system from S → S¹ → S² → … → S^N, and one of those fails, then the saga node will have transitioned from S → Sⁿ, and this state change will remain when the saga unwinds.

Known result

Upon completing a saga node’s forward action, S to S' should be known to have occurred. An example where this is not known is if a request to an external daemon timed out. That request should be re-executed until the result of that request is known. retry_until_known_result in the Omicron repo was built for exactly this.

This relates to the atomic property as well: if S → S' might have occurred, should the node’s unwind action be performed?

Sharp edges

“Boldface” is a pilot term, a magic word to describe the procedures that could, in a crisis, save your life. We say that “boldface is written in blood” because often it’s created in response to an accident investigation. It highlights the series of steps that should have been taken to avoid a fatal crash, but weren’t.

— Chris HadfieldAn Astronaut's Guide to Life on Earth

Avoid using names in saga parameters

Use a resource’s ID in a saga’s parameters instead of the resource’s name: using the name can introduce a TOCTOU bug where saga was constructed to work on a resource with name AAAA and id 1, but instead at the time of execution, which could be delayed, works on a resource with name AAAA but id 9.

Performing a dynamic number of state changes

There are cases where a saga will have to perform a dynamic number of steps in the course of its execution, and that number cannot be known before the saga is constructed.

when there’s a bounded number of steps

Some state changes seem like they require a dynamic number of steps but may not. Attaching NICs to an instance is a good example of this: a user is asking for a variable number of NICs to be attached to their instance. The pattern that was invented was to have a saga node for each potential NIC (this works because there is a maximum number of NICs that can be attached), and have each saga node check if the user specified a NIC for that slot. For example, if there is a maximum of 8 NICs allowed, then there need to be 8 forward nodes and 8 compensating action nodes:

// Here's the functions that actually do the work
fn attach_nic(slot: Slot, ..) -> Result<..., ActionError> {
    ...
}

fn undo_attach_nic(slot: Slot, ..) -> Result<...> {
    ...
}

// where the saga nodes have a conditional check first
fn attach_nic_0(..) -> Result<..., ActionError> {
    if !nic_required(Slot(0), ..) {
        return Ok(());
    }

    attach_nic(0)
}

fn undo_attach_nic_0(..) -> Result<...> {
    if !nic_required(Slot(0), ..) {
        return Ok(());
    }

    undo_attach_nic(0)
}

fn attach_nic_1(..) -> Result<..., ActionError> {
    if !nic_required(Slot(1), ..) {
        return Ok(());
    }

    attach_nic(1)
}

fn undo_attach_nic_1(..) -> Result<...> {
    if !nic_required(Slot(1), ..) {
        return Ok(());
    }

    undo_attach_nic(1)
}

...


fn attach_nic_7(..) -> Result<..., ActionError> {
    if !nic_required(Slot(7), ..) {
        return Ok(());
    }

    attach_nic(7)
}

fn undo_attach_nic_7(..) -> Result<...> {
    if !nic_required(Slot(7), ..) {
        return Ok(());
    }

    undo_attach_nic(7)
}

It’s important to note that this was invented for an older version of Steno that required each saga DAG to be fixed for every run of a particular saga. This is no longer true but the pattern is still useful.

when a node produces an arbitrary number of outputs that another node acts on

One example where a saga node can produce an arbitrary number of outputs, each of which requires some further action(s), is provisioning a number of Crucible regions for a disk.

In the happy path a disk is backed by only three regions, each the size of the disk itself. Due to allocation pressure, a virtual disk’s regions could be striped across multiple physical disks. The user may also specify a virtual disk that is larger than one of the physical disks, making multiple regions necessary.

This ends up looking like:

fn allocate_regions(..) -> Result<Vec<Region>, ActionError> {
    ...

    Ok(regions)
}

fn provision_regions(..) -> Result<..., ActionError> {
    ...
    let regions: Vec<Region> = get_output_of_node("allocate_regions");

    for region in &regions {
        // some request is made
    }
    ...
}

One good rule of thumb: the presence of a for loop in a saga node’s forward action is almost always a giveaway for a series of partial state changes! This violates the property of atomicity.

Importantly, the unwind for that node would look like:

fn provision_regions_undo(..) -> Result<..., Error> {
    ...
    let regions: Vec<Region> = get_output_of_node("allocate_regions");

    for region in &regions {
        // some compensating request is made
    }
    ...
}

But remember one of the most common sharp edges: unwind is not executed for a node that fails! Any provisioned region would remain provisioned.

A pattern used to deal with this is to place the compensating action in a previous saga node, and use a noop as the forward action. So, split a saga node from:

If A1 fails, then U1 will be executed:

Another pattern that can be applied here is to have one saga trigger another saga: the first "parent" saga can perform the operations that produce an arbitrary number of outputs, and/or can query whether or not steps are required in the first place, and then as a last set of actions use that information to build a second "child" saga using the known amount of nodes required, start that saga, and wait for it to finish.

The second saga’s actions become easier to write, but the author is required to consider that the parent saga can crash and rerun the "trigger saga" forward action. This looks like:

parent saga P1 constructs and starts child saga C1, then waits for C1 to finish
the saga executor crashes, restarts the parent saga (P1') and the triggered child saga (C1')
P1' constructs and starts child saga C2, which potentially interleaves with C1'

Depending on the design of the child saga, this either requires that the child saga can run concurrently with another copy of itself, or that C2 somehow recognize that C1' exists and bail accordingly.

This also requires the parent saga to essentially be "read-only", as child saga C2 or the rerun parent saga P1' could unwind while C1' continues executing. It’s important that the parent saga’s unwind actions and the child saga’s forward actions be able to run concurrently.

What to do about saga nodes that have no unwind?

Particularly for sagas that delete resources, forward actions may have no compensating action. One example of this is deleting a disk, which will delete provisioned regions. If a forward action has no compensating action, then any failure that causes an unwind will not actually unwind the system state to where it was prior to the saga executing.

One case to look out for: mixing saga nodes that do not have undo steps with those that do. https://github.com/oxidecomputer/omicron/pull/4547 describes a case where the first node of a saga did not have an undo step, the second node did, and the third (and subsequent) nodes did not. This created a situation where the second node’s forward actions were undone, but they could never be redone because the first node’s forward action removed the user’s ability to trigger the saga.

From this: make sure that the whole saga can be re-executed even if it unwinds at any point, and that the unwind process leaves the system in a state where the same saga can be retriggered.

What do you mean it’s not straight line code?!

When writing a saga, the author must structure a long process as a series of saga nodes, and in order to satisfy the properties of a correct node the actual functions themselves tend to do one "thing", such as making an external call, or updating a database record. To make a saga node do many things is itself a sharp edge (for example, see [_performing_a_dynamic_number_of_state_changes]) and should be avoided.

These small saga nodes are then executed by a saga executor. One sharp edge that can be pretty bad is that there is an indefinite amount of time between the execution of each saga node. Even if nothing else is happening in the system, there’s time required to durably record the progress of the saga and the output of each saga node. Problem is there is often a lot else going on, including the triggering of other sagas. An executor is free to execute a multitude of saga nodes in any order (as long as the DAG edges for each individual saga are respected) and this leads to saga nodes (and other code!) interleaving in arbitrary permutations.

One way this manifests: any "observation" made in a previous saga node’s execution may no longer be true in a subsequent saga node. There could be a very long time between node executions, where the worst case might look like an executor crashing and being down for days while a problem is debugged and fixed.

Intermediate state changes are visible

Sagas are not transactions: changes made during the execution of a saga are visible by the rest of the system.

Say for example a saga adds a resource with a state that other sagas which operate on that type of resource can be triggered from. More concretely, imagine a Foo resource can be in states Start, Running, and Deleted, and there’s a "foo create" and "foo delete" saga. If the "foo create" saga inserts the new Foo resource’s record into a database with state Start, this is a visible intermediate state. If the "foo delete" saga is allowed to operate on a Foo that is in any state, and is triggered in the middle of the Foo create saga, then the two sagas will interleave and race. If the "foo delete" saga is only allowed to operate on a Foo that is in state Running, and the last action of the "foo create" saga is to set the resource’s state to Running, then it’s not possible for the two sagas to overlap execution.

It’s important to define the states of a resource that a saga (or sagas) can be triggered from. For a concrete example of this, see https://github.com/oxidecomputer/omicron/pull/3947.

Concurrently executing sagas are bad

Sagas are triggered as a result of user input, and there’s a few cases where sagas can conflict with each other. It’s important when writing a saga to consider these potential conflicts.

Without ensuring that only one saga is executing for a particular resource at a time, execution of the saga nodes for multiple sagas can interleave in any permutation, and can potentially cause problems.

Multiple creation sagas

Imagine a user triggers multiple resource creations, say multiple snapshot creations. This can easily happen if a user accidentally double clicks a UI button, writes incorrect code that uses our API, or really any way: the fact that it can happen means it will, eventually.

Unless these sagas are completely distinct they will end up conflicting with each other, which in the case of a snapshot creation is a problem: multiple sagas will be created with the same input parameters that will attempt to take snapshots of the same disk.

Say that a saga named Alice and a saga named Bob are both started for a snapshot of a disk. One way to block multiple creation sagas from executing for the same resource is to have one of the earlier forward actions' state changes apply successfully for Alice but create an error for Bob. The fact that snapshot names have to be unique for a project is a helpful constraint here: using the snapshot creation example, ssc_create_snapshot_record will succeed for Alice but fail for Bob. Bob will unwind and return an error to the user, where Alice will (hopefully) succeed. As long as the saga nodes leading up to this forward action’s state change can be unwound successfully, this is one approach to take.

Multiple deletion sagas

Deletion sagas are trickier. Again, multiple identical sagas will be created from user requests, but this time there may not be a forward action state change that will succeed for Alice that fails for Bob. Imagine that the first saga node sets a snapshot’s state to Deleted. This could apply successfully from Alice and Bob unless there’s an additional disambiguation for which saga is requesting the state change.

With the snapshot creation saga, the two sagas are disambiguated by the random UUID that is generated near the beginning for the snapshot ID, even though the saga parameters are / could be the exact same. This is how ssc_create_snapshot_record can succeed for Alice and fail for Bob: two snapshots with different IDs would have been inserted if not for the database conflict from the duplicate in Name column.

If the saga nodes are idempotent, then the whole saga itself should be able to rerun with the same input and still succeed. Disambiguation of these two sagas may not be necessary, and if required may prompt a rethinking of whether or not saga nodes that seem to require disambiguation are in fact idempotent!

Creation and deletion sagas

Triggering a resource creation saga along with a resource deletion saga can also cause a conflict. One saga will be making requests to create a resource’s required components while another might be making delete requests for those exact components. If the creation saga starts unwinding, it will quickly find that the components it thinks it created are no longer there.

One pattern that can help is to define the states from which actions can be taken on a resource. Imagine a resource creation saga (Alice) creates a resource in state Creating, does a bunch of work to actually create that resource, then as one of the final forward actions transitions that resource into a state of Ready.

If the resource deletion saga (Bob) is allowed to start when the resource is in state Creating, then the execution of Alice’s saga nodes will conflict with Bob’s. If Bob is only allowed to start if the resource is in the state of Ready, or some list of states that does not include Creating, then it cannot conflict with Alice.

Watch out for PENGUINS THROUGH TIME

Again: XTS works like ECB. It’s deterministic. If you’re looking for the penguins, they’re there, but you have to look for them across time instead of space: successive writes to the same sector-block location will repeat, but encryptions of the same plaintext at different locations will be randomized.

— https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/

"Electronic Codebook" [ecb] is an encryption mode that lacks diffusion, resulting in a ciphertext that does not hide data patterns. The "penguin" referred to above in the article about XTS refers to a penguin bitmap encrypted with ECB where the output still looks like the input. The article cautions readers to watch out for penguins across time with XTS, and this author would like to popularize the term "penguins across time", so here we are.

Backing up to the property of idempotency, which was defined as

Given the same inputs (the saga parameters, other saga node’s output), a saga node’s forward action should be idempotent.

A saga node’s author has the responsibility to think about multiple copies of a saga running with the same inputs. Imagine Alice and Bob are resource deletion sagas that are triggered and created at the same time. Alice’s execution may interleave with Bob’s, or entirely occur before or after Bob. It’s the saga node author’s responsibility to consider these two cases:

a saga node’s execution is interleaved with the same node in another saga
a saga node is being re-executed (in the context of the same saga)

The state change by this node must idempotently succeed for Alice and Alice' (a re-execution of that saga node) and for Bob’s version of that node. This requirement extends to the undo step, if the saga node has one.

An additional penguin through time to watch out for is an identical saga that executes sometime later. Again imagine Alice and Bob are resource deletion sagas that are triggered and created at the same time, but Alice completes before Bob even begins executing. It is the responsibility of saga authors to consider this case: although each saga node that deletes a resource should do so idempotently anyway, the whole saga must be idempotent to run.

Note that this may only be possible if resources are soft-deleted: soft-deletion is an idempotent state change, but hard-deletion cannot be disambiguated from the case where the resource never existed in the first place.

For example: let’s say that a saga node has a forward action of changing the resource’s state from X (where X is in the list of states Nexus is allowed to delete from) to Deleted, which also involves setting a non-null value for time_deleted to mark this record is soft-deleted.

For this forward action to be idempotent with a re-execution of this saga node, Deleted must be in that list of states Nexus is allowed to delete from, or there must be some way of checking that the delete has occurred already and skip doing it.

crate::app::sagas::test_helpers::actions_succeed_idempotently is very good at finding some of these issues.

Implication for daemons used in sagas

Idempotency and Known results

The idempotency requirement for saga nodes must extend to endpoints used by those saga nodes: they should be idempotent for equivalent input. This is especially important when combined with the requirement that a saga node’s state change have a known result: this could generate many equivalent requests to daemon endpoints.

Also related to the known result requirement: a daemon that returns an error code for a request should not have changed any state. S → S' should eventually return a 2XX response when combined with a saga node action that retries until success is known.

Nested paths

Imagine a daemon that is used by a saga. The example given in Caitie McCaffrey’s presentation [distributed_sagas] is booking a trip and all the dependent things required for that trip.

Imagine that the trip creation saga makes the following requests:

POST http://something/trips/123

POST http://something/trips/123/plane/abc

POST http://something/trips/123/car/def

POST http://something/trips/123/hotel/ghi

Unwinding the trip creation saga will involve compensating requests in the reverse order:

DELETE http://something/trips/123/hotel/ghi

DELETE http://something/trips/123/car/def

DELETE http://something/trips/123/plane/abc

DELETE http://something/trips/123

These would also be the list of requests required to cancel or otherwise delete a trip.

Let’s imagine one of the penguins through time talked about in a previous section: multiple identical delete sagas where the first (Alice) executes completely before the second (Bob) even starts.

Alice would start and then would execute:

DELETE http://something/trips/123/hotel/ghi

DELETE http://something/trips/123/car/def

DELETE http://something/trips/123/plane/abc

DELETE http://something/trips/123

and then Alice would be done. Bob would then start and attempt to execute:

DELETE http://something/trips/123/hotel/ghi

If this was any other daemon, it may be appropriate to return a 404 as a response to this DELETE: trip 123 was previously deleted so DELETE http://something/trips/123/hotel/ghi is a bad request.

It important that, in the context of being contacted by a saga node, that this endpoint instead return 204 No Content: the saga node’s job is to ensure that the resources are deleted, and it’s the daemon’s job to respond idempotently to identical requests. Because this RFD has previously stated that the whole saga must be idempotent, then this requirement extends to nested paths.

From this, daemons that are contacted by sagas must themselves also soft-delete resources so that requests for these resources can return the same results. This must extend to nested paths as well: if trip 123 was soft deleted, then all sub-resources must first have been soft-deleted, and subsequent delete requests should return 204. See https://github.com/oxidecomputer/crucible/pull/1004 for a real world example of this.

Determinations

Knowing the boldface improves your odds, but it’s no guarantee. You can be the best driver in the world with the safest car in the world, but if a semi comes through a stop sign and plows into you, none of that will matter.

But in a real crisis, what other hope have you got? The more you know and the keener your sense of operation awareness, the better equipped you are to fight against a bad outcome, right to the very end.