RFD 107
Workflows Engine
RFD
107
Updated

Abstract

This RFD discusses "workflows", a mechanism to coordinate complex, potentially long-running collections of tasks. The goal is to describe a component of Nexus that—​combined with the API layer and datastore—​would encompass most if not all of the functionality of the control plane. Discrete operations such as firmware upgrade, provisioning a VM instance, or modifying a network configuration would be encoded and executed as workflows. The goals, described in more detail below, are to provide a central facility for encoding such operations to simplify the work of the implementer, provide visibility to the user, and bring consistent auditing, debugging, etc.

Note that part of this RFD is also to determine the abstractions best suited to our implementation of workflows. While most industry examples express these as DAG of steps, we discuss below the potential benefits of additionally or alternatively expressing the desired end state.

It will likely be useful to have some example administrator or user interactions in mind.

  • VM provisioning — serialized and parallel operations to storage, network, and hypervisor subsystems (to scratch the surface)

  • SSD firmware updates — coordinated, progressive update for every SSD in the rack or fleet with particular care taken to make sure data integrity is never diminished (e.g. what do we do if a mirror is in a degraded state already?). Consider the case also where a server is offline (transiently? permanently?) and the SSDs in that system are known, but not accessible. Or consider what happens when a new SSD (with older firmware) is added during or after the execution of the workflow.

  • VM migration — sequential coordination to plumb up all the necessary resources on the target machine, streaming state from the current home of the VM to its new home, pausing the VM, doing all network cutover, resuming the VM in its new home, etc.

  • Replacing a faulty server — automated and directed, human intervention. Whether this is done by customers or our own personnel, it will involve some careful execution and checking. One could imagine the need for reminders, escalations, and very long-running workflows.

Introduction

The control plane data layer will contain some representation of the “model”, the desired end state. This includes every piece of configuration, from the versions of every software component in the system (including, presumably, Nexus itself) to user-created entities such as users, disks, VM instances, vNICs, etc. Note that some of these will be completely under the control of the user (e.g. an IAM group); some will be invisible to the user, but indirectly influenced by what they configure (e.g. firewall permissions to enable storage components to communicate); some will be user visible, but outside of their control through software (e.g. the topology of servers, fans, disks).

The final arbiter of truth will be the systems directly responsible for the entities. For completely synthetic entities that’s just the database; for others it’s the responsible system. For example, an SSD will know its firmware revision, the host OS (Helios) knows the running VMs, etc. Nexus will likely require a cache of all those values—​it will not be sensible to query all SSDs for their firmware revision every time an administrator visits a screen that contains that information (and this is still less tenable on a multi-rack configuration).

User actions will create a delta between the current and desired states. We use the term “workflow” to describe a managed collection of steps required to transform some subset of the system from its current state to the desired state. Workflows may be monitored by users to track the progress of changes they’ve requested. We may consider a goal of using this facility for every action we need the control plane to take—​software updates, VM provisioning, data backup, etc. Achieving that goal will require workflows to be simple to author while providing significant benefits to the author and user.

For those familiar with Terraform, the model / desired state is expressed in the .tf files (in HCL), the truth is contained in the targetted systems (e.g. AWS, GCP), and the workflow is the DAG of resource-specific actions they execute. The terraform.tfstate files are roughly equivalent to the cache of state that Nexus would maintain.

Important
Do we think of the workflow purely as the means to transition from the current state to the desired one or would we include the activities needed to update the target state as part of the workflow? For an SSD firmware update for example, would we update the target version, make a plan to transform the system, and kick off a workflow to execute that plan? Or would a step in the workflow be to make the persistent change to the target state?

Existing Workflow Engines

We conducted a non-exhaustive survey of existing, open source workflow engines harvested from a subset of those in this list. We bowed to the wisdom of the crowd by pruning those with less than 1K github stars.

Important
If there are other interesting workflow engines to consider as candidates for our own use, as sources of inspiration, or as relevant competitors, those would be extremely valuable points to discuss.

The the evaluation of the systems below was cursory, limited to documentation and examples; we did not put these systems into use or dig in exhaustively (though it was exhausting). Our goals were as follows:

  • Find systems we might use in whole or part

  • Identify key distinctions between workflow systems

  • Learn any important lessons, breakthroughs, or science in this domain

  • Identify common nomenclature

The notes below on each system are a subset of the notes taken, please comment on this RFD for more information on any of the systems listed (or systems not listed).

Apache Airflow

“Airflow is commonly used to process data, but has the opinion that tasks should ideally be idempotent, and should not pass large quantities of data from one task to the next”

"Pipelines" and "Tasks" are specified in Python. These are uploaded to the framework with a pretty nice GUI to visualize logic, in-flight processes, etc.

Argo Workflows

Argo is built around use with Kubernetes. Also: “Argo Workflows puts a cloud-scale supercomputer at your fingertips!” 🤮

Cadence

See this talk on YouTube (remember to like, comment, and subscribe)

Developed by Uber; used for a variety of workflows that seem more oriented towards longer running, human-in-the-loop style workflows (quite sensibly for their business). "Activities" are written in Java or Go (maybe other languages as well), based on futures. Discrete actions are called "Tasks". Activities have methods for specifying the collection of steps, receiving events, and querying the status. Cadence has a pretty nice-looking unit test framework with the ability to mock actions and timers that warrants further investigation when we start to concretely consider testing.

Conductor

From Netflix. Similar in concept to Cadence with the biggest distinction being that they have a JSON-based DSL to specify the workflows.

Prefect

"Flows" and "tasks" written in Python, oriented for data engineering. It’s an open core project, directing people to the Prefect Cloud. Their focus is data engineering workflows, ETL stuff, with integrations into various data repositories it seems

Oozie

Hadoop workflow scheduler; nothing too relevant for us here.

N8n

Visual integration engine; not relevant to our use case.

Wexflow

Visual composition tool and some blockly/scratch coding stuff with a focus on .NET environments.

Zeebe

Workflows / service tasks

"Workflows" are built in a visual editor (BPMN) or YAML; discrete actions are called "tasks". Their main focus is on human-in-the-loop workflows (e.g. business process management). RocksDB is their state repository.

Types of Workflows

It may be useful to categorize workflow use cases into a few different groups:

  1. "One-shot workflows" for actions like provisioning a VM that are expected to succeed most of the time and should always complete quickly, regardless of success or failure. These workflows generally shouldn’t retry internal operations, or at least not very much. Examples: boot/halt/reboot a VM or server, allocate/attach/detach/snapshot a disk, etc.

  2. "Reliable one-shot workflows" for actions where we want the system to autonomously finish the task regardless of how long it takes. By "autonomously", we mean that it’s not just blindly retrying a low-level action, but it’s proactive about checking and rechecking the state of the system to make sure things haven’t flown off the rails. If it encounters transient failures, it should retry with randomized, capped exponential backoff. If it encounters persistent failures (by which we mean a failure that we know requires operator intervention, not just a transient failure that’s happening a lot), it should notify an operator and probably stop. If it encounters unknown or bad state, it should definitely notify an operator and stop. We would use this for executing many background tasks in the system, including live migration, unplanned non-live migration, phone home activities, backups, reports, database migrations, etc. We may use this to apply user changes that can’t be done atomically the way provisioning a VM can, like most changes to networking or firewall configuration.

  3. "Reliable persistent workflows" for reliable workflows that additionally define conditions for reactivating them. Most software upgrades might fall into this category, at every level of the stack: if you want to update from version X to version Y, we first do the upgrade (similar to a reliable one-shot workflow). Then if we later discover a component at version X, we run this workflow again to go fix it. Other examples might include changes to software configuration and storage rebalancing.

A major benefit of first-classing workflows is that we can build tooling to show operators, support personnel, and users what workflows have finished, are currently running and making progress, are paused because they’ve encountered some unexpected state and require operator intervention, are running but haven’t made forward progress in a while, etc. This observability (including recent errors from stuck workflows, and dealing with cases where there are far too many errors for a person to look through) is tremendously valuable for understanding the state of the system and fixing it when things go wrong.

Key Design Considerations

Based on the systems evaluted in the previous section, we’ve identified a number of design considerations to bear in mind.

Language

Important
How will we specify workflows? The choices are either an existing language such as Java, Python, Go, or—​likely in our case—​Rust, or a custom language typically specified in JSON or YAML.

Using a language such as Rust has several advantages:

  • Easy to write workflows without learning some new language—​this may be fairly important for us as we want to make it easy to work vertically through the stack and don’t want to require a tremendous amount of domain specific learning in Nexus

  • Easy to unit test / mock: it’s just code

  • Potentially easier error handling and complex logic

Using a DSL has different advantages:

  • In general, the system can infer more about the structure of the workflow

  • Dependencies are explicit rather than processed at runtime

  • The inputs and outputs of each step are potentially more explicit

We need to be able to express not only complex logic, but also complex failure handing. If we want to cover long-running, human-in-the-loop types of automation we may also need to express ways to pester users and escalate if they don’t respond.

Execution Semantics

Important
What part of the system ensures consensus, exactly-once semantics, and rollback from failure?

The database will store state that pertains to in-flight workflows. When a task or step becomes runnable we will need some sort of notification mechanism to record that fact and a dispatch mechanism to run the next part of the workflow. This could, for example, employ a pub/sub mechanism of the database (if it exists), or require us to build something bespoke.

If a running task is terminated unexpectedly, say, because the system it’s running on reboots, we will need to detect that it’s failed and figure out a remediation plan. Similarly, we may want timeouts on tasks (despite the famous challenges of setting appropriate timeout lengths) that trigger different forks of the workflow. For example, if we go update an SSD firmware and the drive never comes back after resetting it, this might trigger a timeout handler to notify an operator to ask them to intervene; it would not terminate or fail the workflow of which this was a single step.

Deployment and Authoring

Important
Is the workflow service effectively stand-alone with workflows and actions sent to it by consumers, or are the workflows and actions also part of the control plane.

Most of the systems we looked at are intended at some level to solve the organizational challenge of trying to coordinate activities that need to consume functionality from disparate services owned by disparate groups. The workflow framework acts as a hub, mostly unaware of the actual workflows being executed. Those types of organizational challenges seem (mercifully) far off. As such, we can probably have our workflows more tightly integrated with the engine itself. That is to say, we can co-develop services that handle low-level components, the workflows that coordinate them, and the workflow engine itself.

Recovery From Failure

Important
Should recovery from workflow failures be automatic or explicit in the encoding of the workflow?

In the case of an unrecoverable failure in the middle of a multi-step workflow, the system should have mechanisms to unwind previous steps to return the system to its original state. One could think of workflows, effectively, as transactions over innately non-transactional, distributed systems.

Catie McCaffrey’s work on Distributed Sagas may be a source of inspiration. Sagas are a database research concept from the '80s to describe a specific type of long lived transaction that can be interleaved with other transactions and can be reverted (i.e. all-or-nothing). Distributed Sagas are that concepts over distributed systems where transactional semantics are even more elusive. In short, each step comes with a procedure to apply as well as revert the intended action. For example, creating and then destroying a guest VM are a symmetrical pair; it gets trickier for operations with observable side-effects such as powering off a server.

We could either have the specification of each step include a mechanism for inverting it should a subsequent step fail, or we could leave it to the workflow author to account for unwinding every failure state. Neither approach is free from downsides. The former invites the obvious recursion of what to do should the recovery from failure itself fail. The latter has a higher burden on every workflow offer…​ and may suffer from the same pathology as the former.

Human-in-the-loop or Machine-only

Important
Should the workflow facility include workflows that involve some human interaction or should they be exclusively automatic?

Many workflows will involve only software components of our system. VM provisioning is an example of this type of operation: the value of the workflow is to coordinate the activity of many, disparate components; to sequence the steps; to provide visibility, logging, recovery, etc. Even (relatively) long-running workflows might be fully automatic. Consider VM migration or SSD firmware upgrade as two examples that may easily require many minutes (recall that a given SSD may contain small parts of many virtual disks so taking two of these offline simultaneously could incur greater-than-desired risk).

There are other operations on the Oxide rack or collection of racks that will involve both automatic steps and those that require a human to intervene or some other, external activity (i.e. an API call). A canonical case is that of component (e.g. an SSD) replacement, but there are many others such as approval workflows (email to a manager to verify that a user can provision an nth VM), limit increases, or various recovery paths from failures.

To further complicate the matter, this distinction may not be particularly crisp. For example, assume that every block of a virtual disk is replicated in three locations. If the SSD containing part of that virtual disk is down, would we upgrade the firmware on the other SSDs, temporarily putting data in a single location? Assuming that answer is "no", this condition would prevent the completion of SSD firmware upgrade and—​likely--require human intervention to proceed assuming that we would want an operator to desclare a disk unrecoverable before we started reconstructing a new replica.

Important
If we want a workflow system that can accommodate human-in-the-loop operations, should this be the same system, a distinct one, or something complementary?

User-visible Service

Important
Should we plan for the possibility of our workflow service being directly accessible by our users.

Our users will already indirectly interact with our workflow service: they will initiate operations, monitor the status, receive alerts, etc. Today AWS has at least two workflow-style services: Simple Workflow Service (SWF) and Serverless Application Model. It is unclear if AWS uses these internally or if they’re purely customer facing (this would be an interesting data point if anyone has cotacts at AWS who could answer this). (Note that Amazon.com has their own customer workflow service called "Herd", but there does not appear to be any public information about it other than the name.)

The short term answer here is certainly "no". The longer term answer may also be "no", but there is value in having this be an explicit non-goal (or explicit goal should this be deemed important, customer-facing functionality).

Determinations

  • We’re pursuing distributed sagas for control plane operations where they’re appropriate (namely: a statically-defined set of steps, possibly with dependencies between them, that should either fully complete or else be unwound). This covers "one-shot workflows" and "reliable one-shot workflows" above.

  • We’re targeting internal-use-only for now (not a customer-facing service).

  • We’re deferring for now work on "reliable persistent workflows", mostly because we’re just not there yet. A path here may be clearer by the time we need them because we’ll have more experience with sagas for one-shot workflows.

A prototype saga implementation exists in Nexus today and is used for provisioning new Instances. Next steps on this path are to gain more experience building the control plane in terms of these sagas. For more details on sagas and our implementation, see Steno.