RFD 48
Control Plane Requirements
RFD
48
Updated

Introduction

In software systems the terms data plane and control plane are often used to refer to the parts of the system that directly provide resources to users (the data plane) and the parts that support the configuration, control, monitoring, and operation of the system (the control plane). Within the Oxide rack, we can say that the data plane comprises those parts that provide CPU resources (including both the host CPU and hypervisor software), storage resources, and network resources. The control plane provides the APIs through which users provision, configure, and monitor these resources and the mechanisms through which these APIs are implemented. Also part of the control plane are the APIs and facilities through which operators manage the system itself, including fault management, alerting, software updates for various components of the system, and so on.

A solid control plane implementation provides several major customer benefits:

  • the product’s primary deliverable: self-service, API-provisioned cloud infrastructure;

  • better utilization (and so improved value / reduced cost) compared with other systems as a result of unified hardware/software view of infrastructure and proactive workload rebalancing as needed;

  • better operational experience (for reduced operational cost) compared with other systems, also due to tight integration between software and hardware, which allows the system to reliably manage hardware failures, hardware lifecycle, and firmware upgrades; and

  • API-driven reporting and control over the system for integration with existing tools and systems the customer may already be using.

Several RFDs are closely related:

  • RFD 48 (this RFD) describes requirements, goals, external requirements, and open questions related to the Oxide Rack control plane.

  • [rfd53] describes requirements and evaluation methodology for the control plane data storage system.

  • [rfd61] describes a proposed architecture and implementation details.

  • [rfd55] describes user alerts provided by the control plane.

  • [rfd10] describes a prototype implementation of the Oxide API.

  • [rfd14] describes the Oxide Rack product.

Important
This RFD uses blocks like this one to call out key design choices to make it easier to see where early feedback is most valuable.

Major open questions (feedback requested!)

Feedback is requested on any and all of the content in this RFD. It may be useful to focus on the callout blocks, since those highlight particular design choices we’re making now. Particularly big open questions include:

What will the control plane data storage subsystem look like? See below for a discussion of some options and [rfd53] for more details. Papers or other public documents about competitive systems would be helpful!

What’s the role of the control plane in bringing services online from a cold start, particularly where it comes to authenticating individual servers, unlocking local disk encryption keys (for the OS), unlocking network disk encryption keys (for Instances), and unlocking any other cluster secrets (what are these?)?

Other product questions

Automatically migrating storage resources in the case of apparent server failure: See the discussion in [_automatic_migration_of_resources], which describes several very different approaches. Which of these are customers likely to prefer?

Extra capacity vs. reduced availability: several sections below implicitly assume operators would prefer to keep extra capacity around (unused) than accept service outages. For example, the sections on control plane software update assume that operators have capacity available for live-migrating Instances to avoid an outage while a server is rebooted. Some operators might prefer to take the outage instead of having excess capacity. Are their customer requirements in this area? Can/should we encourage customers one way or the other? Should we provide a policy knob?

Control Plane Requirements

Functions of the control plane

Broadly, the control plane must provide:

  • an externally-facing API endpoint described in [rfd4] through which users can provision elastic infrastructure backed by the rack. This includes APIs for compute Instances, storage, networking, as well as supporting resources like organizations, users, groups, ssh keys, tags, and so on. This API may be used by developers directly as well as the developer console backend. See [rfd30].

  • an externally-facing API endpoint for all operator functions. This is a long list, including configuration and management of hardware and software components and monitoring.

  • implementation of lifecycle activities, like initial rack setup; adding, removing, or replacing servers or other components; and the like.

  • facilities for remote support by Oxide, including secure access to crash dumps, core files, log files, and system consoles.

Some of the requirements (e.g., control over host power and console access for remote support) may have significant implications for both rack and control plane design, so it’s worth enumerating these in some detail.

Developer API

Providing elastic infrastructure is the purpose of the product. [rfd4] describes the API for this in detail. This API may be accessed directly, or using higher-level tooling (like Terraform), or via the Developer Console.

The API provides endpoints for:

  • Instances: CRUD, boot/halt/reboot, access console, reconfigure/resize, etc.

  • images (for compute Instances): CRUD

  • disks (network block storage): CRUD, backup?/restore?, snapshot, clone, detach/attach

  • networking resources: CRUD of many different resources (VPCs, VPC Subnets, Routers, NAT devices, VPNs, etc.). See [rfd21].

  • auditing of administrative operations

There are also non-infrastructure resources used to organize and manage everything: projects, organizations, teams, users, ssh keys.

Operator functions for system management

Generally, the functions here will be accessible via an API. Operators may use the API directly, higher-level tooling, or an Operator Console.

These functions include:

  • basic configuration: includes NTP, DNS, etc; resource limits; system policies

  • rack-level trust: the control plane is assumed to play some role in attestation, but that is not yet understood (or written down)

  • hardware visualization: enumeration of racks, servers, other rack hardware (like power shelves, switches, etc.), and other server hardware (including the system board, root of trust, CPUs, memory, disks, and any expansion cards).

  • hardware location LEDs: the system will support lighting the location LED on components that have them, including many of the above

  • hardware fault reporting: collection and reporting of devices believed to be faulty and information known about the problems

  • hardware fault management: automated response to known issues

  • software update, which includes:

    • general support (schedules, policies, etc.)

    • updates for stateless control plane components

    • updates for control plane data storage

    • updates for the host OS (assumed to require live migration followed by reboot)

    • updates for the hypervisor (assumed to require live migration and possibly a reboot, assuming the hypervisor implementation is coupled to the host OS)

    • updates for host bootstrap software (assumed to use live migration followed by reboot to immediately verify the new software)

    • updates for root-of-trust software (assumed to require live migration followed by reboot because the RoT upgrade will reset measurements maintained by the device)

    • updates for SP software (assumed to require only a reboot of the SP)

    • updates for device firmware (may require live migration followed by reboot)

  • alerting (asynchronous notification of various events of interest)

  • control plane backup and restore

  • auditing of administrative operations

The above points about software updates assume that live migration is cheap and reliable and so err on the side of using that for safety even in cases where it might not strictly be necessary (e.g., device firmware updates). We can revise these decisions as new understanding dictates.

Operator functions for rack lifecycle

The system needs either fully automated or API-driven facilities for:

  • initial rack setup

  • adding a compute server

  • removing a compute server

  • replacing a compute server

  • replacing a switch

  • adding/removing/replacing other rack components

  • adding/removing/replacing other server components

System support

The system needs either fully automated or API-driven facilities for support by Oxide, including:

  • auditing of administrative operations

  • collection of crash dumps and core files

  • collection of log files

  • access to server consoles

  • access to server power controls, including NMI ("diagnostic reboot")

  • support bundles

Metrics

The system must provide real-time and long-term metrics, including metrics provided by the hardware, host OS, service processor, root of trust, hypervisor, storage data path, and networking data path.

We may also want to support higher-level functions like threshold-based alerts, more sophisticated automated diagnosis and repair, and capacity planning features.

We may also want to integrate with customers' existing metric systems (e.g., by providing a Prometheus endpoint).

Non-API-driven functions

The functions mentioned above have major components that are not driven by operator APIs:

  • fault reporting (the system must collect this information all the time)

  • fault management (the system may take action automatically in response to failures)

  • software update policies (as directed, the system keeps components up to date)

  • system support (collection of crash dumps, core files, and log files)

  • metric collection

  • automated workload balancing to maintain target levels of utilization (see below)

Additionally, the control plane will likely be involved in some way in the boot process for compute servers (e.g., to provide configuration, credentials, a kernel, etc.). Details are TBD.

The control plane plays an important role in managing overall system utilization. Increased utilization provides direct value to the customer, but some headroom is necessary to maintain quality of service. Since many Instances use significantly less resources than they’re configured with, operators may want to support overprovisioning of machine resources. In that case, the control plane should monitor each server’s actual utilization and take that into account not just any time an Instance is provisioned or rescheduled, but also in proactively terminating spot instances or moving standard instances using live migration. We’ll likely want to provide controls to operators for target utilization for each class of service (e.g., standard instances get 50% of a machine — no more — and machines should not be filled past 80%).

Automated migration of resources

Under various conditions, the system may migrate resources, by which we mean to stop serving a resource (e.g., a compute Instance or storage or networking resources) on one compute server and direct another compute server to begin serving it.

For Instances, there are at least two types of migration:

  • planned, live migration, where an end user may not even notice that the Instance has been moved because memory and execution state are preserved

  • uncoordinated, non-live migration, which an end user might experience as a reboot, since memory and execution state necessarily cannot be preserved.

Live migration is a very useful tool for the system internally because it enables the system to update any software component on a compute server with virtually no end-user-visible impact. This might be used for updating the host operating system, host bootstrap software, device firmware, and so on. It can also be used any time an operator just wants to power off or reboot a server, or even for balancing capacity across servers within the rack.

Important
The control plane must support planned live migration of Instances. Besides being expected by customers, this facility significantly reduces the impact and risk of various control plane activities, including software update and resource rebalancing.

Uncoordinated, non-live migration of Instances is potentially desirable for customers, too. Operators will likely want this when responding to a permanent server failure in order to move the workload to other working servers. AWS appears to support this as an opt-in behavior, though details remain to be understood. It’s a more open question the extent to which we would want to apply uncoordinated, non-live migration of Instances automatically in response to failures, but this question is separate from whether the underlying mechanism exists and we believe this is lower priority than much of the rest of the control plane work.

Important
The system must support uncoordinated, non-live migration of Instances and storage resources. The specific circumstances under which we would automatically use unplanned, non-live migration for Instances are to-be-determined. We will prioritize this work as we better understand the market need and gain operational experience with the underlying mechanism.

Non-functional requirements of the control plane

Availability. Availability of the control plane refers to the property that requests to provision resources succeed when the underlying resources are available within the rack and requests to reconfigure or monitor resources succeed as long as they are well-formed. Unavailability refers to request failure due to hardware or software failure.

Important
Generally, the control plane is expected to remain available in the face of any single hardware or software failure, including transient failures of individual compute servers, power shelves, switches, rack controllers, or the like.

Durability. Along the same lines, resources created in the control plane are expected to be durable unless otherwise specified. That is, if the whole rack is powered off and on again ("cold start"), the system should converge to a point where all Instances, disks, and networking resources that were running before the power outage are available as they were from the user’s perspective before the event. Similarly, if a compute server is lost (either through graceful decommissioning or otherwise), it should be possible to resume service of resources that were running on that server (e.g., Instances, disks) on other servers in the rack. There may be additional constraints on how many servers can fail permanently before data is lost, but in no case should it be possible to permanently lose an Instance, disk, or other resource after the permanent failure of a single compute server.

Important
Resources created by users should generally survive permanent failure of any single hardware or software component.

Consistency. Generally, users can expect strong consistency for resources within some namespace. The bounds of the namespace for a particular resource may vary as described in [rfd24]. For example, if a user creates an Instance, another user with appropriate permissions should immediately see that Instance. In terms of CAP, the system is generally CP, with an emphasis on avoiding partitions through reliable software and hardware.

Important
The API namespace is generally expected to provide strong consistency.

Scalability and performance. The API is designed with a scheme for naming and pagination that supports operating on arbitrarily large collections, so in principle it’s expected to support arbitrary numbers of most resources. In practice, the system is intended to support on the order of 100 servers in a rack and 10,000 VMs in a rack. While these numbers are unlikely to change drastically in the future, the long-term goal of providing a single view over multiple racks means the system will need to support much larger numbers of servers and other resources. To avoid catastrophic degradation in performance (to the point of unavailability) as the system is scaled, aggressive limits will be imposed on the numbers of most resources. Operators may choose to raise these limits but will be advised to test the system’s performance at the new scale.

Important
The API should support arbitrarily large systems. The system itself should be clear about its target scale and avoid catastrophic degradation due to users consuming too many resources.

Security. [rfd6] discusses control plane security in great detail.

Supportability and debuggability. Effective customer support includes rapidly diagnosing issues and releasing fixes with low-risk updates. To achieve this, all the software in the rack, including the control plane, must be built with supportability in mind, which means being able to collect enough information about failures to diagnose them from their first occurrence in the field as much as possible and being able to update software with low risk to the system. Details will be covered in an RFD to-be-named-later.

Multi-rack control plane

Our immediate goal is to build a control plane suitable for a single-rack deployment using a design that can be extended for multiple racks, AZs, and regions. [rfd24] discusses what it means to extend various API abstractions beyond a single rack. This is mostly beyond the scope of this RFD, but it’s worth mentioning a few key points here.

A key result of [rfd24] is that users (both developers and operators) will likely want to interact with multiple Oxide racks as a unit, whether that’s at the level of an availability zone, region, or fleet. To achieve this, we’ll eventually need to build API endpoints that aggregate resources provided by individual racks (e.g., "region API servers"), an interface for those endpoints to manage components within racks (either directly or via a rack-level API), and potentially other components as well. We could spend considerable time fleshing out what this looks like and even implement v1 of the rack using the same components that we’ll use to extend the control plane beyond one rack. However, we’ll defer as much of that work as possible, as it’s not strictly necessary for v1 and may well be invalidated by our experience building the single-rack control plane.

Important
To keep our options as open as possible, we’ll have the system expose our best approximation of a region-wide API (to avoid as much breakage as possible when we do extend the control plane to multiple racks), but the implementation will take full advantage of being a single-rack system. We will not attempt to build internal abstractions or APIs as though there were multiple racks behind the region API.

Because we expect that customers will want to be able to list Instances for multiple racks, we can infer that in a multi-rack system, the data associated with a given rack’s control plane must be replicated beyond that rack. Otherwise, an attempt to list Instances in the region when a single rack is offline would either fail (violating our availability requirement) or return incomplete results (violating our consistency requirement). This combination of requirements eliminates the approach of using a simple federation of autonomous racks.

There’s more on multi-rack support in the proposed architecture below.

Software updates in the control plane

As described above, the control plane must support automatic update of all of the Oxide-supported software components in the system, from the control plane itself to device firmware. For the control plane software, update mechanisms should use well-known best practices to maintain availability and minimize the risk associated with software updates:

  • Infrastructure should generally be immutable, which is to say that we prefer to provision new components (and deprovision older ones) rather than update the software delivered within a particular component. The more we can leverage our own Instances, Disks, and Images APIs, the better.

  • Updates should generally be rolled out gradually. A typical deployment should provision new components, bring them into service, validate that they’re working as expected, and then deprovision older ones. This process can be shrink-to-fit based on the complexity, scale, and risk tolerance of the component.

  • To the extent possible, rollback should be an option at least for some period after an upgrade. This too can be shrink-to-fit based on component — for systems of heterogeneous components, each having policies around compatible versions, automated rollback can grow quite complicated.

Ideally, the software update process would be sufficiently reliable and non-impacting that operators could opt to have the system do it automatically without any input from them. In practice, operators are likely to want to control when updates happen to mitigate risk.

Software updates and internal interfaces

The software update mechanism must be aware of interfaces between loosely coupled components (i.e., components that are not upgraded as a unit) so that only compatible combinations of components are deployed. We should define policies for each of these interfaces that specifies what combinations of versions are supported (e.g., versions N-1, N, and N+1 in both directions).

Data storage

In designing the data storage system, it’s important to consider what data we have to store, at what scale (in terms of total data as well as read and write throughput), what types of writes will be made (e.g., are transactions necessary?), and what consistency requirements we require. Many of these are described under [_control_plane_requirements] above.

Canonical state, consistency

For reference, here’s a review of several resources defined in the API, per [rfd4]:

  • Projects

  • Instances (VMs)

  • Disks (network block storage devices)

  • Networking: VPC, Subnets, etc. (see [rfd21])

  • Organizations, Teams, Users, SSH Keys

For many API resources, there are necessarily multiple copies of the state. There is almost always some state in the data plane, and there almost certainly needs to be state stored elsewhere in the control plane as well. To take a few examples:

  • Instances have a property specifying the number of CPUs allocated to the Instance. The data plane obviously has a copy of this state (explicitly or not) in order to implement the expected behavior. In order to survive transient failure of the compute server, this state must also be persistent somewhere; and in order to survive permanent failure of the server (and its storage subsystem), there must be another copy outside the server.

  • In practice, we’ll likely want to be able to list Instances (and paginate through this arbitrarily large list) without contacting every compute server, since that would have a significant impact on latency and availability (and actually violates our availability requirement above). That means at least some Instance metadata would exist in some other highly available database intended for querying in this way.

  • Although the networking APIs are still TBD, we can expect that there will exist a need to configure VPC Subnets with corresponding IP subnets, with properties like IP prefixes and subnet masks. These necessarily exist in the data plane — across many compute servers — to implement each VNIC with an address on the VPC Subnet. The control plane would also need to record which IP subnets (and addresses) are in use in order to allocate new ones.

For each piece of state, we can ask several key questions to evaluate consistency, performance, and availability:

  • Where does the authoritative state associated with each resource live?

  • What other copies of the state exist? How are these copies kept in sync?

  • What components are required in order to be able to modify this state?

  • Will the system accept requests to modify state when it cannot execute the modification right away?

It’s worth keeping in mind that we can have different answers to these questions for different resources, and we can even have different answers for different data associated with a single resource.

Where is the authoritative state and how is it propagated?

Consider these properties of an Instance:

  • an id (immutable, read-only uuid assigned by the system),

  • a name (highly constrained unique string),

  • a description (free-form text)

  • a number of CPUs allocated

The id and name are important for being able to locate the rest of the metadata about an Instance. Names are also used to page through all of the Instances on the system. We almost certainly want these in a single database from which we can easily find a particular Instance or list a group of them having adjacent names, though this doesn’t mean the authoritative state has to be in that database.

The number-of-CPUs property is more interesting, so we’ll use that example to explore different approaches.

One possible principle would be to say that the running state (i.e., the state in the data plane) is the authoritative state of the system and everything in the control plane is essentially a cache. That means that if the user requested a change to the number of CPUs, we must change the actual running number of CPUs (i.e., the data plane state) synchronously with that request and the request can only complete successfully if we successfully update the data plane. Joyent’s Triton uses this approach, quite successfully, for many properties of VMs, and it has the advantage that changes made outside the control plane for whatever reason (intentionally or otherwise) can be made to work. However, this approach falls apart for any kind of distributed resource like most networking resources — see the example above about IP subnets.

Another possible principle would be to say that the control plane’s state is authoritative. Once changes are made there, they are propagated to the data plane. The question then becomes: are those changes propagated synchronously (such that failure to propagate them result in failing to make the change) or asynchronously? If synchronous, latency and availability may be slightly worse, since more systems must participate in order to complete a request successfully.

The major downsides of asynchronous propagation are that it allows the system to accumulate debt (in the form of future work), possibly without bound; and it introduces a divergence between the database state and the actual runtime state. What happens if the user requests more CPUs and the compute server is down? What state do they see in the API? What if they want to make some other change (or even another change to the number of CPUs) — is that based on the previous runtime state or the last state committed to the database?

If done carefully, asynchronous propagation has a major upside: it allows users to record their intent, putting the burden on the system to execute the requested operation and retry as needed to handle transient failures. The consistency questions can be resolved by first-classing this model: the API would expose a current state and an intended state. After all, users ultimately want to establish a particular state of the system, and the mechanics of the API operations required to do that (and handling the possible failures and the associated unwinding of changes) can be a massive effort. A model that lets users express intent and tells the system to make it reality can be a substantial improvement in user experience, as demonstrated by the success of tools like Terraform. In practice, though, building such a system is a considerable amount of work and makes for a more complex API and a more complex implementation. See the appendix below for details.

Important
For now, we will adopt the guiding principle that the control plane’s state is authoritative. As much as possible, we’ll synchronously propagate changes to the data plane. We’ll defer the idea of letting users express intended changes that cannot be immediately applied to the system. (In the future it may make sense to build such an API later on top of the one we propose here, and in the meantime users will be able to use client-side tooling like Terraform to effect the same result.) Put differently, the expectation is that when a user makes a change via the API and that API requests completes successfully, the change has taken effect. Some such operations will be asynchronous, and [rfd4] describes asynchronous operations as a first-class resource. In those cases, we mean that when the asynchronous operation that was started by the API request completes, the requested change has taken effect.

Some resources don’t correspond to infrastructure resources at all (like organizations, teams, users, ssh keys, etc.). The obvious choice is for these to live in the database, although some of these may also be associated with resources in foreign databases (e.g., LDAP) and we’ll need to better understand what’s authoritative and how differences are resolved.

Thinking about SSH keys as another example: in this case, a single API operation (adding an SSH key to a user’s account) may require propagation to many different compute servers, depending on how these keys are used. In that case, we likely would want to say that the keys are propagated asynchronously and establish a few guarantees: that the single copy in the control plane is authoritative and that propagation to the rest of the system will eventually complete. We’re essentially using asynchronous propagation here.

Important
The above principle around synchronous propagation is our bias for most resources, but it may make sense to use asynchronous propagation for some resources. In those cases, we’ll want to be clear about the semantics of multiple sequential changes and we’ll want to be very careful about the amount of asynchronous work that the system can queue up.

Dealing with divergence

Whatever choice we make to ensure consistency of the control plane’s data, it’s conceivable that state could diverge. This can happen for a number of reasons:

  • a software bug

  • an unexpected flow of data between underlying components (e.g., a hypervisor reconfiguring an OS resource under some conditions that we did not account for) — which may also represent a bug

  • deliberate action by a person using an unsupported interface. We do not intend to provide a POSIX shell or other access to system internals to customers, but it’s likely that such interfaces would need to exist for both support and development. While dangerous, such mechanisms can be invaluable in both mitigating an ongoing incident or testing the system’s behavior in cases that are otherwise hard to replicate.

The appropriate way to address divergence likely depends on the specific case: what data diverged, how, and why? Implementation and operational experience will likely inform policies here.

Where possible, we can prefer an approach that first-classes the idea that configuration divergence could happen. For example, imagine an update mechanism that functions by explicitly updating both a database and a cache. This has the property that if something else updates the database, the cache may be left stale indefinitely. Instead, the mechanism could work even in the normal case by just updating the database and one could either configure that in such a way that the cache is notified automatically or the cache could periodically poll for unreported changes. In this case, if the database were updated by some unexpected component, the system would find out quickly.

We can and should be rigorous in dealing with unexpected divergence — recording, debugging, and if necessary fixing cases that we didn’t expect.

Possible data storage implementations

Having concluded that the control plane will need some highly-available distributed database, there are a few obvious options:

  1. A distributed database along the lines of FoundationDB, CockroachDB, or the like. In theory, a system like this can solve a lot of the distributed systems problems for us, including maintaining data integrity, replicating data, ensuring consistency according to our requirements, etc. In practice, experience suggests that operating such systems is far from hands-off. We would want to explore and stress-test a number of such systems before committing to one, and we’d likely want to invest heavily in developing expertise in operating the system at scale and under adverse conditions (i.e., in the face of failures).

  2. A traditional database along the lines of PostgreSQL with some form of replication built in. This is similar to option 1, and all of the above caveats apply — we’ll likely need to invest heavily to properly operationalize it. On the plus side, we have strong operational experience with PostgreSQL already, but the flip side is that that experience has uncovered a number of disappointing properties around availability and consistency of performance at scale. We’ll want to invest just as heavily in stress testing at scale and under adverse conditions.

  3. Execute a distributed consensus algorithm (e.g., Paxos or Raft) across a number of components each using a much simpler local storage engine (e.g., SQLite or RocksDB). Note that this is essentially what systems like CockroachDB do, so this option essentially refers to building such a system ourselves. We’d presumably want to evaluate existing systems first before choosing this approach.

Other options are possible as well.

Replication as a primary feature

Experience from various team members with a number of storage systems has demonstrated the value of systems that support logical replication as a first-class operation. When it’s possible to replicate chunks of the namespace to newly-created nodes (even if it takes a long time), a number of tasks are made much simpler:

  • updating database software on any of the database nodes

  • moving database nodes to different physical machines

  • updating database schema (this may require some support in the replication mechanism)

  • changing data storage properties of the database nodes (e.g., filesystem record size or compression)

  • creating non-production copies of the database for offline analysis or testing

A carefully designed replication mechanism can make all of these use cases possible with minimal downtime and low risk, preserving the option to rollback right up until a final switchover.

Appendix: Automatic migration of resources

Under [_control_plane_requirements] above, we said that the system will support both planned, live migration and unplanned, non-live (disruptive) migration. We explicitly deferred the question of when we would automatically migrate instances disruptively. This section discusses some of the tradeoffs in more detail.

There are considerable risks associated with automated Instance migration in response to a perceived failure of the host system:

  • It’s very difficult to determine reliably whether a system has failed (to the extent that it’s even well-defined). That said, the tight coupling of hardware and software in the rack may enable us to identify certain failures reliably enough (via the connection between service processors and rack controllers) and higher-level liveness checks could be used to identify other classes of failure.

  • When a system has not failed but has merely been partitioned from whatever component(s) are doing liveness checks, if care is not taken, two instances could wind up running, a serious violation of user expectations. Again, we may be able to use the interface between service processors and rack controllers to avoid this problem in most cases (e.g., by powering off the server and providing a safe procedure for bringing it back online). There would still be cases we couldn’t handle, but we could avoid taking action in those cases to be safe.

  • Even when detection is reliable and the system can reliably ensure that any previous Instance is not running, a lot can go wrong:

    • If the underlying failure was caused by the Instance itself, this behavior can propagate the fatal failure until the entire rack becomes non-functional. While this may sound exotic, Oxide team members have shared multiple experiences of behavior exactly like this, including workloads that trigger kernel hard-hangs and workloads that trigger CPU VR resonance (resulting in power cycles). This can potentially be mitigated by limiting the number of times an Instance can be migrated in this way within a short period of time.

    • If the rack itself is close to full utilization, the distribution of resources from one server on the remaining servers could saturate them, resulting in a degradation for all workloads. Again, Oxide team members have observed similar behavior in past experience (e.g., distributed storage systems filling up, forcing a heavy workload to be concentrated onto fewer and fewer storage servers, browning them out). Strong resource isolation and avoiding overprovisioning could mitigate this issue, at the cost of not being able to recover even from a single server failure if the rack is too full.

    • There are many stories of catastrophic outages resulting from overly aggressive automatic recovery processes. Perhaps most famous is AWS’s multi-day outage of its Elastic Block Store. This event was triggered initially by a network disruption that caused a large number of storage replicas to disappear, resulting in an attempted recovery of those replicas, which browned out the network and created a feedback loop.

These failures are not extremely likely, but nor are they all that rare, and the system’s behavior is likely to determine whether a customer’s bad day is made much worse. Risks aside, the more that we can safely automate recovery from both transient and permanent failures of compute servers, the more value we’re delivering to operators.

For storage resources, many of the same considerations apply; however, the recovery time and rack resource utilization (particularly network resource utilization) is likely to be quite significant because of the data transfer involved. There are several ways to recover from permanent loss of storage resources:

  1. Allocate new storage resources elsewhere and replicate using other available copies. This approach can begin right away and restore redundancy without any human intervention, but it requires that additional capacity is kept available for such cases. It could also take a long time and make heavy use of rack resources (particularly network bandwidth) since a lot of data may need to be copied.

  2. If the storage devices are functional, the operator could repair the failed server, bring it back into service, and the system could continue using it. This approach avoids the need for excess capacity and may significantly reduce time to recovery if people are available to repair the server. However, time to recovery depends on how quickly people can be notified and brought in to make the repair.

  3. If the storage devices are physically intact, the operator could physically swap the storage devices into a new chassis and add the new compute server to the system. This is largely the same as option 2, with the same pros and cons, except that the system would likely see this as a new compute server and so a special operation would have to be provided to support this. Besides that, there may be increased risk of human error with this procedure.

  4. Whether or not the storage devices are functional, the operator could repair the failed server and bring it back into service using an operation that ignores existing data on the server, allocates new storage resources on the formerly-failed server, and replicates to this server from other available copies. This mixes the pros and cons of the above options: no excess capacity is needed to handle this failure, but time to recovery depends on human intervention as well as the time required to replicate the data.

It’s possible to support all of these, configured via a policy knob. They’re not necessarily mutually exclusive, either: only option (1) can be done fully automatically, but it doesn’t have to be automatic: we could provide a knob to turn off the automated recovery and just notify an operator when this happens. When the operator logs in to view the problem, they could be presented with all of the above options, including option 1. (We may have to fall back to this position anyway if capacity is not available.)

For both compute and storage resources, assuming we do want to support automatic migration in response to perceived permanent failure of a compute server, we’ll want to at least consider:

  • documenting clearly how we detect such failures, what actions we take, and the expected impact on end user workloads

  • providing configuration knobs to control how liveness is determined (possibly at several levels of the stack) and how long the system may appear dead before we take action

  • making sure we understand how to avoid the split-brain problem

Appendix: Holistic API for managing resources

The API proposed in [rfd4] describes basic CRUD for individual resources like Instances, Disks, and various networking resources. This is a typical low-level API provided for cloud infrastructure and it makes sense that we’d expose this as a primary interface.

When deploying applications on such infrastructure, users often have simple sounding use-cases that require a potentially large number of operations. For example, just to upgrade a fleet of 10 Instances from one image to another using a rolling blue-green deployment might involve deploying 10 new instances and removing 10 old instances (potentially in a staged fashion). Or a user might want to stamp out instances of an application that involves several Instances of a database behind a load balancer plus an application behind another load balancer.

It’s appealing to enable users to think not about individual resources, but a collection of resources (possibly equivalent to Projects in our API), where the desired state can be expressed declaratively (e.g., in a file, which can be version-controlled) and a single idempotent operation can be provided that takes whatever steps are necessary to make the state of the world match what’s been requested. This operation could fail for transient reasons, and it could be rerun until the system converges. Terraform is a client-side tool popular for this reason.

Important
It’s likely that we will want this sort of facility for our own control plane software, if for nothing else then to manage rolling upgrades of our own software. It would be nice to avoid reinventing the wheel here. But if we’re going to have to build it anyway, it would be nice to build it in such a way that customers could use it, too. However, it’s not a product requirement for v1.0.

It’s worth explaining here how we could implement such an API atop the API provided by RFD 4. We could provide a first-class notion of a DeploymentConfig (for lack of a better term). The DeploymentConfig object includes projects, instances, disks, and networking objects. The running state of the system can be serialized as a DeploymentConfig. A DeploymentConfig can also be constructed manually or programmatically. The primary operation provided by the API is commit(deployment_config). This API call registers the caller’s desire that the world should look like the specified deployment config. The system creates a plan that would involve removing some objects, adding new objects, and potentially reconfiguring others. It would then perform some of those operations, re-evaluate the current state of the world, and repeat. One approach is to say that this goes on forever. Even after reaching the desired end state, the system periodically re-evaluates reality and makes it match the desired config. It’s not clear if this is preferred to having a discrete end point after which the system stops comparing reality to the desired config.

The system must not allow a DeploymentConfig to be committed that overlaps in scope with a previous config unless the intent is for the new config to replace the old one. It’s unclear how that’s communicated — maybe the config is scoped to a project, and each project can have at most one committed config. This eliminates the questions about dueling actors using the higher-level API — or rather, it forces end users to deal with it. That seems an appropriate choice here, since if two users are concurrently making changes to overlapping parts of the system, they should have to coordinate to merge their intents and express it unambiguously to the system. This is similar to the git workflow — and like that workflow, we could provide tools to automate merges in easy cases.

What about dueling actors using a mix of the lower-level PCP API and the higher-level API? We could imagine a few options. In all of these, the question is what happens when a user uses the lower-level API to modify objects within the scope of a higher-level DeploymentConfig?

  1. We disallow this. That is, we detect a low-level operation that overlaps in scope with a higher-level DeploymentConfig and fail with a message saying that these resources can only be modified through the deployment config mechanism or by uncommitting the deployment config.

  2. We allow this. When the higher-level system sees changes are being made behind its back, it immediately stops, notifies the user, and awaits further instruction. (You can imagine at this point that the two users coordinate offline, decide how to proceed, and then either re-enable the deployment config or not.)

  3. We allow this. We don’t explicitly try to do anything in response to it. Most likely the higher-level system will quickly undo changes made via the lower-level API, which might be surprising to the user, but is consistent with the previous request to commit the deployment config.

(1) seems preferred, but it’s possible that any of these are acceptable.

It remains to be seen exactly how to model networking resources this way, particularly for resources that may span projects.