Introduction
[rfd48] discusses requirements for the Oxide rack control plane. These requirements include a highly-available, horizontally scalable system for persistently storing control plane data. This RFD expands on those requirements to better articulate what’s needed, what existing components we might consider evaluating for this purpose, and how to evaluate them.
Requirements
Basic requirements
The control plane needs to store information about most of the objects that it manages, including:
Instances (see [rfd4]), including where the Instance is running, its current runtime state, policies attached to the Instance, and the like
(Virtual) disks (see [rfd4]), plus any internal objects used to represent the disk (like replicated chunks)
Networking resources (see [rfd21]), likely to include IP subnets, available addresses, allocated addresses, VPC Subnets, VPCs, Gateways, NAT configurations, and much more
Servers, including all of the Oxide-managed hardware and software components
Organizations, teams, users, and SSH keys (see [rfd4])
Requirements:
durable read/write/delete of individual objects
read of a large set of objects using pagination with the same guarantees as the API (e.g., that paging through a large set will enumerate at least everything that was in the set before, during, and after the operation)
optimistic concurrency control similar to HTTP conditional requests
Possible requirements and nice-to-haves:
ACID or ACID-like transactions. It’s not yet clear if this is strictly required, and scalability requires that transactions be bounded in scope, but this would be a nice feature.
Non-functional requirements
Requirements:
Consistency: Much of the API namespace provides strong consistency within a particular scope (e.g., within a datacenter or region). See [rfd48] for more.
Availability: the system should remain fully available (i.e., for reads and writes) in the face of a bounded number of software or hardware failures.
Relatedly, the system should be able to run with zero planned downtime for both software upgrades and maintenance activities.
Hands-off operation. Since Oxide customers, not Oxide staff, will be primarily responsible for the operation of an Oxide rack, the data storage system will be an implementation detail from their perspective, and our primary means of interacting with deployed systems will likely be through support cases. As a result, the data storage system must be able to run indefinitely without human intervention, assuming no scale-out or fault-recovery activities are needed. We must be able to fully automate any scale-out and fault-recovery activities, including detection and response.
Supportability. When things do go wrong (which includes crashes, unavailability, request failures, or acute performance problems), it must be possible to quickly diagnose and mitigate the problem. The system must provide clear diagnostics, as well as useful metrics about its runtime behavior, both when it’s working and when it’s degraded.
Security. The system must support the product requirements in [rfd6]. Both clients and servers should be mutually authenticated. Data must be encrypted at rest with keys managed by the control plane. Note that this doesn’t necessarily need to be implemented in the data storage system, if we can provide this facility in the infrastructure (e.g., the filesystem) and the database can run atop that.
Open-source: We will need to be able to support the system ourselves.
Cost-efficiency: The storage and other resources used by this system reduce the resources available for customer workloads, so the smaller the footprint the better.
Logical replication as a primary feature. Logical replication of chunks of the namespace as a first-class operation vastly simplifies a number of key tasks, including updating database software on any of the database nodes, moving database nodes to different physical machines, splitting shards, updating database schema, changing data storage properties of the database nodes (e.g., filesystem record size or compression), and creating non-production copies of the database for offline analysis or testing. A carefully designed replication mechanism can make all of these use cases possible with minimal downtime and low risk, preserving the option to rollback right up until a final switchover.
Schema migration (or similar operations): We’ll need to understand how changes to the data representation in the database can be applied online (without downtime, ideally with a rollback option). This might include creating or removing indexes, changing an actual table schema, or other database operations that behave similarly (like some forms of compaction).
Possible requirements and nice-to-haves:
Independently stress-tested, as via Jepsen
Community and commercial support options
Explicit non-requirements:
Based on past experience, the reputation of a system or stories about it being used by other organizations are weak data points. We will want to independently verify any properties we care about.
Size requirements
The system must support horizontal scalability — deploying more instances to increase the capacity of the system nearly linearly. Most systems support this on some level, even if it’s just application-level sharding of otherwise uncoordinated instances. The application’s choices around sharding play a major role in determining horizontal scalability.
Based on the requirements in [rfd48], we can expect a single rack to need to store:
order of 10,000 Instances, from which can infer order of 10,000 virtual disks and VNICs and IP addresses
a much smaller number of VPCs, IP subnets, and other networking resources
order of 100 servers, each with order of 100 hardware or software components to be tracked
Each of these objects is likely to be on the order of a few kilobytes.
Based on number of objects or bytes stored, the single-rack case is not a particularly demanding dataset. Providing a single namespace for a system with 1,000 racks might require hosting 10 million Instances. Based on these estimates, that’s potentially only 100 GiB of space required.
Performance requirements
While storage space and total I/O bandwidth (in bytes per second) will not likely be limiters for scale, large numbers of small operations may well become bottlenecks for an individual instance. Still, the requirements here are likely to be considerably lighter than for a data path service.
Throughput
We’d like to estimate the rate of database reads and writes required for our use-case, since this can have a major impact on sizing and efficiency. Database request rate is closely related to external API request rate. Unfortunately, it’s very hard to estimate the API request rate in a deployed system because use cases are so varied. Here are a few examples:
Use case | API requests per Instance | API requests per Rack per second |
---|---|---|
An application that’s completely redeployed every day | 10 requests per Instance per day | 1.2 rps |
An application that’s completely redeployed every hour | 10 requests per Instance pr hour | 30 rps |
A CI/CD system with average Instance lifetime of 1 minute | 10 requests per Instance per minute | 1670 rps |
These are averages distributed over time. Even with the lightest workloads, the load could be concentrated in a much smaller time. The case of a 10,000-instance app redeployed once per day, which is only 1.2 requests per second if redeployed slowly over a whole day, jumps to 1700 requests per second if the customer wants to redeploy it in a minute.
These examples assume on average of 10 requests per Instance in its lifetime, counting requests to create and destroy the instance, add a VNIC, attach a disk, and check its status a few times. It may be more realistic to assume a fixed number of write operations per Instance in its lifetime (for create, destroy, attach/detach of disks, and the like), plus some number of reads per Instance per unit time that it exists (based on how frequently users list their VMs).
Latency
A good customer experience requires that external API requests complete quickly, on the order of hundreds of milliseconds or better. As external API requests may involve a few database accesses, each of these needs to complete on the order of 150ms or better. This is not hard for small datasets, but can be difficult to maintain as the system grows.
"Build vs. buy"
We might consider building our own data storage system. As a straw man, each instance of the control plane could have local storage. There may be robust implementations of Raft (or other consensus algorithms) in Rust that we could incorporate as a library into the control plane, putting the control plane state into the replicated state machine. Given the list of requirements above, such an effort is likely quite expensive and not substantially simpler than the problem that many existing distributed database technologies seek to solve. Our requirements may be simpler around the operations and features that we need to support (particularly in querying data), but not so when it comes to consistency, availability, operability, and data integrity.
We do not rule out building our own system, but the considerable cost of doing so demands that we first investigate existing technologies in this space.
Proposed evaluation method
We can think of evaluation of existing technologies as a funnel where we start with a long list of potential systems and winnow it down, first based on quick tests (e.g., reading docs to understand the design in order to rule out obvious non-candidates) and later based on more costly investigation (e.g., building a workload generator and doing serious load testing). It might look like this:
Brainstorm candidate technologies.
Do basic research on each technology to better understand its design center.
Look for public write-ups of people’s experiences with it, particularly post mortems. Success stories are somewhat interesting, but failures (with analysis) are often more informative. Jepsen is a particularly good source for this.
Do deeper research on each technology to better understand whether it can be operated hands-off, how it can be scaled out, how it can be updated with minimal downtime, how it handles individual instance failure, and how it handles schema migrations.
Read the operator documentation, particularly around design center, automation, scalability, and failure modes.
Search the issue tracker to better understand what issues people have had and what the team’s response has been.
Look at language support for the system, particularly for Rust.
Look at tools available to understand the system’s behavior, both when it’s working and when it’s not. What metrics are available? Can we get them programmatically?
Reach out to community members we know for their firsthand input.
Deploy the system and play around with it.
Can we get it to run, pass its own tests, and do basic operations quickly?
Do some basic fault testing by hand to see how it works.
Do some very basic benchmarking to see how that works.
Heavy testing. To be confident about the system’s behavior, it’s essential that we verify the system’s availability and performance under heavy workloads with large database sizes, and potentially while inducing failures. It’s not enough to verify these properties independently (low latency, basic fault testing, heavy read/write workloads, and large database sizes), as the confluence of these factors in deployed systems induce new issues.
Set up an HA config in a more permanent cluster.
Run a workload generator to fill the cluster while also performing reads, updates, and deletes.
Measure the performance of the system as it fills up.
Induce failures and measure the impact on availability and performance.
Continue with this for an extended period — at least 48 hours, and ideally as long as several weeks. We may not want to block progress in other areas on several different multi-week stress tests, but we probably do want to complete these tests (maybe after cautiously deciding to go with a particular technology) so that we can identify problems early in our experience with the technology.
Suggested technologies
Our primary focus is on NewSQL systems, which are generally Spanner-inspired distributed databases supporting SQL and distributed ACID transactions. These also tend to be "cloud-native" (i.e., hands-off operations, supporting rolling upgrades and schema migration). We’re intending to consider CockroachDB, Yugabyte, TiKV/TiDB, and VoltDB.
We’re explicitly not considering:
Traditional RDBMS systems (e.g., MySQL, PostgreSQL). Note that our size and performance requirements may well be within the capacity of a single-system RDBMS deployment, and these provide strong consistency guarantees, many convenient ACID features, and (often) mature documentation and tooling. However, availability and durability will demand that data be replicated (and likely synchronously — we cannot accept losing user control plane requests because of a failure as prosaic as an unexpected server reboot). Certainly once we move beyond the single-rack system, managing an HA, horizontally scalable fleet using these technologies feels quite complex.
NoSQL systems (e.g., Cassandra, Riak). The challenges of building relational, transactional applications atop these systems is well known. These systems also generally predate the modern emphasis on hands-off operation, which is critical for supporting a system that we will not be operating directly.
FoundationDB. We’ve only taken a cursory look, but as the name implies, it appears to be more of a foundation for building a custom storage system than a full-featured system. For example, it only recently gained support for something resembling secondary indexes, and that’s in a Java library and doesn’t support SQL. Outside of that, it seems the expectation is that you’d implement indexes atop what’s there. See this comment, too. This may not be a bad approach, and may merit further investigation, but it seems this would be signing up for significantly more work for our 1.0 product.
Systems traditionally used for low-volume strongly-consistent data like configuration and service discovery (e.g., ZooKeeper, Etcd, Consul). These systems do not appear to be designed to be part of a high-volume data path; while availability is important, the expectation is not necessarily that if they go down, the whole system is down. Further, it’s not clear they’re intended for horizontal scalability, as they replicate the entire dataset to all nodes in the cluster.
References
[RFD 4] RFD 4 User Facing API Design
[RFD 6] RFD 6 Threat Model
[RFD 21] RFD 21 User Networking API
[RFD 48] RFD 48 Control Plane Architecture
[Joyent RFD 116] Joyent RFD 116