110 - CockroachDB for the control plane database / RFD / Oxide

RFD

110

Authors

Updated

The Oxide system’s control plane requires a low-touch, highly-available, horizontally-scalable, strongly-consistent data store for storing a variety of data including user configuration, hardware inventory, and runtime state ([rfd48]). The specific requirements around scale, performance, and functional and non-functional requirements are discussed in [rfd53]. A number of modern database systems (generally grouped under the term "NewSQL") seek to meet these requirements. This document describes our initial experiences with CockroachDB.

Note to readers

To make reviewing this RFD more manageable, the detailed test results are separated into two documents:

Testing Details, describing the first set of tests run
Testing Details, 2020-11 edition, describing the second set of tests run using a newer Cockroach build, larger instance types, and a lighter workload

The assumption is that most readers don’t intend to review these. (If you do want to review these, please do — feedback is welcome!)

The primary goals for this RFD are to determine:

Are there warning signs from the data contained here that we should really dig into before proceeding with CockroachDB?
Is there more testing we should do right now before proceeding with CockroachDB?

See the next section on [_goals] for more specific criteria around these questions.

If you only want to see the top-level conclusions, read through the [_goals] and [_executive_summary_of_results] sections.

If you want to weigh in on the questions above, you’ll probably want to read this whole document. This will include why we focused on CockroachDB and some basic concepts for evaluating CockroachDB and our test results, plus the summary of tests, caveats, and areas of technical risk. You probably don’t need to look at the detailed test results (but again, please do if you’re interested).

If you skip the detailed test results, you might still be interested to see the "Online expansion and contraction" test in Testing Details, 2020-11 edition. There are some pretty graphs! And they demonstrate a lot of important CockroachDB behavior.

Goals

Our goal is to select a technology to use as a database for the control plane. Naturally, we’d like to select an existing system that meets our requirements rather than build one from scratch. The underlying technical problems are quite complex. Even if the best candidate falls short, we may still be better off adapting it (and potentially our usage of it) rather than starting from scratch.

Secondarily, we want to uncover particular areas of focus for future testing and development to mitigate major problems found with whatever technology we choose.

Importantly, we are not seeking to completely de-risk the choice of database, just to gain enough confidence to justify moving forward with it.

There is little here about absolute performance, as our specific requirements in this domain are not particularly hard. (See [rfd53].)

"Executive" summary of results

(probably not brief enough to be called "executive")

We started as described in [rfd53] by surveying popular technologies, primarily CockroachDB, Yugabyte, and TiKV/TiDB. Based on documentation and user experience reports, we chose CockroachDB as the most promising candidate for more serious testing. See [NewSQL-notes] for raw notes from this initial survey, and see [_why_cockroachdb] below for why we started with CockroachDB.

We tested basic operation, online expansion and contraction, and several fault scenarios. Through all this, we found lots to like about CockroachDB. Some of these might seem like low bars (e.g., no data loss), but they’re nevertheless critical and so worth mentioning:

We did not observe any data loss.
We did not observe inconsistency (in the CAP sense), though we were not testing for this directly.
We did not observe any unexpected crashes. There’s one category of issue where CockroachDB crashes by design (loss of clock synchronization) that we expect can be managed.
CockroachDB never required operator intervention to re-establish availability after failures. Nodes quickly and reliably rejoined the cluster and began serving traffic.
Despite our inducing various failures both transient and extended, the cluster always converged to a state of sufficient replication (as measured by its own metrics) without operator intervention.
Broadly, the system worked as documented:
- We did a lot of tests that involved shutting down nodes temporarily and the system was barely affected.
- We did a lot of tests that involved moving data (e.g., shutting down nodes for an extended period, adding new nodes, or removing old ones). The system’s metrics clearly showed the underlying conditions (e.g., under-replicated ranges) and the actions being taken (e.g., creating new replicas). The behavior matched the disk and network activity observed elsewhere in the system.
- We did several expansion tests, after which we saw resource utilization balance across the fleet (suggesting horizontal scalability, though we did not test this directly).
The documentation has generally been good: it adequately describes the system’s setup and operation. We did have some open questions not answered by the docs.
Tooling has generally been good. There are built-in tools for managing the cluster (e.g., decommissioning nodes), running a variety of workloads, etc. The built-in web Admin UI gives good overall situational awareness.
The system provides good coverage of metrics about its operation, plus other observability tools (e.g., query analysis, distribution of data across nodes, various reports about problem nodes or ranges, etc.). Many of these are fairly unpolished (e.g., the report about hot ranges is just a JSON endpoint).
We tested some basic reconfiguration like adjusting the GC period, and that worked as expected.

There’s also some disappointment:

On small instance types, both expansion and contraction of the cluster resulted in brief periods (2-3 minutes) where almost no requests completed at all and longer periods (20-30 minutes) where average and p95 latency were significantly increased (2x-3x). Things were much better on systems with more headroom, but tail latency still increased significantly.
Online schema changes worked and did not result in client errors, but did result in significantly increased latency, even with pretty large instance types and plenty of CPU and I/O headroom to start with.
There can be considerable variance in performance (10% to 100%) depending on which node receives a particular query. This is particularly true for workloads with a working set in memory. Since the cluster automatically splits ranges, merges ranges, and moves leaseholders around, performance can be inconsistent over time even for a uniform workload. This automatic adaptation is probably good overall, but capacity planning and SLO management could be more difficult as a result.
In the few cases we wanted them, there didn’t seem to be a lot of operational controls. For example, we observed one node being particularly hot. It did not seem that there was a way to tell CockroachDB to move some of its leases to other nodes. (CockroachDB does try to be completely hands-off, and a fair objection to such a control is that if the system is managing leases, it may well decide to move those leases back right after the operator moves them. Really, you don’t want to have to manually configure this.)
The primary risk identified from user experience reports is that the built-in (non-Enterprise) backup/restore option is not suitable for many production clusters. We did not dig into this and it’s probably one of the biggest open risks.
Most of the technical content that we found on the web comes directly from CockroachDB. It’s great that they have so much useful documentation, but is it worrisome that there isn’t more from a large, active user base? Would we be too dependent on the company? (It’s not clear any of the other NewSQL systems are any better in this regard.)
The licensing story for CockroachDB is a mess. There’s a free "CockroachDB Core" version and a "CockroachDB Enterprise" version with some paid, licensed features. The free Core roughly corresponds with code licensed under BSL, which resembles a FOSS license but restricts the user from competing directly with Cockroach Labs. The paid version roughly corresponds with code licensed under CCL. BSL code converts to Apache 2 three years after each release. There are a lot of problems with all this:
- The BSL does not provide a patent grant. (APL 2 does.)
- It’s not super clear what comprises Core. The CCL implies that Core is everything on GitHub, but that includes all the CCL-licensed code. It also refers the reader to an FAQ that says that Core is licensed under BSL. However, they recently announced that distributed backup and restore are part of Core, and it turns out they’re still under CCL, not BSL. To make things more complicated, there are also "CCL (Paid)" and "CCL (Free)" features — i.e., some free stuff is under CCL. Is that part of Core or what?
- CCL’s opening sentences appear to claim broader applicability than just code licensed under CCL.
- CCL includes a patent grant, but it’s revocable and not perpetual.
- CCL appears to allow Cockroach Labs to change the pricing of any CCL-covered feature at any time.
To mitigate all this, we’re intending to stick with the OSS build, which includes no CCL code. This carries some risk, since it does not appear to be widely used. It also means potentially giving up features like distributed backup that are intended to be free, but not licensed accordingly.

Again, there are lots of functionality, fault conditions, and stress scenarios that we did not test.

Important

Our conclusion is that CockroachDB is solid enough to continue moving forward with and it’s not worth spending comparable time right now evaluating other options.

Important

For readers: are there any results here that we should be more concerned about than reflected here? Are there other tests we should run now before proceeding with CockroachDB?

Why CockroachDB

For more context on why we started with NewSQL options, see the [rfd53] section on "Suggested technologies".

Why did we start with CockroachDB over the other NewSQL options? Most of the NewSQL family of databases have similar properties:

architecturally based on Google’s Spanner
SQL-like interface
strong consistency (in the CAP sense)
horizontal scalability, including expansion without downtime
reasonably tight dependency on synchronized clocks
support mutual authentication of both clients and other server nodes using TLS

It seems fairly likely that any of the big options would work for us. It also seems reasonably likely that any one of them might have some major issue that we won’t discover until we’re pretty far down the path of using it.

For us, the most appealing, differentiated things about CockroachDB are:

It has a strong focus on hands-off operation. Initial setup is a good example of this. There’s only one component to deploy, and you just need to point it at enough other instances to find the cluster. By contrast, with TiDB, there are several components to deploy, which means independently monitoring their availability and utilization and independently scaling them out. The documented options for TiDB deployment include Kubernetes, Ansible, and TiUP, the last of which appears to be a full-fledged package manager and cluster management tool.
It has a very strong [CockroachDB-Jepsen-Report]. The Jepsen report for CockroachDB was glowing. The reports for Yugabyte and TiDB showed some serious issues, including several operational issues. It’s important to remember that these reports are from mid to late 2019 and the serious issues have likely been addressed. Relatedly, Yugabyte’s public blog post claimed (and as of September 2020 still claims) to have passed Jepsen, a claim so misleading that the Jepsen report added a note at the top saying that’s not true.
It’s range-sharded, meaning that keys are sorted rather than hashed. This is critical for enabling pagination in large collections. CockroachDB discusses this and other issues in a blog post (obviously a biased source, but the technical details appear accurate). By contrast, Yugabyte is primarily hash-sharded. (Yugabyte supports range sharding but our notes show that as of May it appeared to lack active rebalancing for them. This functionality appears to be supported in beta now.)

Yugabyte is completely open-source (as opposed to CockroachDB, which is under the Business Source License). It also directly uses the PostgreSQL query execution engine, so it supports more PostgreSQL functionality out-of-the-box. In the above-linked post, CockroachDB claims this makes it harder for Yugabyte to distribute query execution, but we did not dig into this claim.

TiDB is also open-source and the company, PingCap, has written a lot about their use of Rust (although only parts of TiDB are in Rust). TiDB emphasizes MySQL compatibility rather than PostgreSQL.

Background on choice of technology

Our consideration of the control plane database is influenced by the experience of several Oxiders in building, operating, and scaling Joyent’s Manta service. Manta is an S3-like system built for both high availability and horizontal scalability. The system achieves these properties primarily using a sharded fleet of PostgreSQL databases using replication and automated failover. This architecture resembles other systems designed around 2012. This can be made to work, but requires significant engineering investment in the automated failover machinery and associated cluster state management as well as the process for moving data between shards in order to expand the fleet. PostgreSQL operates as a single-system database — any appearance to the contrary (including high availability of a single shard plus the idea of multiple shards) is synthesized by this surrounding software.

Joyent ran into massive problems keeping these database shards up, particularly under a write-heavy workload, for a variety of reasons that might be applicable to other technologies as well. Most of these were cases where PostgreSQL accumulates a sort of "debt" in the form of work that has to be completed at some time in the future. This includes transaction wraparound vacuum, regular autovacuum, synchronous replication "apply" lag, and the lag between checkpointed data and what’s been written to the database. For details on one such case, see [Manta-Outage-Postmortem]. In every one of these cases, the debt accrues slowly, potentially for months or years. Then a balloon payment becomes suddenly due. At this point the database can be completely offline for days, with few options for mitigation. Anecdotal reports from engineers working with other database systems suggest that many database systems have similar behaviors where they suddenly start a background operation during which availability or performance is severely compromised. Naturally, those of us who worked on Manta are eager to identify such issues as soon as possible.

Other problems resulted from PostgreSQL having little understanding that it’s part of a distributed system. The system accumulates the aforementioned apply lag and normally pays the price only when a failover happens. However, a transient failure of the replication TCP connection also triggers this pay-down — not because it intrinsically has to, but because the replication code was built atop a mechanism originally intended for single-system database crash recovery. It was not reconsidered when it became load-bearing for a distributed system. Such issues push us towards systems designed with the failure modes of distributed systems in mind.

Finally, we found during this experience that PostgreSQL seems oriented towards use cases with periods of low activity, not a 24/7/365 duty cycle. The recommended approaches for managing many of these problems are to pay down the debts during idle times, which don’t exist for a 24/7 cloud service. This is another reason to prefer a system designed for the cloud.

The modern family of NewSQL technologies are compelling because they facilitate applications built with relational data and transactions while also being low-touch, minimal-configuration, highly available, and fault tolerant. For more context on other options considered, see the [rfd53] section on "Suggested technologies".

CockroachDB basics

It’s important to understand some fundamentals about CockroachDB just to know how to test it, let alone evaluate it in detail.

Architecture

CockroachDB exposes a SQL interface using the PostgreSQL wire protocol and consumers typically use a regular PostgreSQL client. SQL queries are served by whatever node the client sends the request to, which is called the gateway node. The expectation is that clients load-balance requests across nodes in the cluster or that the cluster is deployed behind a load balancer like haproxy or ELB.

Internally, all data is maintained in a sorted key-value store. The entire key space is sorted and divided into Ranges, primarily based on size (512 MiB by default). Each Range has some number of Replicas corresponding to the configured replication factor. Ranges are split automatically based on size and load, and they can also be split by user request. They can also be merged automatically based on size.

For writes, there’s an independent instance of the Raft consensus algorithm for each Range. Generally, the members of this Range-specific Raft cluster are the nodes that hold Replicas for that Range. Writes are directed to the Raft leader for the Range, which runs each write request through Raft to ensure strong consistency.

Reads do not go through Raft: instead, there’s a leaseholder for the Range. This is one of the nodes with a Replica for this Range, and it’s almost always the same node as the Raft leader. All reads for a Range are directed to the leaseholder, which can generally serve the request from its own copy. In cases where strong consistency might be violated (when the leaseholder is not the Raft leader), reads are sometimes delayed in a way that ensures correctness.

To summarize: the gateway node turns the request into key-value operations that are distributed to other nodes: the Raft leader (for writes) or leaseholder (for reads) for the Range associated with each key. There’s more involved when a transaction involves multiple Ranges, but we can largely ignore that for our purposes here. For more, see "Reads and Writes in CockroachDB" and "Life of a Distributed Transaction".

Fault tolerance

Transient failures of individual nodes do not significantly affect reads or writes. Based on what we said above, we’d expect that:

For any Range where the failed node is not the Raft leader, writes would be largely unaffected, since the Raft cluster can quickly achieve consensus with the remaining nodes.
For any Range where the failed node is not the leaseholder, reads would be unaffected, since only the leaseholder is used for reads.
For a Range where the failed node is the Raft leader or leaseholder, write or read requests would be unavailable (respectively). However, no data needs to be moved for the leader or leaseholder to be moved to one of the other Replicas. (Again, we’re talking about transient failures.)

CockroachDB declares a node dead if it hasn’t heartbeated to the cluster for 5 minutes. When that happens, the Ranges that had Replicas on that node will be declared under-replicated. The cluster picks new nodes to host replacement Replicas, and data is copied from the nodes that are still available. This can have a notable performance impact while data is flying around.

Surviving multiple failures

It’s critical to understand that the number of nodes in the cluster is not the same as the replication factor. Suppose you have a cluster of 7 nodes configured with replication factor 3 (the default). With 7 nodes, you might think that you could maintain availability even while losing two nodes. That’s wrong: consider the Ranges that have Replicas on both of those nodes. (With enough Ranges in the system, it’s likely that some will have a replica on each of the two failed nodes.) Those Ranges only have one Replica available, which is not enough for consensus. Such Ranges will be unavailable.

It’s important to remember that the replication factor determines how many failures you can survive. Adding cluster nodes alone only increases capacity (in terms of storage and performance), not availability.

Client behavior and retries

As mentioned above, CockroachDB uses the PostgreSQL wire protocol so that you can use a standard PostgreSQL client. Cockroach Labs provides "beta" level support for rust-postgres and the team appears to have contributed improvements to that crate.

Under some conditions, in order to maintain strong consistency when multiple transactions modify the same data, CockroachDB aborts a transaction with a retryable error. In many cases, CockroachDB automatically retries the transaction. In the rest of cases, it’s up to the client to do so when it receives the appropriate error code. According to the docs, some client libraries automatically handle these cases, and even if not, it’s fairly straightforward: you just issue a ROLLBACK and try again. For more, see the documentation on transaction retries. Server-side retries are automatic as long as the statements are issued to CockroachDB as a batch and the results are small enough that they’re buffered rather than streamed. These conditions are under the client’s control.

Automatic background activities

CockroachDB automatically does a few things that have potentially significant impact on performance:

splits ranges based on size
splits ranges based on load
merges ranges based on size
moves replicas based on load
(unverified) moves leases to other replicas?
(unverified) moves replicas based on available capacity?

These can dramatically impact performance. In particular, load-based splitting can split a busy Range into two less-busy Ranges. If a different node becomes the new Range’s leaseholder, then the original busy load can be successfully split across two nodes.

Implications for testing

CockroachDB’s assumption that clients will distribute load evenly to available cluster nodes (which is generally a fine approach) complicates our testing. If fault testing includes a load balancer, it would be easy to end up testing the behavior of that load balancer and not the cluster itself. If we leave out the load balancer, then each client is directed at a particular cluster node, and that client will see failures whenever that node is offline. We need to discount those failures if we’re only trying to assess the cluster’s behavior. (In principle, we do care about the load balancer and client-side behavior as they relates to availability, but in practice, we have good reason to believe we can build this ourselves as long as the server behaves reasonably. So we want to test the server’s behavior now rather than build a perfect client first.)

Performance testing is affected by the way requests are distributed from gateway nodes. Consider a 3-node cluster where clients are distributing requests evenly to all three nodes, but where the workload is concentrated on one Range. In this case, we’d expect the Raft leader and leaseholder for the active Range to have notably lower latency (by at least one internal network round-trip) and higher throughput — and this is what we observed. We need to be aware of this when looking at how performance changes over time, since it’s possible we’re just observing a workload shift.

When the cluster decides to split Ranges or move leaseholders, overall latency and throughput can suddenly change significantly, even though nothing is wrong. If that happens during fault testing, care must be taken not to assume that the fault caused the change in performance. We’d expect this effect to be small when the number of Ranges is high enough that any one split or leaseholder move is a small fraction of the overall load.

Summary of the tests

Online expansion: while pointing one load generator at each node in a 3-node cluster, increase the cluster gradually to 6 nodes and observe latency, throughput, and error rate. We were not looking for improved latency or throughput — that winds up being complicated by various other factors and we decided that was better for a separate horizontal scalability test — but just to know that latency and error rate were not significantly impacted. On small nodes running a moderately heavy workload, the cluster did stop serving requests for a few minutes and then performed poorly for the next 20-30 minutes while data was moved around. On larger nodes (with much more CPU headroom) and with a lighter workload, p95 latency still doubled, but throughput was not so affected and the cluster continued serving queries the whole time.

Online contraction: similar to online expansion, with similar results. In this case, we started with one load generator for the first three nodes in a 5- or 6-node cluster. Then we gradually decommissioned nodes and observed the latency, throughput, and error rate. The results were similar to expansion.

Long-running workload: we ran one workload for 240 hours (over 9 days) to look for any major degradation. Overall, this went well, though there were occasional brief spikes in latency and comparable degradation in throughput.

Schema changes: we performed a few types of schema changes while running a light workload. These completed successfully and resulted in no client errors, but latency was significantly degraded for the changes that required lots of I/O. p95 jumped from about 3ms to 100ms.

Backup and restore: we performed a backup of a 100 GiB database using cockroach dump. This took about 33 minutes for about 81 GiB of logical data in 43 million records. We did not run this under any load. We did not test distributed backup after discovering that it’s not covered by the BSL. We restored this database using cockroach sql to execute the SQL command file (which is the backup format for cockroach dump). This took just over 6 hours (with lots of idle resources — it appears to run largely single-threaded). This produced the expected number of records in the newly-created table. This approach obviously isn’t fast, but it did work.

Rolling upgrade: we performed a rolling upgrade of a cluster from v20.2.0 RC3 to v20.2.0 RC4 under a light load. This showed no impact beyond what we’d expect (failures and high latency for requests to nodes being upgraded).

We also ran several kinds of fault testing:

Send SIGKILL to instances of CockroachDB. This had virtually no affect on the cluster. The killed node was serving requests again in single-digit seconds. Only in-flight requests seemed to be affected.
Reboot the OS on the system hosting one node. This had virtually no affect on the cluster. This node was back up and serving requests within 90 seconds, nearly all of that being OS reboot time. Only in-flight requests seemed to be affected.
Initiate an OS panic on the system hosting one node. This is similar to a reboot, but behaves more like a network partition, since TCP connections are not gracefully torn down. This looked nearly the same as an OS reboot except that it took a little longer for the OS to come back up.
Transient single-node partition: use firewall rules to introduce a partition around a cluster node for less than the default 5-minute timeout for declaring a node "dead". There were some oddities around the client-side reporting (see [_open_technical_questions]), but the overall impact was good. There were no errors, and while latency rose, it was less than ambient fluctuations for the previous 30 minutes. Queries per second dropped across the cluster and throughput on all nodes went down. All nodes' CPU usage and disk throughput when down. This is probably because one load generator was off, not because one node was down.
Extended single-node partition: use firewall rules to introduce a partition around a cluster node for longer than the default 5-minute timeout. We saw similar oddities around client-side latency, but the overall impact was good. There were some multi-second latency outliers on a bunch of nodes but they were mostly beyond p99.

We also created a "debug zip" using cockroach debug zip. This is a primary means of collecting data for post hoc analysis, and it includes:

settings.json: list of all cluster config settings
events.json: appears to be an event log (e.g., "decommissioning")
rangelog.json: appears to be a log of changes to ranges
nodes.json: includes detailed information about each node, including metrics, uptime, buildinformation, etc.
running queries and transacations of various kinds, including internal
for each node: lots of internal details, including stacks.txt, command-line arguments, environment variables, log files
for each node, for each range: information about the range, including byte counts, replicas, etc.
schema details
report on "problem ranges"

On illumos, we’re missing each node’s "threads.txt", which is "only available on Linux/Glibc". There are a bunch of pprof files that we didn’t look into. These appear to contain heap and CPU profiling data, but it’s not clear if they worked.

See the separate Testing Details and Testing Details, 2020-11 edition document for details.

Evaluation

In most areas, our testing has gone well. In a few areas, latency for user queries suffered significantly when background work was going on. A key question for us is: are we okay with the latency increase during operations like expansion, contraction, and schema upgrade?

Any system we use (or build) is going to involve tradeoffs. At this time, we don’t have particular reason to believe that other candidate systems will do better here. If our past experience with PostgreSQL is any indication, not all widely-deployed databases have strong support for isolating workloads or strictly balancing priorities of different workloads. And there are many dimensions in which other systems could behave much worse, particularly in all the areas above where we said that we were happy with CockroachDB (like being hands-off and reliably converging after failures). Most of these areas are far more critical to day-to-day performance and availability than the operations we’re concerned about here. Ideally we might investigate all of these other systems, but unfortunately even gathering this data is pretty expensive.

We could also continue to spend time investigating these behaviors. It may be informative to deploy a cluster on systems with a second SSD to be used as a slog device, potentially improving the latency of synchronous writes when the main pool device is being used heavily for streaming reads and writes. Again, the time required for such testing is non-trivial.

Our current thinking is that we’ll proceed with CockroachDB despite these issues, investigating as needed in mitigating and improving the behavior.

Caveats and limitations on these results

We wound up doing a lot of ad hoc testing, sometimes in response to unexpected issues with a given test. While we tried to control variables, it’s possible that some results are path-dependent. For example, our long-running workload test was run on the same cluster that had been expanded and contracted again at least once, and it’s possible it would have different performance characteristics than one that had not gone through that process. Relatedly, although we were reasonably careful with data collection, a more fully-automated process that also collected data regularly from the load generators would reduce the possibility of problems we missed.

We did not end up directly verifying horizontal scalability (that is, in a controlled test). We saw it in practice during expansion and contraction activities, but we didn’t scale up or down the workload to really prove it.

We also did not test overload conditions, which may be worth revisiting, particularly since we ran into some early problems on systems with insufficient IOPS capacity.

We used a pretty limited number of workloads: primarily the "kv" (key-value) workload that ships with the cockroach workload tool. This was sufficient to exercise reads and writes, with some control over the size of writes and the fraction of read requests. We also used the same tool to populate our large databases. Results could be very different for data that looks very different, as might happen with larger payloads, more varying payload size, less well-distributed keys, use of secondary indexes, etc.

We only ran tests on AWS. Many of the tests used fairly small instance types. All testing was on illumos. Early tests used a beta version of CockroachDB using PebbleDB; later tests used the v20.2.0 release (also using PebbleDB).

We did not do any significant performance work like tuning the filesystem or networking stack or CockroachDB itself. It’s possible we could see improvements in absolute performance from that work.

There are lots of tests that we considered, but did not try out:

Overload.
Backup/restore under load.
Horizontal scalability in a controlled experiment. We saw this in practice during expansion and contraction, but we didn’t scale up or down the workload to really prove it.
Asymmetric network partitions (or even any partitions involving more than one node).
System hangs (e.g., pstop).
Running the clock backwards.
ZFS snapshot rollback on one or more nodes.
Recovery when one Replica has been offline for an extended period and lots of data has been written to the Range when it comes back.
Any sort of storage GC stress-testing (e.g., deleting a very large amount of data in a short period and seeing the impact when it gets collected later).
Any sort of testing of haproxy as a load balancer. 5 Some of these may be worth digging deeper into. Others may be obviated by other choices we make. For example, we may want to build a smarter client-side load balancer and not use haproxy.

Areas of technical risk

These correspond with areas that we didn’t test, described above. Here we explain the big ones.

Area Likelihood Impact Details

Area	Likelihood	Impact	Details
Overload	Moderate	Moderate	Many databases behave poorly when overloaded (i.e., latency climbs non-linearly rather than prioritizing and shedding load). It’s likely that we’ll find ourselves in such conditions at some point, and it would be best if the database did not contribute to a cascading failure in that case. If we find that CockroachDB doesn’t handle this well, we may need more aggressive client-side measures (e.g., circuit breakers) or a proxy between clients and the database. These measures aren’t so hard to build in up front, but they can be hard to bolt on after the fact if clients aren’t expecting errors for these conditions.
Backup/restore	Moderate	Moderate	Prior to v20.2.0, users reported that what’s supported in the non-Enterprise CockroachDB is not suitable for production clusters. We did test `cockroach dump`, and it worked, but it’s certainly a coarse mechanism. In v20.2.0, Cockroach made distributed backup and restore part of Core, but it wasn’t relicensed accordingly (see above) and our plan of record is to stick with BSL code, so we will not use it. It’s probably not valid to take ZFS snapshots and replicate them, as they couldn’t be coordinated across the cluster without downtime. It’s possible that we’ll need to implement our own backup/restore system. On the other hand, while this is not a small project, it seems bounded in scope, particularly if we allow the backup to not represent a single point in time.
Online schema changes	Low-moderate	Moderate	This is supposed to work, and we successfully completed some without errors, but performance was significantly impacted. In the worst case, we may have to build application-level awareness of these changes, which people have been doing for a long time with traditional RDBMSs.
Rolling upgrade	Low	Moderate	This is supposed to work, and we successfully tested it, but it will likely somewhat complex to operationalize and each release potentially brings new breakage. On the other hand, their basic approach is pretty reasonable.
Horizontal scalability	Low	Moderate	Horizontal scalability is a very fundamental part of the system here and everything we know about the design suggests that it will work. Our non-controlled tests show it in action.
Inconsistent performance due to debt	Moderate	Low-moderate	Most database systems have background activities (like storage GC) that build up and can affect performance. That CockroachDB partitions data into relatively small ranges (512 MiB by default) may mitigate how much of the database can be in such a state at once. We can run lots of tests to smoke out these issues, but only running workloads comparable to production for very extended periods can give us high confidence here.
Client functionality and reliability	Moderate	Low-moderate	Good performance and availability requires robust and fully-functional client implementations, where our choice of language (Rust) may not have seen a lot of focus. On the plus side, CockroachDB speaks the PostgreSQL wire protocol, so we can likely benefit from strong interest there, and CockroachDB supports rust-postgres as "beta". It seems pretty likely that we’ll want to build our own client-side load balancing system similar to Joyent’s cueball. (A Rust implementation of cueball does exist already, and there’s also r2d2.)
Instability due to lack of clock sync	Low	Low	A CockroachDB node crashes when its clock offset is more than 500ms from the cluster mean. This was initially a major challenge on AWS, but use of chrony and NTP has easily kept clocks in sync within 1ms over a weeklong test.
Lack of logical replication escape hatch	?	High	[rfd53] talks about "logical replication as a primary feature" because when a system is capable of replicating chunks of the namespace elsewhere, many difficult problems become much simpler. This includes moving databases between machines, reconfiguring storage, offline analysis, testing, etc. It’s unclear if CockroachDB has a mechanism like this. "changefeed" is probably the most promising area to explore here. However, the replication that CockroachDB does provide first-class supports many of these use cases. For example, if we wanted to change the filesystem record size, we could bring up a fleet of nodes with the new filesystem configuration and decommission the old ones. The question is whether there are important use cases where the built-in replication isn’t enough. One examples might include constructing a whole second copy of the cluster for testing purposes.
Security: client and server authentication using TLS	Low	?	[rfd48] and [rfd53] specify that database instances should mutually authenticate each other as well as clients. CockroachDB does support this using TLS, but we did not test it.

Overload

Moderate

Many databases behave poorly when overloaded (i.e., latency climbs non-linearly rather than prioritizing and shedding load). It’s likely that we’ll find ourselves in such conditions at some point, and it would be best if the database did not contribute to a cascading failure in that case. If we find that CockroachDB doesn’t handle this well, we may need more aggressive client-side measures (e.g., circuit breakers) or a proxy between clients and the database. These measures aren’t so hard to build in up front, but they can be hard to bolt on after the fact if clients aren’t expecting errors for these conditions.

Backup/restore

Moderate

Prior to v20.2.0, users reported that what’s supported in the non-Enterprise CockroachDB is not suitable for production clusters. We did test cockroach dump, and it worked, but it’s certainly a coarse mechanism. In v20.2.0, Cockroach made distributed backup and restore part of Core, but it wasn’t relicensed accordingly (see above) and our plan of record is to stick with BSL code, so we will not use it. It’s probably not valid to take ZFS snapshots and replicate them, as they couldn’t be coordinated across the cluster without downtime. It’s possible that we’ll need to implement our own backup/restore system. On the other hand, while this is not a small project, it seems bounded in scope, particularly if we allow the backup to not represent a single point in time.

Online schema changes

Low-moderate

Moderate

This is supposed to work, and we successfully completed some without errors, but performance was significantly impacted. In the worst case, we may have to build application-level awareness of these changes, which people have been doing for a long time with traditional RDBMSs.

Rolling upgrade

Low

Moderate

This is supposed to work, and we successfully tested it, but it will likely somewhat complex to operationalize and each release potentially brings new breakage. On the other hand, their basic approach is pretty reasonable.

Horizontal scalability

Low

Moderate

Horizontal scalability is a very fundamental part of the system here and everything we know about the design suggests that it will work. Our non-controlled tests show it in action.

Inconsistent performance due to debt

Moderate

Low-moderate

Most database systems have background activities (like storage GC) that build up and can affect performance. That CockroachDB partitions data into relatively small ranges (512 MiB by default) may mitigate how much of the database can be in such a state at once. We can run lots of tests to smoke out these issues, but only running workloads comparable to production for very extended periods can give us high confidence here.

Client functionality and reliability

Moderate

Low-moderate

Good performance and availability requires robust and fully-functional client implementations, where our choice of language (Rust) may not have seen a lot of focus. On the plus side, CockroachDB speaks the PostgreSQL wire protocol, so we can likely benefit from strong interest there, and CockroachDB supports rust-postgres as "beta".

It seems pretty likely that we’ll want to build our own client-side load balancing system similar to Joyent’s cueball. (A Rust implementation of cueball does exist already, and there’s also r2d2.)

Instability due to lack of clock sync

Low

A CockroachDB node crashes when its clock offset is more than 500ms from the cluster mean. This was initially a major challenge on AWS, but use of chrony and NTP has easily kept clocks in sync within 1ms over a weeklong test.

Lack of logical replication escape hatch

High

[rfd53] talks about "logical replication as a primary feature" because when a system is capable of replicating chunks of the namespace elsewhere, many difficult problems become much simpler. This includes moving databases between machines, reconfiguring storage, offline analysis, testing, etc. It’s unclear if CockroachDB has a mechanism like this. "changefeed" is probably the most promising area to explore here. However, the replication that CockroachDB does provide first-class supports many of these use cases. For example, if we wanted to change the filesystem record size, we could bring up a fleet of nodes with the new filesystem configuration and decommission the old ones. The question is whether there are important use cases where the built-in replication isn’t enough. One examples might include constructing a whole second copy of the cluster for testing purposes.

Security: client and server authentication using TLS

Low

[rfd48] and [rfd53] specify that database instances should mutually authenticate each other as well as clients. CockroachDB does support this using TLS, but we did not test it.

In most of these, we could mitigate the risks with more testing, though the work required is often substantial.

Open technical questions

Can we see how much data is to-be-GC’d (i.e., not visible right now, but being preserved because the ttl period hasn’t expired)?

Can we throttle schema change operations so that they don’t affect latency so much?

Is it expected that we’d see such significant impacts to latency when adding or removing nodes?

Is it possible to split a cluster (e.g., to create a secondary copy for other purposes, like backup)? You could almost do this by deploying 2x the nodes and temporarily doubling the replication factor. This would result in something that it feels like you could split into two clusters. However, the actual split would probably need to be coordinated via Raft: one side would necessarily wind up in a minority and there would need to be an explicit step to have it elect a new majority.

What do all the metrics mean? Many of them aren’t well documented. Some are named confusingly. For example: what are range "adds" and "removes"? They don’t seem to correlate with when a range is created. They seem to correlate with when a replica is moved — so maybe that reflects a new replica being created and an old one removed? But the stat is definitely named with "range", not "replica".

Is it possible to manually rebalance the set of replicas or leaseholders on a node?

Previously open questions

Has any work been done on ideal block size? ZFS performance? Use of ZIL/slog?

Essentially no. Cockroach Labs has almost no experience with ZFS. They report that RocksDB (and so PebbleDB) natively write 32KiB blocks, but they compress it before it hits the filesystem.

In cases where the system has seemed totally stuck (no requests completing), we seem to see latencies reported around 10.2 seconds, 109 seconds, and 0 errors. We saw this from cockroach workload run kv, even in the extreme case where the gateway node that that command was pointed at was partitioned via a firewall rule for two whole minutes. In almost all cases, we’ve never seen the p99 exceed 10.2 seconds even when throughput went to zero for few minutes (e.g., when expanding the cluster). We also saw 10s heartbeat latency for a node that was partitioned, although most of the data points were incredibly stable at 4.55s. What gives? Are these special timeout values? Why do we see 0 errors in many of these cases?

Some of these cases were user error. When we have more data, we can turn this into a more precise question.

Takeaways going forward

TBD: determine how many nodes we want to have in a single rack. (We could choose a few large nodes or many small nodes or anything in between.) This depends partly on whether we want to optimize for throughput or resiliency.

TBD: determine the appropriate replication factor for control plane data. We may want more than 3 for availability or durability reasons.

TBD: how problematic is the latency impact during schema change and what mitigations are available?

TBD: relatedly, can we use a slog device for the storage pool(s) on which the CockroachDB databases will be deployed? Would we want to for reasons like this?

We will want to carefully operationalize rolling upgrade.

We will likely want to make sure the normal shutdown process correctly drains each node of traffic. In our testing, the default 60-second timeout was not long enough.

We will need to determine an approach for load balancing. Past experience with Joyent’s Cueball and PostgreSQL resulted in a system that balanced load very well among multiple instances and was highly resilient — and resilient at scale.

We will need to determine the backup requirements for control plane data and figure out how to meet them with the tools available to us.

By electing to stick with the OSS build target, we’re signing up to own some amount of maintenance of that target.

We may want to implement a few pieces that are currently stubbed out on illumos:

gosigar implementations that CockroachDB uses to collect metrics about CPU and memory metrics
the thread dumps present in cockroach debug zip file contents (see above)

Other lessons learned

Unrelated to CockroachDB, as part of this work, we also learned a bunch about AWS, largely related to I/O performance.

The typical baseline EBS volume is "gp2" class, a general-purpose SSD-based network volume. We initially used these volumes for testing because it’s fairly cheap and we weren’t intending to measure absolute performance. gp2 volumes provide a certain number of IOPS depending mostly on the volume’s size; what’s tricky, though, is that they also support bursting way above their baseline performance, and worse (for our use case): they start with a significant "credit" ostensibly to speed up boot time, which might use more I/O than steady-state. They can run significantly faster for the first several hours than they will after that. It took some time for us track this down as the cause of suddenly-dropping database performance.

To avoid bursting, we switched to more expensive "io1" class volumes, which provide more consistent performance at whatever level you specify. We also did some testing using EC2 instance types with directly-attached NVME storage ("i3" instance types). Those are nominally cheaper, but all data is lost when the instance is shut down, so it needs to remain running 24/7 as long as the cluster might ever remain in use, so it winds up being more expensive for this sort of testing.

References

There are many links in the text above (that are not included here) to official CockroachDB and AWS documentation.

[Manta-Outage-Postmortem] Postmortem for the July 27 outage of the Manta service
[RFD 48] RFD 48 Control Plane Architecture
[RFD 53] RFD 53 Control plane data storage requirements
[Jepsen report on CockroachDB] Jepsen report on CockroachDB
[Notes on NewSQL Databases] Notes on NewSQL databases

RFD 110 CockroachDB for the control plane database

Table of Contents