79 - Rust approaches to concurrency in the control plane / RFD / Oxide

RFD

Authors

Updated

This RFD discusses the choice of whether to use async/await in control plane software today. This is not intended to constrain any other work or future work, particularly as the considerations below change.

Background

Dropshot and the control plane prototype make heavy use of async/await. In retrospect, this is pretty arbitrary. The general excitement about async/await in Fall 2019 / early 2020, combined with the fact that most existing HTTP frameworks use async/await led some of us to think that async/await was the preferred approach to concurrency in new Rust code. @cbiffle explained that from Rust’s perspective, both async/await and synchronous approaches are fully supported, with neither being particularly preferred, but by that point we had some inertia to stick with what we were already using. Since the code base will be growing significantly, and converting to or from an async/await approach will be costly proportional to the code base size, it’s worth reconsidering this from first principles now.

We primarily consider two different approaches:

an event-based system built with async/await, or
using synchronous functions and a thread pool, which is often just called "threaded" (although obviously both approaches use threads and often a thread pool). We call this a "synchronous threaded approach" below to be precise.

For other related questions, see the [_related_choices] appendix.

Executive summary

To simplify (oversimplify, really) the tradeoffs between these two approaches:

An async/await-based system is likely to take a bit longer to build (because of the learning curve and rough edges). It may be cheaper to operate (because of the reduced memory usage). To be supportable, we may have to make a bigger investment in debuggability. (Absent that, it could be much more expensive to support.)
A synchronous, threaded system is likely to be a bit faster to build (ignoring the non-trivial cost of converting what we’ve already built), use more memory (and so cost more to operate), and be a bit easier to debug. However, it may be much harder to maintain high availability in the face of pathological cases.

This decision would be quite costly to revisit later, so we can’t build a synchronous system and easily switch to async/await if we find the memory usage to be a problem. (Some ways to mitigate this are discussed below, but they’re not great.)

We propose sticking with an event-based approach using async/await, provided that we’re able to invest as needed in the debuggability we need. At this time, this is not a strong preference and more compelling arguments are welcome!

Considerations in detail

Memory usage

In the limit, the primary advantage of an event-based system over a synchronous threaded approach is reduced memory utilization because tasks are presumed cheaper than threads. For N current operations on a system with C logical CPUs, memory usage for the event-based approach is dominated by N times the memory usage per task (though there are still C threads). For the threaded approach, memory usage is dominated by N times the memory usage per thread. Assuming N >> C (because most operations block on downstream network dependencies), and the size of a task is much smaller than a thread stack, the event-based system uses less memory. For reference, Rust thread stacks allocate 2 MiB by default. It’s difficult to quantify the memory required for comparable tasks without an implementation, but it seems reasonable to estimate that it could be an order of magnitude less (i.e., 200 KiB per task). We ignore the question of whether this memory is allocated up front or not, since both approaches can support either choice. The distribution of memory used per operation could also present a challenge for synchronous threaded systems.

Thread pool sizing

In a typical event-based system, a pool of threads is used for all on-CPU operations, often including non-blocking system calls. These threads are often also used for filesystem operations, which may block. With this approach, a common heuristic is to use match the thread pool size to the logical CPU count, which ignores any time blocked on the filesystem. This is probably fine for many use-cases, including the Oxide control plane.^[1] Provided there’s no major source of unexpected blocking in these threads, thread pool sizing is pretty easy and doesn’t have significant risks. (Note that other choices are possible here, including separate thread pools for various kinds of tasks.)

In a synchronous threaded system, threads are generally assigned to an operation for the duration of the operation, including any blocking network requests. In this case, the ideal size for a given workload depends on the fraction of wall time that each request will spend blocked — the more blocking, the more operation concurrency is possible for a given level of CPU utilization, and the more threads are needed to achieve that. Unfortunately, the average fraction of time spent blocked can change with both the client workload itself or because of dynamic properties of the system, like if some network dependency suddenly takes twice as long to service requests. The cost of a wrong choice can be significant: if the thread pool is too small, requests may fail simply because there are no threads available.

Behavior in pathological cases

It’s easy to envision a happy path with both approaches on a system with C logical CPUs and N concurrent operations keeping the CPUs about 70% busy overall. The event-based system has C threads able to keep the CPUs (mostly) busy, there are N total tasks, and there are N - C tasks not running (mostly blocked on I/O). The synchronous threaded system has at least N threads, up to C of which are on-CPU at any given time. Maybe the event-based system is using less memory, but otherwise both systems function well.

What happens when things go wrong?

Case 1: modest increase in client workload

As a warm-up, suppose that the incoming client workload increases by 50%, pushing the system just slightly beyond CPU saturation. In the event-based system, we’d now expect 1.5N tasks on the same C threads. The threads will saturate the CPUs, and we’ll see tasks queueing up even when not blocked on I/O because they’re waiting for threads to become available. Latency will increase somewhat.

In the synchronous threaded system, assuming we haven’t exceeded a cap on threads, we’d expect 1.5N threads. There’s still queueing, but it’s probably happening in the kernel’s thread scheduler. Again, latency will increase somewhat. This still looks similar to the event-based system.

Case 2: massive increase in concurrency

Let’s go back to the original state (N tasks) and suppose that instead of a modest increase in client concurrency, we see a massive increase. To make things more interesting, let’s also assume that the blocked time for each request also increases so that we don’t run into the CPU saturation issue from case 1. (If we don’t assume this, then the CPU saturation and resulting latency from case 1 will likely be catastrophic before we reach the problem we’re about to discuss.)

Lest this case seem contrived: this is exactly what we might see if request latency from a downstream dependency (e.g., a database) suddenly increased by 2x or more, which is a very realistic problem. In this case, the blocking time per request skyrockets, and the increased latency per request can causes overall concurrency to increase significantly (depending on the client behavior).

The event-based system ought to handle this well until we run out of memory for new tasks, at which point incoming requests may have to be rejected immediately. The threaded system ought to handle this well until we run out of available threads, at which point we again need to reject incoming requests. There are two differences here: first, because of the memory usage differences between these approaches, the breaking point will likely be much higher for the event-based system than the synchronous threaded system. Second, the limit on the event-based system is potentially fuzzier, in that it could rise and fall based on available memory. That might not be desirable, though, and we may want to implement a concurrency cap anyway.

Mitigations for pathological behavior

In the end, it may not matter which concurrency model we pick because we’re going to need mechanisms to maintain availability in pathological cases. These might include:

a cap on overall concurrency, above which new requests are rejected immediately
a cap on total clients connected, above which new clients are rejected immediately
throttles on both overall concurrency and total clients, introducing latency as we approach these limits in order to push back on clients
timeouts on all outgoing network requests (to avoid exhausting concurrency slots due to a poorly functioning dependency)
caps or throttles on concurrent requests of a particular type, or blocked on a particular dependency, so that pathological behavior of some requests don’t starve all others
client-side circuit breakers for outgoing requests (to improve success rate or latency when some fraction of dependencies are functioning poorly)

There are many techniques for maintaining performance and availability when dependencies fail. What’s relevant here is that without such measures, both event-based and synchronous threaded systems can fall apart in pathological cases. In practice, since thread pool sizing is so much more difficult with synchronous threaded systems, maintaining availability in pathological conditions feels harder in those systems.

Programming model

As interfaces, the two approaches are in some sense equivalent: you can wrap an async interface with a sync one using something like block_on. See also "Single-threaded execution" in the executor docs. You can wrap a sync function with an async one by having a separate thread pool in which to run the synchronous function. The async wrapper can execute the sync function by sending a message to the pool via a channel and waiting to receive a message back when the function completes.

Synchronous threaded systems are sometimes preferred because the code appears sequential (and so straightforward), but people also find them difficult to build correctly because of the need for locks and the possibilities of data races, deadlocks, livelocks, and other issues. Event-based systems are often believed to be highly scalable, but difficult to program at all and to understand because logic is scattered all over and there’s additional code required to encapsulate task state.

Interestingly, Rust alleviates many of the challenges with both approaches. The borrowing system makes it possible to identify data races at least at run time, if not compile time. The properties of Rust mutexes eliminate some classes of bugs, like code paths that fail to unlock a mutex. The implementation of async/await makes event-oriented code as easy to read as sequential code and without the need for explicit marshalling of task state.

That said, it’s accepted that async/await in Rust has a steeper, longer learning curve: async/await usage requires understanding how the compiler decomposes a function into a combination of structs and closures, which has impact on the required lifetimes for arguments and even local bindings. It also has rough edges: compiler diagnostics continue to improve, but even in early 2020 we ran into error messages that were absurdly voluminous (that one’s been fixed), misleading error messages (somewhat improved), and suggestions that produced invalid syntax. "Workarounds to know and love" covers a number of common issues that people run into (some of which have since been addressed).

Debuggability

Debuggability often gets less attention than these other considerations, but it’s critical for Oxide. The control plane will be deployed in customer sites to which we won’t have direct access except through whatever support channels we build into the product, and those generally require customer involvement, which makes them high-overhead and expensive because a customer is already annoyed enough to open a support case. It’s essential that we’re able to fully understand software problems with as little customer involvement as necessary as quickly as possible. Since each occurrence with each customer costs both us and the customer, it’s also essential that we do so with as few iterations as possible.

Within a program, we see many kinds of problems:

fatal failures (crashes, programmer errors assertion failures, panics, etc.)
deadlocks, usually associated with threads, but which have analogs in event-based systems as well
missed wakeups: one task or thread is blocked on a condition that will never occur, usually because it’s already been satisfied but that wasn’t propagated to the task or thread
other non-fatal failures. This is a broad category that could include live locks, infinite loops, or returning the wrong response for a request.

To understand these problems, we ask questions like:

What tasks are currently blocked? What is each one blocked on? This gets complicated: e.g., what’s the blocking chain behind this deadlock?
What code paths are we hitting? Or: are we hitting a particular code path? Or: what was the state of X when we hit code path Y?
What (causally) led the program to a particular code path?

For the synchronous threaded system, techniques to answer these questions are well-established using postmortem debugging (e.g., core files), dynamic tracing (e.g., tracing particular code paths or profiling code), and stack traces or other data extracted through these techniques or others. It’s generally fairly easy to understand deadlocks and blocking chains from a core file of a threaded system (though it may be hard to understand how the program got there!). Stack traces go a long way towards establishing causality.

Fortunately, the Rust implementation of async/await supports "long stack traces" — that is, when in the middle of an asynchronous operation, the stack includes context about the other async operations that spawned it. However, it doesn’t appear possible to easily answer questions about what’s blocked or why, postmortem or otherwise. On the plus side, the runtime state of tasks is managed by an executor, which is not built into the Rust runtime but rather provided by a library like tokio. It may not be very hard to modify a mainstream executor or build our own that prioritizes debuggability. Other facilities like the tracing package are also available to aid debuggability.

Analysis

We can summarize the considerations:

Consideration	Advantage
Memory usage	Event-based (async/await)
Thread pool sizing	Event-based (async/await)
Behavior in pathological cases	Event-based (async/await) (slight edge)
Programming model / ease of use	Synchronous threaded
Debuggability	Synchronous threaded

Consideration

Advantage

Memory usage

Event-based (async/await)

Thread pool sizing

Event-based (async/await)

Behavior in pathological cases

Event-based (async/await) (slight edge)

Programming model / ease of use

Synchronous threaded

Debuggability

Synchronous threaded

We can also think about this in terms of risks:

Using async/await for event-based system
Risk	Likelihood	Severity	Comments
The team has a hard time understanding failures, particularly from customer systems, because the runtime provides poor visibility into task state and causality	Moderate	High	Moderate-cost mitigations: build or incorporate a custom executor that makes it easy to see task state in a postmortem debugger develop or incorporate dynamic tracing support incorporate metrics and collection system incorporate OpenTracing-style logs and collection system
Engineers new to the team take longer to come up to speed than with a synchronous system	High	Low	Moderate-cost mitigations: invest in documentation and clear code structure develop safe, ergonomic patterns and abstractions

Using a synchronous threaded system
Risk	Likelihood	Severity	Comments
Memory usage becomes a major problem	Moderate	Low	Relative to something like google.com or an S3-like endpoint, the control plane is not likely to be very high-volume. This could change. If this becomes an issue, it will boil down to product cost: more memory usage means we’ll need more resources for the control plane. This is bad, but it’s not likely to be catastrophic. There wouldn’t be much we could do about it aside from rewrite key portions to be event-based instead.
We spend a lot of time tuning a thread pool or building our own thread pool.	Moderate	Moderate	It’s possible that out-of-the-box thread pool implementations will be specific and basic tuning is straightforward enough, particularly if we’re willing to sacrifice some memory to avoid spending a lot of time on this problem. If this becomes a problem, we could wind up spending a lot of time iterating on it.
The system has low availability because of the various cases where threads can be blocked much longer than expected, potentially even indefinitely.	Low (but kind of unknown)	High	This is a problem for the event-based system, too, but that would behave more like a memory leak, whereas in the synchronous system, we leak a more critical resource. We could wind up playing whack-a-mole trying to find all the cases where threads are blocked for a long time. (It’s tempting to try to apply an overall timeout for every operation, but operations often have greatly varying expected times (consider an API endpoint that downloads potentially large files), making any useful timeout difficult to implement.)

Imagine the two biggest risks as two bogey monsters:

We receive a high-priority customer support ticket for an Oxide service that’s not responding to requests. We examine a core file and discover that all 2048 threads in our synchronous thread pool are blocked on one particular dependency that’s not responding to requests. We forgot to include a timeout on these requests. (Even worse: this is the third such incident, each for a different dependency we didn’t expect to be such a problem.)
We receive a high-priority customer support ticket for a hung request to an Oxide server. We examine a core file and have no idea what’s blocked on what — all of the async/await tasks are idle waiting for something, but we can’t see what.

The first is the peril of synchronous thread pool sizing in a pathological case. It highlights that we have to identify every case where threads could block indefinitely in order to avoid problems like this. The second highlights the problem posed by poor debuggability. Both of these are real problems — and both are also mitigatable.

It’s also possible that we could mitigate the risk of choosing poorly by separating the implementation into several different concurrent systems. For example, we could have separate systems for handling HTTP requests, working with the database, making requests to other parts of the control plane, etc. Each could use its own event-based system or synchronous thread pool. This would be more work up front, particularly in defining the interfaces between these systems. It also makes the thread pool sizing problem harder, since we’d have several different thread pools to size.

See [_executive_summary] for our conclusions.

Related choices

Event-based system without async/await

As mentioned in the introduction, this RFD considers an event-based system using async/await and a synchronous, threaded system. We could also consider an event-based system without async/await using explicit state tracking. We leave this option out because compared with async/await:

The learning curve issues apply whether using async/await or not — that is, someone new to the system would have to learn either async/await or else the patterns and infrastructure we create for tracking state explicitly. The lifetime considerations apply to both.
Debugging the ad hoc system would be just as challenging.
With the community behind async/await, we could benefit from tooling and debuggability investments made by others.
It’s been observed that Rust’s borrowing rules make it awkward to build async interfaces that use borrowing.

Use of channels

There’s also a question of whether to use shared state (e.g., mutexes around data) or channels. The ergonomics of channel-based message passing in Rust has encouraged their popularity for communicating between multiple contexts, and that’s possible in both event-based and synchronous threaded systems. We could discuss this question as well, but more implementation experience may be needed to better understand the tradeoffs.

One thing we can say is that while message passing can avoid many classes of concurrency issues, a side effect is that they decouple cause from effect. That is, if context A (whether a task or thread) sends a message to context B to do something, once the message reaches context B, any information (e.g., stack trace) from context A is generally lost. This can make debugging much more difficult. For an example, see this small case study.

RFD 79 Rust approaches to concurrency in the control plane

Background

Executive summary

Considerations in detail

Memory usage

Thread pool sizing

Behavior in pathological cases

Case 1: modest increase in client workload

Case 2: massive increase in concurrency

Mitigations for pathological behavior

Programming model

Debuggability

Analysis

Related choices

Event-based system without async/await

Use of channels

Recommended reading

RFD 79 Rust approaches to concurrency in the control plane

Table of Contents