It is a truth universally acknowledged, that a system in possession of an input queue, must be in want of a backpressure implementation.
As we have improved the performance of the
Crucible Upstairs, an issue has
emerged: it’s possible to submit BlockOp
jobs much faster than they can be
handled. This is exacerbated by the
"fast ack" optimization:
according to our contract, we can acknowledge writes to the caller immediately,
without waiting for them to actually complete. The next flush or read
operation, however, will have to wait for all of those writes to actually
finish, meaning it will take a surprisingly long time.
(See crucible#902
for an
example of this behavior leading to the host reporting disk errors!)
During early development, there were many sources of Accidental Backpressure in the system, e.g. checking every previous job when adding a new job. We have been gradually removing these sources of accidental backpressure as we optimize the upstairs implementation.
To keep our systems behaving without accidental backpressure, we introduced
intentional backpressure in
crucible#916
and refined
it further in crucible#990
and crucible#1047
.
However, it’s still not perfect: especially on unencrypted volumes (which lack
the unintentional backpressure of encryption), the right pattern of large block
writes can lead to pile-ups in various queues.
This document explains the various places where backpressure and queue limits are implemented, explains limitations of the current implementation, and suggests a path forward.
Queues in the Crucible Upstairs
The diagram below summarizes major queues and limits in the system:
flowchart TD
G["Guest::write(..)"] --> BP{{Backpressure}}
BP --> R(Guest::reqs)
R -.-> C["Guest::consume_req(..)"]
C -->|IOP / BW limits| P["process_new_io"]
P -->D(Downstairs::ds_new)
D -->|Clamp to MAX_ACTIVE_COUNT| A(In-progress jobs)
A --> TLW{{Total live work}}
D --> TLW
TLW -->|Job count > IO_OUTSTANDING_MAX| L["gone_too_long fault"]
TLW -.->|Bytes in flight| BP
TLW -.->|Job count| BP
Backpressure implementation
The current backpressure implementation is based on two values:
Total write bytes in flight
Live job count (i.e. the number of new and in-progress jobs)
Backpressure is calculated as the maximum of two computed values:
Backpressure due to write bytes, which begins at 1 GiB in-flight and has a quadratic scale such that there’s a delay of 10 ms at 2 GiB in-flight.
Backpressure due to live job count, which starts at
0.05 * IO_OUTSTANDING_MAX
and has a quadratic scale with a maximum delay of 5ms atIO_OUTSTANDING_MAX
.
Using the maximum of these two values makes backpressure robust against both small and large write jobs. Comparing active jobs at a variety of write sizes, we see the following:
For small writes, we’re limited by total jobs; the black line indicates where
backpressure kicks in at IO_OUTSTANDING_MAX / 2
. Large writes stabilize at
fewer outstanding jobs, presumably due to the bytes-in-flight backpressure.
The backpressure is implemented as an artificial delay before returning when performing a write operation (only); reads and flushes actually wait for the operation to complete, so they don’t need artificial backpressure.
There’s one subtlety here: once backpressure kicks in, the writer must hold a mutex lock during the backpressure delay. Otherwise, arbitrarily many writers can send writes simultaneously; even though each one is delayed by backpressure, it would still be possible to overwhelm the system. We saw this in practice: Propolis uses an 8-task worker pool to send jobs to Crucible, so our initial backpressure implementation didn’t slow things down fast enough.
Mental model for throughput backpressure
The current backpressure implementation can be thought of as "throughput backpressure": given an infinite stream of incoming writes, the system stabilizes at a point where the backpressure delay matches the true time to complete each write.
In other words, the system stabilizes in a "one-in, one-out" state: each new write is delayed for long enough that the oldest write in the queue completes as it’s added to the queue. If the backpressure delay is too short, then the queue grows (and the delay increases); vice versa if the delay is too long.
In the diagram below, the Upstairs ⟷ Downstairs operation (including backpressure) matches the Upstairs ⟷ Disk time.
sequenceDiagram
Guest ->> Upstairs: backpressure
Upstairs ->> Guest: immediate ack
Upstairs ->> Downstairs: network
Downstairs ->> Disk: filesystem
Disk ->> Downstairs: filesystem
Downstairs ->> Upstairs: network
Deficiencies of the current system
IOP and bandwidth limits do not cause backpressure
In the current implementation, IOP and bandwidth limits are implemented by
limiting the rate at which Guest::consume_req
can be called (by up_listen
).
Those limits are not connected to the backpressure system, meaning jobs simply
accumulate Upstairs in Guest::reqs
, and the guest could use unbounded amounts
of RAM.
This is mitigated by the fact that we never enable IOP or bandwidth limiting.
Recommendation: Remove the current IOP and bandwidth limit implementation; reimplement it as part of the backpressure system further down the road.
What is MAX_ACTIVE_COUNT
doing?
Right now, we limit in-progress jobs (i.e. jobs that are heading through the
network to the Downstairs and back) to 2600 on a per-client basis. New jobs
above that threshold stay in ds_new
for that client; they’re still counted as
live work for the purposes of other backpressure. The MAX_ACTIVE_COUNT
limit
is a hard limit and isn’t connected to the rest of the backpressure system.
This is a little weird but isn’t obviously misbehaving. The implementation does
what it’s intended to do, and is compatible with the rest of the backpressure
system; jobs in ds_new
still count for backpressure.
Problematic bytes-in-flight delay
There’s a subtle failure mode in our current bytes-in-flight backpressure limitation.
First, remember that we use "total live jobs" as our indication that a
Downstairs has stopped responding, marking the Downstairs as faulted when total
live jobs reaches IO_OUTSTANDING_MAX
(by default 57000).
If we send many small writes to a Downstairs that is no longer responding, backpressure will be determined by the "live jobs" limit, which maxes out at 5 ms. The total delay with our current parameters will be 47 seconds, which does not seem unreasonable.
However, if we send large writes, the backpressure delay will be determined by the bytes-in-flight limitation, which is not clamped. As we approach 57000 jobs, the delay due to bytes-in-flight will become enormous; in practice, we’ll never hit the 57000-job fault threshold (a quick Python calculation tells me it would take 107 days).
Recommendation: We should also fault at a particular level of bytes-in-flight. Doing so would create a pleasing symmetry between our two sources of backpressure: both sources could kick in at some fraction of the fault threshold, and scale up (quadratically) to a particular maximum delay before faulting.
Whither latency?
This model works fine for writes, which are delayed due to backpressure but then
return immediately. However, it’s problematic for other operations: for
example, a flush will have to wait for every single preceeding write operation
to complete, so write × N, flush
can be slow.
We may also want to exert backpressure to limit maximum latency.
It’s not obvious to me how we could do this cleanly. Some operations will necessarily take longer than others (e.g. the hypothetical 3 GiB write above), and we don’t directly control latency.
We could try adding an addition backpressure source based on latency of periodic operations (e.g. the periodic ping). However, this seems tricky: the pings only occur every 5 seconds, so the control loop would have to be very slow to be stable. As the ZFS write throttle blog post notes, measuring throughput is generally problematic!
A more reasonable option may be to tune the existing backpressure limits such
that our max latency is never particularly long. For example, we could begin
exerting backpressure at 100 MiB of data in-flight instead of 1 GiB. We
implemented such a change in
crucible#1047
, changing
the jobs-in-flight backpressure to start at 5700 jobs in flight (down from
28500), which improved system robustness.
Aside: counter-intuitive latency behavior
It’s possible for speeding up the Upstairs to actually increase latency, when we’re sending many writes followed by a read or flush (as described in the section above).
For a write job, there are three times to consider. The first two are
T_ack
, how long it takes to ack. This is "as fast as possible", limited only by data structures in the UpstairsT_done
, how long the write actually takes to finish.
The Guest is limited by T_ack
as to how fast it can submit writes.
If T_ack >= T_done
, then the queue length will remain at (roughly) 0,
because jobs are submitted as quickly as they are completed.
If T_ack < T_done
, then the queue length will increase internally. At a
certain point, backpressure will kick in, delaying guest requests to
T_ack + T_bp
(our third time to consider). The queue length will stabilize at a
length such that T_ack + T_bp == T_done
.
The total latency to actually complete a job is given by
T_done * number of writes in the queue.
This is invisible to writes (because they ack right away), but a subsequent read
(or flush) will see this true latency.
If T_ack
gets smaller (because we are processing jobs faster), then T_bp
must become larger to compensate, so the queue length will stabilize at a
longer length. This means that the latency for a subsequent flush or read
will increase.
Determinations
Add "too many bytes in flight" as a failure condition, analogous to the existing "too many jobs in flight" failure condition. This would improve failure detection for large-write workloads if a Downstairs goes away.
Tune the start points and initial slope of the backpressure curves to keep worst-case latency to a reasonable level. We’ve already done this for jobs-in-flight, but should do the same for bytes-in-flight.
Consider removing (and eventually reimplementing) IOP and bandwidth limiting on the
Guest
: they’re not used at the moment (outside of unit tests), and should be implemented in a way that feeds into the backpressure system.
Open Questions
What is the correct shape for the backpressure curve?
Right now, the implementation is quadratic;
the ZFS write
throttle (which is very relevant prior art!) has the form delay = (dirty -
min_dirty) / (max_dirty - dirty)
.
What other sources of backpressure do we want?
We already discussed implementing IOP and bandwidth limits as backpressure. Are there other resources that we should limit to ensure quality-of-service across many VMs in the rack? For example, we could use backpressure to maintain a maximum level of RAM consumption by the Upstairs (although this specific example may already be approximated by bytes-in-flight).
Security Considerations
This RFD does not open any new security considerations, but reminds us that the rack exists in a world of bounded resources.
A robust and well-tuned backpressure implementation is necessary to ensure availability and quality of service. Without backpressure, write-heavy guests could consume disproportionate amounts of RAM on their physical sled, because the upstairs would end up buffering arbitrarily large numbers of writes.