This document describes implementation of the "Northern Mux" design from RFD 0060, a virtual storage service named Crucible.
Crucible overview
Referencing section 9.2 of RFD 60, the key points are copied here:
The virtual disk volume in North is statically divided into LBA ranges (e.g. in 1GB chunks)
Each range has 3 associated instances of South
North and South communicate over a network protocol for both block data and related metadata
North handles a write by checking its LBA and forwarding it to the appropriate South instances
The initial implementation of the Crucible crate provides (among other things):
an implementation of "South", called Downstairs, that manages storing blocks on actual storage
an implementation of "North", called Upstairs, that communicates with one set of three Downstairs
a library for interacting with the Upstairs, called Guest
a Downstairs region and process management service called the Agent
Propolis is the service that manages virtual machines, and will also implement the device emulation that those virtual machines interact with. By including the Crucible crate, Propolis will use the Guest to send the standard read, write, and flush commands to the Upstairs, which will then fulfill those requests by communicating with the Downstairs:
┌───────────────────────────────────────────────────────┐ │ │ │ Guest VM │ │ Virtio device │ │ ┌────────────────-----────────┐ │ │ └────────────────-----────────┘ │ ├───────────────────────────────────────────────────────┤ │ Virtio device │ Sled │ Propolis | │ │ pci-crucible-block │ ├─-----------------------------------------------------─┤ │ Crucible Volume LBA │ │ Upstairs is The volume address space is │ │ part of the broken into region sets │ │ Propolis image ┌───────┬───────┬-----┬───────┐ │ │ └───┬───┴───┬───┴-----┴───┬───┘ │ └───────────────────────────│───────│─────────────│─────┘ ----------------- │ │ │ ┌────────────────┘ └─┐ └─-- Network │ │ ┌──────────────┐ ┌──────────────┐ ----------------- ┌─┴────────────┐ │ ┌─┴────────────┐ │ ┌─┴────────────┐ │ │ ┌─┴────────────┐ │ │ Sleds │ │ │ │ │ │ │ │ │ Crucible │ │ │ │ Crucible │ │ │ │ Downstairs │ │ │ │ Downstairs │ │ │ │ │ ├─┘ │ │ ├─┘ │ ├─┘ │ ├─┘ └──────────────┘ └──────────────┘ ---------------- │ All three │ │ Downstairs make │ │ up a Region Set │ Each Downstairs is responsible for a region of the LBA
Guest + Upstairs
The Guest takes IO from Propolis and communicates that to the Upstairs, which sends to downstairs over the network, waits for response and reports the results back to Propolis.
The Guest and Upstairs are meant to keep as little state as possible, and require no additional storage apart from the data structures in RAM.
Downstairs
The downstairs stores blocks on physical storage. It is intentionally meant to perform as little of its own logic as it can get away with - all logic should be performed by the Upstairs.
A Downstairs will answer requests from any Upstairs that has resources allocated to the SSD this Downstairs operates on.
Regions and Extents
The region of data that a Downstairs is responsible for is broken into what we call extents. A Downstairs stores guest data in these extents (files) based on the offset into the LBA of the virtual volume:
| --- Block address for a region --- | ┌─────────┬─────────┬─────────┬─────────┐ Guest data: │ABCDOXIDE│0x1de478 │123456890│ZYXWVUTSR│ └─────────┴─────────┴─────────┴─────────┘ │ │ │ │ │ ┌─────────┐ Extent File 0 │ABCDOXIDE│ │ │ │ └─────────┘ │ │ │ │ ┌─────────┐ Extent File 1 │0x1de478 │ │ │ └─────────┘ │ │ │ ┌─────────┐ Extent File 2 │123456890│ │ └─────────┘ │ │ ┌─────────┐ Extent File 3 │ZYXWVUTSR│ └─────────┘
(Note that the Downstairs does not use a log structured data structure - blocks are stored in files.)
A downstairs region represents a collection of extents that together cover the LBA range a downstairs is responsible for. An extent is a file that stores multiple contiguous blocks of user data. Each extent contains multiple blocks. We store metadata for each extent in a database file that is in the same directory as the extent files.
For each extent, we store the following in a SQLite database that lives alongside the data:
A generation number
A flush number
A dirty bit
The first three of these together allow the Upstairs to determine the latest data when comparing three Downstairs on startup:
The generation number is incremented every time a upstairs is attached to an instance and should only ever go up.
The flush number is an ever increasing value that is incremented every time an extent is flushed.
The dirty bit is set before an extent is modified. When an extent does a flush, it writes out the generation number and flush number and clears the dirty bit.
For each block, we store the following in that same SQLite database:
An integrity hash
Optionally, an encryption context (more on this later)
A first pass of performance, number of files, and memory usage we select 64MiB as the default size for an extent file. This number is subject to change.
Agent
Each of the ten U.2 SSDs in a compute sled will be formatted as a single-device zpool with no redundancy.
A Crucible zone will be created for each SSD.
Each Crucible zone will have a delegated ZFS dataset (file system) within the pool on the SSDs where it will create and store regions:
# zfs list -o name NAME data data/crucible data/crucible/regions data/crucible/regions/0d052762-10dd-4481-9a1a-1d36b9799f51 data/crucible/regions/2f8d3bd6-bd47-4ae9-8801-c75e81e7c9d7 data/crucible/regions/6dec576f-0655-4f90-afd6-93ce0f5aacb9
Inside a Crucible zone you will have a single Crucible agent and one or more Downstairs processes:
# svcs | grep oxide | awk '{ print $NF }' svc:/oxide/crucible/agent:default svc:/oxide/crucible/downstairs:downstairs-0d052762-10dd-4481-9a1a-1d36b9799f51 svc:/oxide/crucible/downstairs:downstairs-6dec576f-0655-4f90-afd6-93ce0f5aacb9 svc:/oxide/crucible/downstairs:downstairs-2f8d3bd6-bd47-4ae9-8801-c75e81e7c9d7
The Crucible agent is responsible for listening for requests from Nexus. When a request arrives, the agent will create the Downstairs region and either start a new Downstairs process for this region, or update a running Downstairs process to answer requests for this new region.
The exact balance of how many Downstairs processes are started vs reused is still TBD, pending performance, fault tolerance, and resources required.
Volume abstraction
Most of the time, Guest VMs will interact with Volumes. Volumes present the same interface as the Upstairs (read, write, flush, etc) but where the blocks are read from and written to is different.
Volumes are composed of a list of subvolumes, along with an optional read only parent. The subvolumes are concatenated together end to end, but the read only parent always underlays starting at the beginning of the Volume.
For example, here is a Volume with four subvolumes and one read only parent:
Volume: |-----------------------------------| Subvolume: |--------| Subvolume: |--------| Subvolume: |--------| Subvolume: |--------| Parent: |----------------------|
The typical use of the read only parent will be to construct disks with an image underlay - creating a new disk out of an Ubuntu image for example. The term read only parent is used here but could easily be substituted with "ubuntu image". See [_relationship_between_volumes_and_api_objects] for more.
The Volume layer will take care of translating requests into the LBA to the appropriate sub volume or optional read only parent (more on this later). Without a read only parent, reads and writes to the Volume will be directed to the appropriate subvolume:
Volume: |-----AAAAAAA---------BB------------| Subvolume: |-----AAA| Subvolume: AAAA-----| Subvolume: |---BB---| Subvolume: |--------|
The example read or write range denoted by A interacts with two subvolumes, while the read or write range denoted by B only interacts with one.
If a read only parent is present, reads that occur to the LBA range of the read-only parent will read from both that parent and the subvolumes:
Volume: |-----AAAAAAA-----------------------| Subvolume: |-----AAA| Subvolume: AAAA-----| Subvolume: |--------| Subvolume: |--------| Parent: |-----AAAAAAA----------|
Crucible stores per-block integrity hashes, and those are used here to determine which block (from the read only parent or from the appropriate subvolume) to serve:
If a subvolume’s block does not have an integrity hash, serve the read only parent’s block - the Guest has not written to this block yet:
Volume: |-----1111111-----------------------| Subvolume: |--------| Subvolume: |--------| Subvolume: |--------| Subvolume: |--------| Parent: |-----1111111----------|
If a subvolume’s block has an integrity hash, serve that - the Guest has written something to the Volume there.
Volume: |-----1122211-----------------------| Subvolume: |-------2| Subvolume: 22-------| Subvolume: |--------| Subvolume: |--------| Parent: |-----1111111----------|
Writes will never be sent to the read only parent, only subvolumes. In this way, writes to subvolumes overlay blocks from the read only parent.
Reading from two locations like this incurs a read latency penalty as long as the read only parent is present. The intention is that a background task will be reading blocks from the read-only parent and writing those to the volume (creating an integrity hash in the appropriate subvolume), and once the read only parent’s whole LBA range has been copied the read only parent will be removed. See [_block_migration_task] for more information.
Relationship between Volumes and API Objects
As previously stated, the typical use of the read only parent will be to construct disks with an image underlay:
Volume: |-----------------------------------| Subvolume: |-----------------| Subvolume: |-----------------| Parent: |--- ubuntu image ---|
Disks can also be constructed with a snapshot underlay:
Volume: |-----------------------------------| Subvolume: |-----------------| Subvolume: |-----------------| Parent: |------ disk snapshot ------|
Importantly, volumes can be layered. A user can construct a disk with an image underlay, write to it, save that as a snapshot, and then construct a disk using that snapshot as an underlay:
Volume B: |---------------------------------------| Subvolume: |---------| Subvolume: |--------| Subvolume: |--------------------| Parent: |------ snapshot of volume A -------| Snapshot of Volume A: |-----------------------------------| Subvolume snapshot: |-----------------| Subvolume snapshot: |-----------------| Parent: |--- ubuntu image ---|
Reads may source from the ubuntu image, one of the snapshot of volume A’s subvolumes, or one of volume B’s subvolumes. Writes always land at the uppermost layer’s subvolumes (writes are never sent to the read only parent).
The contract that the volume abstraction must maintain is that any operation will not change the contents of a block from the perspective of the Guest OS. Another way of stating this is that only the Guest OS will modify blocks.
Volume Construction Requests
Nexus will create, store, and send Volume Construction Requests, which is a
JSON serializable enum describing the volume topology to create. All the
receiving end (propolis, etc) has to do is call Volume::construct
with the
construction request in order to obtain a Volume object that can then be used
normally.
Block migration task
Block migration (also known as the Scrubber) is responsible for copying all the data from a read only volume to the underlying read write subvolume. The scrubber should allow writes to the subvolume and not over write them with data from the read only volume.
Some assumptions/requirements:
The scrubber “lives” in the volume layer.
Data arrives at the volume layer unencrypted, all the encryption/decryption happens in the guest/upstairs layers (below the volume layer).
The scrubber only has to send data from RO to the subvolume(s), never to propolis or out to a guest.
The existence of a blocks checksum is used to determine if that block has been written or not.
In the Volume Layer:
Scrubber reads data from the RO side (starts with block 0).
Scrubber constructs a WriteUnwritten IO (looks very similar to a write).
Scrubber sends this WriteUnwritten to the subvolume.
On ACK, the scrubber moves on to the next block (keeps track of where it is).
Scrubber will send a flush when it reaches the end of an extent, and should only update the flush point when (all??) downstairs have returned ACK’s. Do we need to have three? Can it still work if two are present?
Regular IOs can continue. If an IO is below the scrub point, then can go direct to subvolume. If an IO is above the scrub point, then the usual path that exists now will be taken of reading both RO and subvolume.
SubVolume:
Takes the WriteUnwritten op, sends it to each downstairs.
Treated a little differently (maybe?), requires an ACK from all three downstairs?
Downstairs:
Receives WriteUnwritten.
Tries to read block in question, if that block is unwritten (has no hash), then it will write the data to the block. If the block has data, then no action is taken. Each block is checked independently.
Return WriteUnwrittenACK back to upstairs.
Interfaces
Nexus ↔ Crucible Agent
This interface is used by Nexus to allocate storage resources and assign them to a VM.
Workflow for creating a Virtual Volume:
User selects a volume size.
The requested volume size LBA is divided into segments and each segment is assigned (by Nexus) to three downstairs who will provide a 3-way mirror of the data for that segment. If the volume size is small enough, there may only be a single segment.
│ Volume LBA │ │ broken into Region Sets │ ┌───────┬───────┬-----┬───────┐ └───┬───┴───┬───┴-----┴───┬───┘ Each Region Set is provided │ │ │ by a triplet of downstairs │ │ │ ┌────────────────┘ └─┐ └─-- Additional downstairs │ │ if needed, depends ┌──────────────┐ ┌──────────────┐ on size of the volume ┌─┴────────────┐ │ ┌─┴────────────┐ │ ┌─┴────────────┐ │ │ ┌─┴────────────┐ │ │ │ │ │ │ │ │ │ │ │ Crucible │ │ │ │ Crucible │ │ │ │ Downstairs │ │ │ │ Downstairs │ │ │ │ │ ├─┘ │ │ ├─┘ │ ├─┘ │ ├─┘ └──────────────┘ └──────────────┘
Nexus sends to each Crucible Agent the size of the segment along with UUIDs describing both the overall storage volume and a UUID for this specific downstairs.
Each Crucible Agent will create the "Region" on the ZFS file system the zone lives on.
Once the region is created, the Crucible Agent either starts a new downstairs process or selects a running downstairs to be responsible for the new region.
Finally, the downstairs port number and any other needed info is reported back to Nexus.
Nexus then passes the list of downstairs I.P.'s, port number, UUIDs, and any certificates or keys to Propolis.
Propolis ↔ Upstairs
Propolis will use the Guest to communicate with an Upstairs. Each Guest will implement basic read, write, and flush functionality, plus some Crucible specific functions.
Importantly, Propolis will handle all device emulation. Users will select NVMe, virtio, or some other device whose block requests will be fulfilled by Crucible, and Crucible will be none the wiser.
The actual work (encryption, network I/O) will occur on another set of threads Upstairs manages (e.g., a tokio runtime).
Completion will be through a callback Upstairs provided by Propolis at setup time.
Rate limiting can be done in both Propolis and the Upstairs. Upstairs will have more information about its internals that is not available to Propolis (i.e. if one of the three downstairs is having trouble keeping up). It would make sense to have a limit at Propolis. Some consideration was given to having the network switch implement bandwidth limits for storage traffic, but this may be too little too late, though could prevent a bad actor from a DOS attack on the network backplane.
Crucible details
Activation
When an Upstairs connects to a Downstairs, part of the negotiation that occurs will involve a process called "Activation". The Upstairs requests to be Active, and sends along a few parameters:
the Upstairs' UUID
a unique session UUID
the Upstairs' generation (shortened to gen) number
a boolean where true is requesting a read-only activation, and false is requesting a read-write activation.
a boolean where true is expecting the Downstairs to have its encrypted flag set to true, and false is expecting false.
Between different connections, the Upstairs UUID remains the same, and the session UUID changes. This allows for future optimizations, such as handling different cases:
a connection with the same Upstairs UUID, session UUID, and gen number may be a network blip.
a connection with the same Upstairs UUID and gen number but different session UUID may result from the Upstairs crashing and restarting.
A Downstairs can be started in either read-write or read-only mode. A Downstairs launched in read-only mode will return an error for any operation that modifies on-disk state - typically, a Downstairs that is launched to serve data from a ZFS snapshot directory will be in read-only mode, and regular Downstairs will be in read-write mode.
If there is a mismatch in the Upstair’s requested mode (read-only or read-write) and the Downstair’s mode, the activation is rejected by the Downstairs. Similarly, if there is a mismatch in the encryption expectation, the activation will be rejected.
Each Upstairs must use a unique UUID, as this disambiguates the job IDs that arrive at the Downstairs. Generation number will be used to deactivate old connections so that at most one Upstairs with a certain UUID is connected and active.
An Upstairs activation will be accepted if there are no existing sessions with the same Upstairs UUID. If there is an existing session with the same Upstairs UUID, the Upstairs activation will be accepted if the gen number is higher than the current session’s gen number. In this case the current session will be forcibly deactivated by the Downstairs. This is intentionally a takeover, as opposed to more complicated procedures like leader election.
A read-only Downstairs supports multiple active read-only Upstairs connecting to it. Any Upstairs that tries to issue an operate that will change on-disk state will see an error. It will not support any read-write Upstairs connections.
A read-write Downstairs only supports one active read-write Upstairs connecting to it. It will not support any read-only Upstairs connections.
Supporting Snapshots
When a snapshot is taken, Nexus will tell the appropriate Crucible Agent to launch a corresponding read-only Downstairs serving out of the ZFS snapshot directory. In order to support creating many disks "out of" those snapshots (see [_relationship_between_volumes_and_api_objects]), any number of volumes can be constructed using those read-only Downstairs as the read-only parent.
Each one of those connections to the read-only Downstairs must use a unique Upstairs UUID. Nexus is required to change these IDs when copying the volume construction request.
Supporting Live Migration
During Live Migration, Nexus will send the destination VM the same volume construction requests that the source VM received, except that the generation numbers will be increased by one.
This will cause each of the destination VM’s Upstairs' activations to forcibly terminate the source VM’s Upstairs' connections.
Supporting Hot Plug
Removing a disk, then attaching a disk will cause the volume’s generation numbers to be increased by one.
End to End data integrity
End to end data integrity protects from corruption between data in a buffer in a guest VM, and the data on physical storage. ZFS will provide us with protection for corruption of data at rest but we need to consider corruption of data after Crucible accepts it and before it lands on disk. Corruption can occur at any point: in Crucible memory, as it goes over the network, in the NICs, etc.
Note: Crucible can protect data once it lands in Crucibles buffers, but to really offer E2E protection, the OS running in the VM will have to do it itself. Crucible can only protect the data once it lands in Crucibles buffers.
Three mirrors gives Crucible the ability to compare for an corruption that happens after data leaves an upstairs, and per-block integrity hashes allow for determining if corruption has occured. Hashes are computed post-encryption by the Upstairs and are part of the protocol messages that are sent to the Downstairs:
Write messages have a spot for the Upstairs to send an integrity hash along with the data
Read messages have a spot for the Downstairs to return integrity hashes along with the data
As mentioned above, ZFS itself provides and corrects for corruption of data at rest on the underlying hardware.
xxHash was chosen as the hash algorithm:
xxHash is an Extremely fast Hash algorithm, running at RAM speed limits.
The hash digest is a simple u64 value meaning the protocol messages are not made unnecessarily large by including these digests.
Encryption of data
We plan to encrypt data when it arrives in the Upstairs and decrypt before we leave Upstairs - users of Guest will be unaware.
Data in transit between Upstairs and Downstairs will occur over TLS.
By encrypting before mirroring, reconstruction between mirrors will be simpler as the encrypted data should be the same on each mirror, and the keys will not be required to reconstruct.
Keys for data encryption will be provided from Nexus (it’s part of the material that an Upstairs will be initialized with).
AES-GCM-SIV-256 was chosen as the algorithm:
AES-GCM-SIV is a mode of operation for the Advanced Encryption Standard which provides similar performance to Galois/Counter Mode as well as misuse resistance in the event of the reuse of a cryptographic nonce. AES-GCM-SIV is designed to preserve both privacy and integrity even if nonces are repeated.
The nonce and tag associated with AES-GCM-SIV is referred to as an "encryption context" for a block, and is stored in the SQLite database for an extent.
Authenticated encryption was chosen to mitigate the scenario when an attacker has access to extents Downstairs - with unauthenticated encryption, that attacker can modify those blocks and the Upstairs will decrypt them, producing garbage data or something worse. With authenticated encryption, the Upstairs will be aware of this tampering, and will report reads of those blocks as an error instead of serving garbage data.
Integrity hashes for blocks with encryption contexts will include those encryption contexts as part of the input to the hash algorithm. This is to ensure end to end integrity of the encryption context as well.
The performance of AES-GCM-SIV is mostly the same as AES-XTS - as Josh pointed out, they’re both AES. Introducing per-block metadata storage incurred a small performance degradation, but this was seen with unencrypted blocks requiring integrity hash storage as well.
Key rotation will require rewriting of all data. If Volume 1 exists with key A, rekeying can be achieved by constructing a new Volume 2 with key B that uses Volume 1 as a read only parent:
Volume: |-------------------------| SubVolume: |-------------------------| <- Guest with key B Parent: |-------------------------| <- Guest with key A
The migration task that copies blocks from the read only parent to the Volume will then perform the re-encryption. In Propolis, Volume 2 can be swapped for Volume 1 at runtime with no impact to the VM.
Snapshots
Snapshots will be coordinated from the Upstairs with a request sent to all downstairs using the existing dependency mechanism. This will guarantee that all downstairs take the snapshot at the same time. ZFS will be used on the region file system to create the snapshot.
The storage space required for a snapshot will be the same as the space of the original volume.
The data present in the appropriate ZFS snapshot directory will be served by a new Downstairs, and will be copied behind the scenes to a new set of three Downstairs (by some migration task) until the new set has a full copy of all the data. At that point the ZFS snapshot can be destroyed.
The new set of Downstairs should use a separate encryption key.
Images and Clones
If a user has an image that references a snapshot of an Ubuntu root disk that they would like to create a new VM out of, they can create a volume using that snapshot as a read only parent:
Volume: |--------------------------------------------| SubVolume: |--------------------------------------------| Parent: |-------------------------|
The read only parent’s size will be strictly less than or equal to the Volume’s size.
Tokens for Class of Service
Leaky bucket (as a meter) has been implemented to limit IOPs per second and bandwidth per second. Each Upstairs can be configured with an optional IOPs/s and bytes/s, and will limit appropriately.
Rate limiting can exist on several levels, at the downstairs, at the Upstairs, and Propoplis itself can control flow.
High bandwidth low latency data transfer
We need to select a good high bandwidth low latency data transfer in Rust With at least a 3x multiplier for writes, we should be sure we are efficiently sending data over the wire.
Upstairs ↔ Downstairs authentication
How the Upstairs and Downstairs authenticate each other is important to prevent unauthorized connections and interception.
Using TLS and certificates is the first choice for over the wire encryption. As the data itself will be encrypted, this will protect the metadata. Only using TLS will not perform authentication though: mutually authenticated TLS using client and server certificates could be selected but note that this only authenticates the stream between the Upstairs and Downstairs, not the requests themselves. mTLS also has a strong dependency on correct and synchronized clocks.
Authenticating each protocol message also incurs overhead and it’s important to select something lightweight.
Keys and/or certificates will always come from Nexus.
Crucible Crash Consistency
The generation number, flush number, and dirty bit (as described in the
Extents
section) was designed such that a valid block will always exist, and
if writes are interrupted at any point in the write process, either the old or
the new block will be readable.
Encryption Problem statement
Using AES-GCM-SIV for Crucible’s authenticated encryption means that a nonce and tag for a block has to be stored alongside the encrypted data, and writing these three parts out separately introduces the potential for a crash to cause the on-disk state to be unreadable - if all three of the nonce + tag + data are not atomically written to disk, decryption will fail.
Before atomic write | After atomic write |
---|---|
dirty bit = false | dirty bit = true |
nonce and tag | nonce’ and tag’ |
data | data’ |
The combinations of nonce, tag, and data that are valid are:
nonce + tag + data = old block
nonce’ + tag’ + data’ = new block
Any other combination will not decrypt. If the Guest has not issued a flush, Crucible should guarantee that an Upstairs read after an interrupted Downstairs write will either return to the guest the data from the old block or the new block, and that any write interruption should not cause the data to become unreadable.
If the Guest has issued a flush, and Crucible has returned successfully from that flush, it should never serve the old block.
That above table isn’t the whole picture, because those writes are not atomic. Operating systems do not have to commit data to a disk after every write. The only guarantee is that data will be on a disk after an fsync. In the mode that we’ve configured it to use, SQLite will perform an fsync of the metadata db after every transaction. This leads to a scenario where between extent fsyncs writes to the extent files could be on disk:
fsync | write | write | fsync |
Nonce and tag v1 | Nonce and tag v2 | Nonce and tag v3 | Nonce and tag v3 |
Data v1 | Data v2 | Data v3 | Data v3 |
extent data on disk* | extent data in ram, maybe on disk? | extent data in ram, maybe on disk? | extent data on disk* |
*: unless disk firmware is lying or has a bug or is “consumer” grade or ?
Between fsyncs above, the data on disk could be any combination of v1, v2, or v3. The only guarantee provided by the OS that we can rely on is that returning ok from fsync means data has been durably written to disk.
Interruptions can occur at any point during the write, and at any point before fsync returns ok. Depending on the fsync behaviour of SQLite and the fsync behaviour of the downstairs, different issues may arise. Because SQLite fsyncs after every operation but extents are only fsynced after receiving a flush comment from Upstairs, a crash could cause data to fail decryption due to skew between the encryption context on disk and the data on disk:
Stored in RAM | Stored in RAM | Stored on Disk | Stored on Disk | |
---|---|---|---|---|
Nonce and tag | Extent Data | Nonce and tag | Extent Data | |
Base state | v1 | v1 | v1 | v1 |
After write (including SQLite fsync) | v2 | v2 | v2 | v1 |
Fsync of extent | v2 | v2 | v2 | v2 |
This sort of error would go unnoticed until there was a power loss because the state in RAM is always ok.
I’m going to use “interrupted write” and “crash consistent” interchangeably here. Also, as before, “encryption context” means nonce and tag for a certain block.
Constraints
We cannot be crash-consistent if we do not write out all of the nonce and tag and data atomically. We currently use extent based storage for the data and SQLite for the nonce and tag. This means we cannot overwrite any of those parts individually, and because we have separate storage for the data and encryption context we cannot overwrite these together atomically.
This means we have to safely append instead of overwriting.
On-disk data has to decrypt successfully at all times, or we lose data.
Rejected solution ideas
Compressed representation
It’s important that when Extent blocks are written to disk that they only take up the block size. ZFS uses transaction groups to be crash consistent and we do not want our Crucible blocks to span multiple physical disk blocks.
If it is possible to pack the nonce + tag + data bytes together and have this take up less than or equal to the block size, this can be written to the disk atomically. The problem here is that encrypted data does not compress - encryption defeats the statistical properties that make most (if not all) compression schemes work. This means compression downstairs won’t save bytes.
We also can’t compress upstairs before we encrypt: the guest could be using encrypted storage as well! Even if the guest isn’t using encrypted storage, I also hesitate to request compression before encryption because I’m not a cryptographer, and there at least two attacks (https://en.wikipedia.org/wiki/CRIME and https://en.wikipedia.org/wiki/BREACH) that have been invented / discovered that make me wary.
Journalling
Given the constraint that we cannot overwrite both the extent storage and the encryption context together atomically, one idea is to write somewhere else, and teach the downstairs to return from either the old or new location. Problem is at some point this incurs a clean up penalty, otherwise the disk will be littered with these files. There’s also the problem that this balloons the number of open files that have to be supported.
Solution: supporting multiple encryption contexts
When performing a write, perform the following steps:
Set the dirty bit if it is not set.
Insert a new encryption context row
Write out the new extent data
Only delete old encryption context rows after extent fsync
The only guarantee presented by the OS is that fsync needs to be called to ensure persistence:
The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.
Ensuring crash consistency means accounting for extent and SQLite fsync behaviour. Writes to the extent cannot be considered durable until fsync is called and returns successfully. We have to assume that any write to the extent between subsequent fsyncs may have made its way to disk, and from this assumption we have to store all encryption context rows between fsyncs.
For SQLite, we’re using write-ahead logging with the synchronous PRAGMA set to FULL. https://www.sqlite.org/wal.html states:
Writers sync the WAL on every transaction commit if PRAGMA synchronous is set to FULL but omit this sync if PRAGMA synchronous is set to NORMAL.
When synchronous is FULL (2), the SQLite database engine will use the xSync method of the VFS to ensure that all content is safely written to the disk surface prior to continuing.
By default, PRAGMA synchronous
seems to be 2, which is FULL. This is probably
the “lowest” setting we should consider, given that the documentation states
that NORMAL isn’t durable.
In the happy path, reads can return the extent data and the most recent encryption context. These will both come out of RAM. When a crash or power loss occurs, the downstairs will boot back up and have to contend with an unknown file state. The Upstairs will have to receive every encryption context and the data, and figure out which encryption context matches what is on disk.
Reconciliation
This is the process by which the three mirrors that together provide the virtual disk will, on attachment, come to consensus on the data stored on each mirror. At the end of this process all three mirrors will contain the same data. The process is designed such that an interruption at any point will not change the final outcome, and eventual consistency will be reached. Depending on where the interruption happens, the repair may start over, or continue from where it left off.
The Upstairs drives the reconciliation process.
When all three downstairs for a region set have connected, the upstairs collects and compares the generation number, the flush number, and the dirty bit on all three downstairs with each other. If there is a mismatch, then there needs to be a repair. If any region has a dirty bit set, then it needs repair.
When a repair is needed, the Upstairs will select a source extent and a destination extent. To determine a source, continue the below steps until you have just one extent and use this extent as the source.
Pick the extent(s) with the highest generation number.
Then pick the extent(s) with the highest flush number.
Then pick the extent(s) with the dirty bit set.
If you still have more than 1 extent, then pick one at random.
After choosing a source extent, you only need to repair the other extents if their generation or flush is different than the chosen source extent, or if the dirty bit is set on an extent.
If using an extent with a dirty bit set as the source, then that extent must also get the dirty bit cleared during the recovery process.
After compiling a list of extents that need repair, (source, destination(s)), the upstairs then converts this list to a sequence of repair commands. Each command is issued and the result is waited on before continuing to the next command. After an extent has been repaired, a restart of the repair process will not require this extent to be repaired a second time.
Repairing an extent
Repairing an extent is coordinated by the upstairs, but the actual transfer of data happens downstairs to downstairs. Each downstairs runs a repair server and listens for repair requests from other downstairs. A downstairs with an extent that needs repair will be given the address:port of another downstairs that contains a good copy of that extent.
For example, if Downstairs_0 has extent 4 that needs repair. The Upstairs will tell Downstairs_0, you should get extent 4 from Downstairs_2. Downstairs_0 will then contact Downstairs_2 and request the file list, then each file required to repair extent 4. Once Downstairs_0 has all the files, it will replace the local copy with the new versions.
Stats
Stats should be gathered by the storage layer and made available to the operator.
Stat collection should be considered from various points of view: * An individual VM * A sled * The chassis
Stats to collect: * R/W/Flush count at upstairs level. * R/W/Flush count at each downstairs. * Stats from ZFS file system below the downstairs * Startup stats * extents “fixed”?
Determinations
After further testing, We stopped using SQLite in favor of adding the metadata to the end of each extent file. The information stored is still the same