Viewing public RFDs.
RFD 238
Trust Quorum and Rack Unlock
RFD
238
Updated

Motivation

It’s important to protect sensitive data on an Oxide rack from both casual theft and casual physical attacks inside the data center. Specifically, in scope for this RFD, we want to protect against the following two attacks:

  1. An attacker steals one or more U.2 flash drives from one or more server sleds, and is able to read any sensitive Oxide or customer specific data outside of the Oxide rack within which the drives originated.

  2. An attacker steals a subset of server sleds, smaller than some threshold, K, and is able to use them with or without a rack to read data on any of the drives in those servers, or drives stolen from other servers from the same rack.

On top of preventing the above two attacks, we want to provide confidentiality of sensitive data at rest with a minimum of inconvenience for legitimate users of an Oxide rack. Specifically, we do not want to require passwords (or any sort of human intervention) for decrypting drives when the individual server sleds start up, or when cold booting a rack. This latter point is especially useful in datacenters that are not staffed with operators 24/7, because before the rack is unlocked, there is no network access available to the outside world. Therefore any password would need to be entered locally via a direct attachment to a technician port, making downtime longer than necessary during brief power outages.

To prevent the two attacks described above with minimum friction for rack owners, we utilize cooperation of servers inside a rack to securely reconstruct a shared secret that can be used to unlock the rack. By unlocking a rack, we mean deriving enough keys from the shared secret necessary to decrypt the ZFS datasets containing the control plane database and other operational data residing on the U.2 drives on each server. Keys for other system and user data reside in the database and can be accessed after the database is decrypted and the full control plane is running.

Each sled has only a share, or piece, of the secret. When at least a threshold of K shares are retrieved by an individual sled from other sleds, the secret can be reconstructed and the ZFS datasets decrypted. By ensuring that each sled can individually retrieve enough key shares to reach the threshold and decrypt its local storage, we guarantee an unlock of the entire rack.

Importantly, the implementation of the trust quorum mechanism also provides for runtime gating of control plane software and underlay network access via remote attestation. In order for a sled to unlock its local storage it must establish a [sprockets] connection to at least K-1 other sleds and retrieve key shares from those sleds over those sprockets connections. A sprockets connection is only established if a mutual TLS connection can be established utilizing PKI baked into the RoT of a sled that ties back to a root manufacturing key, and a remote attestation is performed that verifies expected software is running on the remote machine. If certificate validation or remote attestation fails at either end of the connection handshake then shares will not be exchanged. If a sled cannot establish enough connections to retrieve shares and recompute the rack secret it will fail to decrypt its local storage. This prevents any control plane zones from launching on that sled and prevents the sled from configuring and accessing the underlay network where control plane services interact.

The remainder of this RFD describes how we manage to create and utilize the trust quorum in a secure fashion.

Overview

Creating a Shared Secret

We utilize Shamir Secret Sharing over a finite field, GF(256), with the same irreducible polynomial chosen by AES: x^8 + x^4 + x^3 + x + 1. We use GF(256) for its efficiency and constant time implementation, as well as the fact that randomly generated values are uniformly distributed. This uniform distribution matches the distribution we desire for the encryption keys we are deriving from this shared secret.

Note
Our low-rent trust quorum (lrtq, v0) implementation used a more complicated construction, namely: curve 25519 wrapped with ristretto to make it a prime order group suitable for secret sharing. There isn’t really a good reason for the extra complexity. This choice was made primarily because we use Curve25519 elsewhere in our system and thought it would make sense to stick with it for secret sharing as well.

While the underlying implementation of secret sharing may be somewhat complex, using such a scheme is easy. We split a secret into N total shares and distribute a unique share to each participant in the secret sharing scheme. A threshold of K shares is needed to reconstruct the secret. We call the entity that creates and splits the shared secret the dealer and the owner of a share a player or shareholder. Retrieval, and combination of shares is performed by a player.

A dealer creates a secret and splits it into N=3 total shares.

A player combines K=2 shares to recreate the secret.

Key Share Distribution and Retrieval Requirements

Assume we have a dealer that can generate a shared secret, and split that secret into shares. The dealer needs to then know:

  1. Which players to distribute the shares to (identity)

  2. How to contact those players (discovery)

  3. How to trust those players are who they say they are (authentication)

  4. How to ensure those players are running the proper software (attestation)

  5. How to securely distribute the shares to the players (integrity and confidentiality)

Now assume we have a player who has received a single share and knows how to reconstruct the secret given some other shares. In addition to having the share, the player also needs to know:

  1. Which players to retrieve shares from (identity)

  2. How to contact those players (discovery)

  3. How to trust those players are who they say they are (authentication)

  4. How to ensure those players are running the proper software (attestation)

  5. How to securely retrieve the shares from the players (integrity and confidentiality)

  6. How to securely send a share to a trusted player that requests it (integrity and confidentiality)

For our solution to work, we need a number of system properties to be supported via specific mechanisms. The table below maps the properties to the mechanisms that we have chosen.

System PropertyMechanism

Identity

RoT PlatformId and Trust Quorum (TLS Authentication) public key certificates

Discovery

Routing table lookup with well known boot prefix

Authentication

sprockets

Remote Attestation (Measurement + Appraisal)

sprockets

Integrity

sprockets

Confidentiality

sprockets

Transport

sprockets over TCP

Sprockets

sprockets is a protocol for establishing secure communication sessions between two endpoints. As indicated in the prior section, sprockets provides many of the security guarantees of the trust quorum design. At its most basic, sprockets is a transport channel based on mutual TLS 1.3, upon which attestation evidence (signed measurements of RoT, SP and other software images) are sent over the established TLS 1.3 channel and verified before the channel becomes usable by applications. Authentication for the TLS 1.3 connection is provided by our trust quorum certificates that are generated at RoT boot time and certified by the PlatformId certificates that contain the device’s unique identity. [rfd36] and [rfd303] contain more details about certificate chains baked into the RoT, while [rfd497] talks about application level attestation and sprockets.

A Brief Note About Sled Identity

The full details about the keys on the RoT are available in [rfd36] and [rfd303]. This section is only a brief summary useful for understanding the keys required for trust quorum usage.

A sled has a unique identity based on a Unique Device Secret (UDS) stored in the RoT. The UDS is derived from a Physically Unclonable Function (PUF) which itself is a manifestation of natural randomness introduced in the silicon manufacturing process of the RoT. Via further derivation steps, an asymmetric key pair, known as the PlatformId, is created during manufacturing. The private key never leaves the RoT for security reasons. A certificate signing request (CSR) containing the corresponding public key is exported and signed by our our intermediate manufacturing key provided by the online signing service [rfd343], and then injected back into the RoT. We can assume this PlatformId to be valid for the lifetime of the device, and trust that its public key certificate was signed only by Oxide. The PlatformId public key certificate contains in its subject common name the part and serial number used to uniquely identify a server sled. This identity information is exposed to the consumer upon establishment of a sprockets channel so that the remote endpoint can be identified by the trust quorum software.

The PlatformId private key is also used for certifying a Trust Quorum keypair that is created on each boot of the RoT. This keypair is used for TLS authentication in the TLS 1.3 portion of the sprockets session establishment. The private key never leaves the RoT, and signing is performed via the TLS software offloading signing operations from the host to the RoT over the IPCC ([rfd316] channel to the SP, and then from the SP over the SPI link to the RoT. Since the PlatformId is unique to the RoT and it certifies a unique trust quorum keypair, the trust quorum keypair is tied to the platform identity such that the trust quorum key can be said to be authenticating a TLS connection on behalf of the given platform identity. Therefore, when a TLS session is established as part of sprockets, the part and serial number in the PlatformId unambiguously and securely identify the remote endpoint of the connection.

There are further keys that are used to sign measurements and complete establishment of a sprockets session, but these are not relevant to trust quorum and will not be discussed further. See [rfd36] and [303] for details.

Rack Initialization

As described in [rfd57], when a customer gets a rack, it is essentially a blank slate. The customer must configure the rack by connecting a laptop to one of the technician ports on either switch and then sshing into our rack setup TUI, [wicket], running on the sled directly attached to the chosen switch via PCIe. [wicket] uses a service named wicketd to communicate with the Management Gateway Service (MGS) to retrieve information about sled position and identity. MGS polls the Service Processors (SP) for this infor [rfd210]. Importantly, for the context of this RFD, each SP will publish the BaseboardId ([rfd433]) of its sled which can be used to uniquely identify it and is contained in Subject CN field of the PlatformId public key certificate used to secure sprockets channels. In essence the PlatformId certificate can be treated as securely identifying the BaseboardId. PlatformId and BaseboardId can therefore be used interchangeably in most discussions below.

At some point during rack initialization, the customer will view the sled information as presented by [wicket], and trigger the creation of the Rack Secret. Wicket, via wicketd, will instruct the bootstrap-agent running on its sled to generate the rack secret, split it into N=# sleds in rack shares, and distribute an individual share, the set of `BaseboardId`s representing the membership group of the rack, and some metadata to every other bootstrap agent over sprockets sessions.

The diagram in [rfd210] shows the network communication topology of the components discussed here. Below is a sequence diagram representing the order of operations among the entities.

In order to maintain a healthy rack during setup, we ensure that for trust quorum initialization to succeed all sleds have been able to recreate the rack secret and properly persist their shares, group info, and metadata to their local M.2 devices.

If for any reason rack secret generation fails, the system should be reset to its factory state and initialization should be tried again, perhaps leaving any sleds having trouble out of the trust quorum group. Currently, reset to its factory state is performed via a manual clean slate operation.

Trust Quorum Group Membership

As described in [_rack_initialization], a set of unique sleds identified by their PlatformIds forms the trust quorum group membership of a rack. While the [sprockets] protocol allows any sled to verify that any other sled in a rack is authentic Oxide hardware running trusted software, it does not confer the ability for any sled inserted into a rack to retreive key shares and recreate the rack secret. A trust quorum member will only distribute its key share to other trust quorum nodes if their PlatformId (as reported by the established sprockets session) is part of the group membership.

The rationale for this restriction is to prevent an attack where a modified Oxide sled is plugged into a rack solely to retrieve the rack secret. This capability would then allow the attacker to steal a subset of sleds or drives smaller than the trust quorum and be able to read the compromised data at their leisure, violating the intent of the trust quorum as a whole. We aim to prevent such casual access physical attacks by relying on group membership and making it so that hardware modification of sleds requires extended phsyical access. An attacker would have to steal a sled already part of the trust quorum and actively being used in the rack, modify it, and then insert it back into the rack or be able to have the administrative permissions to add a sled to the trust quorum. This is a higher bar than being able to insert any modified sled into a rack.

As a consequence of trust quorum group membership, every sled added to a rack must be explicitly admitted to the cluster via the reconfiguration protocol described below. Furthermore, sleds must be explicitly removed from the cluster using the reconfiguration protocol. Decommissioning of a sled should completely wipe the M.2 storage devices before they are placed in a new sled or before the sled is placed in a new rack. Any sled that is currently part of one trust quorum group/cluster, as reflected in its membership group (with a unique ID) stored on its M.2 devices, will not be allowed to be added to an existing cluster. The section [_decommissioning_a_sled] expands upon this description.

Aside - Why Do We Rely on the Operator to Trigger Rack Secret Creation?

In order to split a shared secret, the entity doing the splitting (Bootstrap Agent connected to the Rack Switch in our case), must know how many shares, N, to split the secret into. There is no hard timing constraint upon when MGS will be guaranteed to have heard from all SPs in the rack. Therefore, there must be some arbiter that makes a strongly consistent, point-in-time decision, about which sleds are part of the rack, without the ability to just 'wait for all the sleds'. Since the Operator is already performing a sanity check of the inventory, it seems reasonable for them to also trigger the creation of the rack secret when the inventory that shows up in wicket is complete. Importantly, if for some reason there is a failure of one sled to report itself, a human is capable of recognizing that and taking corrective action. That human may also decide to continue with rack secret generation anyway and solve the problem later, or halt the initialization and request support ASAP.

Originally, there was an alternative idea to detect sled presence automatically using the Ignition subsystem described in [rfd142]. Ignition provides presence information to the sidecars, and is capable of determining which slots in the rack are filled by sleds. However, Ignition does not indicate the BaseboardId of any sled or provide any more advanced query mechanism. Furthermore, in most cases, there will be two Sidecars capable of simultaneously providing possibly different presence information, and preventing consensus on the overall group. Due to the determinism and consistency constraints, relying on presence information for trust quorum group membership was ruled out.

Choosing the Secret Reconstruction Threshold (K)

When splitting a secret, the number of shares required to reconstruct the secret, dubbed the Threshold (K), must be chosen in addition to the total number of shares the secret is split into, N.

There are two fundamental tradeoffs to make in choosing the threshold size: i.e. the number of key shares used to generate the secret key needed to unlock rack level information.

  1. Security - A larger threshold means that an attacker would have to steal more sleds to recover the rack level secret and unlock information on any stolen drives.

  2. Resilience - A smaller threshold tolerates more failures, as fewer sleds are required to unlock the rack.

The number of sleds in each customer rack may be different, and the desired tradeoffs around security and resilience may also be different for each customer. While it may be worthwhile to allow the customer to choose the number of sleds needed for the rack level threshold secret, it may be overly burdensome, as this type of key is not widely understood. We therefore recommend choosing a threshold size of K = N/2 + 1 for version 1 of the trust quorum.

Important

The chosen threshold, K, does not need to be a majority of the number of sleds in the rack, N. K should in most cases strictly be chosen with regards to the tradeoff between the properties of Security and Resilience described above. [_rack_membership_change_and_key_rotation] will explain in detail why there is no need for a constraint like K >= N/2 + 1.

There is, however, one corner case to protect against, in which it is beneficial to always maintain K as a majority of bays in the rack. If a number of foreign sleds are inserted into a rack, we can have 2 clusters independently unlock in the same rack! We can detect and log this, and it is a major error of the operator, but it’s not always possible to prevent. For instance, we can’t make K >= Number of bays in the rack/2 + 1 if a rack is only half populated! It’s also very difficult to maintain such an invariant as a rack grows or shrinks in population. For these reasons, we are leaning towards diagnosis and treating this as a disaster scenario. It should not be incredibly dangerous, as each subcluster in the same rack should not be able to interoperate due to different CAs and keys used for services. Our goal should be to prevent such issues as much as possible and allow easy remediation.

Warning
If there are not enough (K) Gimlets (key shares) to reconstruct the rack level secret, all information encrypted by that secret will be lost.

Rack Unlock

On cold boot of a rack, we have a forest of trust: each sled’s RoT is independent and provides a trust chain for SP, HBS, and Helios. That trust needs to be extended across the rack to recreate the shared secret at each sled and unlock the rack.

When an individual sled on an initialized rack boots, it must recover K-1 (it already has its own share) shares from other sleds in order to derive a wrapper key to decrypt its disk encryption keys and unlock its disks. Each individual sled discovers its peers by looking in its local routing table, as all peers have unique /64 routes with a specific fdb0::/64 prefix. This is described fully in section 6.3.1 of [rfd63]. In order to know which sleds to trust, the membership set of PlatformIds stored on the local M.2 devices are utilized.

The Sled Agent (see [rfd61]), which contains the bootstrap agent internally, which itself contains the trust quorum node software, is the first control plane process to run on Helios for each sled. It is the component responsible for coordinating across sleds to unlock the rack. Attestation data is exchanged among sled agents via sprockets over TCP and validated, allowing trust to be extended between each pair of server sleds. Pairwise exchange of secret shares over secure sprockets channels is performed. When each sled agent has a threshold of shares, it reconstructs the rack secret, derives encryption keys from that rack secret, and unlocks its local storage. At this point the rest of the control plane software can boot, and the rack secret is securely erased from memory.

The diagram below shows the major components involved in forming a trust quorum and unlocking the rack.

        ┌─────────────────────────────────────────┐
        │Sled                                     │
        │     ┌─────┐           ┌────┐            │
        │     │ RoT │◀──SPI────▶│ SP │◀───┐       │
        │     └─────┘           └────┘    │       │
        │                                IPCC     │
        │                                 │       │
        │ M.2                             ▼       │
        │┌──────────────┐            ╔═════════╗  │     sprockets      ╔═════════╗
        ││ Share +      │            ║  Sled   ║  │                    ║  Sled   ║
        ││ TQ config    │            ║  Agent  ◀──┼───────┬───────────▶║  Agent  ║
        │└──────────────┘            ╚═════════╝  │       │            ╚═════════╝
        │                                         │       │
        │                U.2 Disks                │       │
        ├─────────────────────────────────────────┤       │
        │DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD│       │            ╔═════════╗
        │DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD│       │            ║  Sled   ║
        └─────────────────────────────────────────┘       ├───────────▶║  Agent  ║
                                                          │            ╚═════════╝
                                                          │
                                                          │
                                                          │            ╔═════════╗
                                                          │            ║  Sled   ║
                                                          └───────────▶║  Agent  ║
                                                                       ╚═════════╝

Trust Quorum Reconfiguration

In order to add or remove sleds from an existing rack, we need to be able to change the group membership of the rack cluster. This implies changing the rack secret and the corresponding share distribution such that newly added sleds can recompute the new rack secret, removed sleds cannot recompute the new rack secret, and sleds that are members of the old and new group can rotate their disk encryption keys to ones derived from the new rack secret. Reconfigurations must be able complete under conditions of partial failure which makes understanding, designing, and implementing reconfiguration protocols quite difficult.

To aid in understanding, we will attempt to give concrete examples when possible. We will first describe precisely what we mean by a configuration and what a transition from one configuration to another looks like once it is complete. We will then briefly describe our system model which defines what types of network behavior and failures our protocol can tolerate. We will then identify the precise safety properties that must be maintained as invariants to ensure that our reconfiguration protocol is correct. We then proceed to define our reconfiguration protocol, giving intuition as to why it satisfies our safety properties and maintains liveness while tolerating a specific enumeration of failures within our system model.

Note
Trust quorum reconfiguration lives outside of Reconfigurator as described in [rfd459]. Trust quorum reconfiguration is likely to replace the current add and expunge sled operations provided by omdb, and as such will update the PlanningInput to reconfigurator.

What is a Configuration?

A configuration consists of the following:

  • A unique software RackId fixed for the life of a particular rack and stored in CRDB.

  • A group of members of size N equivalent to the total number of shares for the rack secret. Each member has a unique PlatformId that serves as its identifier.

  • A strictly increasing monotonic sequence number, dubbed the Epoch

  • A threshold, K, needed to recreate the rack secret

  • A set of encrypted metadata, including prior rack secrets, to allow key rotations

Each member also has a unique share per epoch such that when K of them are combined, the rack secret can be reconstructed. Various keys can be derived from the shared secret, such as the disk encryption keys (ZFS wrapper keys) used to secure control plane data on local U.2 disks.

The problem of reconfiguration is to ensure a valid transition from the old group at epoch e to the new group at epoch e'.

A transformation of data similar to the following occurs at each player after a reconfiguration completes.

Differences in configuration at member a before and after a reconfiguration
OldNew

PlatformId

(part-a, serial-a)

(part-a, serial-a)

Epoch

1

3

RackId

1234

1234

Group Membership

{a, b, c}

{a, b, d, e, f}

Threshold (K)

2

3

Share

share_1_for_a

share_3_for_a

Encrypted Disk Encryption Key for one disk

12xyz

tu345

Encrypted Metadata

blob_1

blob_3

A node is a participating entity in the reconfiguration protocol and runs on a single sled.

We say that a node that has crashed or does not respond to messages is faulty. A node that responds to messages is non-faulty or healthy.

We say that a reconfiguration is complete, or has committed, when all non-faulty members of the new group are capable of unlocking a rack by using only the new configuration and shares. We say that a reconfiguration is pending when a reconfiguration has begun and it is not yet complete. If after some point in time, a reconfiguration is pending, it may be cancelled. Cancellation consumes an epoch number for the cancellation, meaning that the next reconfiguration will use the next epoch number in sequence.

Our goal is to ensure that reconfiguration is both safe and live. By safe we mean that nothing bad happens, and by live we mean that something good eventually happens.

System Model

Our system operates in a Partial Synchrony network model. Sleds may crash and restart. Members exhibit non-byzantine behavior. Specifically, all entities involved honestly participate in the protocol, and they do not equivocate such that they renege on a commitment or respond with two different answers to the same question.

Safety Properties

The protocol maintains the following safety properties at all times.

  • S1: New Secret Confidentiality - Members of the old group that are not also members of the new group must not be able to retrieve key shares for the new rack secret.

  • S2: Commit Completeness - If any member of the new group completes a reconfiguration, all non-faulty members of the new group eventually complete the reconfiguration, as long as it remains the current configuration and at least K sleds in the new configuration remain non-faulty.

  • S3: Key Rotation Guarantee - Any member of the old and new group should be able to rotate their encryption keys. In other words, they must be able to reconstruct the rack secrets for epoch e and e' once they have seen a Commit message for epoch e'.

An additional safety property that will be maintained is conditional on the number of sleds removed in a single reconfiguration. If fewer than K sleds are removed from the old group in epoch e, then members of the old group will never be able to unlock the rack secret epoch, e. In almost all practical deployments this will be true, but we can guarantee this by wiping the M.2 devices prior to physical removal for any removed sleds that are healthy. If we want to re-use the sleds in another rack, we have to perform this wipe.

Liveness Properties

  • L1: Commit Availability - Nexus can commit a configuration after K + Z sleds have seen the Prepare for the new configuration, where K is the trust quourum threshold, and Z is an availability parameter that allows Z sleds to simultaneously become unhealthy and rack commit to eventually complete across all healthy sleds in the new group.

  • L2: Unlock Availability - Any node that is a member of both the old and new group can unlock its storage when there are K-1 members of the old or new group available. Any node that is a member of the new group can unlock its storage when K-1 members of the new group are available, regardless of whether those nodes have seen the commit message yet.

A short explanation about L2 is in order. The transition from the old to the new group is not atomic nor instantaneous. Even after a reconfiguration is committed, it is possible that both the old configuration and the new configuration are used to unlock different subsets of nodes because enough nodes that are members of both the old and new group have not yet seen the Commit message for the new group. This is OK, and actually desirable, as it ensures we never end up in an unrecoverable state if there is a rack power off event during Commit message propagation.

Protocol

Note
The reconfiguration protocol is formally specified in TLA+. This prose description should match the TLA+ as much as possible, but where they differ, the TLA+ should be the preferred model. The implementation is guided by the TLA+, giving it precedence.

Reconfigurations may only start while the control plane is up. An operator must explicitly request a reconfiguration and the resulting request is stored in CockroachDB to provide strong consistency guarantees. A background task in Nexus is responsible for driving the reconfiguration forward by communicating with sled-agents to trigger protocol operations by the trust quorum nodes running inside the sled-agents. Data is persisted to CockroachDB and the M.2 devices on each sled, and communication between trust quorum nodes allows the protocol to make progress in the case of partial failure or reboot of the entire rack.

Important
All communication between Nexus and Sled-Agent transits the underlay network via HTTP requests. All communication between trust quorum nodes goes over sprockets channels on the bootstrap network.

Prepare Phase

The following diagram shows a high level picture of the prepare phase of the protocol, as described below.

An operator sends a Reconfiguration request containing the rack id, new group membership, and epoch to Nexus, which attempts to persist it in CRDB. The epoch is used for mutual exclusion and must be exactly one greater than the highest epoch currently seen by the database. If there is a pending reconfiguration the request will fail.

A background task in Nexus takes the current reconfiguration and tries to "prepare" it. The goal of a Prepare is to ensure that K + Z nodes in the new group have learned about the new configuration. This is enough nodes to ensure that the rack can commit the new configuration with up to Z simultaneous node failures.

Nexus chooses any sled-agent in the current configuration as a coordinator and sends it a ReconfigureMsg on the underlay network containing the rack id, group membership, and epoch for the new configuration. The coordinator sled-agent hands this request to its local in-process trust quorum node which then performs the following operations to create a new configuration:

  1. Gathers enough shares to reconstruct the rack secret for the current configuration at epoch e.

  2. Generates a new rack secret for the new configuration at epoch e'.

  3. Splits the rack secret for epoch e' into N shares, one for each member of the new group, and associates each share with a PlatformId.

  4. Decrypts metadata for epoch e, using a key derived from the rack secret for epoch e. This metadata includes all prior rack secrets that may be required for key rotation of current members. Some members may be offline and not prepared or committed for epoch e. They need to be able to rotate keys derived from their known rack secret at some prior epoch to keys derived from the rack secret for the latest epoch, e'.

  5. Derives an encryption key from the rack secret for epoch e' used to encrypt metadata for e'.

  6. Adds the rack secret for epoch e to the decrypted metadata from epoch e, creating new metadata for e'.

  7. Encrypts the new metadata with a key derived from the rack secret for the new epoch, e'.

At this point the coordinator has enough information to create a Prepare message for each node in the new group, including itself. The coordinator creates and persists the Prepare message for itself in a ledger on the M.2 devices, and then continuously sends each new member a prepare message containing its unique key share until all members have acked the message or a Commit or Cancel message has been delivered from Nexus.

Each individual trust quorum node, upon seeing the Prepare message, persists it and then acks back to the coordinator if the new epoch, e', is the largest epoch that has been seen so far.

Nexus has a background task that continuosuly polls the coordinator to see if at least K+Z replicas have received and persisted the Prepare message for epoch e'. Nexus persists which sleds have acked to CRDB, so we can tell which nodes are having trouble via observation of the DB if necessary. Once at least K+Z nodes have acked, Nexus marks the new configuration Committed in CRDB, and moves onto the commit phase of the protocol.

If the coordinator has died or becomes unresponsive for some timeout period before Nexus has seen acknowledgements from K+Z replicas, Nexus marks the new configuration Cancelled, as it is possible that not enough unencrypted key shares have been distributed to their nodes such that the rack secret can ever be unlocked. In this case, the epoch e' is consumed and any new reconfiguration attempts will start at epoch e'+1, still treating the configuration at epoch e as current. In other words, e' is not necessarily equal to e + 1, as epochs can be skipped.

Commit Phase

Once nexus has seen prepare acknowledgements from K + Z nodes as informed by the coordinator, and persisted the configuration as committed in CRDB, Nexus will inform the nodes that the configuration for epoch e' has committed.

Delivery of commit information from Nexus to trust quorum nodes is diagramed and explained below.

At this point, however, no trust quorum node for epoch e' knows that e' has committed. At least K+Z nodes have the Prepare message for epoch e' saved locally in a ledger, but they have not yet done anything with the information in those prepare messages other than persist them.

Nexus must inform the nodes that the commit for epoch e' has occurred. It does this by sending a Commit message containing the epoch e' to all nodes. If the receiving node has received a prepare message for epoch e', then this node marks the configuration committed in its ledger and acknowledges receipt to nexus. Nexus updates CRDB with the fact that this node has seen the acknowledgement. At this point, the node is officially in epoch e' and does not participate in prior epochs by responding with shares or sending share requests for those prior epochs.

During the trust quorum reconfiguration background task tick, Nexus will attempt to send the Commit message to all nodes in e' that have not acknowledged it yet. Using the same rules as above, the node will acknowledge the commit. Importantly, some nodes may not have received the Prepare message for epoch e' yet. They will eventually learn it when they restart, but we’d like to have all nodes in e' prepare and commit as soon as possible. Unfortunately, nexus does not have enough information to send Prepare messages, as key shares never leave the sled-agent. We therefore have two options here. The nodes themselves can continually gossip their latest configuration, or Nexus can trigger the nodes to communicate. We choose the nexus driven path because of the lower overhead, and easy ability to update Nexus and CRDB. In this case, nexus sends a PrepareAndCommit message to the unacked nodes to instruct them to perform operations required to prepare and commit at epoch e'. These operations differ depending upon the current state of the node that receives the PrepareAndCommit, as described below.

If the node that receives a PrepareAndCommit actually has a prepare message already then it marks the configuration committed and acks to nexus. If it has already committed, but nexus did not receive the ack it acks again. The more complicated case is if the node hasn’t actually received a Prepare for epoch e'. In this case, it doesn’t know the configuration for that epoch and it doesn’t have a unique key share. The PrepareAndCommit contains the membership set for e', and the first thing the node does is ask for the configuration for e' from all nodes in e' along with their key share in a GetConfigAndShare request. When it receives a configuration and K shares, it computes its own key share and persists it along with a commit marker to the ledger. It then acks back to Nexus. The node is now capable of recomputing the rack secret for epoch e' and rotating it’s encryption keys, which it can do at any time.

While this Nexus driven strategy for commitment is enough to complete a reconfiguration and inform all healthy nodes that are members of epoch e' to commit, it is not quite robust enough for our needs. The current protocol does not tolerate rack reboots during commitment. It is possible that some members, but fewer than K members in epoch e', were notified of commitment in epoch e' before the rack rebooted. With fewer than K nodes having committed in epoch e' those nodes that have committed in epoch e' will not be able to retrieve enough shares from the new group to create disk encryption keys and boot into the control plane. There needs to be a way for nodes that have seen Prepare messages for epoch e' to commit while the control plane is down. The next section discusses this.

Peer Commit

When a coordinator is in the process of preparing a new configuration at epoch e', it continuously attempts to send Prepare messages to all nodes that have not yet acknowledged them until it is effectively told to "stop preparing" by receiving a Commit or Cancel message for epoch e' from Nexus. It is possible that a Commit is received before all N replicas in epoch e' have seen a Prepare message. While the coordinator could continue indefinitely to prepare, this is not a resilient mechanism, as that coordinator can always die. Since it also is not necessary, the coordinator stops sending Prepare messages when it Commits or Cancels a configuration.

Additionally, once Nexus has decided to Commit a reconfiguration it continuously tries to inform nodes about the commit until it learns that all nodes have committed. It is possible that the rack reboots before Nexus is capable of informing all N nodes in epoch e' that they have committed or cancelled.

We need a way for all healthy nodes in epoch e' to prepare and commit, regardless of whether the control plane is online or not. We ensure these information transfers the same way we do regular key share distribution for a stable configuration at epoch e: nodes communicate directly with each other over sprockets channels. When a node committed at a later configuration receives a request from a node at an earlier configuration, the responding node will inform the requesting node that it must advance to the latest configuration with a CommitAdvance message. When a node that has seen a Prepare for an epoch e', but has not yet committed e', receives a request for its key share for epoch e', it will respond with the key share. The responding node treats the request for the key share as an implicit commit, although it’s not strictly necessary to write the commit to persistent state. The reason implicit commit is allowed is because the peers are expected to be non-byzantine, and therefore a node requesting a key share for epoch e' always implies that e' has committed.

The following subsections describe scenarios where a peer commit may be necessary, and explain why in some cases it is not.

Unlocking in the old group

When a sled reboots and has its current committed configuration at epoch e, it will establish sprockets connections to all prospective peers and ask for key shares for epoch e. Nodes that are at epoch e will respond with the key share for e. If K-1 nodes reply with key shares for epoch e, then the sled will recompute the rack secret for epoch e and unlock its storage to boot the rest of the control plane. This will occur even if other nodes have already committed at epoch e', as long as responding nodes at epoch e' do not respond with a CommitAdvance before K-1 shares are collected from epoch e. When the underlay becomes available on the sled, Nexus will eventually inform it via the background task to commit epoch e'.

Node attempting unlock at e with a Prepare for e'

If a node receives a request for a share for epoch e while it’s at epoch e' it will respond with a CommitAdvance message for epoch e'. If the requesting node has not yet seen a quorum of shares at epoch e, and it has a prepare message for epoch e', it will drop all shares from epoch e and instead ask for shares for epoch e'. When it receives a quorum of shares for e' it will commit at e' and perform the operations necessary to rotate its disk keys and unlock the sled as described in [_zfs_key_rotation].

Node attempting unlock at e without a Prepare for e'

If a share requesting node at epoch e receives a CommitAdvance message at epoch e' in response, but does not yet have a Prepare message for epoch e', it will need to learn the configuration for e', and compute its own key share for e'. The unlocking node follows the same messsage flow as a node that has received a PrepareAndCommit from Nexus for epoch e'.

Expunged nodes

If the requesting node at epoch e reaches a node at epoch e', and the requester is not in the new configuration at epoch e', the responder will reply with an Expunged message for epoch e' indicating that the requesting node is no longer part of the trust quorum and that it should stop asking for shares.

Protocol Nuances

Protocol Termination

Nexus is responsible for determining when a reconfiguration is "complete". Individual trust quorum nodes only have a local view of their own state, while Nexus has a global view.

Cancellation

Nexus initializes a reconfiguration by talking to a sled agent which acts as the coordinator for the reconfiguration. The coordinator creates a rack secret, splits it, and distributes key shares as part of Prepare messages to other nodes in the new epoch. The rack secret and key shares only live in memory at the coordinator. If the coordinator dies then it may not have distributed enough key shares to reconstruct the rack secret. At this point the new configuration at epoch e' must be abandoned.

It is also possible that not enough nodes in the new group are healthy enough to perform a reconfiguration, at which point Nexus will never trigger a commit.

To simplify things we treat both of these scenarios in the same manner. We set a timeout, of say, 1 minute, and if Nexus has not recorded enough successfully prepared nodes it cancels the reconfiguration by tagging the configuration for epoch e' as Cancelled in CRDB. The operator can inspect the state of the configuration and choose to try again with the same or a new group membership at epoch e' + 1.

While we could have the nexus background task continuously attempt to inform each node about the cancellation, it is not strictly necessary, as Nexus will never attempt to commit the configuration at this point. Importantly, once a reconfiguration attempt is cancelled in CRDB a new reconfiguration can be undertaken immediately.

Commit

It’s important to note that as soon as nexus marks a configuration Committed in CRDB, the reconfiguration is committed irrevocably. This is true before Nexus has even sent a Commit message to a single node. All healthy nodes will eventually commit.

Once Nexus has seen that all members of a configuration, e' have committed, it has no more work to do in the background task and a new reconfiguration can take place if necessary.

On the other hand, it is possible that a sled dies before it has fully committed. The nexus background task will indefinitely try to get this node to the committed state, which is now impossible. The node could also just be out of the rack for maintenance, in which case it will commit when it is put back in the rack. Our protocol allows a sled to become part of the trust quorum even if it is temporarily down, as long as K+Z sleds acknowledge Prepare messages. Even if another reconfiguration takes place, the offline node can still eventually commit in the latest epoch, because each new configuration includes encrypted metadata containing all the necessary rack secrets for any old configuration where a current member may be stuck. This allows stale nodes to catch up when they are back online, without the need to kick them from group membership during reconfiguration.

When nexus has seen commit acknowledgements for every member of the latest configuration and persisted them to CRDB, it can instruct the nodes to discard old encrypted rack secrets. This is done by adding a parameter to each reconfiguration request from nexus specifying the oldest rack secret that must be maintained. On the next PreparePhase, any new encrypted metadata will only contain the rack secret for the currently committed epoch. Each node at this point will only need to rotate from epoch e to e+1 upon commit.

ZFS Key Rotation

We use ZFS for our filesystem, and all disk encryption revolves around ZFS dataset encryption. A specific crypt dataset exists on each U.2 drive in each sled. Child datasets inherit this encryption. ZFS encryption has an underlying encryption key which itself is encrypted via a wrapper key which allows the wrapper key to be rotated without having to re-encrypt all data. A separate wrapper key is derived from the rack secret for each trust quorum configuration epoch. The specifics around key derivation are described in [rfd301].

Crashes or reboots of a sled can occur at anytime, even after the sled-agent has transitioned to a new epoch and has realized it has to rotate its wrapper key for each disk. Because of this, we tag the dataset with an epoch as a zfs property, so that we know which encryption key has been used. Upon reconfiguration we can rotate to our new key via zfs change-key and set the new epoch as a property in one command.

Ledger Garbage Collection

Trust quorum reconfiguration writes files to the M.2 devices for the valid messages it receives. It may also summarize and write out state for a committed configuration in a different format. Over time, there may be a lot of these files taking up space. They will also carry along with them old key shares that could potentially be used by an old sled removed from trust quorum to reconstruct the rack secret and decrypt any data on that sled’s drives. While both of these issues are very minor, we should institute a mechanism for garbage collecting old configurations by deleting these files. A reasonable policy is probably to delete files older than the last two committed reconfigurations. We can make this user driven, or do it automatically.

Upgrading from LRTQ

The low-rent trust quorum (LRTQ) was deployed on customer racks as of initial product launch. [rfd388] has more details around what guarantees LRTQ provides and why we did it. For our purposes here, we note that all disk encryption keys derived from the rack secret using LRTQ are tagged with epoch 0, and the key derivation mechanism will remain the same across lrtq and this implementation. The prime field with which we construct the rack secret has changed, however, and so we must maintain this information during our upgrade.

CRDB will maintain a table of trust quorum configurations, one for each attempted reconfiguration. There won’t be a configuration for epoch 0 in this table for systems on which we upgrade from lrtq. However, we still want to know the members of the group for the LRTQ configuration, so that we can rotate into a real trust quorum configuration at epoch 1. By definition, the members of the existing LRTQ configuration, are all the in-service sleds known to the reconfigurator via CRDB. As a first step, we can therefore run a schema migration that utilizes the BaseboardIds of the in-service sleds as the membership for the LRTQ configuration. We’ll ensure that we also tag the version of the protocol in the table for this epoch 0 configuration as version 0, which indicates LRTQ.

After we have our initial LRTQ configuration stored in CRDB we can use this membership to generate a new membership for epoch 1, save it in CRDB, and then trigger an upgrade reconfiguration. This is nearly identical to a regular reconfiguration except that the LRTQ rack secret will be computed differently, since it uses a different prime field.

Correctness of Reconfiguration

The following discussion gives intuitive explanations as to why the reconfiguration protocol satisifies the safety properties of our system and describes when and why the system maintains liveness.

As with most other distributed algorithms, the primary techniques used to maintain our safety and liveness guarantees are persistence and retries.

Above all, we must maintain our safety invariants during reconfiguration:

  • S1: New Secret Confidentiality

  • S2: Commit Completeness

  • S3: Key Rotation Guarantee

New Secret Confidentiality

When we remove a sled, X, from the trust quorum membership for epoch e', we must ensure that it is not able reconstruct the rack secret for the new group of which it is not a member. To prevent this, we must ensure that:

  • X never gets a key share for e' and K-1 sleds do not respond to it with their key shares

  • K sleds do not respond to it with their key shares for e'

We prevent X from ever getting a key share by not generating one at the coordinator during a reconfiguration. The coordinator never sends a Prepare to X because it’s not in the new group. Additionally, X can only ask for shares from the old group since it doesn’t even know about the new group. Committed peers in the new group respond with an Expunge message and never a share, since they know that X is not in the new group given the membership information in each configuration.

Commit Completeness

Guaranteeing that all healthy members in the new group commit is more complicated than preventing a sled not in the new group from retrieving the rack secret. The latter at the most basic level is simply a membership check, and by definition doesn’t have any liveness restrictions. A system that is down is not going to distribute key shares! Commit completeness, however, requires both nexus/ sled-agent interaction, as well as trust quorum peer interaction. It must also satisfy our liveness constraints

First we must note that Nexus is the component that decides when to commit. It does so only after it has seen K+Z prepare acknowledgments when polling the coordinator. Furthermore, nexus persists this commit decision to CRDB before it starts informing sled-agents to notify their internal trust quorum nodes about the commit. Therefore, Nexus is crash safe. If Nexus crashes before persisting, then it’s as if the commit never occurred. If it crashes afterwards, then if it will pick up where it left off. If the control plane stays up, nexus will end up committing all healthy nodes.

Unfortunately, the control plane may not always stay up. The worst case scenario would be when all nodes except those hosting CRDB saw the commit, and there weren’t enough nodes in the old group to unlock anymore. If we didn’t allow trust quorum nodes to inform each other about commits then the nodes hosting CRDB would be stranded, unable to get a quorum to unlock in the old group, and not knowing that the new group had committed. All the other nodes would boot, but the control plane would never come back online. However, in our protocol, members of the new group that had committed would see the nodes hosting CRDB are in the membership of the new group and inform them of the commit as described in [_peer_commit].

Key Rotation Guarantee

The correctness of key rotation follows largely from that of commit completeness. Members of the new and old group always learn of the commit for the new group if a commit has occurred. Members can only commit if they have a configuration and key share for the new epoch. Configurations contain the encrypted rack secret for the old group, protected by the rack secret for the new group. Therefore, as long as there is unlock availability, these nodes will be able to reconstruct the new rack secret, decrypt the old rack secret, and derive old and new disk keys to enable key rotation.

Determinations

  • We are using Shamir secret sharing over GF(256).

  • We will choose a threshold size K=N/2 + 1 where N is the number of sleds in the rack, and K is the number of shares required to recreate the rack secret.

  • We are using sprockets to provide:

    • Authentication

    • Remote Attestation

    • Integrity

    • Confidentiality

  • We rely the bootstrap network defined in [rfd63] for peer discovery.

  • We are relying on our own [gfss] crate for a Shamir secret sharing implementation.

Open Questions

  • Do we actually want to allow offline sleds in epoch e' during a reconfiguration?

    • In other words: should K+Z == N ?

  • Sled agents are going to be interacting over Dropshot for reconfiguration. How do we trust nexus to give us a new configuration? Must we attest, or is it good enough to use our internal auth mechanism?

  • Do we automatically expunge sleds when they are removed from the active configuration?

Security Considerations

  • We should get our implementation of shamir secret sharing ([gfss]) audited by a cryptographer.

  • We should get our trust quorum protocol analyzed by security / cryptography experts.