Background and Purpose
There are many different types of secrets inside an oxide rack. At the
base of the system we have the DeviceId and Alias keys stored on the
RoT and used for platform identity and measurement signing for attestation
respectively [rfd36]. These keys along with a 3rd RoT hosted keypair used for
authenticating ephemeral Diffie-Hellman agreement provide the ability for sleds
to form secure sprockets sessions for application layer messages [rfd238].
These sprockets sessions provide each sled the ability to confidentially share
information in a point- to-point fashion, where the integrity of messages is
protected, the authenticity of the endpoint and the attestation of its running
software is guaranteed.
In order to provide the rack-level security guarantees such that an attacker
cannot walk off with a subset of sleds or drives and recover any useful
information, we have designed a Trust Quorum [rfd238].
The primary protection mechanism behind the trust quorum is shamir secret
sharing. A shared rack-level secret which is used as a key-derivation source
for other keys used to protect storage at rest. The rack secret is split into
N unique key shares by a dealer process and distributed over sprockets
sessions to each bootstrap agent along with the unique platform identities of
the N trust quorum members. These platform identities are baked into the public
key certificates on the RoT to allow verification that an entity is who they say
they are. After distribution of this information, bootstrap agents can establish
sprockets connections to other bootstrap agents, verify membership in the group
(via the exchanged certs), and retrieve K-1 shares from other agents such
that they can reconstruct the rack secret from K shares. Importantly, without
obtaining K shares, no information about the rack secret can be learned.
As described in [rfd238], individual shares are store unencrypted on each
sled’s M.2 drives. An attacker would have to steal at least K of these drives
to reconstruct the rack secret, which is infeasble without significant time and
disruption during physical access. In the future we plan to "seal" these secrets
with the help of our root of trust (RoT) such that they are only decrypted on
sled boot. Sealing would mean that an attacker would have to steal K whole
sleds and be able to boot them to recover the rack secret. The weight of K
sleds makes this prohibitive for a casual attacker.
With this rack secret we have a shared secret that is only available when enough of the N sleds in the group are plugged into the same rack and trust each other enough to distribute shares to each other. From this shared secret, we can do things like derive or wrap individual encryption keys for each individual U.2 device, and independently derive rack level root certificates for internal services.
The purpose of this RFD is to identify:
What data is protected by the rack secret?
What is the temporal lifecycle of that data?
What is the spatial locality of that data?
Is it local to a specific drive/sled/rack/cluster?
What physical or logical restrictions prevent the encrypted data from being moved?
What is the key-hierarchy used to protect that data
What keys protect what data?
Which keys are derived from which secrets and keys?
Which keys are wrapped (encrypted by other keys)?
How are keys derived or wrapped (Key Schedule)?
What happens in the case of key compromise?
These questions are critical to the security of our rack, and our determinations may evolve given our experiences of with the current solution and more time to think of and implement solutions necessary to harden and expand our posture. For now, though, this RFD must deliver strong enough answers to allow us to best protect our data at rest and move onto our other security goals in the near future.
Leaf Peeping
Our rack-level key-hierarchy starts at the root with the Rack Secret. Each member of the trust quorum can learn enough shares to recompute the rack secret. The rack secret itself is not a key, and so we must derive keys from it via a key derivation function (KDF) that can be used for vaious purposes required by our rack. What keys we derive, and what keys are derived from those keys, and what keys any of them may wrap, is hard to determine without more information about what data is actually being protected by keys in our system. To help construct our key hierarchy, and fill in the internal nodes in the tree, we will start by listing out the data to be protected within the rack, what unique keys will be used to protect that data, and the security guarantees we would like from those keys. After establishing the leaf-keys we can work backwards to fill in the rest of the key hierarchy.
The below table shows the data we care about protecting by the rack secret key hierarchy for now. There are likely other things that need protecting, but they will almost certainly have similar requirements. And while we currently store Crucible encryption keys in CockroachDB, we anticipate having some sort of key management system in place post MVP that will house things like encryption keys, certs, and authentication tokens. We also do not currently have the internal certificate authority mechanism defined for managing how control plane services communicate over mutual TLS. We do know, however, that we will want such a system and therefore are being explicit here that we will at least have a root certificate and leaf certificates to protect. Whether or not we need intermediate certs is TBD.
| Data | Storage Type | Storage Device | Lifetime |
|---|---|---|---|
Control Plane Data | CockroachDB | Some U.2 devices | Temporary until lifetime of rack |
Control Plane Metrics | Clickhouse | Some U.2 devices | ? |
Crucible Extents | Files | Most U.2 devices | Lifetime of a storage volume |
Crucible volume encryption keys | CockroachDB | Some U.2 devices | Lifetime of a storage volume |
User Authentication Tokens | CockroachDB | Some U.2 devices | Temporary / user-session TTL |
Internal Service Leaf Certs | Files | All U.2 devices - although not all certs on all devices | Cert TTL |
Internal CA Root Cert Private Key | File | All sleds? | Long enough to at least sign Leaf Cert CSRs, and then may be rederived for rotation of leaf certs |
Internal CA Root Cert Public Key | File | One or more U.2 devices on all sleds | Until Cert rotation |
We can see that all stored data resides on U.2 devices, either in files or
databases, that end up in files. We know based on prior discussions that we do
not want to use hardware backed full disk encryption (due to lack of trust in
implementation, complexity of key management, and multiple possible vendors),
and we also know that we will have one zpool per U.2 device. We want to encrypt
almost the entire zpool via a root / dataset using ZFS encryption, opting
out specific datasets like crucible that provide their own encryption. We can
therefore assume that all necessary customer and control plane data on each U.2.
device is encrypted. To limit key compromise, such that if one disk is stolen
and has its key compromised, not all other disks are compromised, we choose to
use an individual key per disk. We know that each disk contains a lot of data,
and that we do not want to re-encrypt that data when changing keys. Lucikly for
us, ZFS allows us flexibility in how we generate and change these keys.
Key rotation is managed internally by the ZFS kernel module and changing the user’s key does not require re-encrypting the entire dataset.
As we are encrypting all critical data on the U.2s using ZFS encryption (including crucible keys), and those ZFS keys are not stored on those disks, we cover our data at rest protection requirements. An attacker who cannot reconstruct the rack secret can not steal a subset of drives or sleds and retrieve any useful information. We now must figure out whether we want to derive or wrap those keys, and what the rest of the hierarchy looks like.
Building a Key Hierarchy from a Rack Secret
Assuming a static cluster, we now have a rack secret that can be computed from
K key shares and used to derive child keys. In the following example, K =
2, just so we can keep the diagrams small.
The rack secret protects every key in the system that is used after rack unlock. There are two mechanisms to get the next keys in the hierachy:
Key Derivation
Key Wrapping (Key Encryption)
Key Derivation we already use to derive any primary child keys from the rack secret. But it can also be used to derive keys from keys. Key wrapping is just taking a key and encrypting it. Each has their benefits and drawbacks.
Key derivation is nice because you never have to store the derived keys on disk. You can just regenerate them. The problem with it is that if the derived key changes any downstream derived keys will now change also change.
Key wrapping is nice because if the wrapper key is rotated, the downstream keys do not have to change. This is particularly useful for encrypting large amounts of data on storage. You don’t want to be forced to decrypt and then re-encrypt just because a parent key was rotated. The downside of key wrapping is that you now have to store the wrapped (encrypted) key on disk somewhere. If the same key is used in multiple places you have to replicate it.
For trust quorum, we have the particular problem that when nodes are added or removed we generate new shares. While it’s possible to maintain the same rack secret and key derivation with new shares, it is less secure over time because any single malicious node who retrieved the rack secret at one time and saved it can now recover any data on any existing drives or any data produced in the future. While this is always an issue, we prefer to allow the rotation of the rack secret in the case of a known compromise. We therefore always generate a new rack secret on every change of the trust quorum (reconfiguration), or other key-share rotation. This way, even if all existing data on the rack is compromised, at least no new new data will be compromised if the compromised sled is removed.
Right now what do we know about our security goals? We want to:
Derive keys from the rack secret as much as possible to limit the need to store wrapped keys
Allow rotation of the rack secret so that a compromised rack secret can be mitigated
Use unique encryption keys per U.2 drive so that in case one key is compromised, the others are not compromized.
Ensure that new (empty) sleds cannot access any at rest data from old sleds that is not shared with them once a reconfiguration occurs.
And what are our constraints given these goals? We must:
Be able to change the ZFS wrapper keys per U.2 drive when the rack secret is rotated. This requires knowing the old and new wrapper key at the same time.
Allow for the fact that not all sleds will know at the same time when a new reconfiguration has been committed, and when to change the wrapper key, given the distributed nature of key rotation.
Recognize that commitment of the new configuration and hence new rack secret may occur after multiple "false starts", where a new reconfiguration is distributed to multiple sleds but not committed.
The first constraint is a given from our use of disk encryption in general. The latter two constraints come from the distributed nature of the reconfiguration problem and are elaborated upon in [rfd238]. The second constraint makes it impossible to distribute the new key share during the committment, because after a sled commits to a new epoch it should only utilize the new rack secret, and thus it may have to request the new key shares from a sled that has not yet learned that it has committed. There are other security reasons for not distributing key shares during commit that are further fleshed out in [rfd238]. Therefore, given a 2-phase commit protocol ([rfd238]), we must distributed the new shares in the prepare message. The third constraint makes it impossible to change the ZFS encryption wrapper keys immediately when learned, since the new rack secret from which those keys are derived may never be committed. Without a committed rack secret, it will be impossible to retrieve the shares necessary to recompute the secret and rederive the ZFS encryption wrapper keys.
Given these goals and constraints, we re-iterate that any sleds that are members of both the old and the new group must have access to the old and new committed rack secrets at the same time. This is necessary to allow them to derive the ZFS encryption wrapper keys for each U.2 device so that they can change the keys. For simplicity of the reconfiguration protocol, and to limit the exposure of the old rack secret we also only want to allow sleds to distribute key shares for the currently committed configuration. The requirement and desires above are in tension, and so we must get creative about how we handle this situation.
The most straightforward way to solve this predicament is via the dealer during
a reconfiguration. As described in [rfd238] we number each configuration with
a monotonically increasing epoch. At epoch 1, each sled gets a single share
and recomputes the rack secret to derive the original ZFS encryption wrapper
keys for the U.2 devices. When a reconfiguration occurs, the dealer retrieves
enough shares to recompute the rack secret for the current committed epoch (1
in our example). The dealer generates a new rack secret for epoch 2, and splits
it into key shares. The dealer also derives an old-rack-secret encryption
key associated with epoch 1 from the epoch 2 rack secret and encrypts the epoch
1 rack secret with the old-rack-secret key. It sends this encrypted secret
to sleds that are members of the new group along with the rest of the trust
quorum prepare message. Until the new configuration at epoch 2 is committed, the
encrypted epoch-1 rack secret cannot be decrypted, because no members will send
shares for epoch-2 required to recompute the epoch-2 rack secret and the derived
key necessary to decrypt the epoch 1 rack secret. If epoch 2 is never committed
this will remain the case.
As soon as a sled sees that a configuration has been committed for a new epoch,
it retrieves enough shares to unlock the new epoch rack secret, derives the
old-rack-secret encryption key, decrypts the old rack secret protected by
this key, derives the old and new U.2 encryption keys from the old and new
rack secret respectively and re-configures the ZFS encryption for each U.2
drive. Once the encryption keys have been changed for all the U.2 drives, the
encrypted rack secret for the old epoch is securely deleted, along with any
other encrypted rack secrets for the old epoch that were prepared but never
committed. There is some nuance here around failure modes, but that is not
relevant to this RFD and is further fleshed out in [rfd238].
Our key hierarchy has now been roughly described in prose, and it tolerates failures of sleds during reconfiguration, and allows the changing of encryption wrapper keys used for ZFS encryption on U.2 drives. It also deals with compromise of the rack secret and limits exposure of individual derived key compromise. With this in mind, we can now draw a diagram of the key-hierarchy.
Key Derivation
Similar to sprockets ([rfd256]), we use [hkdf] with sha3-256 as the hash
algorithm as our key derivation function. We also use chacha20poly1305 for
encrypting our rack secret during reconfiguration. ZFS encryption has limited
encryption options, and so we choose the strongest one which is AES-GCM-256.
Notably, both AES-GCM-256 and chacha20poly1305 use 32 byte (256-bit) keys,
and so we always derive 32 byte keys from HKDF.
HKDF is a two-step algorithm, and while it does not require a salt for the
first extract step, it strongly suggests one. The salt should be independent
of the key material, at least as long as the hash function used, and does not
have to remain secret. While ideally, the salt wouldn’t be generated by one
party, we already rely on the dealer to generate the rack secret. For ease of
implementation we will have the dealer also randomly choose a 32 byte salt for
each epoch and include it with the key share, and membership data sent to the
bootstrap agents. Note that this salt is only used in the key derivation for
the old encrypted rack secrets, and not for the disk encryption keys. This is
somewhat of a historical artifact, and it’s not clear that it provides much
security guarantee in either case. The reason for this is that the HKDF-expand
for generating key material relevant to the previous rack secrets is performed
by the trust quorum code itself, while the disk key generation is done in a
separate module, key-manager, which only receives the input key material (the
rack secrets). There is no separate distribution of a different salt from the
dealer and so we’d either end up sharing the salt or adding a new one. In either
case we’d need to plumb that through to the key manager, which is extra work
with no clear improvement in security.
After extracting the ikm (and possibly salt) into a uniformly distributed
output key material using HKDF-Extract, individual keys may be derived from
this output using the HDKF-Expand part of the HKDF algorithm. To bind output
derived keys to their usage and context, such that they will be prevented from
being used for other purposes by other derivers of the same key, an info
parameter is provided to the HKDF-Expand function.
In our scenarios we have two types of keys we are generating: storage
encryption keys, and rack secret encryption keys. We want to contextualize
these so that one key isn’t mistakenly used as the other. We also want further
contextualization for individual drives and configuration epochs so that keys
with the same purpose are not confused, but with different specific use cases
are not confused with each other. We define the info strings we pass to HKDF
below where + is the concatenation operator.
Rack Secret Info:
"rack-secret" + new_epoch + '-' + old_epochwhere each ofnew-epochandold-epochis a 4 byte big-endian integer.new_epochcorresponds to the epoch where the encryption keys is derived, andold-epochis the epoch of the rack-secret that is being encrypted.U.2 Drive Info:
"U.2-zfs-" + pci_vendor_id + drive_model + drive_serial_number, where the tuple of(pci_vendor_id, drive_model, drive_serial_number)uniquely identifies a U.2 drive.
The derivation chart then looks like the following (using 2 drives instead of 320) for a rack. The dealer creates a rack secret and salt for epoch 2 used for encrypting rack secrets. From there the 2 storage keys and the rack secret wrapper key which protects the rack secret from epoch 1 with the rack secret from epoch 2 can be derived. While it’s possible for any bootstrap agent with access to the rack secrets for epochs 1 and 2 to perform the derivation and encryption, in practice, the encryption is only done by the dealer, while the other bootstrap agents only derive the key required to decrypt the old rack secret, since they don’t actually have a way to get access to both the old and new rack secrets simultaneously.
Determinations
Each U.2 drive has a storage separate encryption key
We derive all storage keys from the current rack secret utilizing their serial numbers
Rack shares can only be retrieved for the current epoch. While not explicitly specified here, but described in [rfd238], boot agents may learn about commitment from other agents. This eliminates the problem of new agents who commit being stranded and not being able to get enough shares to unlock their rack.
We derive a rack secret encryption key from a new rack secret being prepared to protect the old (current) rack secret upon commit and distribute this to each member of the new and old group during reconfiguration.
Upon learning about commit of a new epoch, members of the new group retrieve shares for the new rack secret, recompute it, decrypt the old rack secret in the prepare message, derive the storage keys, reconfigure storage encryption, then securely delete the encrypted rack secret and the new and old rack secrets in memory.
External References
[RFD 36] Root of Trust and Attestation
[RFD 238] Trust Quorum and Rack Unlock