Viewing public RFDs.
RFD 301
Has anybody seen my keys?: A key-hierarchy strategy for rack-level security
RFD
301
Updated

Background and Purpose

There are many different types of secrets inside an oxide rack. At the base of the system we have the DeviceId and Alias keys stored on the RoT and used for platform identity and measurement signing for attestation respectively [rfd36]. These keys along with a 3rd RoT hosted keypair used for authenticating ephemeral Diffie-Hellman agreement provide the ability for sleds to form secure sprockets sessions for application layer messages [rfd238]. These sprockets sessions provide each sled the ability to confidentially share information in a point- to-point fashion, where the integrity of messages is protected, the authenticity of the endpoint and the attestation of its running software is guaranteed.

In order to provide the rack-level security guarantees such that an attacker cannot walk off with a subset of sleds or drives and recover any useful information, we have designed a Trust Quorum [rfd238].

The primary protection mechanism behind the trust quorum is shamir secret sharing. A shared rack-level secret which is used as a key-derivation source for other keys used to protect storage at rest. The rack secret is split into N unique key shares by a dealer process and distributed over sprockets sessions to each bootstrap agent along with the unique platform identities of the N trust quorum members. These platform identities are baked into the public key certificates on the RoT to allow verification that an entity is who they say they are. After distribution of this information, bootstrap agents can establish sprockets connections to other bootstrap agents, verify membership in the group (via the exchanged certs), and retrieve K-1 shares from other agents such that they can reconstruct the rack secret from K shares. Importantly, without obtaining K shares, no information about the rack secret can be learned.

As described in [rfd238], individual shares are store unencrypted on each sled’s M.2 drives. An attacker would have to steal at least K of these drives to reconstruct the rack secret, which is infeasble without significant time and disruption during physical access. In the future we plan to "seal" these secrets with the help of our root of trust (RoT) such that they are only decrypted on sled boot. Sealing would mean that an attacker would have to steal K whole sleds and be able to boot them to recover the rack secret. The weight of K sleds makes this prohibitive for a casual attacker.

With this rack secret we have a shared secret that is only available when enough of the N sleds in the group are plugged into the same rack and trust each other enough to distribute shares to each other. From this shared secret, we can do things like derive or wrap individual encryption keys for each individual U.2 device, and independently derive rack level root certificates for internal services.

The purpose of this RFD is to identify:

  • What data is protected by the rack secret?

  • What is the temporal lifecycle of that data?

  • What is the spatial locality of that data?

    • Is it local to a specific drive/sled/rack/cluster?

    • What physical or logical restrictions prevent the encrypted data from being moved?

  • What is the key-hierarchy used to protect that data

    • What keys protect what data?

    • Which keys are derived from which secrets and keys?

    • Which keys are wrapped (encrypted by other keys)?

    • How are keys derived or wrapped (Key Schedule)?

    • What happens in the case of key compromise?

These questions are critical to the security of our rack, and our determinations may evolve given our experiences of with the current solution and more time to think of and implement solutions necessary to harden and expand our posture. For now, though, this RFD must deliver strong enough answers to allow us to best protect our data at rest and move onto our other security goals in the near future.

Leaf Peeping

Our rack-level key-hierarchy starts at the root with the Rack Secret. Each member of the trust quorum can learn enough shares to recompute the rack secret. The rack secret itself is not a key, and so we must derive keys from it via a key derivation function (KDF) that can be used for vaious purposes required by our rack. What keys we derive, and what keys are derived from those keys, and what keys any of them may wrap, is hard to determine without more information about what data is actually being protected by keys in our system. To help construct our key hierarchy, and fill in the internal nodes in the tree, we will start by listing out the data to be protected within the rack, what unique keys will be used to protect that data, and the security guarantees we would like from those keys. After establishing the leaf-keys we can work backwards to fill in the rest of the key hierarchy.

The below table shows the data we care about protecting by the rack secret key hierarchy for now. There are likely other things that need protecting, but they will almost certainly have similar requirements. And while we currently store Crucible encryption keys in CockroachDB, we anticipate having some sort of key management system in place post MVP that will house things like encryption keys, certs, and authentication tokens. We also do not currently have the internal certificate authority mechanism defined for managing how control plane services communicate over mutual TLS. We do know, however, that we will want such a system and therefore are being explicit here that we will at least have a root certificate and leaf certificates to protect. Whether or not we need intermediate certs is TBD.

DataStorage TypeStorage DeviceLifetime

Control Plane Data

CockroachDB

Some U.2 devices

Temporary until lifetime of rack

Control Plane Metrics

Clickhouse

Some U.2 devices

?

Crucible Extents

Files

Most U.2 devices

Lifetime of a storage volume

Crucible volume encryption keys

CockroachDB

Some U.2 devices

Lifetime of a storage volume

User Authentication Tokens

CockroachDB

Some U.2 devices

Temporary / user-session TTL

Internal Service Leaf Certs

Files

All U.2 devices - although not all certs on all devices

Cert TTL

Internal CA Root Cert Private Key

File

All sleds?

Long enough to at least sign Leaf Cert CSRs, and then may be rederived for rotation of leaf certs

Internal CA Root Cert Public Key

File

One or more U.2 devices on all sleds

Until Cert rotation

We can see that all stored data resides on U.2 devices, either in files or databases, that end up in files. We know based on prior discussions that we do not want to use hardware backed full disk encryption (due to lack of trust in implementation, complexity of key management, and multiple possible vendors), and we also know that we will have one zpool per U.2 device. We want to encrypt almost the entire zpool via a root /crypt dataset using ZFS encryption, opting out specific datasets like crucible that provide their own encryption. We can therefore assume that all necessary customer and control plane data on each U.2. device is encrypted. To limit key compromise, such that if one disk is stolen and has its key compromised, not all other disks are compromised, we choose to use an individual key per disk. We know that each disk contains a lot of data, and that we do not want to re-encrypt that data when changing keys. Lucikly for us, ZFS allows us flexibility in how we generate and change these keys.

Key rotation is managed internally by the ZFS kernel module and changing the user’s key does not require re-encrypting the entire dataset.
zfs(8) manpage

As we are encrypting all critical data on the U.2s using ZFS encryption (including crucible keys), and those ZFS keys are not stored on those disks, we cover our data at rest protection requirements. An attacker who cannot reconstruct the rack secret can not steal a subset of drives or sleds and retrieve any useful information. We now must figure out whether we want to derive or wrap those keys, and what the rest of the hierarchy looks like.

Building a Key Hierarchy from a Rack Secret

Assuming a static cluster, we now have a rack secret that can be computed from K key shares and used to derive child keys. In the following example, K = 2, just so we can keep the diagrams small.

The rack secret protects every key in the system that is used after rack unlock. There are two mechanisms to get the next keys in the hierachy:

  • Key Derivation

  • Key Wrapping (Key Encryption)

Key Derivation we already use to derive any primary child keys from the rack secret. But it can also be used to derive keys from keys. Key wrapping is just taking a key and encrypting it. Each has their benefits and drawbacks.

Key derivation is nice because you never have to store the derived keys on disk. You can just regenerate them. The problem with it is that if the derived key changes any downstream derived keys will now change also change.

Key wrapping is nice because if the wrapper key is rotated, the downstream keys do not have to change. This is particularly useful for encrypting large amounts of data on storage. You don’t want to be forced to decrypt and then re-encrypt just because a parent key was rotated. The downside of key wrapping is that you now have to store the wrapped (encrypted) key on disk somewhere. If the same key is used in multiple places you have to replicate it.

For trust quorum, we have the particular problem that when nodes are added or removed we generate new shares. While it’s possible to maintain the same rack secret and key derivation with new shares, it is less secure over time because any single malicious node who retrieved the rack secret at one time and saved it can now recover any data on any existing drives or any data produced in the future. While this is always an issue, we prefer to allow the rotation of the rack secret in the case of a known compromise. We therefore always generate a new rack secret on every change of the trust quorum (reconfiguration), or other key-share rotation. This way, even if all existing data on the rack is compromised, at least no new new data will be compromised if the compromised sled is removed.

Right now what do we know about our security goals? We want to:

  • Derive keys from the rack secret as much as possible to limit the need to store wrapped keys

  • Allow rotation of the rack secret so that a compromised rack secret can be mitigated

  • Use unique encryption keys per U.2 drive so that in case one key is compromised, the others are not compromized.

  • Ensure that new (empty) sleds cannot access any at rest data from old sleds that is not shared with them once a reconfiguration occurs.

And what are our constraints given these goals? We must:

  1. Be able to change the ZFS wrapper keys per U.2 drive when the rack secret is rotated. This requires knowing the old and new wrapper key at the same time.

  2. Allow for the fact that not all sleds will know at the same time when a new reconfiguration has been committed, and when to change the wrapper key, given the distributed nature of key rotation.

  3. Recognize that commitment of the new configuration and hence new rack secret may occur after multiple "false starts", where a new reconfiguration is distributed to multiple sleds but not committed.

The first constraint is a given from our use of disk encryption in general. The latter two constraints come from the distributed nature of the reconfiguration problem and are elaborated upon in [rfd238]. The second constraint makes it impossible to distribute the new key share during the committment, because after a sled commits to a new epoch it should only utilize the new rack secret, and thus it may have to request the new key shares from a sled that has not yet learned that it has committed. There are other security reasons for not distributing key shares during commit that are further fleshed out in [rfd238]. Therefore, given a 2-phase commit protocol ([rfd238]), we must distributed the new shares in the prepare message. The third constraint makes it impossible to change the ZFS encryption wrapper keys immediately when learned, since the new rack secret from which those keys are derived may never be committed. Without a committed rack secret, it will be impossible to retrieve the shares necessary to recompute the secret and rederive the ZFS encryption wrapper keys.

Given these goals and constraints, we re-iterate that any sleds that are members of both the old and the new group must have access to the old and new committed rack secrets at the same time. This is necessary to allow them to derive the ZFS encryption wrapper keys for each U.2 device so that they can change the keys. For simplicity of the reconfiguration protocol, and to limit the exposure of the old rack secret we also only want to allow sleds to distribute key shares for the currently committed configuration. The requirement and desires above are in tension, and so we must get creative about how we handle this situation.

The most straightforward way to solve this predicament is via the dealer during a reconfiguration. As described in [rfd238] we number each configuration with a monotonically increasing epoch. At epoch 1, each sled gets a single share and recomputes the rack secret to derive the original ZFS encryption wrapper keys for the U.2 devices. When a reconfiguration occurs, the dealer retrieves enough shares to recompute the rack secret for the current committed epoch (1 in our example). The dealer generates a new rack secret for epoch 2, and splits it into key shares. The dealer also derives an old-rack-secret encryption key associated with epoch 1 from the epoch 2 rack secret and encrypts the epoch 1 rack secret with the old-rack-secret key. It sends this encrypted secret to sleds that are members of the new group along with the rest of the trust quorum prepare message. Until the new configuration at epoch 2 is committed, the encrypted epoch-1 rack secret cannot be decrypted, because no members will send shares for epoch-2 required to recompute the epoch-2 rack secret and the derived key necessary to decrypt the epoch 1 rack secret. If epoch 2 is never committed this will remain the case.

As soon as a sled sees that a configuration has been committed for a new epoch, it retrieves enough shares to unlock the new epoch rack secret, derives the old-rack-secret encryption key, decrypts the old rack secret protected by this key, derives the old and new U.2 encryption keys from the old and new rack secret respectively and re-configures the ZFS encryption for each U.2 drive. Once the encryption keys have been changed for all the U.2 drives, the encrypted rack secret for the old epoch is securely deleted, along with any other encrypted rack secrets for the old epoch that were prepared but never committed. There is some nuance here around failure modes, but that is not relevant to this RFD and is further fleshed out in [rfd238].

Our key hierarchy has now been roughly described in prose, and it tolerates failures of sleds during reconfiguration, and allows the changing of encryption wrapper keys used for ZFS encryption on U.2 drives. It also deals with compromise of the rack secret and limits exposure of individual derived key compromise. With this in mind, we can now draw a diagram of the key-hierarchy.

Key Derivation

Note
We are ignoring any CA related certs or other keys here and focusing on storage encryption keys and Rack secret encryption keys. Those are most relevant to MVP. Similar techniques will be applied as needed to certs and other keys when their usage gets defined in other RFDs.

Similar to sprockets ([rfd256]), we use [hkdf] with sha3-256 as the hash algorithm as our key derivation function. We also use chacha20poly1305 for encrypting our rack secret during reconfiguration. ZFS encryption has limited encryption options, and so we choose the strongest one which is AES-GCM-256. Notably, both AES-GCM-256 and chacha20poly1305 use 32 byte (256-bit) keys, and so we always derive 32 byte keys from HKDF.

HKDF is a two-step algorithm, and while it does not require a salt for the first extract step, it strongly suggests one. The salt should be independent of the key material, at least as long as the hash function used, and does not have to remain secret. While ideally, the salt wouldn’t be generated by one party, we already rely on the dealer to generate the rack secret. For ease of implementation we will have the dealer also randomly choose a 32 byte salt for each epoch and include it with the key share, and membership data sent to the bootstrap agents. Note that this salt is only used in the key derivation for the old encrypted rack secrets, and not for the disk encryption keys. This is somewhat of a historical artifact, and it’s not clear that it provides much security guarantee in either case. The reason for this is that the HKDF-expand for generating key material relevant to the previous rack secrets is performed by the trust quorum code itself, while the disk key generation is done in a separate module, key-manager, which only receives the input key material (the rack secrets). There is no separate distribution of a different salt from the dealer and so we’d either end up sharing the salt or adding a new one. In either case we’d need to plumb that through to the key manager, which is extra work with no clear improvement in security.

After extracting the ikm (and possibly salt) into a uniformly distributed output key material using HKDF-Extract, individual keys may be derived from this output using the HDKF-Expand part of the HKDF algorithm. To bind output derived keys to their usage and context, such that they will be prevented from being used for other purposes by other derivers of the same key, an info parameter is provided to the HKDF-Expand function.

In our scenarios we have two types of keys we are generating: storage encryption keys, and rack secret encryption keys. We want to contextualize these so that one key isn’t mistakenly used as the other. We also want further contextualization for individual drives and configuration epochs so that keys with the same purpose are not confused, but with different specific use cases are not confused with each other. We define the info strings we pass to HKDF below where + is the concatenation operator.

  • Rack Secret Info: "rack-secret" + new_epoch + '-' + old_epoch where each of new-epoch and old-epoch is a 4 byte big-endian integer. new_epoch corresponds to the epoch where the encryption keys is derived, and old-epoch is the epoch of the rack-secret that is being encrypted.

  • U.2 Drive Info: "U.2-zfs-" + pci_vendor_id + drive_model + drive_serial_number, where the tuple of (pci_vendor_id, drive_model, drive_serial_number) uniquely identifies a U.2 drive.

The derivation chart then looks like the following (using 2 drives instead of 320) for a rack. The dealer creates a rack secret and salt for epoch 2 used for encrypting rack secrets. From there the 2 storage keys and the rack secret wrapper key which protects the rack secret from epoch 1 with the rack secret from epoch 2 can be derived. While it’s possible for any bootstrap agent with access to the rack secrets for epochs 1 and 2 to perform the derivation and encryption, in practice, the encryption is only done by the dealer, while the other bootstrap agents only derive the key required to decrypt the old rack secret, since they don’t actually have a way to get access to both the old and new rack secrets simultaneously.

Determinations

  • Each U.2 drive has a storage separate encryption key

  • We derive all storage keys from the current rack secret utilizing their serial numbers

  • Rack shares can only be retrieved for the current epoch. While not explicitly specified here, but described in [rfd238], boot agents may learn about commitment from other agents. This eliminates the problem of new agents who commit being stranded and not being able to get enough shares to unlock their rack.

  • We derive a rack secret encryption key from a new rack secret being prepared to protect the old (current) rack secret upon commit and distribute this to each member of the new and old group during reconfiguration.

  • Upon learning about commit of a new epoch, members of the new group retrieve shares for the new rack secret, recompute it, decrypt the old rack secret in the prepare message, derive the storage keys, reconfigure storage encryption, then securely delete the encrypted rack secret and the new and old rack secrets in memory.