RFD 241
Holistic Boot
RFD
241
Updated

[rfd75] discusses in detail three boot strategies to be explored, while discounting a fourth in §3.12 referred to as "a strategy of incremental booting". This previously discounted approach is henceforth referred to as Holistic Boot and forms our proposed strategy. In this document we refer to HBS ([rfd20]) as host software that executes prior to architecture-independent illumos code during the boot process, regardless of the number of separate programs performing these functions.

Note
This material along with that in [rfd215] supersedes [rfd37], which has been abandoned, and portions of [rfd20], [rfd27], and [rfd75]. The published RFDs have had appropriate notes added highlighting the superseded material, and [rfd27]'s contents have been brought in line with this proposal.

Determinations

  1. The host operating system will perform all necessary initialisation of the hardware resources assigned to it by [rfd88].

  2. The host OS kernel loaded and given control upon entry into the operational power state ("boot") will remain resident and in control of the host. There will be no handoff to a separate production kernel.

  3. All bootstrapping functions will be performed only once at boot and will be considered one-way. Reversibility of these operations while in the operating power state ("running") is neither expected nor required.

  4. As discussed in [rfd81] §1.6, no host that is running will boot again until it has passed through a power state that is guaranteed to reset all host-visible hardware to a known state (for Gimlet, A2 or lower).

    Important
    Exception: The host operating system may blow hard fuses in managed devices incorporating them, subject to interface stability commitments associated with software upgrade and rollback.
  5. Portions of the host OS will be located in a storage device that can be accessed directly by host software with little or no setup ("phase-1"); the remainder will be located on a standard SSD ("phase-2"). Neither device will be removable while the containing sled is operating. We refer to these as phases rather than the traditional stages because there is no handoff between them. For Gimlet, these locations are chip-down SPI-attached NOR flash and PCIe-attached non-hot-serviceable SSDs in M.2 sockets. Future sleds are expected to operate similarly, possibly utilising different peripherals and storage technologies while retaining these architectural characteristics.

  6. Every host will have two independent storage devices for phase-1 software and two independent storage devices for phase-2 software. A device of each kind is designated "A" and the other of its kind "B"; together the pair of unlike devices is referred to as boot storage unit A ("BSU A") and boot storage unit B ("BSU B"). The host OS software components within a BSU are assumed to be tightly bound to one another and must be upgraded as if they were successive blocks on a single storage device. Phase-1 OS components loaded from phase-1 device A will always load phase-2 components from phase-2 device A, and so on.

  7. For sake of consistency, processor- or machine-specific initialisation that takes place prior to fetching the first instruction on the BSP is referred to as stage-0 (not phase-0). For Gimlet, stage-0 consists of software executing off-package (e.g., on the SP) and AMD-supplied firmware executing on in-package cores outside our control. Future designs may implement stage-0 differently or incorporate it into HBS as phase-0. Stage-0 is not part of HBS and is out of scope.

  8. Contrary to previous discussion in [rfd37], no explicit information will be passed into the host OS at the time control is transferred. Instead, all variable data will be retrieved by the host OS from the SP during the boot process.

  9. If and when a new board design precludes reliable discovery of the SP by HBS, an implementation-specific kernel will be introduced, binding SP discoverability to kernel implementation. See [rfd215] for background.

Architectural Features

Holistic boot is a simplified hybrid of Heliosboot and tiny-HBS. The oxide machine-specific illumos logic, ancillary host system software, service processor software, and Gimlet hardware will implement this strategy. As its name suggests, the essential feature of Holistic Boot is that substantially all boot logic is contained in the single host OS image.

As in Heliosboot, that host OS image is based on illumos ([rfd26]). Unlike Heliosboot, there is no handoff to a second kernel, usually called the production kernel in LinuxBoot. Holistic boot may also be viewed as an even smaller version of tiny-HBS in which no separate PCIe (or USB, etc.) driver stack is implemented outside Helios. A simple matrix highlights the essential attributes of these four strategies.

AttributeTinyGiantHeliosHolistic

Host OS storage location

SSD

SSD

SSD

SPI NOR + SSD

PCIe DXIO+hotplug setup

Loader-reset-Helios

Loader only

Helios-reset-Helios

Helios only

HBS organisation

Loader+Helios

Loader only

Helios only

Loader+Helios

Alternative Explorations and Additional Research

While we have not yet completed this aspect of our investigations described by [rfd75], we have very low confidence in the reversibility of AMD firmware configuration required to access PCIe end devices safely. This includes DXIO engine configuration, NBIO PCIe strapping, and hotplug configuration. Even if firmware commands are nominally available to revert this configuration to its initial state, we have little reason to trust them for the same reasons we don’t trust the processor’s overall reset logic ([rfd81]). Additionally, it is likely that these paths are poorly-tested as the AGESA UEFI OS does not utilise them. While we are not strictly constrained to a choice between Giant-HBS and Holistic Boot by this situation, it provides a strong incentive to move in a direction that does not rely on reversible firmware behaviour or reliable internal reset. That we can go backward in time is a lie we wish to avoid telling software executing in our wake.

SD/eMMC ([rfd75] §6 Ib) was eliminated from the Gimlet design, primarily because SD card storage devices are universally unreliable and suitable eMMC devices difficult to source. Instead, a second 32 MiB NOR flash device was added to the rev A design; this is carried over to rev B and improved per [rfd163] to provide two completely independent storage slots for both firmware and software under the SP’s exclusive control. While flash availability remains highly constrained, this could conceivably be upgraded to 64 MiB if necessary. See the following section on learnings around capacity requirements. We tentatively expect that the existing [rfd163] design will provide adequate storage to support Holistic Boot without any need for an SD/eMMC implementation. Note that SD/eMMC is no longer documented by AMD for Milan and has been removed from Genoa, so we are also avoiding a dead-end technology from AMD’s perspective.

Most of the remaining items from our [rfd75] §6 tactical plan have either been completed (Ia, IIa/b/c/d/e/f, IIIa/b/c) and have informed our selection of Holistic Boot or are obviated by it (Ib/d). According to plan, part IV tasks were largely deferred and most will never be completed.

Warning story from LinuxBoot

The deployment of LinuxBoot at Google unearthed many intertwined, critical assumptions of implied state from one system to the next. While LinuxBoot’s model is quite different from anything we would want to deploy today, it is instructive to note the myriad of state that is often assumed to be present or passed through various boot paths, and the problems that can occur as a result.

Before the transition to LinuxBoot, Google’s production kernels had been booted in "Legacy" mode, despite running on top a UEFI firmware stack. This involved a key DXE Driver module called the Compatibility Services Module (CSM). This module is responsible for presenting the state discovered and initialized during UEFI in a way that allows OSes which do not support UEFI to still boot as if they were running on top of legacy BIOS. This crucially means that no UEFI boot or runtime calls are made from the OS, no EFI virtual address remmapping is done, and the OS looks in the Extended Bios Data Area (EBDA) for its ACPI RSDP.

In swapping to LinuxBoot, two things happened: 1) Linuxboot did not invoke the CSM to create legacy data structures such as the EBDA and to place the RSDP there. 2) LinuxBoot itself booted in EFI mode, which crucially invoked UEFI’s SetVirtualAddressMap call in order to use EFI runtime services.

This latter effect unearthed a previously unknown bug where one of the SMI handlers installed by firmware called out to functions in UEFI, which had been relocated. (Side note: this behaviour even while operating correctly is a massive security hole, as calling out to UEFI code from SMM defeats the point of SMM memory protections in the first place). In order to work around this problem while the SMI handler still existed, and since we didn’t intend on ever calling UEFI runtime services in LinuxBoot or the production kernel, we embedded noefi into the LinuxBoot kernel command line. This set off another set of problems.

In the migration from Legacy Boot to UEFI boot, the ACPI RSDP was moved from the EBDA into the UEFI system table. This means that while LinuxBoot came up, it had access to the system table and that allowed it to find all the ACPI tables. But because we’d put noefi in the LinuxBoot kernel command line, the resulting next kernel we kexeced booted in legacy mode, looked in the EBDA region for the RSDP which didn’t exist, and failed to find any devices attached to the system. We then errorneously thought: we can pass the RSDP to the production kernel through the command line parameter acpi_rsdp, and our problems would be solved. We started rolling this solution out through our test batch in the fleet. Our problems were not solved.

In the event of kernel panics, Google had embedded a crash kernel that would be kexeced by the production kernel. This kernel would come up with minimal features but it would scrape the logs of the previous kernel and send a crashdump to a remote service for debug purposes. This kernel also booted in legacy mode, but was not passed acpi_rsdp by the production kernel, and thus failed to boot with useful devices such as the network card. We’d broken crashdumps on all machines which carried LinuxBoot.

The point of the above story is that there is potentially a ton of state, assumptions, and implied resposibilities separated across firmware and OS. Even when we pass all the correct state to the following stage, there’s no guarantee that the stage after that will still behave the same way. This is not the only story, but it’s probably the worst. We’ve had cases where we didn’t set a PCI-e config space register in LinuxBoot, which subtly led to a performance regression in a high level system that was only noticed a couple of months later.

All this is to say, the ability of the x86 architecture to hide state is immense, and therefore the chance that we will leave state initialized in a way that will adversely affect following stages in ways we cannot predict is high. There is thus an argument for minimzing the number of stages where we have hand state from one to another in order to avoid these kinds of issues.

The Loader Stub and Phase-1 Software Discovery

Unfortunately, legacy x86 boot semantics necessitate a small loader stage that will be responsible for ensuring that the BSP is properly configured (see [rfd215]) and any initial bootstrap data loaded into memory along with the kernel binary and additional software components required to mount the rest of the operating system’s data stored on an SSD, then transferring control to the kernel. Logically there is no reason this loader could not be glued onto the kernel in the same manner that i86pc kernels incorporate a similar stub called dboot. For i86pc, this approach is called "direct boot", which is a misleading historical artefact: it actually requires both a full-scale bootloader such as loader(8) or GRUB (see [rfd75] §3.8) and either a BIOS or UEFI OS. In our case, the loader is responsible for putting the BSP into 64-bit mode, interpreting the 64-bit ELF object containing the OS kernel into memory, and transferring control to it.

We need to compress at least most of the phase-1 components to ensure we have enough space to reach phase-2. Ongoing experiments show that most of the network stack (possibly useful for recovery) and the kernel components necessary to access NVMe SSDs can fit into approximately 17 MiB when compressed with LZMA (see xz(1)); that includes several debugging components such as mdb(1) and its dmods that would not normally be delivered in a production stage-1 archive. This image consists of an approx. 87%-full 64 MiB UFS filesystem, so we should generally expect 3-4x compression ratios. Gimlet offers us 29 MiB of storage space after accommodating stage-0 components and AMD’s directories, so while we have some margin here it’s also likely that the set of software required for phase-1 will grow over time and we will want to make the most of available compression algorithms. The current bootrd code in illumos krtld doesn’t know how to decompress anything; it assumes that loader(8) has decompressed the boot archive or any other modules provided in GZIP format. While it cannot be used to store phase-1 software in its current implementation, lofi(7d) supports on-the-fly decompression of read-only LZMA-compressed loopback filesystem images.

Similarly, nanobl-rs doesn’t know how to decompress anything today either, but it does know how to interpret ELF objects. Conversely, krtld needs the ELF object which it expects to fetch from the earlyboot rootfs (which can be bcpio, UFS, or HSFS currently — as well as bootfs which is currently PC-specific), but we can’t execute from that. We’d like to avoid keeping separate copies of the kernel because it’s fairly large, around 620 KiB), which means that nanobl-rs or whatever other loader we want to use needs to be able to decompress the boot archive and interpret it well enough to find and interpret the kernel inside it. We may wish to revisit doing this the way it was done on OBP sun4 machines, as they also interpreted the host kernel directly from the same storage used to supply the rest of the OS; unlike OBP, we do not have — no especially want — the ability to interpret UFS or other complex filesystems in the loader so simple formats like tar or cpio seem preferable.

There are at least two other options for compressing phase-1 software:

  1. MP0 firmware ("the PSP") supports packing zlib/deflate-compressed objects into host boot flash and is capable of decompressing them into host DRAM ([amd-psp-boot] §4.1.5). Using this functionality would allow us to compress the entire body of phase-1 software without need for a decompression implementation of our own. The drawbacks are of course relying on firmware that we may not otherwise need and the poor space performance of deflate relative to LZMA, zstd, and other algorithms we might choose to employ ourselves. Note that we do not know what, if any, limitations apply to the size of objects to be loaded by MP0.

  2. ELF objects can contain compressed sections for non-allocable data, which can reduce their size considerably ([bahrami15], [gabi] §4.1). Unfortunately there are several drawbacks. First, the restriction to non-allocable data makes this generally relevant only for debug data and notes, leaving program text and data uncompressed. Second, like the previous mechanism, the only algorithm specified is zlib/deflate. Third, the kind of data for which this mechanism was designed is not generally present in phase-1 software because substantially all phase-1 software is part of illumos, which replaces bloated DWARF or stabs debug data with the much more compact — and itself compressed — CTF. While this mechanism may be generally valuable for any Rust binaries we deliver (in system software or otherwise) and can be used independently of whatever solution we adopt, it is likely not a general solution. Rust-specific binary size problems are out of scope here. Finally, although this functionality has been standardised into the generic ABI, illumos doesn’t have a complete implementation yet (the Solaris implementation is proprietary and unavailable to us).

We leave this area open to further research and prototyping.

Measured Boot

As there is no handoff between phase-1 and phase-2 and all stage-0 and phase-1 components are contained in the same storage device along with their associated entry points, it is not necessary for the SP to measure these bodies of software independently; see [rfd36]. Similarly, because there is no handoff between HBS and the host OS, HBS does not need to measure phase-1 components such as the kernel. Only phase-2 components need to be measured and its measurements transmitted to the SP for recording. This eliminates the need for a second body of code to perform and transmit measurements as would be required in the other strategies. Similarly, it eliminates an obstacle to verified boot (see [rfd216]) in two ways:

  1. Because there is no need for HBS to measure the host OS kernel prior to transferring control to it, there would also be no need to obtain and enforce a verified boot policy at that point: the SP has already enforced whatever policy is in place.

  2. Assuming that sufficient OS components can be located in phase-1 storage to perform automated or semi-automated recovery from loss of phase-2 software integrity and that software to perform that recovery is actually implemented, a verified boot policy could be enforced at the transition from phase-1 to phase-2 if desired. At this point of enforcement, full communication with the SP is available for already-verified phase-1 software to obtain that policy.

Consistent with the overall architecture, measurement (and policy enforcement, if desired) of phase-2 and later software components will be the responsibility of the host OS.

Boot-Time Communication and Phase-2 Software Discovery

The host OS will obtain all variable information required during the boot process (i.e., manufacturer, model, serial number, selected BSU, etc.) by communicating with the SP directly. This eliminates any need for the SP to rewrite flash for the host OS’s benefit; unfortunately we do not control the SMU, so rewriting the APCB is our only means of configuring it. This eliminates the need to define an additional configuration mechanism and implement the SP software to manage it. Instead, the same host-SP communication channel already needed for other purposes can be used to obtain all this information.

This does require that the host OS installed in phase-1 storage be capable of finding the SP on its board model and revision. We have two options here:

  1. A more general version of AMD’s documented "board getting method", such that some fixed mechanism such as GPIO or an accessible I2C EEPROM supplies this bootstrapping information on every board; or,

  2. We use the implementation-specific kernel feature of illumos described in [rfd215], so that each set of compatible access methods binds to an implementation. Introduction of additional implementations is made only when a new product precludes reliably finding the SP.

The second of these is far more flexible and future-proof, and this is exactly the reason this mechanism exists and was used for decades at Sun. Note that introduction of a new board model or revision does not necessarily require a new implementation. For example, Gimlet does not use eSPI at all; if a future board were to make the SP accessible using eSPI but not use UART1 (as Gimlet does), that board may be able to share an implementation with Gimlet. Software would be capable of reliably discriminating between the two boards at runtime based on whether it finds the SP. The details of these transitions are necessarily left to the future, but we have high confidence that the basic mechanism is adequate.

While [rfd37] initially intended a somewhat complex means for something akin to Giant-HBS to learn where to find host OS components, Holistic Boot’s use of permanently bound BSUs obviates the need for it. Phase-1 software will obtain the configured BSU (matching the phase-1 flash mux control state; see [rfd163] §1.4.1) and locate its phase-2 software components on a fixed device obtained from a static mapping from the tuple (board model, board revision, BSU index) onto the phase-2 storage device path.

Open Questions

  • Should the loader be glued to the kernel such that MP0 is configured to load the kernel+loader from local storage, or should they remain separate such that MP0 loads only the loader and the loader in turn fetches the kernel either directly from phase-1 storage or from a region of DRAM loaded separately?

  • Do we need to work around any size limitations in MP0’s loader firmware? If we do, should we have the loader or kernel fetch from flash?

  • Should the loader or kernel be responsible for locating and interpreting the phase-1 OS components?

  • Should we bother keeping the "host never writes to phase-1 storage" premise from RFD 27? See the discussion below.

  • Where is decompression of the boot archive performed, and can we avoid storing a second copy of the kernel in precious phase-1 storage?

  • Are we sure we want to ship nanobl-rs as the loader? Nothing else seems obviously suitable (see [rfd75]) but despite some improvements the code is still pretty bad. We should assess the relative costs of starting over vs. fixing what’s there.

Security Considerations

See [s-measured-boot] and the [rfd75] Security Considerations. The following changes have occurred relative to the risks discussed there:

Because of the tight binding between phase-1 and phase-2 software, the fact that the host OS already manages phase-2 storage, and the SP’s ability to measure and, optionally, restrict transfer of control to stage-0 firmware/configuration and phase-1 software, there is no longer any security value to prohibiting the host OS writing to the entire BSU. It may be simpler for the upgrade process to be driven entirely by host software with respect to both storage devices. Attacks against the host OS that attempt to persist malicious code or configuration in stage-0/phase-1 storage would appear to succeed but be detected and recovered automatically by the SP at the next boot. The risk that such an attack will be attempted given a compromised host OS seems high, as this is a common vector for persistent threats in PCs, though more knowledgeable attackers familiar with our architecture will recognise the need for additional compromise of SP software to make such an attack effective. As implementing phase-1 software upgrade in host software does not eliminate the need for SP-driven configuration and recovery, it may have limited value; therefore, we leave this path unspecified at this time.

As a refinement, and as discussed in a previous section, the portion of HBS that may handle keying material and implement a measured and/or verified boot policy will be limited to the operating system kernel. There is no requirement for a separate loader stage to measure the kernel prior to control transfer.

The risks associated with multi-stage boot architectures do not apply to Holistic Boot, which is inherently single-stage.

External References