RFD 284
Loading the Host Operating System
RFD
284
Authors
Updated

Background

RFD 75 explores different options for booting, and we have chosen a holistic approach.

RFDs 215 and 241 explain how holistic boot works for the Oxide machine architecture. RFD 281 further describes the storage of the host operating system in persistent media, image discovery, and boot once the host OS kernel is loaded, and explains that there is the practical matter of loading the host operating system image into RAM and starting it running.

This document describes how the host OS image is loaded and activated.

Determinations

The component that is responsible for loading and starting the host operating system is the Pico Host Boot Loader, or phbl (pronounced "foible").

phbl supports loading a compressed CPIO archive for the phase1 host OS image. ZLIB is currently the only supported compression algorithm.

phbl Implementation

phbl is a simple program: it is loaded from SPI flash by the PSP after DRAM training and is invoked from the reset vector of the BSC (BootStrap Core, [amd64]). It does the bare minimum required to bring the processor up into 64-bit mode, load the illumos kernel, and call the kernel’s ELF entry point; presumably a one-way journey. In many ways, it behaves like single-threaded microcontroller code in this regard.

phbl passes a small amount of data to the kernel: the kernel entry is called with the physical address of the CPIO archive and the archive’s length as arguments. Further, it observes an extremely simple protocol for transferring ownership of the page tables that map the kernel image, and obeys a strict convention for the granularity of mappings that the kernel may rely on.

The major steps in system initialization that phbl completes include:

  1. Initializing the processor. Advancing the BSC through 16-bit real mode, 32-bit protected mode, and into 64-bit long mode with paging enabled

  2. Mapping and initializing the console UART (purely for its own output, for instance if an exception is encountered in early boot).

  3. Locating, reading, and decompressing the phase1 CPIO archive containing the host kernel image, relevant kernel modules, and other data. See RFD 281 for reference on the contents of the archive.

  4. Extracting the executable ELF image containing the host kernel from the archive read in (2), and loading it into RAM by copying the loadable segments from the binary image into physical memory and mapping them at their linked addresses

  5. Starting the host kernel by calling its ELF entry point

Virtual Memory Guarantees

phbl supports creating 1GiB, 2MiB and 4KiB virtual memory mappings, and guarantees that a region of virtual memory will be mapped with the largest pages possible, given alignment and size constraints. So for instance, a 4GiB region aligned on a 1GiB boundary will be mapped with 4 1GiB pages, while a 4GiB region aligned to a 512MiB boundary will be mapped with 256 2MiB pages, then 3 1GiB pages, and then another 256 2MiB pages.

The UART’s registers will be mapped uncached and write-through, while mappings to known RAM regions will be mapped cached and with the "writeback" memory type: that is, the PTE for the UART will have both the PCD and PWT bits set, while these will be cleared for mappings that point to RAM.

When mapping ELF segments for both the loader and host operating system, protection bits for each segment will be respected: if a segment is read-only, the 'R/W' bit in a corresponding PTE will be clear. Similarly, non-executable segments will have the NX bit set.

Modifications to System State

An explicit goal of phbl is to minimize changes to system state post-reset, leaving the host in as close to a pristine state as we can when the operating system takes over. Recalling that phbl runs only on the BSC, specific changes are limited:

  1. The interrupt-enable and direction indicator bits are cleared in the %FLAGS register, disabling interrupts and forcing repeated operations into ascending order. Note this is architecturally guaranteed on reset, but is so critical we repeat the operations explicitly from the reset vector.

  2. Caching is enabled

    • The CD and NW bits in %cr0 are cleared

    • The default MTRR type is set to writeback

  3. Protected mode (and then long mode) are enabled

    • A GDT is created with the following segments:

      1. A NULL segment (required)

      2. A 64-bit code segment (required for long mode)

      3. A 32-bit code segment (required for 32-bit mode)

      4. A 32-bit data segment (required for 32-bit mode)

    • The BSC’s GDTR is loaded with a reference to this GDT

    • The PE bit is set in %cr0 to enable protected mode

    • The segmentation registers are modified to reflect the state of the processor as it transitions from 16-bit real mode to 32-bit protected and then 64-bit long mode: data and stack segments refer to a 32-bit data segment when in 32-bit protected mode, and are cleared on entry to 64-bit mode.

  4. Paging is enabled with 4-level page tables

    • The PAE bit is set in %cr4

    • Page tables are created and %cr3 is loaded with the physical address of a PML4.

    • The LME and NX bits are set in the EFER MSR.

    • The PG and WP bits are set in %cr0

  5. The CPU is put into 64-bit long mode

    • Ultimately, this is the result of a "long" jump to a 64-bit code segment after the above steps.

    • %CS is set to point to a 64-bit code segment

    • %DS, %ES and %SS are cleared

  6. %rsp is loaded with a valid stack pointer owned by the loader and transferred to the host operating system.

  7. Some amount of RAM is allocated to the loader. This contains the loader image (text, data, etc) and space for the stack and page tables that are transferred to the host operating system.

  8. The UART is initialized

    • It is unconditionally set to 3Mbps, 8 data bits, no parity, one stop bit, with FIFOs enabled, by modifying the relevant device registers.

    • Data, mostly informational, is transmitted on the UART.

  9. An IDT is set up to catch exceptions in the loader, and the IDTR is loaded with references that IDT. Exception handlers write to the UART.

  10. Some amount of RAM is used to map and load the host operating system kernel and CPIO archive; we guarantee that this is real RAM and is below the 4GiB mark. The host kernel is mapped at its linked addresses, referring to physical address that are themselves taken from the kernel executable image: contents of loadable ELF segments are copied into RAM at these addresses, as opposed to referring to the image in the CPIO achive. The CPIO archive is simply resident in physical memory and phbl passes the location and size of the archive to the host OS.

  11. The host kernel BSS is zeroed by clearing the memory size of the segment as specified in the ELF program header.

State on Entry to the Host Operating System

The following describes the state of hardware and virtual memory on entry to the host OS:

  • The BSC is in 64-bit long mode with paging enabled

  • All APs remain in their reset state

  • The IF and DF bits are cleared in %rflags

  • %cr0 has paging and protected modes enabled, WP set to enforce page-level write-protection, and CD and NW are cleared

  • %cr3 contains the physical address of a valid, properly aligned PML4 that is in RAM below 4GiB.

  • %cr4 has the PAE bit enabled

  • EFER has the LME and NX bits set

  • IDTR refers to a valid IDT

  • GDTR refers to a valid GDT

  • %cs refers to a valid 64-bit code segment

  • %ds, %es, and %ss are cleared

  • %fs and %gs are in their reset state

  • The IA32_MTRR_DEF_TYPE MSR is set to 0x06 ("writeback")

  • The loader itself is identity mapped, with permissions on its segments reflecting the segment’s type. For instance, the text segment is mapped read-only, while the data, BSS, and rodata segments are not executable

  • The UART MMIO registers are identity mapped, uncached

  • The host OS kernel image is mapped at its linked address, and its segments are copied to the physical addresses specified in the kernel ELF executable. Segment permissions from the ELF program headers are respected in the virtual mappings

  • The stack is identity mapped and is 32KiB long

  • The pages making up the page table radix tree are identity mapped and physically contiguous.

  • Leaf PTEs mapping the host operating system kernel at its linked addresses have bit 11 set, in accordance with the contract with the host OS as documented in RFD 215

  • The PSP has accessed the flash to load phbl, which contains the CPIO archive, from the BHD

  • The loader guarantees that spans of virtual memory are made using the largest applicable page size, given alignment and size constraints. ELF segments are unconditionally mapped with minimum 2MiB pages.

  • Contents of stack RAM are indeterminate

  • Contents of the general purpose registers are indeterminate

  • The stack pointer contains a frame pointing to the loader’s main entry point

  • The loader calls the ELF entry point for the host OS (as opposed to making a jump), so the BSC pushes a valid instruction pointer onto the stack and the host OS could, potentially, return to the loader: should this ever happen, we immediately panic

Current Implementation Details Possibly Subject to Future Change

Phbl does not access the host SPI flash to retrieve the initial boot archive, and at present, the compressed archive is compiled into the phbl image as a byte array. However, it could be stored separately, and we believe that if we ever need to, we can change this with no backwards compatibility concerns.

The observation here is that we treat flash holistically: updating phbl requires rewriting all of flash, including a hypothetically separate boot archive. Similarly, if a separate host OS image distinct from the phbl image in flash were to be updated, we’d necessarily rewrite phbl in flash as well. In other words, updates of any component in flash are inherently tied to all others, so if we determine that it is advantageous to separate these at some point in the future, we can do so with no significant backwards compatibility concerns.

Open Questions

  • The path name of the kernel image is hard-coded. Should we parameterize it in some way instead?

Security Considerations

We assume that the service processor (SP) is running software validated by the root of trust (ROT), and that the SP will validate the host images in flash before allowing the PSP to begin starting the system; thus, phbl needs to do no runtime verification of the host operating system, deferring that to the SP.

Note that this leaves us open to a time-of-check, time-of-use race, but there’s no way for us to reasonably detect, let alone respond, to such an attack.

Refer to RFDs 75 and 216 for discussion of these issues.

Glossary of Terms Abbreviations

AMD

Advanced Micro Devices. Company.

BIOS

Basic Input/Output System. Archaic name for firmware, commonly used in AMD documentation. Note that Oxide systems do not use a "BIOS" in the traditional sense.

BIOS Directory Table

A table in flash that describes where code and data required to boot (e.g., the BIOS) are located. Abbreviated "BHD". For example, the BHD points to the BIOS reset vector, so that the PSP may load it into RAM once DRAM has been trained and arrange for the BSC to begin execution in it. Ref [amdpsp].

BHD

See "BIOS Directory Table". Ref [amdpsp]

BSC

BootStrap Core. The first CPU core brought online when the x86 cores come out of reset. Executes the reset vector code. Responsible for loading and starting the host operating system and ultimately bringing the other CPUs online. Ref <[amd64]>

CD

Cache Disable. Bit in %cr0. If the CD bit is set, caching is disabled on that logical processor. Ref [sdm] and [amd64]

CPIO

A portmanteau of "cp" and "in and out": CoPy In and Out. A Unix file archiver program, and also the name of archiver’s file format as in, "CPIO files" or "a CPIO archive".

CR

Control Register. Architecturally defined by x86 to control various aspects of the CPU’s operation. Control registers used by phbl include %cr0, %cr4 and %cr3.

DF

Direction flag. Bit in %FLAGS. If clear, string operations increment their source and destination registers; if set, source and destination are automatically decremented.

DRAM

Dynamic Random Access Memory. Readable and writeable memory where each bit is built from a single transistor and a capacitor. Capacitor leakage currents require each bit to be periodically refreshed, and "DRAM timing training" or just "DRAM training" is the process of determining the refresh interval. DRAM is somewhat slower, but DRAM is cheaper and denser than SRAM (Static RAM). The vast majority of system RAM is DRAM.

EFER

Extended Feature Enable Register. An architecturally defined MSR that controls extended features, such as 64-bit mode and the validity of the NX bit in page table entries.

EFS

Embedded Firmware Structure (called the "Embedded Firmware Signature" in some AMD documentation; this appears to be an error). A data structure at a fixed address in the "offchip boot storage media" (e.g., flash) that "contains data required to boot the system." E.g., it points to the BIOS reset vector (phbl) and so forth.

ELF

Executable and Linkable Format. The object file format used by the host kernel image.

FLAGS

An architecturally defined x86 register that holds both flags resulting from arithmetic operations (for instance, a Z bit indicating that a recent arithmetic operation yielded a zero result, or that two values compared equal), as well as control operations (for instance, whether interrupt delivery is enabled on the CPU).

GDT

Global Descriptor Table. An architecturally defined, RAM-resident table of entries describing "segments". Required for 32-bit and 64-mode processor modes.

GDTR

Global Descriptor Table Register. A hardware register containing a base linear address and bound that point to the GDT.

GiB

Gibibyte. A power of two (2^30) as opposed to gigabyte, which is a power of ten (10^9).

IF

Interrupt Flag. A bit in the %FLAGS register indicating whether interrupt delivery is enabled on the CPU. If set, interrupt delivery is enabled; if clear, interrupt delivery is disabled.

IDT

Interrupt Descriptor Table. An architecturally defined, RAM-resident table of entries describing how to handle interrupts and exceptions. Indexed by interrupt/exception vector number.

IDTR

Interrupt Descriptor Table Register. A hardware register that contains a base linear address and bound that point to the IDT.

KiB

Kibibyte. A power of 2 (2^10) as opposed to kilobyte, which is a power of ten (10^3).

LME

Long Mode Enable. A bit in the EFER MSR. Puts the CPU into ia32e mode, allowing it to make a long-jump to 64-bit mode.

MiB

Mibibyte. A power of 2 (2^20) as opposed to megabyte, which is a power of ten (10^6).

MMIO

Memory-Mapped IO. Regions of address space that map device registers and are used for IO, as distinct from programmed IO.

MMU

Memory Management Unit. A per-CPU hardware device that uses RAM-based tables to relocate and map virtual addresses to physical addresses.

MSR

Model-specific register. A register in an integer-indexed namespace that provides programmatic access to a variety of system functions.

MTRR

Memory Type Range Register(s). A set of MSRs that describes the "memory type" of regions in the physical address space.

Host OS

Host Operating System. The helios image that manages the physical hardware.

PE

Protection Enable. Bit in %cr0. When set, the CPU is in 32- or 64-bit "protected mode" and memory references are made with respect to the segment descriptors in the GDT.

PG

Paging. Bit in %cr0. When set, paging is enabled and memory accesses on the logical processor are mediated by the page tables configured in the MMU.

PHBL

Pico Host Boot Loader. The software component in the Oxide architecture that is run from the reset vector and that loads and starts the host operating system.

PML4

Page Map Level 4. An architecturally defined level in the page table hierarchy. The root of the page table radix tree when using 4-level paging. The physical address of a valid PML4 must be in %cr3 for paging to work properly.

PSP

Platform Security Processor. An AMD-specific "secure execution context" embedded in the x86 SoC. Contains an ARM-based microprocessor on-die that is responsible for DRAM training, loading the reset vector from flash, etc.

NW

Not Write-through. Bit in %cr0. When set along with CD, memory cache-coherency is not maintained.

NX

Non-executable; sometimes "not-executable". A bit in a PTE that prevents the CPU from executing instructions on the page that the PTE maps.

PAE

Physical Address Extension. Bit in %cr4.

PTE

Page Table Entry. An architecturally defined, word-sized data structure in the paging radix tree. Note that architecture documentation describes a specific meaning to the term, "PTE": this is an entry at the 4th level of the tree that maps a 4KiB page. We tend to use the term more generically and use the specific term, "leaf PTE" when referring to a leaf entry in the tree.

ROT

Root of Trust

Segmentation Registers

Architecturally defined x86 registers that are related to the segmented memory model. The segment registers phbl uses are %CS (code segment), %DS (data segment), %ES (extended segment, used by string operations), and %SS (stack segment). In 16-bit mode, these are unchanged from their reset values. In 32-bit mode, these become indices into the GDT, and %CS points to a 32-bit code segment while %DS, %ES and %SS point to a 32-bit data segment covering the first 4GiB of the physical address space. In 64-bit mode, %CS points to a 64-bit code segment while the rest are cleared. Architecturally there exist two more: %FS and %GS. These are unused and ignored in phbl.

SoC

System on a Chip.

SP

Service Processor

UART

Universal Asynchronous Receiver/Transmitter. A hardware device for asynchronous serial communication.

WP

Write Protection. Bit in %cr0. When set, the CPU honors page-level write protection bits in PTEs in kernel mode. When cleared, page-level read/write permissions are ignored in kernel mode.

External References

  • [amd-psp-boot] AMD Platform Security Processor BIOS Architecture Design Guide for AMD Family 17h and 19h Processors. Advanced Micro Devices. Pub no 55758. Rev 1.11, August, 2020.

  • [amd64] AMD64 Technology: AMD64 Architecture Programmer’s Manual, Vol 2: System Programming. Advanced Micro Devices. Pub no 24593. Rev 3.38. November, 2021.

  • [sdm] Intel 64 and IA-32 Architectures Software Developer’s Manual, Vol 3A: System Programmer Guide, Part 1. Intel Corporation. June, 2021.