26 - Host Operating System & Hypervisor / RFD / Oxide

RFD

Authors

Joshua M. Clulow, Patrick Mooney, Robert Mustacchi

Updated

Note

This RFD was the subject of an Oxide and Friends episode, Helios.

Servers in the Oxide Rack will form the backbone of both the compute and storage services available to customers. The rack will be composed of a variety of hardware and software components, some of which represent resource-constrained or inflexible environments; e.g., the Service Processor (SP) or Hardware Root Of Trust (ROT). In contrast, the server CPU is a cornucopia: software update delivery is easy when compared to device firmware; rich postmortem debugging and live tracing or instrumentation facilities are feasible; the full power of modern programming platforms like Rust is available. This RFD will explore options for the software stack that runs atop the host CPU: the Operating System (OS) and the Virtual Machine Monitor (VMM).

Hypervisor Choices

When choosing an open source VMM, there are two primary choices:

KVM (on GNU/Linux)
bhyve (on illumos)

Not under consideration:

Xen: Large and complicated (by dom0) codebase, discarded for KVM by AMZN
VirtualBox: C++ codebase makes it unpalatable to work with
Emerging VMMs (OpenBSD’s vmm, etc): Haven’t been proven in production

In addition to the choice of a kernel VMM (the component which handles privileged VM context state, nested paging, etc) and its accompanying OS, attention should also be paid to the userspace portion of providing a VMM implementation. This component handles much of the device emulation state as well as tasks like live migration.

KVM on GNU/Linux

Of the existing open source VMMs, KVM is the most feature rich. It includes support for features such as:

nested virtualisation
processor feature masking and emulation
VMM paravirt device emulation (Hyper-V time facilities, for example)
PCI-passthru
live migration

The most popular userspace component to complement KVM is QEMU. Built as a general-purpose emulator supporting a vast array of platforms and devices, QEMU is able to rely on KVM for the "hard" parts of x86 virtualisation while exposing emulated devices to the running guest. The wide collection of supported devices comes at a cost: QEMU is complex and often the subject of bugs affecting its reliability and security. For that reason, QEMU is outside our scope of consideration for a VMM userspace.

Google developed crosvm, a Rust-based userspace component using KVM, to allow for the use of other operating systems in the Chromebook environment. As part of their Firecracker project, Amazon adapted crosvm to produce a small and focused VMM userspace in Rust. With much of that now contained in rust-vmm, a series of Rust crates designed to perform emulation for a limited set of devices, several userspace VMMs have sprung up from that code lineage, including Cloud Hypervisor from Intel. While not all of the features we require are present in the rust-vmm crates, Firecracker, or Cloud Hypervisor, it could serve as a good base to build what we need, while preserving the desired level of safety and performance.

bhyve on illumos

bhyve began as a de novo VMM developed on FreeBSD. Unlike some of its predecessors, it was designed to require the HVM-accelerating features (VT-x, EPT) present in modern CPUs. This freed it from needing to support older and more onerous emulation logic such as shadow page tables. Between this and its lack of nested virtualisation support, the codebase is presently much smaller and more manageable than KVM.

Simpler codebase (due to requiring modern hardware features to function)
Most of its testing and production use has been on Intel CPUs
Limited PCI-passthru support (reportedly iffy on AMD)

Presently, bhyve utilizes a custom C-based userspace. While it is much smaller in scope than QEMU, the complexity of device emulation have caused it to be a source of bugs (security and otherwise) in the past. The rust-vmm crates do not presently have direct support for the bhyve kernel VMM , but its structure is not too different from KVM (for which the crates were designed), which makes porting them onto bhyve a plausible option if we were to want a Rust-based userspace on that platform as well.

Note: While FreeBSD is where bhyve originated, and it remains the upstream repository of record for it, we are not considering it as a potential OS choice. There is not a significant difference in functionality between the illumos and FreeBSD implementations, since pulling patches downstream has not been a significant burden. Conversely, the more advanced OS primitives in illumos have resulted in certain bugs being fixed only there, having been difficult to upstream to FreeBSD.

Goals and Exploration

It may be possible to consider the OS and the VMM in isolation, but in practice these decisions are at least somewhat related. As we are exploring the specific bodies of software at hand, we should consider in depth the goals that we have for our OS and VMM.

Rust as a First Class Citizen

Most popular operating systems today are written chiefly or entirely in the C programming language. While it is certainly possible to write safe C at some cost, there is mounting evidence to suggest that from time to time, some expense has been spared. While we are not setting out to proactively rewrite the operating system in Rust, we are interested in exploring new venues for the use of safe and powerful languages like Rust in systems programming.

Rust is seeing increasing use in the embedded space, and we at Oxide are working on our own microcontroller operating environment in the form of Hubris. It seems likely that Rust would be a good language for implementing system daemons and tools in a larger scale UNIX system, as well as system libraries that need to expose a C ABI. In a modular kernel, one can easily imagine writing a new NIC driver or packet filtering policy engine in Rust.

A little under fifty years ago, UNIX made the transition from being mostly assembly language to mostly C: a demonstrably higher level language with a commensurate increase in robustness, portability, and engineering productivity. The future is unwritten, but Rust presents an exciting prospect for the basis of the next fifty years of UNIX development.

Whichever platform we choose to build on, we are committing ourselves to some degree of maintenance work. That said, all choices we would consider here have an active upstream project with which we should try to maintain a relationship. As we explore the implementation of new and increasingly critical subsystems in a language other than C, we should consider our path to getting that work included in the upstream project. There are several key points in favour of illumos on the Rust front:

no opposition to kernel modules created and maintained outside of the core source tree or with a different licence; indeed, extending so far as to provide a stable API/ABI for many core facilities required for device drivers
unlike with Linux, there are few if any instances of conditional compilation used to change the shape of the kernel for different use cases; use of more complex C interaction tools like bindgen would not be required to maintain a set of crates that interact with kernel primitives
smaller community with a set of values relatively tightly aligned with those we have seen in the Rust community; should be an easy sell to produce new core OS components (e.g., new drivers, programs, and libraries) in Rust within the base system

There have also been early positive developments for Rust in the Linux kernel. In early July 2020, a Linux kernel mailing list thread on in-tree Rust support arranged a discussion at the upcoming Linux Plumbers Conference. That discussion occured in August 2020, and was reportedly positive in nature. As of March 2021, work on a prototype for writing Linux drivers in Rust is happening in the linux-next tree.

Guest Facilities

What do we need to support in guests?

Operating Systems

We expect that customers will chiefly expect to use Linux and Windows guests in an Oxide Rack. At least initially, only operating systems that support UEFI booting will be supported.

Emulated Devices

Guests workloads will expect some minimum set of emulated deviced to be available in order to function properly. Beyond the normal x86 PC architecture devices (APIC, PIT, etc), we’ll need interfaces to block storage and network connectivity. The virtio device model has been popular and is (mostly) well supported by modern OSes. There are some questions around the quality of drivers available for Windows on that front.

For block storage, it would be advantageous to support API-driven hot insertion and hot removal of volumes from the guest. The PCI-centric device models (pci-virtio-block and NVMe) make this more challenging than something like virtio-scsi. It’s not clear to what extent such functionality is a requirement.

Most OSes utilize the x86 TSC for timekeeping, when it’s available. Windows is a notable standout here, in that it prefers the HPET. We may need to implement some portion of the Hyper-V paravirtualized device interface, so that Windows guests can use it (and the TSC) for time.

Under certain conditions, Linux guests are known to boot slowly as they wait for a slowly-filling entropy pool. It may be advantageous to provide a virtio-rng device for them to alleviate the dearth of boot time entropy.

Out-of-band Management

Most access to the guest operating system will likely be through idiomatic remote access systems provided by the OS itself; e.g., SSH for Linux, and Remote Desktop Protocol (RDP) for Windows. As with a physical server, in the event that a guest is malfunctioning the user may want out-of-band access to the (virtual) console of the guest.

Console access may also be important when using operating system install media (e.g., ISO images) to create a virtual machine template image from scratch. In addition to console operation, the ability to "insert" an ISO file into an emulated CD-ROM drive or a raw disk image into an emulated USB drive may also be needed for this use case.

To make this work we’d need to emulate a frame buffer, and a workstation keyboard and mouse. Our guest firmware package will need support for providing those things to the guest OS. The emulated frame buffer would need to be able to be connected from the server to the remote user via some protocol like VNC or SPICE.

Nested Virtualisation

Nested virtualisation is not considered an essential feature at this time. While it may be useful for CI/CD workloads, and can be convenient for hypervisor development, it is challenging to emulate the underlying interfaces with flawless fidelity. That fact, combined with its dreadful performance for more production-facing workloads make it less attractive as a focus for the first product.

Live Migration

Live migration of virtual machines (commonly known as vMotion in VMware), where the state of a running VM is moved between physical machines without downtime or any other significant perceived impact, is considered a necessary feature of our product. It allows operators to migrate workload within the greater system to balance capacity and mitigate the impact of planned maintenance. The cost of software updates or hardware maintenance on a node are greatly reduced if the instances can be drained from it first.

Security

Over the last 2+ years, a great deal of mistrust has been fostered over the ability (or lack thereof) of modern microprocessors to adequately isolate shared resources in the face of speculative execution and, more broadly, microarchitecture-focused attacks. Faced with these challenges, the OS must be flexible and provide the tools required (core-aware scheduling, vertical threads) in order to make a genuine attempt to mitigate those attacks.

Control Plane Support Facilities

Virtual machines are the most prominent customer-visible resource that the Oxide Rack will provide, but there are a number of critical internal software components that will form the basis for delivery of that service; e.g., highly available distributed block storage ([rfd29]), the user facing API ([rfd4], [rfd21]), and the control plane as a whole ([rfd48]). This additional software will almost certainly run on the same servers as the hypervisor, and thus facilities we might use to build it merit some consideration here. Facilities for support and debugging of that software are discussed in a later section.

Isolation and Sandboxing

Modern programming environments like Rust provide a high degree of memory safety, at least when used in their "safe" mode. As a result, a variety of common bugs in older software environments are generally avoided by construction. That said, bugs in the compiler or core libraries, or in "unsafe" code, can still result in exploitable defects. In addition, memory safety and even an advanced type system cannot completely preclude logic errors that exist in otherwise well-formed programs; e.g., a network service might incorrectly treat its input and allow a request to effect some unforeseen privilege escalation. As a result, we would like strong sandboxing facilities to allow us to balance the need for co-location of unrelated processes with the need for isolation between them.

As a minimum unit of isolation, the process model shared by all modern UNIX platforms affords us an isolated address space for a body of related code. Security critical components could conceivably be split into multiple separate processes which communicate through IPC mechanisms; e.g., a network server might handle incoming requests and forward instructions to a separate process for managing a virtual machine. Emulated guest requests for HTTP Metadata or DNS may be processed in yet another separate process. These processes should be able to opt out of certain privileged operations; e.g., the guest metadata service need not be able to fork new processes or read local files, while the virtual machine manager might not need to make any network requests at all. To make privilege separation between processes work well, we would need observable, robust IPC mechanisms with a reasonable performance profile.

Consider two different different classes of workload: first, a customer virtual machine; and second, a storage server program that forms part of our network block storage system. Each of these represent a different, but well-defined domain of security, functionality, and life cycle. These domains would benefit from being executed within an isolated partition where visibility of other domains is limited or completely obscured. From a life cycle perspective, the customer domain might be ephemeral, brought into existence by the control plane only while the VM is running on a particular server. In contrast, the network storage software is responsible for owning some amount of data, and will persist between server reboots.

In addition to limiting cross-domain visibility, we will likely want to use quotas or caps to limit the blast radius of a run away software component (e.g., one leaking memory) and provide more predictable performance for customer workloads. Classes of limit we may wish to consider include:

customer virtual machines pinned to a specific set of host CPUs
internal services limited to a specific subset of CPU cores, which likely does not overlap with those used for customer workloads
limits on memory resource usage
limits on file system space usage in any shared pool of storage (the network storage service may deal more directly with entire NVMe disks)

Service Supervision

One level up from the process itself, the more abstract notion of a "service" may be composed of several related processes. A fault in any process may require a restart of all related processes. The supervisor will be responsible for restarting a software component that has failed, and may need to track and alarm on additional degraded states for software that appears to be failing persistently.

The supervisor should be able to track related processes, such as if a program creates a child process to handle some helper task or some set of requests. Child processes that form part of a unit should not be able to "escape" their relationship with other processes in a unit, even through acts like double forking.

The service supervisor will be another source of telemetry for the broader stack. Ideally it will report software faults with a similar set of structured information and through the same channels as faults in hardware devices.

Comparison: Service Management Facililty (SMF) & systemd

The Service Management Facility (SMF) is responsible for the supervision of services under illumos. It was introduced with Solaris 10 in 2005. SMF provides a number of facilities for managed services:

Two levels of service object: the service (like a class), and the instance (a specific instantiation of a particular service, called default for the trivial case of a single-instance service).
A typed key-value store with transactions, allowing properties to be attached at the service or instance level; utilises a basic inheritance structure so that common properties can be defined at the service level.
Parallel start of services once dependent services are running, as well as configurable event-driven restart of related services as required.
A set of ergonomic commands for management and monitoring of services, svcadm(1M) and svcs(1), and to ease access to values from the property store in service startup scripts, svcprop(1).
A stable library interface, libscf(3LIB), for the creation, configuration, monitoring, and control of services.
Automatic capture of stdin and stdout output into rotated log files.
Integration with the Fault Management Architecture for reporting on software level faults.

SMF is built in large part upon process contracts from the contract subsystem ( contract(4), libcontract(3LIB), uts/common/os/contract.c, etc). The classical UNIX process model presents a number of challenges for service management; e.g., the reparenting of processes when they outlive their parent, a behaviour often utilised as part of the startup sequence for long-running service processes. Process contracts provide a separate hierarchy for tracking the relationship between related processes in the system, providing a robust way to track whether a service is still running on a system. The subsystem also provides a variety of mechanisms for reporting on faults (e.g., fatal signals, core files, etc) and other life cycle events.

SMF integrates well with other parts of the system, including the privileges(5) subsystem. Service instances can be configured to start not only with specific credentials (users and groups) but with a specific subset of privileges. For example, a process could be allowed to bind a listen socket with a port number lower than 1024 — the classical "secure" port range reserved for the super user — without being granted any other elevations. Conversely, a process might be denied the privilege to create new files in the file system, without hampering its ability to report status into the SMF-managed log files.

Systemd was introduced in 2010, and in the ensuing decade has seen almost ubiquitous adoption by economically relevant Linux distributions. Ten years in, with a broadening sprawl of different programs under the one banner, it’s not clear that the systemd project has a single goal in mind — but the most critical functionality it provides is ostensibly similar to SMF: the management of services as a first class facility in the system.

Though systemd has seen broad adoption, this process has not been without controversy. The project leadership has at times handled criticism or egregious bugs with perhaps less empathy than might have been ideal. One source of consternation has been the extent to which the project provides (at times quite minimal) implementations of such a wide variety of functionality, without considering the potential downsides of their new and often heavily opiniated approach.

Instead of relying on an existing logging mechanism, or providing managed access to regular files on disk, systemd provides the journal subsystem. The journal is a binary log format that has at times been a bottleneck in production software. Journal files have been reported to become corrupt under some conditions. The journal daemon itself has been a source of relatively recent security issues, a problem that is less commonly seen with simpler direct-to-file logging.

There have been similar simplistic expansions into other parts of the system, such as time synchronisation, where a variety of what may either be mistakes or deliberate design choices have often resulted in a patently inflexible and somewhat inaccurate source of time. The project has also moved to supplant existing name resolution caching with a local DNS proxy that until relatively recently could not resolve A records with sufficient entries to require EDNS support, and the official project position appears to be that if you want anything but the absolute basics to work you should probably use something else.

Realistically, it seems a robust infrastructure product would likely end up using few if any of the components provided by the systemd project, despite there now being something like a hundred of them. Instead, more traditional components would need to be revived, or thoroughly bespoke software would need to be developed, in order to avoid the technological and political issues with this increasingly dominant force in the Linux ecosystem.

Software Deployment

It would help to have facilities in the operating system that enable robust deployment of the software artefacts that we layer on top. Some programs end up as a single broadly static binary, but others end up as a tree of files that must be deployed as a unit. A classical solution to the problem of deployment has been to use a software packaging system like dpkg/apt or rpm/yum. Management of those systems, especially at scale and with unattended operation, has proven challenging. Recent emerging approaches instead choose to build atomically deployable artefacts that comprise a prebuilt (and sometimes even immutable) file system image, containing all required and related files for a given service.

These modern approaches often make use of operating system facilities like containers that can serve both as a unit of isolation and as a unit of deployment. The deployable artefact is often much like an entire file system; e.g., we have heard that Google uses ext4 file system images mounted over an iSCSI transport from a central highly available storage service. Another previously implemented solution is to use ZFS send streams to reconstitute a child file system into an existing system pool.

As described earlier, at least some core internal workloads will need to live persistently on a particular server, and be started automatically as part of server boot. Such core services will need to exist in order to solve the chicken-and-egg problem of cold start of the rack.

Server Management Facilities

In order to make the system robust, and do so with a high level of automation, we will need visibility into the underlying hardware and its drivers. This should include both status information in addition to errors. We intend to collect that information in a programmatic and structured manner for use elsewhere in the system. As such its producers will ideally have an accessible and consistent interface to emit such status and error information. It is likely that there will be consumers both in the OS kernel itself (for example, to retire a bad page of memory) as well as in user mode.

Component Inventory

It would be good to have a rich topology map within the server that covers all of the enumerable hardware, including:

sensors (temperature, etc)
indicators (LEDs)
serial numbers
firmware versions
physical locations (e.g., "Front Disk Slot 5")
logical attachment points (e.g., nodes in a SAS or NVMe fabric)

Ideally, this topology information would be concentrated in one consistent interface, rather than a collection of disjoint ones. Dependable resource IDs could be derived from that for use on the local system and to properly annotate information that is collected at the cluster level.

Fault Management

It would be good to have a system that can track telemetry that may itself not rise to the level of an actionable fault, but where emerging trends may (in aggregate) represent a fault; e.g.,

memory correctable errors (perhaps 1 per day is OK, but 10 per day is not)
memory uncorrectable errors
PCI-e errors
Machine Check Architecture (MCA)

As noted above, while some of these actions will be immediately actionable for the host OS, it is important that they also be made available to systems higher up in the stack so that reporting and decision-making at the cluster level will be possible.

Comparison: illumos & Linux

The Fault Management Architecture (FMA) was introduced in Solaris 10 in 2005, and was subsequently inherited by illumos. It is a comprehensive system with several components, both in the kernel and in user mode, supporting the detection and reporting of both software and hardware faults. Events are correlated by diagnosis engines to produce a whole system view of fault conditions, with the potential for corrective action; e.g., activation of hot spare disks in a degraded ZFS pool, retirement of memory and CPUs, etc.

FMA provides kernel facilities for the management of faults in hardware devices; e.g.,

ddi_fm_acc_err_get(9F) and ddi_fm_dma_err_get(9F) for detecting errors reported by underlying hardware access (PCI, DMA, etc).
ddi_fm_service_impact(9F) and ddi_fm_ereport_post(9F) for reporting device errors to the fault management system for recording and potential diagnosis.

FMA stores informational telemetry (e.g., Ethernet link state changes) and hardware faults (e.g., correctable or uncorrectable DIMM errors) in a structured log. These log records are then processed by modules loaded by the fault management daemon (fmd(1M), FMDPRM). Some modules are diagnosis engines, and some serve other purposes such as forwarding telemetry records to remote hosts or fetching telemetry from subservient devices such as a service processor.

Coherent management of the health of the system requires a complete picture of the hardware topology. Modern computing systems are often composed of a complex tree or graph of buses and switches, with many interconnected devices cooperating to provide service. NVMe disks may be connected through a variety of PCI bridges, just as SAS and SATA disks may be attached via a complex topology of expanders or port multipliers. FMA provides a view of the hardware through libtopo, the topology library (a somewhat dated description appears in FMDPRM). The system is represented in different ways (e.g., logical topology and physical topology) via different trees or graphs of objects with a common property-based interface. Properties are exposed with serial and model numbers, and any other notable device-specific information such as link speed or disk capacity. Each device has a URL-like construct, the Fault Management Resource Identifier (FMRI) that uniquely identifies it, and is used in other parts of the stack as a way to concretely nominate particular devices.

FMA also provides a basis for the promotion of persistent soft faults into a hard fault using Soft Error Rate Discrimination (SERD). Broadly, one correctable memory error per week might be expected, but three per hour may be predictive of a fault. In addition to the general FM module interface, the Eversholt module and associated fault tree description language allow for the definition of various measurement engines and transformation of telemetry into diagnoses in a relatively declarative fashion.

The Linux ecosystem has a variety of point solutions to particular subsets of what FMA provides. Because the kernel project does not generally ship any associated daemons or libraries, they can only provide the portions of fault management that can or must be kernel-resident. One such subsystem is Error Detection And Correction (EDAC), a framework chiefly tasked with reporting memory errors and retiring pages. As a kernel-only component, this module does not seek to perform more in-depth diagnosis that might require, say, a persistent long term store of error telemetry, or the kind of classification that is best performed in a daemon. The focus of EDAC appears relatively narrow, and it is not clear that there is a broader fault management effort underway. Instead, the general state of the art outside DIMM faults appears to be liberal use of dmesg and the occasional non-fatal (!) oops for reporting serious errors.

Beyond the kernel, there are other projects like mcelog that seem to provide some user mode recording of more complex error conditions. The project is capable of decoding and logging the machine check exceptions from which it gets its name, as well as some set of memory and PCI errors. It even claims to support a level of predictive failure analysis. Platform support is not extremely broad; e.g., the project FAQ suggests that users of AMD Zen-era hardware should just stick to EDAC at least for now. The author of the project gave a talk in 2009 about a laudible goal of unified error reporting, but it seems that EDAC may perhaps have emerged as the plan of record and stalled other efforts.

On the topology front, there are a mixture of sources that one would need to tie together to provide a complete picture of the system. Some data is present in textual files within /proc and /sys, some of which appear to be documented as stable and some of which are treated as private implementation details that could presumably change with any new kernel release. Others are commands with textual (sometimes parseable) output; e.g., lshw, hwinfo, lspci from pciutils, lsusb from usbutils, dmidecode, lsblk from util-linux, hdparm, mii-tool from net-tools, ethtool, etc.

The illumos approach to fault management is by no means perfect. It no doubt has bugs, and will require work to adapt to the continuing evolution of computer hardware and software over time. With that in mind, it appears substantially more coherent and complete than the Linux ecosystem. The benefit of co-designed kernel and user mode components in one coherent whole is perhaps thrown into its sharpest relief with fault management, an area that requires participation right up and down the software stack.

Software Engineering Facilities

The Oxide Rack will represent a core layer of the infrastructure of a variety of customers with diverse needs and expectations. The product will provide facilities in several broad categories:

General purpose virtual machines in which we will have little control over, and at times little ability to accurately predict, the exact nature of the workload in advance.
Storage services for guests, where reliability and consistent performance are critical.
A control plane and API service that users and operators use to interact with the product, with an expectation of predictable request latency and reliable operation under increasing load.

Our interactions with customers will include the need to fix critical issues promptly, and to make measured improvements to the performance of the system. In order to make this possible, we will need various facilities from the operating system and the hypervisor.

Debugging Metadata

Many software defects are difficult to reproduce, and inspection of the production system is required in order to determine the cause of the issue. In order to aid that exploration, debugging metadata must not be limited to special debug builds of the software. The production software must include enough metadata to enable runtime inspection of the state and execution of the system. Useful metadata for debugging can include:

Symbols for all functions and data objects in shared libraries, executable programs, and kernel modules; where inlining is prominent, as in Rust, we may need additional metadata beyond classical C/ELF symbols.
Robust stack walking capability sufficient for runtime sampling/profiling as well as inspection of the state of a hung or crashed process. This may be provided via a number of approaches; e.g., ensuring all binaries are compiled with correct use of the frame pointer register, or include some additional unwinding metadata.

It might seem like this is chiefly a concern at the application level, but our past experience has demonstrated how critical it is to extend this stance on complete visibility all the way down the stack. An application stack trace that stops at libc or some other system-provided library leaves us with a critical blind-spot in our exploration.

Static Instrumentation

During the design of the system, there will be some set of metrics that we would like to collect that we can determine in advance. Examples of this class of metrics include:

counts of VM exit reasons per guest instance
accumulated on-CPU time for each guest CPU
histogram with broad buckets for slow I/O operations, so that we can determine after the fact that a guest or a disk has seen unusually slow I/O operations
physical device statistics like packet, error, or interrupt counters
per-facility kernel memory allocator statistics
details about any processor power or clock scaling events

For high-frequency events, and especially in modern vastly multi-core systems, efficient collection and aggregation of standing metrics like these is a critical part of making the system both high-performing and observable.

It would be convenient for the system to present as much static instrumentation as possible in a consistent, discoverable framework. When static metrics are presented in a self-describing structure, it can help with two broad use cases:

discovery during interactive exploration by support engineers without needing to learn new tools for each new class of metric
library-based access by agents responsible for collecting and storing metrics over time, where the agent may have discover and work against different system versions that do not include the latest set of possible metrics

Comparison: kstat & procfs/sysfs

Sun introduced a uniform kernel statistics interface with Solaris 8 in 2000, known as kstat. This subsystem provides a stable kernel interface, kstat(9S); a stable library interface, kstat(3KSTAT) and libkstat(3LIB); and a command line tool, kstat(1M). The library can be used from Rust with relative ease. The statistics themselves may or may not be stable, depending on the source of the data. Some statistics are common between drivers of the same device type class; e.g., block devices provide a common base set of statistics that are used by iostat(1M) to summarise I/O activity.

The Linux approach is somewhat less ruthlessly uniform. While much information is available, it has accumulated in a variety of forms and venues depending on subsystem age. Linux has had a basic /proc virtual file system since the 0.99 release in 1992, and appears to have been modeled at least in part on Plan 9. Rather than structured data, the kernel constructs string representations (except when it doesn’t; e.g., /proc/$pid/cmdline) that any consumer other than cat will need to parse to reconstitute the information.

Over time, procfs has accumulated other non-process related files that convey other information about the system, often in a new textual format per file; e.g., see /proc/cpuinfo or /proc/meminfo. At some point, this file system became a writeable control interface in addition to a statistics interface, though still requiring the consumer to produce a textual representation of control messages that the kernel will then parse. Further, under the /proc/sys tree are an array of files with a different structure that is roughly analogous to the BSD sysctl(8) interface.

Around Linux version 2.4 in 2001, sysfs was introduced at /sys and a partial migration of some of the non-process related files from /proc ensued. Some areas of this newer tree are at least documented as stable, though as of release candidates for version 5.8 in 2020 it continues to represent a mixture of statistics presentation and mutable control plane facilities. The virtual files in this new tree still generally require a round trip through a textual representation. It is not clear if the consolidation of statistics into a uniform structure can or will ever be completed, presumably at least in part due to use of the old /proc names in user tools that are developed and shipped apart from the kernel itself.

Dynamic Instrumentation

Static metrics are valuable when engineers can determine interesting values to collect and store in advance of a problem occurring. Unfortunately, many of the most pernicious software defects are not found or fixed by inspecting existing metrics.

Complex defects are often not easily reproducible; their presentation may depend on ambient and difficult to catalogue background activity that is unique to a specific customer site at a specific time. In order to chase these defects to root cause and make a fix, we must be able to ask questions about the dynamic state of the system and iteratively refine those questions as we hone in on the problem. New instrumentation should be able to be enabled on a running production system without updating or even restarting the software.

Human intervention in systems can lead to unforced errors, and even the most cautious engineers engaging in debugging exercises are not immune. Our instrumentation must be constructed to be safe to engage on a production system without fear of a crash, or data corruption, or serious and widely felt performance degradation. When not engaged, the machinery for instrumentation should have no impact on system operation or performance. In the limit, it should be safe to accidentally request the enabling of all possible system instrumentation without prolonged deleterious effect; it is better to report an error to the debugging engineer than to make the system unsafe.

Modern systems generally consist of a lot of software, performing operations at a dizzying rate. A naive firehose approach to system observability where post-processing is used to locate the subset of interesting events is unlikely to broadly useful. In order to have minimal impact on system performance, and to answer very specific questions, the instrumentation system should be able to efficiently and safely filter and aggregate relevant events in-situ; that is, to make the decision to record the event, and which data to record, at the time of the event.

Our hypervisor and other critical services, are likely to be implemented in a split between user processes and kernel modules. We will need instrumentation that can cleanly measure events that occur across more than one process, and which extend into the kernel.

In general, we expect to able to gather at least the following information:

scheduling events; e.g., when a thread or process goes to sleep and why, or what caused a thread or process to wake up
timer-based profiling; e.g., to allow us to sample system state such as the kernel or user stack at regular intervals, possibly with other constraints such as limiting to a particular process
system call entry and return, both per-process and globally
kernel function entry and return, with guard rails that prevent the tracing of functions that are not safe to trace
user process function entry and return, and breakpoint-style tracing of specific instructions or specific offsets within a function
creation of new processes and threads, and termination (planned or unplanned) of existing processes and threads
latency distributions of operations, some of which may be defined as the composite of multiple system-level operations by the engineer

In some cases it can also be desirable to allow instrumentation to take limited mutating actions against the system, such as invoke a data collector program once a particular sequence of trace events has been detected. In order to aid in the reproduction of races in multi-threaded software, it can be helpful to be able to inject small delays (hundreds of microseconds) at specific moments in order to widen a suspected race window. Facilities that perturb the system may present a trade-off in safety, and it’s possible we might want to be able to restrict these separately from general instrumentation in customer environments.

Comparison: DTrace & eBPF

DTrace, as described in the 2004 paper, is a system for the dynamic instrumentation of production systems. A prominent goal in its construction is to be perfectly safe, and over more than a decade of use on a wide variety of production workloads it has proven to be sufficiently robust that engineers and operators need not worry when enabling it on a critical system. Many key design decisions stem from safety as a core goal; e.g., the instruction set for the tracing virtual machine allows no backwards branches, so infinite loops are impossible by construction.

Joyent hosted arbitrary customer workloads within zones, an isolation and virtualisation technology similar in principle to FreeBSD jails or Docker containers. DTrace was sufficiently safe that access could be granted to customers to instrument software running within their container, with only restricted visibility into global system behaviour. In addition to raw DTrace access, part of the Cloud Analytics product was built on top of DTrace instrumentation. This product was able to collect statistics both from probes that fired in user applications, and from the underlying operating system, aggregating them in a live graphical view. Finally, countless real production problems were solved by long-running DTrace enablings distributed throughout the fleet, waiting to log data about the events leading up to some critical fault, but without otherwise affecting the regular operation of the system.

In the more distant past, DTrace was a critical underpinning of the Analytics feature of the Fishworks appliance at Sun. Analytics enabled customers to drill down into fine detail while analysing the performance of the system, providing more abstract control over DTrace enablings and presenting an interactive graphical view of the resultant telemetry.

The Berkeley (née BSD) Packet Filter (BPF) was introduced in 1992, to provide a safe virtual machine that could be included in the kernel for selective packet capture. By allowing the filtering engine to run safely in the kernel, the performance overhead of copying every packet into a user address space for filtering could be avoided. It followed from similar approaches taken in earlier systems.

In 2014, an extended BPF (eBPF) was introduced to the mainline Linux kernel for a variety of uses. In contrast to many prior approaches, the eBPF virtual machine makes use of a just-in-time (JIT) compiler to convert eBPF instructions into native program text as they are loaded into the kernel. This choice appears to be the result of an attempt to build one system for two masters:

Adding new behaviours to the system, even in the data path, where performance is of paramount performance and programs must run to completion for system correctness even if they have an outsized impact on the rest of the system; e.g.,
- filtering, configuring, or redirecting socket connections
- classifying and shaping network traffic
- system call security policy, resource, and quota management in cgroups
- network encapsulation protocol implementation
Tracing and performance measurement of the system; e.g., by allowing eBPF programs to hook various trace points and events from the perf subsystem

The first use case, extending the data path, requires high performance at all costs. Without low latency operations, eBPF would not be an attractive target when implementing new network filtering or encapsulation facilities. The robustness and security of eBPF appear to depend fundamentally on a component called the "verifier", which scans the eBPF program upon load into the kernel. The verifier attempts to determine (before execution) whether an eBPF program will do anything unsafe, and seeks to ensure that it will terminate. There have been some serious vulnerabilities found in the verifier, and it is not clear the extent to which it has been proven to work. Indeed, kernel/bpf/verifier.c is (according to cloc) eight thousand lines of non-comment C code running in the kernel. CVE-2020-8835 from earlier this year is one such example of a security issue in the verifier.

By contrast, DTrace has a more constrained programming model which has allowed a more readily verified implementation. A byte code interpreter is used, with security checks directly at the site of operations like loads or stores that allow a D program to impact or observe the rest of the system. The instruction format does not allow for backwards branches, so constraining the program length has a concomitant impact on execution time and thus impact on the broader system. Each action is limited in the amount of work it can perform; e.g., by caps on the number of bytes of string data that will be read or copied, and by the overall principal buffer size. Explicit choices have been made to favour limiting system impact — one could not implement a reliable auditing or accounting system in DTrace, as the system makes no guarantee that an enabling won’t be thrown overboard to preserve the correct execution of the system.

In addition to issues of implementation complexity and verifier tractability, there is the small matter of binary size. The bpftrace tool, analogous on some level to dtrace, depends directly on the library form of BCC, Clang, and LLVM. This puts the directly linked text size (as visible via ldd) at around 160MB, which was more than half of the size of the entire SmartOS RAM disk. This doesn’t account for other parts of those toolchains that generally come along for the ride, or the debugging information that is often stripped from binaries in desktop Linux distributions. By contrast, dtrace and supporting libraries run to around 11MB total including CTF. In 2020, disks, memory, and network bandwidth, are relatively cheap. That said, in contexts within the system where we choose to execute the OS from a pinned RAM image, program text size may still be an issue. Lighter debugging infrastructure is easier to include in more contexts without other trade-offs.

Finally, the tools in the eBPF ecosystem continue to present a number of opportunities for improvement with respect to user experience. A relatively easy task with DTrace is to trace all system calls being made on the system or by a particular process, and to then aggregate them by system call type, or obtain a distribution of the latency of each call, or some combination of those and other things. By contrast, on a modern Ubuntu system, repeated attempts to do the same basic thing resulted in hitting any number of walls; e.g.,

Probe names have not been selected to be terribly ergonomic; e.g., what would in DTrace be syscall::read:entry, where each of the colon-separated portions of the tuple are available in separate variables, the analogous probe available to bpftrace appears to be tracepoint:syscalls:sys_enter_read. There is only one variable, probe, which then contains the whole string. Useful output appears likely to require post-processing for even simple tracing activities.
When trying to enable all tracepoint:syscalls:sys_* probes and count the distinct probe values, it becomes apparent that enabling probes at all results in sufficient risk to the system that you may enable at most 500 of them; following the instructions in the scary warning to increase the probe count results instead in tens of seconds of the system hanging and then pages of viscous and apparently diagnostic output from the eBPF verifier.
Upon further inspection, there is instead a probe site that fires for the entry of every different kind of system call, tracepoint:raw_syscalls:sys_enter, though it is difficult to actually use this effectively: the only way to identify the system call is then by its number, which though stable is also different per architecture (e.g., even 32- or 64-bit processes on x86 have different numbers).

Conversely, it is possible to cheaply enable tens of thousands of probe sites with DTrace. On a current system, there are 236 system calls and something like 34000 kernel function probe sites, not including a number of other probes. A simple enabling that counts the firings of all probes over a ten second run is responsive and easy:

# time dtrace -n '::: { @ = count(); }' -c 'sleep 10'
dtrace: description '::: ' matched 80872 probes
dtrace: pid 28326 has exited

         39354882

real    0m18.393s
user    0m0.564s
sys     0m8.115s

Postmortem Debugging

Defects are an inevitable part of shipping a software product. We should seek to delight our customers with a speedy resolution to reported issues, with as few repeated incidents as possible for each defect. As much as possible, we should aim to be able to fully debug an issue from the first reported incident. To enable us to do this, our system must record as much information as possible about the state of the system when the fault occurred.

To fully enable postmortem debugging, the system should provide rich facilities for the creation of core files in the face of serious faults. A core file will generally include the complete contents of process memory at the time of the fault, but it needs to include other information as well. A process has a number of open file descriptors that can represent a diverse set of resources beyond the process itself, and thus important context for debugging; e.g.,

Facility	Examples of Context
local domain sockets, pipes	which process or processes are on the remote end when the pipe opened or connection started send and receive buffer state
network sockets	local and remote address and port when the connection started send and receive buffer state
semaphores and other IPC	state of the IPC object
open files and directories	path at open time file system and inode number if we believe the file has been unlinked
kernel event ports, eventfd, timerfd	information about the state of the resource; e.g., associated object count, state of timers, etc
debuggers and other tracing tools	was the process under the control of a debugger when terminated?

Facility

Examples of Context

local domain sockets, pipes

which process or processes are on the remote end
when the pipe opened or connection started
send and receive buffer state

network sockets

local and remote address and port
when the connection started
send and receive buffer state

semaphores and other IPC

state of the IPC object

open files and directories

path at open time
file system and inode number
if we believe the file has been unlinked

kernel event ports, eventfd, timerfd

information about the state of the resource; e.g., associated object count, state of timers, etc

debuggers and other tracing tools

was the process under the control of a debugger when terminated?

The creation of a core file may have been discretionary; e.g., using a tool like gcore to create it on demand. If not discretionary, the system must record details about the cause of process termination; e.g., memory address and program counter for segmentation violations, or the source of a fatal signal if delivered from outside the process. We should also note the time, both in terms of observed wall clock time and monotonic seconds since boot. The more context we record, the more likely it is we will be able to debug the problem the first time it happens.

In addition to core files, we would like the system to be able to dump the entire contents of kernel memory, and optionally the memory of a specific selection of user processes, on panic of the system. Crash dumps provide a wealth of information and have been instrumental in debugging issues with kernel facilities such as device drivers and hypervisors. Crashes of an entire server are extremely disruptive, and are a class of fault that even advanced facilities like live migration are unable to paper over. We should seek to minimise the number of repeat incidents that take out an entire server.

While customers enjoy speedy resolution to the issues they report, they are presumably even happier with resolution of issues they did not have to discover themselves. For customers who are amenable to remote data collection by support organisations, we can use rich crash data collection facilities to provide an automated reporting service: core files and crash dumps can be reported to Oxide on creation for timely investigation.

As we engage with more customers and more deployed systems, managing the relationship between a specific core file and the software artefacts that produced it can be a challenge. It is important that core files include sufficient program text and metadata so that a debugger can make sense of them even if it is subsequently difficult to locate the exact input source code. Metadata may be in a format like DWARF format, or some other summarised format such as CTF, depending on binary size constraints and the toolchain that produced the binary.

Comparison: illumos and Linux

Postmortem debugging is a practice that requires many facilities in the system to work together, and highlights the need for a coherent approach. On illumos systems, the base operating system has a number of properties that help here:

Binaries are never stripped of symbols for shipping, and frame pointers are always used to mark the current stack frame.
Compact C Type Format (CTF) debugging information appears in every kernel module and many user libraries and commands; this information is then embedded in any core files or crash dumps, so the debugger can always find the correct information without the need to correllate a core file with a separate catalogue of debugging information packages for shipped binaries.
The Modular Debugger (MDB) is an integral part of the operating system. It is available for live use on user programs, the kernel from user mode, and on the console while the system is paused (KMDB). The same tools can then be used on kernel crash dumps and user core files. New kernel facilities are encouraged to include first class support for MDB modules that aid in debugging; e.g., for zfs or i40e. Making software debuggable in this way needs to be baked in as part of the design; e.g., critical data structures must be reachable from the debugger, and ancilliary information like event timestamps may be stored that are unused except under the debugger at a later time.
Core files for user software are generated using the same algorithms and format, whether as a result of a fault (SIGABRT, SIGSEGV, etc) or at the discretion of the operator with gcore(1). On Linux systems, gcore comes from GDB, an external package with different maintainers. On illumos it is maintained alongside the same core file support in the base operating system.
Comprehensive use of note entries in core files allows them to contain information about the state of the process that is not visible from the process memory or register state; e.g., pfiles(1) (similar to lsof -p) works on core files by reading information about open files sockets, pipes, etc, that were saved in NT_FDINFO notes in the core file. Linux by contrast has relatively modest use of ancilliary data here,

MDB first shipped with Solaris 8, in the year 2000, replacing the older ADB. It is the latest in a long line of postmortem debugging systems dating back to at least crash in System III. By contrast, the Linux postmortem debugging landscape is comparatively fragmented, and Linus himself is somewhat infamously opposed to debuggers.

KDB was integrated into Linux in late 2010 and appears to deal only with debugging of the live system (a la KMDB). It is somewhat less flexible in general: it provides only a basic set of commands for examining the state of loaded modules, the kprintf() buffer, the process list, etc. Though the ability to register additional KDB commands is present, it is used in just one place in the rest of the kernel to date. There does not appear to be a mechanism for composition or complex interactive use, such as piping the result of one KDB command into another.

For postmortem debugging, one is expected to use kdump and GDB. Reports from the field, such as from Dave Jones in 2014, suggest that this facility is perhaps not especially robust. To use kdump, another Linux kernel is loaded into memory with kexec(2), using a special reserved memory area that must be sized appropriately for the particular system. At least based on the kernel documentation, moving parts are somewhat abundant. A more classical dumping approach, diskdump, was floated in 2004 but did not appear to make it to integration.

The kdump documentation (admittedly mostly unchanged since 2005) suggests GDB will have trouble with 64-bit dumps from x86 systems; mailing list posts from 2014 suggest that this may still be the case. Red Hat maintains the crash utility as a body of software separate from the kernel, which at least offers a debugging module system. The kernel tree also contains various Python scripts for use with GDB, which appear to make use of whatever type metadata is available to GDB itself.

Co-development naturally leads to tight integration, and features like Postmortem DTrace have been enabled by treating the whole operating system as a cohesive whole. DTrace works by assembling telemetry from instrumentation in a per-CPU log-structured buffer, which is periodically switched out for the user mode tracing tools to copy out. If the system panics, the last active buffers are included (with all other kernel memory) in the resultant dump. The DTrace MDB module is able to lift those buffers out of the dump and use the DTrace libraries to produce output for any probe firings that occured in the lead up to the crash. When a bug is difficult to reproduce, one can easily introduce additional instrumentation without needing to deploy new software or restart anything; if and when the fault recurs, that extra telemetry is readily available like a black box flight recorder.

Third-Party Software

Oxide will produce a variety of software in-house, such as the control plane ([rfd48], [rfd61], et al) and the storage service ([rfd29], [rfd60], et al). Alongside the software we create and control ourselves, there will be software components that we merely adopt and integrate within our stack; e.g., we wish to ship CockroachDB ([rfd110]) rather than build our own fault tolerant database system.

In addition to application-level software, new hardware devices will often require a driver implemented at least partially within the operating system kernel.

We are by definition adopting and taking responsibility for any body of software that we ship in the product. Funding for development and maintenance of software can and does disappear, corporate stewards can merge or be acquired, and project structures can change significantly without much warning; e.g., RethinkDB, Riak, X.Org, MySQL, etc. We will hit bugs that others do not hit, and will likely need to fix them on our own timetable. Nonetheless, it is important to consider carefully how much work we are signing up for in choosing a platform.

Comparison: illumos and Linux

Linux is demonstrably popular, and as such many of the open source components that we may choose for integration into the product are developed and tested first against Linux interfaces. Other platforms are often an after-thought for project maintainers, if they are even considered at all. Little needs to be said here: Linux distributions generally ship with a package set that contains sufficiently current versions of relevant open source software, and other software is generally easy to build; manufacturers are often directly involved in providing in-tree drivers for their hardware.

Building the product on top of illumos will mean we at Oxide are committing to some amount of engineering and advocacy work in the future to ensure continued support from relevant upstream projects. While the illumos community is smaller than the Linux community, we would by no means be the only participants.

Our early control plane database prototyping efforts have been underway using software packaged by the OmniOS project, along with software we have built ourselves. These bodies of software were relatively easy to build with only minor patches, including:

CockroachDB
Prometheus
Grafana
Chrony (NTP)
ClickHouse

In addition to OmniOS, the pkgsrc project maintains cross-platform build recipes for a vast storehouse of open source software; some 20,000 packages are available for illumos systems. This project is a joint effort spear-headed by the NetBSD project.

The cornerstone of ensuring that modern software is easy to build is the continued availability of the runtimes and standard libraries that underpin that software. Support for illumos is in upstream stable versions, with regular CI/CD builds, of both the Rust and Go toolchains. Python, OpenJDK, and Node are also available. If a critical piece of software is not available natively and we cannot port it, it can likely be made to work within an emulation environment such as LX-branded zones or even bhyve.

With respect to drivers for hardware, it is difficult to predict how much work will be required before we have completed part selection. Some components will require custom driver work, such as the PCI-e interface for managing the switch. One notable example of a component where the illumos driver is actually maintained by the manufacturer themselves is Chelsio, who have had productive engagement with the illumos community in the past.

Measurement Of Success

For a foundational choice like the one this document describes, we should also consider the shape of a successful outcome. To succeed, most critically we need to ship a product on a specific time frame. That means applying the engineering talent we have amassed to work on a product for the market we believe exists. It means picking a stable foundation that allows us to build the product features we want, but also to support them once they are deployed in the field. It also requires us, where possible, to choose software that is already close to being what we need — where any improvements required to produce something we can ship are tractable and sustainable.

The API-driven virtual machine interface we have chosen to expose to the customer does not generally require them to comport with any internal technology choice we make, provided the product runs their workloads with an acceptable level of performance, reliability, and operability. This applies equally to components like the Service Processor ([rfd8]), where we intend to eschew IPMI and other industry standards; and to the Host Bootstrap Software ([rfd20]), where we intend to discard the UEFI stack in favour of something that more directly meets our needs. It applies as well to our choice of hypervisor and operating system.

With a relatively small team and a product that will form a critical infrastructure layer, it is important to focus on selecting and building tools that act as force multipliers for us; that allow us to find and correct software defects completely and promptly from as few observed incidents as possible.

Determinations

As of June 2021, considerable effort has been spent on exploring and reducing the risk the use of illumos and bhyve in the product. We have continued to work with the Rust community on illumos-specific toolchain issues with positive results. Helios development environments are available to Oxide staff and are in use for various development efforts, in local and cloud virtual machines, and on physical lab hardware. We have been able to boot and use Helios in some form on various AMD Ethanol systems, as well as on the Tofino Barefoot switch sample system.

Our May 2021 milestone demonstration day featured several relevant components:

Helios, our illumos distribution, running on an AMD EPYC 7402P system in our lab environment
Propolis, our Rust-based userspace daemon for bhyve, was featured in our May milestone demonstration day booting an Alpine Linux guest image
Omicron, our Rust-based suite of control plane software, running natively on Helios
The CockroachDB and Clickhouse databases, running natively on Helios

When selecting an operating system in particular, there are literally thousands of dimensions on which choices can be evaluated; each dimension a new question that can be asked. Because we can only choose a single operating system, the options are effectively mutually exclusive — all of those thousands of questions ultimately have to go the same way. This document is not, and likely cannot be, an exhaustive survey of all facets of all of the choices in front of us. Other subsequent RFDs and project plans will cover specific work items in more detail as they arise.

With this in mind, and based on our experience thus far, we are moving forward with:

Helios, our illumos distribution, as the operating system for the host CPU in our servers
Propolis, our Rust-based userspace, and bhyve in the illumos kernel for guest workloads

Postscript

This RFD was originally written in 2021. While Propolis was open source during its development, Helios was open sourced in January 2024. The open sourcing of Helios prompted a discussion in [oxf-s4e4] that expanded on its history and rationale, along with reflections on its implementation in our shipping rack.

External References

[rfd4] Oxide Computer Co. RFD 4 User Facing API.
[rfd8] Oxide Computer Co. RFD 8 Service Processor (SP).
[rfd20] Oxide Computer Co. RFD 20 Host Bootstrap Software: Objectives.
[rfd21] Oxide Computer Co. RFD 21 User Networking API.
[rfd29] Oxide Computer Co. RFD 29 Storage Requirements.
[rfd48] Oxide Computer Co. RFD 48 Control Plane Requirements.
[rfd60] Oxide Computer Co. RFD 60 Storage Architecture Considerations.
[rfd61] Oxide Computer Co. RFD 61 Control Plane Architecture and Design.
[rfd110] Oxide Computer Co. RFD 110 CockroachDB for the control plane database.
[oxf-s4e4] Oxide Computer Co. Oxide and Friends, Season 4, Episode 4, Helios

RFD 26 Host Operating System & Hypervisor

Table of Contents