88 - Chassis Management Responsibility Allocation / RFD / Oxide

RFD

Authors

Updated

Background

The Oxide server has a large number of sources of information about the state of the machine, ranging from discrete logic components with fault signaling capability to the thousands of MCA/MCAX registers providing minute detail about hardware faults detected by the host processor. It also has a similar array of control mechanisms for both electromechanical and silicon functionality. Each of these may be grouped into one or more conceptual functions, each of which in turn is part of a larger subsystem that provides services to the owner of the machine through the Oxide Control Plane or allows other Oxide software to better understand and control system state (possibly without the operator’s awareness).

As the designers of the machine, we have the opportunity to route the signals required for each of these functions to the microprocessor or microcontroller of our choosing, or even to employ fan-out buffers, muxes, or other logic as required to route them to multiple endpoints. The purpose of this document is to establish a collection of architectural principles for where data and control flows should terminate within each individual chassis. While the primary focus is on allocating these responsibilities between the host processor and service processor (SP) in Gimlets, the same principles are also applied within the Sidecar chassis and between the Sidecar and the Gimlet to which each is attached. Some additional Sidecar-specific discussion of environmental and inventory management may be found in [rfd58] §4; it is intended that it and other documents shall where appropriate define these data and control paths in greater detail in a manner consistent with the [s-determinations] documented herein.

With certain obvious exceptions, nothing in this discussion is intended to be limited to the v1 product or the current generation of hardware; the same principles and design considerations will apply to future products. Therefore "Sidecar" and "rack-level Ethernet switch" may be used interchangeably and inclusively with respect to future designs; likewise "Gimlet" and "compute server". The [s-determinations] herein apply to similar future products unless/until superseded in writing.

Principles

Single source of truth: Regardless of which subsystem is assigned an observation function, all other consumers of the observed data must obtain it through the assigned observer, never directly. This prevents inconsistent data (due to sampling skew, use of different interfaces, or other reasons) from being presented to an operator or used to drive an automated process. It also allows for any filtering, bounds-checking, time series creation, or other simple transformations to be done exactly once, reducing both computation and the risk of bugs.
Single point of execution: This is the control analogue to the single source of truth. There are many possible points at which an operator, software, or hardware element may decide that an action should be taken or choose one from among a set of possible actions. Separately, there is a single place where any particular action is executed, regardless of where or how the decision to take that action originates. Thus, for any specific controlled mechanism, there exists a fan-in structure for conveying commands to the single point of execution. This allows for arbitration and hysteresis if a series of conflicting commands arrive in a short time, simplifies hardware, and reduces or eliminates the need for multiple (usually somewhat different) control functions in software.
Inventory and health monitoring together: While communication channels between the host and SP exist and can be used to communicate chassis management data, our system is simplified by minimising the need for this. One way to achieve that is to allocate responsibility for monitoring managed device health and status to the same entity responsible for discovering that device and maintaining its attributes in an inventory. Disjoint inventories may be easily merged by higher-level software running on the host or in the control plane to present a comprehensive view of chassis state without the need to resolve conflicts or maintain a map of duplicate entities. Perhaps most importantly, the entity responsible for monitoring the state of a device will also be responsible for initiating recovery when that device causes or produces errors.
Complex behaviour should be self-hosted: Self-hosted environments offer richer, more mature observability tools that make it easier to understand large, complex bodies of software. They also, generally, provide more RAM — certainly so in our case. Larger bodies of software require more memory to run and are likely to have more bugs and bugs that are more difficult to find and fix. Therefore, our bias should be toward assigning responsibility for more complex features to the host and leaving the SP to provide those functions that are necessary for booting the machine, are required for safety even if the host is not functioning properly, have real-time constraints that the host cannot satisfy, or are needlessly wasteful of computational resources if run on the host.
On the flip side, the host processor has a limited collection of peripherals, and GPIO pins are in very short supply. It may be beneficial to allocate a function to the SP that requires these types of resources, especially if one of the host processor cores would be required to do substantial busy-waiting to perform the function. Although the host processor has a ridiculous amount of computational power, tying up multiple large cores to meet communication interface timing requirements is a poor use of it.
Sidecar/Gimlet/PSC parity: It is desirable that responsibilities be allocated in a similar manner between the host processor/host OS and SP on the host and between the switches and the SP in Sidecars, and to have a large overlap with functionality required in the power shelf controller (PSC). This parity or symmetry will allow for software reuse and keep processing and peripheral requirements similar so that the same SP can be used in all three applications. In a few cases, this may not be possible, as the Tofino2 ASIC is not a general-purpose processor and runs no operating system and the PSC has no other major intelligence on board. In these cases, tasks performed by the host processor in servers may be allocated either to the Sidecar/PSC SP or to the PCIe-attached controlling server via sideband signalling or onboard Tofino2 functions accessible to the host via PCIe.

Allocations

Broadly speaking, we should allocate responsibility for environmental (thermal and power) control and monitoring to the SP via sideband interfaces, and responsibility for system inventory and device status monitoring to the host via in-band interfaces. Detailed allocation of tasks and routing of signals are discussed in the following sections.

Thermal Loop

The SP will have full responsibility for running a thermal loop, in which temperature sensors and possibly other data sources will be polled and fan speeds adjusted. In addition, the SP may toggle simple throttling mechanisms such as THERMTRIP_L and APML-based memory bandwidth capping that are made available by the major onboard ASICs. The specific objectives (ΔT, temperature cap, etc.) of this function are outside the scope of this document; however, at a minimum, the SP will be responsible for maintaining all onboard device junction temperatures below the manufacturer’s absolute maximum rating. If this cannot be achieved, the SP will be responsible for shutting down the server’s or Sidecar’s main power rails. It is anticipated that if analogous functionality is required of the PSC’s SP, it will be limited to the PSC’s own thermal zone. The PSC’s SP may or may not have the ability to control whether each individual rectifier is supplying power (see [rfd121]); however, thermal management of the rectifiers is an autonomous function of the rectifiers themselves, not the PSC.

The SP may choose to set thresholds for alert delivery in sensors that support this, based on the thermal specifications of the monitored device, to rely on integrated fixed-threshold mechanisms provided by manufacturers, and/or its own internal thresholding driven by polled sensor data. This is an implementation detail of SP software to be determined based on available margin, detection latency requirements, anticipated speed of thermal excursions, number of data sources to be polled, polling frequency, and hardware resources available to perform polling and interrupt-driven external threshold detection. It is not the intent of this document to constrain implementation of the SP’s thermal loop in any way, only to enumerate the set of data sources to be made available for its use and any limitations on that use driven by the necessity of sharing those sources.

This task is allocated to the SP because it may be required to prevent damage to hardware even when the host or switch is not functioning properly, and when a defect in the host OS prevents host-based software from making forward progress. The design of Hubris, as described by [rfd41], lends itself well to rapid automatic recovery when defects or hardware errors prevent the SP from functioning as designed; while the host OS is also designed to recover automatically from detectable errors, not all errors and defects are detectable and recovery often takes sufficient time for a major thermal excursion to occur before proper control could be reestablished. Finally, as there is no other device on the Sidecar board designed for this purpose, parity suggests that allocating this function to the SP will simplify the Sidecar design by eliminating some of the environmental management discrete logic discussed in [rfd58] §4.1.2.

As a result of this allocation, SMBus sideband interfaces for all PCIe end devices (including those that are soldered to the baseboard) will be routed exclusively to the SP for temperature sensor reading. All additional discrete temperature sensors, such as those required for inlet air, will be routed to the SP for its exclusive use. PWM fan controllers and fan tachometers will be routed exclusively to the SP, as this forms the control portion of the thermal loop. The SP will also be responsible for fan presence and fault detection, since the host will not have direct access to these devices.

Special Cases

The PCIe SMBus sideband interface for Sidecar will be routed to the attached host. As Sidecar is a separate chassis, there is no need for the server SP to read temperature sensors (or anything else) through this interface as there is for PCIe endpoints contained within the same (server) chassis. Host software responsible for controlling the switch is likely to need access to the Tofino2 ASIC through this interface. This should not be read to state that Sidecar temperature sensors should be routed to the attached host; they should be routed to the Sidecar’s local SP.

On servers, the host’s APML/SB-TSI SMBus-ish slave will be routed to the SP for functions that include sideband temperature monitoring of the host processor. Like DIMM SPD, this single bus segment hosts slave devices with very different functions; in addition to SB-TSI for temperature monitoring, the SB-RMI endpoint is also present, which is discussed in a later section; concisely, as both of these functions are logically allocated to the SP, routing the segment to the SP presents no problems. DIMM SPD is a different matter, and is discussed in detail in the next section.

Thermal Data Flow Diagram

DIMM SPD

Serial Presence Detect (SPD) is a feature of DIMMs described in the [device-inventory] §1.2. The data stored in these is required by the memory controller (UMC) and associated phy microcontroller unit (PMU) to properly configure communication with the DIMMs during PSP boot. This requirement is architectural and imposed by AMD; it cannot be avoided in any reasonable way. There does not appear to be any means available to instruct the PSP to configure an I²C multi-master demux in order to access these devices; a mechanism is described by [amd-agesa] §4.1.7 for using a single-master mux that allows access by the host processor to multiple SPD banks, but not for describing the opposite mechanism for multi-master sharing of SPD bus segments.

In general, we assume that the DIMMs we use will incorporate the TSE2004 type part. Without this temperature sensor, it would be simple and obviously correct to assign responsibility for and attachment of these devices to the host processor. With it, however, the situation becomes more challenging: the on-DIMM sensor provides a more accurate indication of DIMM temperature than a separate sensor placed on the system board next to the connector. The SP is responsible for the thermal loop; while a modern DDR4 RDIMM is unlikely to be the first managed device to approach its maximum allowed temperature, we still want these sensor readings available. This gives us a few options:

Assign responsibility to the host, with one of the following sub-options:
1. Implement a daemon to sample the sensors and convey the readings to the SP.
2. Ignore the SPD-based sensors and drop discrete temperature sensors routed to the SP next to one or more DIMM connectors for use in thermal management.
3. Do without DIMM temperature monitoring altogether.
Drop an I²C switch (e.g., PCA9543) or generic mux (e.g., SN74CB3Q3257PW as in [amd-speedway]) controlled by the SP, allowing either the host or SP exclusive access to the SPD devices at any one time. The SP must set the switch to give the host control prior to booting the host and each time the host is reset. The SP could then reclaim control upon receipt of the first message from HBS, which is guaranteed to be sent after the PSP has finished reading the DIMM identity data for PMU programming.
1. After this, the host has no access to the SPD nor the temperature sensor, if one exists.
2. After this, the SP proxies requests for the SPD (and optionally the temperature sensor data), using the standard [rfd27] SP-host interface instead of the SMBus interface. This is essentially the opposite of (1)(a) above.
Drop an I²C switch controlled by the host. This option is substantially hardware-identical to the previous one, but host software will be responsible for relinquishing control of the SPD bus to the SP and for reclaiming it prior to initiating any reset procedure. The SP will use the temperature sensor if it has access; otherwise it will treat the DIMMs as if they have SPD parts without this feature.
Assign the SPD bus to the SP all the time, and implement an SPD proxy in the SP. Because hot-addition and hot-removal of DIMMs are never supported, the SP could simply read and cache the contents of the SPD EEPROMs (512 bytes each for a total maximum of 8 kB in the current server design) at power on. The host’s SPD SMBus master would be routed to an SMBus slave on the SP, and the SP would supply the contents to the PSP during boot. Emulation of the temperature sensor portion, which unlike the SPD returns mutable data, seems optional; the host doesn’t need it. A slightly clever mechanism will be required to emulate an I²C mux in software, as a maximum of 8 DIMMs per bus segment can be supported and PSP firmware supports no documented mechanism for reading SPD data from multiple segments other than via a standard switch compatible with the ubiquitous PCA9545A.
Drop a PCA9641 or similar multi-master demux and allow the SP and host to share responsibility for these devices.

We’ll discuss the merits of each of these options in turn, then conclude with a survey of existing designs known to us.

Host Responsibility

Mechanism (1)(a) both requires significant additional software work and lacks the desired attribute of SP thermal loop autonomy. Logically we should compare this with (1)(c) as we must not assume host software is functioning properly; therefore we cannot rely upon DIMM sensor readings and the tradeoff is reduced to writing additional host software in the hope of providing the data to the SP for opportunistic use. Option (1)(b) is much cleaner but likely to be even less accurate than the sensors built into the EEPROMs. Practically speaking we should probably consider all of these equivalent to (1)(c), which simplifies system design slightly at the cost of a somewhat less accurate (and less efficient) thermal subsystem.

SP-Controlled Switch

Mechanism (2) is problematic for host-directed reset of the host processor; we must either require the host to always request that the SP perform the reset (which may be desirable regardless), or we must ensure that the SP can reliably snoop every type of host processor reset and relinquish control of the SPD bus to the host immediately upon detecting one. One such mechanism may be having the SP relinquish the bus upon detection of a collision, which could also be used as a crude form of option (5). Because the host processor supports a wide range of reset types and AMD do not document the external observability of all of them, this type of crude arbitration may be the only choice. This problem space is related and very similar to that around the SP determining when to relinquish control of the host’s off-chip boot flash to the host processor, which is out of scope for this document; however, it may be considered advantageous to reuse some or all of the solution to that problem here if this mechanism is employed.

Host-Controlled Switch

Mechanism (3) is somewhat similar to (2) but is especially risky because the host may triple-fault or SYNC flood, causing a reset without any opportunity for the host to reclaim the bus from the SP for the following boot. For this reason, this seems strictly worse than option (2) and likely unworkable.

SP Proxy

Mechanism (4) feels like an ideal compromise, ensuring that data can always be made available where and when required. It also keeps the SP in control of the SPD flash, eliminating part of the attack surface should a compromised host attempt to rewrite it (see [s-security]), and does not require special logic to handle host processor resets.

Important

Use of this mechanism requires that the AMD SMBus engine supports clock stretching, as proxying through the byte-oriented I²C engine in the H7 imposes at least one byte worth of added latency, in addition to any extra latency imposed by additional mux layers between the H7 and the SPD slave. AMD documentation doesn’t specifically state whether the SMBus engine supports stretching (in fact, this peripheral is not described at all), though there is some evidence that it does. As of October 2020, AMD have informed us that "[t]he SMBus controller supports clock stretching, but it does not support handling a violation of the SMBus defined clock low time-out (ie. Ttimeout > 35 ms as defined in http://smbus.org/specs/smbus20.pdf )" (sic).

This actually provides another reason to prefer this mechanism: the SP could detect an SPD slave entering this condition and take corrective action; a mux or direct connection to the SPD slave would presumably lock up the bus and hang the boot process. If this were to occur, there would likely be little or no observability, as the x86 cores are not executing at this point in the boot process and SPD reading is performed by firmware.

In this implementation, the host would be free to read (and write, if we believe this is needed — but only if allowed by the SP; see [s-security]) the SPD EEPROM contents exactly as if it were permanently attached to them, while the SP would have permanent access to the temperature sensors and would even have DIMM inventory if we decided at some later time to add FRU inventory to the SP feature set (see next section). The drawbacks are consumption of an additional I²C slave on the SP and an additional SP software requirement for the proxy (though all of these options require some kind of software work somewhere). The STM32H7 offers up to 4 I²C interfaces, which can be configured as slaves with multiple addresses based on a simple bitmask mechanism; see [stm32h7x3-um] §50.4.8. It seems likely that this peripheral is suitable for this purpose.

DDR4 V_ref(CA) Margining

There is one additional wrinkle we should consider here, although it is much more a matter of validation than chassis management: AMD’s reference designs include (and [amd-board-guide] §13.4.3 describes in general) a pair of LTC2629-1 DACs for margining V_ref(CA) in the DIMM power subsystem. During design validation, the voltage level of this signal is swept across a range while observations are made of memory errors recorded by the UMCs in the host processor.

AMD provides two mechanisms for this test: a firmware mechanism that sends commands to the DACs if present (see [amd-agesa] §4.1.5, particularly APCB_TOKEN_CONFIG_MEM_MBIST_DATA_EYE et seq), and a hardware device that directly sets the voltage level via an onboard connector (see [amd-board-guide] §12.4.1). In both cases, an "HDT" (JTAG probe) is required to unlock certain registers in the UMC while the test is performed. Whether the DACs, connectors, both, or neither are designed into the board and populated, all AMD reference designs also incorporate the static voltage divider (see [amd-board-guide] §13.4.3), guaranteeing the correct reference level whenever the DACs are in their power-on state or depopulated. Margining is normally not done in the field, and most manufacturers almost certainly do not run this test in the factory after DVT. The AsRack EPYCD8 boards we have (which predate Rome) incorporate neither the DACs nor the AMD-specified connectors; we would assume that most if not all manufacturers would depopulate these components in production, especially because the DACs are quite expensive.

If present, this pair of quad DAC ICs (and no other devices) is installed on the SMBus shared with the SPD devices, upstream of any switch isolating DIMM sockets into groups of 8 or fewer to satisfy SPD addressing constraints. Although AMD does not state it explicitly, the fact that these are controlled by hidden core firmware makes their bus location, addressing, input reference voltage, and register layout an immutable interface that is effectively part of the AMD hardware. The DACs, then, are "User selected" ([amd-board-guide] §13.4.3 table 102) in the same way that one could obtain a Ford Model T in any colour the customer preferred, so long as that colour was black. A more detailed description of this undocumented interface derived from analysis of AMD reference designed may be found in [ddr4-power] §1.1.

The H7 has a built-in DAC peripheral ([stm32h7x3-um] §26), and we would suppose that it could emulate and subsume this functionality transparently without additional hardware. Such a design would cost less ($424 per rack) and free up baseboard space for other uses (approx. 60 mm²), but more compellingly would increase our degree of control; this function should never be used during normal operation, so the SP could lock it out unless given a specific command over a trusted datalink to indicate that a diagnostic or validation mode should be entered. If we choose to implement this option for SPD access, it seems very appealing to emulate the 2629s in the SP as well.

Although the SP could implement a completely 2629-compatible interface, a second problem makes this path unworkable: although the H7x3 has dual DACs and a fairly flexible mechanism for obtaining the voltage reference, the same reference is effectively always shared between the two DACs. For reasons discussed in much greater detail in [ddr4-power], this doesn’t work well with AMD’s SP3 DRAM interface design, which effectively requires separate DRAM power supplies on either side of the socket, each with its own V_ref(CA) that must be margined independently. That document also provides a more detailed discussion of both AMD’s design and DDR4 requirements.

Although it is unfortunate that we cannot reuse this proxy for DRAM margining, the proxy is still an excellent option for SPD access.

Multi-Master Arbitration

Mechanism (5) is conceptually pleasant but experience has taught that the PCA9641 is prone to lockups in which the bus is never released as it should be. Additionally, although the Milan SMBus controller includes a bit for indicating to software that a collision has been detected, we have no way of knowing (a) whether this functionality actually works nor (b) whether the PSP firmware responsible for reading the SPD contents handles it correctly. Accordingly, this seems to carry more risk than usual.

Implementation Survey

The [amd-ethanol] design takes approach (1)(c) (effectively, as this platform is intended to support commodity OSs that would not have the necessary functionality for proxying); no connection exists between the SPD subsystem and the BMC. The [amd-daytona] (see sheet 58) design takes approach (2), using a quad mux controlled by the BMC. The means by which the BMC decides to take control is not known; it presumably involves a private interface between the AMI BIOS and Insyde BMC firmware. It is also unknown whether the SPD temperature sensors (if installed) are used by the BMC’s thermal management subsystem. The AsRack EPYCD8 server board’s approach is not known in any detail; however, we do know that the SPD EEPROMs are not visible to the host OS, implying some kind of mux or switch; a TI SN74CBTLV3257 is present nearby and may be used for this purpose.

Data Flow Diagrams

DFDs are provided here for options (2) and (4) (SP-Controlled Switch and SP Proxy, respectively). Processes downstream of this subsystem are shown for reference but are representative only and out of scope for this document.

SPD Option 2 Data Flow Diagram

image::spd-2-dfd.svg[SPD Option 2 (SP-Controlled Switch) Data Flow Diagram,opts=interactive]

SPD Option 4 Data Flow Diagram

image::spd-4-dfd.svg[SPD Option 4 (SP Proxy) Data Flow Diagram,opts=interactive]

Power Control and Monitoring

With one exception, all power supervision, sequencing, and conversion logic is the responsibility of the SP. All PMBus interfaces from power converters, onboard PMICs, and related devices, along with separate fault signals and associated discrete or integrated voltage, current, and power sensors will be routed to the SP (or to external logic under the SP’s control) for the purpose of monitoring and controlling power. The SP is also responsible for acting on excursions immediately, independent of fault diagnosis performed by the host or control plane software. That is not to imply that these errors needn’t be propagated to the control plane; they must, so that accurate diagnosis of transient or persistent faults can be made. The SP is not responsible for this diagnosis, only for the immediate response; latency associated with remote diagnosis is far too great to prevent damage in the case of power rails falling out of spec.

The SP does not play a direct role in regulation of the host processor’s VDDCR_CPU and VDDCR_SOC rails. Conversion and regulation of these rails is directed by the host processor itself via the SVI 2.0 protocol. Although it may be possible in principle for the SP to snoop or interpose on this channel between the host processor and the power regulators, doing so seems unnecessarily challenging and risky with no commensurate benefit. Instead, the SP may monitor and in a limited way control this subsystem via a separate out-of-band PMBus mechanism. In normal operation, the SP will function only as an observer; AMD’s architecture requires that the power conversion subsystem powering these rails act in accordance with the SVI 2.0 specification and the control commands issued by the processor. The SP is, however, ultimately responsible for ensuring that these rails are brought up or taken down in the appropriate sequence by enabling or disabling the power converters as required, and for shutting down these rails in the event of certain faults or power excursions that may be signalled by the host processor and/or the power converters and regulators.

On the PSC, power control functionality includes only the PSC’s own power domains; it is expected that the rectifiers and other shelf logic contain crowbars or other autonomous mechanisms for maintaining the power output to the bus bars within spec and shutting down if that cannot be achieved. Power monitoring on the PSC includes all power domains within the shelf that are visible to it, including the shelf output.

Power Data Flow Diagram

FRU Inventory, Error and Performance Telemetry

While the sideband signal routing discussed thus far may allow the SP to form at least a partial FRU inventory, we can see from [device-inventory] §1.4 and §2.2.2-3 that we cannot rely on sideband mechanisms to collect this information. Moreover, considerable error telemetry exists that is inherently available only to the host, including PCIe link state and protocol-specific transaction status. Absent a global interposer of the type contemplated by [rfd42], which is beyond our capabilities, there is no way for an out-of-band observer such as the SP to obtain this state directly. If we wished to aggregate this state at the SP, it would have to be collected and proxied by system software on the host processor. As we assume a diagnostic and recovery model driven by host software (with parts likely distributed within the larger control plane) rather than by the SP, we can eliminate software complexity by not implementing such a proxy.

The host will use in-band mechanisms to obtain FRUID and VPD from PCIe endpoint devices and DIMMs (SPD as noted previously), and will be responsible for collecting inventory of and error and performance telemetry from these devices. The managed devices included here are: M.2 and U.2 SSDs, NIC, the Tofino2 switch, and any AIC slots we may include in future designs. The host is best positioned by far to collect this data, which includes standard PCIe link and device status, AER, and device-specific status accessible only in-band. We do not specify here whether the host, higher-level control plane software, or both will be responsible for diagnosing faults from this telemetry stream and directing or taking appropriate action in response; the portions of the data flow diagrams included here that are downstream of the chassis management subsystem are intended only for representative purposes.

The host OS will be responsible for monitoring its own host processor (see [device-inventory] §1.1) using the large array of in-band mechanisms and for conveying this telemetry to the control plane. Should the host processor experience a catastrophic error, the SP may use the APML/SB-RMI interface to obtain error data and convey it to the control plane. As discussed previously, this interface is available on the same bus segment as the SB-TSI temperature monitoring endpoint, a function also assigned to the SP. This mechanism should be used only if the host processor is non-functional to avoid duplicate error reports.

The absence of a host processor is something of a special and unusual case: the onboard power sequencing logic (which may be implemented by the SP) should ensure that host power rails are not energised in the absence of a host processor, or in the presence of a mechanically but not electrically compatible host processor model. See [amd-ethanol] and [amd-daytona] for example implementations. Applying power to a Gimlet under these conditions indicates either a major fault or operator error, and this condition must be reported to the control plane by the SP.

FRUID Data Flow Diagram

Special Cases

There are are least two, possibly three, exceptions that require special discussion. Fans, as part of the thermal subsystem, are attached only to the SP. This means that the SP will have exclusive knowledge of both fan presence and fan health (in the form of tachometer readings and, conceptually, locked rotor indicators). Fans, as discussed in [device-inventory] §1.6, are somewhat unusual in that they are likely to be FRUs that lack built-in identification (but see [rfd132] §3.1.7 and §8). The SP will need to proxy fan presence and error (locked rotor) state to the host or control plane software for integration with diagnostic, alerting, and recovery functions; however, there may be no identity or location data associated with this telemetry if the fan or fan module does not have add-on FRUID storage.

The other exception is the Gemini subsystem itself. Although inventory is not meaningful (these devices are always present and their location is fixed), error and performance telemetry will be collected by the SP and conveyed to control plane software for diagnosis, alerting, and recovery.

Last but not least are the identity and location of a Gimlet or Sidecar itself. Each of these boards will have a permanently fixed FRUID ROM installed containing the model and serial number. As this information is normally immutable, this ROM will be attached to the SP, which will be responsible for reading its contents once at startup. Additionally, location data conveyed to the SP over the management Ethernet, aux network, or other means will be obtained at startup and both will be conveyed to host software using the interface described in [rfd27] §1. The SP will also be responsible for conveying these data to control plane software so that it can be kept current even when the host is not operational. As this information is normally immutable while a Gimlet is installed or a Sidecar is receiving power, there is no need for periodic polling by, making it simple and inexpensive for host software to obtain it from the SP during the boot process.

Note

By parity, we should route Sidecar’s FRUID to the Sidecar SP. It’s unclear whether a direct mechanism exists for system software on the PCIe-attached Gimlet to obtain this same data.

Operational Considerations

[rfd82] requires us to consider the customer’s need for capacity planning, which in turn requires awareness by the control plane software of not only the presence and identity of a server but also a general awareness of the resources it can provide. At a minimum, this means we must know the quantity and model of host processor(s) and the quantity of volatile and nonvolatile storage it contains. In general, it may also be useful to know the speed of its network links and perhaps something about the performance of its nonvolatile storage devices, though it’s likely reasonable to infer these from the server model.

As our cold start model will power every installed compute server at least once, and we generally will power on a newly-installed compute server in a functioning rack unless our power budget absolutely forbids it, the simplest way to address this requirement is to allow the host to obtain this information (which it will do as a matter of course during boot) and cache it in the control plane. Except for nonvolatile storage, this inventory is immutable while the server is installed in the rack. If the server is removed, the control plane will be informed of this through the presence subsystem and the server will be powered when it is later reinstalled, repeating this process. There are only two reasons a server would not remain powered on and fully operational:

It has a fault that prevents us from doing so
Operator-supplied policy forbids us from doing so

In these cases, the operator will be informed that the FRU (and resource) inventory for this server is either unavailable or cached and potentially out of date, along with the reason the server is not operational. Servers that are inoperable due to a fault do not provide capacity, so nothing is lost by not having an accurate resource inventory in that case. Operators can be made aware that they have capacity islands due to power capping or other policies, and a likely estimate of that capacity can be supplied from the cache along with information about how to unlock that capacity.

Additionally, [rfd82] requires us to consider how operational hardware problems are resolved. The caching mechanism just described can easily be made to include information about diagnosed faults in specific components within a server that is no longer operational. All of the same discussion points for capacity planning apply here. When a technician replaces a faulty component other than a nonvolatile storage device, the server must have been removed from the rack and upon reinsertion will be powered up and the inventory updated. At this time, the fault will be automatically cleared provided that the faulty device was properly replaced. Nonvolatile storage devices may be replaced while the server is operational and that replacement will be detected automatically. In either case, the control plane will be able to convey to operators an accurate indication of the server’s device inventory and the fault status of each device, even when the server is not installed or is powered off. Note that this is true regardless of the server’s exact power state.

Determinations

With the exception of [s-spd] and remaining questions discussed in [s-questions] below, the entire preceding document should be considered normative with respect to assignment of chassis management responsibilities:

Responsibility allocations described here should be reflected in the lower-level design of our hardware and software; and
Where responsibility for a specific chassis management function is not explicitly allocated, allocation should be made in accordance with the principles listed in [s-principles] and with an eye toward parity or analogue with those discussed here.

The SP Proxy mechanism described in [s-spd] should be explored in EVT1, as it simplifies software, reduces the number of required hardware components, and provides the best overall result. Should this prove unworkable, we will need to investigate the open questions around host processor reset handing and consider the SP-Controlled Switch instead, as discussed in [s-spd] option (4).

Open Questions

How will the PCIe-attached host obtain information about Sidecar FRUs and the top-level Sidecar PCBAs? While that host can obtain this information from control-plane software, a more direct mechanism may be desirable.
Which classes of host processor resets are externally observable, and by what means? Answering this will require observation of reference platform behaviour, because AMD do not document it.
How will the host obtain detailed information about faults in or observed by PCIe power controllers? This information is directly available only to the SP, which can supply it to a hypothetical distributed diagnosis engine, but the recovery procedure must be performed locally and with low latency; it may optionally rely on the details for ideal behaviour. There are two basic options here:
- Don’t do that; either don’t try to recover automatically from power faults at all, or attempt recovery in a way that doesn’t depend on details the host can’t obtain via the PCIe interface; or
- Allow the host to fetch some abstracted version of detailed fault data (possibly as simple as a single bit indicating whether automated recovery should be attempted, though that would put the SP into a decision-making role it probably shouldn’t have) via the UART channel. Separating recovery out as an asynchronous process would probably be required, adding complexity to the PCIe device state machine.

Security Considerations

We consider the security aspects of this architecture from four perspectives:

requirements for integrity, confidentiality, and availability of telemetry (this is out of scope because this RFD does not define the services; however, it is incidentally necessary to assess these requirements in order to contemplate the broader set of considerations);
changes to the shape of the attack surface caused by assignment of responsibilities;
threats associated with the SPD proxy service described in [s-spd]; and
vulnerabilities associated with the DDR4 margining functions (although this function is out of scope for this RFD and discussed only in passing, it necessarily shares an SMBus with services subject to these architectural guidelines).

Data Protection Requirements

The TBD services that fall under this architectural umbrella process mostly public data, such as identity information for DIMMs or SSDs. Other data falls into a gray area: power and thermal sensor data, error reports generated by hardware, fault diagnoses, and other device health data. This data is not by itself sensitive, though some customers may consider it so. The communication channels on Gimlet and Sidecar baseboards used for the collection of this data are not protected by cryptographically strong protocols, so this data can easily be spoofed, tampered with, and read by any attacker with intrusive physical access. Attack vectors requiring intrusive physical access are outside this product’s threat model; however, this method could be used to add hostile devices to the machine with access to these communication channels. This is discussed further in later sections; however, the lack of any truly sensitive data accessible in this manner makes this seem like a low risk possibility.

Of greater interest is high-resolution power dissipation data, which in rare instances has been used as a side channel for observing privileged processor state (see [cve-2020-8695]). Although most of our power monitoring infrastructure is not likely designed to provide sufficient resolution to enable this type of attack (in particular, power data is unlikely to be sampled by the SP at any greater frequency than 1 Hz), the PDN components used to supply the host processor are capable of advanced high-resolution monitoring that could conceivably be used for this purpose. Compromised SP software or a hostile I²C master on the same bus could conceivably be used to create an exfiltration path for other confidential data. This risk is not specific to the architecture this RFD proposes; no alternative allocation of responsibilities would mitigate it. However, this does suggest that owners of the hardware providing this service should assess this risk in greater detail. For our purposes, sampled power dissipation (and component voltage levels and current flows) is considered non-confidential. Likewise, thermal data, which makes a far less effective side channel and in any event is also likely to be sampled at low resolution, requires no special protection with respect to confidentiality.

The confidentiality requirements associated with error telemetry and device health are externally defined by the services that provide or process them. The service processor is not expected to process this data, most of which this RFD allocates to the host processor. Both the host processor and the SP are expected to employ datalinks to the control plane that provide strong protection of confidentiality and integrity; however, service owners must consider the risks associated with any type of error telemetry that may include customer data (e.g., cleartext sent to a self-encrypting storage device in a failed transaction). Local processing of this data on the host processor is out of scope.

Fault diagnoses are expected to be made either locally by system software running on the host or by control plane software. Although out of scope for this RFD, these diagnoses (current and historical) likely merit at least some protection. An attacker who can obtain these may be more easily able to introduce malicious hardware (for example, by sending a malicious replacement for a piece of hardware known to be faulty, or by impersonating a technician sent to replace a piece of faulty hardware). An attacker who can alter them could do the same, as well as causing various classes of denial of service attacks: against an individual machine, by incorrectly introducing a fault diagnosis of some critical component, or against an entire organisation by introducing numerous spurious diagnoses over a period of time, forcing operators to diagnose faults manually or replace large quantities of functional hardware. This type of attack could be economically damaging to both Oxide and the customer, depending upon its extent. Because these diagnoses are derived from a range of other telemetry, the ability to violate the integrity of these upstream sources could be used in a similar manner.

Availability is mostly out of scope here, because the part of the system that is of interest is inaccessible except by intrusive physical attacks that require removing the component from the rack with associated loss of power. The data that would normally be associated with these services do not exist and/or cannot be sampled in this condition, and control paths are unusable. The much greater loss of availability has already occurred. Altering the machine to prevent collection of data or use of control paths once it has been reinstalled is certainly possible, but if the objective is to cause denial of service this seems needlessly complicated. One exception may be anything that can cause a loss of thermal data, which could cause premature fan wear, overheating, and other misbehaviour. The risk of such attacks seems low, as does the impact, though an attacker could conceivably make the hot aisle hotter and cause a nuisance for on-site technicians. Loss of availability of power telemetry is only a nuisance; however, interference with signals required for sequencing or fault detection could lead to temporary or permanent failure of the hardware. None of these vulnerabilities are specific to the allocation of responsibilities discussed here.

Assignment of Responsibilities

The services that will eventually be defined elsewhere in accordance with the general architecture described here are primarily low level, with commodity hardware on one side and the control plane software on the other. Communication channels with the control plane in both local SP software and local host software, out of scope here, are assumed for our purposes to provide mutual authentication and integrity protection. In scope, then, is the fact that assigning an actor the responsibility for monitoring and controlling a particular data source or control point has two side effects that affect security:

the actor becomes a potential source of attacks against the assigned data sources, control points, and adjacent logic; and
those data sources, control points, and adjacent logic become potential sources of attacks against the designated actor.

Generally, the considerations here are not affected by which actor is assigned a particular chassis management responsibility. However, allocating a particular function to the SP or host processor does change somewhat the shape of the attack surface. Software running on the host processor is potentially vulnerable to guest escape attacks, initiated either by hostile users (e.g., a customer’s disgruntled employee) or by remotely-compromised guest software (a wide range of possible threat actors with arbitrary motives). Software running on the SP is not directly vulnerable to this threat; however, as our architecture assigns most of the low-level monitoring and control functions to the SP, it becomes exposed to a wider range of attacks initiated via the managed devices themselves. This type of attack generally relies on the ability to replace a managed device, or some component of one, with a hostile device. Such an attack may be carried out by a threat actor with invasive access to the machine and the opportunity to reinstall the component after tampering with it or by compromising the supply chain.

A slightly higher-risk vector involves the installation of an SSD containing a hostile VPD EEPROM and/or additional hostile I²C master(s). Although the impact of such an attack is limited by the segmentation of each VPD endpoint onto its own dedicated SMBus (see [device-inventory] §2.2), the opportunity still exists to confuse SP software with invalid data, lock up the bus and delay the SP’s ability to access temperature data, and possibly other I²C-based attacks on downstream infrastructure. If we instead allocated responsibility for SSD VPD reading exclusively to the host (whether in-band, out-of-band, or both), similar risks would exist with respect to the host processor’s SMBus peripheral and host software. Note that as the SP is not designed to protect the host from hostile PCIe endpoints, and the host necessarily accesses the SSDs using a PCIe link, the risks to the host associated with a hostile SSD are not specific to chassis management and cannot be mitigated by any architectural choice within the scope of this document. An interposer of the type proposed in [rfd42], likely incorporating additional logic and software, could mitigate these risks, but is beyond our ability to execute at this time. Such a device may be worth revisiting should capabilities increase or these risks be deemed excessive.

Host-Supplied Data

There is no means of protecting confidentiality or integrity of data pulled from the host by the SP over the SB-RMI/SB-TSI interface, so to continue a theme, an attacker with intrusive physical access could compromise the integrity and confidentiality of this data by the addition of malicious hardware. Of potentially greater interest is the ability of compromised software running on the host processor to use this mechanism to confuse the SP or perhaps create a covert exfiltration channel. However, per [amd-milan-ppr] §8.6 does not describe any means of doing this from the x86 cores. As the PSP and potentially other hidden cores can read and modify at least some of these registers, malicious firmware could perform these actions. As this firmware by design can be supplied only by AMD, an attack of this type would require either inside access at AMD or an as yet unknown vulnerability in the hidden core hardware or firmware, plus an attacker choosing to target our SP software. Although it is difficult to assess the risk of these factors, the SB-RMI/SB-TSI interface seems like a low-risk vector as it is fairly simple.

SPD Proxy Service

This is the only service that is actually described by this RFD, so we should consider its security properties in greater detail. Generally speaking, the relevant aspects have already been discussed elsewhere. This service is required by the host to boot and for DRAM to function properly. DIMMs are subject to hostile replacement by a technician with intrusive physical access or through a supply chain attack, and could incorporate I²C devices designed to attack the SP, host, or other adjacent devices (most notably via the DDR4 margining DACs if installed; see the next section). The SPD temperature sensors are also used by the SP’s thermal loop, so compromised devices could mislead the SP into wasting energy, wearing out fans prematurely, overheating the adjacent hot aisle, etc. These attacks are likely very low risk; a hostile DIMM offers far more exciting opportunities to compromise a wide range of data and processes on the host. In such an attack, the temperature sensor could conceivably be used as a very low bandwidth exfiltration vector. Also like the rest of the telemetry sources, a bogus SPD target could attack software elsewhere in the system that consumes its data, including control plane software and even front-end software. The complexity of such multi-phase, indirect attacks (not to mention the far more exciting prospects available to an attacker capable of satisfying the prerequisite) would seem to render these vectors low risk.

Corresponding attacks against the SPD proxy by the service processor make little sense, but such attacks may be possible by compromised SP software. Once again, an attacker capable of compromising the SP software seriously enough to undertake such an attack would almost certainly prefer other, more exciting targets. A simplistic attack, disabling the proxy, would cause the host to fail to boot; an attacker would need to reset the host processor to trigger this, so there is no obvious reason such an attacker would not simply hold the host processor in reset instead. A more interesting attack might involve providing subtly wrong DIMM identification data to the memory controller, inducing random spurious errors, reboots, and fault diagnoses. As with other tampering or spoofing attacks against telemetry sources, this type of attack could vary in scope from a minor nuisance limited to a small number of components in a single chassis to a major organisational denial of service attack involving costly downtime and unnecessary hardware replacement across an entire fleet. This is a general risk associated with any fault management system, and is specific to neither the proxy service nor its implementation on the SP.

DDR4 Margining

We have several choices in how we approach DDR4 margining, which are not mutually exclusive. These choices are out of scope for this document and will not be made here; however, it is necessary to list them because some of them introduce potential attack vectors via components of the SPD proxy service discussed previously:

to perform margining during validation, or not;
to design our baseboard with the Naples-style test headers, or not;
to design our baseboard with the Rome/Milan-style DACs, or not;
to populate the headers (if designed) on production boards, or not;
to populate the DACs (if designed) on production boards, or not.

The LTC2629-1, as is customary among devices of this type, performs no authentication against the source of commands it receives via I²C. The address of each is also fixed by AMD’s firmware interface. By design, the DACs have the ability to source or sink enough current to completely override the static resistor network (if present) and set the corresponding V_ref(CA) voltage level(s). Therefore, if the DACs are present, any master on this bus could trivially interfere with some or all of the host-DRAM interfaces, rendering memory inaccessible or causing errors at whatever rate the bus master chooses. At a minimum, therefore, the DACs constitute a denial-of-service vector if installed. The nature of the reference voltage mechanism makes permanent damage to hardware unlikely.

If we choose to perform margining during design validation (which seems only prudent), and we choose to use the firmware mechanism (currently AMD’s recommended path), we then face a tradeoff between adding substantial cost and introducing this vulnerability on the one hand; or depopulating them and losing the ability to verify margin on boards in the field (or in the factory, without a special build for sampling). Should we leave the DACs populated, available attack vectors would include:

malicious SP software, which could be supplied by a hostile Oxide employee, possibly by a hostile customer employee, or any actor capable of mounting a software supply chain attack;
compromised SP software, which could arise at runtime due to attacks from the administrative plane (with or without authentication);
malicious SP hardware, which could be delivered through a supply chain attack or an intrusive physical attack with replacement of hardware (though anyone capable of the latter would have easier ways to achieve this goal);
malicious or compromised host firmware executing on hidden cores during early boot (an attacker capable of doing this can do far worse); or
a malicious DIMM containing an I²C master in addition to or in place of the standard SPD EEPROM.

While none of these attack vectors seem especially high risk, they may along with the substantial added cost tip the balance away from shipping production boards with DDR4 margining DACs populated.

Summary

There are a few recurring themes here:

Replacing a data source can be done with casual physical access to SSDs and invasive physical access otherwise, and allows an attacker opportunities to:
- Confuse SP software
- Confuse/attack other software, including the control plane
- Spoof or tamper with error telemetry to cause incorrect diagnoses and costly downtime, unnecessary hardware replacements, etc.
- Attempt to reprogram EEPROMs accessible on the same bus
- Interfere with SP and/or host access to the device or adjacent devices
- Cause thermal excursions, energy waste, and potentially hardware damage
- Interfere with inventory (as distinct from error telemetry and metrics) information presented to operators, which may have its own security impact
A compromised host (most likely via guest escape, but the details aren’t important) can also generate a limited amount of false telemetry, and in some control plane architectures may also be able to produce false fault diagnoses.
DDR4 margining DACs pose special risks and deserve additional attention.

Most of these attack vectors seem to be fairly low-risk, and the attacks would be complex to carry out. Nearly all attacks one could mount against these services would have the effect of denial of service, either locally or by inducing a widespread rash of false inventory data and fault diagnoses, which could be quite costly for Oxide and probably the customer. In a few instances hardware could be permanently damaged, though the risk of that outcome attributable to the choices made here is very low. Plausible threat actors and motivations are limited mainly by the likely goal of denying service, and are otherwise nonspecific.

Mitigations specific to these services that may be worthwhile to explore include:

Not implementing VPD/SPD EEPROM programming functionality in the SP/host
Activating hardware-specific locking mechanisms on SPD/VPD EEPROMs under our control; these usually require invasive physical access to defeat
Informing operators whenever possible of any online or offline device replacements (which may not have been authorised)

Additional details are specific to the services governed but not described by this document and are necessarily out of scope.

External References

[amd-agesa] Advanced Micro Devices. AMD Generic Encapsulated Software Architecture (AGESA™) Interface Specification. Publication number 55483, revision 1.42. 2020. Distributed only under NDA.
[amd-board-guide] Advanced Micro Devices. Socket SP3 Processor Motherboard Design Guide for AMD Family 17h Models 00h-0Fh and 30h-3Fh and Family 19H Models 00h-0Fh (NDA). Publication number 55414, revision 1.09. 2019. Distributed only under NDA.
[amd-daytona] Advanced Micro Devices. AMD SP3 SKU MB (schematic). PCB Fab: DVT (ver F), rev v03. 2019. Distributed only under NDA.
[amd-ethanol] Advanced Micro Devices. Ethanol-X (schematic). From Publication number 56283, revision 0.71. Schematic revision 1.0. 2018. Distributed only under NDA.
[amd-milan-ppr] Advanced Micro Devices. Preliminary Processor Programming Reference (PPR) for AMD Family 19h Model 01h, Revision B0 Processors. Publication number 55898, revision 0.57. October 2020. Distributed only under NDA.
[amd-speedway] Advanced Micro Devices. 2U Rackmount Rome-based System with SMT DIMMs Paper Study (schematic). Publication number 56422, revision 1.0. 2018. Distributed only under NDA.
[cve-2020-8695] Intel Corporation. 2020.2 IPU - Intel RAPL Interface Advisory. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-8695. See also https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00389.html. 2020.
[ddr4-power] Oxide Computer Co. DDR4 Power Background. https://github.com/oxidecomputer/shared-engineering/blob/master/ddr4-power.adoc. 2021.
[device-inventory] Oxide Computer Co. Managed Device Inventory. https://github.com/oxidecomputer/shared-engineering/blob/master/device-inventory.adoc. 2020.
[rfd8] Oxide Computer Co. RFD 8: Service Processor. https://8.rfd.oxide.computer/. 2020.
[rfd27] Oxide Computer Co. RFD 27: Interface: Service Processor / Host Bootstrap Software. https://27.rfd.oxide.computer/. 2020.
[rfd41] Oxide Computer Co. RFD 41: Customer microcontroller OS design. https://41.rfd.oxide.computer/. 2020.
[rfd42] Oxide Computer Co. RFD 42: System Architecture for Bare Metal Instances. https://42.rfd.oxide.computer/. 2020.
[rfd58] Oxide Computer Co. RFD 58: Rack Switch. https://58.rfd.oxide.computer/. 2020.
[rfd82] Oxide Computer Co. RFD 82: Motivations and Principles for the Design of Operator Facilities. https://82.rfd.oxide.computer/. 2020.
[rfd121] Oxide Computer Co. RFD 121: Power Shelf Controller. https://121.rfd.oxide.computer/. 2021.
[rfd132] Oxide Computer Co. RFD 132: Oxide Rack v1.0 Serviceability, FRUs, and CRUs. https://132.rfd.oxide.computer/. 2021.
[stm32h7x3-um] ST Microelectronics. STM32H742, STM32H743/753, and STM32H750 Value line advanced Arm®-based 32-bit MCUs (RM00433). https://www.st.com/resource/en/reference_manual/dm00314099-stm32h742-stm32h743753-and-stm32h750-value-line-advanced-armbased-32bit-mcus-stmicroelectronics.pdf. 2020.

RFD 88 Chassis Management Responsibility Allocation

Table of Contents