316 - HSS/SP communication protocol / RFD / Oxide

RFD

316

Authors

Cliff L. Biffle, John Gallagher, Keith M Wesolowski, Andy Fiddaman

Labels

Updated

The host system software (henceforth HSS, which includes HBS as described by [rfd20] and the host operating system described by [rfd26]) and the Service Processor (henceforth SP, described by [rfd8]) have dedicated async serial channels for communication. [rfd27] describes the Interprocessor Control Channel, but defers the details of the protocol and interface used by HSS and the SP on that channel. The purpose of this document is to describe those details.

The majority of uses of the HSS / SP control channel are effectively RPC calls from HSS to the SP. Both the HSS kernel and usermode applications (specifically sled-agent, described in [rfd61]) need to make requests of the SP.

Proposal

We propose using a protocol with the following properties, with a heavy eye toward simplicity:

The host only ever initiates requests by sending data on the channel, and the SP only ever replies to requests. If the SP needs to notify the host that an event has occurred or other data of interest is available, it must assert the out-of-band level-triggered interrupt signal SP_TO_SP3_INT_L via GPIO. When this occurs, the host will make a request to the SP to enumerate the set of pending events and then retrieve the associated data or clear these events by making additional requests.
The host kernel owns the channel, and exposes it to sled-agent via a device node with a call-oriented interface (i.e. the kernel handles conversion of requests and responses to and from their streaming forms).
Only one request may be outstanding at a time (i.e. no pipelining). This may define an implicit upper bound on how long an operations is allowed to take; e.g., we wouldn’t want the kernel to be unable to notify the SP of a panic if a sprockets operation was outstanding for an extended period of time. Overlong operations (of the known list, this is exclusively sprockets operations) could potentially be pipelined at the application level by breaking them up (e.g., "start signing X", "poll for completion of signing X" instead of a single "sign X" operation).

Encoding

We propose encoding the messages in a format compatible with the rust [hubpack] crate, which is effectively a series of little-endian bytes for the values being encoded. This has the advantage that the SP can use hubpack to serialize and deserialize the messages, after handling the framing, and provides a simple and predictable message format. Encoded messages will be constrained to a maximum size (MAX_MESSAGE_SIZE) allowing fixed size buffers to be used where appropriate, and messages will be structured so that they can be deserialized incrementally in order to reduce memory consumption in the SP. Most messages will be significantly shorter than this maximum size.

MAX_MESSAGE_SIZE is initially defined to be 4123 bytes (4KiB + header size + crc size + sizeof (u64)) which allows for a 4KiB payload along with an extra 64-bits of accompanying data in the data portion of the message. This is chosen so that the largest messages, such as a block of a recovery phase 2 image, can be a full 4KiB and still carry a small amount of associated data; the maximum message size may be amended in the light of further information and after prototyping. Note that the [_framing] overhead will mean a slightly larger buffer is required to accommodate a maximally-sized message in its entirety.

Each encoded message will have the following form:

header[magic, version, sequence, command] | data | checksum

as shown in the following table.

Protocol Overview
Field	Size	Notes
magic	u32	0x1DE19CC
version	u32	The protocol version number. This will be incremented as necessary for changes in the command set or to the data associated with a command. Ideally, both the SP and the Host should be willing to accept and use a small number of previous protocol versions.
sequence	u64	Increments for each new HSS→SP message. SP uses request value when building a response, but also sets the top bit as an indicator that this is a reply message.
command	u8	In rust, `command` is likely to be an enum and `hubpack` encodes enums as a u8; it should be sufficient, particularly since each direction has its own set of values.
data	[u8; n]	Varies based on command, n <= MAX_MESSAGE_SIZE - HEADER_SIZE (17) - CHECKSUM_SIZE (2)
checksum	u16	Fletcher-16

The command field is deliberately placed directly before the associated data to allow data-carrying enums to be used in the Rust implementation, if desired, for at least some of the data block. This is also why the large byte streams are at the end of the data, allowing incremental consumption and lower memory use, particularly on the SP Side. The command field is set to one of the values from the following tables, depending on the direction of the message (Note that 0x0 is not used for a command value to aid [_framing]):

HSS→SP Commands
Value (mnemonic)	Associated Data	Description
0x1 (HSSReboot)	-	Host requests reboot; no response is expected from the SP.
0x2 (HSSPowerOff)	-	Host requests power off; no response is expected from the SP.
0x3 (HSSBsu)	-	Host requests Boot Storage Unit (BSU) value (see [rfd241])
0x4 (HSSIdent)	-	Host requests identifying data (model, revision, serial)
0x5 (HSSMac)	-	Host requests MAC addresses
0x6 (HSSBootFail)	reason: u8, data: [u8; n]	Host boot failure; no response is expected from the SP. A reason code is sent followed by data providing more detail. Reasons are: * `1` → General failure. * `2` → Could not locate a phase 2 image. * `3` → Phase 2 protocol header problem. * `4` → Integrity failure. * `5` → Ramdisk problem. The SP should record this information and propagate it to the control plane. Typically the host will follow this command with an HSSReboot.
0x7 (HSSPanic)	cause: u16, data: [u8; n]	The host has panicked. The cause field gives a high level overview of the cause, currently defined values are: * `0xcall` the panic() function was explicitly called; * `0xa9nn` a trap occurred. The trap number is identified by `nn`; * `0x5enn` fatal userland trap `nn` occurred; * `0xeb00` early boot panic; * `0xeb97` early boot prom panic; * `0xeba9` early boot trap occurred. The data following the cause provides more information about the panic including data such as the stack trace. This is considered opaque by the SP. The SP should record this information and propagate it to the control plane but not take action such as power cycling the host. The host will send an HSSReboot command when ready - it may, for example, spend some time writing a crash dump to disk.
0x8 (HSSStatus)	-	Host requests SP status register values. See [_the_sps_status_registers] for further details.
0x9 (HSSAckStart)	-	Host acknowledges SP task startup.
0xa (HSSAlert)	-	Host requests a pending alert message from the SP.
0xb (HSSRot)	TBD	Reserved for host RoT Request For reference, the RoT operations include: `GetCertificates`, `AddHostMeasurements`, `GetMeasurements`, `SignTranscript`). See [rfd256].
0xc (HSSRotMeas)	TBD	Reserved for host measurements (see [rfd256])
0xd (HSSImageBlock)	hash: [u8; 32] offset: u64	Request data from a phase 2 image with the provided hash, starting at the requested byte offset. There is no length parameter, the SP should send as much data as it can (up to MAX_PACKET_SIZE). In practice this limited by the amount of data that can fit in a UDP packet between MGS and the SP.
0xe (HSSKeyLookup)	key: u8 maxresponse: u16	Perform a lookup for the specified key. The size of the buffer which is available to receive the corresponding value is also provided as part of the request which allows the SP to fast fail if the buffer will not be large enough. `key` is a `u8` in order that it can be easily encoded from an enum by `hubpack`. Currently defined keys are: * `0` → ping (the SP will respond with "pong") [RO]; * `1` → Installinator Image ID [RO]; * `2` → Inventory status, as a tuple of `(count, version)`. The `version` is always 0 for now, but is included for forward-compatibility; the count is the limit for HSSGetInventoryData (below) [RO]; * `3` → `/etc/system` file content [RW, max 256 bytes]; * `4` → `/kernel/drv/dtrace.conf` content [RW, max 4096 bytes].
0xf (HSSGetInventoryData)	index: u32	Performs a lookup for the specified inventory item. The index must be less than the inventory count returned by HSSKeyLookup. For details on inventory, see [rfd360].
0x10 (HSSKeySet)	key: u8 data: [u8; n]	Set a value for the specified key. The key indices are the same as those listed above for HSSKeyLookup. Only the keys flagged as `RW` can have their values set via this command, and their lengths are limited to the sizes shown.

SP→HSS Commands
Value (mnemonic)	Associated Data	Description
0x1 (SPAck)	-	SP acknowledges request. Used when there is no data to send back. Host will check the sequence number to determine that the acknowledgement is for the message it last sent.
0x2 (SPDecodeFail)	reason: u8	SP could not decode request, reasons are: * `1` → COBS decode failure (The sequence will be set to all 1s) * `2` → CRC mismatch * `3` → Deserialization failure (The sequence will be set all 1s) * `4` → Magic number mismatch * `5` → Protocol version unsupported (try a lower one if you can) * `6` → Sequence invalid (high bit set) * `7` → Data length incorrect
0x3 (SPBsu)	bsu: u8	SP provides BSU * `0x41` → BSU A * `0x42` → BSU B
0x4 (SPIdent)	model: [u8; 11] revision: u32, serial: [u8; 11]	SP provides identifying data. See [rfd219] for the format of serial numbers.
0x5 (SPMac)	base: [u8; 6] count: u16 stride: u8	SP provides MAC address data in the FRUID format proposed in [rfd320].
0x6 (SPStatus)	status: u64 startup_options: u64	SP provides status register values. See [_the_sps_status_registers] for further details.
0x7 (SPAlert)	action: u8, data: [u8; n]	SP provides a pending alert message in response to request. If there are no pending alerts, then action is `0` and there is no data. Currently alert messages with a non-zero `action` are displayed to the OS console and sent to the system log. In future, `action` may be extended to allow messages to be delivered to other endpoints — via a door to a userland application such as sled-agent, for example. Care must be taken to avoid losing any alert messages as a result of [retransmission]. Whenever the SP sends an SPAlert message to the host, it must retain a copy in a buffer and resend it if it subsequently receives a HSSAlert message with the same sequence number. The cached message can be discarded once an HSSAlert request is received with a different sequence number.
0x8 (SPRot)	TBD	Reserved for SP RoT Response
0x9 (SPImageBlock)	data: [u8; n]	Provide phase2 image block.
0xa (SPKeyLookup)	result: u8 data: [u8; n]	Provide a response to an attempted value-by-key lookup. `result` is one of: * `0` → Successful lookup, value follows; * `1` → Invalid key; * `2` → No value exists for requested key; * `3` → Buffer too small for key’s value.
0xb (SPInventoryData)	result: u8 name: [u8; 32] type: u8 data: [u8; n]	Provide a response to an attempted inventory data lookup. `result` is one of: * `0` → Successful lookup, more data follows; * `1` → Invalid index; * `2` → Communication with the target failed in a way that suggests device absence (e.g. I²C NACK); * `3` → Communication with the target failed in a different way. `name` is a unique identifier for the given inventory item, with trailing `0` bytes. The identifier is typically the designator of the relevant IC on the PCA; if the IC is on a daughterboard, it is a nested list of designators separated by `/`. `type` is a discriminator that specifies how `data` is to be decoded; `data` is a type-specific serialized `struct`. The values and encoding for `type + data` form a private interface between the SP and the host. For forward-compatibility, we should only append fields to a given `data` payload; the host can then ignore any trailing bytes that it doesn’t know about.
0xc (SPKeySet)	result: u8	Provide a response to an attempted key set operation. `result` is one of: * `0` → The value was stored successfully; * `1` → Invalid key; * `2` → Key is read-only; * `3` → Provided data too long for key.

Unused bytes in data fields, for example in the opaque RoT request and response structures, are set as 0xff to aid [_framing].

Example

One way in which this could be represented in rust, suitable for serializing with hubpack, is:

enum SPToHSSCommand {
    _Unused,		// Skip 0x0
    ... elided ...
    Ident { model: u8, rev: u8, serial: [u8; 11] },
    ... elided ...
}

struct Response {
    magic:  u32,
    ver:    u32,
    seq:    u64,
    cmd:    SPToHSSCommand,
    check:  u16,
}

The following message:

Response {
    magic: 0x1de19cc,
    ver: 0x1,
    seq: 0x8000_0000_0000_007c,
    cmd: Ident {
        model: 0x81,
        rev: 0x1,
        serial: [ 0x42, 0x4d, 0x4e, 0x33, 0x34, 0x32, 0x32, 0x30, 0x30, 0x30, 0x31, ],
    },
    check: 0xbeef,
}

serializes as:

    0xcc, 0x19, 0xde, 0x01
    0x1, 0x0, 0x0, 0x0,
    0x7c, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x80,
    0x4,
    0x81,
    0x1,
    0x42, 0x4d, 0x4e, 0x33, 0x34, 0x32, 0x32, 0x30, 0x30, 0x30, 0x31,
    0xef, 0xbe,

The SP’s status registers

The SP maintains a 64-bit status register in which the bits carry the following meanings:

Bit 0 - SRTaskRestart - The SP task has (re)started;
Bit 1 - SRAlertAvail - Alert message(s) available;

The SP also maintains a 64-bit startup options register in which the bits carry the following meanings:

Bit 0 - SORecovery - Request phase 2 image from the SP via IPCC;
Bit 1 - SOKBM - Enable KBM debugging;
Bit 2 - SOBootRD - Enable BootRD debugging;
Bit 3 - SOPROM - Enable PROM debugging;
Bit 4 - SOKMDB - Load KMDB;
Bit 5 - SOKMDBBoot - Load KMDB and enter it at boot.
Bit 6 - SORamdisk - No phase 2, mount ramdisk as root.
Bit 7 - SONetBoot - Network boot.
Bit 8 - SOVerbose - Enable verbose boot.

Whenever the status register is non-zero, the SP asserts the out-of-band level-triggered SP_TO_SP3_INT_L signal via GPIO. The startup options register does not affect the interrupt status.

When the SP host communication task starts, it initialises the status register to 1 (SRTaskRestart) indicating that the task has started. This could be at SP boot time or when the task has restarted due to a fault.

The Host can read the value of the status and startup options registers at any time via an RPC request, but does so usually in response to the SP_TO_SP3_INT_L signal being asserted. Based on the bits which are set, the host will issue one or more requests to consume the information from the SP.

If the host issues the HostAckStart message, the SP will clear SRTaskRestart in its status register;
If the host issues the HostAlert message, the SP will reply with an SPAlert message and, if no more messages are pending, will clear SRAlertAvail in its status register;

In all cases, if the status register becomes 0, the SP will de-assert the SP_TO_SP3_INT_L signal.

Framing and synchronisation

Encoded messages will be framed using Consistent Overhead Byte Stuffing [COBS], with 0x0 being used as the frame delimiter. This will increase the message size by at most 1 byte per 254 bytes of data, giving a known value for the maximum frame size (based on MAX_MESSAGE_SIZE) and allowing for the use of static buffers where appropriate.

In order to optimise for COBS encode/decode speed, we will take some steps to reduce the number of 0x0 bytes in the original encoded data by:

Numbering commands starting from 0x1;
Setting unused bytes at the end of data blocks to 0xff.

Note

Prototyping work has shown that the amount of necessary padding is minimal since the data portion of the message can be implemented as variable length, while still retaining compatibility with [hubpack].

Using COBS to frame packets on the wire has a number of benefits. Either end is able to unambiguously identify the end of a packet and either end can terminate a partial packet sent by just writing a frame terminator. There are, however, some situations that can cause the two ends of the channel get out of sync and the protocol implementation has to be able to deal with this.

First, it is conceivable that corruption could occur during transmission. This has not been seen in extensive testing on two separate servers (where gigabytes of data have been transferred over this channel) but it’s worth thinking through what would happen if it did. Assuming the corruption is within the body of the packet, then it may be detected by checksum at the end of the frame; that checksum is using the Fletcher-16 algorithm which is cheap for the SP to calculate. Assuming the checksum appears correct, then the magic and version fields will be checked and, on the Host side, the sequence number in the response packet will be checked against the expected value.

If corruption is detected by the SP, then it will reply with a special message that indicates it is unable to decode the request, and the host will re-send the pending message. Similarly, if the host detects corruption in a reply, it will discard it and re-issue the request, unchanged.

A special case is if there is corruption in the frame terminator itself. Without anything been done to guard against this the channel would become permanently wedged. Implementing a timeout here was considered but discarded as an option because there is no guaranteed response time for any message sent to the SP. Some messages are likely to take a while and the SP is not a hard real-time OS, so selecting an appropriate timeout value is difficult. The chosen solution is for each side to follow up a packet with periodic additional frame terminators, while waiting for a reply, possibly filling up the Tx FIFO. When read by the other side of the channel, this just appears as an empty frame and is discarded. The sending interval for these extra terminators is not critical - roughly one every 0.1s seems reasonable.

Another way that synchronisation can be lost is if the SP task panics/restarts after the host has sent a command. In that case it will come back up without the command to process and the host will still be waiting. To address this, the SP maintains a 64-bit status register and, whenever it is non-zero, it asserts the out-of-band interrupt. Whenever the SP task starts or restarts, it sets a bit in that register to indicate that, which has the side effect of asserting the interrupt. The host notices this and gives up sending/waiting for the active command, and issues a new request to retrieve the status register. It then processes the bits which are set there, clearing them by retrieving data from the SP or sending commands to acknowledge the event. Once the register is clear (and the interrupt de-asserted), the original command is sent again.

The reverse situation can occur if the host panics between sending a request to the SP and processing the reply. In that case, the host will send a panic message and then, after a variable amount of time, a reboot request. This causes a situation where the SP is blocked writing a response to the host and the host is blocked writing the panic message to the SP. To avoid this deadlock from occurring, the SP must continue to read from the host even while it is sending a response. It will usually just see the empty bonus frame discussed earlier, but if it sees a new command then it must discard whatever it is busy sending and process the new message.

While it is not expected to be possible, it’s also worth considering what would happen if the host and the SP were to get out of step. That is, the host is transmitting requests and the SP is returning replies, but the SP reply is an old one, a response to a previous message. To handle this case, when an SP reply is valid in all aspects apart from the sequence number, the host should discard the reply and listen again, without re-sending.

Retransmissions

As may be apparent from what’s above, there are some cases when the host will have to automatically resend a message during a transaction. This can occur when:

the SP asserts its interrupt while the host is sending or waiting for a response;
the SP replies to a request with SPDecodeFail;
the host has read IPCC_MAX_PACKET_SIZE bytes from the SP without finding a frame terminator;
the host cannot decode the COBS frame received from the SP;
the decoded packet from the SP is shorter than IPCC_MIN_PACKET_SIZE;
the reply message checksum does not match;
the magic number in the reply message is incorrect;
the version number in the reply message is incorrect;
a request sequence number was found in the reply. This would indicate that the UART is looped back in some way and it’s not clear that it is possible to recover in this case.

When retransmitting a message for any reason, the Host will re-use the sequence number from the original request. Recovering from an SP reset means multiple commands back and forth, and so the original request will be put into a new packet, with a new sequence number, once resynchronisation is complete.

Open Questions

Should RoT requests be opaque to the kernel, or a specific request type and a payload it just passes along?

Security Considerations

We have no means to for either the host or the SP to authenticate the other across the UART channel; invasive physical access would allow an attacker to act as a man-in-the-middle and inject arbitrary requests/responses.

External References

[COBS] IEEE Consistent Overhead Byte Stuffing
[hubpack] hubpack: a predictable serde format
[rfd8] Oxide Computer Co. RFD 8: Service Processor (SP)
[rfd20] Oxide Computer Co. RFD 20: Host Bootstrap Software: Objectives
[rfd26] Oxide Computer Co. RFD 26: Host Operating System & Hypervisor
[rfd27] Oxide Computer Co. RFD 27: Interface: Service Processor / Host System Software
[rfd61] Oxide Computer Co. RFD 61: Control Plane Architecture and Design
[rfd219] Oxide Computer Co. RFD 219: Serial: Numbers, Not a Podcast
[rfd241] Oxide Computer Co. RFD 241: Holistic Boot
[rfd256] Oxide Computer Co. RFD 256: Remote Attestation and Trusted Transport Protocol
[rfd293] Oxide Computer Co. RFD 293: BETTER SLED THAN DEAD: A guide to sled resurrection from beyond the update window
[rfd320] Oxide Computer Co. RFD 320: Per-board MAC address allocation and tracking
[rfd360] Oxide Computer Co. RFD 360: Gimlet and Sidecar Topology Trees and Device Identification

RFD 316 HSS/SP communication protocol

Table of Contents