The host system software (henceforth HSS, which includes HBS as described by [rfd20] and the host operating system described by [rfd26]) and the Service Processor (henceforth SP, described by [rfd8]) have dedicated async serial channels for communication. [rfd27] describes the Interprocessor Control Channel, but defers the details of the protocol and interface used by HSS and the SP on that channel. The purpose of this document is to describe those details.
The majority of uses of the HSS / SP control channel are effectively RPC calls from HSS to the SP. Both the HSS kernel and usermode applications (specifically sled-agent, described in [rfd61]) need to make requests of the SP.
Proposal
We propose using a protocol with the following properties, with a heavy eye toward simplicity:
The host only ever initiates requests by sending data on the channel, and the SP only ever replies to requests. If the SP needs to notify the host that an event has occurred or other data of interest is available, it must assert the out-of-band level-triggered interrupt signal
SP_TO_SP3_INT_L
via GPIO. When this occurs, the host will make a request to the SP to enumerate the set of pending events and then retrieve the associated data or clear these events by making additional requests.The host kernel owns the channel, and exposes it to sled-agent via a device node with a call-oriented interface (i.e. the kernel handles conversion of requests and responses to and from their streaming forms).
Only one request may be outstanding at a time (i.e. no pipelining). This may define an implicit upper bound on how long an operations is allowed to take; e.g., we wouldn’t want the kernel to be unable to notify the SP of a panic if a sprockets operation was outstanding for an extended period of time. Overlong operations (of the known list, this is exclusively sprockets operations) could potentially be pipelined at the application level by breaking them up (e.g., "start signing X", "poll for completion of signing X" instead of a single "sign X" operation).
Encoding
We propose encoding the messages in a format compatible with the rust
[hubpack] crate, which is effectively a series of little-endian bytes for
the values being encoded. This has the advantage that the SP can use hubpack
to serialize and deserialize the messages, after handling the framing, and
provides a simple and predictable message format. Encoded messages will be
constrained to a maximum size (MAX_MESSAGE_SIZE
) allowing fixed size buffers
to be used where appropriate, and messages will be structured so that they can
be deserialized incrementally in order to reduce memory consumption in the SP.
Most messages will be significantly shorter than this maximum size.
MAX_MESSAGE_SIZE
is initially defined to be 4123 bytes (4KiB + header size
+ crc size + sizeof (u64)) which allows for a 4KiB payload along with an extra
64-bits of accompanying data in the data portion of the message. This is
chosen so that the largest messages, such as a block of a recovery phase 2
image, can be a full 4KiB and still carry a small amount of associated data;
the maximum message size may be amended in the light of further information
and after prototyping. Note that the [_framing] overhead will mean a slightly larger buffer is required to accommodate a maximally-sized message in its entirety.
Each encoded message will have the following form:
header[magic, version, sequence, command] | data | checksum
as shown in the following table.
Field | Size | Notes |
---|---|---|
magic | u32 | 0x1DE19CC |
version | u32 | The protocol version number. This will be incremented as necessary for changes in the command set or to the data associated with a command. Ideally, both the SP and the Host should be willing to accept and use a small number of previous protocol versions. |
sequence | u64 | Increments for each new HSS→SP message. |
command | u8 | In rust, |
data | [u8; n] | Varies based on command, n <= MAX_MESSAGE_SIZE - HEADER_SIZE (17) - CHECKSUM_SIZE (2) |
checksum | u16 | Fletcher-16 |
The command
field is deliberately placed directly before the associated
data to allow data-carrying enums to be used in the Rust implementation, if
desired, for at least some of the data block. This is also why the large byte
streams are at the end of the data, allowing incremental consumption and lower
memory use, particularly on the SP Side. The command field is set to one of the
values from the following tables, depending on the direction of the message
(Note that 0x0
is not used for a command value to aid [_framing]):
Value (mnemonic) | Associated Data | Description |
---|---|---|
0x1 (HSSReboot) | - | Host requests reboot; no response is expected from the SP. |
0x2 (HSSPowerOff) | - | Host requests power off; no response is expected from the SP. |
0x3 (HSSBsu) | - | Host requests Boot Storage Unit (BSU) value (see [rfd241]) |
0x4 (HSSIdent) | - | Host requests identifying data (model, revision, serial) |
0x5 (HSSMac) | - | Host requests MAC addresses |
0x6 (HSSBootFail) | reason: u8, | Host boot failure; no response is expected from the SP. A reason code is
sent followed by data providing more detail. Reasons are: |
0x7 (HSSPanic) | cause: u16, | The host has panicked. The cause field gives a high level overview of the
cause, currently defined values are: |
0x8 (HSSStatus) | - | Host requests SP status register values. |
0x9 (HSSAckStart) | - | Host acknowledges SP task startup. |
0xa (HSSAlert) | - | Host requests a pending alert message from the SP. |
0xb (HSSRot) | TBD | Reserved for host RoT Request |
0xc (HSSRotMeas) | TBD | Reserved for host measurements (see [rfd256]) |
0xd (HSSImageBlock) | hash: [u8; 32] | Request data from a phase 2 image with the provided hash, starting at the requested byte offset. There is no length parameter, the SP should send as much data as it can (up to MAX_PACKET_SIZE). In practice this limited by the amount of data that can fit in a UDP packet between MGS and the SP. |
0xe (HSSKeyLookup) | key: u8 | Perform a lookup for the specified key. The size of the buffer which is
available to receive the corresponding value is also provided as part of the
request which allows the SP to fast fail if the buffer will not be large
enough. |
0xf (HSSGetInventoryData) | index: u32 | Performs a lookup for the specified inventory item. The index must be less than the inventory count returned by HSSKeyLookup. For details on inventory, see [rfd360]. |
0x10 (HSSKeySet) | key: u8 | Set a value for the specified key. The key indices are the same as those
listed above for HSSKeyLookup. Only the keys flagged as |
Value (mnemonic) | Associated Data | Description |
---|---|---|
0x1 (SPAck) | - | SP acknowledges request. Used when there is no data to send back. Host will check the sequence number to determine that the acknowledgement is for the message it last sent. |
0x2 (SPDecodeFail) | reason: u8 | SP could not decode request, reasons are: |
0x3 (SPBsu) | bsu: u8 | SP provides BSU |
0x4 (SPIdent) | model: [u8; 11] | SP provides identifying data. |
0x5 (SPMac) | base: [u8; 6] | SP provides MAC address data in the FRUID format proposed in [rfd320]. |
0x6 (SPStatus) | status: u64 | SP provides status register values. |
0x7 (SPAlert) | action: u8, | SP provides a pending alert message in response to request. If there are no
pending alerts, then action is |
0x8 (SPRot) | TBD | Reserved for SP RoT Response |
0x9 (SPImageBlock) | data: [u8; n] | Provide phase2 image block. |
0xa (SPKeyLookup) | result: u8 | Provide a response to an attempted value-by-key lookup. |
0xb (SPInventoryData) | result: u8 | Provide a response to an attempted inventory data lookup. |
0xc (SPKeySet) | result: u8 | Provide a response to an attempted key set operation. |
Unused bytes in data fields, for example in the opaque RoT request and response
structures, are set as 0xff
to aid [_framing].
Example
One way in which this could be represented in rust, suitable for serializing
with hubpack
, is:
enum SPToHSSCommand {
_Unused, // Skip 0x0
... elided ...
Ident { model: u8, rev: u8, serial: [u8; 11] },
... elided ...
}
struct Response {
magic: u32,
ver: u32,
seq: u64,
cmd: SPToHSSCommand,
check: u16,
}
The following message:
Response {
magic: 0x1de19cc,
ver: 0x1,
seq: 0x8000_0000_0000_007c,
cmd: Ident {
model: 0x81,
rev: 0x1,
serial: [ 0x42, 0x4d, 0x4e, 0x33, 0x34, 0x32, 0x32, 0x30, 0x30, 0x30, 0x31, ],
},
check: 0xbeef,
}
serializes as:
0xcc, 0x19, 0xde, 0x01 0x1, 0x0, 0x0, 0x0, 0x7c, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x80, 0x4, 0x81, 0x1, 0x42, 0x4d, 0x4e, 0x33, 0x34, 0x32, 0x32, 0x30, 0x30, 0x30, 0x31, 0xef, 0xbe,
The SP’s status registers
The SP maintains a 64-bit status register in which the bits carry the following meanings:
Bit 0 -
SRTaskRestart
- The SP task has (re)started;Bit 1 -
SRAlertAvail
- Alert message(s) available;
The SP also maintains a 64-bit startup options register in which the bits carry the following meanings:
Bit 0 -
SORecovery
- Request phase 2 image from the SP via IPCC;Bit 1 -
SOKBM
- Enable KBM debugging;Bit 2 -
SOBootRD
- Enable BootRD debugging;Bit 3 -
SOPROM
- Enable PROM debugging;Bit 4 -
SOKMDB
- Load KMDB;Bit 5 -
SOKMDBBoot
- Load KMDB and enter it at boot.Bit 6 -
SORamdisk
- No phase 2, mount ramdisk as root.Bit 7 -
SONetBoot
- Network boot.Bit 8 -
SOVerbose
- Enable verbose boot.
Whenever the status register is non-zero, the SP asserts the out-of-band
level-triggered SP_TO_SP3_INT_L
signal via GPIO. The startup options register
does not affect the interrupt status.
When the SP host communication task starts, it initialises the status register
to 1 (SRTaskRestart
) indicating that the task has started. This could be at
SP boot time or when the task has restarted due to a fault.
The Host can read the value of the status and startup options registers at any
time via an RPC request, but does so usually in response to the
SP_TO_SP3_INT_L
signal being asserted. Based on the bits which are set, the
host will issue one or more requests to consume the information from the SP.
If the host issues the
HostAckStart
message, the SP will clearSRTaskRestart
in its status register;If the host issues the
HostAlert
message, the SP will reply with anSPAlert
message and, if no more messages are pending, will clearSRAlertAvail
in its status register;
In all cases, if the status register becomes 0, the SP will de-assert the
SP_TO_SP3_INT_L
signal.
Framing and synchronisation
Encoded messages will be framed using Consistent Overhead Byte Stuffing
[COBS], with 0x0
being used as the frame delimiter. This will increase the
message size by at most 1 byte per 254 bytes of data, giving a known value
for the maximum frame size (based on MAX_MESSAGE_SIZE
) and allowing for the
use of static buffers where appropriate.
In order to optimise for COBS encode/decode speed, we will take some steps to
reduce the number of 0x0
bytes in the original encoded data by:
Numbering commands starting from
0x1
;Setting unused bytes at the end of data blocks to
0xff
.
Using COBS to frame packets on the wire has a number of benefits. Either end is able to unambiguously identify the end of a packet and either end can terminate a partial packet sent by just writing a frame terminator. There are, however, some situations that can cause the two ends of the channel get out of sync and the protocol implementation has to be able to deal with this.
First, it is conceivable that corruption could occur during transmission. This has not been seen in extensive testing on two separate servers (where gigabytes of data have been transferred over this channel) but it’s worth thinking through what would happen if it did. Assuming the corruption is within the body of the packet, then it may be detected by checksum at the end of the frame; that checksum is using the Fletcher-16 algorithm which is cheap for the SP to calculate. Assuming the checksum appears correct, then the magic and version fields will be checked and, on the Host side, the sequence number in the response packet will be checked against the expected value.
If corruption is detected by the SP, then it will reply with a special message that indicates it is unable to decode the request, and the host will re-send the pending message. Similarly, if the host detects corruption in a reply, it will discard it and re-issue the request, unchanged.
A special case is if there is corruption in the frame terminator itself. Without anything been done to guard against this the channel would become permanently wedged. Implementing a timeout here was considered but discarded as an option because there is no guaranteed response time for any message sent to the SP. Some messages are likely to take a while and the SP is not a hard real-time OS, so selecting an appropriate timeout value is difficult. The chosen solution is for each side to follow up a packet with periodic additional frame terminators, while waiting for a reply, possibly filling up the Tx FIFO. When read by the other side of the channel, this just appears as an empty frame and is discarded. The sending interval for these extra terminators is not critical - roughly one every 0.1s seems reasonable.
Another way that synchronisation can be lost is if the SP task panics/restarts after the host has sent a command. In that case it will come back up without the command to process and the host will still be waiting. To address this, the SP maintains a 64-bit status register and, whenever it is non-zero, it asserts the out-of-band interrupt. Whenever the SP task starts or restarts, it sets a bit in that register to indicate that, which has the side effect of asserting the interrupt. The host notices this and gives up sending/waiting for the active command, and issues a new request to retrieve the status register. It then processes the bits which are set there, clearing them by retrieving data from the SP or sending commands to acknowledge the event. Once the register is clear (and the interrupt de-asserted), the original command is sent again.
The reverse situation can occur if the host panics between sending a request to the SP and processing the reply. In that case, the host will send a panic message and then, after a variable amount of time, a reboot request. This causes a situation where the SP is blocked writing a response to the host and the host is blocked writing the panic message to the SP. To avoid this deadlock from occurring, the SP must continue to read from the host even while it is sending a response. It will usually just see the empty bonus frame discussed earlier, but if it sees a new command then it must discard whatever it is busy sending and process the new message.
While it is not expected to be possible, it’s also worth considering what would happen if the host and the SP were to get out of step. That is, the host is transmitting requests and the SP is returning replies, but the SP reply is an old one, a response to a previous message. To handle this case, when an SP reply is valid in all aspects apart from the sequence number, the host should discard the reply and listen again, without re-sending.
Retransmissions
As may be apparent from what’s above, there are some cases when the host will have to automatically resend a message during a transaction. This can occur when:
the SP asserts its interrupt while the host is sending or waiting for a response;
the SP replies to a request with SPDecodeFail;
the host has read
IPCC_MAX_PACKET_SIZE
bytes from the SP without finding a frame terminator;the host cannot decode the COBS frame received from the SP;
the decoded packet from the SP is shorter than
IPCC_MIN_PACKET_SIZE
;the reply message checksum does not match;
the magic number in the reply message is incorrect;
the version number in the reply message is incorrect;
a request sequence number was found in the reply. This would indicate that the UART is looped back in some way and it’s not clear that it is possible to recover in this case.
When retransmitting a message for any reason, the Host will re-use the sequence number from the original request. Recovering from an SP reset means multiple commands back and forth, and so the original request will be put into a new packet, with a new sequence number, once resynchronisation is complete.
Open Questions
Should RoT requests be opaque to the kernel, or a specific request type and a payload it just passes along?
Security Considerations
We have no means to for either the host or the SP to authenticate the other across the UART channel; invasive physical access would allow an attacker to act as a man-in-the-middle and inject arbitrary requests/responses.
External References
[COBS] IEEE Consistent Overhead Byte Stuffing
[hubpack] hubpack: a predictable serde format
[rfd8] Oxide Computer Co. RFD 8: Service Processor (SP)
[rfd20] Oxide Computer Co. RFD 20: Host Bootstrap Software: Objectives
[rfd26] Oxide Computer Co. RFD 26: Host Operating System & Hypervisor
[rfd27] Oxide Computer Co. RFD 27: Interface: Service Processor / Host System Software
[rfd61] Oxide Computer Co. RFD 61: Control Plane Architecture and Design
[rfd219] Oxide Computer Co. RFD 219: Serial: Numbers, Not a Podcast
[rfd241] Oxide Computer Co. RFD 241: Holistic Boot
[rfd256] Oxide Computer Co. RFD 256: Remote Attestation and Trusted Transport Protocol
[rfd293] Oxide Computer Co. RFD 293: BETTER SLED THAN DEAD: A guide to sled resurrection from beyond the update window
[rfd320] Oxide Computer Co. RFD 320: Per-board MAC address allocation and tracking
[rfd360] Oxide Computer Co. RFD 360: Gimlet and Sidecar Topology Trees and Device Identification