250 - Management Network Topology and Proprioception / RFD / Oxide

RFD

250

Authors

Updated

This RFD describes several interrelated details of how the control plane interacts with SPs over the management network. The overview of the management network is covered by RFD 210; it describes the management network, the SPs attached to the network, the VSC7448 switch, and—in general terms—the Management Gateway Service (MGS), the component of the control plane that communicates over the management network.

There were, however, a few important topics left unresolved:

MGS needs to perform discovery of the SPs on the management network and needs to be able to direct messages to specific SPs.
MGS needs to be able to infer the rack location of sleds, switches, and PSCs (when we have more than a single PSC).
SPs may need to be able to infer their own location.

Principles and Assumptions

The Management Network is an L2 network

The management network is a connection between MGS and Hubris. MGS is a Rust program running in a Helios zone on a "Scrimlet", i.e. a Gimlet attached to a Sidecar via PCIe. Hubris is Oxide’s embedded OS, running on a bare-metal service processor (SP) on a Gimlet, Sidecar, PSC, or other future product.

MGS and Hubris communicate by sending and receiving packets. These packets are typically UDP over IPv6, but may also include ICMPv6 for neighbor discovery. Inside the management network, routing is done by switches that operate at the L2 level, i.e. handling ethernet frames based on SMAC and DMAC.

[img-basic] shows the fundamental building blocks of the management network.

A single link in the management network

Warning

The KSZ8463 numbers its ports 1 and 2, rather than 0 and 1.

All figures in this document focus on frame routing, so implementation details like PHYs and physical level protocols aren’t shown. For detailed design information about the rack switch, see RFD 58 and RFD 144.

Inter-SP Isolation

As noted in RFD 210, while MGS can communicate to any SP on the management network, SPs may only communicate to MGS and not to other SPs. The expectation is that the VSC7448 will enforce this constraint. This is, in part, to inhibit an exploit from spreading between SPs.

This should be enforced at the L2 layer, meaning even multicast or broadcast packets from one SP should not reach other SPs.

Per-Switch Isolation

Every management-network-connected-device has two network ports that connect to each of the two rack switches. These ports should be isolated at the L2 level, meaning that even multicast or broadcast messages from an SP should be directed to a specific network and only arrive at a single Sidecar.

[sidecar-gimlet] shows connections between multiple Sidecars and Gimlets. The isolated per-switch networks are shown in red and blue. Note that the VSC7448-to-Gimlet mapping is made up for this example and will not reflect the final mapping.

Connections between multiple Sidecars and Gimlets

The SP must be aware of which port a packet has arrived on; this is a building block for location detection.

Note

With both per-switch isolation and inter-SP isolation, each MGS-SP link should behave as though it is on its own isolated network.

VSC and Tofino Port Bijection

A single cable from the rack switch to a particular sled cubby contains a set of associated signals:

Data network
Management network
Ignition (RFD 142)

These signals exit the switch and enter the sled through single connectors, meaning it should be impossible to mix them up. As such, a port on Tofino (data network) has a one-to-one correspondence with a port on the VSC7448 (management network), and similarly, with a connection to the Ignition controller.

The PSC does not include data network signals, but has a one-to-one correspondence between VSC7448 port and Ignition controller.

Cabled Backplane Exigencies

The cabled backplane will be deterministically configured for a given product, i.e. any variation between racks would represent an error. This means there is a known mapping from the tuple [Sidecar location, VSC7448 port] to physical cubby location within the rack.

This mapping will be defined in RFD 126.

Warning

RFD 144 shows such a mapping in the Switch Port Map table, but this table is incorrect (as of 2022-03-22); the final mapping will depend on physical layout to maximize signal integrity on the high-speed data network.

Future-Proofing

While we don’t want to overengineer for a future we have yet to earn, we do want to consider subsequent products that fit into cubbies (a Genoa-based sled, an object storage sled, etc.), other rack configurations (with different numbers of sleds, or different cabling), and future rack switches.

VLANs for Network Isolation

To implement the desired isolation, we will use two different VLAN tagging schemes. The network can be broken into three sections, of which two are tagged:

Between MGS and the VSC7448, frames will be tagged with a VLAN indicating the target VSC7448 port. For example, 0x102 would map to port 2, 0x103 to port 3, etc.
Between the VSC7448 and the KSZ8463, frames are untagged.
Between the KSZ8463 and the SP, frames will be tagged with a VLAN indicating the target KSZ8463 port. In diagrams below, we’ll use 0x301/0x302 for KSZ8463 ports 1 and 2.

All VLANs are configured regardless of link partner activity.

Here is an example of a packet traveling through the management network:

VLAN tagging through the management network

VSC7448 Configuration

The downstream ports on the VSC7448 accept untagged traffic, apply a port-specific VLAN tag to incoming frames, and forward traffic exclusively to the upstream port.

The upstream port accepts frames with a single VLAN tag that matches one of the downstream ports, directs traffic to the appropriate downstream port, and strips the tag on egress to a downstream port.

In Cisco terminology, these are "access ports" and "trunk ports" respectively. However, the VSC7448 has no such notion in its configuration; it’s organized in terms of VLANs themselves. We will oblige it and configure 39 different VLANs:

32 for individual cubbies
2 for rack PSCs
2 for technician ports
1 for the on-board SP
1 for the peer Sidecar’s SP
1 for the Dark Lord on his dark throne default VLAN

The upstream port will be configured to drop frames without a VLAN tag; this means that only tagged frames will be forwarded. VLANs will be of the form 0x100 + port number, and will include the upstream port and a single downstream port. Because we do not use every port in the VSC7448 in order, VLAN id assignment will not be contiguous; this has no real impact.

KSZ8463 Configuration

The KSZ8463 is configured similarly to the VSC7448, with a total of three ports:

Ports 1 and 2 go upstream to the VSC7448s on two different rack switches
Port 3 is connected to the SP

The upstream ports apply a VLAN tag upon ingress and strips it on egress. The downstream port only accepts tagged frames and sends them to one or the other upstream port.

To implement this behavior, the KSZ8463 is configured with three VLANs:

0x301 includes Ports 1 and 3
0x302 includes Ports 2 and 3
0x3FF is the default VLAN and contains no ports

Think of the combined KSZ8463 + SP network as equivalent to an SP with two MACs. The KSZ8463 is an implementation detail, necessary because our SP only has a single RMII interface. If a subsequent SP had two RMII ports, there would be no KSZ8463 and the SP would not need to use VLAN tagging.

Hubris

802.1Q is explicitly not supported in smoltcp, our chosen net stack.

Because our architecture requires tagged packets arriving at the SP, we have a few choices:

Modify smoltcp to support VLAN tags
Strip tags before sending packets to (per-VLAN) instances of smoltcp

The latter is probably easiest, and the STM32H7 supports VLAN tagging in its ethernet peripheral.

MGS

In the MGS ⇔ SP protocol, the management switch will be represented as a "component" which MGS can query. Unlike most components, MGS will do additional interpretation to infer the properties of this component so that it can perform discovery of other SPs on the management network. Note that this will include details of the cabled backplane that describe the mapping between each port on both Sidecars to all of the SPs.

IPv6 Addressing

MAC Address Selection

Every Oxide FRU contains a FRU ID EEPROM, described in RFD 148. RFD 165 specifies that every Gimlet needs to have a MAC address assigned to the T6, but does not firmly declare where it is stored; let’s assume that it is stored in the FRU EEPROM. RFD 174 specifies that Oxide should assign MAC addresses based on our OUI (A8:40:25), reserving part of the space for software virtual devices.

To simplify software, we should ensure that there are no MAC address collisions on the management network. We can use the plan described above, but assign a range of MAC addresses in the FRU EEPROM, e.g. A8:40:25:00:00:10 through A8:40:25:00:00:20 (giving the Gimlet 16 MAC addresses to choose from).

The management network on each connected device requires two MAC addresses for the two distinct ports.

It’s not noted in the RFDs above, but the T6 actually needs two MAC addresses, because it has two ports and two interfaces (one going to each Sidecar). This means that we need to solve the "assign a MAC address range" problem independent of the management network, so the plan above does not add any new technical complexity.

We can fine-tune the exact number of MAC addresses per Gimlet / Sidecar / PSC, but at a conservative 16 per unit, this allows us to build 27K racks before needing to purchase a second OUI.

IPv6 Address Selection

Given a MAC address, we can generate a link-local IPv6 address using the EUI-64 transform [RFC4291]. This is simpler than using DHCPv6, and these IP addresses only matter within a single rack, so there’s no need for anything fancy.

Since each port on the SP has its own MAC address, it will also have its own IPv6 address.

Network Topology Discovery

Upon boot or loss of state, MGS and/or Hubris need to discover various network-level parameters to usefully operate:

IPv6 addresses
Link-level addresses
Location within the rack

All of this information can be determined with 1-2 idempotent requests to the right place, so state maintenance doesn’t require too much reasoning about the distributed system.

These activities may also occur on a regular cadence. For example, link-level addresses will be periodically refreshed as they time out of the NDP cache. We may choose to implement periodic checking of the other parameters as well.

SP IPv6 Address

To discover the IPv6 address of an SP on a given port, MGS will send a UDP message over a link with a given VLAN ID with an IPv6 multicast address, using our MGS ⇔ SP protocol. Because the VSC7448 has been configured with the VLAN awareness described above, that message will only be sent to the targeted SP. The response from the SP will include an IPv6 address that MGS may use for subsequent messages.

MGS IPv6 Address

Similarly, to learn the IPv6 address of a particular MGS instance, Hubris will send a multicast UDP message to a specific Sidecar link. MGS will reply with an IPv6 address that Hubris may use for subsequent messages to that Sidecar.

MAC Address

Once both sides have discovered IPv6 addresses, they can use NDP to discover link-local / MAC addresses. Each side will send Neighbor Solicitation messages and receive Neighbor Advertisement messages in reply. This is supported automatically in the Hubris and illumos netstacks.

Sidecar Location

MGS cannot infer whether it is connected to Sidecar A or B using exclusively Sidecar-local information: the internal Sidecar SP can’t tell, because the VSC7448 "local"/"peer" ports are connected to KSZ8463 ports 1/2, respectively, regardless of Sidecar position. There’s no location information from the cabled backplane. MGS can, however, infer the location deterministically by sending a query to a Gimlet SP.

If MGS is running, than at least one SP has booted (namely, the one on the Scrimlet that is running MGS). Consider the following diagram, with made-up VSC7448 port assignments.

Location detection

MGS can find its location by sending two messages to SPs:

To VSC7448 port 37, "Reply if you receive this on port 1"
To VSC7448 port 22, "Reply if you receive this on port 2"

Depending on which reply it receives, MGS can determine whether it is attached to Sidecar A or B.

For this to work, we must avoid port mappings that could produce ambiguity. In the example above, if Sidecar B’s port 37 went to KSZ8463 port 1 on a different Gimlet, we would receive ambiguous responses when trying to infer the location of Sidecar B, since both Scrimlet B and this other Gimlet would reply.

It would be particularly convenient if all cables from the top Sidecar position (A) connect to the left Gimlet connector (port 1 on the SP), while cables from the bottom Sidecar position (B) connect to the right Gimlet connector (port 2). Even if this specific constrain doesn’t hold, there are plenty of mappings that don’t produce ambiguity; in general, we want to avoid situations where a given VSC7448 port for both Sidecar positions is attached to the same KSZ8463 port.

Once MGS has determined the Sidecar location, there is a one-to-one mapping between ports on the VSC7448 and cubby numbers, as described above. The mapping can be stored in a configuration file shipped alongside MGS. In a future with multiple rack configurations (e.g. Oxide Rack 1, Oxide Networking Rack, Oxide Storage Rack), multiple configuration files can be included with MGS. Detecting rack configuration (to select the appropriate configuration file) is deferred to future work below.

MGS can also query additional SPs to confirm that there hasn’t been a miscabling (either at the factory or due to improper service of a rack switch). Complete cabling validation may require Nexus-level coordination, discussed below.

SP Location

If it is necessary or useful for SPs to know their own location, they can simply ask MGS. MGS will have already inferred the rack topology using the strategy described above, so a message arriving from VSC7448 port X can be replied to with "You’re in Cubby Y" (or an error message if MGS is not ready to reply).

This can be implemented as part of the MGS ⇔ SP UDP protocol.

Control Plane Cabling Validation

Each MGS assumes a properly cabled backplane. While it can autonomously detect certain anomalies (e.g. if a cubby had its pair of cabled switched), there are many it cannot detect, such as if two cubbies had swapped their left cables. This might be particularly likely to arise in the case of servicing or replacing a Sidecar: while sleds can be removed without disturbing the cabled backplane, Sidecars are effectively part of the cabled backplane and need to be physically disconnected. If cables were reconnected incorrectly, MGS would not always be able to detect this error.

Nexus, the coordinating center of the control plane, would need to query both MGS instances, correlate their data, and look for inconsistencies. In the case of a repair activity we might assume that the extant Sidecar was properly cabled and that the repaired Sidecar was left in an incorrect state. We should require remediation before resuming use of the repaired Sidecar.

Considered Alternatives

Software-Free Sidecar Location Determination

The plan above relies on software to disambiguate between Sidecar locations. It’s reasonable to suggest a physical disambiguation method, but nothing obvious comes to mind.

One considered option is to use a different cable for Sidecar A, e.g. with an extra conductor that could be grounded at the end. This would add an extra SKU and (more importantly) is error-prone to implement, especially through rack servicing operations. Right now, the cables are identical between the two Sidecars, and we’d like to keep it that way.

Using Router Solicitation for IPv6 address discovery

IPv6 address discovery uses multicast packets to request the IPv6 address of the SP or MGS. The desired data looks a lot like Router Solicitation / Router Advertisement messages from NDP.

Using RS messages would require implementing RS handling in smoltcp and configuring it the illumos netstack appropriately, but gives us the advantage of using a well-known protocol (e.g. for packet analyzers).

On the other hand, this is overloading the RS protocol in a way that may be unexpected. It seems safer to use our own protocol, without baked-in semantics.

No VLAN tags between KSZ8463 and SP

We can imagine a world where frames between the KSZ8463 and SP do not have VLAN tags attached to them. This would simplify the Hubris side of the netstack. In such a world, we can mostly preserve the isolated per-switch network behavior with the following KSZ8463 configuration:

Port 1’s default VID is 0x301, which is applied to untagged frames on ingress.
Port 2’s default VID is 0x302, which is applied to untagged frames on ingress.
Port 3’s default VID is 0x1. VLAN tags are not applied on ingress, and stripped on egress.
VLAN 0x301 contains ports 1 and 3.
VLAN 0x302 contains ports 2 and 3.
VLAN 0x1 (the default VLAN) contains ports 1, 2 and 3.

In this configuration, a untagged frame arriving on KSZ8463 port 1 is tagged with VID 0x301. The only other port on this VLAN is port 3, so it is forwarded to the SP, and the VLAN tag is stripped on egress.

This architecture relies more heavily on learning at the switch level: a frame from the SP to a particular Sidecar will be assigned to the default VLAN,then forwarded to port 1 or 2 depending on the destination MAC address, based on the KSZ8463’s internal (learned) MAC tables. In a notable difference from previous behavior, multicast and broadcast messages from the SP now go to both Sidecars.

In addition, because frames arriving at the SP do not have VLAN tags, it’s harder to perform Sidecar location: the SP does not know whether a particular frame arrived on KSZ8463 Port 1 or 2. However, it’s not impossible, through the power of nested VLAN tags:

Nested VLAN tags for Sidecar location

In this image, MGS sends a message to VSC7448 port 2 (outer VLAN tag 0x102) which is dropped by the KSZ8463 if it doesn’t arrive on port 1 (inner VLAN tag 0x301). MGS can use a similar strategy as before to determine its location by sending "Reply if you can hear this packet" requests to known ports.

Though the VSC7448 supports multiple VLAN tags, there are questions about the host software’s ability to handle this plan.

More importantly, this bakes the KSZ8463 into the network architecture, which doesn’t necessarily make sense: in a future product, we could use an SP that has two MACs of its own, instead of using an off-chip switch. To prepare for such a future, it’s best to keep the implementation details of the two-port architecture isolated, which is accomplished by only using the 0x301/2 VLAN tags between the KSZ8463 and the SP.

Non-Isolated Per-Switch Networks

We could also remove the VLANs between the KSZ8463 and SP entirely! This also simplies the Hubris netstack, but is not a good idea; let’s unpack why.

First, Sidecar location determination becomes much harder: the workaround above fails, and we’d end up falling back to out-of-band solutions (e.g. having the Scrimlet communicate with its on-board SP over its serial link to confirm whether it has received a packet).

Second, multicast messages from MGS will be reflected by the KSZ8463 to the other instance of MGS. This is not inherently a problem, but is definitely weird and non-ideal.

Without VLANs at the KSZ8463, it will send multicast messages back upstream

Note that we still assume VLANs between MGS and the VSC7448, so multicast messages wouldn’t be reflected back downstream by Sidecar B, and there are no loops in the L2 network graph (so we wouldn’t need to run spanning tree).

Randomly generated MAC addresses

Every STM32 has a 96-bit unique ID baked into the silicon. The ID format is not documented for this particular chip, but is typically a mix of batch, X/Y location on the die, etc.

We could hash this UID and generate a MAC address from it. If we generate 72 management MAC addresses per rack (two MAC addresses each for 32 Gimlets, 2 Sidecars, 2 PSCs), the chances of a collision are 0.015% for a 24-bit hash (using the Oxide OUI as the top 24 bits).

This seems low, but it’s high enough that we’d need to plan for and detect collisions, which adds software implementation complexity.

Note

Using a random MAC address is reasonable for development and on boards which lack a FRU ID EEPROM.

Using the same MAC address

Due to our network isolation, it’s theoretically possible to use the same MAC address for every single SP, because messages are sent deterministically to a single port based on VLAN tags.

This would be very funny, and is likely to work at the low level, as both the VSC7448 and KSZ8463 include a per-VLAN FID in their MAC tables. However, it would probably confuse something else in the stack, and would make network debugging harder.

Determinations

The management network will use VLAN tagging to isolate every SP-Sidecar link at the L2 level. This requires software support in MGS and Hubris.
Packets are untagged as they traverse the backplane, then re-tagged at the network edges.
SP MAC addresses will be assigned based on the FRU ID EEPROM. Each SP will have two management network MAC addresses (one for each Sidecar).
SP IPv6 addresses will be generated based on the MAC addresses.
The Sidecar will determine its location in the rack by sending messages to attached SPs, which are on deterministic ports. Once it knows its location, the rack cable layout provides a known map between physical cubbies and VLAN tags. MGS can use this map to direct packets to specific SPs.

Open Questions

To learn or not to learn?

The VSC7448 and KSZ8463 both automatically perform learning to associate MAC addresses (combined with a per-VLAN FID) with specific ports. However, with our network architecture, only two ports (upstream / downstream) are ever associated with a given VLAN-tagged frame.

We could disable learning, which would cause the switches to "flood" incoming frames to all relevant ports; this would just be the VLAN-specific port that frame didn’t arrive on.

The main downside is that we could no longer dump the MAC / learning tables for debugging switch behavior.

(As a third alternative, we could disable learning but have the SP program static MAC tables into the switches as it comes up. This doesn’t seem to have any obvious advantages.)

Rack configuration detection

The strategies above allow a Sidecar to determine its location and the rack-level configuration of an Oxide Rack Version 1 (non-final name).

A future Oxide product is likely to reuse the Sidecar + Scrimlet but with an entirely different rack-level configuration, e.g. a networking rack with 8 Sidecar / Scrimlet pairs but no other Gimlets (this is a purely made-up configuration).

Our current strategy does not allow a Scrimlet to detect that it’s running in an alternate rack configuration.

Security Considerations

SP VLAN hopping

A malicious SP should not be able to craft packets that reach other SPs.

We can enforce this rule with careful configuration of the switches. In short, any packet that doesn’t obey the tagging scheme shown in [network-vlan] should be dropped; this includes missing VLAN tags, bonus VLAN tags, and incorrect VLAN tags.

Most importantly, the downstream ports on the VSC7448 should only accept untagged frames. This means that any SP trying something tricky will have its frames dropped on ingress.

There is one obvious weakness to this scheme: a malicious Sidecar SP may reconfigure the VSC7448, and can therefore send messages to any SP in the system.

External References

[rfd58] Oxide Computer Co. RFD 58: Rack Switch. 2021.
[rfd126] Oxide Computer Co. RFD 126: Cabled Backplane. 2022.
[rfd142] Oxide Computer Co. RFD 142: Ignition. 2021.
[rfd144] Oxide Computer Co. RFD 144: Sidecar Detailed Design. 2021.
[rfd148] Oxide Computer Co. RFD 148: FRUID ROMs. 2021.
[rfd165] Oxide Computer Co. RFD 165: Gimlet Manufacturing Programming. 2021.
[rfd174] Oxide Computer Co. RFD 174: Oxide Assigned Numbers Authority. 2021.
[rfd210] Oxide Computer Co. RFD 210: Omicron, service processors, and power shelf controllers. 2021.
[ksz8463] Microchip. KSZ8463 datasheet. 2018.
[RFC4291] Network Working Group. IP Version 6 Addressing Architecture. 2006.
[RFC4861] Network Working Group. Neighbor Discovery for IP version 6 (IPv6). 2007.

RFD 250 Management Network Topology and Proprioception

Table of Contents