58 - Rack Switch / RFD / Oxide

RFD

Authors

Updated

Note

This RFD was the subject of an Oxide and Friends episode, The Sidecar Switch

Introduction

This RFD is an attempt to capture the requirements and design space for what is referred to as "the rack switch". There has already been significant discussion about what this function may look like and as a result this document will be more colored than some other RFDs. Given that this function sits at the heart of an Oxide rack and quite literally connects all the major pieces together this document may color a bit outside the lines. Where possible an attempt is made to refer to details on adjacent RFDs.

Functional Requirements

The rack switch is to provide reliable connectivity among the compute nodes in an Oxide rack as well as connectivity between the rack, adjacent Oxide racks, the greater customer network and beyond. In this capacity the rack switch handles three big classes of networking traffic:

Control plane, all traffic originating from or destined for control endpoints used to manage and operate an Oxide rack
Block storage, all traffic used to provide block device services for virtual machines
Application, traffic originating from or destined for customer virtual machines running in an Oxide rack

Each of these classes of traffic presents different requirements, as discussed below:

Control Plane

Control Plane traffic is required to successfully boot and operate an Oxide rack. This traffic can be further subdivided in two classes, based on function and its endpoints:

Service Processor (SP) control and status, originating from or destined for SP endpoints
Host control and status, originating from and destined for control plane endpoints running on the host CPU

SP Control

The function of the SP control plane traffic is to allow for reliable low level control over network and compute node system board primitives and provide services (if necessary) needed to bootstrap the main network data plane. This traffic is traditionally referred to as out-of-band (OOB) management.

Given the limited processing capabilities of the SP when compared to the host CPU, this traffic has the following characteristics:

Ethernet frame based
Low volume
Primarily unicast, but uses broadcast for endpoint discovery
Strictly confined within a single rack/switch domain

In order to facilitate this traffic the rack switch is expected to provide:

An Ethernet broadcast domain, which can be established without relying on other (configuration) services or functional entities
Provide rudimentary rate-limiting and broadcast control, as to protect the constrained resources of SP endpoints
The highest class of packet forwarding reliability

Given the critical nature of this traffic it is expected to be available first and disrupted last.

Host Control

All remaining control plane traffic which does not fit the SP Control traffic criteria is considered to belong in the Host Control Traffic class. This includes traffic for higher level control plane primitives such as data replication for control plane state, API services, larger volume telemetry, etc. This traffic may cross the boundary between Oxide racks and availability zones. As such it is expected to have the following characteristics:

Originates from and destined for host CPU endpoints
Is carried using routed protocols
Represents a moderate amount of overall traffic volume
Does not require sophisticated switch features

In order to facilitate this traffic the rack switch is expected to provide:

One or more Ethernet broadcast domains separated from the SP Control domain
Basic L3 routing/forwarding facilities
Basic classification capabilities for QoS and telemetry. A possible implementation could be using ECN bits in IP headers or priority bits in the VLAN tags.

The configuration and state management of Host Control networking facilities is likely to depend on services provided by the SP control plane.

Block Storage

The block storage traffic class comprises all traffic to provide block storage services for virtual machines running on compute nodes in the rack. This class of traffic has similar characteristics to stage 2 control plane traffic, with the exception of its volume. This class of traffic is likely to represent a major portion of the overall traffic volume (bulk volume).

Similarly the rack switch is not expected to provide additional features or primitives, but this traffic can/should only be carried by a high capacity data plane.

Application

The last major traffic class is application traffic, primarily as a result of customer workloads executing in virtual machines. This traffic will be diverse and is likely to need additional primitives provided by the rack switch. Examples may involve floating IP addresses, load balancing, packet filtering, etc. This traffic is expected to have the following characteristics:

Originates from and destined for hosts/networks outside the rack
Is carried using protocols such as VxLAN, Geneve, etc.
Require more advanced networking primitives
Represent the majority of overall traffic volume

In order to support this traffic, the rack switch is expected to provide:

Connectivity/interoperability with the customer network/wider internet
Manipulation of packets carried through protocols such as VxLAN, Geneve, or similar
Stateful primitives for load balancing, packet filtering

Reliability

Customers may be deploying a single Oxide rack. This makes the rack switch function a significant failure domain. It is therefore desirable to provide the rack switch function using redundant hardware, allowing for a reboot/restart of individual pieces.

Given the role of the switch in the Oxide rack and its ability to move (and even duplicate) traffic, similar concerns with respect to firmware and system integrity exist for the rack switch as they do for compute nodes. We need to control the boot process of the devices implementing the switch function in order establish confidence in the integrity of both the hardware and their configuration.

This leads to the following additional requirements:

The rack switch hardware in an Oxide rack shall be implemented such that that networking functions are provided without a single point of failure
Software and configuration data required for boot and operation by the switch hardware and stored on the rack switch system board can be attested and its integrity established

Design Considerations

The rack switch has been a long discussed component of the Oxide rack. Straw man requirements, informed by [rfd9], [rfd24] and implementation details such as those found in [rfd14] have already shaped our thinking. What follows is a summary of various design details as it exists in our collective understanding. These are by no means a concrete implementation, but have been guiding us so far.

Topology

The rack switch function is implemented using two devices, built around merchant silicon. Each device is connected to each compute node in the rack using a 100GBASE-CR4 based link. It is assumed that both devices provide service under normal operation, providing for a total of 200Gb/s non-blocking aggregate bandwidth between compute nodes. In the case of planned maintenance or a failure of one of the switching devices, network connectivity is expected to continue to flow at reduced capacity. Please note that each flow will be forwarded using a single path and as a result the bandwidth per flow will always be limited to 100Gb/s.

Chipsets

Various semiconductor manufacturers have chipsets in the market which may fit our port density, use case and requirements. Having said that we would like to somewhat future proof our investment given the time when we expect to go to market. This means that designing with an ASIC which supports up to 64 ports of 200G, or 12.8Tb/s of bandwidth has our preference. This will allow us to upgrade the bandwidth to our compute nodes from 2x100G to 2x200G in the future, without having to design a new rack switch. In addition, having additional ports available in the switch chassis beyond what is strictly necessary to support the number of compute nodes in the rack will provide us with flexibility providing connectivity into the customer network as well as combining multiple Oxide racks in a single cell and/or availability zone.

With this in mind we have considered the following chipsets:

Broadcom Tomahawk 3

The Tomahawk3 (TH3, [tomahawk3-brief]) is Broadcom’s current high performance, full-featured, fixed-function switch ASIC, aimed at data center networks. It supports up to 64x200G, or 12.8Tb/s of traffic. Its feature list is long, including support for the VxLAN and Geneve encapsulation protocols.

We have performed a napkin design exercise to determine if this ASIC can be implemented in the way we would like to, using an external PCI Express link to a compute node and with support for attestation of its firmware. As far as we have been able to determine this is possible, but Broadcom has not been willing to provide feedback on this design without us purchasing a substantial block of engineering support hours. The napkin can be found here.

While this chipset would certainly meet our needs, our experience interacting with Broadcom makes it clear that while they are happy to sell us silicon, they have little interest in supporting us. In no uncertain terms have they suggested we stick with the tried and true, working with one of their design/manufacturing partners. While this is not necessarily bad advice given what they know about us and where we are in the process, this attitude does not match the expectations we have for the relationships with our primary technology partners.

Intel (Barefoot Networks) Tofino 2

The Barefoot Networks (recently acquired by Intel) Tofino 2 (TF2, [tofino2-brief]) ASIC is a very different device from the TH3. Where the TH3 is fixed-function, the Tofino ASICs are programmable using the P4 programming language [p4-wikipedia]. This allows for switch functions/features to be determined by the integrator (us) rather than the ASIC designer. It should be evident that this is hugely appealing to us, as it will allow us to enable/deliver features in future releases of our software stack. We could even go so far as to tailor switch/data plane features to specific (classes of) customers, when appropriate. With support for 64x200G this would mean the switch may be used in multiple design generations of our racks while providing the ability to adapt to changing customer requirements in the future.

It should be noted that Broadcom provides a device with similar capabilities in the form of the Trident4 [trident4-brief]. It was suggested however that the TH3 would be better suited to our needs and we were consequently not provided with documentation.

Where Broadcom has not made much of an attempt to win our business, the opposite is true of the Barefoot Networks (BN) team. They went out of their way to pitch Tofino during early meetings with Intel, positively surprising us in the process. They have consistently been prompt to answer our questions, provide presentations and demos of Tofino’s design details, development environment and have provided engineering support to help with a similar napkin design exercise to determine if Tofino can be made to work using an external PCI Express link to our compute node.

In addition to great support, BN has supplied us with the source code to pretty much everything we need in order to implement Tofino in our product. This includes the full driver stack, run time, compiler extensions to the open source P4 compiler and some working P4 application examples.

Conclusion

Tofino 2 is an appealing device for a next generation data center product like the Oxide rack. Its programmable nature makes that we can deliver functionality over time, tailor to the workloads of specific (classes of) customers through hardware acceleration of future networking applications. The port density and raw performance of the device provides for headroom, which will allow for reuse of the (majority of the) switch design in subsequent Oxide rack generations.

During the OCP Summit 2020 Intel has shown Tofino to be an integral part of their data center offerings. Combined with the support provided by the BN team to date and the fact that we were given the source code to everything we may need, we feel confident we will be able to develop a meaningful partnership in the coming years. As such it is plan of record to use Tofino 2 as the ASIC in our rack switch.

External PCI Express

Contemporary switch designs based on merchant silicon, such as the [wedge100-spec], are all implemented as a host CPU connected to the switching ASIC using PCI Express. Ignoring implementation details this makes these switches like a large, multi-port network interface, housed in a purpose built enclosure containing an otherwise ordinary PC Compatible system and x86 host CPU.

Given our design philosophy around firmware (as little as possible and only in places where it can be appropriately managed), the requirement to have confidence in the integrity of the device and its configuration, and the investments we are making in this area for our compute node, it is our desire to implement the switch ASIC as an external PCI Express device connected to a regular compute node in the rack. This makes it such that we can leverage all the work already done on system board management, firmware integrity and host boot software.

Out-of-band Management

An important point of discussion were the primitives required to facilitate/implement the stage 1 control plane. Over the course of several sessions we have determined that a serial connection (RS-232, RS-422, RS-485) between SP and control plane functions is not the direction we would like to go. While initially appealing because of its relative low complexity (although -422 and -485 are non-trivial), serial connectivity has very practical limitations with respect to cabling, raw bandwidth and part availability. By the time we have added the features required for control plane communication, it starts to resemble Ethernet.

When using Ethernet to implement the stage 1 control plane, we have two options; a dedicated network using a discrete Ethernet switch or NC-SI (NC-SI, [nc-si-wikipedia]).

Dedicated Network

The dedicated network option is a tried and true method of implementing an out-of-band management network. This could be implemented by adding secondary commodity FE/GigE switch ASICs to the rack switch chassis, each of which is connected to every SP present in the rack. This would provide a redundant path for stage 1 control plane traffic.

Secondary Switch ASIC

Fortunately the switch ASICs for this class of devices is so well integrated that a chipset can be implemented at a modest component cost. While these chipsets generally contain a full CPU in order to allow them to be used as a the basis for a workgroup switch, most/all be booted/controlled by an external CPU using a SPI interface. This would make it possible wire it into the I²C bus provided by the external PCI Express connection or be controlled using a CPU core of our choosing. There are plenty of details to work out, but at least on paper this is feasible.

Examples of possible chipset candidates are:

Please note that the mentioned chipsets are really just a somewhat random selection of devices available. Each of these semiconductor companies has multiple offerings which fit our needs and there are other options available from companies such as Marvell.

Cabling

In addition to the switching devices this solution would require adding cabling to the rack. This can obviously be accomplished using Cat 5/6 twisted pair cabling, but this less desirable for serviceability. At the time of writing we have significantly progressed down the path of blind mating Ethernet and using a discrete management network will require additional signal pairs.

Network Controller Sideband Interface

The alternative to a discrete out-of-band management network is the use of NC-SI. This method uses the NIC in each compute node and effectively turns it into a small switch, providing a Fast Ethernet-like interface through which the SP can connect to the rack data plane. A control protocol is provided as part of the standard for the SP to manage the upstream ports to the data plane and the forwarding filter, allowing it to send/receive traffic.

When implemented using a PCI Express Add-in Card (AIC) NC-SI becomes a somewhat awkward thing which needs to channeled over SMBus. Fortunately the OCP Mezzanine specification provides a more sane method, allowing for a relatively straight forward RMII between the Ethernet MAC connected to the SP and the NIC. The appeal of this solution is that apart from the Mezzanine slot little infrastructure is required on the main board of the compute node and no additional hardware needs to be added to the rack to provide a the stage 1 control plane network. This traffic would simply be carried over separate VLANs within the data plane.

While the promise of no additional hardware is appealing, NC-SI implementations currently in the market have proven to be fragile, riddled with bugs in both the NIC firmware and the BMC client stack. The protocol state machine is complex and vendors have struggled getting the implementation in their firmware to work correctly. This has caused a somewhat negative feedback cycle, with adoption being slow due to problems, but the implementations at the same time not getting the miles and priority required to make them reliable.

Given the importance of this traffic it is of paramount importance we can get this to work reliable for NC-SI to be a viable option. As such the NIC vendor selection will have to focus and to offset risk we are aiming to do the following:

Analyze the implementations of prospective vendors and request details of the test/validation they use to develop this feature
Negotiate access to the source of the NIC firmware, providing us with the ability to debug (and potentially fix) problems on either side of the link (optional, but desired)
Build a NC-SI evaluation rig, capable of exercising the more challenging aspects of the protocol state machine
Develop the compute node to allow for either a discrete management network or the use of NC-SI, providing us with the option to delay this decision until after we have been able to do perform our own evaluation

If done well, parts of the work for the evaluation rig can be reused in our control plane protocol stack.

Alternative Considered: RS-485

Note for future readers: this section was written much later than the surrounding sections. It is for this reason you will find additional design aspects creep in. We felt it necessary however to include this section for our future selves as this reflects our thinking during the design of this part of the system and to remind ourselves why we decided against the use of RS-485 or a similar serial protocol.

Building an Ethernet-compatible management network has turned out to be harder than we expected, due to:

Lack of good SP SoC choices that have dual Ethernet MACs
Even fewer parts that have gigabit MACs
A desire to run the network over a twinax cabled backplane
A desire to AC couple (or magnetically couple) the network
The difficulty of fitting 2x the number of PHYs and ports in the switch just for management, particularly if those ports have magnetics

Given all of this, there are no off-the-shelf solutions that we could merely buy and deploy. If we’re having to assemble something from parts, why does it need to be Ethernet? Why can’t we do something apparently simpler, such as an RS485-style serial protocol, or even CAN? All our SP candidates have two or more UARTs and CAN MACs. We considered this — in many ways it’s very appealing. But there are a number of problems.

First, any communication protocol we designed would need to recreate much of Ethernet. We would need link-layer addressing, switching for peer-to-peer communications, error detection/correction, some amount of rate negotiation for upwards compatibility, provisions for flow control, etc. This is not a deal-breaker, but it is engineering effort, and by using an Ethernet MAC we get most of this for free. (Also, because we’re a tools company, you know we’d have to write our own logic analyzer plugins for our custom stuff; Ethernet analyzers already exist.)

Second, we can’t meet the above requirements by "just" deploying something like RS-485, because while RS-485 in an NRZ encoding is balanced in the differential sense, is not inherently DC-balanced. Our UARTs only do NRZ, and cannot (for example) scramble or Manchester-encode their output. This means it’s hard to AC couple, requiring external isolator devices or more complex coupling circuits. We could balance the signal using an external scrambler or encoder chip, but there aren’t fantastic options for such chips.

Third, the data rates on RS-485 and CAN are both in the neighborhood of 10Mb/s before overhead is taken into account. This is 1/10 the throughput of our proposed 100Mb/s Ethernet and could become a problem for management traffic throughput (particularly on CAN, where the overhead is comparatively high).

Fourth, because we cannot buy a 36+ port switch chip for a protocol of our own devising, the connections at the switch could easily become a bottleneck, magnifying the bandwidth problem by sharing the 10Mb/s across all nodes. (CAN, in particular, expects to be a shared medium rather than a switched network. RS-485 can work either way, but we’d need to build it ourselves.) Whereas the switch side of the SGMII-based solution is relatively straightforward.

Finally, while the v1 SP has a single Fast Ethernet MAC (100Mb/s), other candidate parts such as the NXP RT1170 will become available over the next year. We have the possibility of switching to a v2 SP with redundant MACs and higher management throughput. But these parts' UARTs and CAN MACs are the same speed as our current choice, so by relying on them, we don’t get any improvement — whereas with the SGMII-based approach, we can potentially slot in a v2-based sled and have it magically get more bandwidth, thanks to auto-negotiation. And so an Ethernet-based management network seems like the least-worst option for a number of reasons, and has potential for future expansion that could be valuable within the next few years.

Conclusion

After several rounds of questions with the NIC vendors the team has taken a hard looks at the responses. For most implementations we feel there are shortcomings in the way traffic is separated between the host CPU and attached SP. Furthermore the the implementations require shared ownership and control of the NIC between host CPU and SP and we do not feel confident this can be properly resolved. In response we have chosen to implement a dedicated management network in the rack switch and use blind mating of the additional cabling to overcome and improve on serviceability challenges. More details on this decision are available in [rfd89]. There are some details we would have to work out such as how the switch domains are joined together in order to allow traffic between host CPUs and SPs, but overall this solution is well understood and the individual components carry less inherent complexity.

Device Functions

The following section describes the major device functions required to facilitate the rack switching function provided by the management network and Tofino 2. Each of these functions require design decisions to be made and these sections are an attempt to capture the required details.

Rack Switch Device Interface

Tofino 2 and the system board used to implement the switching ASICs at a high level function like a regular PCIe AIC. Where things differ with respect to most AICs is that the rack switch is housed external to the host CPU, with removable cabling between them. Because of this loose coupling between the host CPU and rack switch no assumptions should be made about their boot order and the rack switch should be implemented as a hot-plug capable device, including surprise arrival and removal.

Given the availability requirements for the rack switch, it is desirable for it to continue providing service even during short periods of the host CPU being unavailable. Furthermore re-attaching the host CPU should not interrupt or reset the switching services provided to the various traffic classes. This implies that the rack switch can autonomously handle critical chassis management functions, up to and including an emergency shutdown as a result of thermal runaway.

Required Signals for System Board Control

Assuming the rack switch is provided with standby power as soon as it is mounted in the rack and power is applied to the bus bar, the following minimum set of signals are required to control the different system board functions:

PRSNT#: indicating the cable is attached and the rack switch is present, allowing the host CPU to commence a hot-plug procedure
I²C/SMBus: communication with board control possible functions required for chassis boot
PCIe Reference Clock
PCIe x4 RX/TX
PERST#: PCIe link activation/reset

Required Signals for Host CPU Hot-plug Support

In addition to the signals listed above AMD, Epyc host CPUs require some additional signals to enable hot-plug support. These signals, primarily used to implement hot-plug support for NVMe SSDs, are described in the PCI Express ExpressModule specification. Implementing the Rack Switch Device Interface using the same electrical specification would allow for a single hot-plug implementation on the compute node for these subsystems. This list of signals, used by the AMD host CPU to track and drive the power state of the device, is as follows:

PWREN#: indicating the PCIe module should be powered up
PWRFLT#: indicating a power fault on the PCIe module

For details on how these signals should be implemented to enable hotplug support, please refer to [rfd66] § 2.1 and [rfd129]. Please refer to [rfd95] § 6.2 for considerations on signal reference, isolation and back-driving.

PCIe ExpressModule Emulation

The PCIe Express ExpressModule specification assumes module power is provided by the host system. It is for this reason the interface provides the controls listed above to enabled, disable and monitor the module power state. The rack switch does not meet this assumption since it is powered independently from the compute node. Wiring these signals into the rack switch power control function and letting the AMD hot-plug machinery drive this directly may cause (part of) the power to the switch ASICs and interface modules to be reset on hot re-attach.

Since it is desired for the switching services to remain available during re-attach (claimed to be supported by TF2) some amount of emulation of these signals will be necessary in order satisfy power-on logic in the hot-plug infrastructure on the host CPU and allow the OS and device driver to take control and manage TF2 re-initialization. This emulation will be provided by connecting the PRSNT#, PERST#, PWREN#, PWRFLT# and SMBus signals to the system Service Processor (SP). Software on the SP will then initiate and manage the hotplug procedure with the host CPU, depending on overall system state.

Ethernet Ports

Ethernet frames enter and exit a switch ASIC using Ethernet ports. The quantity and the location of the hardware implementing these ports has a significant impact on the design of the rack switch.

Port Groups

Backplane Ports: As per [rfd76] each of the 32 compute nodes will be connected to the rack switch using 100GBASE-CR4, with an option to upgrade to 200GBASE-CR4 in a future product. Each port/compute node is connected to the switch ASIC using DAC cables, but implemented as a twinax based (custom) backplane instead of off the shelve DAC cables in QSFP28 form-factor. This cabled backplane will allow for blind mating the compute node and reduce operators having to handle cabling. Still under investigation is whether or not it is desirable/feasible to also blind mate the rack switch.
Network Ports: Mirroring the 32 ports used to connect compute nodes, it would be desirable to have up to 32 uplink ports available, allowing an Oxide rack to be connected to a high bandwidth networking fabric, like those deployed by the hyperscalers. These ports will be implemented using QSFP28 optical transceivers, allowing for both short range (100GBASE-SR, 100m/150m) and long range (100GBASE-LR4/ER4, 10km/40km) optical connections between racks or even directly between availability zones.
Management Ports: The final +1 port available on the TF2 ASIC, which unlike the other 64 ports is 1GigE capable, will be connected to the secondary ASIC providing the management data plane. With appropriate switching policies this will allow communication between host CPUs and SPs.

Customer Impact

Tofino 2 devices are available in SKUs with 32+1, 48+1 and 64+1 Ethernet ports. The price for these devices is primarily driven by the amount of (enabled) programmable resources, the number of packet processing pipelines (2, 4) and the number of stages within these pipelines (12, 20), and not the number of Ethernet ports. While most customers are unlikely to utilize all 32 network ports, the cost reduction for buying a device with 48 ports is not significant (early pricing indicates a few hundred dollars per device at most). Similarly the design and BOM cost of providing the extra ports on the front of the chassis should be low, while having these extra ports available will mean that some customers will be able to connect more racks together without requiring additional switches implementing a wider spine for the network fabric. An example:

A customer would be able to connect up to 32 racks in a single cell using a full mesh, requiring 31 inter-rack links and leaving one link available per rack switch as uplink to the customer network/the Internet. Tofino 2 will allow the use of 200GBASE-SR4 optical transceivers once these are available, providing each rack pair with 2 x 200Gb/s of peak bandwidth and 2 x 40/100/200Gb/s per rack to the network and/or Internet. At 32 compute nodes per rack this would be an oversubscription ratio of 1:16 when all rack switches are operational while at 24 compute nodes per rack this ratio is reduced to 1:12. Depending on inter-rack bandwidth requirements of storage and application traffic this ratio may be acceptable.

For cells with less than 32 racks, the inter-rack bandwidth can be improved (half the number of racks means half as much oversubscription per rack pair), network cost can be reduced (because less transceivers are required) and with better network policies and better traffic steering capabilities asymmetrical links and network graphs of different shapes such as a grid or cube would be possible. All this can safely explored at a later time.

If a customer desires more capacity than can be provided by the 32 rack cell described above without investing in additional networking infrastructure, racks could be grouped into smaller cells (8, 16, 24 racks each) with higher oversubscription ratios between cells. This would require software support in order to keep applications and traffic local to cells as much as possible. Alternatively or in addition a customer could choose to deploy one or more separate AZs.

For customers requiring larger cells or cells with more cross-sectional bandwidth, our current rack switch design and cabled backplane can be extended to form a new network centric rack type. The cabled backplane should in this case significantly cut down the per link cost, cabling and management burden of interconnecting these spine switches, while the 32 network ports per rack switch would allow for 256+ 200Gb/s fabric ports per rack. At full port utilization this would provide a currently staggering (let’s revisit this in 10 years) 51.2Tb/s+ of cross-sectional bandwidth per network rack, allowing for cells of 200+ racks and 6400+ compute nodes.

Board Layout Considerations

The Ball Grid Array (BGA) pinout of TF2 devices with 64 ports match the backplane/network grouping, aiding board layout if ports can allocated to both the front and back of the rack switch. For devices with 48 ports the pins are split 32/16 with one quadrant not connected, while 32 port SKUs have all 32 ports allocated on one side of the device.

Network Interface Module Management

The hardware used to handle Ethernet frames consists of two major parts; the ASIC, which provides serialization/deserialization (SerDes), queueing and packet processing functionality, and pluggable network interface modules. The latter provide an abstraction for the external cabling used to connect a switch or NIC to the network, allowing devices to be used with either short range Direct Attached Copper (DAC) cabling or longer range fiber optical cabling.

Introduction

The primary data plane of the rack switch consists of the Tofino 2 ASIC connected to pluggable transceivers using an interface popularly known as QSFP28 and specified in [sff8665]. Each rack switch is expected to be connected using both types of cabling; the compute nodes in the rack will be connected using DAC (albeit with a blind-mating twist), while connectivity to the broader network will be provided using fiber optic transceivers and cabling.

Optical transceivers implemented as QSFP28 modules are active devices which require some amount of active control. Not unlike other high speed interfaces, the electrical interface to these modules can be divided in a low- and high speed portion. The low speed portion consists of an I²C bus and several strap pins for presence, power state as well as status LEDs, while the high speed portion contains the eight differential pairs used to send/receive Ethernet frames. The full interface is specified through a combination of [sff8665], [sff8636], [sff8679].

Of particular note is that while the module implements a standard I²C physical layer signaling scheme, the address of the module is fixed and used as a control word to indicate a read or write operation. This means in practice that QSFP28 modules can not share an I²C bus and instead need a bus segment per module.

[sff8636] specifies a fairly comprehensive register map containing the information needed to control a QSFP28 module. Of this information available there are three pieces of information which are of particular concern to us and will drive the design of the rack switch board:

Identification and Status: Apart from a simple presence and straps pins, the module provides identifying information such as serial numbers as well as specifics such as its type, power class and device status. The identifying information will be needed by the control plane for inventory tracking, while specifics such as type and status will be needed during switch port and module initialization. With the exception of power state, the primary consumer of this data will be the host CPU/switch ASIC driver.
Power and Thermal Management: QSFP28 modules are powered by the system board into which they are inserted. In order to support the power requirements for the different link types QSFP28 modules come in eight different power classes, with class 8 allowing for a maximum power consumption of 10W. Having up to 32 of these modules powered up in a relatively confined space requires active thermal management. In order to make this possible the standards provide standardized means to read power consumption and temperature information from each module using the serial interface. This information is needed at a regular interval (straw-man; every 1s) in order to actively manage the thermal envelope of the rack switch chassis.
Link Management: An optical transceiver receiving incoming light will convert this into electrical signals according its specification, generating corrupt frames if the signal to noise ratio of the incoming signal is not above a certain threshold. No assumptions are made by the transceiver about (possible) information embedded in the signals. In order to avoid sending these frames through the ASIC the transceiver monitors the incoming light levels, providing loss-of-signal (LOS) indicators/interrupts which need to be provided to the ASIC driver in order to enable/disable the corresponding switch port connected to the module. A sufficiently efficient control path needs to exist between the ASIC driver running on the host CPU and the optical transceivers in order to act on the LOS state appropriately.

The challenge managing the interface modules is that multiple entities require access to the information provided via the I²C interface. In particular, identification and status information is to be provided to the control plane, and is needed during network boot and module initialization. The power and thermal information is required at a regular interval by the board control function providing thermal management, while link state information is needed by the ASIC driver to manage the SerDes. The latter two functions may not be running on the same CPU which may lead to contention getting timely access to the information needed.

Implementation Survey

The implementation of the board infrastructure required to power and manage up to 64 interface modules can be done in one of several ways, each with different trade-offs. This section is an attempt to replicate the formula found in [rfd88] to enumerate and analyze these options in order to pick an implementation which meets requirements while adequately constraining complexity. For this discussion the principles outlined in [rfd88] § 2 very much apply and are recommended reading.

Without further ado, the following is a short list of possible options:

Use the host CPU and ASIC driver for interface management by configuring the GPIO pins and PCIe-to-I²C bridge provided by Tofino 2 to create 8 I²C busses, use I²C switches (e.g., PCA9543) to further subdivide each bus (interface group) into a segment per interface and use GPIO expanders (e.g., PCA9554) to connect strap and interrupt pins.
Add Gemini to the rack switch board, connecting the Service Processor (SP) to the SMBus in the PCIe link between host CPU and rack switch device. Alternatively the PCIe-to-I²C provided by Tofino 2 can connected as an alternate/additional I²C interface to the SP. Either/both of these I²C segments allow for direct communication between the host CPU and SP. Interface management functions are split between SP and the host CPU in one of the following ways:
1. Use as many I²C busses as available on the SP and similar to (1) use one or more levels of switches (e.g., PCA9543) and GPIO expanders (e.g., PCA9554) to connect to each of the 64 interfaces including strap and interrupt pins. The host CPU and ASIC driver can query/receive LOS interrupts and counter data via the SP.
2. Use an FPGA on the backplane and network side of the board connected to the I²C segment, strap and interrupt pins of each of the interface modules and connected to the SP using SPI. A SPI-to-QSFP28 bridge can then be implemented using soft logic, allowing for efficient parallel data access by the SP. Similar to (2)(a) the host CPU and driver can query/receive LOS interrupts and counter data via the SP.
3. Similar to (2)(b), implement a SPI-to-QSFP28 bridge using FPGAs placed at the backplane and network side of the rack switch system board. But instead of allowing host CPU and ASIC driver access to interface module data via the SP, implement a PCIe x1 link in soft logic on both FPGAs and add a PCIe switch to the system board. This provides the host CPU and driver direct access to the Tofino 2 ASIC and LOS interrupts and counter data from both FPGAs via PCI Express.

Analysis

Option (1) carries the least amount of complexity in the hardware, but imposes limitations on other parts of the system. In particular, it provides no way for any logic on the system board to read interface module temperature data, making the host CPU and driver an active part in chassis thermal management. Given the thermal management reliability requirements and the latency of this data path this is not desirable. Of secondary concern is the use of I²C switches to provide each interface module with its own segment. These switches are prone to bus error states, requiring recovery logic in software. While seemingly sufficient at first glance, this option limits the way a rack switch can be managed.

The addition of Gemini to the rack switch as proposed in (2) significantly changes the system itself. It allows for board functions such as power sequencing and thermal management to be handled autonomously by the SP, but comes with the consequence of having Gemini be part of the boot sequence and having to maintain and manage what will most likely be a different build from the SP found on the system board of a compute node. With that said, having an SP on the system board will make for a more appropriate solution to managing the interface modules.

Option (2)(a) is similar to (1) in that it is an attempt to solve for the constraints using minimum hardware complexity. What makes (2)(a) strictly better compared to (1) is that the SP will assume control of the network interface modules, allowing it to regularly poll power and temperature data in addition to forwarding module presence, identification and status data to the control plane. This is in line with the principles found in [rfd88] § 2. Any data and LOS interrupts required by the host CPU and driver will need to be forwarded by the SP over the SMBus in the PCIe link, which may pose a bottle neck depending on what is required by the driver. Finally the SP has a limited number of I²C peripherals and IO pins available which means that one or more layers of I²C switches may be required to implement this option. This carries similar concerns over I²C bus errors as in (1).

Option (2)(b) is an attempt to address I²C bus reliability by replacing the I²C muxes with two FPGAs which each implement a SPI-to-QSFP28 bridge. With the SPI bus most likely being significantly higher bandwidth than the I²C segments, this will allow the SP to address up to 32 interface modules more or less in parallel addressing possible concerns around throughput, data and LOS interrupt latency and bus reliability. Because a higher bandwidth path now exists between the network interface modules and the SP, it can prepare/pack data more efficiently when sending it to the host CPU and driver over the PCIe SMBus. This solutions comes at a cost of adding two FPGAs to the system board and require soft logic to be developed and maintained for the SPI-to-QSFP28 bridge. Placing the FPGAs on both sides of the system board should make routing of signals easier, since serial to parallel fan-out happens closer to the interface modules.

Finally option (2)(c) trades technical complexity for the flexibility of having interface module data independently available with low latency on both the SP and host CPU/driver, allowing management tasks to be allocated where they most naturally fit. This flexibility comes at the price of having to implement PCIe on the FPGA and adding a PCIe switch between the host CPU and components on the rack switch system board. This latter part is not necessarily bad is it may help hot-plug and hot re-attach by allowing the SP more control over the PCIe link. And depending on the placement of this switch, it may replace a re-timer since the switch will perform the same function. (2)(c) does pose some tension with respect to [rfd88] § 2, in particular the principle of "Single point of execution". The FPGA X-to-QSFP28 bridge would need to implement some amount of arbitration in order to manage requests from both the SP and host CPU/driver and some care must be taken in both the software running in the SP and the driver not to work against each other.

Dedicated Management Network

The initial high level concept for the dedicated management network assumed a low-complexity implementation using an unmanaged L2 switch ASIC and 100BASE-TX signaling between the rack switch and compute node. As was to be expected the details are more nuanced and this section is an attempt to capture the major design considerations, possible implementations and an analysis of these options.

Design Considerations

The following is a list of design dimensions which need to be considered:

Ports

The following ports are required:

1x 10G (10GBASE-KX or SFI) for Tofino 2
32x 100M or 1G for compute nodes
2x 100M or 1G for PSC
1x 100M or 1G BASE-T for operator console
1x 100M or 1G for auxiliary power equipment (ATS, BBU)
1x 100M/1G/10G between both rack switches (optional, depending on desired failure modes)

Transformers and AC coupling

The twisted pair BASE-T implementations of Ethernet require (magnetic) transformers in order to operate over twisted pair wiring. These transformers are absolutely the right thing to use when connecting systems over long distances, providing isolation between system boards, noise suppression and immunity and the higher drive currents required for these distances. These transformers however are bulky and fitting 32 of them on the rack switch system board will require an unacceptable amount of board space. Fortunately we do not need much of the functionality provided by these transformers since the backplane remains within the rack and consists of shielded twinax cabling, providing sufficient noise insulation. While we do need isolation between system boards because we can not guarantee ground potential between them, our environment is better controlled than a building. This should make it possible for us to AC couple PHYs using capacitors. This is not standardized in any way however and while various IC vendors, including Microchip [an2190] claim for this to be possible, no interoperability is guaranteed and the responsibility for compatibility between devices is put square on the system designer.

Switch Chipset

Despite an extensive search we have been unable to locate a single switch ASIC which provides the required ports listed above in a single chip solution. The switch ASICs available expose their available ports through a combination of builtin PHYs and serial MAC interfaces, requiring additional ICs to break out these various interface standards into either discrete SerDes interfaces or additional PHYs. The implementation survey includes several chipset combinations which can fulfill the required ports, while requiring a varying number of additional ICs to do so.

Service Processor

The Service Processor used throughout the larger system is limited to a single MAC with (R)MII. In order to connect the SP in each compute node and power shelve (PSC) with both rack switches in the rack, a small three port switch is used as a port expander. The small switch used in this role therefor needs to have at least two ports which are compatible with our backplane and the rack switch ASIC.

Implementation Survey

The requirement for a 10G port in order to connect to Tofino 2 combination with the number of ports required for compute nodes eliminates most switch ASICs readily available for the workgroup/industrial market. The following is the short list of implementation options:

The Microchip VSC7449-02 can be configured to provide 12x QSGMII port + 4x 10G SerDes ports + 1x SGMII NPI port. One of the 10G SerDes ports would used to provide the interconnect with Tofino. Nine of the QSGMII ports would then be connected to 3x Microchip VSC8512-02, providing a total of 36 100BASE-TX ports. 32 of the copper ports would be connected to the backplane using capacitive coupling, two ports would be paired with magnetics to connect the PSC and operator console, one port could be used to connect the management network between both rack switches (optional) and finally a Microchip MAX24287 SGII-to-MII media converter would be used to connect the SP on the rack switch board with the Node Processor Interface (NPI) SGMII port of the switch, providing connectivity and allowing for optional packet injection. The MIIM port provided by the switch system board SP can be used to configure the media converter and VSC8512 PHYs. If we have no intention of ever using the on-die CPU in the switch ASIC, the same MIIM chain can also be used to configure the ASIC. On the server and PSC sides a three port 100M/1G switch from the Microchip KSZ series could be used to connect their respective SP to both management switches.
The Microchip VSC7444-02 can be configured to provide 24x 1G SGMII [sgmii] ports + 2x 10G SerDes ports + 1x SGMII NPI port. Two of these ASICs (dubbed A and B) are used in the rack switch chassis in order to provide the minimum required 34x SGMII ports. One of the 10G SerDes ports on A is used to provide the interconnect with Tofino. One 10G port on each ASIC is used as an interconnect between them and one 10G on B could be used to provide the optional interconnect with the management network in the adjacent rack switch chassis. 32 of the SGMII ports are connected to the backplane, while the remaining two are paired with SGMII capable PHYs and magnetics to connect the PSC and operator console. Finally a Microchip MAX24287 SGII-to-MII media converter is used to connect the rack switch SP with the NPI SGMII port of A. The MIIM port provided by the switch system board SP can be used to configure the media converter.If we have no intention of ever using the on-die CPUs in the switch ASICs, the same MIIM chain can also be used to configure both switch ASICs, otherwise a SPI interface is to be used. On the server side three options exist to connect with the backplane:
1. The NXP SJA1105R provides 4x MII ports + 1x SGMII. The server/PSC SP is connected to one of the ®MII ports and a MAX24287 SGII-to-MII media converter is used to turn one of the MII ports into a second SGMII port, allowing both SGMII ports to connect directly to the backplane. In the case of the PSC two ports can instead be configured as copper PHY providing BASE-T connectivity with the rack switch chassis.
2. The Microchip VSC7511 provides 4x SGMII ports + 1x SGMII NPI port. Two of these ports can be connected directly to the backplane and a MAX24287 SGMII-to-MII media converter is used to connect to the server/PSC SP using the NPI port. In the case of the PSC two ports can be configured as Cu PHY providing BASE-T connectivity with the rack switch chassis.
3. Of course no survey is complete without an option involving an FPGA; since the small switch used in conjunction with the SP is only acting as a port expander, a Lattice ECP5 with SerDes can be used to implement the required 1x MII and 2x SGMII interfaces. RTL IP for the required MACs and both interface types are available from Lattice for their devices and there is a working example for a media converter which we can extend for use with a second port.
4. A hybrid of sorts between (1) and (2) is possible as follows; instead of placing the PHYs in the rack switch, two SGMII connected PHYs (or dual PHY device) with 100FX support are placed with the compute node and connected via the backplane. Both of the PHYs are connected to a small three port switch such as KSZ8463FRL using 100FX SerDes. The server SP is in turn connected to the three port switch using RMII.
As a variant of (2), the Microchip VSC7448-02 can be configured to provide 32x 1G SGMII ports + 2x 10G SerDes ports + 1x SGMII NPI port. The 32x SGMII ports are sufficient to connect the compute nodes while an additional Microchip VSC7511 connected to this NPI port will provide three additional SGMII ports + 1x SGMII NPI port, the latter is used to connect the rack switch SP. The three additional ports provided by the VSC7511 are sufficient to accommodate the PSC and technician ports.
An option similar to (2) seems possible using the Marvell 98DX2538 chipset. From the product brief it is not apparent this device has any advantage over the options from Microchip and since their data sheets and parts are hidden behind a sales rep and NDA I have not considered this option in more detail.

Analysis

The two major implementation options are quite different. This section is an attempt to capture and analyze these differences on a number of dimensions:

Rack Switch: One of the major differences between the options is how they trade board space/complexity in the rack switch for a more involved implementation in the Compute Node. Option (1) requires a total of four, relative large 27x27mm BGA packages (switch + 3x QSGMII-to-Cu PHY) to implement the required minimum number of ports. Including all the support infrastructure such as power and clocks, this will require a sizable amount of board space. Option (2) is implemented using two 27x27mm BGA packages and while far from trivial, will be easier to accommodate on the system board. Finally, while (3) still consists of two discrete switches, one of the packages is smaller. Having said that the difference between (2) and (3) unlikely to be significant.
Compute Node: Where option (2) requires less components in the rack switch than (1), the opposite is true when considering the parts required in the compute node. (1) consists of the least component and is the lowest complexity solution since low port count switch chips with an MII interface and copper PHYs are very common parts. And when removing the transformers this implementation becomes quite lean. The opposite is true for a low port count switch chip with both MII and SGMII ports. Unfortunately no such device seems to exist and in order to complete the implementation a parallel to serial media converter is required to provide either a second SGMII port (2)(i) or a the MII interface to the SP (2)(ii). Ignoring the cost of implementing the required soft logic and bitstream attestation, (2)(iii) would allow for an appealing single chip implementation. This would compound if the same FPGA would provide additional connectivity or functionality to either the SP or host CPU. Finally, (2)(iv) provides a hybrid of sorts between (1) and (2), colocating a SGMII connected PHY on the server motherboard but using 100FX SerDes to connect to a low cost three port switch, avoiding the need for capacitive coupling or magnetics but still allowing for the use of low cost parts on the compute node board. Since this option does not rely on analog properties in the PHY features such as auto-negotiation will work. Electrically this option looks like the rack switch is connected to the three port switch in the compute node using 100BASE-FX optical transceivers, without using actual transceivers. Future product versions could implement this different, for example if an SP MCU becomes available with a dual 1G MAC and native SGMII.
Backplane: The use of transformers in the rack switch is unlikely to be possible given the required board space for these components. Option (1) would instead require an transformer-less or capacitive coupled implementation. This has certainly been done and Microchip provides some guidance [an2190], but there are some caveats; while all application notes mention 10BASE-T and 100BASE-TX there is no mention of capacitive coupling in 1000BASE-T applications. It is unclear whether this has become less relevant in the age of GigE networking or if the difference in encoding/signaling between the latter two prohibits this from working. [snla0880a] from TI provides more details on auto-negotiation, signal shaping guidance and passing 802.3-20xx requirements. The use of SGMII [sgmii] on the other hand is far more straight forward. The standard was developed by Cisco for connecting optical PHYs and for use in backplanes and given the DC-balanced encoding used can be trivially AC coupled. In addition since all signaling is encoded as frames over LVDS and no feature/protocol assumptions are made about the properties of the interconnect, features like auto-negotiation work as expected. Since we are aiming to design the backplane for multiple generation of our product, we may want to avoid the possible compatibility problems hiding in capacitively coupling copper PHYs. Given that industrial use of Ethernet will remain for the foreseeable future and GigE is likely to become more available on lower cost parts, SGMII feels like a safer bet.
Software: The software complexity to configure and manage the devices is difficult to argue one way or another. On the rack switch end, (1) requires initialization and setup of the switch and PHYs, while (2) requires initialization of two identical switch devices. On the compute node size (1) is clearly less involved than any of the other options, requiring initialization of only a single device. The use of MIIM should make this easier in (2) and (3) as all devices can likely be chained together and addressed using a single bus. One of the reasons to prefer (2)(ii) over (2)(i) would be that some all of the VSC family devices seem to be built using the same or very similar IP, possibly allowing some reuse in the drivers for these devices. Of note is (2)(iii) and the use of a bitstream to configure the FPGA. As pointed out during in person discussion of this carries risk as bricking this device with a faulty bitstream would leave the server SP unable to connect to the management network and attempt a recovery path. This means some kind of local recovery mechanism it to be devised in order for the SP to recover from such a failure using a last known good bitstream.

Conclusion

After discussion we have decided to adopt (2)(iv) as our PoR for the dedicated management network. The VCS7444-02 from Microchip, while more powerful than we need and arguably a bit pricy, is a solid choice for our use case and requires the least amount of PCB real estate in the rack switch. Given it is targeting the industrial Ethernet market and it being a relatively recent addition to the Microchip lineup it should meet the five to ten year life of our rack switch product.

Furthermore we feel that the use of SGMII as a signaling standard in our backplane for this network is an appropriate choice when considering risk of implementation and availability of parts in the future.

Finally we opt for a combination of a Microchip KSZ8463FRL and Microchip VSC8552 or VSC8562 to provide the required RMII interface to the SP and SGMII interfaces to the backplane for both the compute node and the PSC. A final selection between those two PHYs will depend on a requirement for IEE 1588 support. The KSZ8463FRL and VSC8552/VSC8562 will be connected using their 100FX SerDes, steering us clear of having to use transformers or use capacitive coupling of copper PHYs. This configuration is more involved (and expensive) than we would like, but since this implementation detail is limited to the compute node and PSC respectively, it can be improved in future product if/when more SP capable MCUs with dual GigE MACs and possibly native SGMII (fiber) support become available.

The use of all Microchip parts for this portion of our system is on purpose with the intend that if problems were to arise we only require the support from a single FAE team.

Thermal Management

Tofino 2 is a high power device and combined with the power consumption of optical QSFP28 modules adequate environmental control of the chassis is paramount for correct operation. As such an onboard temperature monitoring and fan control function, which can operate independent of the host CPU is desired. At a minimum the environmental control system consists of the following pieces:

TF2 on-die temperature diode, connected to an external temperature sensor with a programmable temperature trip point
Additional temperature sensors placed on the system board to measure inlet- and exhaust air temperatures
Temperature sensors built in to the network facing QSFP28 transceivers (up to 32 sensors in the PoR design)
A fan controller which can be programmed and monitored by an external CPU, connected to as many dual rotor fans as required
A temperature monitoring function which ingests temperature data and programs fan curves through the fan controller

Implementation

While the chassis and environmental conditions for the rack switch are somewhat different from the compute node, it is possible to reuse most of the design for the latter. This implies the use of the SP in order to implement the thermal control loop, but doing so is supported by the following:

The SP or something equivalent will be needed to manage the QSFP28 network interface modules
Reading the QSFP28 temperature sensor data does not seem to be supported by discrete controllers, forcing the need for us to be able to run code local on the rack switch system board instead of the host CPU.

The fans and fan controller used in the compute node can most likely be reused in the rack switch chassis. The PoR fan is a 48V 80mm model, which fits in the 2OU available to the compute node. Alternatively and if able a 92mm model may be possible, allowing for more efficient cooling at the cost of reusing this part. Depending on cooling capacity we may need more than the three dual rotor fans than used in the compute node, but if needed an additional fan controller can be added to accommodate these additional channels.

The Tofino 2 ASIC does not have features to allow more fine grained control over power consumption. Parts of the device (SerDes, redundant packet processing pipelines) may be put in a lower power state as a result of lower resource utilization, but it does not provide frequency throttling levers. In order to avoid device damage as a result of thermal runaway it will be wise to add a discrete failsafe in the form of a temperature sensor with an alarm function connected to the on-die diode, which can be programmed during system boot with a temperature threshold near the max junction temperature and raising an alert signal if tripped. This signal would be used as an input to the ASIC power rails sequencing logic, providing a last resort low latency emergency shutdown. With the fans not connected to these rails, this should allow the system to quickly regain positive environmental control in the chassis.

Serviceability Design

[rfd132] details a number of serviceability considerations that impact Sidecar’s mechanical design. Unlike Gimlet, which allows for toolless removal of the system, due to the complexity in both the cabled backplane, the same is not true for the entire sidecar chassis. As a result we instead focus on improving the serviceability of other components to help balance the current physical reality of the product. In turn we look at a number of specific cases of serviceability in Sidecar.

Fans

One of the goals of using 3OU for each sidecar was to enable the use of 80mm fans for cooling and to allow us to reuse the exact same fans that we’re currently slated to use in Gimlet, which are dual-rotor, counter-rotating 80x80x80mm fans ([rfd114]). As we believe fans are a component that has a reasonable failure rate due to mechanical wear-out ([rfd132]) and servicing the Sidecar chassis directly is quite complicated, we opted to make the rear fans designed as a toolless CRU.

In other words, someone should be able to insert and remove the fans in an independent fashion without having to power off, otherwise disrupt service to Sidecar, or perform any other actions (such as removing a backplane cable or a transceiver for some reason).

While hot-serviceable fans make it easier for a customer to perform the service and reduce the impact of it, this also means that there is a corresponding responsibility to finish the service action quickly. Missing one or two rotors of cooling may cause thermal degradation, the exact amount of which is still subject to thermal modeling. While Gimlet combines all fans into a single cold-serviceable unit ([rfd132]), because the fans are hot-serviceable, they should be independently serviceable to minimize the impact to Sidecar.

Front Panel

The front panel of Sidecar presents both a design challenge and opportunity. The front panel is full of QSFP transceivers and fiber optic cables. If a sidecar has to be serviced, this is one place where absent us doing something, it’s going to be challenging for customers to make sure that all the fibers and transceivers are returned to the same place that they started.

Service LEDs

When it comes to networking, LEDs have been historically used to indicate both whether a link is up, the speed at which the link is up, and then whether or not there is activity on the link. Most existing switches have some number of LEDs per transceiver port to indicate this. We take a slightly different approach to LED management in an attempt to maintain actionability (see [rfd77] for the rationale).

As a result, we stick to using a single, monochrome LED for each individually serviced component in Sidecar. This LED is used to indicate component health and does not indicate activity. Explicitly we want to have the following LEDs:

Front-facing system level service LED
A service LED for each QSFP port
A service LED for each independently serviced Fan
A service LED for any front-facing RJ45 technician ports

Diagrams

External References

[rfd9] Oxide Computer Co. RFD 9: Networking Considerations. https://rfd.shared.oxide.computer/rfd/0009. 2020.
[rfd14] Oxide Computer Co. RFD 14: The Oxide Rack v1.0. https://rfd.shared.oxide.computer/rfd/0014. 2020.
[rfd24] Oxide Computer Co. RFD 24: Multi-Rack Oxide Deployments. https://rfd.shared.oxide.computer/rfd/0024. 2020.
[rfd66] Oxide Computer Co. RFD 66: PCIe Hotplug Architecture. https://rfd.shared.oxide.computer/rfd/0066. 2020.
[rfd76] Oxide Computer Co. RFD 76: Rack Population and Elevation. https://rfd.shared.oxide.computer/rfd/0076. 2020.
[rfd77] Oxide Computer Co. RFD 77: LED User Experience Guidelines and Applications. https://rfd.shared.oxide.computer/rfd/0077. 2020.
[rfd88] Oxide Computer Co. RFD 88: Chassis Management Responsibility Allocation. https://rfd.shared.oxide.computer/rfd/0088. 2020.
[rfd89] Oxide Computer Co. RFD 89: Arrivederci NC-SI. https://rfd.shared.oxide.computer/rfd/0089. 2020.
[rfd114] Oxide Computer Co. RFD 114: Server Thermal Management. https://rfd.shared.oxide.computer/rfd/0114. 2020.
[rfd129] Oxide Computer Co. RFD 129: Gimlet Storage Midplane. https://rfd.shared.oxide.computer/rfd/0129. 2020.
[rfd132] Oxide Computer Co. RFD 132: Oxide Rack v1.0 Serviceability, FRUs, and CRUs. https://rfd.shared.oxide.computer/rfd/0132. 2021.
[tomahawk3-brief] Broadcom. Tomahawk3 / BCM56980 Series. https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56980-series. 2020.
[trident4-brief] Broadcom. Trident4 / BCM56880 Series. https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56880-series. 2020.
[tofino2-brief] Barefoot Networks. Tofino 2 Brief. https://www.barefootnetworks.com/products/brief-tofino-2/. 2020.
[p4-wikipedia] Wikipedia. P4 (programming Language). https://en.wikipedia.org/wiki/P4_(programming_language). 2020.
[wedge100-spec] Open Compute. Facebook Wedge100 - 32x100G Specification 1.0. https://www.opencompute.org/wiki/Networking/SpecsAndDesigns#Facebook_Wedge_100_-_32x100G. 2016.
[nc-si-wikipedia] Wikipedia. NC-SI. https://en.wikipedia.org/wiki/NC-SI. 2020.
[an2190] Microchip. Transformerless Applications of Microchip’s Ethernet Devices http://ww1.microchip.com/downloads/cn/AppNotes/cn586761.pdf. 2016.
[snla088a] Texas Instruments. AN-1519 DP83848 PHYTER Transformerless Ethernet Operation. https://www.ti.com/lit/an/snla088a/snla088a.pdf. 2013.
[sgmii] Cisco Systems. ENG-46158, Serial-GMII Specification v1.8 http://bbs.bzxzk.net/attachment.php?aid=18624. 2005.

RFD 58 Rack Switch

Table of Contents