347 - Delay Driven Multipath / RFD / Oxide

RFD

347

Authors

Updated

Background

The Oxide platform is built on multipath networks at the physical layer. The Oxide rack contains a pair of switches and each compute sled is connected to both switches in the rack. This means for any packet going from one compute sled to another both switches are a viable nexthop destination and a choice must be made as to which one to use. When racks are interconnected the nexthop choice must also be made at the switch level e.g., what switch in an adjacent rack is the best nexthop for a given compute sled in that rack?

This situation is generally known as the multipath routing problem. A solution to this problem is a path from source to destination across a graph representation of a network. Routing protocols that implement solutions to this problem come in both distributed and centralized forms.

Distributed solutions such as the [BGP] protocol often take the form of what are called distance vector (DV) algorithms. Routers running a DV protocol announce destination prefixes to their peers along with a distance metric. Peers receiving DV announcements add a routing entry to their local tables for each prefix in the announcement with a nexthop set to the announcing router. If there are multiple paths the distance metric and a weight for that metric are combined to determine which path to use when making forwarding decisions. Weights are typically determined out of band by measuring networks and forming traffic matrices.

Centralized solutions such as the [OSPF] protocol often take the form of what are called link state (LS) algorithms. Routers running an LS protocol announce destination prefixes to their peers along with a graph representation of links in the network they have direct observability of or have learned from others. Peers receiving LS announcements combine link state information from peers to form a complete graph representation of the network. When making routing decisions, each router runs a local shortest path first (SPF) algorithm such as Dijkstra or Bellman Ford using a combination of link distance metrics and weights to determine the best bath. Like the DV protocols, determining a traffic matrix to form a system of weights is typically an out of band activity.

Some routing protocols combine DV and LS approaches. Early on at Oxide, we were considering implementing [RIFT] as our primary intersled routing protocol along side ECMP. RIFT stands for Routing In Fat Trees. As the name suggests, this is a routing protocol specific to so called Fat Tree or Folded [Clos] type topologies. These topologies are hierarchical and have an explicit notion of directionality. Routers higher in the hierarchy are northbound, routers lower in the hierarchy are southbound and routers in the same level of the hierarchy are east/west. RIFT combines LS and DV by having all northbound announcements be LS and all southbound announcements be DV. With this scheme every router has a complete view of the topology below it to make optimal routing decisions, while not having to worry about specific details of the broader network above.

What all of these protocols have in common is that they are dynamic routing protocols. This means that routers implementing these protocols are able to combine information from peer routers and operators as they are running to adapt to changing network conditions. Examples of adaptation include adding or removing routes in response to announcements from peers, removing routes due to a peer session failure or expiration or, updating link weights because an operator has provided a new traffic matrix.

All of the protocols described above also separate the notion of network structure determination from forwarding path decision making. Network structure as perceived by any router in a network is observed through announcements from peers describing network reachability and some notion of topological structure associated with that reachability information, whether it be a distance metric or a collection of links each with their own metrics. Forwarding path decision making is made based on distance or link weights from a traffic matrix that is supplied by an operator within the context of observed network structure. Simply put, network structure informs a router of what its options are for forwarding a packet to the next hop and link/distance weights allow the router to determine the best choice from those options.

The mechanics of how network reachability and path weight information is propagated and processed in a routing protocol are determined in part by the timescale at which the protocol must observe and adapt to changing network conditions. The BGP and OSPF protocols were designed to support routing in the Internet (capital I). The network structure of the Internet is based on peering agreements between organizations that move at the speed of legal contracts. Path selection in the Internet is mostly an economic/business decision based on contracts that assign monetary costs to forwarding decisions. Organizations operating with a large Internet footprint engage in an activity called traffic engineering which boils down to measuring and predicting traffic flows to calculate traffic matrices that minimize (maximize) operating costs (revenues).

RIFT is a protocol designed specifically to operate in the data center context. Similar to BGP and OSPF, RIFT propagates network structure through announcements. Unlike BGP and OSPF, RIFT only supports specific topological structures. RIFT leverages that specificity to define a valley free routing (VFR) model that always prefers southbound routing based on centralized SPF computations. Rift has its own built in notion of distance metric calculation that can be combined with an external weighting mechanism to determine optimal paths.

Goals

Given the background on routing protocols above, let’s describe what our goals are for an intersled routing protocol for the Oxide platform. Note that the intersled routing protocol is not the only routing protocol we must implement for the product. We must also implement protocols such as BGP, OSPF and IS-IS that allow Oxide racks to be connected to upstream networks. But those protocols are a part of Boundary Services described in [RFD63] and are not the subject of this RFD, we are specifically concerned with intersled routing here.

Our goals for intersled routing are the following.

N-1 fault tolerance. Any link or switch can fail or be taken out of service without causing a communication outage.
Flexible topology construction. Our customers decide how they want to interconnect racks, not us.
Optimal packet-granularity load balancing at the speed of traffic dynamics.
Scale to hyperscale sized networks.

These goals are in rough order of priority. First and foremost our networks must be reliable. Our customer’s workloads must continue without the need for manual operator mitigation in the presence of link and switch faults. Here N-1 means that for any given Oxide topology, if we look along the path between two servers and count the critical components to make that path work from phy-to-phy we denote that as N. N-1 fault tolerance means we can lose any one of those components without an outage. So we can loose any phy, any transceiver, any cable, or even any switch. However, once we are in an N-1 state, that is considered a contingency state and there is risk that another failure could lead to an outage. Another outage could happen and we could find ourselves operational in N-2, but now the risk of the next failure creating an outage is even higher. This terminology comes from power grid operations.

Putting our customers in the driver’s seat for topology construction means that we cannot assume a particular topology structure. This rules out the use of protocols like RIFT. While we cannot guarantee any topology is in play, we should strive to give our customers the maximum possible freedom in deciding how they want to interconnect their Oxide racks.

Our customers are paying for 6.4 tbps of bisection bandwidth per rack with nodes that are microseconds apart in latency. We intend to give them every last ounce of that available bandwidth at snappy round trip times, independent of the characteristics of the traffic they run through Oxide powered interconnects. That means instrumenting, observing and adapting to network dynamics well inside the timescale at which microcongestion starts to degrade effective capacity. Microcongestion is a short-lived congestion event comes and goes within sub-millisecond intervals. These events, often called microbursts have been shown to be responsible for the majority of data center packet loss and performance degradation. See the [DRILL] paper for a good survey of the research in this space. Optimizing the network at this speed means in-band instrumentation and continual distributed optimal path approximation. This also rules out RIFT-like approaches as centralized SPF computations can’t keep up.

Finally, our routing protocols should not impose scaling limits on the number of racks a customer can interconnect. As a distance-vector type protocol, ddm scales well in terms of the routing table size any particular node has to keep. The largest routing tables will be on rack switches that are connected to a large number of other rack switches. An upper bound on routing table size for rack switches is given in the P4 Considerations section. That discussion assumes racks that are directly connected sidecar-to-sidecar. As DDM evolves and Oxide starts to produce networking racks, we’ll need to consider intermediary switches which implies the need for aggregation and route summarization. This should fit in naturally with DDM being a distance vector protocol. But at the point we start stretching across true intermediary routers, we need to graduate from pure distance vector, to path vector (e.g. like BGP) to be account for asymmetric topology conditions and to not aggregate in a way that would inhibit optimal delay calculation. Lots of fun an interesting challenges there, but for now this is an open question / future work.

At face value there is also concern for blowing up packet sizes, as extension headers add 4 bytes per hop on every packet. However, modern hyperscale topologies such as Facebook’s multi-plane folded clos and topologically similar networks from other hyperscalers have shown that even for extraordinarily large networks with hundreds of thousands of servers, the number of hops involved are relatively small, and very likely to fit within the 15-hop limit (described later) of a single IPv6 extension header.

Introduction

Delay driven multipath (ddm) is a routing protocol that aims to provide network-layer load balancing and fault tolerance inside flow round-trip-times on networks where multiple physical paths exist between hosts. Ddm takes inspiration from the micro load balancing ideas in [DRILL] and the notion of using delay as a leading indicator of congestion from [Swift]. We bring these ideas together in a routing protocol that has a distance-vector control plane for out-of-band route distribution, with an in-band IPv6 extension header based data plane for carrying destination-delay values between nodes in the network as they send packets. The result is a continuously evolving distributed approximation of a Dijkstra forest where link weights are destination delays, there is a root for every node in the network, and the radial fanout for each root node corresponds to the destination nodes the root is communicating with. A Dijkstra forest is an extension the shortest-path tree that results from running Dijkstra’s Algorithm over a graph. In Dijkstra’s algorithm if we select one node in the graph as the root, then the result is a tree that contains the shortest path from every other node in the graph to the root. A Dijkstra forest results from selecting each node in the graph independently as a root node, computing the shortest-path tree for that node, and calling the collection of all such trees a forest. In the case where all nodes in a network are communicating, the distributed graph formed among the nodes solves a time-varying form of the all pairs shortest path (APSP) problem continuously.

In addition to tracking delay, ddm also tracks outstanding acknowledgement time. By incorporating this information into routing decisions we can not only load balance and avoid congestion, but also react to faults within the same sub-round-trip-time (RTT) timescale.

A notable aspect about ddm is that it exists purely at the internet layer (L3). Many of the multipath mechanisms in production today include transport layer information in routing decisions to achieve path affinity for connection flows. This is an explicit non-goal of ddm as we are trying to avoid the elephant flow issues inherent to these approaches. An elephant flow is a long lived high-volume flow. Because ECMP decisions are based on address/port hashing, elephant flows are pinned to a particular path independent of congestion considerations and cause head-of-line blocking issues for short-lived mice-flows. A nice summary of these issues can be found in the Tiny Flow paper. The motivating factor for transport layer flow affinity is ensuring packet order within flows. If consecutive packets within a flow traverse different paths on the way to their destination, they may arrive out of order if the latency difference across the paths is large enough. Out-of-order delivery can have a significant detrimental impact on performance if it occurs frequently. The first order objective of ddm is to optimize for latency, but we will show that this has the side effect of implicitly optimizing for ordering as a second order objective. Pragmatically speaking this can be phrased as: optimizing per-packet delay typically maintains ordering, except when delay measurements indicate that maintaining ordering comes at the cost of increased congestion.

Control Plane Protocol

The ddm control plane protocol (which is an entity unto itself independent of the Oxide platform control plane) is responsible for neighboring router discovery and network reachability distribution through prefix announcements. It is only responsible for determining what routes exist, when multiple paths for a particular destination exist the control plane protocol does not participate in that decision. It only creates the choice space for such decisions to be made. Ddm peers discover each other through IPv6 link-local multicast. This means peers must be on a common layer-2 broadcast domain. Ddm is an IPv6 only protocol. When a ddm router is started, it is given a set of interfaces it should attempt to discover peers on. The router will send out send out periodic solicitations on the configured interfaces and respond to solicitations with advertisements on those interfaces.

When a ddm router receives an advertisement in response to a solicitation, it starts a prefix exchange server, listening on the link-local source address the remote peer was discovered with. Prefix exchange messages between peers take place over http(s), optionally employing TLS client authentication. While the exchange server for an interface is running, periodic solicitations continue to be sent out at a configurable rate. Ddm defines a configurable threshold for the number of solicitations that may be missed before considering a peer expired. When a peer expires the exchange server started for that peer is torn down and any routes imported from that peer are removed.

Ddm defines two kinds of routers, server routers and transit routers. In the Oxide platform, Gimlets run server routers and Scrimlets run transit routers. The difference between server and transit routers is redistribution behavior. Any time a transit router receives an announcement, it redistributes that announcement to all of the other peers it has sessions with (not including the router from which the announcement came). Similarly, transit routers also redistribute any changes in network reachability due to peer expiration. Server routers do not redistribute any reachability information. They only send announcements for prefixes that they originate.

The ddm control plane may be viewed conceptually as a state machine instance per configured interface. This state machine is depicted in the diagram below.

The state machine begins in the solicit state. In this state peer discovery is performed. When an advertisement is received, the state machine transitions to the exchange state. From exchange, if an announce or withdraw event is fired in response to prefixes being added or removed through the local administration API, the state machine transitions to the push state where the announced or withdrawn prefixes are propagated to neighbors. The push state is also transitioned to upon receiving a pull request from a peer. A pull request is a peer asking for all the prefixes the router has. From exchange if a sync event is fired in response to an admin API call for the local router to synchronize with all its peers, the pull state is entered where a pull request is send to each active peer. Finally from exchange if a push event is fired in response to a remote peer pushing an update to this router, the update state is entered where the local routing table is updated according to the announcement and withdraw prefixes in the update message. Optionally, if the router is a transit router, the update is redistributed to active peers.

Once a router has entered the exchange state and the push/pull handlers are online, an synchronization process is started called the initial pull request. For this process, ddm routers asynchronously pull prefixes from a newly established peer as a one-shot operation (operators or control plane software can always initiate ad-hoc pull requests later). This allows the peers to synchronize any prefixes that may have existed prior to their peering. It’s important that this initial pull process does not block establishment of push/pull handlers or event handlers such as expiration timers. Establishing these handlers first ensures that requests and events cannot get lost during the initial synchronization. For example, if a router is performing an initial sync with a peer, and during that time the peer also sends a new announcement or withdraw, if the exchange handlers are not running, the result of the pull will be stale and we will have lost the message required to update to the current state.

Discovery

The control plane discovery mechanisms are responsible for three primary things

Soliciting other routers through UDP/IPv6 link local multicast.
Sending out router advertisements in response to solicitations.
Continuously soliciting link-local at a configurable rate to keep sessions alive and sending out notifications when peering arrangements expire due to not getting a solicitation response within a configurable time threshold.

Protocol

The general sequence of events is depicted in the following diagram.

       *==========*                *==========*
       |  violin  |                |  piano   |
       *==========*                *==========*
            |                           |
            |     solicit(ff02::dd)     |
            |-------------------------->|
            |    advertise(fe80::47)    |
            |<--------------------------|
            |                           |
            |            ...            |
            |                           |
            |                           |
            |     solicit(ff02::dd)     |
            |-------------------------->|
            |    advertise(fe80::47)    |
            |<--------------------------|
            |                           |
            |     solicit(ff02::dd)     |
            |-------------------------->|
            |     solicit(ff02::dd)     |
            |-------------------------->|
            |     solicit(ff02::dd)     |
            |-------------------------->|
            |                           |
       +----|                           |
expire |    |                           |
 piano |    |                           |
       +--->|                           |

This shows violin sending a link-local multicast solicitation over the wire. That solicitation is received by piano and piano responds with an advertisement to violin’s link-local unicast address. From this point forward solicitations and responses continue. Each time violin gets a response from piano, it updates the last seen timestamp for piano. If at some point piano stops responding to solicitations and the last seen timestamp is older than the expiration threshold, violin will expire the session and send out a notification to the ddm state machine that started it. Violin will continue to send out solicitations in case piano comes back.

In the event that piano undergoes renumbering e.g. it’s link-local unicast address changes, this will be detected by violin and an advertisement update will be sent to the ddm state machine through the notification channel provided to the discovery subsystem.

The DDM discovery multicast address is ff02::dd. Discovery packets are sent over UDP using port number 0xddd.

Packets

Discovery packets follow a very simple format

                     1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|   version     |S A r r r r r r|  router kind  | hostname len  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           hostname                            :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
:                             ....                              :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The first byte indicates the version. The only valid version at present is version 1. The second byte is a flags bitfield. The first position S indicates a solicitation. The second position A indicates and advertisement. All other positions are reserved for future use. The third byte indicates the kind of router. Current values are 0 for a server router and 1 for a transit routers. The fourth byte is a hostname length followed directly by a hostname of up to 255 bytes in length. At the moment, the hostname is just used for operator awareness and debugging purposes.

Exchange

The ddm control plane exchange mechanisms are responsible network reachability distribution. Ddm routers expose the following HTTP API endpoints. All objects are encoded as JSON in HTTP requests and responses.

Ddm routers may be optionally configured to use HTTPS as a transport and require client certificate authentication. The maghemite/ddm implementation currently calls gethostname(3c) to populate this value.

Exchange API Endpoints
Verb	Route	Purpose
PUT	`/push`	Push an update to this router. Takes the `Update` object defined below.
GET	`/pull`	Pull the prefixes and tunnel endpoints exported by this router. Returns the `PullRespone` object defined below.

Update Object
Key	Value Type	Description
underlay	`Option<UnderlayUpdate>`	Updates to underlay routes.
withdraw	`Option<TunnelUpdate>`	Updates to tunnel routes

Underlay Update Object
Key	Value Type	Description
announce	`HashSet<Ipv6Prefix>`	Network prefixes being announced by the sending router.
withdraw	`HashSet<Ipv6Prefix>`	Network prefixes being withdrawn by the sending router.

Ipv6 Prefix Object
Key	Value Type	Description
addr	`IPv6 Address`	An IPv6 address prefix.
len	`len`	Length of the address prefix.

Tunnel Update Object
Key	Value Type	Description
announce	`HashSet<TunnelOrigin>`	Tunnel endpoints being announced by the sending router.
withdraw	`HashSet<TunnelOrigin>`	Tunnel endpoint being withdrawn by the sending router.

Where the tunnel origin object has the following fields.

Tunnel Origin Object
Key	Value Type	Description
overlay_prefix	`IpPrefix`	The destination prefix on the upstream network being announced or withdrawn.
boundary_addr	`Ipv6Addr`	The address of the tunnel endpoint that acts as an egress for the overlay prefix.
vni	`u32`	A 24-bit Geneve virtual network identifier placed in a 32-bit unsigned integer.
metric	`u64`	A metric indicating a relative sense of priority for this tunnel origin. This is used as a basis for overlay route installation into OPTE virt-to-boundary tables. See [rfd404] for more details.

Pull Response Object
Key	Value Type	Description
underlay	`Option<HashSet<PathVector>>`	The set of path vectors announced by a router.
tunnel	`Option<HashSet<TunnelOrigin>>`	The set of tunnel endpoints announced by a router.

Path Vector Object
Key	Value Type	Description
destination	`Ipv6Prefix`	The destination prefix being announced.
path	`Vec<String>`	A list of hops along the path to the destination.

Push Request Processing

When a push request is received by a ddm router, for all router kinds the following happens.

Any withdraws are removed from the router database.
Any withdraws are removed from the underlying platform.
Any announcements are added to the router database.
Any announcements are added to the underlying routing platform.

If router receiving the push request is a transit router, the following additional things happen.

The update received is redistributed to all active peers (except for the sending peer).

Pull Request Processing

When a pull request is received by a ddm router, for all router kinds the following happens.

All prefixes originated by the receiving router are returned to the router sending the pull request.

If router receiving the pull request is a transit router, the following additional things happen.

All prefixes imported by the receiving router are returned to the router sending the pull request.

Tunnel route processing

See [rfd404].

Administration API

Ddm routers implement a local administration API. This is an HTTP(S) API that broader control plane software and administrative programs may use. The API has the following endpoints.

Administration API Endpoints
Verb	Route	Purpose
GET	`/peers`	Get all the routers peers. Returns a `HashMap<u32, PeerInfo>` object. The key is the index of the interface of the underlying platform the peer was discovered on. The `PeerInfo` object is defined below.
GET	`/prefixes`	Get all the routers imported prefixes. Returns a `HashMap<Ipv6Addr, HashSet<Ipv6Prefix>>` object. The `Ipv6Prefix` is defined in a section above. The result does not include originated prefixes.
PUT	`/prefix`	Announce a set of prefixes. Takes a `HashSet<Ipv6Prefix>` argument. Prefixes will be distributed to peers that are active now and any future peers that become active.
DELETE	`/prefix`	Withdraw a set of prefixes. Takes a `HashSet<Ipv6Prefix>` argument. Prefixes will be withdrawn from peers that are currently active.
PUT	`/sync`	Send a pull request to all active peers and handle any responses by updating internal route database and underlying platform.
GET	`/tunnel_endpoints`	Get all the routers imported tunnel endpoints. Returns a `HashSet<TunnelOrigin>>` object. The result does not include originated tunnel endpoints.
PUT	`/tunnel_endpoint`	Announce a set of tunnel endpoints. Takes a `HashSet<TunnelOrigin>` argument. Tunnel endpoints will be distributed to peers that are active now and any future peers that become active.
DELETE	`/tunnel_endpoint`	Withdraw a set of tunnel endpoints. Takes a `HashSet<TunnelEndpoint>` argument. Tunnel endpoints will be withdrawn from peers that are currently active.

PeerInfo Object
Key	Value Type	Description
status	`PeerStatus`	The status of the peer. Only active peers are included in prefix (re)distribution. This is an enum value defined below.
addr	`Ipv6Addr`	The link-local address of the peer.
host	`String`	The hostname of the peer. This is not used in a functional capacity, it is simply used in logs for operator situational awareness and debugging sanity.
kind	`RouterKind`	An enum that indicates if this is a transit or a server router.

PeerStatus enum
Value	Description
`"NoContact"`	There has been no contact with the peer. This happens when an interface has been configured and solicitations are being sent out, but at no point has a solicitation been responded to with an advertisement.
`"Active"`	The peer has peen solicited, responded with an advertisement and has continued to respond to solicitations with advertisements within the expiration threshold. Each solicitation must be responded to within a configurable number of milliseconds. The expiration threshold is the number of times an advertisement is not received within the time allowed for a solicitation.
`"Expired"`	The peer was active at one point in time, but has failed to respond to solicitations beyond the configured threshold.

PeerStatus enum
Value	Description
`0`	This value indicates the router is a server router. The distinguishing characteristic of a server router is that it does not redistribute prefixes that are announced to it by other routers. It only announces the prefixes it originates.
`1`	This value indicates the router is a transit routers. The distinguishing characteristic of a transit router is that it redistributes prefixes it imports from other routers to all of its peers. Transit routers typically do not originate routes.

Router Database

The router database has three top level elements.

A hash map of peers HashMap<u32, PeerInfo>
A set of routes imported from peers HashSet<Route>.
A set of prefixes originated by the router HashSet<Ipv6Prefix>.

This is an in-memory database that does not survive reboots. Ddm is designed to be used in conjunction with a broader control plane that manages persistent state.

Underlying Platforms

Ddm currently supports two underlying packet forwarding platforms.

The illumos kernel.
Dendrite driven switches.

When a ddm router instance is started, the underlying platform is specified in the command line arguments. When routes are announced and withdrawn, they are added and removed from the underlying packet forwarding platform.

The illumos kernel implementation uses the operating system’s router socket (AF_ROUTE). The Dendrite implementation uses the data plane daemon (dpd) API.

Data Plane Protocol

The ddm data plane implementation has four primary responsibilities.

Instrument layer-3 flows with time egress timestamp hop-by-hop extension headers.
Emit acknowledgements from destination nodes returning timestamps to their originators.
Measure and update destination delay for each acknowledgement packet.
Make nexthop routing decisions based on measured delays.

At a high level what ddm is doing for instrumentation is building a stack of stack of timestamps as packets flow from source to destination across a network.

When server nodes emit packets, they add a timestamp to an extension header on that packet. When an acknowledgement packet comes back with that timestamp in it, the sending node updates its local delay table with the following for the source address of the acknowledgement packet.

The instantaneous delay (delay).
The delay velocity (delay_v).
The delay acceleration (delay_a).
The last acknowledgement timestamp.

This information is used when determining where to send subsequent packets. For each outgoing packet the delay table entry for the destination address is consulted. When there are multiple paths, there are multiple matching table entries. For each entry returned a pick function is evaluated over the delay value, its derivatives and the time since the last acknowledgement.

The pick function p(…) is the defining logical component of ddm. It determines nexthop routing decisions based delay information. Like transport protocols such as TCP, there may be multiple variants of the pick function. The pick function may use past, present and projected delay information. There may be different pick functions running and different parts of a ddm network. The following subsections describe pick functions that are either currently implemented or are under development.

Pick Functions

The parameters available to a pick function for each candidate nexthop are the following.

Parameter Description

Parameter	Description
`d`	instantaneous delay
`ack_last`	local time of the last acknowledgement

d

instantaneous delay

ack_last

local time of the last acknowledgement

Pick functions may leverage local storage to maintain time varying quantities such as derivatives, averages, etc. The diagrams above show a pick function that considers delay velocity and acceleration.

Basic

The most basic pick function iterates over nexthop routes and selects the one with the lowest delay. This function can lead to pathological behaviors where mass simultaneous selection leads to congestion on a particular path while unloading another and in the next time frame making the opposite decision. This can lead to oscillatory behavior between paths which can cause severe performance degradation.

Probabilistic

This pick function does the following.

Calculates a sum of all nexthop delays.
Assigns the probability 1 - (delay/sum) for each nexthop.
Selects a nexthop using a random variable according to the nexthop probabilities.

What the distribution of the random variable should be is an interesting question. Implementations should make this a configurable parameter.

Predictive

This scheme combines the current delay with first and second derivatives to predict delay and then make either a basic or probabilistic selection based on the prediction rather than the current value.

Note

There is no implementation of this yet.

IPv6 Extension Header

The ddm protocol embeds hop-by-hop timestamp information in IPv6 extension headers. The ddm extension header has a fixed 8-byte portion that is always present, followed by a variable sized list of elements. There may be between 0 and 15 elements in a single ddm extension header. DDM over greater than 15 hops is not currently supported. If the need arises the 15 element limit per ddm extension header will not change, rather extension headers must be chained. This is to keep in line with the recommendations of RFC 6564 for IPv6 extension headers.

          0               0               1               2               3
          0               8               6               4               2
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    0x00 |  Next Header  | Header Length |    Version    |A|  Reserved   |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    0x04 |     0.Id      |           0.Timestamp                         |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    0x08 |     1.Id      |           1.Timestamp                         |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         |     ...       |                ...                            |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         |     ...       |                ...                            |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
(N+1)<<2 |     N.Id      |           N.Timestamp                         :
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Fixed header fields have the following semantics:

Field	Description
Next Header	IANA IP protocol number of the next header.
Header Length	Length of the ddm header and all elements in 8-octet units, not including the leading 8 octets. This follows convention established in RFC 6564.
Version	Version of the ddm protocol.
A	Acknowledgement bit. A value of 1 indicates this is an acknowledgement, 0 otherwise.
Reserved	Reserved for future use.

Field

Description

Next Header

IANA IP protocol number of the next header.

Header Length

Length of the ddm header and all elements in 8-octet units, not including the leading 8 octets. This follows convention established in RFC 6564.

Version

Version of the ddm protocol.

Acknowledgement bit. A value of 1 indicates this is an acknowledgement, 0 otherwise.

Reserved

Reserved for future use.

Element fields have the following semantics

Field	Description
Id	Identifier for the node that produced this element.
Timestamp	Time this element was produced. This is an opaque 24-bit value that is only meaningful to the producer of the timestamp. Current reference implementations keep a microsecond counter from an arbitrary time in the past.

Field

Description

Identifier for the node that produced this element.

Timestamp

Time this element was produced. This is an opaque 24-bit value that is only meaningful to the producer of the timestamp. Current reference implementations keep a microsecond counter from an arbitrary time in the past.

Ordering Considerations

As was mentioned in the introduction, packet ordering can have a significant impact on performance. Ddm does not optimize routing decisions for ordering as a first order objective. However, maintaining ordering when it will not lead to congestion is a secondary goal.

To begin discussing ordering we first need to characterize the conditions under which packets may be delivered out of order. Then we’ll discuss how ddm nexthop selection logic can lead to those conditions and at what frequency.

The diagram below shows timing for two packets sent to the same destination. At time t_switch the nexthop for that destination is changed.

Packet reordering at the destination happens when the following is true. Note that this model is only considering in-network reordering where we define in-network to be between the phys of a NIC. Additional consideration needs to be made for processing times and delays that happen between the NIC and the OS netstack, as well as the netstack and applications. The [Swift] paper has some good insight on these different timings and how they relate to congestion. For now this model serves as an analytical tool for reasoning about in-network reordering.

(1) t_tx2 + 𝚫flight2 < t_tx1 + 𝚫flight1

Let’s define K to be the change in packet flight time by making a nexthop routing change (K = 𝚫flight1 - 𝚫flight2). Then we can express 𝚫flight2 in terms of 𝚫flight1

(2) 𝚫flight2 = 𝚫flight1 - K

We can also use 𝚫tx to express t_tx2 in terms of t_tx1.

(3) t_tx2 = t_tx1 + 𝚫tx

Substituting (2) and (3) into (1) and simplifying we land at the following as a reordering condition.

(4) 𝚫tx < K

This establishes a relationship between inter-packet timing and routing decisions that decrease latency. For any change that comes with a decrease in latency larger than inter-packet latency (at the time of the change), out-of-order delivery will occur.

To think about this more concretely. Let’s consider what inter-packet timing looks like on a busy server. Assuming 200 byte packets we have the following.

Capacity	Time Between Packets
100 gbps	16 nanoseconds
10 gbps	160 nanoseconds
1 gbps	1.60 microseconds

Capacity

Time Between Packets

100 gbps

16 nanoseconds

10 gbps

160 nanoseconds

1 gbps

1.60 microseconds

At face value this is a bit daunting in the context of equation (4). In order to avoid out-of-order delivery even on busy a 10 gbps network we would need to be managing delay at nanosecond scales which clearly is not possible given overall flight times will be at least in the 10s to 100s of microseconds. However, the goal of ddm is not to eliminate out-of-order delivery, it’s to establish an upper bound such that the overhead of reordering at the receiver is significantly outweighed by the gains of having an optimally balanced network that avoids congestion to a level approaching the aggregate physical capacity of the network.

Note

In the [DRILL] paper they claim load balancing on data center networks where "packets nearly always arrive in order" which is difficult to rationalize given the analysis above, but worth looking into in the future. Possibly introducing local queue length as an additional load balancing indicator in ddm.

A central concern of ddm is then to reduce reordering overhead. Reducing overhead actually falls out as a side effect of reacting to delay as quickly as possible. Sending packets down the path of least resistance increases load (delay) on the lower latency path and reduces load on the higher latency path thereby driving path latencies together for a balanced network. The analysis above shows that the size of an out-of-order delivery window is a function of the latency variation magnitude between paths. Because ddm optimizes to reduce latency variation magnitude between paths, it’s also optimizing to reduce reordering overhead.

Receive Side Reordering

Because a certain level of out-of-order delivery comes with the ddm territory, ddm implementations should reorder things at the receiver and deliver in-order streams to up-stack protocol handlers such as TCP.

Let’s now consider what is involved with receive side reordering (rsr).

Given a packet with sequence number s_a arriving at time t_0 and a packet with sequence number s_n arriving at t_1, if s_n != s_a+1 then s_n must be buffered until s_n-1 arrives. The difference in delay between paths bounds the number of packets that need to be buffered for any given event. Let’s assume that delay across paths can be kept within 10 ms and take a look at some numbers for different link capacities.

Capacity	Buffer Size Requirement
100 gbps	125 MB
10 gbps	12.5 MB
1 gbps	1.25 MB

Capacity

Buffer Size Requirement

100 gbps

125 MB

10 gbps

12.5 MB

1 gbps

1.25 MB

(e.g. for 100gbps (100e9 bps/ 8 bits per byte / 1e2 seconds to 10 ms)).

Another consideration is packet loss. We cannot go off the rails if a packet fails to arrive, so something in the spirit of a [CoDel] scheme with a maximum sojourn time for buffering must be employed.

illumos Considerations

The illumos implementation of ddm is a data plane implementation for server type routers. There are no current plans to implement a data plane for a transit type router on illumos.

As a server data plane implementation, the ddm kernel plumbing is responsible for the following.

Adding ddm extension headers to packets on egress.
Removing ddm extension headers from packets on ingress.
Maintaining delay tables based on ddm-ack packets.
Making nexthop routing decisions based on delay tables.
Parsing ddm packets correctly in snoop.
Receive side reordering.

Ddm is purely an IPv6 protocol, so for the most part only that portion of the network stack needs to be considered. The routing table in illumos is made up of internet routing entries (ire). These entries have been modified to include a delay value.

When a packet arrives at the IPv6 input handler function, it’s inspected to see if it’s a ddm-ack packet. If it is, the following happens.

The ire entry corresponding to the source address is looked up.
The timestamp value from the packet is extracted and subtracted form the current time to form a delay value.
The delay value in the ire is updated with the delay value calculated form the ddm-ack packet.

If the packet is not a ddm-ack, but is an IPv6 packet with a stack of ddm elements in the ddm IPv6 extension header, then an acknowledgement is immediately sent. The acknowledgement is an IPv6 packet with no payload containing the complete stack of ddm elements received.

When an IPv6 packet is sent, if it is not a link-local packet or a packet destined to the same host it’s being sent from, the ddm IPv6 extension header is added. The timestamp is computed by grabbing a local timestamp, finagling it into a 24 bit microsecond format. The gethrtime(9F) function is used for this to get a high-resolution timestamp from some arbitrary time in the past.

When there are multiple routes to a particular destination, the kernel groups these routes in to ire buckets (irb). An irb contains a linked list of ires and a bit of metadata about the bucket. To support pseudorandom selection of routes within a bucket a 64-bit random variable is added to the irb structure. Every time a selection within the bucket is made, the random variable member is moved to the next value according to the distribution function used by the pick function (see the pick functions section above). For pick functions that use randomness this variable is then used in selecting a nexthop route from the bucket.

Whether or not to use ddm is defined as an ip interface property. An administrator or control plane program can set the ddm property of an interface to on in order for that interface to implement the ddm packet mechanics and for routes associated with the ip lower layer (ill) of the interface to consider delay properties when selecting a nexthop route. When making routing decisions, if ddm is not enabled on any of the interfaces associated with the ill of the route candidates, the kernel proceeds with ECMP logic. This allows ddm to coexist with established routing mechanisms in the kernel.

As described above, there are several pick functions to choose from. The illumos implementation should support choosing among these. Initially this may be a compile time switch, however, making this tunable at runtime is desirable.

It’s worth noting that all of this adds a non-trivial amount of cost in the network processing pipeline. This cost needs to be measured and evaluated relative to the network speed gains realized. The OS should be taking as little compute as possible in the kernel, leaving as much as it can for applications. And that is a perfect segue to our next section!

Offloading

Today ddm is implemented in the kernel netstack. However, we should consider what offloading ddm onto the network interface card (NIC) application specific integrated circuit (ASIC) would look like.

P4 Considerations

The p4 implementation of ddm is a data plane implementation for transit type routers. There are no current plans to implement a data plane for a server type router in p4.

As a transit data plane implementation, the ddm p4 plumbing is responsible for the following.

Adding timestamp elements to ddm packets as they traverse the router.
Removing timestamp elements from ddm-ack packets as they traverse the router.
Maintaining delay tables based on ddm-ack packets.
Making nexthop routing decisions based on delay tables.

However, single rack deployments are a special case. In a single rack deployment there is no multipath decision to be made at either of the rack routers. Each rack router is singly-connected to each of the 32 sleds in a rack and that is it. So given a packet bound to a particular sled, there is only one option, the single port that is connected to that sled. Therefore, for the single rack case, all the ddm data plane code needs to do is forward ddm extension headers along.

Coming back to the multi-rack case, this implies a change in routing table structure. Instead of a single-path match / action routing table that looks like this.

dst -> (port, nexthop)

We have need one that looks like this

dst -> [(port, nexthop)]

This is logically what the illumos route bucket described in the previous section gives us. However in P4, there is no concept of a pointer or reference to create linked lists from and there is no concept of an array of objects (header stacks do not apply here). What is commonly done in P4 code in this type of situation is using multiple tables, sometimes referred to as nested tables, although they are not actually structurally nested.

Warning

What follows is to drive discussion around the design surface for multi-rack routing on P4 programmable ASICs. As we gain more experience the direction may change significantly.

Consider the following multi-rack setup.

The question is how do we create a table structure that allows us to build up multi-path routing tables for this topology. We start with a top level table. In this table and all the tables that follow, the match key is a destination address. In the top level table the matched value is a bitmap indicating what other tables to look in.

Some entries in this table might look like the following.

1:a -> 0b1000,
2:c -> 0b0110,

The first entry states, address 1:a can be found in the first table. The first table is reserved for directly attached hosts. It has the same basic structure as the table linked above for the single rack case.

The second entry states that the address 2:c is found in the second and third tables. The second and third tables have additional structure associated with them. They look like this.

dst -> (port, nexthop, delay)

The port is the port connected to a nexthop switch in another rack, or perhaps an intermediary switch when we start making dedicated networking racks. The delay value here is a special kind of P4 value called a direct register. In P4, registers are stateful values that can be written by the data plane. A direct register is attached to a table key. This allows the data plane to update table state dynamically. When the P4 code detects a ddm-ack packet, the direct register is accessed by the source address as a table key, and is then updated according to the top of stack delay element on the packet transiting the router. That element is popped off the stack as the packet transits to its next hop.

Note that in the bitmap returned from the top level table there were two tables identified. This is where the multipath choice comes in. We grab entries from both tables and then use the delay from each entry to decide which path to take.

An interesting question here is: if there are no arrays or lists, how do we access/enumerate these second level tables? For now, we only have 32 front facing ports which means we can only have 32 such tables, so just define them all and have static branching based on the bitmap. Gross? A bit. The same applies for updating delays. We have to look at what port a ddm-ack packet came from and perform a mapping from that port to the appropriate table.

And what about link bonding? Not sure. Just treat them as independent?

Even though there are numerous tables, they’re all small. There are 33 tables per switch (one local and 32 for the front facing ports), each with up to 32 entries, so were looking at a total of up to 1056 entries in secondary tables for a fully connected sidecar. The top level table can contain up to 1056 entires itself, one for each secondary table entry across all secondary tables. This brings us up to 2104 entries, which is still quite small relative to what router ASICs are expected to handle.

Determinations

Traditional routing techniques will not allow us to realize the potential the physical networks of the Oxide platform have to offer. Out-of-band traffic engineering has no hope of keeping up with microcongestion and other fast moving traffic dynamics typical of data center network traffic flows. Centralized link state protocols also have no hope of keeping up. We are building a distance-vector control plane with in-band network instrumentation that builds a distributed delay matrix in real-time as a means to load balance and provide fault tolerance inside flow round trip times.

Open Questions

Do we need explicit support for anycast?
How well is the strategy of chained small tables going to work on Tofino?
Offloading the server data plane to the NIC?
Route aggregation and summarization.
Impact of doubling packets with ddm-ack.

Security Considerations

Is TLS client authentication a good model for this?
Where is key data going to come from (independent of whether or not we use TLS)?
Do we need to build authentication mechanisms into the UDP/IPv6 multicast based discovery communications?

RFD 347 Delay Driven Multipath