RFD 404
Boundary Tunnel Routing
RFD
404
Updated

This brief RFD defines the boundary tunnel routing problem and proposes a solution.

Note
References to ddm (see [rfd347]) in this RFD refer to the ddm control plane protocol and not the data plane protocol.

Problem

When an external route is added to an Oxide switch, that switch becomes a gateway for the route’s destination prefix on an upstream network. The physical network that interconnects servers and switches in the Oxide platform is called the underlay network. The underlay network is decoupled from upstream networks through Geneve-based virtual network encapsulation. This is shown in the diagram below. Everything above the switches is upstream, everything below is underlay. In this case the upstream networks are all IPv4. The underlay network is exclusively IPv6. The prefixes above each switch represent a destination on the upstream network the switch has a route for.

         Oxide Rack 0            . . .           Oxide Rack N
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓           ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ 0.0.0.0/0                  ┃           ┃┌───────────┐ ┌───────────┐┃
┃ 10.0.0.0/16    0.0.0.0/0   ┃           ┃│  Switch0  │ │  Switch1  │┃
┃ 172.30.0.0/24  10.0.0.0/24 ┃           ┃└───────────┘ └───────────┘┃
┃ ┌───────────┐ ┌───────────┐┃           ┃      │   ▲                ┃
┃ │  Switch0  │ │  Switch1  │┃           ┃      │ ┌─┴─────────────┐  ┃
┃ └───────────┘ └───────────┘┃           ┃      │ │outer          │  ┃
┃   fd00:10::1    fd00:11::1 ┃           ┃      │ │---------------│  ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛           ┃      │ │src: fd00:00::1│  ┃
                                         ┃      │ │dst: fd00:10::1│  ┃
                                         ┃      │ ├───────────────┤  ┃
                                         ┃      │ │inner          │  ┃
                                         ┃      │ │---------------│  ┃
                                         ┃      │ │src: 10.0.4.7  │  ┃
                                         ┃      │ │dst: 1.2.3.4   │  ┃
                                         ┃      │ └─▲─────────────┘  ┃
                                         ┃      │   │                ┃
                                         ┃┌───────────┐              ┃
                                         ┃│   Sled    │              ┃
                                         ┃└───────────┘              ┃
                                         ┃  fd00:00::1               ┃
                                         ┃                           ┃
                                         ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

When a packet destined to an upstream network from an instance or service is sent, it’s encapsulated in an IPv6 packet on the underlay network. We refer to these encapsulated packets as belonging to an overlay network. As a further distinction packets going to/from upstream networks on an overlay are described as tunneled. There are a number of problems to solve in order for this to work.

  1. How does the encapsulator (OPTE) know what underlay address to use in order to get packets to a switch that is a viable upstream network gateway for the destination of the tunneled packet?

  2. If there are multiple viable gateway switches, which one to use?

  3. If underlay paths to a gateway come and go, or if upstream routes move around, how does that information propagate down to OPTE?

Solution

There is a simple solution that solves all three of these. Use ddm to distribute tunneled routes. This is depicted in the diagram below. When the route 0.0.0.0/0 is added to Switch0 in Rack0, it sends out an advertisement to its ddm peers that the upstream destination 0.0.0.0 is available through the Rack0/Switch0 underlay address fd00:10::1. The diagram goes on to show the propagation of this advertisement to Rack1/Switch0 (we are assuming a multi-rack deployment but this is just as valid in the single rack case). Then the ddm instance on Rack1/Switch0 redistributes the tunneled route to the servers attached to it. When ddmd running on a server gets a tunneled advertisement, it adds a corresponding entry to OPTE in the virt-to-boundary (V2B) table. This table maps upstream network destinations to underlay addresses that are a viable rack-egress point for that destination. More on this table to follow. Tunneled routes will be distinct from regular routes in ddm, they’ll have a different message type and will be handled differently at server routers - but the distribution model and transit mechanics remain the same as regular routes.

           Oxide Rack 0           . . .           Oxide Rack N
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓      ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    0.0.0.0/0                  ┃      ┃     ┌───────────┐ ┌───────────┐┃
┃    10.0.0.0/16    0.0.0.0/0   ┃      ┃     │  Switch0  │ │  Switch1  │┃
┃    172.30.0.0/24  10.0.0.0/24 ┃      ┃ ┌───┴┬──────────┘ └───────────┘┃
┃    ┌───────────┐ ┌───────────┐┃  ┌────▶│ddmd│    │                    ┃
┃    │  Switch0  │ │  Switch1  │┃  │   ┃ └──┬─┘    │                    ┃
┃┌───┴┬──────────┘ └───────────┘┃  │   ┃    │      │                    ┃
┃│ddmd│fd00:10::1    fd00:11::1 ┃  │   ┃    │      │                    ┃
┃└──┬─┘                         ┃  │   ┃    │┌───────────┐              ┃
┗━━━│━━━━━━━━━━━━━━━━━━━━━━━━━━━┛  │   ┃    ││   Sled    │              ┃
    │         ┌───────────────┐    │   ┃ ┌──▼┴┬──────────┘              ┃
    │         │dst: 0.0.0.0/0 │    │   ┃ │ddmd│fd00:00::1               ┃
    └────────▶│gw:  fd00:10::1│────┘   ┃ └──┬─┘                         ┃
              └───────────────┘        ┃    │                           ┃
                                       ┃ ┌──▼─┐                         ┃
                                       ┃ │opte│                         ┃
                                       ┃ └──┬─┘                         ┃
                                       ┃    │                           ┃
                                       ┃ ┌──▼───────────────────┐       ┃
                                       ┃ │         V2B          │       ┃
                                       ┃ ├──────────────────────┤       ┃
                                       ┃ │0.0.0.0/0 → fd00:10::1│       ┃
                                       ┃ │...                   │       ┃
                                       ┃ │...                   │       ┃
                                       ┃ └──────────────────────┘       ┃
                                       ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

The ddm daemon already has the concept of multiple underlying platforms to install routes in, illumos and Dendrite. This adds OPTE as a third underlying platform. The problems enumerated in the previous section are solved as follows:

  1. Tunneled route distribution through ddm.

  2. Each router acting as a tunnel endpoint allocates a randomized IPv6 unique local address (ULA). That ULA is then used as the advertised gateway. In the case where an address goes from one gateway to multiple gateways, distinct ULA advertisements are made from each gateway providing the tunneled prefix. Flow affinity is maintained by using ECMP hashing on the inner L3/L4 header fields.

  3. This is intrinsically solved by how the ddmd routing protocol works. Peer expiration and route movement through withdraw and re-advertise are already part of the core protocol.

The V2B table is an addition to OPTE. OPTE currently has a V2P table and a single boundary services address. This work keeps this distinction in place, as there is core logic built into OPTE around the differences between on-VPC traffic and external traffic. We just expand the single address into a V2P-like table.

This approach has the benefit of distributing actual upstream network routing information to OPTE boundary routing tables. This captures subtleties such as being able to do longest prefix matching across overlapping table entries. This will make tunneled networks behave much like "normally routed networks" because we’re actually implementing routing through a pretty standard path-vector routing protocol, it just happens to be tunneled. We also inherit some nice path-vector routing protocol capabilities like route aggregation, bulk advertise/withdraw, loop detection/mitigation, full path visibility, graceful restart, etc.

IPv6 Unique Local Addressing

When the Maghemite external network router (mgd) starts up on each rack switch, it checks if it has any routes to upstream external networks. If it does, that means that the switch it’s on must act as a boundary tunnel endpoint between the internal rack underlay network and external networks. In this case, the router generates a random ULA to use as a tunnel endpoint address. This generated ULA is stored on the file system and is used for subsequent restarts. This is needed because the lifetime of the ULA advertisement across the network is tied to ddmd lifetime and not mgd. The mgd router uses the API of the ddm router running in the same switch zone to advertise the generated tunnel endpoint address on the underlay network. So if mgd restarts, but ddmd does not, the underlay advertisement is still present. On a complete switch zone restart, or a sled restart, both mgd and ddmd go down and thus the ULA advertisement is withdrawn from the broader network, so we may as well generate a new one.

A robustness gap here is if mgd fails in a way that does not withdraw ULA or tunnel advertisements from the network but ddmd is still up. That would result in doomed packets going to a switch that is not viable for external egress. Perhaps ddmd should monitor the health of mgd and withdraw any tunnel advertisements and/or the boundary ULA if mgd is down or otherwise unhealthy.

Implementation in Mgd

mgd owns the external routing information base (RIB) for a particular switch. It’s where dynamic protocol implementations such as BGP and BFD live. It also provides an API for static route management. Because of this, it has visibility over the entire RIB and can manage the flow of tunnel advertisements onto the underlay network for the switch it is running on. It accomplishes this by advertising and withdrawing tunneled routes via the API of the local ddmd daemon, based on changes in the external network RIB. In fact, mgd is the only place where all the information needed to automatically manage boundary tunnel routing is present.

Rack Setup Considerations

This solution is also a good fit for initial route distribution during rack setup. The ddm daemons are among the initial systems that come up on each sled. The bootstrap agent gives them a prefix to advertise and the bootstrap network is thereby constructed. By the time we get to RSS, the network of ddm daemons are ready and waiting to distribute the initial set of tunnel routes that will provide external connectivity to critical early setup services like NTP.

Tunnel Origin Metrics

Tunnel routing provides a mechanism for server routers to find an egress point for packets destined to upstream networks. However, some egress points may be better than others. Consider two rack switches running the [BFD] protocol. This protocol detects faults between nexthop routers. If a prefix 198.51.100.0/24 is available from switch0 and switch1, but only switch1 has BFD sessions that are up - then server routers should prefer sending packets destined to 198.51.100.0/24 to switch1.

To make this possible, a metric field is present in tunnel announcements. This allows originating routers to assign a quantitative measure to the announcements they make. When server routers receive multiple announcements for the same overlay prefix, they decide which advertisements to install in the OPTE virt-to-boundary tables based on the metrics in the announcements. Metrics deliberately abstract away BFD, as we’ll find ourselves in a similar situation with rack switches running BGP and have distinct BGP metrics for upstream prefixes, and do not want the details of every possible routing protocol or fault detector leaking into the ddm protocol.

The initial implementation will simply have the metric be 100 for routes that are considered fully functional, and 0 for routes that are degraded in some way i.e., have a BFD session for the route’s nexthop that is down. As time moves on we’ll likely have more sophisticated approaches that provide a more nuanced insight into tunnel endpoint quality. For example, how many nexthops for a given route the tunnel endpoint has, incorporation of BGP distance metrics, etc.

Protocol Details

The following tables correspond to those in [rfd347] §4.2 Exchange. A tunnel update has the following format.

Tunnel Update Object
KeyValue TypeDescription

announce

HashSet<TunnelOrigin>

Tunnel endpoints being announced by the sending router.

withdraw

HashSet<TunnelOrigin>

Tunnel endpoint being withdrawn by the sending router.

Where the tunnel origin object has the following fields.

Tunnel Origin Object
KeyValue TypeDescription

overlay_prefix

IpPrefix

The destination prefix on the upstream network being announced or withdrawn.

boundary_addr

Ipv6Addr

The address of the tunnel endpoint that acts as an egress for the overlay prefix.

vni

u32

A 24-bit Geneve virtual network identifier placed in a 32-bit unsigned integer.

metric

u64

A metric indicating a relative sense of priority for this tunnel origin. This is used as a basis for overlay route installation into OPTE virt-to-boundary tables at server routers as described above.

Alternatives

Implement Everything in the Control Plane

When looking at the RPW approach above, a valid question is why not just have the control plane manage this entirely? Have the RPW send commands to sled agents that have them push routes down to the V2B OPTE tables.

With a pure control plane approach there is no in-band network disruption detection. Because the ddm daemons are communicating over the same links they are routing, they actively keep sessions alive through a stream of solicitations and advertisements, and they propagate peer expirations along a path - we get path failure detection and can automatically withdraw routes when they are detected to be non-viable. Network paths can fail in ways that are not immediately apparent to the control plane but are immediately detectable and actionable from an in-band routing daemon.

This is the same class of routing problems that ddm proper has do deal with. As the ddm control plane daemon becomes more sophisticated and employs techniques like damping and interactions with the ddm data plane protocol, the tunneled routing machinery will automatically benefit from these advancements.

Force Costumers into HA with Symmetric Switch Configs

If we assume that switches in a rack are always symmetrically configured then we just need one anycast address per rack. And in the single rack case this simplifies back down to having a single boundary services address for OPTE to send packets to. In the multi-rack case we still need to propagate what upstream addresses correspond to what underlay switch addresses and propagate that information across a network of racks.

While this does dramatically simplify things for MVP, it seems overly constraining to force customers into symmetric switch configurations and it does not really simplify things that much for multi-rack. Unwinding this assumption also seems like more work and pain than just starting without it.

Is this an EVPN Trojan Horse?

While this is also introducing tunnel routing, or routing in and out of tunnels (RIOT) if you prefer - no, most definitely not. While we do carry an L2 frame in our Geneve encapsulated packets, all overlay network endpoints are strictly routed and do not share a broadcast domain with any other endpoints (/32 or /128 netmasks for IPv4 and IPv6 respectively). This means there is no broadcast, unknown and multicast (BUM) traffic to contend with and therefore we don’t need to deal with inner-frame replication in-network. This simplifies things dramatically and what we are left with is simply a tunneled routing protocol at L3. In our case, what we’re implementing is closer to [LISP] than EVPN.

Applicability to the Rest of the Overlay.

As I’m reading through the OPTE code for V2P to implement V2B, I’m realizing we can use these same mechanics for regular VPC/overlay route propagation. What’s more, if participation in a tunneled network control plane (in the routing protocol sense of the term) is keyed by VPC id - then we get rid of the everyone knows about everything problem, and propagation between peers only happens on the VPC ids peers care about. If we dispense with the idea of a dedicated boundary services VNI and just use the VPC VNI and have boundary services act like a VNI router, then this same benefit can be had in the boundary services context.

It’s important to note that if this approach is considered, it needs to be done within the context of the overall information propagation architecture of the Oxide platform. Today the control plane is responsible for information propagation, including managing [CAP]. This approach would be a partial shift away from that model and letting an underlying protocol handle part of the information management problem. This would need to be considered very carefully.

This approach would also not be a general approach for propagating OPTE state. Things like firewall rules, NAT updates etc do not correspond to a routing information propagation model. OPTE manages a holistic view of the connectivity model for instances and services, and routing is only a part of that model.

Determinations

As described above, there are a number of benefits to using a routing protocol to distribute overlay routing information. The determination is that we will use this approach for boundary routing. Applicability to the rest of the overlay is going to be shelved for now. Today we are not distributing prefixes for VPC overlay routing, just single-address destinations. If we transition to prefix distribution, there may be value into moving to a routing-protocol-based approach.

Open Questions

  • How scalable will this be? Let’s try full Internet tables once we have BGP integrated and find out.

  • Not all OPTE instances need all boundary routes, how should we prune things on the way to the V2B tables?

Security Considerations

  • This could be a DoS vector, especially with a dynamic routing daemon driving things. We should consider route damping and other typical stability measures in our implementation.

  • We inherit the existing security considerations for ddm. But we now need to make those considerations in the context of external factors that can come into play through BGP or static route manipulation.