RFD 63
Network Architecture
RFD
63
Updated

The purpose of this RFD is to outline proposals for the network architecture used within one or more Oxide racks, how customers interface with an Oxide rack, considerations for how we scale the network and its related services, and how we can implement the primitives that we have proposed to our customers.

It is recommended that readers familiarize themselves with the following RFDs, which go into more detail about the user facing APIs and considerations for what we are implementing and why:

In addition, the following are useful background in terms of systems that have come before this that we are learning from.

Customer and Oxide Benefits and Goals

Goal 1: The network needs to feel snappy.

The network services and the resulting performance needs to feel snappy to customers. This is as subjective as it is objective. Some customers will be doing things like spinning up instances and running tools like ping, iperf, and others to 'measure' performance, and we need to make sure that the resulting network architecture gives us a good foundation for hitting this.

Goal 2: No single point of failure.

The network architecture design should be such that the broader network architecture can handle the failure of a single component and still provide the same service, possibly in a degraded fashion (e.g. less available bandwidth). For example, the failure of a single switch should not mean a loss of service for the rack. The failure of a single server should not impact network services on other servers, except insofar as they were trying to talk to resources on the offline server.

A related aspect to this is that when a failure occurs, we should strive to minimize the impact to the customer where possible. For example, we should try to avoid causing spurious TCP RSTs in customer flows because a component has failed, where possible.

Goal 3: Network architecture and services should not impose a scalability limit as the customer’s deployment grows.

As our customers scale their deployment, we need to make sure that network services and our architecture scale as well. Other ways of saying this include:

  • Network services (e.g. routing, NAT, FW) should avoid having a single bottleneck, where possible.

  • We need to design services such that they can scale-out, in addition to or instead of scaling-up.

There are some obvious caveats that come along with this:

  • A single instance can’t go faster than the underlying server’s network capacity (though the instance may be subject to its own caps).

Goal 4: We need to meet our customer where they are.

We need to provide easy integration with our customer’s network (e.g. BGP). For example, we need to design things assuming that the customer has allocated a large amount of their IPv4 or IPv6 space already. If a customer uses a specific interior routing protocol, we need to assume that we will need to match them, within reason.

We cannot be all things to all customers, so we will need to pick and choose where appropriate. However, our designs should at least try to give us flexibility where prudent. In particular, in the short term, we may need to pick a limited number that we support to help us get to market.

Goal 5: Customer instances should be able to live on any server.

In traditional networks, instances and their addresses are tied to the physical network. For example, all instances in the same L3 network have to share a broadcast domain. To limit the scope of a broadcast domain, often a broadcast domain or network will be limited to a smaller scope such as a single rack. This means that moving instances often requires changing the IP address.

Our design should allow for networking not to be the constraint on where a customer instance can be moved within the availability zone. There may be other constraints that limit an instance to a particular Rack or Cell (such as storage); however, the network should not constrain that.

Goal 6: Simplify customer network management experience.

Where possible, we should seek to simplify the customer’s network management experience. This comes in many forms. For example, some of this is about improving the serviceability and manageability experience. This can take the form of things such as leveraging LLDP, being able to more easily pinpoint and paint pictures of where traffic and saturation is in the network, providing tools for remediation, and more.

When it comes to things like upgrades and management of the actual configuration, we should design things such that they do not require 'traditional' network management whereby operators need to manually log into different components such as switches and servers and try and roll out configurations. This should be managed automatically. Common alerts and monitoring that teams may set up for the network should be easy to configure and set up automatically.

Many of our customers are in organizations where different teams are responsible for servers and for networking. Because of this, it is is often the case that groups can end up in acrimonious or even antagonistic relationships where by incident analysis is passed between different groups with blame asserted. Where possible, we should make it easy to indict or acquit aspects of the network in a neutral way so as to help the customer solve the thing that actually matters: understanding why an application is broken.

Goal 7: Where possible we should provide an evolutionary path to future services and approaches.

The only thing we know for certain right now is that whatever initial design we come up with will not last forever. Therefore, when making certain decisions we should consider what the evolutionary impact is and where possible try and design things such that we can adjust them with as little customer and operational impact as possible.

Disclaimer

The rest of this document goes into more detail about the proposed architecture, what we’re trying to accomplish, and why we made some of these decisions. This is not the end of this journey — there are still a lot of details that we will need to work through in subsequent documents. We will find issues with what’s proposed and it will not meet all of our needs over time; however, it will hopefully prove to be a reasonable balance between ambition, providing building blocks for the future, and something we can actually bring to market for our initial product.

While the only guarantees that we may well have is that this will miss the mark in a number of ways, I ask that folks run with this and hopefully can see a path towards something interesting ahead and help us shape ourselves to get there.

High-Level Overview

The following are the high-level aspects of the network design:

  • An IPv6 based physical network within the rack and between Oxide racks

    • /64 per-host and per-rack summaries

    • An interior routing protocol that is used between hosts and the ability to start with minimal coordination.

    • Equal-cost multi-pathing pushed all the way down to the host

    • Independent L2 Networks with VLANs for Service Processors, Power Shelf Controllers, and upstream ports that are on a physically distinct set of networks (that are joined together in the Sidecar Rack Switch

  • Network virtualization leveraging Geneve encapsulation to implement VPCs

    • Allows us to take advantage of current NIC offloads

    • Geneve gives extensibility through header options

  • Oxide Packet Transformation Engine (OPTE) running on each host to provide scalable network services

    • Allows for flexible implementation of routing, NAT, Firewalling, DNS, and more

    • Gives a design that allows us to move around functionality between the host, NIC, and switch as we scale and evolve.

    • Provides a framework for customer introspection of application networks

  • Network Boundary Services that provide an interface to the customer network.

    • Utilizes BGP for interfacing with the customer network

    • Leverages the capabilities of the Tofino 2 switch ASIC, minimizing the need for additional networking hardware in small configurations

    • With OPTE provides the foundation for load balancing and additional future services.

  • Virtio Net Devices exposed to guests

    • Guest configuration via traditional auto-configuration (DHCP, SLAAC)

  • Network services to ease management and auto-discover the network, including LLDP

The following series of images describe different views of the system:

Logical view
   /----------\                        IPv6/56 +-------+    +-------+ IPv6/56
   | Internet |                                | Oxide |    | Oxide |
   \----------/                                | Rack  |    | Rack  |
        ||             +----------+            +-------+    +-------+
   +----------+        |  Oxide   |                ^            ^
   | Customer |<======>| Network  |<===============+============+ . . . L3 Network
   | Networks |<======>| Boundary |<===============+============+       Fabrics
   +----------+        | Services |                v            v
                       +----------+            +-------+    +-------+
                                               | Oxide |    | Oxide |
                                               | Rack  |    | Rack  |
                                       IPv6/56 +-------+    +-------+ IPv6/56
Compute Node Networking Plane
    ^            +-----------------------------------------------------------
    |            | Server  +------+                              VM A
    | To switch  |         | LLDP |                         +---------------+
    + . . . one  |         | IGP  |               +------+  |    +--------+ |
    |            |         +------+   +------+    | cust |  |-+  | virtio | |
    |            |              ^     |      |<-->| vnic |->|o|--| nic    | |
    |            |    +------+  |     |      |    +------+  |-+  | net0   | |
    |            |-+  | host |  |     |      |              |    +--------+ |
    +-----------<|o|--| phys |<-+     |      |              +---------------+
                 |-+  | net0 |  |     |      |
                 |    +------+  |<--->| OPTE |                    VM B
                 |    +------+  |     |      |              +---------------+
                 |-+  | host |  |     |      |              |    +--------+ |
    +-----------<|o|--| phys |<-+     |      |    +------+  |-+  | virtio | |
    |            |-+  | net1 |  |     |      |<-->| cust |->|o|--| nic    | |
    |            |    +------+  |     |      |    | vnic |  |-+  | net0   | |
    |            |              v     |      |    +------+  |    +--------+ |
    | To switch  |      +---------+   +------+              +---------------+
    + . . . two  |      | Control |      |
    |            |      | Plane   |<-----+
    |            |      +---------+
    v            +------------------------------------------------------------

Mapping to User API Concepts

In RFD 21 User Networking API we introduce a number of high level concepts and API entities such as VPCs, Internet Gateways, VPC Routing, VPC Peering, VPC Firewalls, and more.

The fundamental idea is that each VPC is built using an overlay network with its own virtual network identifier using Geneve. The implementation of VPC Subnets, VPC Routing Tables, VPC Firewalls, Internet Gateways, VPC Peering, the instance’s DHCP and more is done for the most part by the [sect-opte], which is a programmable software component that allows us to run a series of transformations on packets and uses the [sect-onds] to know what the set of rules to apply is.

Ephemeral and Floating IPs, along with outbound NAT and Internet gateways, are built leveraging a similar virtual network and [sect-opte]. In this case, a dedicated virtual network is reserved for use with the Boundary Services which takes care of translating data between the encapsulated format and a form appropriate for the underlying network.

Finally, pieces like DNS are implemented by the broader network services with some assistance from [sect-opte].

What This Doesn’t Do

Notably, this network architecture, purposefully doesn’t solve the problem of trust. It leaves the network, broadly speaking as an entity that will forward most traffic and leaves it to the entities that are generating and receiving it to determine whether or not they should act upon it.

XXX There’s probably more

Example Packet Flows

Physical Flow

Presume the following example diagram of a pair of servers in the rack connected to each switch.

Address Key:
  site:12_00::/64
  ^^^^ ^^ ^^
    |  |  |
    |  |  +------ Prefix for the server (8 bits)
    |  +--------- Prefix for the rack (8 bits)
    +------------ Prefix for the AZ/Cell (48 bits, constant in example)


                 site:12_00::/64
                   +-------+
                +->| SERV0 |<--+
                |  +-------+   |
                |              |
            +-------+      +-------+
fe80::S0_NA | NIC A |      | NIC B | fe80::S0_NB
            +-------+      +-------+
                |              |
                v              v
          +----------+    +----------+
fe80::MP1 | Switch 1 |    | Switch 2 | fe80::MP2
          +----------+    +----------+
                ^              ^
                |              |
            +-------+      +-------+
fe80::S5_NA | NIC A |      | NIC B | fe80::S5_NB
            +-------+      +-------+
                |              |
                |  +-------+   |
                +->| SERV5 |<--+
                   +-------+
                 site:12_05::/64

Here, let’s walk through what happens if SERV5 needs to send traffic to SERV0. Traffic is being sent from one of the service IPs in site:12_05::/64 to site:12_00::/64. The rack identifier and server each have 8 bits in their part of the IPv6 address, the distinction between the two is indicated with the _ (underscore) character. See the address breakdown in the above image. It’s also worth nothing that site here is an IPv6 block that’s only routable in the context of the Oxide environment itself. That is, we use [sect-bs] to reach the outside world that way it doesn’t matter if our customers are using IPv6, IPv4, or both (and hopefully not neither!).

  1. SERV5 would look up the destination of site:12_00::/64 in its routing table. This would indicate two next hops of either Switch 1 or Switch 2. The system would then use delay driven multipath to determine which device to use (presuming both paths were up).

  2. Based on the decision of the routing protocol, the server would pick an interface to send the packet. It would then verify that it had a valid address for the link-local address of the corresponding switch in its neighbor table. If one was not valid, it would use the neighbor discovery protocol (NDP) to look that up.

  3. Presuming it chose to go out NIC A, the server would generate a packet with the destination MAC address of fe80::MP1 being that of Switch 1, with an IP destination of site:12_00::/64.

  4. Switch 1 would receive the packet and would it evaluate its options. It would prefer to send it to the address of SERV0.NIC A. However, if that link was down, it would have be unlikely to do much with the packet in this example (as there are no other routers available) and would have to drop the packet. If this were to have occurred, then SERV5 would have had a routing table update that said not to send the packet to Switch 1.

  5. Presuming that the switch found that it could send to SERV0.NIC, it would rewrite the destination MAC address to be fe80::S0_NA and from fe80::MP1.

  6. SERV0 would receive the packet and deliver it locally based upon the IP address.

Rack to Rack

If we take the previous example, let’s extend it and say we had another rack in the same site:

                 site:ab_01::/64
                   +-------+
                +->| SERV1 |<--+
                |  +-------+   |
                |              |
            +-------+      +-------+
fe80::S1_NA | NIC A |      | NIC B | fe80::S1_NB
            +-------+      +-------+
                |              |
                v              v
          +----------+    +----------+
fe80::MP3 | Switch 3 |    | Switch 4 | fe80::MP4
          +----------+    +----------+
                ^              ^
                |              |
            +-------+      +-------+
fe80::S6_NA | NIC A |      | NIC B | fe80::S6_NB
            +-------+      +-------+
                |              |
                |  +-------+   |
                +->| SERV6 |<--+
                   +-------+
                 site:ab_06::/64

Switch connectivity:

          +----------+    +----------+
fe80::MP1 | Switch 1 |    | Switch 2 | fe80::MP2
          +----------+    +----------+
            ^      \        /      ^
            |      +\------+       |
            |      | +-----+       |
            v      v       v       v
          +----------+    +----------+
fe80::MP3 | Switch 3 |    | Switch 4 | fe80::MP4
          +----------+    +----------+


                 site:12_00::/64
                   +-------+
                +->| SERV0 |<--+
                |  +-------+   |
                |              |
            +-------+      +-------+
fe80::S0_NA | NIC A |      | NIC B | fe80::S0_NB
            +-------+      +-------+
                |              |
                v              v
          +----------+    +----------+
fe80::MP1 | Switch 1 |    | Switch 2 | fe80::MP2
          +----------+    +----------+
                ^              ^
                |              |
            +-------+      +-------+
fe80::S5_NA | NIC A |      | NIC B | fe80::S5_NB
            +-------+      +-------+
                |              |
                |  +-------+   |
                +->| SERV5 |<--+
                   +-------+
                 site:12_05::/64

Here let’s discuss what would happen if SERV5 in rack 12 was trying to send to SERV1 in rack ab. The steps below are somewhat abbreviated. The switches in each rack are fully connected. That is each switch in rack 12 is connected to each switch in rack ab and vice versa, but the two switches with a rack are not connected.

  1. SERV5 would look up the path to SERV1’s service address site:ab_01::/64. It would be told that the route to site:ab::/56 is via either Switch 1 or Switch 2 with an equal cost. Assume it picked Switch 2, it would forward the packet there via fe80::MP2.

  2. Switch 2 would have two routing entries for the site:ab::/56. Two approximately equal-cost entries (depending on the current delay) would send it to either Switch 3 or Switch 4. Let’s presume it forwarded the packet to switch 4 which would happen via fe80::MP4.

  3. Switch 4 would find that it has a single path to site:ab_01/64 which is directly via SERV1.NIC B. It would then forward the packet to SERV1.NIC B’s address fe80::S1_NB.

  4. The packet would be received by SERV1 on NIC B and it would deliver it to the application that has the address.

Internal DNS

Presume that an instance on the VPC Subnet represented by 192.168.1.0/24 is looking up the DNS name for another instance on its VPC Subnet named db whether this is through someone running an explicit command like dig or host or a side-effect of using a tool like ping, curl, or something else. The following steps would roughly happen:

  1. The DNS configuration in the instance would instruct the DNS tooling to use the reserved DNS services IP Address, 192.168.1.2. The system would create a DNS packet for 192.168.1.2.

  2. Because the instance is configured with an off-subnet gateway, the request would need to be routed to the network’s gateway, 192.168.1.1.

  3. The operating system kernel would need to check its ARP cache for that address. If we assume that it had expired, it would then issue a broadcast ARP packet asking about 192.168.1.1.

  4. That packet would enter the OPTE, which would note that is an ARP packet for the network’s gateway. It would synthesize a reply and inject it back to the guest.

  5. With the ARP data resolved, it would send to the gateway’s MAC address.

  6. As before, the OPTE would notice that this was something destined for the DNS services address, mark it for processing by the broader network services, and queue it for that work.

  7. Another agent on the system would inspect the DNS request and make the appropriate backend calls to the [sect-onds]

  8. Upon getting the response, the agent would generate a DNS reply, and give it to OPTE to inject back into the guest.

  9. The guest would receive the DNS reply and continue as normal.

Instance to Instance

Assume that we have two instances, 'test' and 'web' that are on two different servers, 'A' and 'B'. These are part of the same VPC and project. If one were to try and connect to an HTTP server between the two, then here’s an example of what would happen. Let’s assume someone on 'test' ran something like curl http://web/index.html.

The first step would be DNS resolution. For that, see [flow-dns] above. Let’s assume that has been completed and we got that the IP address of 'web' was 10.0.0.88. With that in hand, the following steps might occur:

  1. The system would create a socket and attempt to initiate a TCP connection to the host at 10.0.0.88 on port 80.

  2. The OS kernel would realize that the packet needed to be routed to the network gateway at 10.0.0.1.

  3. The OS kernel would need to check its ARP cache for the gateway 10.0.0.1. Assume that it has that cached and we don’t need to look it up right now as it would have used it for the DNS flow.

  4. The OS kernel would send a packet out with the MAC address of the gateway and the IP address of 10.0.0.88.

  5. The packet would leave the guest and it would be received by OPTE.

  6. The OPTE would look at the packet and evaluate the following considerations:

    • Whether the VPC Routing rules allowed this communication.

    • Whether the VPC Firewall rules allowed this communication.

  7. Assuming the flow was allowed, it would apply the following transformations to the packet:

    • Rewrite the destination MAC address to be that of the actual VNIC of the instance.

    • Encapsulate the packet in a Geneve UDP packet, using the VPC’s virtual network ID.

    • Fill out an Ethernet, IPv6, and UDP header that indicated the actual server that had the target instance.

  8. It would follow the existing physical flow and be sent out one of the server’s two NICs and routed through the DC to the recipient.

  9. The packet would be received by the other server’s instance of OPTE. Between the NIC and OPTE, the packet would be classified.

  10. OPTE would evaluate the packet and do some of the following:

    • Verify and allocate firewall state rules.

    • Decapsulate the packet.

    • Deliver it to the guest

  11. The guest kernel on 'web' would receive the TCP packet, begin processing it, and then use this same procedure in reverse.

NATing Out

A common sanity check that someone might do is pinging an external resource, e.g. ping 8.8.8.8. If we assume an instance in a default configuration, it will have an Internet Gateway configured that causes all traffic not in the VPC network to be subject to a NAT and reach the outside world. This flow leverages Boundary Services.

  1. The kernel creates an ICMP packet directed towards 8.8.8.8.

  2. That gets forwarded to the off-network gateway, as with the earlier examples.

  3. OPTE would receive the packet and evaluate the following:

    • Whether the VPC Routing Rules indicated that there was a configured Internet Gateway that could be used for this. The default VPC Routing Table would have this configured.

    • Whether the VPC firewall rules are configured to allow this outbound access. The default firewall rules would do so.

  4. OPTE would check the NAT state that this instance has used and see if it fits within its quota. If it does, then it would allocate a specific port for it and update the local table.

  5. OPTE would then do the following transformations on the packet:

    • Rewrite the source IP information to indicate that allocated from the NAT.

    • Encapsulate the packet with the Boundary Services dedicated overlay network.

    • Generate the Outer Ethernet, IP, and UDP headers to point it at the Boundary Services.

  6. Follow the existing physical flow as it was sent to the Boundary Services.

  7. Boundary Services would receive the packet and do the following transformations:

    • Decapsulate the packet

    • Rewrite the destination MAC address based on its routing table and forwarding rules.

  8. The packet would be sent to the broader network. When a reply came back and was received by Boundary Services it would look up the destination IP address and port.

  9. Assuming the destination IP address and port were valid, it would transform the packet in the following way:

    • Encapsulate it with the boundary services network ID

    • Build the outer MAC, IP, and UDP information based on the server’s logical address that currently has the instance.

  10. When the instance of OPTE receives the packet it sees that it was something that was in its NAT state table and takes the following steps:

    • Decapsulate the packet

    • Rewrite the destination address and port from the external NAT address to the internal one.

  11. Deliver the rewritten packet to the guest, which will process the ICMP response and print the output normally.

Floating and Ephemeral IP Inbound

In this example, presume that a user has set up a floating IP with the address 1.2.3.4 and assigned it to their instance through the API previously. Presume for the moment that there is an HTTP server running on the instance the floating IP is mapped to. In our customer’s broader network, Boundary Services will be advertised as the routing target via bgp. With that in mind, here are the approximate steps that would occur:

  1. Boundary Services receives an inbound TCP packet on port 80 for 1.2.3.4

  2. Boundary Services will look up in is state table and determine if it has a mapping for this IP address. If not, it will discard it. Given that it does in this case, it will transform the packet in the following way:

    • Encapsulate the packet with the boundary services network ID

    • Generate an outer packet destined for the compute node found in its mapping table.

  3. The packet will transit the physical network as described in [flow-phys] and be delivered to the instance of OPTE on that server.

  4. OPTE will look if it has a flow for this packet or if it matches its NAT rules (effectively this is a 1:1 NAT). If it does, it will then consider the following:

    • Does the instance’s VPC firewall rules allow the packet

  5. Upon determining that this flow is valid, OPTE will go through and make the following transformations:

    • Decapsulate the packet

    • Rewrite the destination IP address (but not port) and MAC address to be that of the instance.

  6. The packet will be delivered to the instance’s NIC and processed.

  7. If the server were to generate any replies in response to this, they would behave in a similar fashion to the first portion of [flow-nat].

Learning from Others and the Past

As this is a complex subsystem, it’s important to pay attention to what others have done before us and learning from systems that we have, ourselves, implemented directly such as Joyent’s fabrics. In particular, we’re going to focus on a few areas and building blocks that have existed, including kernel facilities, and implementations that have been publicly documented in papers by Google and Microsoft.

VL2, Ananta, and VFP

Microsoft has a trio of networking papers that describe the design and evolution of their virtual networking features which are in the VL2, Ananta, and VFP papers.

The VL2 paper introduced (not for the first time) and popularized a scheme for what could be seen as overlay networks. It discusses the use of equal-cost multi-pathing and of limiting broadcast domain size by giving services a service address that is translated into the actual host that is running that service by a directory server.

This was used as a basis in Ananta where Microsoft put together a scheme for doing NAT and general non-terminating load balancing (that is load balancing that does not accept a connection and open a new one to a backend). This was brought to a head in the VFP paper, which describes the full network virtualization solution that Microsoft has.

There are a couple of highlights from their approaches and things that apply to us:

  • Ananta talks about the benefit of batching allocation requests for NAT. The general techniques here are useful and we may want to go so far as to pre-allocate all of the ports that we care about for an instance.

  • VFP makes a strong point for a 'connection-oriented' programmable pipeline. There’s a lot that makes sense about the connection-oriented focus. In particular, we’re building something where, from the customer perspective it is about the actual individual connections due to things like firewalling and other stateful connection considerations. They spend a lot of time talking about the challenges of languages being designed this way and the ultimate benefit.

  • Based on the negative experiences in the VFP paper, it’s not clear that trying to adopt OpenFlow or Open vSwitch necessarily makes sense for us. On the other hand, Google does make successful use of both of them in their Andromeda paper. One thing that’s not clear is just how stock that implementation is and how much of the benefits of OpenFlow were because of what the paper mentions of its existing extensive use. It may be that because we’re starting from scratch, more of the issues that Microsoft brought up are likely to rear their heads for us.

  • A powerful thing that they did, variants of which come up in the Andromeda paper is, that they designed a system that can push specific flows to different places, such as a NIC. This is a useful insight, one of which we’ll talk about in detail in the discussion of how we might design the [sect-opte].

  • Finally, many of the lessons learned in the VFP’s paper’s section 10.2 are incredibly useful. It both lets us know what we can get away with, but also where we need to think critically about how we design the various aspects of the system. This ties into similar lessons that we’ve learned from the deployment of [past-fab].

Andromeda

Google’s Andromeda paper provides an interesting discussion of the evolution of GCP’s network virtualization implementation and details into it. There is a lot of wisdom to draw from the evolution of Andromeda. Off the bat, there are similarities to what we see in [past-msft] and [past-fab]. There is a control plane that is distinct from the data plane. These two environments have different requirements and while they’re integrated the way and means that they scale and evolve is naturally different.

An important aspect that is focused upon in Andromeda is the ability to have agility. The ability to change out the software, transfer state, and do so in a small amount of time is needed not just for upgrade but also for live migration. An important takeaway is that whatever we design, we’re going to want to think carefully through how we can make iteration and the deployment path easy for customers. While deploying software on-premises is often more involved than in a managed environment, the more seamless we can make the experience, the more willing our customers will be to take and roll out upgrades.

An interesting consequence of this design is that it allows them to do migrations between arbitrary clusters, which is a useful ability.

One of the things that they highlight was the evolutionary path that led them to adopt what they call 'hoverboards'. A fundamental problem of the data plane that acts on flows and connections is how state is pushed out to it. The idea behind the 'hoverboard' is that being able to spin up VMs in the broader datacenter that can handle this general processing is a good idea and keeps the primary data path simple. Instead, based on specific usage, data flows are programmed back into the host.

An interesting thing they also point out is dedicated 'co-processor' threads which are there to do CPU intensive packet operations. The most notable of which is encryption, though apparently this only occurs when going out to the WAN or when transiting between clusters in GCP. One realization from that is that it suggests most traffic inside of a cluster is not encrypted.

The use of 'hoverboards' and the 'co-processors' is somewhat different from what we see in VFP from Microsoft. In VFP hot-path flows were programmed into specialized NIC hardware, where as here we have specialized flows actually being programmed in the host. Both approaches are useful and in fact, the Andromeda paper goes through a number of different discussions of other techniques one could use and how they ended up with this one.

From my perspective, we don’t necessarily have the spare capacity at our initial small rack-level scale to be able to dedicate specific VMs that we can provision based on demand. While the evolutions that both Microsoft and Google have made towards dedicated special-purpose hardware and being able to leverage more of the fleet, because we need to deliver an initial product and our initial scale will be small, we should choose to learn from both of them and that the initial designs did work for smaller scales.

A lot of the lessons that they’ve learned from this can help us as we try to look beyond the initial product and ask ourselves how do we build environments for customers that scale beyond just a handful of racks. What does it mean when even just the thought of 100 racks is a drop in the bucket. Like with the [past-msft] work, having these papers is an incredibly useful resource to draw upon, which can’t be understated.

Fabrics

In the beginning of 2014 and into delivery in early 2015, a scheme for doing network virtualization and providing a similar set of functionality was developed at Joyent, called 'fabrics'. While there is no canonical paper à la the items we’ve discussed above, the high-level view was an overlay / underlay scheme was developed that leveraged VXLAN as an encapsulation protocol and a custom scheme we developed which was a simple TCP protocol that went through a proxy to our production databases. A lot of the experiences (both positive and negative) are a major influence on the design proposed here and provide additional, useful background.

A major focus of the design was on reusing existing OS-level abstractions (such as virtual switches) and the creation of a new abstraction: the overlay device. The overlay device was a data link upon which additional virtual NICs could be created and took two specifications, an encapsulation plugin and a search plugin. The encapsulation plugin described how to transform the packet to put it on the wire. The search plugin operated in userland and when the kernel had an unknown destination, userland would be responsible for looking that up.

Like is described in the VL2 paper and alluded to in the Andromeda paper, there was a centralized database that had mappings. When things changed, we had an invalidation record that would be written out for each compute node that needed to take an action and it would take it. While there was an idea of pre-loading mappings, we always did so lazily. The one optimization that we had was that when we received an ARP or NDP request, we would not only inject the response, but also the mapping from the virtual MAC address to the compute node.

With those basics in mind, we can begin to talk about some of the things that worked well, some of the things that worked poorly, and which of those were due to implementation and which were due to more general architecture.

One thing that did make sense, and this is reflected through what we’ve seen written from Google and Microsoft, was the split between the data path and the control path. Effectively having the data path just do its thing (whether in-kernel or in userland) and having a separate entity push updates has been useful. This is a general observation folks have made and is part of the design of other open source tools like Open vSwitch. Fundamentally this split seems to be one that’s worth keeping.

That said, an important thing is keeping that actual database available and having the things that fetch from it be able to actually handle transient failures in a way that doesn’t really increase latency. This is only useful in so far as the broader system is actually available and tolerant to those failures.

Another major lesson was the choice to focus solely on L2 networks. While what the customers ultimately cared about and got in the API were L3 networks that they could control the IP addresses over, the actual implementation focused on L2 addressing. This had a couple of interesting side effects over the years:

  • A result of this was that we had to emulate ARP and NDP. While not expensive in and of itself, this injected additional latency into the system and when there were failures could lead towards harder to debug issues for the guests themselves as there was often no good way for support staff to actually go through and understand the context of the guest and its ARP cache.

  • It made it substantially harder for us to go through and create some of the higher level services that we wanted in an HA fashion, in particular routing and NATs. When we started this, we naively just provisioned a single 128M instance for a given L3 network that ran NAT. Effectively, this didn’t give us a good foundation for building a scale-out NAT or Routing (incidentally firewalling was actually taken care of by our prior design choices and not a part of this).

With this in mind, we instead choose to focus on a design that really functions at L3 and provides us the flexibility that we need to implement the flows described earlier. The approach that Google has chosen by using an off-subnet route is actually very useful and one that we intend to use ourselves as discussed in RFD 21.

This leaves a few distinct advantages to us. Critically that the guest only ever has to be able to have an ARP address for the gateway, which will be fixed. There will still be lookups in the system, but they shouldn’t interfere with the ability of the guest to actually generate and send those packets.

Also, this construction basically means that we’re treating this as an L3 forwarding problem, which ties into what we’re actually trying to create which is routing, NATing, and making decisions about reachability based on IP level information.

One of the other interesting choices was how we tried to plug this into the kernel. There were a lot of benefits for us in terms of maintainability and getting things off the ground and leveraging things that already existed. At the time, the focus was on building different blocks that we could assemble into what we needed. While that was reasonable based on our business needs and constraints, it left some challenges. In particular, some aspects of performance were a little trickier due to some of the ways we plugged into the existing networking stack (effectively both as an application consuming and a driver producing).

Like Microsoft, Google, and certainly others have done, constructing a programmable engine that we can change on the fly or evolve and iterate on in a faster way is worthwhile. In addition, as we’ll talk about later on in [sect-opte], depending on how we build this, it’ll give us a lot of flexibility and the ability to more easily adapt and evolve the product.

A related lesson is just to be careful about where the abstractions are created and how you can see through them. While abstractions and layering are important and useful, eventually you also have to cut through them.

Physical Network Architecture

The physical network of an Oxide environment is a series of Layer 3 (L3) IPv6 networks that leverages equal-cost multi-path (ECMP) routing to provide multiple, redundant network paths. This L3, ECMP focus is actually used not only between racks, but also within the rack, between the server and the switches.

This high-level design was chosen to help with several of our primary goals. The choice to focus on per-rack summarized L3 addresses in a part comes from [goal-3] and [goal-1]. The size of a broadcast domain inherently limits the scalability of the network. Many modern networks are being destined to take advantage of Clos topologies. This is well trodden ground and we’ve seen many folks write up about it, including Google’s Jupiter Rising paper and by Facebook in RFC 7983. In addition, being able to summarize the remote resources helps keep the size of routing tables in the switches down, helping with the our ability to scale.

By leveraging multiple links and ECMP, that helps us satisfy both [goal-2] and [goal-3]. We push this down to the servers themselves in part for these goals, but also notably with [goal-6]. The popular alternative to this is to create a Layer 2 (L2) network per rack and leverage the link aggregation control protocol (LACP) IEEE standard. However, LACP’s origins were in first allowing for multiple links to a single device. This eventually grew to the ability to use LACP between a pair of switches; however, to do so the switches are required to be able to speak to each other on a custom protocol that can be used to do things like synchronize MAC address tables.

This is not a part of the IEEE standard and has been a historical source of bugs across practically every switch vendor and has historically been undebuggable and for many consumers a large source of outages. Instead, we believe we shouldn’t try to lie to the broader system. While involving the server in the routing protocol will be a different source of things to debug, it should make the actual implementation of the network switches dramatically simpler and leverages something that the industry has built for some time, mainly routers.

In addition, this design gives us some useful properties which minimize the amount of configuration that is required to bootstrap the network. The more we can keep the physical networking needs simple, the easier it should be to deliver a network that is both easier to manage and high-performance.

IPv6 Addressing

The physical network is built upon IPv6 and utilizes subnets to summarize collections of resources. Addresses are assigned from the IPv6 ULA range. For example:

  • An availability-zone is represented by an IPv6 /48.

  • A rack and its resources is represented by an IPv6 /56.

  • A server and its resources are represented by an IPv6 /64.

Conceptually, you can see this in the following image:

AZ, Rack, and Server IPv6 address allocation
   AZ-1 fd00:1122:3344::/48
 +-----
 |  Rack 1                      Rack 2                     Rack 3
 |  fd00:1122:3344:01/56        fd00:1122:3344:02/56       fd00:1122:3344:03/56
 | +------------------------+  +------------------------+ +------------------------+
 | | Host 1                 |  | Host 1                 | | Host 1                 |
 | | fd00:1122:3344:0101/64 |  | fd00:1122:3344:0201/64 | | fd00:1122:3344:0301/64 |
 | +------------------------+  +------------------------+ +------------------------+
 | | Host 2                 |  | Host 2                 | | Host 2                 |
 | | fd00:1122:3344:0102/64 |  | fd00:1122:3344:0202/64 | | fd00:1122:3344:0302/64 |
 | +------------------------+  +------------------------+ +------------------------+
 |           ...                          ...                        ...
 | +------------------------+  +------------------------+ +------------------------+
 | | Host 31                |  | Host 23                | | Host 10                |
 | | fd00:1122:3344:011f/64 |  | fd00:1122:3344:0217/64 | | fd00:1122:3344:030a/64 |
 | +------------------------+  +------------------------+ +------------------------+
 +-----

   AZ-2 fdff:eedd:ccbb::/48
 +-----
 |  Rack 1                      Rack 2                     Rack 3
 |  fdff:eedd:ccbb:01/56        fdff:eedd:ccbb:02/56       fdff:eedd:ccbb:03/56
 | +------------------------+  +------------------------+ +------------------------+
 | | Host 1                 |  | Host 1                 | | Host 1                 |
 | | fdff:eedd:ccbb:0101/64 |  | fdff:eedd:ccbb:0201/64 | | fdff:eedd:ccbb:0301/64 |
 | +------------------------+  +------------------------+ +------------------------+
 | | Host 2                 |  | Host 2                 | | Host 2                 |
 | | fdff:eedd:ccbb:0102/64 |  | fdff:eedd:ccbb:0202/64 | | fdff:eedd:ccbb:0302/64 |
 | +------------------------+  +------------------------+ +------------------------+
 |           ...                          ...                        ...
 | +------------------------+  +------------------------+ +------------------------+
 | | Host 31                |  | Host 23                | | Host 10                |
 | | fdff:eedd:ccbb:011f/64 |  | fdff:eedd:ccbb:0217/64 | | fdff:eedd:ccbb:030a/64 |
 | +------------------------+  +------------------------+ +------------------------+
 +-----

Based on the above, you can see that each site can be summarized with an IPv6 /48. We’ve then broken this up such that we have eight bits dedicated to indicating the rack in a site and eight bits to indicate the server inside of the rack. This gives us approximately 256 racks in a single AZ and allows us to have up to 256 hosts in each rack.

RFD 76 discusses the range of hosts that could fit into a given rack given different configurations. In the densest configuration listed there, 1OU high x 1/3 wide, that would give us 78 servers in the rack. Even if that quantity were able to double, we would still have plenty of addresses available.

The limitation of around 256 racks would give us on the order of around 10-20 thousand servers that could be represented in a single /48 segment. Is it likely the case that this is a reasonable sizing for a Cell or perhaps even an Availability Zone. Though more research into the scalability and bandwidth constraint of such a network core needs to be done.

An important aspect of this scheme is the ease of summarizing resources. The more that we can summarize resources, the less routing table entries that we’ll need to program into the switch. Given our plans to maximize switch resources for Boundary Services, being able to make sure that as we expand the rack we don’t end up exploding our switch state is an important aspect of making sure we can design around [goal-3].

Having a specific set of IPv6 addresses map to a server may on the face appear to fly in the face of [goal-5].

Control Plane Use

As described earlier, each server is allocated an IPv6 /64 which it advertises for routing on the network. The process by which each sled acquires the /64 is described in more detail within RFD 259: Who’s On First.

Once provisioned to the sled, that /64 will then be used by the server itself and addresses from that /64 prefix can be allocated to various internal services that run on the control plane. Although each sled will be responsible for actually using the addresses, allocation will need to be split between different entities in the system such as the Rack Setup Service (RSS), Nexus, and the local node (driven primarily by sled agent). The details of the allocation scheme are left to other RFDs.

This addressing scheme implies a few important things:

  1. Addresses for the control plane are not tied to the instance, they are tied to the server they are located on.

  2. If something related to the control plane migrates between systems, its address will change. Note, this is not true for customer instance.

The rationale for this approach are a few fold. The first is that we’re trying to reduce what we’re advertising in the routing protocol. By using addresses from the server, we minimize the number of entries that we have in the routing table and means that we can much more cleanly summarize a rack.

Another reason for this approach is that a number of aspects of the control plane design suggest that there will be a service discovery mechanism. Because Nexus will probably have some need to scale over time and discover what exists, the underlying addresses for most of that shouldn’t matter.

However, that does also mean that we will need some number of addresses for service discovery itself. If we assume that these may be Cell or Availability Zone-wide, then our plan is to take one of the rack /56 entries and reserve it for this purpose. In this particular case, we will support advertising the individual /64 entries in the routing protocol for the scope of the site. The hope is that the number of entries used for this will be small. One could imagine a handful for consensus such as 3, 5, or 7. This could potentially be augmented by anycast support

Bootstrap, L2 Networks, SPs, and PSCs

The above sections describe how we believe the control plane and network will operate between hosts. However, there are a few other cases that we need to consider, notably that of the service processors, the power shelf controller, and the problem of cold boot.

Bootstrap Network

When a switch or server boots, they first land on the bootstrap network. Both switches and servers run a routing daemon that attaches them to the bootstrap network. When a server starts, the bootstrap agent running on that server extracts an identifying sequence of bits from that server and concatenates it with the well known boot prefix fdb0::/64 to form a unique IPv6 prefix for the server. As a concrete example, if the service processor MAC address of the server is a8:40:25:10:00:01 then the IPv6 prefix for that server could be formed as fdb0:a840:2510:0001::/64. This sequence of bits could also come from the server’s serial number.

Once the bootstrap agent has determined the server’s unique bootstrap prefix, it does two things

  1. Self assigns the first address within the prefix (fdb0:<unique-id>::1/64).

  2. Instructs the local routing daemon running on the server to advertise that prefix.

The local routing daemon running on the server advertises the prefix to both of the switches it is directly connected to. The routers running on those switches subsequently re-advertise the prefix to all of the other servers they are directly connected to. This re-advertisement only happens in the direction of server routers e.g., these advertisements do not leave the rack. In this manner, every server becomes aware of every other server’s self-assigned bootstrap prefix (within the rack) as advertisements propagate. This can be used as a server discovery mechanism for control plane software without having to resort to any form of multicast. The bootstrap agent can simply read the local routing table to see what servers are currently reachable.

Peering and traffic forwarding between routers takes place over link-local addresses. When a router first starts, it sends out neighbor discovery protocol (NDP) router solicitations over link-local multicast to discover neighboring routers. Neighboring routers respond to multicast solicitations sourced from a link-local unicast address over unicast thereby establishing bi-directional unicast link-local communication. From there, the routers engage in a peering protocol and subsequently begin to exchange prefix advertisements.

l3

It’s important to note the directionality of prefix propagation. Servers determine their own prefixes and advertise them. During bootstrap, the switches are not capable of determining what switch they are (switch 0 or switch 1), nor are they capable of coordinating which servers are on which ports across switches. Because of how the cabled backplane is constructed, the port a server is attached to on switch 0 is not necessarily the same as the port it is attached to on switch 1. Thus it would not be possible for a switch-level router to delegate a consistent prefix to servers without some sort of triangulation protocol.

In previous versions of this RFD, two alternatives were discussed for how the bootstrap network would be constructed, an L2 and an L3 variant. We have decided to move forward with the L3 variant for the following reasons.

  • The post-bootstrap network will be a routed L3 network and the bootstrap network must run side-by-side with that network. Having both networks built on the same underlying principles and mechanisms makes for a less complex co-existence and an overall easier system to reason about.

  • Bootstrap prefix propagation through a routing protocol obviates the need to support multicast on the switch. Link-local multicast will still be needed for router peering, but constraining things to link-local means we don’t actually need to replicate packets on the switch as broadcast domains do not extend beyond any single port.

  • Having an actively routed L3 network delivers the benefits of the multipath physical network in the Oxide rack transparently to control-plane bootstrap applications. The salient benefits for the bootstrap network are fault tolerance and downtime free maintenance.

  • Having a single unified network. The L2 approach necessitated two disjoint networks which meant two different bootstrap addresses in two different subnets for each server. While this can certainly be handled at the application level, leveraging multipath routing - which we need for the post-bootstrap network anyway - makes for a much simpler overall network.

SPs and PSCs

Each switch port which is for the RJ45-compatible PSC ports, the service technician, or something else, would each be treated as an independent, isolated L2 domain.

If the service processors have no need to talk to one another, then it might make sense to treat them in a similar manner; however, it can make just as much sense to have a single broadcast domain that they can all participate in depending on how the product evolves.

Routing Design and Protocols

Each data link in the system will have an IPv6 interface on it that has a link-local IPv6 address. This will be true of the switch ports and each of the server’s ports. A server’s routing information will always suggest that for a server in the rack, the next hop will be one of the two switches and it will always advertise that the paths for its server /64 are over its link-local addresses. This coincides with the design of isolated L2 networks and reduces the need for ARP/NDP.

We’ve decided to not cross-connect the switches as part of our network design mostly because of the question of how much cross-sectional bandwidth should that use and to try and fit into various different evolutionary designs and other systems (originally in 2020 we were looking at using RIFT, which has transitioned to delay driven multipath as part of our further experimenting and usage. As a result, each server will need to have reachability information for at least the other servers in its rack pushed down. While we may be able to get away with summarizing portions of remote racks higher-up, we’ll have to work to ensure that we don’t need to have every server have to keep track of distinct paths to every other server if we can to maintain routing scalability.

The internal routing protocol that we use is discussed in RFD 196 Maghemite: The Oxide Routing Stack. This protocol and its needs are somewhat distinct from what we use to interface with a customer’s network (see [bs-adv]) which are more constrained due to the need to interface with our customer’s existing environments (see [goal-4]).

One requirement that we believe we should have is that this routing protocol should be able to operate in a mode where only cryptographically signed messages. We may require different chains of trust depending on what we are performing. For example, using a server-level secret for information within the rack that is trusted by the rack and then using a rack-level secret that is trusted by the Cell or Availability Zone when joining multiple racks together.

MTU

XXX This section was initially forgotten and needs to be fleshed out.

Summary: Per-Switch State

This section summarizes the order of resources that we will need to maintain in the switch for physical networking. This list may seem small, that is intentional, to allow for the majority of the resources to be used for Boundary Services.

From a routing table perspective a switch in the rack would need to have entries for:

  • One entry for each remote rack that is directly routable, summarized as a /56.

  • One entry for each remote site, which may be another AZ or Cell, generally summarized as a /48.

  • Up to 256 /64s for the control plane’s bootstrap /56. Realistically we expect this to be either a single digit, odd number of entries.

  • One routes for each server in the rack that it is directly connected to.

We would need to keep track of the following order of MAC addresses divided into multiple disjoint L2 networks with corresponding VLANs within the primary switch:

  • Two for each server: One for the SP and one for the host’s NIC.

  • One for each switch port, for a total of the Tofino’s 64 ports and then one for each of the management switch’s ports and the devices themselves.

  • One for the power shelf controller (PSC)

  • One for each device on the other end of an external port.

While these might increase by a constant factor, the good news is that the total order of these resources should not put a major dent in the Tofino 2’s resources and allows us to leverage its functionality for more interesting purposes.

Network Virtualization and Encapsulation

One of the major network services that we need to provide is giving each customer their own set of private networks with isolation (VPCs). This means that we are going to have customers that are using overlapping IPv4 and IPv6 addresses.

Our goals have several implications in terms of the constraints and considerations for this part of the problem:

  1. [goal-5] talks about us wanting to be able to run instances in any part of the availability zone. This is an important thing for us to keep in mind. While we have a nicely structured and summarized physical network, when running customer instances, we cannot assume anything about the addresses customers want to use and their physical location in the DC.

  2. [goal-3] makes an important point that the solution we need cannot limit the expansion of a customer inside of an AZ. This means that we need to think carefully about how we design the architectural implementation of VPCs. This doesn’t mean that we won’t have limits or scalability problems, we will. Instead, we need to think about what axis that limit will first show up on.

  3. As part of trying to make [goal-1] a reality, we want to understand where the solutions we want to consider intersect with what the NIC gives us the ability to do. For example, the NIC may be able to provide checksum or cryptographic support.

Background

The main part of the solution space that we’re exploring are various packet encapsulation schemes. This section provides an introduction to them, their use, and associated trade offs.

Terminology

There are several terms that are used in this space that may be used in different ways depending on the context. This section lays out how they’re intended to be used here.

encapsulation: Encapsulation refers to the idea of taking some portion of a network packet and wrapping it up in another one. For example, one could say that the TCP portion of a packet is encapsulated in an IP packet. However, one can take an entire Ethernet packet (MAC, IP, L4 header, payload) and wrap it up in another packet.

overlay network: An overlay network refers to the idea of having a network that is multiplexed on top of another network in some fashion. The naming comes from the idea of 'overlaying' multiple disjoint networks on top of another existing network. In general, the addressing of components on an overlay network are independent from what it overlays, or uses as a transport.

underlay network: The underlay network is the network upon which multiple overlay networks are multiplexed. It has its own unique addressing that is leveraged. Often times the underlay network is a physical network, though it too can be virtualized.

VLANs

Virtual LANs (VLANs) are a way of segmenting an Ethernet network. Defined in IEEE 802.1Q, they add four bytes to an Ethernet header by setting the Ethertype to 0x8100. Those four bytes define a 12-bit tag value, a priority, and what the real Ethertype is. The 12-bit tag is often called the 'VLAN ID' or a 'VLAN tag'.

Each ID represents its own independent Ethernet domain, allowing for up to 4096 different Ethernet domains to share the same physical infrastructure. Often, switch ports are set up with a list of VLANs that they will accept traffic on, which is often called tagging. In addition, switches have historically had the ability to add and remove tags for a specific VLAN ID when a port sends traffic without a VLAN header.

In addition, most NIC hardware has the ability to account for the VLAN tag, optionally insert it on transmit, note it or remove it on receive, and in general create hardware filters based upon it.

A fundamental challenge with our use of VLANs is both their limited nature and the fact that our customers will probably be assigning us specific VLANs for interacting with their physical network. Due to this combination, VLANs are often best left as something that we use for parts of implementation of the physical network or our underlay network.

Later revisions to IEEE 802.1 added support for nesting Ethernet tags; however, their support is not as widespread and their use case is designed more for bridging disparate networks together.

VXLAN

Virtual Extensible LAN (VXLAN) is a network encapsulation protocol that encapsulates a full MAC frame in a UDP packet with an 8-byte header. The header allows for a 24-bit network ID, which means that there are over 16 million unique values possible. While there is a standardized UDP destination port, the UDP source port can take any value. This is often a simple hash of the headers of the internal flow. This means that for a given inner flow, it will be consistent, allowing for different load-spreading techniques like equal cost multi-pathing (ECMP) to be used.

The VXLAN protocol itself suggests that data is transmitted either by having a fixed destination or by sending it over multicast. In practice, many software stacks leverage VXLAN because of its wide reaching hardware support in both NICs and switch silicon; however, supply their own means of determining the destination of the underlying destination. This was the approach that Joyent took when building their implementation and there are similarities in Microsoft’s VL2. However, implementations that did this notably were in violation of the RFC.

One of the initial challenges with VXLAN is that because it has both an inner and outer packet, there were some initial challenges with NIC offloads supporting it in the form of checksum offload, VLAN insertion, filtering, and segmentation offload.

Most recent generation NICs support the ability to perform both an inner and outer checksum of the IPv4, UDP, and TCP checksums. In addition, most NICs that we might consider support TCP segmentation offload in a VXLAN-aware fashion. That is, they will perform TCP segmentation of the inner packet and replicate and adjust not only the inner Ethernet, IP, and TCP headers, but also the VXLAN header and the outer UDP and related headers.

An area where NIC vendors still vary is the ability to direct traffic based on the full set of outer, VXLAN, and inner headers to a specific set of queues. The benefit of being able to direct and associate specific flows to specific NIC queues means that we can attempt to leverage some of the NIC’s QoS features and applying them to specific customers or instances. While these features are useful and important, we need to remember that we may not actually have enough resources from the NIC to cover the total number of instances that we care about.

Geneve

Geneve is another UDP based tunneling protocol that is similar to VXLAN. It has the same 24-bit identification space, makes the same affordance for the source port, and has a default 8 byte header. However, there are a few interesting differences:

  1. Geneve explicitly does not define a control plane protocol or means for the tunnel destination to be looked up. The RFC explicitly leaves it to the implementation.

  2. The VXLAN header has an explicit protocol type field in the header, that allows one to indicate what the encapsulated header is. It is not required that it be Ethernet, though it allows for it and may be the common case.

  3. Geneve allows for up to 252 bytes (in four byte quantities) of options to be set in the header. These come with a four byte header and are similar to IP and TCP header options. This means that the tunnel can optionally be associated with metadata that might not be normally used.

One interesting thing from the RFC is to see who has already reserved options such as VMware, Oracle, and Amazon.

IP-based encapsulation

Another form of encapsulation that has been used is IP-based encapsulation. This has manifested in various form including:

These techniques are somewhat related in so far as they all leverage using an IP header. While each one has its own IP protocol value and encapsulates in some cases different types of data, the most notable being that NVGRE is an IP-like version of VXLAN in that it has its own id space and can encapsulate L2 packets.

Unlike the other techniques described, IPsec is designed around the use cases of authenticating and encrypting traffic. To do so, IPsec has a secondary piece to it called IKE (Internet Key Exchange) which is used to build up those relationships.

Encapsulation with Geneve

For Oxide’s initial product implementation, I propose that we perform network encapsulation with Geneve. Concretely within a region (or perhaps an availability zone) we would allocate every customer VPC a Geneve ID as well as some for system purposes such as Boundary Services. This will be used in tandem with [sect-opte] and [sect-onds] to provide a full set of network services and network virtualization.

Most of the IP based protocols do not have this notion of a virtual ID, which is an important part of differentiating traffic. While it could be created with our own work or abusing an IPsec AH or ESP Security Association context, creating something new and fighting the protocol and others interpretations doesn’t buy us anything compared to a protocol with an explicit notion built-in.

The physical network will be leveraged as an underlay network. When a flow leaves a virtual machine, it will pass through the [sect-opte] and be destined towards the host that has the instance. Intermediate switches will not be required to be aware of this, only the compute nodes and Boundary Services.

With this in mind, it’s worth evaluating how this helps us meet our goals and why this solution was chosen over others.

One of the major impacts with the choice of having customers with their own IPv4 networks is that of [goal-5]. Modern network design focuses on reducing the scale of broadcast domains to improve the operational situation. One distinct advantage of using any overlay network based scheme is that it gives this flexibility at the cost of maintaining a database that has the ability to map between the virtualised network interface of an instance and the physical host it is on. However, that is well trodden ground and is employed in aspects of Microsoft’s VL2, Google’s Andromeda, Joyent’s Fabrics, and other OSS implementations.

Ultimately, the ability to put an instance on any host in an availability zone (or perhaps region) and is what allows us to meet [goal-5]. While there will be other aspects of the system that might limit this migration (e.g. storage), this flexibility is well worth the cost.

This ties into a broader benefit of this approach. Basically that of being able to separate out the physical network that is used for general communication of instance data and the control plane from that of the instance. The physical infrastructure needs to worry about about how to build itself in a scalable way and if the customer instances were directly on that physical infrastructure, that would ultimately devolve into very large L2 broadcast domains which can inhibit scalability.

Two goals that are related here are [goal-2] and [goal-3]. The use of the network encapsulation as part of the compute node’s network datapath (discussed in more detail in [sect-opte]) basically pushes out most of the heavy lifting of the network to a combination of the compute nodes and boundary services. This gives us a few useful properties. Once a packet leaves a given server, then it will be able to leverage the general availability characteristics of the physical network and its inherent reliability and scalability. There will be nothing specific that we need to do here by employing the use of an overlay network and performing network virtualization.

By fanning out the work of the network encapsulation and decapsulation to the compute nodes and boundary services, that improves our scalability odds. This doesn’t mean that we won’t hit scalability issues by doing this on the server at some point. This reality was pointed out in the Google Andromeda paper. Here at least, we are reducing the flow state table to the flows that are likely active on a given compute node based on the instances present, which will, on average, be a subset of all the instances in a project.

Importantly, this is often comparing itself to the more classic network approach of a pair of devices that provide a single service that every packet has to pass through.

As discussed in the background section, there is a lot of hardware NIC functionality that helps us with offloading [goal-1]. The ability to have the current suite of stateless offloads (checksums, TSO, etc.) and a foundation for doing LRO in software is useful. Notably, not all protocols have these accelerations available. While VXLAN does and some IP based encapsulation does as well, it is not often as well defined. While it is possible that we could use a different protocol and restrict ourselves to a smaller set of vendor cards, the evolutionary options provided by Geneve are more appealing.

In addition, most of the IP based options don’t provide a common way for using information from the inner-flow to influence equal-cost multi-pathing without teaching intermediate nodes about that. While there’s no reason we can’t, being able to leverage something that has it built-in is an advantage.

Evolving our use of Geneve

As much as any of the above, [goal-7] is a core reason to pick Geneve for use here. When we compare VXLAN and Geneve, given that both have similar hardware support, but Geneve gives us more evolutionary flexibility, there’s no reason that we shouldn’t pick Geneve today.

There are already ways in which we can use the options in the header. For example, we could insert various things like an instance, project, or organization UUID which would allow us to better track or aggregate data flows at the switch. While it is limited to 252 bytes, that still gives us a pretty good degree of flexibility.

When we compare this to the IP-based suite of protocols that are available for doing this, starting with UDP ultimately gives us a certain bit of flexibility. It means that, if needed, we can deal with a 6:4 translation (which will almost certainly happen). It also means that we reduce the likelihood of hitting middleboxes that won’t support this when we need to bridge across a customer’s broader environment to support multiple regions. Admittedly, it’s rather unlikely that we would find ourselves in a case where we would send any packet across a broader network unencrypted and it would likely be wrapped in something else that has solved that problem.

Oxide Packet Transformation Engine

The Oxide Packet Transformation Engine (OPTE) is a software component that is the cornerstone of the architecture that we’re proposing. We call it a 'transformation engine' because it is responsible for doing the heavy lifting of taking a packet from an instance and transforming it into the form that will be put on the wire and doing the opposite. Critically this means performing a number of actions including:

  • Applying Firewall Rules

  • Applying Routing Rules

  • Geneve Encapsulation and Decapsulation

  • Performing NAT rewriting

  • Some amount of traditional 'virtual switch' functionality

  • ARP, NDP, DHCP, and SLAAC support

  • Bandwidth and Flow Monitoring and Enforcement

Importantly, what we are proposing is an engine. By that, we mean that there is a degree of programmability. As we’ll discuss in the rest of this section, the flexibility of having a programmable interface cannot be understated.

High Level and Vision

The high level vision that we propose incorporates a combination of ideas from both those articulated in RFD 9 Networking Considerations as well as aspects that we’ve learned from the different implementations discussed in [sect-past].

At the core of OPTE is actually an execution environment itself. The core idea is to be able to build a rust-based execution engine into which we can feed a series of instructions, rules, and transformations that should occur in a connection-oriented fashion. The main goal for the rust based execution engine and rule processing is a couple fold:

  1. To be able to give us a solid basis for writing programs that we can use to manipulate packets and update independent of the underlying engine where this is actually deployed. This will accelerate our ability to develop these programs, allowing us to actually deliver additional, incremental features.

  2. Have a system that gives us a means of easily serializing and transferring state as part of live migration and upgrade.

  3. If we have our own execution environment one can actually imagine having multiple targets of this environment. While the initial target is based on living in the existing x86 host CPU, one could imagine several additional targets:

    • The CPU core that is performing actions on current NICs from vendors such as Intel, Chelsio, etc.

    • As a Hubris-based application in the ARM cores that are showing up in various forms of SmartNICs in the future.

    • As an execution engine on a soft-core that is embedded in some of the SmartNIC paths that are using FGPAs.

    • As something that could be potentially evaluated even in Switch ASICs!

The hope is that with this engine we would be able to deliver different plugins that can operate in a layered fashion, similar to what Microsoft describes in VFP. With this we could build properties that allow us to perform actions like encapsulation, NAT rewriting and state tracking, determine whether a firewall allows or denies traffic, and more.

OPTE is made up of two major components:

  1. A dataplane component whose jobs are effectively:

    • Classification of traffic from the network and guests

    • Applying transformations and forwarding packets

  2. A control plane oriented agent that handles things like:

    • Asking the control plane for the appropriate transformations for an instance and updating them as appropriate.

    • Handling non-trivial actions such as translating DNS and metadata requests into the corresponding control plane calls.

This split is useful for a couple of reasons. The first is a separation of concerns. In essence, the dataplane part of it may be embedded into operating system kernels or some other semi-privileged state and we don’t want to bake in the question of how to actually find and track these updates or perform these functions. The means by which we talk to and communicate in the control plane will naturally want to evolve separately from the data plane and vice versa and aspects of these may also vary on the environment that we’re found in.

For a visual version of how this fits into the broader environment, see [img-compute].

While the full design and architecture need more thought, the rough idea here is that we can install a series of per-guest virtual NIC layers of transformations. Each layer will potentially look up existing state (for example, routing rules) or allocate state (firewall rules) and then may push on a set of transformations to apply once all of the layers are processed (e.g. NAT or Geneve encapsulation). Some examples of how OPTE fits into packet flows are: [flow-inst], [flow-nat], and [flow-extip].

Packets in the dataplane that require additional processing or are being implemented by a specific control plane service, such as DNS requests for the internal DNS server or instance metadata, will be passed into a bounded queue for processing by the control plane agent. The control plane agent will take care of any necessary rate-limiting, process the packet, and then inject a reply back into the guest network. Examples of these flows are: [flow-dns].

In addition, it will be responsible for the coordination of snapshotting, and freezing state for live migration as well as processing and coordinating with the control plane.

Server Orientation

XXX This may want to be a top-level section that describes why we’ve chosen where to break out different components.

A critical aspect of OPTE is that it exists on each server. This has several implications and follows from our goals. While it is easier to maintain the overall set of routing and firewall rules to apply when there are fewer places with that information, that has an inherent challenge with scalability and availability. At the heart of the decision to push this out to the server are [goal-1], [goal-2], and [goal-3].

Ultimately, in this overall design and architecture, there are a few different places that we could do this processing. When we’re trying to think through this, the things we need to ask ourselves are:

  1. How much state do we need to keep to track all of this? Effectively, we need to track the routing and firewall rules for each VPC and then for each instance on that VPC we need to figure out where to keep that state.

  2. How much state can we keep in different devices?

  3. What’s the throughput and latency of doing this rule processing?

With the proposed structure there are primarily two different extremes that we can start with to manage these questions of state, validation, and transformation:

  1. Packets from guests enter the network as is and the network fabric determines what can and can’t be sent.

  2. The network fabric is more of a transport fabric and those seeking to produce or consume traffic are responsible for taking care of making sure what should and shouldn’t be sent.

For the sake of this discussion, when we talk about the 'network fabric' we’re referring to the switches and routers that forward traffic. The producers and consumers generally refer to the servers themselves and boundary services.

The proposed initial bias is to have the network fabric itself be simple and instead push that complexity on the server node. Fundamentally, this is because we believe this will give us a better foundation for creating a more scalable and rich implementation. There are a few reasons for this:

  1. If the switches are responsible for all policy enforcement we’re going to have potential issues with the size of that state. While it may work in the simple cases or fit in the single rack world, we don’t believe that it’s going to work in the larger designs. While there is some flexibility in how we can allocate that state, relying on them for say every instance’s NAT state table may be problematic.

    We believe that it is easier for us to scale out the processing, and reduce the actual size of the required state if we push that into the server (or eventually a SmartNIC on the server). While this comes with an increased cost in communicating out changes, for our initial and even moderate scale environments, we believe that this should be feasible.

  2. A simpler general fabric makes it easier for the physical network to autoconfigure and set itself up and helps us with some of our time to market considerations.

  3. We would like to initially leverage the switch’s resources for use in implementing boundary services. This would give us a nice story at small rack counts to reducing the need for additional, dedicated networking hardware, while still letting us take advantage of the programmability features of our switches.

  4. We can focus on incrementally adding features to the switches that can leverage the additional packet options that we add to Geneve. See [geneve-evolve] for some additional thoughts there.

  5. Due to the programming model of the switch, we believe it should be somewhat easier to create less impacting, and seamless updates of software on the server versus reconfiguring the P4 programs on the Tofino switch. That said, a fair counter to this is that we need to make that operationally painless regardless.

Limits and State

A fundamental challenge that we have with OPTE is bounding state. While, it’s tempting to try and pretend we don’t need to track state, to implement the features we’ve outlined in RFD 9 Networking Considerations and RFD 21 User Networking API we need some amount of per-flow state. However, this state is also the enemy. Many network operators have had to debug the problem of the firewall or NAT state table that is full, causing all subsequent connections to fail.

Fundamentally, in OPTE we have this tension between having limits to help protect the broader system, but the knowledge that when these limits are hit there are dire consequences for customer’s instances to be able to continue behaving in the ways that they expect. The other challenge is that sizing these limits is, difficult. As a starting point, we can draw from other cloud providers who document this information, but then the most important thing we need to do is actually provide end to end alerts and monitoring in the product for these types of facilities.

HVM Integration

An important question is how does the broader networking infrastructure interface with a customer’s hardware virtual machine. At the end of the day, we need to choose a type of device that will show up inside of the guest. There are several important aspects to this consideration that we need to address:

  1. Guest support for the interface ([goal-4])

  2. The interface’s impact on guest performance, security, and the ability to support various off load functionality such as Checksums, TCP Segmentation, etc. ([goal-1])

  3. How the interface choice impacts our ability to evolve the system ([goal-7])

  4. How the interface impacts features like migration ([goal-5])

Background

Device Emulation

As hardware virtual machines have evolved, so to have the ways that we map devices into them. There is a large spectrum of different of strategies that have been employed that we can roughly group into three different categories:

  1. Faithfully emulating an existing device in the hypervisor

  2. Create a virtualization-environment specific device (e.g. Virtio)

  3. Passing through a physical hardware resource in some fashion (e.g. leveraging SR-IOV)

In early days of VMware and other hypervisors, the hypervisor would emulate a specific hardware device such as the Intel 82545EM 1 GbE NIC. In this case, when the guest operating system read from or wrote to a specific register of the device’s PCI configuration space and BAR, that would generate an exit from the virtual machine into the hypervisor which would process the request and then return to processing in the guest. This means that reading statistics registers or trying to update the state of packet descriptor rings would all cause guest exits.

To try and mitigate the guest driver problem, where by a guest needed to have support for the device driver, hypervisors would often support several different devices. However, a problem with this is that the emulated hardware did not often behave the same as the physical hardware. For example, VMware does not faithfully emulate the multiple MAC address filters of devices like the Intel 82545EM. And as luck would have it, while every OS uses the same 80% of the features, each one uses a different 5% of the remaining features. This means that to provide an experience that fits in with the native driver requires you to either do a bunch of additional work that may not make sense in a virtualized context or you need the ability to identify yourself and specifically work around things in guest operating systems.

The next major approach that was taken was to create custom devices that admitted that they were in a virtualized context. This technique is often called para-virtualization. Here, rather than trying to leverage an existing device that the operating system supports, there is instead a device presented that the guest needs explicit knowledge of. This device often has different properties to traditional hardware devices in terms of design. They often focus on using shared memory interfaces or taking advantage of other techniques that will minimize the number of exits. They also present themselves as much more simplified devices. For example, with a networking device driver, they generally have no notion of phy management as the phy is virtual.

One of the more prominent implementations here are the virtio device family. These generally show up to a virtual machine guest as a PCI device, though there is support for a purely memory mapped form. While some reads and writes to memory that triggers virtual machine exits is still performed, a number of data structures and state are explicitly in shared memory that is synchronized with memory barriers.

Finally, there has been a large push towards leveraging processor technology such as AMD’s AMD-Vi and Intel’s VT-d and PCI’s single root I/O virtualization (SR-IOV) that allow for one to pass through an entire physical PCIe device, a physical function of a device, or to create a virtual function.

When a physical function or device is passed through, then the operating system is allowed to more or less interact with it directly. Unless the hypervisor is interposing on a region of the device, manipulating it generally does not cause a virtual machine exit. As minimizing exits can lead to much better performance, this is often a useful approach. Of course, it does require that the guest operating system have a device driver for that.

Now, there are a limited number of PCIe lanes and a limited amount of external cabling that one usually wants to use, so these days passing through an entire physical function is generally not done where there are tenancy concerns. Instead, a virtual function is passed through.

A virtual function often requires a variant on the normal device driver for a piece of hardware and when careful planning is not taken, often requires a device-specific driver. For example, if you look at Intel’s NIC line. The 10 Gbit family based on Ninantic (often called ixgbe) has a virtual function driver that is entirely different from the 40 Gbit family based on Fortville (often called i40e). These are also different from what’s used for Intel’s 1 Gbit line. While Intel has finally introduced a specification for what they expect future devices to implement so that it reduces the load, this highlights a problem in the area, that different devices require different drivers.

It’s worth pointing out that other NIC providers are not quite as pessimal as Intel. Generally they manage to stick with a single driver interface for multiple generations of cards and as a result they only need a single virtual function driver for the entire line. However, these are still device specific. If you’re using a Broadcom device and switch to an Intel device, you will not use the same driver.

One challenge with this design is that of migrations. Many hardware interfaces weren’t designed with the idea of migration in mind. For example, if you get an ID from hardware which is sharing its resources between all functions, that identifier might not be scoped to a virtual function. However, the guest driver will have no way of knowing that the IDs its received are no longer valid.

The Problem with VM Exits

Virtual machine exits are a challenge in virtualization performance. Any time that you’re exiting the guest to do work in the hypervisor it means that you’re borrowing CPU time from the guest and injecting latency that might not exist normally. For example, when you write to a register in PCI configuration space it generally does not entirely stop execution in guest context (even if there is blocking for an instruction to retire as it creates a PCIe transaction). Generally speaking, the less VM exits that a guest performs, the better.

The cost of a VM exit is worse these days due to micro-architectural flaws in CPUs and the assumption that more will exist. Many hypervisors opt to expose simultaneous multi-threading to the virtual machine guests. One of the results of various vulnerabilities such as L1 Terminal Fault or multi-architectural data sampling, is that multiple threads that share hardware resources need to be in the same security domain at the same time. This means that if one thread exits into the hypervisor for some reason, if care is not taken, other state of the hypervisor may be exposed in a way that is vulnerable to various forms of side-channel attacks.

To that end, when a virtual machine guest exits on one thread, in some hypervisors, the hypervisor will force the other threads that are part of the same core to exit and cause them to halt until the processing is done.

While there are many different ways to mitigate this technique and to reduce the potential of information leakage through things like changing the virtual address ranges that are mapped while processing in the hypervisor’s kernel module, we need to consider this when deciding upon the guest networking interface.

Changing Guest State

Another thing that we have to be wary of is how the choice of device impacts the guest. The device and the interface that it speaks ends up being baked into various aspects of most operating systems. For example, it ties into the NIC enumeration order in the guest. Many systems will include the driver or instance name in the configuration for it. On many Linux distributions, aspects of the PCI layout end up in the device naming.

Needless to say, whatever devices and ordering we present to customers in a virtual machine instance we want to make sure does not change across the life time of the instance. Changing it could cause networking in a guest to no longer come up after a reboot.

Existing Providers

It’s worth taking a look and understanding what different providers do in this space today.

Amazon has gone through a large number of different iterations of networking which vary based on the instance type. In the Xen era, they leveraged paravirtualization. For a while, they had some amount of passthrough of the Intel 82599 (also known as the X520). Today, the Nitro instances and others have support for an Amazon specific device that appears in the instance through SR-IOV. These are known as the Amazon Enhanced Network Adapter (ENA). Linux and FreeBSD drivers are on github.

Google Cloud Platform initially started by leveraging the Virtio interfaces for all of their guests. However, they have developed their own device driver that is passed into the guest via SR-IOV called the gVNIC which leverages the gve driver they have written for Linux and Windows.

Microsoft Azure has two different network interface modes. One is to perform paravirtualization based on Hyper-V’s specific interfaces. The other is to allow for SR-IOV of the Mellanox ConnectX-3 and ConnectX-4 Lx. See their docs for more background.

VMware provides an extensive array of capabilities. They offer a wide array of emulated devices. Many customers will leverage the VMXNET line of paravirtualized devices and then they support the SR-IOV solutions that different hardware supports. Because they are not selling hardware, they do not have specific constraints as compared to the cloud providers.

Hyper-V looks similar to VMware. The major differences are that it emulates many fewer devices and it has it’s own specific para-virtualization interface. It does not support VMware’s VMXNET or the widely-used Virtio.

Other hypervisors like QEMU/KVM and bhyve are similar to VMware. They have support for a number of emulated devices and they do support SR-IOV solutions as well. The most common use here is with Virtio, giving a common paravirtualized support.

Virtio

I propose that we use Virtio in our product as the initial interface for networking between the guest and the host. This does not suggest or require the use of Virtio between other parts of the product.

With respects to the goals that we laid out, Virtio does help us on a number of them. There is wide driver support in a number of operating systems, including, but not limited to:

  • Linux

  • Windows

  • *BSD

  • illumos

  • Haiku

  • 9Front

  • SCO OpenServer

This wide OS support is important. This is in support of [goal-4]. While some emulated devices have a wide degree of support, they are often in turn widely supported because they are older devices which means that they aren’t suitable for use with SR-IOV. Instead, if we look at things like SR-IOV devices, the support matrix is much more scattered. Even things like AWS’s ena and GCP’s gve have fairly limited platform support as compared to Virito, not that they don’t have other merits. An important assumption going into this is: Oxide does not have the resources to write a high quality networking device driver for multiple, different operating systems for our 1.0 product.

Virtio does support a number of offload features and allows us to negotiate support for IPv4 and IPv6 based checksums, segmentation offload, and large receive offload. For guests that are primarily doing HTTP-style traffic with larger assets, these are all features that dramatically improve the performance for the guest. If a guest is focused on the packets per second rates or focused on small packets, then this is less advantageous. Virtio also supports the ability to wire up multiple transmit and receive queues and put together something like receive side scaling through it. While it may not be as featured as some device drivers and not all guests necessarily take advantage of it, it does provide us a useful tool.

With that in mind, it’s worth talking through why Virtio over emulating devices. In general, because the OS support is equal, the main advantage we have with Virtio comes to the main advantages of para-virtualization over device emulation. Strictly, the number of reads and writes that will trap out is reduced and you have something that is designed knowing it is virtualised. This makes the corresponding device drivers simpler and easier to reason about in the guests.

In addition, because the commonly emulated devices are older, they often don’t always have support for things like TCP segmentation offload, large receive offload, or sometimes even things like IPv6 checksum offload. This all flies in the face of [goal-1]. If we had to choose between the two, there’s basically no reason not to use Virtio over an emulated device in this part of the space.

When we compare this to using an SR-IOV and virtual function style solution there are a couple of different considerations for us.

  1. In most cases an SR-IOV virtual function interface ties us to a single vendor’s implementation. Unless we carefully plan around what that interface is, we will tie certain instances and generations to a specific hardware implementation. It certainly will be harder to get vendor A to implement vendor B’s virtual function interface. Though, a neutral Virtio may be easier. This is related to why we see cloud vendors doing their own interface and driver, though there are a myriad of other reasons as well.

  2. For many devices, it’s easier for us to manage the state to support a transparent live migration than it is for many existing SR-IOV based solutions. This doesn’t mean it’s impossible, but many of the devices aren’t designed to make the movement of this data easy. This and previous consideration, make the issue of [goal-5] harder to achieve with SR-IOV than Virtio. We have to consider not just migrating device state, but also the problem of whether or not other compute nodes are even compatible. If we’re limited by the device, that further constrains us.

  3. There are limits on the number of virtual functions that devices support. In older devices, such as the Intel X520 and related family, it is 64 devices. In devices such as the Mellanox ConnectX-5 EN supports up to 512 virtual functions. Intel’s forthcoming E810 only has 256 virtual functions. This puts a fundamental limit on the scalability of the unit and puts an upper bound on tenancy, especially if we implement support for multiple interfaces inside of the virtual machine.

    While one can initially say that we’re not looking at such a high tenancy, given the talk of spot instances or lambda like solutions, it is pretty easy to imagine us blowing past these limits on systems, given that the old Joyent no.de PaaS service jammed over 200 256 MiB containers onto systems with 48 GiB of DRAM and maybe 12-16 cores back in 2010/2011. This makes it pretty conceivable that we could exceed these limits with micro instances. This ties back to [goal-3] in part.

  4. A solution that leverages SR-IOV will, on average, have better performance than the corresponding Virtio solution. This is because there is no need to exist and leverage the host CPU. This does make SR-IOV more advantageous for [goal-1]. However, as proven by GCP and others, we should be able to make Virtio pretty competitive in this design which will be discussed in more detail.

  5. For us to implement the other network services that we need, just having SR-IOV is insufficient. We need the ability to process and transform the packet. While it’s tempting to try and do that in the switch, that may have problems with [goal-3]. This means that we would need to go down the path of looking for a NIC with additional CPU cores on it, which gets into the world of SmartNICs. While being able to support such an environment and evolve into it would be nice, it’s not clear if it’s the right initial choice for us.

Based on the above trade offs, it seems like leveraging host-based Virtio is the right initial choice for us and makes more initial sense for us and directionally. While the virtual function interfaces of products on the market may be somewhat snappier, they create more complications in terms of managing those devices. In addition, even if we went down an SR-IOV path, we’d want to actually have a Virtio like interface for that, or some other wide-spread, vendor-independent device interface. As the only thing that we know for certain is that we will likely change and evolve the host interface and host side of things.

Evolving Virtio

We need to think of the choice of Virtio or something else as the beginning of a journey and what options it gives us in the future, not just where it leaves us today. As discussed elsewhere, we are planning on leveraging the host to do a bunch of the per-packet processing.

The Doorbell

The first way that we can evolve or enhance Virtio is through the introduction of what we have called 'a doorbell'. One of the challenges with Virtio is that guest exists are involved in most operations to kick off processing of state and when certain updates occur. While such an exit would need to basically signal OPTE, we can instead ask ourselves, what if we actually leverage just a bit of the PCI virtualization features.

Effectively, it we look at what matters from a Virtio perspective, we can actually rig up a small PCIe device that we remap some of the Virtio bar to. When manipulating a Virtio queue in normal operation, after updating shared data structures, all we need to do is poke a register in a bar. This PCIe doorbell would then inject interrupts as required with metadata or provide a means for polling from the CPUs on the host that are processing it. This gives us a way to eliminate the exits that are required in the data path.

Because this is a low-throughput device, we don’t need much more than a single lane of PCIe nor a very fast device generation. However, we would want to think through what the scalability is here. However, given the desire towards different classes of service, an optimization that can be used for larger or more permanent instances is a viable option.

SmartNICs with Virtio-based Virtual Functions

Another possible evolution that could be invisible to customers is to at some point have a custom SmartNIC that exposes virtual functions based on Virtio. In this way, we could decide to push the same version of OPTE into the cores of the SmartNIC. While there are similar scalability challenges to consider as mentioned in comparing Virtio para-virutalization with other approaches, we can choose to offload a subset of customer instances to a NIC that require a higher class of service than others.

One advantage of this approach as compared to other SR-IOV solutions is the fact that we’re using Virtio. As a result, we should be able to construct a means of migration between servers with and without this hardware, giving us a bit more flexibility.

While there may be other challenges associated with going down a SmartNIC path, the important thing is that choosing a para-virtualized Virtio doesn’t actually stop us from going down this path at all. And there is something compelling about exploring running instances of OPTE in hubris on such devices and being able to leverage the same engine in multiple spots in the network.

Boundary Services and Customer Integration

Boundary Services refers to a group of logical services that exist in the physical network and provide a path between an Oxide environment and the customer’s broader network or the broader Internet in the case of SaaS-style customers.

Boundary Services has two major responsibilities:

  1. Speaking routing protocols to advertise 'external' IPs to the customer’s broader network and ancillary protocols (e.g. LLDP) for interfacing and diagnostics.

  2. Transforming network traffic that is crossing the boundary between the Oxide environment and the customer’s broader network into the appropriate form for each side and directing it to the right spot.

Let’s take each of these apart in turn.

Advertising the Network

One of the principle jobs that we’ll have is advertising our network to the broader customer environment. While the customer will be giving us control of a specific range of IPv4 and IPv6 addresses, it will be our responsibility to advertise them to the broader customer network in the form of a routing protocol. In general, there are three major forms that this takes:

  1. BGP

  2. OSPF

  3. Harcoded static routes

From initial conversations we’ve had with different customers we’ve come to the conclusion that BGP is probably the most important of these three options to support out of the gate. In the spirit of [goal-4], we should start with BGP and then prioritize the hardcoded static routes. As we are successful and need to interface with more customers, and have customer demand, we should look at implementing support in the boundary services for OSPF.

Ancillary Services

While the network protocol routing is the most important service that we need to implement in boundary services (ignoring the packet transformations), there are a couple of other services that we’ll need to consider and possibly add support for. These include:

  • LLDP - The Link Layer Discovery Protocol started as a means to discover what devices were on either end of a switch port. For many DC operators this is an important tool in understanding the connectivity of the DC and importantly what is wired up at a physical level. While we want this in the broader Oxide physical network, it is important for boundary services which will be interfacing with the customer’s broader network where this may be our only means of knowing to which devices and ports we were theoretically connected to.

  • PTP - Some of our customers may choose to utilize the precision time protocol. While RFD 34 Time is still to be written as of the moment of the writing of this RFD, to preserve optionality, we may need to build some support for PTP into this part of the suite.

Packet Transformations

The Oxide physical network treats itself as an isolated series of L3 IPv6 subnets which is done in service of our goals and gives us a solid foundation for building modern networks. In addition, it serves to isolate us from the varying implementations of our customer’s networks, their IP allocation schemes, and other details.

As we laid out in [sect-virt], we are currently planning on leveraging network encapsulation and virtualization in part to implement VPCs and in service of maintaining [goal-3] in the face of [goal-5]. To do this, the boundary services needs to be able to translate traffic between two formats: encapsulated traffic on the Oxide Physical Network and the customer’s broader network.

If you walk through the example flows [flow-nat] and [flow-extip] you will note that all instance related traffic is always encapsulated on the physical network, thus allowing the physical network to ignore the addressing that customers use.

To facilitate making this a reality, when boundary receives a packet from the broader customer network, it will need to be able to have enough information to encapsulate that and forward it to the correct host on the physical network. Conversely, when a packet comes from the Oxide physical network destined for the customer network, boundary services will simply need to decapsulate it and forward it onto the customer side.

To implement this we actually look, initially, to our choice of switch silicon — Tofino. One of the challenges that we have in the initial buy of the rack is to minimize the amount of compute overhead that we are using. Because, as we’ve discussed earlier in [opte-server] and elsewhere, we’re trying to minimize the actual router requirements for the physical network, this means that we have plenty of switch resources leftover exactly for this purpose. Therefore our plan is to initially rely on the switch itself for doing this transformation!

This all implies that boundary services needs all the state for the following:

  • It needs to know where to forward all ephemeral and floating IPs onto the corresponding host.

  • For any IP address used in NAT, it will need to also look at a subset of the port range to determine where to send the packet.

Unfortunately, this means that when provisioning happens and ephemeral IPs, floating IPs, or NAT state are allocated, we need to be able to update boundary services with this information. However, the silver lining here is twofold:

  1. While we need to keep mappings of information, these are not stateful mappings like a firewall might need. Instead, all of the per-connection stateful mappings are being managed at the server.

  2. The size of this state should be smaller than that required for all instances to be able to talk to one another. In other words, the size of the boundary services state should be dramatically smaller than the size of the state would be if the switch was required to understand all of the state discussed in [opte-server].

Oxide Network Directory Services

So far, we’ve talked extensively about requiring a large amount of state to not only be kept, but also be synthesized and synchronized out to all of the different hosts in a given scope and space.

There is a lot about the design and implementation of this that requires additional research and exploration. There are a couple of key ideas that we will need to keep in mind:

  • The backing data store will need to be highly available. If this is down, then networking may not function. This is an important piece of making [goal-2] a reality. However, this also means that it can be wise to err on the side of keeping data cached so as to minimize the impact of such an issue.

  • We should carefully consider what our limits should be to reduce the amount of state that needs to remain consistent and coherent. This will make it easier for us to shard this data as customer environments grow.

  • As many of these requirements coincide with what the control plane requires, we should determine if we can leverage the same state for this.

  • While we know that certain techniques that we start with may not function when we get to the GCP scale based on [past-goog], it also means that some of simpler techniques that they employed and that were used in the implementation of Joyent’s [past-fab] will suffice to help us get a product out to market.

  • While we don’t need microsecond-level updates, these updates will impact our customers ability to make run-time changes and therefore in the spirit of [goal-1], they do need to feel 'fast'.

Network Services and Observability

A major problem that we face is that of [goal-6]. We are going to be going into customer environments and likely crossing into various organizational pitfalls. For as many people have tried to say 'The Network is the Computer', many more say 'The network is the problem'. Therefore, one of the things that is tantamount to making the product successful is to be able to eliminate the mystery of the network as a black box.

In we think about the different types of roles that will be using the product as elaborated in RFD 78 Customers, Roles, and Priorities, we can break this down into two different categories:

  1. People who are infrastructure focused and want to primarily understand the physical layout and capabilities. They care about understanding their customer’s applications when there are are problems.

  2. People who are application focused and want to primarily understand their application. They don’t want to think about the physical infrastructure, but if it is responsible for a problem they have then they care.

For these different categories, let’s talk through what we want to actually put together.

Physical Observability

From a physical perspective, operators have often the scraping of switches or leveraging SNMP to build dashboards where they can look at switch ports and understand questions like:

  • What ports are connected to what and are the links up?

  • How much throughput is going on each port?

  • Are any of my ports saturated?

  • Are any ports showing errors?

To that end, we believe that in the broader product we need to be able to visualize this data and make it actionable. See RFD 82 Motivations and Principles for the Design of Operator Facilities for a larger discussion of the rationale.

To be able to build this there are a couple of things that we need to be able to establish. The first of these is topology. To build up topology, we propose that the all of the servers and switches in the environment support the link layer discovery protocol (LLDP). This will allow us to exchange information between servers, switches, and other network equipment about what is connected to each switch. Note, that this information cannot be relied upon for correctness of the implementation, but rather it should be used to help us diagnose and understand problems. This is analogous to what is proposed in RFD 40 Physical Presence and Location: Requirements and Architecture for understanding the physical layout and composition of the rack.

While LLDP gives us the ability to discern and create topology, we also need to gather regular metrics and feed that into whatever broader solutions for metrics the rack ends up developing. We need regular metrics from our switches and all of the NICs in them. While the full list of metrics that we care about is deserving of its own RFD, at a high-level it’s about utilization and errors, with simple breakdowns based upon different dimensions such as packet size, unicast, multicast, and packet type such as IPV4, IPv6, TCP, UDP, etc.

Application Observability

When we consider the application team’s perspective they generally don’t care about the actual physical network itself, unless there’s a problem. Instead, they generally care about things that impact their application. Common situations here may be things like:

  • Accidentally sending all of your traffic to another instance via the Internet.

  • Having a load balancer configuration that somehow isn’t evenly spreading load.

Much in the same way that the physical view wants to have topology information, we want to eventually be able to construct the same thing but for the application and answer the question of what is talking to what. Unlike the physical view, this is a much more fluid and dynamic construct and is in many ways related to the proposed future work of Firewall Flow Logging.

One possible way to imagine building this is to leverage the information that OPTE has in its firewall tables to build up these pictures. Because this may be driven by snapshots, it may not be the most accurate picture, but it is a useful starting point. The one nice thing about OPTE is that regardless of where it lives, it will have to deal with all data from the guest at some level, which allows us a picture of what connections exist.

Similarly, because OPTE does see all the data, we can consider how we evolve it to track things like data on specific flows and it could even allow us the ability to create the equivalent of netstat-style statistics without needing an agent in the guest.

Other Services

In addition to the aspects that are mentioned above, the overall network will need to make sure that functionality like the following works:

  • Time Synchronization, whether NTP, PTP, or something else

  • Some form of Router Discovery for the internal Routing Protocols

  • The foundation for control plane service discovery

Multi-Rack Deployments

While a customer’s Oxide environment will start with a single rack, that is hopefully not going to be the end point of it. While we have proposed a number of things that are meant to keep the overhead down for the single rack environment, this is also designed to be able to scale to a multi-rack environment.

When the system evolves from one rack to multiple racks, there are a few different classes of considerations that we need to have:

  • How do multiple racks address one another?

  • How are multiple racks physically interconnected?

  • What is the impact of multiple racks on the scalability of OPTE, Boundary Services, and others?

Addressing and Connectivity

The first thing that we need to concern ourselves is with the means by which multiple racks are able to physically connect to another and when connected, be able to communicate.

Communication is fairly straightforward. Because each rack has an explicit IPv6 summary (see [sect-phys] for more information) which is carved out of the broader /48, we know that we can assign each subsequent rack one of these and that this will represent everything behind that rack in the routing table. Ensuring that a new rack that is joining the broader cell or availability does not use an address that is already in use will require some coordination during installation; however, hopefully it can be limited to that point.

Because we can represent each rack as a summarized entry, the amount of routing state that we should need for each additional rack should hopefully be O(racks) and not be something that grows at a faster rate. The implicit constant there is still important, but the design should mean we have adequate switch resources for this.

The bigger challenge that we need to think about is connectivity. The way that we evolve switch connectivity between the Cell and availability zone will vary as the environment scales up and we figure out what the desired oversubscription ratios are. Given that we have plenty of additional ports on the switch for this (see RFD 58 Rack Switch), we have options for directly connecting a smaller number of racks or at some point connecting to a rack that had additional networking gear for building up larger CLOS-style topologies.

The biggest current requirement of this scheme is that all Oxide racks in the same Availability Zone are always connected to one another through Oxide-supplied gear. Put differently, to traverse between two servers in the same availability zone, no equipment from the broader customer network should be involved. While at the surface this may fly in the face of [goal-4], this has large implications for [goal-6] and some of the others.

Fundamentally, we are going to want to be able to leverage our own equipment and have guarantees on the operation and how packets flow between them. If customer equipment is in the middle of that, it’s going to make things much more problematic when we deal with management and understanding why problems have arisen and in particular, Investigations of Unknown Problems. Many of our customers teams are stratified between network management and server management and are oft to point fingers at one another. The more information we need from devices outside of our control to be able to accurately diagnose the problem, the harder it is going to be for us to resolve the problem with a good experience.

While we know that not all customers will have organizational boundaries that are quite as pathological as described above, when we think about what it means to be able to understand the relative bandwidth allocations in the network and perform QoS, it will become much harder for us to do that while crossing the customer’s broader network segments to talk between two racks.

This does come at a cost. It means that we will eventually need to develop more specialized networking equipment that plays the roles of aggregation and core switches in our customers environments today; however, this still seems like an area where we’ll be able to provide value and ultimately by tying that into the automation of the product, ideally be able to offer a better product to our customers.

Service Scalability

An important consideration in all of this is the scalability of network services. There are a couple of different aspects for us to consider:

  • The scalability of services like routing, NAT, firewalls, etc.

  • The scalability of Boundary Services

  • The scalability of Directory, DNS, and metadata services

So, first we should address the scalability of the different user-services that are implemented by OPTE OPTE. In particular, this includes: VPCs, Routing, Firewalling, and aspects of NAT.

From a performance perspective, because we are fanning this out to each server, as additional racks are added, our resources to perform these actions should be able to scale with that environment. Because each compute node is doing its own routing and firewalling for customer traffic, that should be manageable.

However, as the environment grows, so will the amount of state that is required to perform this. To deal with this, we will need to impose limits on things like the number of firewall rules or the number of instances in a project or that can be shared together in peered VPCs. Ultimately we will need to pick certain values and figure out what will be manageable.

When we consider boundary services, this becomes somewhat trickier. While we initially are starting with boundary services being implemented by the rack switches themselves, when we end up in larger environments it will likely make sense to have additional boundary services devices that perform this. To make that scale we will want to have a way to shard this out to make each shard highly available. We should work to figure out what the approximate order of magnitude of entries we can keep in a group of coherent boundary service devices.

That said, if we do have a way to shard that out such as directing subsets of boundary services to different addresses, then this makes the scalability problem a bit more manageable and we should be able to increase the number of those to handle this, though the connectivity to do that can be challenging.

Finally there are a collection of different services that we have which are in essence a part of the larger control plane. These will need to be horizontally scalable once we get to a certain point so it’ll be important that we can design an evolutionary path for that from the beginning. Similarly, knowing that we can eventually shard some of those databases may be rather valuable.

Network Services Evolution

This is somewhat of a complex foundation for what we are initially building. However, an important part of it is how does it enable us to implement the additional features that we already know that we’re going to want in future versions of the product, some of which are discussed briefly in RFD 9 Networking Considerations and RFD 21 User Networking API.

These include the following feature areas:

  • Load Balancers and Integrated DNS

  • IPsec VPNs

  • Cross-Region Connections

  • Flow Logging

  • Multiple IPs per Interface

Load Balancing

A major area that we already know that customers want is load balancing. We can roughly break down load balancers into two camps based on whether or not they actually terminate the connection or not. There is a lot of literature on load balancing and software out there, whether it is related to Microsoft’s Ananta, Google’s Maglev, or more. While we describe a few approaches and ways this can fit together, this is by no means meant to be constraining future us, rather hopefully proving to future us that this isn’t impossible.

In any case, it wouldn’t be too hard to augment boundary services with additional general purpose compute resources. For example, one could imagine that for load balancing, boundary services would be able to forward this onto a bunch of intermediate nodes that would do the actual connection termination and health checking of backend instances before forwarding them on. If a connection-forwarding version was instead preferred, we could potentially put that hashing into boundary services itself and have the secondary compute be responsible for performing health checks and updating boundary services appropriately.

When traffic is leaving the instance, if we are not doing a connection-terminating load balancer, then we can employ a technique called 'direct server return' as discussed in the Ananta paper (and others) and have OPTE direct traffic back towards boundary services to be returned. However, if there is something doing termination, then we need to know how to send traffic back towards that. In an ideal world, that can be managed by OPTE as though sending to any other instance.

IPsec VPNs and Cross-region traffic

IPsec based VPNs and cross-region traffic share interesting properties in that they generally require that data is encrypted in schemes that provide both confidentiality and authentication. While IPsec does require specific mechanisms that we might not require in a cross-region setting, these are more similar than different.

In essence, when there’s a VPN on the scene there is some amount of routing that needs to go on, which fits into the existing models proposed by OPTE rather nicely.

The next challenge, which will be more of an issue for the IPsec VPNs is that we ultimately need something that is running IKE on a per-VPN basis and paying attention to routing rules. This will mean some amount of HA for this which will need to be designed carefully.

The larger challenge that we’ll have is the same for both the cross-region behaviors and IPsec VPNs, mainly the actual act of performing encryption. There are many models that we could employ here. On one hand, we could consider pushing this into OPTE. On the other hand, we may not have the resources to do that in either the NIC or in the host CPU and may want to instead have dedicated systems whose job it is to encrypt and forward. This is where aspects of Andromeda’s Hoverboard approach and careful application of hardware acceleration could be useful.

Flow Logging

Firewall flow logging is something that we can build in concert with OPTE. As we discuss in the Application Observability section, to build up the notion of a customer’s topology is a similar problem to flow logging. There is both a need to be able to collect the data, but then for flow logging to be useful, one needs to be able to aggregate and amalgamate it.

The most useful thing here is that the foundation we’ve built has enough information for us to take care of the first step which is do we have the data. Building the ability to hoover it up at the necessary rate for this to be useful its a complicated feature in and of itself, but this is a useful starting point.

Multiple IPs per Interface

This feature refers to a guest’s device having the ability to have multiple IP addresses associated with a given interface. Implementing this on top of our existing foundation is fairly straightforward. It would require additional database state and support in the firewall API syntax; however, once that was present, all this should require is a slight modification of some of the OPTE processing modules.

Global Anycast Services

Another thing to explore is the idea of support for anycast services. A great example of this is how Google uses it to implement their public DNS server 8.8.8.8. In this case, if multiple different availability zones, whether they’re in the same or different regions, are part of the same fleet, they may be able to have functionality where they can coordinate and nominate a specific public IP address this way.

Once that’s been done, then boundary services can advertise this on the broader customer’s network and if that is destined to be a public IP, then it will need to be advertised by that customer on the broader Internet. However, if it’s just offering an internal service that won’t be necessary.

With control over the internal network fabric and our IGP, it should also be possible to create internal anycast services. This may be a useful way to implement some parts of our HA services or control plane services, especially as the environment grows.