The networking APIs that we expose to users are at the heart of the Oxide Rack. This RFD is a part of the broader user API in RFD 4 and serves to focus on the networking part of it. First, this document introduces the terminology that we will use. Then it goes through the high-level concepts and the way that different features interact together. Next, it goes through the specific API endpoints and how they operate. Finally, it talks a bit about future direction.
The following RFDs serve as useful background for this document:
RFD 9 compares the various cloud providers and trade offs that we need to think about with respect to the networking design. RFD 24 covers discussions about the coherence domains and scope of various resources and introduces terminology that covers the different groupings of infrastructure. Finally, RFD 4 introduces all of the basic information about the User API which introduces ideas like projects and more.
Terminology
This section introduces the terminology that we will use throughout the user API. How the different pieces fit together will be described in the Concepts section.
VPC: A VPC is a virtual private cloud. It represents an isolated network fabric.
IP subnet: An IP subnetwork is a specific sub-division of your broader network. It represents a concrete block of either IPv4 or IPv6 addresses and is summarized by a CIDR block.
VPC Subnet: A VPC Subnet is the fundamental building block of a VPC. It represents a subnet that you can allocate IP addresses from for networks. A VPC subnet consist of both an IPv4 and an IPv6 IP subnet.
VPC Routing Table: A table of entries that determine the next destination of a given IP packet.
VPC Firewall: A network firewall is a tool which accepts or denies packets based on a series of rules.
Virtual NIC (VNIC): A network interface card that appears inside of a virtual machine instance. A Virtual NIC has a primary IPv4 and IPv6 address associated with it.
Floating IP: An IP address that can be dynamically assigned from one instance to another through an API call, though the IP address does not appear inside the instance.
Ephemeral IP: A public IP address that is assigned to an instance while the instance is running and is removed when it is not.
Internet Gateway: A special entity on the network that provides a means for instances on a VPC to reach the Internet.
IP Address Pool: A collection of IP addresses maintained by operators of the environment.
Concepts
Resources in an Oxide rack are organized into projects (see RFD 4) and the vast majority of the networking resources are as well. Each project can be thought of as having its own independent physical network fabrics. Just like in a data center, these network fabrics have their own subnetworks, routers, firewall rules, and are isolated from other networks. We call this a virtual private cloud or VPC.
When a project is created, a default VPC is created as well. Though a
project can contain multiple independent VPCs. Each VPC has full access
to all of the standard IPv4 private address space ranges, 10.0.0.0/
,
172.16.0.0/
, and 192.168.0.0/
. It also has access to IPv6
private address space with a randomly generated prefix based on
IPv6 Unique Local
Addresses in the fd00::/
range.
A VPC network isn’t exactly the same as a traditional data center network. Here are some of the things which may be different:
The underlying network is virtualized. This means that things like routers and firewalls are taken care of automatically and don’t require a dedicated hardware device to implement. This allows your network’s capabilities to scale with your deployment.
By default, each VPC network is independent and cannot communicate with another one.
All IP addresses on a VPC are private. They are not routable outside of the VPC. The IP addresses are assigned from private address space.
Like a project, a VPC exists beyond a single Oxide rack and exists across an entire region.
Some traditional networking concepts like VLANs do not apply. While ARP and NDP work, other broadcast and multicast traffic does not work.
By default, an instance is restricted to only using the IP and MAC addresses that are assigned to it.
By default, while instances on a VPC can reach the Internet, hosts on the Internet cannot reach instances on a VPC except as part of a flow that originated from that instance within the VPC. This can be changed by allocating a Floating IP to an instance.
VPC Example
Let’s look at an example of a VPC that’s used to implement a blog that’s logically made up of three layers: a load balancer, HTTP/application servers, and a database tier:
Let’s look at the different components. The items listed below correspond to the values in the image.
This represents the entire VPC. A VPC is associated with a project. All of the instances within the VPC are isolated other VPCs on the system and have their own, private set of IP addresses.
This is the first of two VPC Subnets. It has both an IPv4 and IPv6 CIDR block assigned to it. This VPC Subnet contains all of the application traffic and has its own name.
This is the second of two VPC Subnets. It has both an IPv4 and IPv6 CIDR block assigned to it. This VPC Subnet contains all of the databases.
These are all of the individual instances in the 'App' Subnet (2). Each instance has an IPv4 and IPv6 address, though one could design the system such that only the load balancer had both. Each instance has an address from the corresponding IP CIDR blocks that are assigned to the VPC Subnet.
This represents all of the instances that are assigned to the 'DB' subnet. Like in (4), they have addresses that come from their VPC Subnet.
This is the VPC Router. It is a scalable router that is a part of the underlying network fabric. It maintains a set of routes for the entire VPC that ensures the different VPC Subnets can talk to one another and provides a route to the Internet.
The Internet Gateway is a scalable NAT that allows all of the instances in the 'App' and 'DB' VPC Subnets to be able to make outgoing connections to the Internet.
This is an IPv6 Floating IP address. It provides an external IP address, which allows network communication to be initiated from outside of the project (for example, from the Intrnet). The floating IP maps a single external IPv6 address
2600:3c00::f03c:91ff:fe96:a264
to an IPv6 address inside of the VPC. Here, it maps to the load balancer’s internal IPv6 addressfd12:3456:789a::32
.This is an IPv4 Floating IP address. It provides similar connectivity as in (8). In this case it maps an IPv4 external address
72.14.186.115
to the load balancer’s internal IPv4 address10.169.10.30
.
To make this clearer, let’s work through some sample flows here. Assume
that a user is trying to reach the blog over IPv4 and is doing an HTTP
GET request. The HTTP request would first target the IPv4 floating IP
72.14.168.115
. When that traffic reaches the Oxide network, the
network will translate that into a request that is directed to the 'Load
Balancer' instance, 10.169.10.30
.
From here, the 'load balancer' would forward it to one of the two HTTP instances, 'HTTP 1' or 'HTTP 2' based on its internal policies. Assuming it chose 'HTTP 2', then it would send that HTTP request to the 'HTTP 2' instance, using either its IPv4 or IPv6 address. The application server will then break down the request and fulfill it. If as a part of it, it needs to access the database cluster, it’ll send a message to one of the databases, which will be facilitated by the 'VPC Router' (6). There may be firewalls in place that may further restrict communication between the instances and the VPC Subnets.
When the database replies, it will reply back to the 'HTTP 2' instance, whose traffic will be routed through the Router. The 'HTTP 2' instance will then reply to the 'Load Balancer'. When the 'Load Balancer' wants to reply to the traffic, it will go through the 'Internet Gateway' (7). The 'Internet Gateway' (7) is responsible for making sure that the reply has the right outgoing IP address (that of the floating IP) and making sure it goes out to the Internet. All traffic leaving the VPC must go through an 'Internet Gateway' (7).
Let’s consider another example, let’s say that 'Postgres 2' want to make
a connection to the Internet just to ping 8.8.8.8
. In this case, the
traffic from 'Postgres 2' would first go to the Router (6). The rules in
the router would direct it to the 'Internet Gateway' (7). A NAT session
will be established for this communication and the rack will pick an IP
address and port to use for the NAT translation from a rack-wide shared
pool. After rewriting the packet, when the 'Internet Gateway' (7)
replies it’ll rewrite the response it got back from 8.8.8.8
and direct
it towards the IPv4 address of 'Postgres 2' 10.168.20.12
. That will
traverse the 'Router' (6) and then 'Postgres 2' will receive the
response.
In addition to these examples described above there are a couple of other things to point out about the VPC:
All of the instances on the 'App' and 'DB' VPC Subnets can look each other up using internal DNS.
The floating IP address will appear in external DNS. This can be looked up by applications outside of the VPC or even the Oxide fleet.
You can use the Firewall to restrict what can communicate with what in the VPC.
Subnets
A VPC is broken into a series of VPC Subnets. A VPC Subnet spans
more than one Oxide rack and can operate in an entire availability zone,
which is selected when the subnet is created. Each VPC Subnet has an
associated IPv4 and IPv6 CIDR block. The IPv4 block must be allocated
from one of the IPv4 private address ranges. The largest IPv4 subnet
that can be created is /
and the smallest is a /
which allows for
approximately 64 addresses. A VPC may have multiple VPC Subnets, each with
distinct IPv4 address ranges, the only constraint being that VPC Subnets may not
overlap. VPC Subnets also have IPv6 address ranges associated with them. These
must be be Unique Local Addresses in the range fd00::/
.
When a VPC is created, a default VPC Subnet is created for you. This uses the
IPv4 address range 172.30.0.0/
, and a random IPv6 Unique Local Address in the
range fd00::/
. This /
range is allocated out of the /
prefix
allocated to the VPC, which may be chosen at the time the VPC is created, or
a random prefix will be assigned at that time.
When an instance is created, it is associated with a VPC Subnet and receives an address from that VPC Subnet. By default, the following communication rules are set up on a VPC Subnet:
All instances on the VPC Subnet can talk to one another due to the default firewall rules.
A default gateway is created in the VPC Subnet which can be used to route to other subnets or the Internet.
An IP address is reserved on the network to act as a private DNS server.
All instances can reach the Internet through the default gateway.
When creating an instance, a user may specify an IP address from the subnet to use or allow the system to pick one. In addition, IP addresses in the network may be reserved through the API. A reserved address will not be allocated by the system automatically. It can only be used by explicitly requesting it when provisioning an instance. An instance may use both IPv4 and IPv6 addresses from the VPC Subnet or only a single one.
A number of addresses in a network are used by Oxide to provide rack
services. For example, if the subnet were 192.168.1.0/
and in IPv6
fd12:3456::/
, the following addresses would be reserved:
Use | Logical Address | IPv4 Address | IPv6 Address |
---|---|---|---|
Network Address | First address in the network | 192.168.1.0 | fd12:3456::0 |
Network Gateway | Second address in the network | 192.168.1.1 | fd12:3456::1 |
DNS Services | Third address in the network | 192.168.1.2 | fd12:345::2 |
Future Use | Fourth address in the network | 192.168.1.3 | fd12:345::3 |
Future Use | Fifth address in the network | 192.168.1.4 | fd12:345::4 |
Broadcast Address | Last address in the network | 192.168.1.255 | Not Applicable |
VPC Routers
A VPC Router defines a series of rules that indicate where network
traffic should be sent depending on its destination. A VPC Router is
part of the underlying fabric of the VPC and is different from the
routing table found inside of a guest (e.g. the information you get when
one runs netstat -rn
).
Every rule in a VPC Router contains the following:
Name: A name for the rule.
Description: A textual description for the rule.
Destination: The IP CIDR block that this rule covers. For example,
10.169.10.0/
. However, see also the open questions on Routing Table Destinations.24 Target: The place that traffic that matches the destination should be sent.
There are a number of different targets. These targets include:
An Internet Gateway.
A specific VPC Subnet.
A particular instance or IP address.
A VPC that belongs to another project, which peering has been enabled for.
No destination, which says that the traffic should be dropped.
There are two types of routing tables:
The VPC-wide System VPC Router
Custom VPC Routers which apply to specific VPC Subnets
In each VPC, there is a VPC-wide VPC System Router that is created when the VPC is created. Routes are automatically added to and removed from the System VPC Router. Routes cannot be added or removed directly from this table; however a few entries can be modified. The VPC System Router contains the following types of entries:
Route Type | Purpose | Destination Type | Modifiable |
---|---|---|---|
Default Route | Determines the default destination of traffic, such as whether it goes to the Internet or not. | An Internet Gateway | Yes |
VPC Subnet Routes | Routes that are automatically added for each VPC Subnet in the VPC. | A VPC Subnet | No, they are added and destroyed with VPC Subnets. |
VPC Peering Routes | Routes that are automatically added when VPC peering is established. | A different VPC | No, they are added and destroyed when VPC peering is established or torn down. |
In addition to the VPC System Router, a VPC may contain a number of VPC Custom Routers. The VPC Custom Router is used to provide additional routes and override the behavior of the VPC System Router. Each VPC Subnet may have a single, optional VPC Customer Router associated with it. The same VPC Custom Router can be associated with multiple VPC Subnets.
Rule Ordering
The rules for picking routes are:
Take the route with the most-specific prefix that matches in either the VPC System Router or a VPC Subnet’s VPC Custom Router.
If two rules have the same most-specific prefix, then the one in a VPC Custom Router applied to a VPC Subnet has priority over the VPC System Router.
When searching for a rule to apply, the VPC routing engine will use the rule with the most specific destination. Consider the following VPC System Router:
Destination | Target | Description |
---|---|---|
0.0.0.0/0 | Internet Gateway | Catch all rule to access the Internet over IPv4 |
::/0 | Internet Gateway | Catch all rule to access the Internet over IPv6 |
10.169.10.0/24 | VPC Subnet A | Rule to route IPv4 traffic to VPC Subnet A. |
10.169.20.0/24 | VPC Subnet B | Rule to route IPv4 traffic to VPC Subnet B. |
10.169.20.0/24 | VPC Subnet C | Rule to route IPv4 traffic to VPC Subnet C. |
fd12:3456:789a::/64 | VPC Subnet A | Rule to route IPv6 traffic to VPC Subnet A. |
fd12:3456:789b::/64 | VPC Subnet B | Rule to route IPv6 traffic to VPC Subnet B. |
fd12:3456:789c::/64 | VPC Subnet C | Rule to route IPv6 traffic to VPC Subnet C. |
If a packet was sent to 10.169.10.5
, it would get routed to VPC Subnet
A. While VPC Subnet A and the Internet Gateway both match the packet,
because the VPC Subnet A route is more specific, it will be taken. If
someone were to send a packet to the IPv6 address
2607:f8b0:4005:80b::200e
, because it only matches the Internet Gateway
rule of ::/
, that is where it will be sent.
Let’s consider a subnet that has a VPC Custom Router with the following rules:
Destination | Target | Description |
---|---|---|
10.169.20.0/24 | Drop | Rule to make sure IPv4 traffic to VPC Subnet B is dropped. |
fd12:3456:789b::/64 | Drop | Rule to make sure IPv6 traffic to VPC Subnet B is dropped. |
172.16.0.0/12 | 10.169.30.33 | A Rule to forward all traffic for a VPN to a specific entry point. |
Let’s consider that this VPC Custom Router is attached to VPC Subnet C
and we’re trying to send a packet from it. First, let’s say we’re trying
to send a packet to 10.169.10.5
. This is on VPC Subnet A and we take
the route from the VPC System Router.
If we look at what happens if we send a packet to 172.16.23.95
, the
most specific rule is the one in the VPC Custom Router. While there is a
rule that applies in the VPC System Router, the one in the VPC Custom
Router is more specific. Note, which Router it’s in wouldn’t matter.
All that matters is which rule has the most-specific prefix. Because of
this rule, the packet will be forwarded to 10.169.30.33
which we
presume is running some kind of VPN software and will send the packet
over the VPN.
Finally, let’s say we try to send a packet to 10.169.20.5
, which is an
IP address in VPC Subnet B. There are two rules that already exist for
this: one is in the VPC System Router and the other is in the VPC Custom
Router. Both of them have the same prefix, so there is no winner via the
prefix match. This leads us to apply the second rule, that the VPC
Custom Router has priority. This means that the packet will be dropped.
Internet Gateway
An Internet Gateway provides instances in a VPC access to the Internet and acts in a similar fashion to a traditional source NAT that one would find in a home network. When a VPC is created, an Internet Gateway is created by default as well. The default Internet Gateway shows up in the VPC System Router by default. A subnet without an Internet Gateway cannot route to the Internet on its own.
It is possible to create additional Internet Gateways and to associate them with VPC Custom Routers. This allows you to be able to have no outbound access by default, but to allow certain subnets to have access.
Each Internet Gateway is associated with a pool of external addresses that may be shared with other VPCs and projects.
However, there are many cases where applications need to deal with remote services that allow and deny access based on the IP address. To deal with this, a number of Floating IP addresses may be allocated and associated with an Internet Gateway for traffic within a single availability zone. When using this mode, the system will not dynamically scale up the number of IP addresses associated with the NAT as that would otherwise defeat any IP address based filtering.
An additional use of this mode is to have no IPs associated with the Internet Gateway, causing all outbound traffic to be dropped. As described in the next section, an instance with a floating IP address will use that when making outbound connections. See the next section for more information.
See also additional design considerations.
External IPs
By default, all networking for instances on a VPC is private to that VPC. This means that while instances can make requests to the Internet, they cannot receive inbound connections.
The system provides two different ways to allocate an external IP address:
Ephemeral IPs: A external IP address that is temporarily assigned to an instance while it is running. Its life time is tied to the state of the instance.
Floating IPs: A floating IP is a permanent object that can be attached and detached from an instance. Its life time is separate from any instance it is attached to.
When an instance has a Ephemeral or Floating IP address, it will not be
visible inside of the guest. This means that it will not show up in
tools like iconfig
or ip addr
. Instead, the Ephemeral or Floating IP
address will act as a 1:1 NAT. Basically all traffic to and from that IP
address will be forwarded to that particular instance and its primary
interface. When the instance replies to that address, it will also be
translated back into the external Ephemeral and Floating IP address.
To support an Ephemeral or floating IPv4 address, the instance must have a corresponding private IPv4 address on its primary interface. Similarly, to support an Ephemeral or Floating IPv6 address, the instance must have a corresponding IPv6 address.
To use an Ephemeral or Floating IP address, an instance must be in a subnet with an Internet Gateway. This helps maintain the simple rule that if an instance does not have an Internet Gateway, it cannot reach the Internet at all. The Internet Gateway will ensure that all traffic from that instance uses its Ephemeral or Floating IP when making new outbound connections and when replying to traffic. If an instance has more than one Floating IP address, then the Internet gateway will use all of the associated addresses when making outgoing connections.
Ephemeral and Floating IPs can be assigned and removed while the instance is running. An Ephemeral IP may be changed to a Floating IP through the API and a Floating IP can be changed back to an Ephemeral IP as well. Neither of these operations interrupt the connectivity of the IP address.
Ephemeral IPs
An Ephemeral IP is an IP address that is temporarily assigned to an instance while it’s running and released when the instance is stopped or destroyed. An Ephemeral IP is perfect for instances that need external connectivity, but the actual IP address isn’t important.
Instances with Ephemeral IPs assigned to them are always advertised in external DNS using the instance scheme. The name will not change regardless of whether or not the underlying Ephemeral IP address associated with the instance changes.
An instance can either have a single external IPv4 address, a single external IPv6 address, or both a single external IPv4 and IPv6 address configured.
Floating IPs
A Floating IP address is an external IPv4 or IPv6 address that can be moved between different types of objects. For example, a Floating IP could be used to represent a service. As the service is upgraded new instances are created and destroyed, and the Floating IP address can be moved along side it. This allows consumers to have a consistent address and name.
A Floating IP address can be assigned to multiple different types of things, including instances, Internet Gateawys, and future looking objects such as load balancers.
Floating IPs have their own DNS names and schemes. They do not show up in an instance’s external DNS, but rather in the Floating IP DNS scheme. Multiple Floating IPs can share the same DNS name. This creates a single DNS entry with multiple records.
IP Pools
An IP pool is a collection of external addresses that are maintained by operators. There are separate pools for IPv4 and IPv6 addresses. A given pool may have more than one IP CIDR block inside of it. IP pools may be made available to all projects or they can be restricted to a specific project.
When allocating an Ephemeral IP, Floating IP, or an Internet Gateway, an IP Pool may be optionally specified, which will cause the corresponding IP address to come from those in the IP Pool. Note, an IP Pool is never used for addresses from VPC Subnets that are used by guests.
DNS
The Oxide API services provide DNS servers for resolving the names of instances to their underlying IP addresses. There are two different types of DNS servers that are provided:
Internal DNS servers that advertise VPC addresses.
External DNS servers that advertise floating IP addresses.
Types of Records
The following types of DNS records are supported for an instance:
A records: These map a host name to an IPv4 address.
AAAA records: These map a hostname to its corresponding IPv6 address.
PTR records: These map an IP address back to the corresponding host name. While less common, these are often required for mail servers and other applications.
For a discussion of other types that could make sense, see the DNS record type design discussion.
Record Structure and Visibility
DNS for Instances
When an instance is created, it is automatically registered in Internal
DNS. In this case, the primary IPv4 and IPv6 addresses are registered as
A
and AAAA
records. Internal DNS exists on a per-VPC basis. Using
the network’s DNS servers, an instance is always able to resolve any
address on the VPC. If an instance is not on that VPC, it will not be
able to resolve names outside of that VPC.
When an Ephemeral IP address is assigned to an instance, then that
instance will appear in external DNS. Names in external DNS are
accessible outside of the Oxide environment by other applications. A DNS
A
record is created whenever an IPv4 Ephemeral IP address is assigned
and a DNS AAAA
record is assigned whenever an IPv6 Ephemeral IP address
is assigned.
Names in DNS follow the same structure, regardless of whether or not they are being used internally or externally. This structure is:
..inst....
<instance>
refers to the DNS name of the instance<az>
refers to the DNS name of the availability zone<vpc>
refers to the DNS name of the VPC<project>
refers to the DNS name of the project<org>
refers to the DNS name of the organization<suffix>
refers to the DNS suffix that is used. For internal DNS this is always.internal
. For external DNS, this varies based on the installation.
Let’s look at an example. Here are two names that refer to the same instance. One is in internal DNS and one is in external DNS:
glorfindel.us-east-1.inst.gondolin.noldor.tolkien.internal
glorfindel.us-east-1.inst.gondolin.noldor.tolkien.oxide.fingolfin.org
Here glorfindel
is the DNS name of the instance. us-east-1
is the
DNS name of the availability zone. gondolin
is the DNS name of the
VPC, noldor
is the DNS name of the project, and tolkien
is the DNS
name of the organization. The first DNS host name is the name in
internal DNS, which is why it has the .internal
suffix. The second
name is the one in external DNS and oxide.fingolfin.org
is the suffix.
The DNS suffix is specific to an installation.
In all of the above objects, we explicitly said it was the DNS name. The DNS name is a separate name for each object that defaults to the object’s name. DNS has some additional constraints in terms of naming that aren’t always there for the main name attributes. In addition, it’s important that renaming something that users see and interact with on a regular basis doesn’t impact the names that machines are using unless intended.
When two VPCs have been peered together, subnets that are shared will show up in DNS with the corresponding names that match that project.
DNS for Floating IPs
Floating IPs are automatically registered in both Internal and External
DNS, similar to the instances. However, the same (external) address is
always advertised. A DNS A
record is created for IPv4 Floating IP
addresses and a DNS AAAA
record is created for IPv6 Floating IP
addresses. Multiple Floating IPs can have the same DNS name. This allows
you to create a single hostname with multiple IPv4 and IPv6 records. This
can be used to create round-robin DNS or to have both IPv4 and IPv6
support for the same name.
Floating IPs use a similar, but slightly different scheme from instances:
.fip....
<name>
refers to the DNS name of the Floating IP<vpc>
refers to the DNS name of the VPC<project>
refers to the DNS name of the project<org>
refers to the DNS name of the organization<suffix>
refers to the DNS suffix that is used. For internal DNS this is always.internal
. For external DNS, this varies based on the installation.
The main difference is that there instead of the <instance>.<az>.inst
prefix you have <name>.fip
.
More concretely, let’s say we had the name
blog.fip.gondolin.noldor.tolkien.oxide.fingolfin.org
. Like in the
previous example, here blog
is the DNS name of the floating IP. There
may be more than one floating IP with the same DNS name, allowing for
more than one record to be associated with the name (for example both an
IPv4 and IPv6 address). gondolin
is the DNS name of the VPC, noldor
is the DNS name of the project, and tolkien
is the DNS name of the
organization. Unlike with instances, there is only ever an external
suffix, there is no internal one. Therefore oxide.fingolfin.org
is the
installation-specific suffix.
Let’s say that this corresponded to the example VPC.
If both the IPv4 and IPv6 floating IP addresses from the example had the
name 'blog', and we ran the host
command on this name, here’s what
we’d expect to see:
rm@elbereth ~ $ host blog.fip.gondolin.noldor.tolkien.oxide.fingolfin.org blog.fip.gondolin.noldor.tolkien.oxide.fingolfin.org has address 72.14.186.115 blog.fip.gondolin.noldor.tolkien.oxide.fingolfin.org has IPv6 address 2600:3c00::f03c:91ff:fe96:a264
Note how both the IPv4 and IPv6 address show up. If instead of one IPv4 and IPv6 floating IP address, we instead had three IPv4 addresses, we would instead see:
rm@elbereth ~ $ host blog.fip.gondolin.noldor.tolkien.oxide.fingolfin.org blog.fip.gondolin.noldor.tolkien.oxide.fingolfin.org has address 72.2.112.194 blog.fip.gondolin.noldor.tolkien.oxide.fingolfin.org has address 165.225.172.11 blog.fip.gondolin.noldor.tolkien.oxide.fingolfin.org has address 165.225.164.26
For more on DNS, see the DNS Record Scheme design considerations and DNS future directions.
VPC Firewalls
A VPC Firewall is a tool that can be used to limit what instances can and cannot talk to other instances. Each VPC has its own independent VPC firewall. A VPC firewall is made up of a series of rules. There is one set of rules for incoming traffic and a second set of rules for outgoing traffic.
The VPC firewall is a stateful firewall. This means that when a connection is established due to allowed rules, there don’t need to be explicit rules going in the other direction. Consider the case where inbound traffic is denied, but outbound traffic is allowed. Then an instance makes a connection outbound, it still expects to receive some amount of inbound traffic. Because the firewall is stateful, an exception is made to allow that specific inbound reply back in.
Each firewall rule has the following attributes:
Status: The rule’s current status. The rule can be enabled or disabled. This allows a rule to be manipulated without deleting it.
Direction: Either inbound or outbound to indicate which direction the rule applies to.
Target: Indicates the group of instances that the rule applies to.
Filters: Indicates a reduction of the Firewall rule. For the rule to apply, it must pass the filter. The filter can cover:
The Source (inbound) or Destination (outbound) addresses.
The IP protocol of the traffic. This could be, for example, TCP, UDP, or ICMP.
The ports that the traffic is using. For example, HTTP traffic commonly uses TCP port 80 or SSH uses TCP port 22.
Action: Describes what to do when a rule matches. There are two options: allow and deny. The former allows traffic through and the latter drops traffic.
Priority: A value between 0 and 65535. Rules are ordered according to priority. The rule with the lowest priority is evaluated first. The default priority is 1000.
When specifying a target or a filter’s source or destination, it may be any of the values specified in Target Strings.
For filters, the use of explicit IP addresses and blocks is not recommended. Using tags, subnets, and instances, where possible will ensure that as an instance’s IP addresses and NICs change that nothing slips through firewall rules. It also allows most rules to target both IPv4 and IPv6 by default without having to think about whether the instance is using one or the other or updating a rule when that changes.
The priority of the rule dictates its evaluation order. Rules are sorted according to priority and rules with the lowest priority are evaluated first. The first rule that matches is used. Unlike routing, the prefix does not matter, only the priority. If there is both a matching allow rule and deny rule at the same priority then the deny rule takes priority.
Default and Implicit Rules
Each VPC has several implicit rules that cannot be deleted. These implicit rules operate with a priority of 65535, meaning that they are evaluated last. These rules are:
Implied allow outbound: This rule allows all instances to communicate outbound over both IPv4 and IPv6.
Implied deny inbound: This rule causes all inbound traffic to instance to be dropped.
The following are rules that are added by default when a VPC is created. These rules have a priority of 65534, meaning that they are evaluated before the implied rules, but all other rules have priority.
Default allow internal inbound: This rule allows inbound traffic to all instances on the VPC as long as it originated from the VPC.
Default allow ssh: This rule allows inbound TCP connections on port 22 from anywhere.
Default allow ICMP: This rule allows inbound ICMP traffic from anywhere. This allows network tools like
ping
to work.Default allow RDP: This rule allows inbound TCP connections on port 3389 from anywhere. This allows Windows remote desktop to operate.
Traffic that is Always Blocked
Some classes of traffic are always blocked, no matter what the source or destination. The only Ethernet protocols that are allowed on the network are:
IPv4
ARP
IPv6
Traffic that does not match one of the three types, will be dropped. There are no restrictions on the IP protocol that is sent. It can be TCP, UDP, ICMP, or anything else.
By default, instances are not allowed to spoof the IP address or MAC address that they are sending from. If they do not match what has been assigned to the instance and the interface, then that traffic will be dropped. There are some use cases such as software based routers that want to forward packets with different addresses. This can be enabled on a per-instance basis.
Rule Debugging
A challenging aspect with Firewalls is to understand their efficacy. The API should provide a means of evaluating whether a given packet would make it to an instance. See the additional discussion of Firewall Flow Logging in the future directions. Some of that may be useful sooner.
Examples
Let’s take the example that we used at the start and see how we might use firewall rules to further constrain how different things can talk to one another. The following table summarizes the firewall rules that we’re going to apply. These rules are ordered in their priority order.
Rule Number | Direction | Action | Target | Filters: IPs | Filters: Protocol | Filters: Ports | Explanation |
---|---|---|---|---|---|---|---|
1 | Inbound | Allow | VPC Subnet 'DB' | VPC Subnet 'DB' | All | All | This is a rule that allows all of the databases to talk to one another. |
2 | Inbound | Allow | VPC Subnet 'DB' | Tag: HTTP | TCP | 5432 | This rule allows instances with the HTTP tag, being the 'HTTP 1' and 'HTTP 2' servers to connect and query the databases. |
3 | Inbound | Allow | Instance 'Load Balancer' | All | TCP | Ports 80, 443 | This rule allows the load balancer to receive HTTP and HTTPS traffic from outside of the VPC and from the broader Internet. |
4 | Inbound | Allow | VPC Subnet 'App' | Entire VPC | All | All | This rule allows any instance in the 'DB' and 'App' Subnets to make connections to an instance in the 'App' Subnet. |
5 | Inbound | Deny | All | All | All | All | This is the default inbound-deny rule that stops all other inbound traffic from functioning. This is the lowest priority Inbound rule. |
6 | Outbound | Allow | All | All | All | All | This is the default outbound-allow rule that allows all other outbound traffic to function. This is the lowest priority Outbound rule. |
In this example, we have a number of rules in place that allow inbound connectivity between different groups of instances. If we look back at the original flow we described, during the VPC Example these rules allow for traffic to flow in a natural way. Here is a bit more information on some of these rules.
First, you’ll notice that there are a lot of inbound allow rules, but there aren’t many outbound allow rules. This is because of the default rules which are rules 5 and 6. The default firewall rules allow all outbound traffic to be allowed, but all inbound traffic to be denied. Therefore, we have a number of exceptions to that default deny rule which have a higher priority.
Take rule 3. This rule allows the load balancer to receive inbound connections from anywhere, but only on TCP ports 80 and 443, which are the common ports for HTTP and HTTPS. Without this rule, the load balancer wouldn’t be able to receive traffic. The filter ensures that it’s only on the ports that this should be allowed on. If some other service was started on the load balancer instance, it wouldn’t immediately be accessible to the rest of the world. Similarly, because the target of rule no 3. is the load balancer instance, this doesn’t apply to any other instances.
Rules 1 and 2 lock down the 'DB' Subnet in two different ways. First, Rule 1 allows the instances from within the 'DB' Subnet to be able to talk to each other on any port. However, if you’re outside of the 'DB' Subnet you could not talk to any of the instances within it. This would make it rather hard for the databases to actually serve their traffic.
This is where Rule 2 comes into play. It allows inbound access to the databases, but with a rather constraining filter. Because TCP port 5432 is the main port for Postgres, the filter constrains all communication to that port. The other part of the filter is on a tag. Instances can be grouped together with a tag (See RFD 4 for more on tags).
Using a tag we can collect all of the HTTP instances together. The difference between using the tag and using the whole 'App' VPC Subnet is the 'load balancer'. In this case, we don’t want the load balancer to be able to talk to the databases directly, so we use a more constrained set. By using a tag, if we expand the set of instances needed for serving HTTP traffic, we can add the tag to them and know that the firewall rules will automatically adjust. The same is true if we delete an instance with a tag.
Finally, the 4th rule allows targets all instances in the 'App' Subnet. Here, it allows all instances in the entire VPC to connect to any instance in the 'App' Subnet.
Based on these rules here are various connections that will be denied and result in the traffic being dropped:
The load balancer trying to talk to any of the databases.
The HTTP/Application servers trying to talk to anything other than TCP port 5432 on the databases.
External communication reaching anything other than the load balancer (which could happen if the Floating IPs were moved or an Ephemeral IP were added).
VPC Peering
While each VPC is independent and private, there are times that you want to be able to privately share traffic between two VPCs between different projects and organizations within a region without going out over public IP addresses.
For example, a company that provides a database as a service may have have a VPC or subnet per customer. In an ideal world, they would be able to peer that VPC or subnet into their customer’s private networks. An example of this is discussed in RFD 9 Networking Considerations.
To facilitate this, two VPCs can be peered together. When two VPCs are peered together, for each subnet that’s been peered, the following happens:
A VPC Peering Gateway is created that represents the remote VPC that can be used in VPC Custom Routers.
A route is added for that subnet to the VPC System Router that uses the VPC peering gateway.
DNS requests for the remote subnet using the DNS scheme that refers to the far side’s VPC, Project, and Organization will function. Note, the local DNS search path will not be modified to include searching that.
One has the ability to specify a remote VPC as the filter of VPC Firewall rules.
When a VPC peering relationship is established, subnet propagation can happen in two different modes:
Automatic mode: All subnets in a VPC are shared. As subnets are created and destroyed, they will automatically be shared or be removed.
Manual mode: Each side can manually specify the subnets that they wish to share.
The system will not allow two subnets from either end of a VPC to be
shared if they overlap in their blocks. The use of IPv6 ULA /
prefixes on a per-VPC basis should eliminate this possibility; however,
it is easy to end up in this situation with IPv4 and the default subnet.
Automatic mode cannot be enabled if there is existing overlap.
When a subnet is removed, the routes, firewall rules, and ability to make DNS requests about that subnet are removed. When a peered VPC relationship is removed, then all other information and rules about it are removed as well.
Currently, a VPC can only be peered to one other in this fashion. See also the discussion of future directions for VPC Peering and VPNs.
Inside the Instance
This section describes the behavior of an instance with respect to the network.
VNICs
Inside an instance, you will find one or more Virtual NICs (VNICs).
VNICs show up to the operating system as a normal PCI Express Network
Interface Card and show up in normal networking tools like ifconfig
,
ip link
, ipadm
, and the Powershell Get-NetIPAddress
.
While this behaves the same as a normal device with a few exceptions:
The interface will always appear up.
The speed of the interface is not a reflection of the actual speed of the link (there is none, because the NIC is virtual).
Certain commands and tools that ask for features of Ethernet (such as link advertisements, auto-negotiation configuration, or blinking a NIC’s LEDs) will not function the same way and will likely fail.
On each Interface inside of the instance, there will be an IPv4 and/or an IPv6 address depending on the instance’s configuration. These addresses will always come from the same subnet.
DHCP and Options
The per-instance IPv4 and IPv6 addresses will be supplied over DHCP for both IPv4 and IPv6. A number of additional values will be transmitted over DHCP. These include:
The interface’s MTU, which is always 1500 currently. See the MTU design discussion for more on that.
The instance’s hostname, which comes from the instance’s DNS name.
The instance’s DNS domain name, which is based on the VPC Subnet.
The DNS Domain search domain, which is based on the VPC Subnet.
The NTP server for the instance, which is provided by Oxide.
The default gateway for all traffic, which is based on the VPC Subnet.
Currently, there are no plans for custom DHCP options on a per-subnet or VPC basis. However, to better assess if this is required, there are several questions in the DHCP and Options questions section. The design of transmitting several things over DHCP is such that if we want to add that in the future, it is easy to.
With the exception of the instance’s host name and possibly the MTU, all of the other properties are properties of the system and not expected to change. For the instance’s host name updates to be propagated into the instance, DHCP must be restarted. For this to happen before the DHCP lease expires, tools inside of the guest instance may be required.
Routing and Gateways
Inside of instances, we are currently planning on borrowing from GCP and
giving instances an off-link gateway and giving the smallest address
allocation (a /
for IPv4 and a /
for IPv6). This uses the
RFC 3442 Local Subnet Routes
feature.
While it requires a bit more knowledge from the guest, this minimizes the amount of ARP traffic that the guest sends and makes sure that all traffic is always destined for an IP address that we control. Making it easier to deal with and capture packets for firewall rules, routing, and more.
Critically, this means that the only route that an instance needs is that of the off-subnet gateway, which will be the VPC Subnet’s gateway. As the VPC System Router or the VPC Subnet’s VPC Custom Router is updated, no changes need to be propagated to individual instances.
The Primary and Multiple Interfaces
An instance may have additional interfaces inside of it. Each interface can belong to a different VPC Subnet from the same VPC. An instance’s networking cannot span more than one VPC.
The first interface is considered the 'primary' interface. The primary interface has several properties:
Only the primary interface’s IP addresses show up in Internal DNS.
Ephemeral and Instance IPs are always forwarded to it.
It is the only interface to receive a default route over DHCP.
Additional interfaces from different subnets can also be allocated. These secondary interfaces will not be registered in DNS, though all firewall rules targeting them will be applied.
See additional discussion on the design of Hot-plug of Interfaces and IPs and future direction on Multiple IPs per Interface.
API
The networking APIs and routes are broken into two categories:
VPC-Scoped API Routes
Project-Scoped API Routes
While all resources are always owned by a corresponding project, several
of the resources and routes are specific to a VPC, rather than the
project as a whole. VPC-scoped routes are always under the
/
hierarchy, while the other networking
functions are found under the top-level project API route
/
and represent resources that can be used
across multiple VPCs in the same project, such as
floating IPs.
VPC-scoped API Route Summaries
Verb | Route | Purpose |
---|---|---|
GET |
| List all VPCs in a project |
POST |
| Create a new VPC in a project |
GET |
| Get details about a specific VPC |
PUT |
| Update details about a specific VPC |
DELETE |
| Delete a VPC |
From here on, all APIs are rooted under the VPC, so there is an implicit
/
at the start of every route.
Verb | Route | Purpose |
---|---|---|
GET |
| List all VPC Subnets in a VPC |
POST |
| Create a new VPC Subnet |
GET |
| Get details of a VPC Subnet |
PUT |
| Update details of a VPC Subnet |
DELETE |
| Delete a VPC Subnet |
GET |
| List IP Addresses on a VPC Subnet |
GET |
| Get information about an IP Address on a VPC Subnet |
PUT |
| Update details about an IP Address |
Verb | Route | Purpose |
---|---|---|
GET |
| List all of the VPC Custom and System Routers |
POST |
| Create a new VPC Custom Router |
GET |
| Get all of the details of the VPC System or VPC Custom Router |
PUT |
| Update the details of a VPC Custom Router |
DELETE |
| Delete a VPC Custom Router |
GET |
| List all of the routes on a given router. |
PUT |
| Create a new route in the specified VPC router. |
Verb | Route | Purpose |
---|---|---|
GET |
| Return the VPC’s firewall rules. |
PUT |
| Update the VPC’s firewall rules. |
Verb | Route | Purpose |
---|---|---|
GET |
| List Internet gateways |
POST |
| Create a new Internet gateway |
GET |
| Get information about the specified Internet Gateway |
PUT |
| Update information about an Internet Gateway |
DELETE |
| Delete the specified Internet Gateway |
See also: Future Looking APIs for VPNs.
Verb | Route | Purpose |
---|---|---|
GET |
| List VPC Peering gateways |
POST |
| Create a new VPC Peering gateway |
GET |
| Get information about the specified VPC Peering gateway |
PUT |
| Update information about a VPC Peering gateway |
DELETE |
| Delete the specified VPC Peering gateway |
GET |
| List VPC subnets that have been shared by peering |
GET |
| GET details about a VPC Subnet that has been peered |
PUT |
| Update information about a Peered VPC subnet |
Project-scoped API Route Summaries
Verb | Route | Purpose |
---|---|---|
GET |
| List all floating IPs currently assigned to the VPC |
POST |
| Allocate a new floating IP to the project |
GET |
| Get information about the specified floating IP. |
PUT |
| Update information about the floating IP, such as the instance or Internet gateway it is assigned to or DNS information. |
DELETE |
| Delete the specified floating IP. |
GET |
| List all IP Pools that are available to a project. |
GET |
| Get information about a specific IP Pool. |
See also: Future APIs for Load Balancers.
Verb | Route | Purpose |
---|---|---|
GET |
| Look up what resources refer to the specified IP address in the project. |
General API Concepts
Target Strings
In many places in the networking APIs, there is the ability to specify a target string. This occurs in VPC Router’s entries, VPC firewall rules, and VPC firewall filters. These target strings are made up of two pieces, a resource scope and a resource identifier. For example "vpc:my_vpc", "subnet:db", "ip:192.168.1.1/32", etc. In these cases 'vpc', 'subnet', and 'ip', are the resources scope, and the other part after the ':' are the resource identifiers. Not all HTTP objects support all of the resource scopes.
In addition, when using lists of resources in the API, we use a similar target string style specification to provide us a good way to extend the types of resources that may be listed in a resource as the system evolves.
The following types of resource scopes and identifiers are generally supported in the API:
Resource Scope | Resource Identifier | Description | Example | Allowed context |
---|---|---|---|---|
vpc | The name of the VPC | Targets all networking traffic from the specified VPC. This is either the current VPC or a peered VPC. | "vpc:default" | Firewall rule target, firewall host filter, route target |
subnet | The name of the VPC Subnet | Targets all networking traffic from the specified VPC Subnet. | "subnet:databases", "subnet:blog-web" | Firewall rule target, firewall host filter, route target |
instance | The name of the instance | Targets all of the IP addresses that are assigned to that instance. | "instance:frontdoor-lb" | Firewall rule target, firewall host filter, route target |
tag | The name of a tag | Targets all instances that have the same tag. | "tag:https" | Firewall rule target, firewall host filter |
ip | A specific IP address or CIDR block | Targets the specific IP address | "ip:10.20.30.00/24", "ip:2600:3c00::f03c:91ff:fe96:a264" | Firewall host filter, route target |
inetgw | The name of an Internet Gateway | Targets the specified Internet Gateway | "inetgw:default" | Firewall host filter, route target |
fip | Name of a Floating IP | Targets the specified floating IP. | "fip:foobazco.org" | Route target |
Whole Object Updates, ETags, and Conditional PUTs
Every HTTP object has a corresponding GET
method and can be updated
with a PUT
method. When using the GET
method, the entire object
is returned. To update the object, the entire new version of the object
must be included. There is no partial update functionality with the
PUT
method.
All of the GET and PUT endpoints support the use of an ETag to perform conditional requests. The ETag is returned as part as an HTTP header of a successful GET or PUT request.
Object Lists
There are several HTTP resources in the networking APIs that represent endpoints that can be used to list resources. For each resource that can be listed, there is pagination, meaning that not all entires will be returned in a single request.
The networking API will work like our other APIs and should follow what was laid out in RFD 4 User Facing API for pagination.
VPC APIs
VPCs are represented by an object. When a project is created, a VPC is usually created automatically. Every VPC contains the following fields:
Key | Value Type | Read-Only | Description |
---|---|---|---|
| String | No | The name of the resource. This is used as the main key in the API. |
| String | No | An optional description. |
| String | No | The name that is used for the resource in DNS. |
| String | yes | A UUID for the object. |
| string | yes | The date and time that the object was created. |
| string | yes | The data nd time that the object was last modified. |
List VPCs (GET /project/ {project_name}/ vpcs
)
The List VPCs endpoint, GET /project/
, returns an
array of JSON objects, each of which has the form described above.
XXX Example
Create VPC (POST /project/ {project_name}/ vpcs
)
The Create VPC endpoint, POST /project/
, can be
used to create a new VPC inside of the project {project_name}
. When
creating a VPC, one can opt to have the VPC defaults set up or not. When
creating a VPC with the defaults, the following will be configured:
A default VPC Subnet will be created with both IPv4 and IPv6 CIDR blocks.
The VPC firewall will be populated with the default rules.
When created a VPC, the following parameters may be specified.
Key | Value Type | Required | Default Value | Description |
---|---|---|---|---|
| string | yes | - | The name of the resource. |
| string | no | - | An optional description of the VPC. |
| string | no | The value of | Sets the value of the VPC that will be used in DNS. If this is not
specified, this property will be seeded with the value of the |
| string | no | The system default | This sets the default prefix that will be used for IPv6 addresses.
This must come from the
IPv6 Unique Local
Addresses in the |
| boolean | no | false | Causes the VPC to be created with its default properties as discussed above. |
VPC Subnet APIs
A VPC Subnet are represented as a JSON object with the following fields:
Key | Value Type | Read-Only | Description |
---|---|---|---|
| String | no | The name of the resource. This is used as the main key in the API. |
| String | no | An optional description. |
| String | yes | The availability zone the subnet is located in. |
| String | no | The name of the optional custom router that this subnet uses. |
| String | No | The name that is used for the resource in DNS. |
| String | yes | A UUID for the object. |
| String | yes | The IPv4 CIDR block associated with the VPC Subnet. |
| String | yes | The IPv6 CIDR block associated with the VPC Subnet. |
| Integer | yes | The MTU of the network in bytes. It is always |
| string | yes | The date and time that the object was created. |
| string | yes | The data nd time that the object was last modified. |
| String | yes | The UUID of the VPC this VPC Subnet is a part of. |
VPC Router APIs
Key | Value Type | Read-Only | Description |
---|---|---|---|
| String | no | The name of the resource. This is used as the main key in the API. |
| String | no | An optional description. |
| String | yes | A UUID for the object. |
| string | yes | The date and time that the object was created. |
| string | yes | The data nd time that the object was last modified. |
| String | yes | The name of the VPC this VPC Subnet is a part of. |
VPC Route APIs
The VPC Route rules are represented as a JSON array of route objects. Here’s an example of a set of rules that might appear in a VPC Custom Router’s routes:
[ { "name": "vpn_v4", "destination": "ip:172.16.0.0/12", "target": "instance:my_vpn", "description": "Rule to route traffic to our DC in Hobbiton" }, { "name": "drop_db", "destination": "subnet:cust_db", "target": "drop", "description": "don't let folks route to that subnet" } ]
The body of each rule has the following fields:
Key | Value Type | Valid Values | Description |
---|---|---|---|
| string | Standard RFC1035 names, as in the rest of the API | The name of the resource. This is used as the main key in the API. |
| string | The 'vpc', 'subnet', and 'ip' Target Strings. | Specifies what network traffic the route matches. See the discussion below. |
| string | The 'vpc', 'subnet', 'instance', 'inetgw', and 'ip' Target Strings | Lists where traffic specified in destination should be forwarded to for a specific. See the discussion below. |
| string | N/A | An optional string that provides additional information about the rule |
In the destination member, the Target Strings have the following semantics:
ip: Indicates that the routing rule matches the specified IP addresses.
subnet: Indicates that the routing rule matches the VPC Subnet’s IPv4 and IPv6 prefixes.
vpc: This may only be used to specify a peered VPC and represents all of the subnets that are currently peered.
In the target member, the Target Strings have the following semantics:
ip: Sends all traffic to the specified IP. If this IP is not something that exists on the network, then traffic may be dropped.
instance: Sends all traffic to the primary IP address of the specified instance. If the instance’s IP address changes, this will change along with it.
inetgw: Sends all traffic to the specified Internet Gateway.
subnet: This ensures that the specified traffic is sent to the specified VPC Subnet. This is used by the VPC System Routing Router to ensure that all traffic for a specific VPC Subnet is directed to it.
vpc: This may only be used to specify a peered VPC. This ensures that all traffic is sent to the peered VPC.
Get VPC Router Routes (GET /routers/ {router_name}/ routes
)
The GET VPC Router Routes endpoint, GET /routers/
,
returns the set of VPC Router routes that exist for the specified router.
XXX Example
Set VPC Router Routes (PUT /routers/ {router_name}/ routes
)
The Set VPC Router Routes endpoint, PUT /routers/
,
replaces the set of VPC Router routes for the specified router. The body format
is as described above for GET. Note, the VPC System Router rules have
restrictions on what can be modified.
XXX Example
VPC Firewall APIs
The VPC Firewall rules are represented as an array of JSON objects. Here’s an example of a set of rules:
[ { "name": "http", "status": "enabled", "direction": "inbound", "targets": [ "tag:web" ], "filters": { "protocols": [ "TCP" ], "ports": [ "80", "443" ] }, "action": "allow", "priority": 1000, "description": "server our HTTP traffic externally" }, { "name": "bastion", "status": "enabled", "direction": "inbound", "targets": [ "subnet:db", "subnet:admin" ], "filters": { "hosts: [ "vpc:our_vpc", "ip:1.2.3.4/32", "ip:2600:3c00::f03c:91ff:fe96:a264/128" ], "protocols": [ "TCP" ] , "ports": [ "22" ] }, "action": "allow", "priority": 50, "description": "allow our bastion host to ssh into systems" }, { "name": "no-ssh", "status": "disabled", "direction": "inbound", "targets": [ "vpc:our_vpc" ] "filters": { "protocol": "TCP", "ports": "22" }, "action": "deny", "priority": 100, "description": "drop non-bastion ssh traffic, disabled on 4 APR 2020 for debugging by rm" } ]
Each rule has the following fields:
Key | Value Type | Valid Values | Description |
---|---|---|---|
| string | Standard RFC1035 names, as in the rest of the API | The name of the resource. This is used as the main key in the API. |
| string | "enabled" or "disabled" | Determines whether or not the rule is in effect. |
| string | "inbound" or "outbound" | Determines whether the rule is checked when a packet arrives at an instance (incoming) or when it leaves the instance (outgoing). |
| string array | The 'vpc', 'subnet', 'instance', 'tag', and 'ip' Target Strings | Lists the sets of instances that the rule applies to. |
| JSON Object | Items that filter the scope of the rule | |
| string | "allow" or "deny" | Indicates whether the rule should allow or deny traffic. |
| integer | 0-65535 | Indicates the relative priority of the rule |
| string | N/A | An optional string that provides additional information about the rule |
Firewall Filters
A filter is a JSON object with a series of keys, each of which describe a different axis to filter the object on. A filter can have any combination of the following fields, but it must have at least one of them:
hosts: An array of strings, each of which is a target string. See Target Strings for further restrictions
protocols: An array of strings. Valid protocols are: TCP, UDP, and ICMP
ports: An array of strings, each of which is a port to allow. A range of ports can also be specified by including two numbers separated by a '-'. This range is treated as inclusive. For example, "80-100" would match ports 80 through 100, including both 80 and 100.
A given packet must match all of the filters to be affected by a rule. If a filter indicated that the protocol TCP and port 12345, then if the packet was UDP on port 12345 or TCP traffic on another port, it would not match.
When more than one entry is specified for either of hosts, protocols, or ports, then the packet can match any of them. For example, if you look at the "http" rule in the earlier example, then it will match both TCP packets on port 80 and TCP packets on port 443.
Phrased a different way, each filter category (hosts, protocols, and
ports) are joined together with a logical AND (&&
), while entries
within a category or joined together with a logical OR (||
).
Get VPC Firewall Rules (GET /firewall/ rules
)
The GET VPC Firewall Rules endpoint, GET /firewall/
, returns the
set of VPC firewall rules that currently exist. The object returned is a
single JSON object with all the firewall rules as described above.
XXX Example
Put VPC Firewall Rules (PUT /firewall/ rules
)
The PUT VPC Firewall rules endpoint, PUT /firewall/
, replaces
the entire set of firewall rules with the new object. The object format
is as described above.
XXX Example
VPC Internet Gateway APIs
Key | Value Type | Read-Only | Description |
---|---|---|---|
| String | no | The name of the resource. This is used as the main key in the API. |
| String | no | An optional description. |
| String | yes | The availability zone the Internet Gateway is located in. |
| String | yes | A UUID for the object. |
| String | No | Either 'automatic' or 'manual'. Indicates whether the system should automatically select IPs to use for the Internet gateway on its own or if the user will explicitly scale up and done the IPs. |
| String Array | No | An optional array of strings that indicate the IP pools to leverage for the Internet gateway when in 'automatic' mode. The gateway must be in 'automatic' mode for this to be used. |
| String Array | No | An array of strings, each of which indicates a floating IP address to use for the gateway. The gateway must be in 'manual' mode for this to be set. |
| String | Yes | Indicates the type of Gateway. This will always return the string 'internet'. |
| string | yes | The date and time that the object was created. |
| string | yes | The data nd time that the object was last modified. |
Floating IP APIs
Key | Value Type | Read-Only | Description |
---|---|---|---|
| String | No | The name of the resource. This is used as the main key in the API. |
| String | No | An optional description. |
| String | No | The name that is used for the resource in DNS. |
| String | Yes | The string form of the IP address. |
| String | Yes | Indicates if it’s an IPv4 or IPv6 address. |
| String | Yes | An optional parameter that indicates the name of the IP pool this was allocated from. |
| String | No | A Target Strings that indicates what the resource is attached to. Supported targets are instances and VPC Internet Gateways. |
| string | yes | The date and time that the object was created. |
| string | yes | The data nd time that the object was last modified. |
IP Pool APIs
Key | Value Type | Read-Only | Description |
---|---|---|---|
| String | No | The name of the resource. This is used as the main key in the API. |
| String | No | An optional description. |
| String | yes | The IP CIDR block associated with the IP Pool. Whether it is IPv4 or
IPv6 depends on the value in |
| String | Yes | Indicates if this pool references IPv4 or IPv6 addresses. |
| string | yes | The data nd time that the object was last modified. |
API Summary
Route | Purpose |
---|---|
| Manage VPCs |
| Manage VPC Sunets |
| Manage IPs on a VPC Subnet |
| Manage VPC System and Custom Routing Tables |
| Mange specific routes for a VPC System and Custom Routing Table |
| Manage VPC Firewall Rules |
| Manage VPC Internet Gateways |
| Manage VPC Peering |
| Manage VPC Subnets received through VPC Peering |
| Manage Floating IPs |
| Manage IP Pools |
| Search for the usage of an IP address |
Open Design Questions
IPv6 Addressing
IPv6 addressing marks an opportunity for a major departure from how IPv4 is often treated in cloud deployments. Because IPv4 addresses are scarce and there is a limited set of both public and private addresses, all of the major cloud providers have virtualized IPv4 addresses that customers use for communication. In addition, they use a 1:1 NAT for mapping Internet-accessible IPv4 addresses to an internal address.
With IPv6, these same constraints aren’t necessarily required. AWS is
using their large IPv6 address space allocations to give every VPC a
public /
which is split into explicit /
allocations for a
subnet. The implication is that for those IPv6 uses, there is no
difference between whether an address is considered public or
private. Rather, it is all about the firewall posture.
A related challenge with this model is the question of how does a floating IP work. Critically, AWS does not support Elastic IP for IPv6. This makes sense, given the design that they’re using. AWS likely is not doing any kind of virtualization games here, instead they’re just using normal routing and, ACLs, and firewalling.
While the current API proposes one solution to this, there is a related thing that we need to consider, which is how we map addresses to and from one location. The advantage of the current NAT approach is that it doesn’t require any adjustments in the guest. However, this means that it’s also not possible to have multiple external IP addresses and distinguish them. An alternative approach that would require more work in the guest would be to route those addresses to the guest directly. This relies on the use of addresses outside of the ULA space.
The current API design mimics IPv4 and allows IPv4 and IPv6 to have a similar experience. It also works on the assumption that customers will not have a large amount of publicly routed IPv6 space.
To try and help answer these questions, there is the IPv6 part of the customer questions section.
Routing Table Destinations
Currently, VPC Router destinations are mostly described as subnets. As we have other resources that might dynamically expand to more than one IP block (for example a default subnet with IPv4 and IPv6), it may make sense to allow destinations to be input that way. If we did this, that would mean that the destination would include the following:
IPv4 and IPv6 CIDR blocks
Specific Subnets
Networks from VPC Peering
Future looking things like VPNs to and from clouds like AWS or on-premises where routes can be exchanged used BGP.
The main things that this approach raises are questions around things like:
Having to explain the concept to customers that may be a little different from what they’re used to (though there are already different types of targets for routes).
Complexity in the back end and making sure changes are reflected across the system.
Having a way to print out the normalized into IP CIDR block form.
On the other hand, it would solve some potential problems:
One wouldn’t forget to route one of IPv4 or IPv6
The system could automatically adjust when faced with changes.
I’m not sure there are many customer questions that I would ask to help us clarify what direction to proceed down. My inclination is probably to add support for it in the user API.
Internet Gateway Design
The scalability of an Internet Gateway and its design from a user perspective is an interesting challenge. Here are some of the goals and constraints that we have, some of which are in tension:
We want to make it easy for groups of instances to use a fixed set of external IPs to make it easier when external services allow and deny access to them based on the IP address.
We want to be able to make this scale with a users footprint in the network. When there is a lot of NAT activity going on we need to figure out how to design this such that we avoid NAT exhaustion. This means we may need the number of IP addresses to scale and indicate when we’re running out. Though if we know when we’re running out, we should avoid asking the user to manage it.
Currently in the VPC Router design there is a VPC-wide System Router. However, the floating IP address may realistic be limited to a single AZ. Is it OK to have users manage addresses on a per-AZ basis.
We want to make sure that we avoid a single instance or point that is doing the NAT, to ensure it’s scalable. In an ideal world, being able to perform the NAT on the local machine that traffic is originating would be great and would fit in with some of what we’re doing from a general architecture and floating IP direction.
I don’t have concrete other API proposals; however, this is what has led to the current set of thoughts. However, as we evaluate what makes sense and what doesn’t, this may be useful to keep in mind. There are a bunch of trade offs between scalability by default and having a dedicated set of IP addresses.
The ability for an Internet Gateway to have no IP addresses associated with it feels like a bit of a kludge. While it does simplify the router / floating IP interaction in some ways, I wonder if it makes it more complicated at the same time. It’s not clear to me what percent of instances will want to have all outbound Internet access cordoned off, but also have a floating IP address associated with a single instance in a given subnet.
Floating IPs and Routing Tables
The current design requires that a router entry be present in either the system or a subnet’s VPC Custom Router for an Internet Gateway. The current view is that if there is no entry for an Internet Gateway the instance cannot route to the Internet. It’s not clear how much this matches what someone would naively expect or not.
We could consider having this bypass the Internet Gateway and associated policy altogether, though we still have to honor the VPC Router for other aspects (for example a deny rule for a CIDR block). Ultimately, I think it’s clearer that we do require it so that way thing are consistent. Even if it does introduce a few oddities for the Internet Gateway discussed above.
DNS Record Types
At the moment we’re only suggesting supporting the A, AAAA, and PTR records in our DNS systems. There are a few other record types we should look at:
SSHFP Records
The SSHFP record type encodes an ssh host key fingerprint in DNS. This gives a second, out of band means for communicating what a host’s ssh host key fingerprint is, which comes up every time a new host is contacted.
OpenSSH (the mainstay ssh server in Unix-like systems) trusts the ssh host key at two different levels. By default, it will note the fact that this is in DNS in addition to the normal prompt about the host key fingerprint. The second level occurs if the DNS zone is signed with DNSSEC. In that case OpenSSH will automatically accept the host key and will not prompt the user about it (unless there is a mismatch with known hosts).
Because this requires DNSSEC to be present to be fully useful, it is not clear whether or not this makes sense. It also would require an agent inside of the guest to publish and update metadata about the key and know about changes that occurred to it.
SRV Records
SRV records provide a generic way to indicate a type of service. An SRV record can indicate multiple IP address and port combinations with different weights. For these records to work they would need cooperation from the instance itself. To see examples of where that might make sense, see the DNS future directions section.
DNS Record Scheme
The DNS scheme for records that was proposed is somewhat verbose. The following were the goals that went into this design:
Try to avoid the use of UUIDs by ensuring that names could be sufficiently unique without limiting what a user can call something.
Prefer the use of the same name internally as externally to simplify life.
Make sure that the DNS scheme will deal with future directions.
The current scheme is designed as:
....
By using the organization, project, and VPC, we can avoid the issue of whether or not duplicates exist. While it’s easy to enforce that there are no duplicates for a single user, when we start to bridge multiple projects and VPCs together, that becomes much more difficult and cumbersome.
Currently, the defined name spaces are for instances and Floating IPs.
These take the forms <instance>.<az>.inst
and <name>.fip
respectively. If we were to integrate load balancers or service names
into DNS, then those could have their own scheme. For example we could
see all of them as:
..inst....
.fip....
.svc....
.tag....
.lb....
This currently suggests that we want to include everything as below the VPC for resources. That may be reasonable, given that this is the granularity for networking.
To reduce the friction of names, there are a few things we could do. For
example, by default, we can set the DNS search domains in an instance to
things like: inst.<vpc>.<project>.<org>.<suffix>
or
<vpc>.<project>.<org>.<suffix>
. This will give us good ergonomics. In
addition, because of things like VPC peering, being able to have longer
names allows us to actually expose DNS names to all members of the
peered VPC.
Hot-plug of Interfaces and IPs
We will inevitably hit the point where the wrong IP address is assigned to an instance or an instance wants to have a network interface added or removed. These two are not the same, though have some related challenges.
Today most virtual network interfaces whether they are based on virtio or SR-IOV do not generally show up as a hot-pluggable device. Most networking device drivers have not been made to be hot-pluggable or designed around surprise removal, unless working with a physical USB to Ethernet adapter, which we would not want to emulate for numerous reasons. That means that the only good way to remove an interface from a guest is basically to reboot it.
To that end, GCE explicitly doesn’t allow changing the number of interfaces for an existing instance. Amazon does.
Adding and removing IP addresses from instances is much easier mechanically, but there is still coordination inside of the guest. Because most addresses are assigned via DHCP or statically, someone has to go and update that configuration or trigger a DHCP renewal from the instance. Depending on the amount of machinery that we add to guests, this can be made fairly automatic. Alternatively we can set lower lease times for DHCP, but that has its own trade offs.
While we should design the API in such a fashion that adding support for these is possible, we’ll need to think carefully about the implications of each of the paths and work through that in the design.
MTU
Currently, all networks default to an MTU of 1500 bytes. While this is the most common MTU across the current public Internet, having a higher internal MTU can reduce the overhead for reaching higher-throughputs. On the other hand, when different networks have different MTUs, that can cause a lot of hard to debug friction, especially due to challenges around Path MTU discovery.
Currently, we include the MTU of subnets and have a default MTU in the top-level VPC. We can make it easy to change these going forward if we’d like as we explore things with customers. Some questions to help us better understand this are in the MTU part of the customer questions section.
Customer Questions
IPv6
To help us understand the open IPv6 questions discussed in IPv6 Addressing, here are some questions we might ask customers:
Do you use IPv6 in your environment today? If so, do you have large public IPv6 allocations?
How important are floating IPs to you? Is there value in them behaving the same way between IPv4 and IPv6?
Do you have applications where multiple different AWS Elastic IPs (or equivalent) are pointed to the same instance? If so, does the fact that you can’t distinguish between them cause problems?
DHCP and Options
If you use AWS, have you ever created a custom DHCP Option Set?
Do you run DHCP in your on-premises environment today? If so, what information do you distribute via DHCP?
When running in the cloud, have you ever used a DNS server other than the one provided?
MTU
Do you use Jumbo Frames inside your data center today or on AWS? If so, have you ever had issues with Path MTU discovery?
Future Directions
This section discusses future ways the networking APIs might evolve and how that might fit into what we do. The purpose here is to make sure we haven’t foreclosed developing down a given path.
DNS
While Oxide is not looking to be a public DNS registrar, there are other ideas that we can adopt that might make life easier for application writers. Especially in the face of not having a load balancer for version 1. These ideas were developed originally by Alex Wilson as part of the Joyent Triton Container Naming Service.
The more interesting thing that it did was to provide a bunch of DNS SRV records for services. Services could register an SRV record by pushing a metadata tag out. Once that was the case, CNS would advertise that as part of the SRV record. It would also do some amount of rudimentary health-checking by paying attention to whether the instance was up or down at an API level and if the host it was on was up or down. There was also metadata that an instance could add to temporarily remove itself from the SRV record.
While this requires cooperation, this could prove useful. As part of building out the Manta object store, SRV records where instances would register proved to be quite useful. Though the fact that instances had to explicitly register was important there for liveliness and could prove a reason that this doesn’t make sense.
A variant of the above is to allow a collection of addresses based on a tag. This would result in a number of A or AAAA answers depending on which instances had that particular tag.
The current design of our DNS record scheme does allow for expansion in this regard and wouldn’t make it too hard to include either approach.
Multiple IPs per Interface
Currently only a single IPv4 or IPv6 address can be assigned to an interface, though multiple interfaces can be assigned to an instance. In the future we should probably allow additional IP addresses to be assigned to an instance.
Multiple Peered VPCs
Today, we have a restriction on the number of VPCs that can be peered together. As we have customers build up system that exist across more than one region, we will want to be able to support peering together multiple different VPCs. The biggest gotchas are to understand what the routing and firewall rules between multiple disjoint things look like. For example you can imagine a love triangle of peered VPCs where A peers with both B and C, but B and C are not peered:
VPC A
/ \
/ \
/ \
VPC C VPC B
Working through the semantics of this or much more complicated relationships will be important to understand from an ergonomic perspective and also from the overlapping subnet perspective. It may make sense in such a world to allow a VPC Custom Router to pick how to win in the overlapping subnet problem. These are things that we should think about and make sure that the API is future proof for this all.
VPNs
Based on discussion in RFD 9, we are currently not trying to integrate VPNs in the first version of the product. However, we should consider how they might look.
There are a couple of different types of VPNs that we need to consider that customers are looking for. There are a few different types of VPNs to consider:
Site to Site VPNs: These are used to bridge on-premises and cloud deployments today.
Remote Access VPNs: These are used where a company has employees that are remote that they’d like to be able to bridge their devices onto a given network. This is sometimes called a 'Road Warrior' VPN.
While both of these are solvable in instances with software, the first case is one where we can really improve the experience with broader integration into the VPC Router and minimizing the setup issues. Realistically, both of these want different looking API structures. For the time being, we’ll only discuss the first category.
Today, all of the major providers utilize IPsec to create these site to site VPNs through a combination of dedicated hardware appliances form companies like Cisco and Juniper as well as scalable software services. BGP routes are exchanged over these links to allow for high-availability and to minimize the manual configuration that is required as networks are added and removed. While there are other technologies such as WireGuard and OpenVPN, we are focusing on IPsec due to the interoperability with AWS, GCP, Azure, and others.
When an IPsec VPN is created, the system will do the following:
Add a new VPC Router rule target that represents this remote connection for VPC Custom Routers and firewall rules.
Add routes that are pushed over the IPsec VPN as a type of route in the VPC System Router as long as they don’t conflict with existing local or peered subnets. We will also need explicit filter lists and rules to make sure that we can clearly deal with conflicting rules and making sure that one side doesn’t leak all of the rules. It may make sense to have an explicit prefix that we only accept routes for. The goal is to make it easy to deal with multiple actual routes for the VPN (AWS and GCP use this for HA), with not wreaking havoc with the network. These BGP-based routes will be automatically inserted and removed into the VPC System Route Table.
Add a new Firewall rule target that covers everything advertised by the IPsec VPN tunnel.
Allow one to put a filter on the types of routes that are accepted over the tunnel.
Indicate which subnets should be advertised over the IPsec connection, which may be all of them.
Potentially allow for Internal DNS queries that originate from over the tunnel to be handled for the shared VPC subnets.
This would fit into the API with the following high-level API summary:
Verb | Route | Purpose |
---|---|---|
GET |
| List VPNs |
POST |
| Create a new VPN |
GET |
| Get information about the specified VPN |
PUT |
| Update information about a VPN |
DELETE |
| Delete the specified VPN |
Firewall Flow Logging
Firewall flog logging is a feature that many clouds and switches have which basically operates as a series of flow logs that include what connections were active, statistics about the connection, and what kinds of connections were dropped.
Here are some considerations for the API:
What is the granularity of flow logging? Is it enabled on a per-instance, subnet, or VPC basis?
When flow logging is enabled, are all rules logged or only a subset of them? What is the granularity for that control?
How do we actually collect and manage that data? How do we keep the overhead on the number of connections that are logged to a reasonable volume so as not to disrupt service? Many other cloud systems tie into a logging or analytics engine. It’s not clear exactly what that would look like here.
Is logging of an entire VPC or Subnet all at once important?
This would probably cause us to add additional fields to the existing firewall rules objects or other things such as instances for collecting the flow logging.
This could have the following high-level API endpoints for us to consider. These don’t go into all the details that we would like, but at least give us a starting point for how it might fit into the broader API. At a high level one could create a firewall logging session and then mark certain rules with which of the logging instances they belong to.
Verb | Route | Purpose |
---|---|---|
GET |
| List all active VPC firewall logging sessions |
POST |
| Create new VPC firewall logging session |
GET |
| Get information about a specific VPC firewall logging session |
PUT |
| Update information about a specific VPC firewall logging session |
DELETE |
| Delete a specified VPC firewall logging session |
GET |
| Get the current session data (this interface will need to probably be re-imagined. |
Multiple Disjoint Firewalls
In the current API design, we have a singular firewall resource that contains all the rules that are present in the VPC. The scope of these rules determines what sets of instances we target. There is a possible future where we would want disjoint collections of rules for a VPC. While I’m not sure where exactly this fits in, it’s worth evaluating how this impacts the API.
Today, in the vein of the global firewall we have a single entry point:
/
. This represents the system firewall. In a world of
multiple disjoint firewalls, this could instead be represented as
/
.
This means that if we were to make a transition, we would need to pick a
name for the primary firewall, perhaps system
and then we would need
to reserve both system
and rules
for the names of firewalls and
basically rewrite actions on /
to
/
.
An alternative approach would be to say there is an explicit name for
the firewall object today and just always have it there, but no means of
creating or deleting sets of rules. This would leave /
as
more of a general listing endpoint which has a single entry and then
we’d have /
and /
. The
previously described aspects of firewall logging would all change in an
analogous fashion.
I’m not really sure how we would apply these disjoint rule sets to different VMs, but if we came up with something, it would reduce the complexity of adding it.
Load Balancing
Load balancing is a flagship feature of many clouds and one of the things that customers care quite a lot. RFD 9 Networking Considerations goes into some rationale as to why it’s not in version 1 of the product, but we can imagine it will be there in a subsequent version before too long.
While a whole separate discussion is needed to describe the actual features of load balancers that are worth having (L4, L7, TLS, health checking, forwarding vs. terminating, etc.) there are a few things that we want to tease out as part of the current API design.
Load balancers will almost certainly appear in some form of DNS and they will fall into and out of it. While we don’t know if there would be a single IP or multiple, we should assume that we’ll want to fit it into our existing schemes for DNS in a reasonable way. A token proposal is in the DNS Record Scheme section.
Load balancers probably want to be a regional entity so that way they can easily target anything in the AZ. We also probably want to think about how anycast IP addressing may fit into things if we ever want to go to a global scope.
We should make sure it’s easy for floating IPs to possibly be assigned to not just an instance, but also a load balancer. The load balancer may also be internal, so being able to reassign internal addresses is also useful.
We will want to make sure that when instances are replying to traffic from a load balancer that it is taken into account with the routers and gateways. There are several approaches that we will want to think about depending on how the load balancer is implemented. If this is all based on routing games, then we may want Internet Gateways to be a part of this. On the other hand, allocating explicit IPs on the subnets that the load balancer is forwarding to so that there’s always an on-network place to return the traffic to and that it originates it from, may be a smoother experience.
We want to make it easy to understand traffic that originates from and goes to the load balancer in firewall rules for instances.
As we look at thing like firewall rule logging and other statistics mechanisms, we’ll want this to fit in as well.
When we design a scalable NAT, a load balancer is in some ways just a slight rehash on aspects of that problem.
From an API perspective, the load balancer pretty easily fits in as a
project-scoped service, as it can potentially operate between VPCs. Here
are examples of where it would fit in, with an implicit
/
on the endpoints. However, you could also
replace the /
with a specific VPC scope of /
. Note
the APIs here are completely hypothetical and would need to change based
on the service we actually want to create:
Verb | Route | Purpose |
---|---|---|
GET |
| List load balancers in a project |
POST |
| Create a new load balancer. |
GET |
| Get information about a specific load balancer. |
Patch |
| Update information about a specific load balancer. |
DELETE |
| Delete the specified load balancer. |
GET |
| List the load balancer targets |
POST |
| Create a new target for a load balancer |
GET |
| Get information about a specific target of the load balancer |
PUT |
| Update information about a specific target of the load balancer |
DELETE |
| Remove a target from the load balancer. |