RFD 24
Multi-Rack Oxide Deployments
RFD
24
Updated

The purpose of this RFD is to discuss and consider what it means to manage more than one Oxide rack. It proposes some standardized taxonomy for how to group and think about the operation of more than one rack and from there, discuss what the scope of certain features may want to be.

Industry-standard fault domains

When building services, one of the most important things to consider are the 'fault domains'. The purpose of understanding fault domains is to understand under what circumstances two different resources will fail because of a shared resource. For example, two instances that are on the same server will both fail if that server fails.

Many existing cloud offerings such as AWS, GCP, and Azure operate in multiple different parts of the world. Amazon introduced and has educated the market on a semi-standard taxonomy of different fault domains. For high-availability, different services are run in different fault domains.

Availability Zones

The lowest level block that is exposed to consumers of these cloud services is called the availability zone (AZ). An availability zone is sold to the consumer as something that is isolated and in its own failure domain. Generally, this means that an availability zone has its own power, cooling, facilities, and Internet uplink.

Some resources are always created in the context of a specific availability zone (using AWS as an example). This includes things like a virtual machine, or a network subnet. These only exist and operate in a single availability zone (though they can reach resources in another zone).

While infrastructure providers internally have smaller levels of fault domains such as a rack or individual server, these are generally not directly exposed to the user. Instead, the user can define some amount of affinity rules for placement inside the AZ.

Region

A region represents a collection of availability zones that are linked together through high-throughput low-latency link (usually measured in 100s of microseconds). As a result of this, the AZs are usually all in the same geographic area, but in separate facilities.

While some specific resources are created in the context of an availability zone, the region is the primary building block. API endpoints exist in the context of a region and many services that providers offer, like S3 or VPC, operate in the context of the entire region.

While GCP does have some resources that are basically seen as global across all of their regions, we’ve been told that this will slowly be changing, which means that most cloud providers will be using the region as a basic building block.

Fault Domains in the Oxide Environment

I propose that we describe the fault domains in an Oxide environment in the following terms. This list starts with the smallest term and grows towards the largest, though in some environments the different levels may be identical:

  1. Server

  2. Rack

  3. Cell

  4. Availability Zone (AZ)

  5. Region

  6. Fleet

Each level here describes the default blast radius of a given unit. That is, that a failure of one server should generally not take out another one.

For example, if a server is made unavailable due to a power supply failure, CPU, or fan failure that should not cause any other server to fail for the same reason (though it is possible that the other servers could become overloaded as a result of load shifting).

Similarly, if an AZ becomes inaccessible because a backhoe takes out a power line or Internet uplink, then it is expected another availability zone should not fail because of that (otherwise they shouldn’t be considered separate availability zones).

Of course, there are always exceptions to these rules. If a server were to catch fire, then it’s likely that the failure of that server would spread beyond it and impact at least the rack if not the entire AZ.

Server

The server represents the basic building block of compute resources in an Oxide rack. Each server has its own independent CPUs, DRAM, NICs, fans, and some amount of directly attached storage. While some of that storage may be used by other servers in a rack, the server it is directly attached to is the only one that will be issuing I/O commands directly to those devices.

Some resources that may traditionally be thought of as belonging to a server may in fact be part of a rack, such as power supplies and distribution depending on the design that we end up with.

Rack

The rack represents a collection of resources such as servers, power distribution, top of rack switches, and necessary cabling. The term rack refers to the metal enclosure that all of these resources are physically attached too in addition to the collection.

Cell

In some networking designs, groups of racks will be aggregated together and have a higher degree of bandwidth between one another than other parts of the data center. While a switch in a rack may have a large amount of bandwidth, once you leave the switch that amount of bandwidth is often reduced by at least an order of magnitude if not more.

When a group of racks have been aggregated together and have higher switching bandwidth together than another set of racks in the data center, we call such a collection a Cell.

Availability Zone

An availability zone should represent a collection of Oxide Racks that are co-located such that they have the same failure characteristics. This means that the racks share one or more of the following resources:

  1. They’re in the same physical building or cage

  2. They utilize the same underlying power source

  3. They utilize the same cooling infrastructure

  4. They utilize the same core network infrastructure

  5. They utilize the same physical route to the Internet at large

Effectively, racks that share resources and are in the same failure domain generally should be considered to be in the same availability zone. If the resources that drive such an environment are not totally independent, for example, they have the same power source, but everything else is different, then they still be considered the same availability zones as the failure of a shared resource — power, will cause both of them to fail.

All of the racks in an availability zone should be reachable by routing through Oxide equipment. We should try to avoid our reliance on customer networking equipment for communication between Oxide racks.

Region

A region represents a collection of availability zones with a set of low-latency links between them. We should dictate that such links should generally be private and dedicated and be able to have a sub-1ms ping.

Each availability zone that makes up a region should be independent. A region is not required to have more than a single availability zone.

In most cloud environments, there is an assumption that the links between availability zones are dedicated links made by the cloud provider. However, in many DC metro networks this is not the case. This is something that we will discuss again in the [trust] section.

It is expected that there is no dedicated link between regions and that instead, traffic between regions will have to cross the Internet.

Fleet

A fleet represents the set of logically related Oxide racks that are owned and operated by the same customer in some coherent fashion. A fleet may be a single rack or multiple racks in different regions and availability zones. A given customer could have multiple independent fleets. It’s easy to imagine a corporate structure where each business unit had its own infrastructure that was purchased and operated independently.

At this time we’re also suggesting that a fleet has a single coherent sense of authentication, authorization, and Accounting. It’s a reasonable question if this should be the case. One could imagine wanting to have a uniform set of sense of what the operators are across a fleet but having different regions with different authorization schemes.

An alternative perspective is that we have companies who are in the central IT group and are running an independent region for different sub-groups which each might have their own identity scheme. So while a region should have a coherent sense of identity, whether or not the fleet should is an interesting question.

Customer Responsibility

While we have defined a taxonomy for how different fault domains can be defined, this is something that will be exposed to the customer and that it will be up to them to implement in their deployment. This means that these fault domains that we’ve defined will work only if they have actually constructed them correctly.

While the experience for setting up the first rack may be simple, such as providing a name for a rack, cell, AZ, or region, we need to explore how we can help guide them in correctly setting up the next phase and how to determine that customers are actually setting up independent fault domains. Not only do we want to avoid trying to tell the customer, 'You’re doing it wrong', but we also don’t want to have to write software to take apart something that was combined incorrectly.

We need to make sure that this is a first-class, guided experience and try to validate as much as we can automatically in software. We also need to figure out how to communicate and document this as clearly to customers as we can.

Another challenge that we need to highlight is that if racks in the AZ are connected via customer networking then we may be subject to a lot of availability concerns. This suggests that we shouldn’t allow an Oxide rack to transit customer’s network equipment except when going out to the WAN or perhaps between AZs in a region.

Another way to look at this is that we’ve advocated being the single throat to choke so to say for a customer between multiple vendors. It feels like we should also provide a similar insulation between our rack and customer’s environments to minimize the amount of pain and headache we end up with.

Service Coherence

While a single Oxide Rack needs to be able to operate on its own (though a server does not need to be able to operate outside of the rack), when a customer purchases a second Oxide rack, we need that second rack to be able to interoperate with the first one. While we don’t expect most customers to be building out multiple AZs and regions, it’s worth talking about how we would envision such a world looking and what services would work across what scope.

For a given service or piece of management, there are two related starting questions that we need to answer:

  1. What do I contact to ask about a resource?

  2. What is the scope of where that resource lives or that it can act upon?

The answer to both of these questions should be one of the previously enumerated levels: server, rack, cell, AZ (availability zone), region, or fleet.

Another way to think about this is based on the ideas of cache and memory coherency. We can use the term 'coherence' (though perhaps it’s a bit overloaded) to try and describe at what level in the stack we expect some amount of consistency and visibility.

While there are numerous suggestions below that should make this clearer, it’s important to call out that this does not suggest that v1 of the product must end up here, but rather that when we’re building and design features into the product for management, we should strive to build towards this and enable ourselves for future revisions of the product. The reason we need to think about this now is so that we can start off without painting ourselves into a corner from an API design perspective as building in some of these things after the fact can be much more challenging.

We begin by first enumerating where different resources should live and their scope and then will follow up with the question of management.

Guiding Ideas

When determining what the scope of a given resource should be we want to balance a few different ideas:

  • A customer doesn’t want to have to talk to each rack individually. That breaks the point of having a cloud-like experience. If a large customer had to talk to each rack in AWs that they ran in, that would be painful.

  • Unlike the cloud providers, we certainly do not (nor our customers generally) have private fiber that we can use to interconnect different regions. That means that when crossing regions, we’re crossing the public Internet in most cases. This has very different latency and throughput implications from inside of a DC or region. We should make it obvious to customers when they are constructing something that is going to have different performance characteristics. We don’t want to have to answer the question of why does it take 10x longer to ping 10.0.0.2 and 10.0.0.3 because one is inside the DC and one is in the Internet.

  • The broader a resource’s scope, the more work that is involved for us to try and make that illusion work. For example, we wouldn’t necessarily want to make everything to operate globally across the entire fleet due to the challenges that are involved with that from an engineering perspective.

  • However, customers generally want to have a single pane of glass where it’s easy to compare and contrast what they’re running where. We don’t want an operator to have to open and maintain a different browser session just because they have multiple different AZs and regions to manage. This is an area where Joyent’s Triton product was painful to use beyond a single AZ. Because each AZ basically had to be operated on independently, it was easy to confuse which AZ you were operating in and what you were doing.

API Endpoints

I’d argue that by default, API endpoints, like with AWS should target a region. That means that they have the ability to operate on any given AZ inside of the region, but if you’re operating in one region it doesn’t give you the ability to operate in another region by default.

This does throw in a bit of additional complexity for us in the software end, but I suspect will lead to a better experience for users as it minimizes the number of different things that they need to think about and manage.

Management Pane of Glass

Whatever UIs we provide, we should probably make them easy to make fleet-wide. Even if this means that the UI is going and summarizing data from all of the different regional APIs, I think being able to present a uniform experience here will be useful. This doesn’t mean that having per-server or per-rack views in the UI aren’t important.

While some organizations will have dedicated teams that only operate a service in a given region, there are many organizations where a single group is responsible for operations in all regions (whether from the user perspective or the administrator who is managing the Oxide rack).

When it comes to things like fleet management in terms of parts databases, warranty and inventory management, then there will be many cases where someone wants a fleet-wide summary. While it may be acceptable to drill into region- or AZ-specific panes for detailed information, at a glance an operator should be able to tell the health of the entire fleet and what regions or AZs are unhealthy, rather than manually having to cycle through everything.

For a complicated service that is running across multiple regions, it may be desirable for users to be able to get a summary of what’s going on and various statistics in the context of the entire fleet.

Specific Resources

Users, Groups, Authentication, and Authorization

By default, it seems like users, groups, how they authenticate, and what they’re authorized to do should be fleet-wide. Having to manage separate passwords or manually taking an action across multiple different racks, AZs, and regions, seems like a rather tedious way to do something and an easy way to cause someone to miss a permissions update or similar.

This certainly creates a lot more complexity, especially around availability and durability of such services, but should result in a much better experience. If customers are always using something like AD or an external LDAP server, this may not be as bad if we can leverage their existing as a way to reference external principles. We should avoid writing Oxide-specific information to their store.

Instances

Instances should run in the context of a specific AZ. While an instance only ever runs on a single server at any given time (and we can show that to the user), in general, from a selection perspective, users should be concerned with the AZ as that’s where we make certain fault guarantees.

In theory, it makes sense to allow an instance to move anywhere inside of an AZ as long as it doesn’t violate any kind of affinity constraints. Of course, practically speaking, due to constraints around storage and the like, it will only be able to be moved in the rack for some time. However, that seems like mostly an artificial constraint for us.

Making the instance scoped to a region and freely able to move about would arguably make it harder to actually create something fault-tolerant, which is why I believe the AZ makes the most sense as a starting point. The instance has a dependency on both storage and netowrking. While our goal is that most networking should be able to cover the AZ, storage realistically probably won’t evolve beyond the Cell. Though for many customers that may be the same as the AZ. Because of this instances, may need to be constrained to the Cell.

Networking

Networking is broken up into several different categories that all have different considerations. While we’re still sketching out RFD 21 User Networking API, the best place to start is with the fundamental building block, that is the subnet. Note, the nomenclature for the various items here is subject to change as we firm it up in RFD 21.

In my mind, the subnet is directly related to the instance. I think we should borrow from AWS and make the scope of an individual subnet be the availability zone. I think it’ll be much easier semantically to limit how far a subnet can travel and clamping that to an AZ should both make the implementation easier, but also match what a user might expect from either a standard network design perspective or from AWS.

The next things that we need to think about are the actual broader virtual network concept which represents an independent networking domain. See RFD 9 for more background there. I believe that this should conceptually be regional. The reason that we want it to be within the region is mostly one of practicality. In the definition of a region, we know we have low-latency, high-bandwidth links that we can use to interconnect the regions. However, when we are going between regions, we have to assume that we’re going to be going across the WAN.

Going across the WAN is complicated both from a performance and a security perspective. I think the act of opting into or making it explicit that you’re transitioning and joining two disjoint networks and likely going to be paying some cost to do so is important to acknowledge. My hope is that scoping this to the region makes sense and helps make this clearer.

Both AWS and Azure operate this way. As discussed in RFD 9, GCP’s network by default is across the entire fleet (global). There are certainly some easier aspects of this from a customer perspective. So this is something that we should discuss with them.

By making the virtual network regional, there are a couple of other side effects of this. This means that a route table should be able to target anything in the region and that firewall rules and their grouping should be able to refer to instances across the region. This does mean that the control plane will have more work to do behind the scenes, but it should create a better user experience as a result.

It’s not clear exactly what scopes the internal and external DNS should be as advertised from the project. It probably is reasonable to start by scoping them regionally to match what’s going on from an API perspective probably makes sense. If internal DNS only resolved what was in the single AZ, that would severely limit the usability of the DNS solution. From an external perspective, it’s not clear what makes the most sense, but probably being able to have a single region-wide endpoint makes sense, depending on how we end up doing delegated DNS domains. We’ll have to come back to this later.

Another important question is basically what the scope of a reassignable 'public' IP address should be. My instinct is that we should match AWS and that this should be region-wide. That will create a lot of complications in terms of how we route requests and other things, depending on how we integrate into a customer’s network. But until we know more, it seems a reasonable starting point. There’s a similar question with outbound NAT that we’ll need to think about.

Given the complexity here, a reasonable question is what do we lose by not doing this. The thing that we need to explore is how customers expect to fail over resources between regions. If, for example, most customers only ever use DNS based failover and if a region is offline would rather fail that entire public IP out of DNS, that might be a much better answer. The concern is that customers are currently using a reassignable IP address to associate a primary database instance or some other resource that is being replicated across regions.

Because of how customers manage their address spaces, if each AZ has its own Internet link, we may not be able to advertise a specific route to a public address from all of the AZs in a region or the coordination there may be too complicated. We should explore whether anycast addressing may help with this or we may need to scope the addresses to an AZ and rely on DNS for higher-level functionality.

Trust Boundaries

One issue with networking in particular is a question of at what point do we trust the network and the intermediate hops. For example, we know that if we were to send traffic on behalf of users across the WAN, it would need to be encrypted and authenticated through something like IPsec, Wireguard, OpenVPN, or some other scheme. I think that this is the right posture in general.

However, other customers may believe that this should be at a different level. While we believe that the cross-AZ link should be dedicated and therefore may not require encryption, some customers may be on a metro network where they don’t believe that the network gives them a good enough guarantee. While we haven’t worked out the customer network architecture, there’s a good chance that a lot of data would default to being unencrypted unless the customer was doing encryption as part of their actual protocol.

This suggests that some customers may not trust traffic at a different state in the system, whether that be between racks or between AZs. For the rack case, that would mean that the customer would have to distrust some of the common networking gear in the AZ. For the AZ to AZ case, that might mean that the dedicated link isn’t necessarily a guarantee. In either case, it’s also possible that customer has a different threat model. For example, some customers may want to defend against issues that have occurred to Google and others where their fiber links were tapped by a state actor.

We believe that traffic within the rack should be able to remain unencrypted from our perspective. While the control plane should still employ mutual authentication and encryption, it seems reasonable that we don’t need to encrypt and authenticate all intra-rack traffic. This doesn’t mean that the TOR or that the software running isn’t susceptible, but that those attacks may not be in our default threat model. As discussion on RFD 6 continues, this will need to evolve.

Storage

I believe storage volumes should live in the same fault domain as instances. That is a given volume should be able to exist within the AZ and be attached and detached from their. Certainly, being able to say that this is true within the region would be a nicer story, but it makes reasoning about fault tolerance and almost certainly an implementation much more complicated. However, GCP does provide an option for cross-zone replication in a region.

That said, it’s hard to say what really makes sense here. We know that our initial implementation will really be focused on a single rack, especially as cross-rack bandwidth may be very different. This may even argue that we’ll have to actually make this rack-scoped, but I think that’s a very weird thing to reason about, especially when compared to AWS which makes EBS volumes available in a given AZ. The reason this may be weird to reason about is that a user could detach the volume from an instance in rack A, but then try to attach it to an instance in rack B, which would fail. While some operators may want to expose this level of information, many other have tried to restrict that.

Further, because of the challenges with customer networking, if we have to cross their equipment that will be quite challenging. While it will be a worse customer experience as far as why can’t I do this, limiting the scope of volumes or instances to a Cell may be a useful way to tradeoff flexibility in placement with making sure that a piece of external customer equipment can’t cause induce a failure and end up with us in a blame game with the customer.

If we want to eventually make this appear as something that can exist anywhere in the AZ, then we could possibly make that happen assuming that the cost of migration was paid. This would allow storage to move between cells, but the storage could still only live in a single one.

Images

Base images for instances are a special case of storage. Here, being able to make them work region-wide is probably useful by default, though certainly much harder for us to implement as it will likely require coordination and copying between AZs in the region. But given that a user will probably want to launch the same instances across multiple AZs to allow for proper redundancy, it seems reasonable to have images available region-wide if possible.

Projects

A project, as discussed in RFD 4, is a logical construct that represents all of the resources for a given logical application which includes instances, images, networks, etc.

The natural starting point for a project is for it to be regional. That is a given project exists across the entire region. Customers will be used to this because of AWS’s VPC.

That said, there’s an interesting potential argument to be made for making the project fleet-wide. This could be a more natural way of thinking about things if they have instances deployed across all regions for redundancy. The most natural way to expose this may be to have a project that consists of sub-projects in the future. The main reason that a fleet-wide project could make sense is because setting fleet-wide rules for authorization and authentication are easier to manage.

An alternative way to view this is to have the actual project be global (and not just a collection of regional projects). One could create a network per-region and assosciate resources there, but I think from both an implementation and reasoning perspective, this may make less sense. This will necessitate traversing the WAN and reasoning about the visibility of resources across the WAN may be much more difficult.

Ultimately, it probably makes the most sense to start from a region perspective.

Summary Table

This table summarizes current thinking about which resources should be coherent across which scopes. This is meant to be a starting point for our discussions and is not meant to suggest what we need to have working with v1 of the product:

Resource

Coherence

API Endpoints

Region

User Interfaces

Fleet-aware

User Authentication and Groups

Fleet

Instances

AZ

Storage Volumes

Cell

Base Images

Region?

Virtual Networks

Region

Subnets

AZ

Firewall Rules

Region

Route Targets

Region

Internal DNS

Region

External DNS

Region

Public IPs

Region

Projects

Region

Tags

Region?