RFD 82
Motivations and Principles for the Design of Operator Facilities
RFD
82
Updated

At the core of the Oxide rack is the hardware inside of it. While it’s the software and services that provide the customer the experience that excites them, how we present the complex reality of hardware, failures, and more to customers will be a part of what keeps our customers with us.

The purpose of this RFD is to drive to consensus on the types of problems that customers have and what it is we want to solve as it intersects hardware. Importantly, this does not suggest how we solve it (e.g. questions of data coming from the SP, the host, or something else), but instead provides consistent taxonomies and checklists for evaluating our designs.

Other RFDs have already covered major functional components and discusses what we’re looking for from a feature perspective for things like CPUs, network, storage, the quantities, and the layout and form factor of the rack. This is focused on all the other interactions that we have with these components beyond the basic services they provide. For more information, please see the following: RFD 7 Hardware Root of Trust Silicon, RFD 12 Host CPU Evaluation, RFD 9 Networking Considerations, RFD 21 User Networking API, RFD 29 Storage Requirements, RFD 60 Storage Architecture Considerations, RFD 58 Rack Switch, RFD 69 Blind Mating, and RFD 76 Rack Population and Elevation.

Customer and Product Motivations

RFD 78 Customers, Roles, and Priorities gives great background into the different users and priorities without our targeted customers that we have and their needs. In addition to our customer’s motivations, there are several things that we need to consider from Oxide’s perspective, so that we can deliver a high quality and enjoyable experience not only when our customers use the product, but when we need to support it in the field. We break these into a number of different high-level motivations.

These motivations will be referenced again when we go through the different requirements that we have. Each requirement will be in service of one or more of those motivations.

Capacity Planning (CP)

Capacity planning refers to the act of understanding and evaluating the usage of the varying resources of the rack and critically how they have changed over time with the intent of ensuring that a sufficient amount of all resources will be available for the customer’s needs.

For customers this is generally done as a regular or ad-hoc process that isn’t related to an [cat-iup] or [cat-ohf]. There is not a 'hair-on-fire' type of situation.

Here, customers are trying to answer high-level questions that basically tie back into the goal of service availability:

  • When am I going to run out of resources? What is the current utilization of them and how has that changed over time?

  • What steps can I take to better improve my current utilization and make better use of my capacity? Who or what is consuming these resources?

  • When do I need to add or purchase additional capacity?

  • Given a future need for a given quantity of resources, what is the best way to fulfill that?

Oxide will probably have questions about the utilization of specific resources and how they have changed over time so we can aid our customers in utilizing their capacity from a support perspective. We may also be interested in understanding how we can learn from them and thus evolve future versions of the product from both a hardware and software perspective.

Product Lifecycle (PL)

The product life cycle area covers a number of cases that we care about. For example, the act of installing, upgrading, or decommissioning portions of the rack or the entire rack. Some of the questions that might be asked include:

  • Which component is it that I’m upgrading? How do I make sure that service has been quiesced during the operation? How do I identify it?

  • Are any components in the datacenter predicted to fail or are going to fall out of warranty soon?

  • Were all the new components I installed correctly recognized by the system?

  • Are the newly installed components operating correctly?

Basic Product Operation (BPO)

There are a class of features that are required for the product to operate correctly and efficiently. These are things that don’t relate to direct questions or needs of a customer; however, are nonetheless required and important to the correct functioning of the system.

Some examples of items in this category are the thermal control loop of an individual server sled or switch or the ability to reboot a device or a server to cause a new firmware, configuration, or operating system kernel update to occur (e.g. Hubris, host OS, etc.).

One could also imagine cases where in some future revision of the product the product would automatically rebalance and move instances or workloads around in order to better balance and utilize resources like power. Consider an example from Facebook, where the overall power consumption needed to be considered when enabling GPUs or other purpose-built accelerators, and resulted in them powering off other servers as a result. This might require some of the information talked about in [cat-cp] for the system’s own uses.

Operational Hardware Failures (OHF)

Operational hardware failures are a class of failures that the system can detect and handle automatically without necessarily having to page an operator. These types of events may degrade the service (e.g. a failed fan or drive) provided by a sled or another part of service. The broader system may be able to take actions to restore service or it may take operator intervention to do that. However, the critical aspects of this are that absent subsequent failures, immediate action is not required. Aspects of traditional hardware 'RAS' (Reliability, Availability, and Serviceability) are related to this. The abilities that hardware provide in this space tie into our ability to handle these types of issues, though it is not the only aspect of it.

When compared to [cat-iup], these types of events should generally not cause a 'hair-on-fire' type of scenario. However, it is still important that the system provide high-fidelity information about the problems and its impact, and that we make this easy to service. While this document is focused on hardware, a great deal of this ties into software components as well.

These operational hardware failures are important to not just the customer, but also to Oxide. The more we can understand these types of failure and avoid them becoming an [cat-iup] and automatically handle them, the better the experience to the customer and the reduced chance of our customers experiencing an outage, which ties into their immediate goals of service availability.

In addition, there are also cases where Oxide will want to understand these operational hardware failures so we can determine if there is something more systemic going on.

Questions that we might ask are:

  • Is any part of the rack, whether a sled, switch, or something else in a degraded state? If so, why is it in that state?

  • For any RAS events that have occurred, what is the impact of the event to the service of the rack? What actions has the rack taken automatically? What next steps does the operator or Oxide need to take? What components did this event impact? Which users and instances are impacted?

  • Is this the first time such an event has occurred? Has there been a trend of them?

  • Do I need to replace a component that’s failed or explicitly faulted? If so, where is it and how do I perform that?

Investigation of an Unknown Problem (IUP)

Investigation of an Unknown Problem refers to a diverse set of activities that may involve both customer staff and Oxide support staff. As compared to [cat-cp] or [cat-ohf], this is often 'hair-on-fire' type of event. Regardless of whether the overall service provided by the product has been impacted, the customer and Oxide are involved in trying to solve this problem. Depending on the type of problem, this may mean that the customers cannot meet their business needs.

With these types of problems while there may be a flow chart to walkthrough, the actual set of information and how we may need to react to it are unknown. There are a lot of ad hoc queries or things that one might test as part of such an investigation. Aspects of this may cause us to ask questions of software, firmware, hardware, and the gnarly intersection of them as problems in the field often are never quite so neat.

Questions that either Oxide or customer support personal are trying to answer might include things like:

  • What is the impact to my customers because of this? Who is impacted, what services, and more?

  • What components of the system might be involved in this?

  • Are there any records of recent errors that immediately preceded this event? Is there a history of related errors?

  • What is the current health of components in the system that might be related to this? What are their current utilization and saturation? Are there any errors?

  • Are there any known software, firmware, or hardware problems that might relate to my rack and be related to the current investigation?

While the above list is not meant to be exhaustive, an important aspect of it is that these are questions that Oxide support staff may be trying to ask of the customer or of their environment. There may be information that Oxide requires that may not be front and center for a customer (e.g. easily findable in the UI), nevertheless Oxide may need to know it to provide a service experience that buoys our customers when they’re coming to us in a moment of pain and frustration.

Guiding Principles

Throughout this document there are a number of core principles that we are using to guide these requirements and their ultimate expressions.

Relating to our Customer’s Business Needs

The most fundamental principle that we have is that everything we do needs to ultimately be related back to our customer’s business needs in one form or another. RFD 78 Customers, Roles, and Priorities lays out aspects of who these stakeholders are, what they care about, and the primacy of the service meeting their business needs and the important of the product’s availability.

When we’re considering why certain things are requirements and what is actually important about them, we need to constantly be answering questions like 'How does this help our customer?', 'How does this help our customer’s business?', and 'How does this impact our customer’s ability to deliver value to their business?'.

As we move beyond just requirements and think about the shape of the product, what we display, and how things are configured, this all needs to be in the service of delivering business value to our customers. Take alerts and monitoring, these ultimately need to be tied to the customer’s business needs and what they care about.

There are several cases where we discuss aspects and requirements of the product from Oxide’s perspective, that is things Oxide needs from the product for our customers. These things, such as discussions around the service model and what Oxide support staff might need, ultimately are only important in so far as they relate back to our customers. When we bring them up and consider these things, it’s important that we’re always asking ourselves these questions.

An important thing is that as is laid out in RFD 78, we will have different types of customers and their business needs may be different. As customers grow their environment in the direction of a traditional hyperscalar, the impact of losing a single device becomes much less, and while the product inherently reacts to failure might not change, the way that we interact and display it to customers might change. The failure of one rack for a customer with hundreds is different from a customer with a single one and failures that impact one application may always require paging, while another does not.

Consistency

Where possible, we need to make sure the product behaves in a consistent fashion. We don’t want the way that interacting with different parts of the system to feel unnecessarily different or make it a prime example of Conway’s Law.

There will always be differences in the actual implementation. No two pieces of hardware behave in identical ways, even when they claim to implement the same standard. However, the goal of the product is to unify the experience when working between these different things. Importantly though, differences should not be papered over!

No Surprises

While the product should be delightful and surprising people in being better than their expectations is good, many of our customers have built up an expectation of how certain things work. Customers are going to come to the product with a large number of pre-conceptions of how certain things work and no matter how much education we perform, we cannot assume that we’ll be able to train them to think differently.

This ties into the general concept of affordances, which from a human-computer interaction perspective are basically the things the design and properties of an object suggest what someone can do with it.

For example, many datacenter technicians are used to being able to pull our 2.5" SATA or SAS hard drives and solid state drives from servers without thinking too much about it. They assume that these are hot-swappable and that they can pull it out without thinking about the consequences (e.g damage).

If someone walks up to the front of the server and sees something that looks the same as that, even though it’s a different type of device, we have to expect that folks will naively assume that if it looks the same, they’ll assume they can do the same things with it that they’re used to. If an NVMe device that’s in the same type of slot behaves in a different way, that would violate this principle.

Minimizing surprises extends beyond just how things look and the expectations around interactions in interfaces. It also has to deal with how the product handles operations. For example, if an automated product action (say a reboot of a server for an OS or SP update) isn’t transparent, it shouldn’t just go off in a way that interrupts our customer’s business needs.

Coherent Differences

Differentiation is important to our product. Part of what we’re trying to build is something that isn’t on the market today and that’s something that we want to embrace! Critically, if we try to buy just another 2U server like Dell or HP, it’ll be hard to distinguish ourselves and compete.

When we do differentiate, we need to keep in mind the previous three principles. We need to make sure that what we’re doing ultimately has business value for our customers no matter what. This doesn’t mean it has to be in something that’s immediately obvious for our customer, it may be something that we need to make the product successful from Oxide’s perspective or something like reducing the number of rectifiers that can fail and be serviced in the system like the use of a power shelf.

From there, it’s important that when we differentiate and make things different from what our customers suspect we’re not setting them up to harm themselves. If something looks and behaves the same way most of the time, but is different in a critical way that can cause problems, then that isn’t useful. Similarly where we differentiate, it needs to still contribute to an overall coherent customer experience.

Put another way, differences are important, but they need to honor the other three principles. By doing that, we can truly create a product that customers will enjoy and see the value in these differences.

Example Walkthroughs

This section walks through several example flows that we expect will occur with customers. These flows are not intended to suggest a particular architecture or implementation of how they work, but rather tease out some of the different challenges and steps in various processes. The point of them is to highlight the types of information and knowledge that they require.

Diagnosis and Replacement of a Faulty Device

One of the only guarantees that we can make in the product is that at some point some of the hardware components that are inside of it, will, unfortunately fail. We break this flow into several discrete phases. Note, this is not the only way to break these up and is done mostly to help us explore this space. The break down in the realized product may differ.

  1. Detection: During this phase, some parts of the system detect and report on facts and observations about components. Examples of this include things like a correctable or uncorrectable error, a specific sensor reading, the timing of an I/O, or the inability to communicate with a specific component.

  2. Diagnosis: Diagnoses takes the facts and observations that were reported during the detection phase and draws conclusions from them. Based on these conclusions, various actions will be taken. A part may be taken out of service, the system may give up on using a device, etc.

  3. Reporting: Reporting takes the actions and information that were determined during the diagnosis phase and makes sure that this is reflected elsewhere. This would include notifications to the customer (e-mail, SMS, pagerduty, etc.), notifications to Oxide support, being recorded in Nexus, or discoverable elements in a web UI or CLI.

  4. Dispatch: If a component needs to be replaced, then the appropriate part will be dispatched to the customer from Oxide. It may already exist on the customer’s site.

  5. Servicing: This is the act of actually replacing the faulted component with a new one and validating that the new component is correctly operating.

  6. Validation: This is the act of making sure the servicing was correct and successful and that everything is now operating as expected.

  7. Resolution: At this point, the system should no longer be degraded and should have returned to full operational capacity with respect to this specific failure.

  8. RCCA: This phase may involve doing a root-cause and corrective analysis of the failed component by Oxide and our vendors. The intent with this phase is to understand why the device failed, learn from that and possibly improve software or better deal with problems in the field.

While there are going to be many types of failures or degraded states that don’t walk through all of these phases, it’s worth teasing apart two different example of simple failures: The mechanical failure of a fan and an NVMe U.2 device that will no longer power on. The act of detecting and diagnosing most failures will end up being much more involved as we’re assuming relatively graceful failures.

A Failed Fan

Presume for reasons of chance and inescapable entropy, that a fan’s rotor is broken and the fan can no longer mechanically spin. There might be various facts that are noticed when this has happened. The system might notice things like the server sled has a higher ambient temperature, the other fans tachometers are reporting that they are higher than we’d expect, or that there is a fan whose tachometer is reporting a reading of zero or not reporting at all.

The broader system would need to have these facts reported and then it might make a diagnosis that indicates that the fan is broken. It would then indicate that the server sled is degraded. The broader system may try some way of recovering the fan to see if that restores service, and whether it does or that fails, it will end up creating a series of notifications. In this case, we’d expect the sleds thermal control loop to have already taken an action to compensate for this.

When those notifications go out, they include information about the severity of the fault, the impact, and additional next steps that need to be taken. An important part of this is that it needs to include enough information to accurately describe which part and its location. Operators need to be able to know which server sled is impacted so they can plan, monitor, and understand who is impacted in the broader business. On the other hand, someone servicing the device needs to know where it is physically and which fan they need to replace.

Oxide will need to know the exact details of the failed device so that we can ensure the correct replacement part is sent out. This means getting not just the information about the component itself, but possibly things such as what type of sled it is in (there will be more than one eventually!) and possibly information about the rack. When we perform the RCCA step, there may be more information required.

The person servicing this will enter the datacenter. Different customers will have different levels of expectations for this experience. We need to consider that some will be in co-location facilities and will use what are often called 'remote hands' to perform the action and will not have dedicated staff. The implication of this is that the technician performing the service may not have access to any of the information from the broader Oxide Control Plane and will instead be relying on information that is communicated to them.

That person will be scanning (hopefully) multiple racks to identify the one in question and then trying to identify the sled in question. They will then pull the sled out for service. At that point, the sled will lose power so the technician will not be able to tell which fan is broken from inspection per se and will need enough information to clearly identify which fan was the one that had failed. If possible, information on the component that allows for confirmation that this is the correct part would be useful. With confidence, then the swap will be performed.

After that, the operations team will want to verify that the swap has occurred and the new part is functioning correctly and where possible, know that it is in fact a different part. Ideally this will have happened because the system has automatically detected that there was a new part inserted and that service has resumed. At this point, the service would be resolved and the failed part would be set aside and possibly shipped to Oxide for RCCA.

The U.2 Device That Won’t Power On

This case is similar to the one described above, and rather than rehash it all, particular differences will be called out. Let’s assume, that something has gone terribly wrong and a U.2 NVMe device is now a brick. It will not respond to the application of either 12V main power or 3.3V auxiliary power.

Here, the system may know that a device is mechanically present (because of specific pins) in a given U.2 slot but it is not available for service. While the actual facts and observations that are reported and the means may be different from the failed fan case, the ethos of the 'Detection' phase is the same. Similarly the 'Diagnosis' phase will have similar requirements.

One interesting thing in this case is that we will not know the identity of the actual device that is in there. The system may be able to guess that it is a device that was last in that slot, which suggests the control plane may want to store a copy of such information. As part of 'Diagnosis' the automated actions may attempt to power the PCIe root port or take other actions before needing to call for human intervention and ultimately faulting the part and degrading service. Other parts of the broader rack services (such as storage) will need to be informed of this failure and react accordingly.

When servicing the component, this will be something that can be done without powering off and pulling out the sled, unlike the fan, due to the current plan for front-serviceable U.2 devices. Here, the technician would again need to know the exact server sled and then which drive in particular we are referring to. Here, additional identifying information such as the rack, sled, and U.2 bay labels or LEDs could be employed.

With such a device, it is possible that the service flow would first attempt to reseat the device, that is physically remove it and insert it again and see if that restores service. This would call for some way of the technician knowing how to proceed and determining what happens. While we should assume the technician may not have access to the control plane in any form, it may be reasonable to assume they’ll be able to communicate with those that do.

Finally, when a working device is inserted and service is restored, the rest of the steps are similar.

Takeaways

Here are some important things to take away from these different paths as they influence how we approach and write up requirements:

  • The ability to identify serviceable components is important for multiple parties.

  • We cannot assume that all technicians are direct employees of the customer with access to the control plane or an Oxide technician working with the customer.

Reviewing Capacity

An important example that our customers will be doing is reviewing the capacity available in the system as an aspect of [cat-cp]. Let’s imagine a case where a customer has been using an Oxide rack, or multiple racks, for a period of a 6 months.

When we talk about capacity, it can be tempting to think of it in terms of the underlying hardware resources: CPU cores, GiB of DRAM, network bandwidth, storage IOPS and capacity, etc. However, looking at it from just the raw resource usage perspective leaves out a number of things that a customer might care about. At the end of the day, the thing that’s important to our customers is the number of instances that they can provision of different types because that’s what their users are actually going to do.

This metric isn’t perfect and constructing the means to look at and determine the right set of instance types and quantities to consider is a complex activity, but it translates into what’s actionable to them. One benefit of looking at things this way is that it does capture the presence of islands of capacity.

With this in mind, here are some of the things that we’ll want to be able to potentially ask of the system in general:

  • What is the breakdown of instance types across the entire fleet, for a given user, or for a given project?

  • How has that changed over a period of several months? What are the trends?

  • Is there a given resource that is more stranded or overused than others? Has this changed over time?

  • Have any of the broader resources like storage IOPS or network bandwidth been reaching saturation?

Customers will likely want to be able to slice and deice these questions in many more ways than these limited set of questions. Point in time data is only useful if we can compare it to other data.

In addition, some capacity planning exercises may be specific to a particular application. Being able to correlate this data with external data could be quite powerful. Consider being able to track a deployed applications resources with that application’s request volume and application-specific latency details. While the system view is important, it can’t be the solitary means.

To help make this more concrete, we’ll walk through a couple of different capacity scenarios and go through the more specific questions and actions that operators might want to take.

What’s my current operational runway?

Some of the fundamental questions that operators of hardware infrastructure are going to have are related to their current capacity and understanding when they need to buy additional hardware, especially given acquiring new hardware has a substantiate amount of lead time and business overhead.

When operators begin looking at this, they might start with a high-level overview of the system and hypothetically compare the following (the actual set in the product may vary):

  • The 'raw capacity' being what Oxide describes as provisionable resources.

  • The amount of 'assigned quota' being the amount of resources they have assigned to teams in the forms of their quotas.

  • The amount of 'allocations against quota' which looks at what chunk of the 'assigned quota' has actually been allocated.

  • The amount of 'utilized resources' which shows of those that were allocated, what amount them are generally actually being used (e.g. using 20% of the allocated CPU or storage).

  • The breakdown of individual resources against those (e.g. CPU, DRAM, network BW, storage bytes and IOPS, etc.).

  • The trends in changes of all of the above over different periods.

This might be the basic starting point for an investigation from an operator. Whether this is in the default dashboard or variants thereof this might lead into a number of different explorations that an operator might ask based on what they discover.

Say they discover that they see allocations against quotas of most resources, but there is a resource like DRAM that is relatively underused. There are some natural questions that might follow:

  • What do I see when I break down unused DRAM on a per-rack and per-sled basis? Are there any islands of capacity?

  • Is there a particular other-resource heavy instance type that is being provisioned a lot that could explain this?

  • Are there any instances that have never utilized a large amount of their provisioned resources that might explain this?

Say on the other hand that we find a lot of storage is being used. An operator might want to break this down in a couple ways. The first order might break that usage into the following categories:

  • Storage allocated globally

  • Storage used globally

  • Storage used by images

  • Storage used by attached volumes (i.e. currently accessible to an instance)

  • Storage used by unattached volumes

  • Storage used by backups

Once this is understood, operators may want to break this down based on a per-project basis or some other scope to understand whether it’s been a general proliferation of usage or if there are particular projects that are using substantially more or less.

Variants of this will come up for lots of different resources, possibly including power. Some of these views will want to be broken down physically based on the server, the switch, or another component. Some will want to be broken down logically based on what is responsible for the resource usage.

Another set of questions that an operator might ask here is what is the projected longevity of certain devices. We know that, for example, all flash based devices have a limited lifetime and at some point will need to be replaced.

Can I increase this application quota 50%?

Operators of infrastructure and application teams will often be dealing with common ground in terms of quotas. As the application team does their own forecasting and planning, a common set of interactions will be coming to infrastructure operators and asking for an increase in their quota. This is one of the more fundamental things that the teams will do as will tie into the question of when do they need to buy more capacity.

There are a couple of interesting things in this interaction that are worth highlighting. The first is that the most natural way to think about this for some teams may be based on types of instances and quantities of them. For example, I have this many database shards and I might need to provision n more of them, which take these many of these sized instances with the following disk and network requirements. Importantly, that’s what these application teams are going to create and due to islands of resources, it could be that even though that quantity of resources is available, they are fragmented such that they can’t be used.

However, while that level of information is great, that may be more detailed than they know. Most teams are taking guesses and may not actually be able to accurately quantify this. This already suggests that the better that the application teams can describe collections of their instances that relate to one another, the better. That gives them and the folks in charge of the infrastructure more information.

To handle this interaction and exchange here are some of the kinds of questions that operators might have or things they want to do before they go ahead and increase the quota:

  • Review the application team’s current resource usage. What percentage of their resources have they allocated against their quota. Of those allocated resources, how utilized are they? For example, have they overestimated some amount of resource that they need and it’s always idle or are there moments of peak activity?

  • Review overall capacity trends and what’s been allocated against current quotas and that trend to better understand the current capacity of the system.

  • Try and phrase the resource allocation that the application teams asks for in terms of instances and resources. If the team doesn’t have that, try and derive that based on the currently provisioned resources.

  • Based on the above, try and answer the question of do we have resources to meet the requests. Evaluate what happens if we adjust the quantity and distributions of instances. Evaluate how other teams usage patterns might impact that and how we expect that to influence things.

Do I need to add another network uplink?

There are many shared resources in the rack that have a finite capacity, but are shared between a large number of users such as the storage subsystem or the network. Let’s focus on the network.

As part of general planning or routine observation, a network operator may find that one of the switch uplinks is getting saturated and will need to ask themselves should I add another cable or is this indicative of a lot of small users somehow running me out of capacity. Depending on these answers, different actions might be taken. There might be an attempt to cut back and further restrict the aggregate usage of these ancillary instances, it may result in the desire to run another link, or it could result in some other action such as moving instances around or something else.

Some concrete questions that we might ask are:

  • What’s the historical port utilization and saturation been? Is this a constant thing, does it have periodic spikes, or is there any other pattern to the data?

  • What percent of the time is the port saturated, resulting in errors?

  • Is there a correlation between port usage and say overall provisioned resources?

  • What users are contributing to this utilization and saturation? What type of traffic is it? Is it the same users or different ones?

  • What is the current configuration of this uplink? If I have more than one link am I directing traffic across the ports in a reasonable way? If not, why might that be?

What’s the impact on my capacity as I do this rolling upgrade and maintenance?

While we’d rather avoid it, there are often large changes that might need to happen to a fleet of machines from either a hardware or software perspective in the field. This may be due to thinks like reboots to account for newly discovered security problems that need to be worked around in the OS or that we have a FRU that needs to be replaced in the field. Managing this process and understanding its impact can potentially be a nightmare. While there is a lot of tooling that has been proposed in the product to help with this like transparent live migration, there are still a number of questions that operators might have.

Let’s presume that a batch of DIMMs or fans have to be replaced in every sled across a given fleet. There are several questions that operators might want to ask such as:

  • How many machines can I service in parallel without impacting my overall capacity? How does this impact on isolated islands of capacity? Will I still be able to meet affinity requirements during this time?

  • Based on my growth rate, how much unused capacity do I need to plan on keeping around during this operation?

  • What is the average headroom of a given sled? What about of the network? Does moving things around risk bumping against a limit?

  • Are any of my customers disproportionately impacted by this? Are there things that we can communicate or do in advance to mitigate the impact of this?

Should I change my service model based on failure rate?

When we have customers that have larger fleets of Oxide racks and the impact of failures on their overall capacity is less pronounced, then there are some interesting capacity planning questions that our customers might ask. Let’s use storage as an example.

The design of the storage system is such that we can tolerate drive failures and as we expand the number of racks that a customer has, it is likely that they won’t all be replaced and dealt with immediately after a failure occurs. The amount of time that a failed drive is left without being replaced has an impact on the overall available disk capacity and potentially IOPS available.

A reasonable question that a customer might ask is: 'Based on my failure rate to date, what is the impact on my storage availability if I vary the frequency of DC sweeps for disk replacements?' To tease this apart, consider a case where there is a fixed cost for performing a datacenter sweep to replace failed disks. There is an economic point to the business where the cost of those sweeps is more than paid for by the additional, available capacity. The ideal value might shift depending on the impact of that on meeting customer demand, the size and impact of failures.

Here are some concrete pieces of data that that we would need to answer this kind of questions:

  • What is the historical amount of available storage capacity?

  • How have failures impacted that amount and have I ever been below a target threshold as a result?

  • What has my actual failure rate been to date? What should I expect going forward? Can I leverage Oxide-collected failure data to better inform this prediction?

  • How much capacity would be impacted if I changed the service rate? Would that actually impact what my business can deliver?

Installing a new Network Uplink

An example of [cat-pl], is a case where a customer wants to add more outbound network capacity by adding another network uplink into the rack. This process involves a few different aspects. This process might work as follows:

  1. Indicate through the control plane that we intend to allocate a new port for a network uplink and have the control plane indicate which port on which switch and rack it should be plugged into.

  2. A technician goes into the datacenter. They must identify the network lead that they are looking for and the transceiver. They will then identify the ports on the switch in question. Like in the Fan and U.2 NVMe cases, there may be coordination that allows us to use labels or LEDs to better identify which port is the correct one. It may also be possible to design things such that it doesn’t matter.

  3. The technician will want to know that they’ve successfully inserted the transceiver and cable and the broader datacenter staff will want to verify that they can get a link up on that port. Once that’s the case, the technician can leave the datacenter.

  4. The broader control plane will begin to leverage that port and datacenter staff will monitor and ensure that traffic is flowing across it correctly.

Requirements

This section lays out different requirements that we have and relates each of them back to the specific motivations outlined above and then explains for each of the motivations why it’s important.

While some aspects of architecture and implementation may be suggested by what’s discussed below, the intent is to focus on what is required and to use subsequent, focused documents to define things like architecture, implementation, communication paths, etc.

The distinctions between these different requirements is somewhat arbitrary and there is a lot of overlap between things like inventory, presence and location, etc.

Component Power Cycle Control

There are many places where the ability to control power of components such as a server sled, switch, or even an individual device is important. This comes across as something that shows up as part of several different motivations.

For [cat-bpo] we need the ability to reboot a server and other components as part of the normal software upgrade process. For example, to complete a host operating system upgrade, the control plane will want the ability to reboot the server. To apply a firmware update to an NVMe device or other component, a reboot of that component may be required, though this is not meant to suggest what the required granularity is.

For [cat-ohf] part of dealing with faulty conditions in the field may include an automated power cycle of the component to try and restore service before taking a more drastic action. Another part of this is that if a component has been determined faulty, it might make sense to power it off to reduce additional issues that could arise from it.

For [cat-iup] this is more of dealing with odd issues that arise in the field. As part of the support troubleshooting process, there may be reasons where support staff will want to issue a reboot to try and clear out a situation. This doesn’t necessarily suggest that we want a common reboot button that’s available or front in center in the UI per se. Though if we expect that to be something we want customers to do hiding it in the UI can be frustrating.

Similar to the planned cases, this doesn’t extend to just the server sled. This may tie into may other components in the rack, like that power shelf controller, a rectifier, a particular NIC, switch, or even a specific switch port. It’s hard to anticipate what kinds of diagnostic or repair actions we’ll want to take in response to an unknown problem, so the more capabilities that we have, the more flexibility that’ll give us in the field.

Diagnostic Reboot and Crash Dump Support

  • Motivations: IUP

Many operating systems and devices support the ability to explicitly cause the system to fail and create a copy of memory and CPU information in the process and write that out to stable storage. This process is some time referred to as creating a crash dump. These mechanisms can often be triggered though hardware schemes. For example, sending a non-maskable interrupt (NMI) on x86 can be used to trigger this.

When dealing with [cat-iup] we would like the ability to inject and request this from systems where possible. It is important that this be an out-of-band mechanism, which often has hardware implications. In many cases, such as an operating system deadlock or similar, the hardware-based mechanism can be triggered in a way to still cause this to occur. Where practical, the ability to trigger this for all components is useful.

Firmware Upgrade

  • Motivations: BPO

Hardware devices from third-parties will have some amount of firmware (software to which we generally cannot build, modify, or usually even observe). Since we know that all software will need to be changed during its lifetime (whether to address bugs or changes in the operating expectations). As a result, we need hardware devices to support the ability to upgrade their firmware in some fashion.

This is mostly a part of [cat-bpo] as this is something, that like a software update, we’ll need to just take care of automatically as part of the product. We need to make this experience as painless to our customers as we would any other software upgrade. Issuing a firmware upgrade may show up as part of [cat-iup] or to mitigate a [cat-ohf].

Health and Status

Central to our customer’s ability to have the product be available and providing useful service to their users, is the reliability and availability of the underlying devices that make up the system. Critically, we need the ability to answer the question of whether our device’s are meeting the expected performance and reliability characteristics.

Critically, when we think of a device being healthy this is more than the traditional RAS (Reliability, Availability, and Serviceability) characteristics. We can also think of a device as providing an SLA for its services. Consider, for example, a storage device. We might phrase an SLA as a certain number of I/Os completed within certain latency bands.

In service of [cat-ohf] the product needs to be able to understand and track a device’s current health, understand when errors occur, and whether it’s operating within the expected thresholds. This is often the very foundation of this entire category. We need reliable ways to know if a device is misbehaving or emitting errors. Though it’s worth noting that one should not expect a device to fail gracefully or to always trigger the "I’m failing" bit.

What this means will vary for each type of component in the system. CPU’s often have things like the x86 Machine Check Architecture, where as a corresponding NIC will have PCI Express Advanced Error Reporting and a number of other mechanisms. Some of this data is about tracking from the consumers such as how long does an NVMe I/O take or how long does it take the SP to read a particular sensor.

Components have a large number of [cat-ohf] that we would like to be able to handle gracefully in the name of overall system availability. In addition, there may be some number of transient, non-fatal errors that we need to track (e.g. DIMM correctable errors) and potentially take additional action based upon. Where possible, and a reasonable tradeoff of complexity and economics, we want to select for hardware that enables our ability to gather and react to such events.

In addition, the more of a historical record of this information that we can gather, the better we can refine and iterate on the product to deal with new classes of failures. While our individual customer sites may be small, it’ll be important that we learn from aggregating information in the field. As we go to make subsequent vendor decisions, this information can be useful to arm ourselves with. Importantly, this is an area where we can really differentiate. Many servers sold by traditional OEMs hide all errors and attempt to deal with them only in firmware (e.g. MCAv2 cloaking), preventing systems from being able to take more intelligent action in the face of errors.

In a similar vein, when dealing with [cat-iup] we’re going to want to be able to review components in the system and ask questions of their health and status during different phases. The cacophony of failure can lead to unexpected failures in ways that one doesn’t expect. For example, folks have debugged cases where a SAS expander did an internal reset; however, the initial user visible problem manifested itself as a network connection being reset![1]

As a result, Oxide support staff need a way to understand what kinds of hardware-related events have occurred on the system or if a device is not meeting some sort of SLA when investigating problems.

While [cat-cp] is focused on customer-visible resource usage, understanding the general trends of health and status, particularly as it applies to utilization and saturation is important. A tricky part of these measurements is that we can lose an accurate sense of utilization and saturation the more that we average a measurement. Take the case of of a NIC. A 10 Gbit/s link may have an average use of say 4-6 Gbit/s; however, there could be smaller windows say on a 1ms basis where the port is actually saturated and dropping packets. Where possible, we want to make sure that we’re aware of such types of issues and understand that as part of our capacity planning exercise.

Identity and Inventory

Identity and inventory are related concepts. Identity covers knowing uniquely identify components in the fleet. This includes everything ranging from CPUs to fans, transceivers, DIMMs, SSDs, to things like PCBs and even the rack itself. For a compilation of such devices, see the Managed Device Inventory.

Identity provides a means of distinguishing between two otherwise similar components and knowing that they are in fact, not identical. While we might not care about the identify of every single thing in the system, such as every capacitor, we will care about more than just FRUs (field replaceable units). There may be major components from vendors that we need to identify for purposes of root-cause and corrective analysis as part of [cat-ohf] even if it’s not a discrete FRU (such as a component on the motherboard, imagine a chipped-down NIC).

For many devices, identity is a tuple of different information such as a manufacturer, part number, and serial number. Some components like x86 CPUs require the CPUID family, model, and stepping values in addition to unique values. In some cases, additional information about the device such as manufacturing date and location, firmware revisions, etc. are useful. For components without identity information, if the component is in a fixed location, some semblance of identity can be derived from that location.

Inventory refers to assembling information about identity, along with some amount of location information and tracking that in some scheme (e.g. a centralized database). In addition, inventory may also care about things such as the component’s capabilities and other useful ancillary information such as warranty or when it was first installed into the datacenter.

From both a [cat-cp] and [cat-pl] perspective, having inventory information allows one to better understand what makes up the rack and what’s in use. When considering what to upgrade or the current state of capacity, it helps to know what actually exists. For example, if an operator decides based on the current utilization of resources that they want to add upstream network capacity, it helps to have an existing inventory of what’s in use, where, and how that will fit together. As part of the general life cycle, understanding a component’s service time, record, expected lifetime and warranty, is important and useful.

When dealing with [cat-ohf] or [cat-iup] understanding the identity of components is important. At the most basic level, if we’re going to dispatch a replacement part we want to know everything about what type of part should be there to begin with.

From there, understanding what the exact details of what’s in a device that may be having an unknown problem that is being investigated is quite important. The same is true when trying to do automated fault management. Knowing the identity of the device and aspects that are specific to it end up being quite important for managing it, as we will almost certainly have part-specific behavior that may vary based on generation or even, unfortunately, manufacturing run.

Indicators, Buttons, Switches, and Labels

Indicators, buttons, switches, and labels are a core part of the interaction and experience that someone working directly on the hardware as part of a service path has to deal with. There are many cases where this comes up from a hardware perspective in the examples of [exp-fan], [exp-u2], and [exp-uplink].

These different means of interacting physically with the rack are all in service of [cat-ohf] and [cat-pl]. Whenever we think of these, there is an overriding set of questions that we need to think about. At the core, we need to ask: What information are trying to convey to someone and how is it useful in their current situation? Put another way, what question does someone have that this is trying to answer and what actions should they take in response to this information?

For example, a large number of indicators or labels are going to be for someone who is trying to perform a particular piece of hardware service that may or may not have anything to do with the company. An example of something we may want to be able to indicate is that a U.2 device or a compute sled has been successfully latched into place and can obtain power. This is trying to solve the problem of making sure that someone does not have to come back into the datacenter to fix the mistake of not having properly seated something.

There are a couple of different properties that all of these should have:

  • It should be hard to confuse this for something else. The intent should be relatively clear and have consistent branding. Consider the case when we have components that are manufactured by someone else like an SSD or a fan. Since it may have branding information we need to make sure that it’s obvious what relates to Oxide rack and should be looked at versus what does not.

  • If this is a button, switch, or some other thing that someone is required to operate with, it should be clear how to operate and interact with it.

  • Customers may have pre-conceived expectations of what a specific button, switch, or latch does from working with other systems. If it is visually similar, it should work in a similar way.

  • The information communicated should be consistent with other parts of the product. For example, if an indicator on a server reflects some piece of information, looking at that in another view such as a web UI or CLI should not produce different information.

It’s also important to call out that there are many different ways that we can design this part of the system. There are many different ways that we could design the service flow. While there is traditional labeling and LEDs, there are also other things we can do. Imagine the case where most technicians do have a smartphone and we can have an application or web UI that they can go to and scan a QR code that can bring them the information and visually highlight what they care about. One could even conceive of other much more involved things such as augmented reality interfaces where we can overlay information from the product over what someone is seeing.

These different approaches all have their own sets of tradeoffs, costs, and impacts on the customer experience. However, the important thing is not to assume that we’re wedded to one particular path at this stage, rather that no matter what we do it needs to solve for these particular use cases and requirements and some combination may be employed.

Location and Presence

Location and presence are important tools that are useful for operators of the rack and for datacenter technicians. See also, RFD 40 Physical Presence and Location: Requirements and Architecture.

Every component, whether it be a disk drive, fan, DIMM, the server sled, or even the rack itself has a physical location. These locations are relative. For example, consider a DIMM. One might care which slot it’s in on the motherboard. However, then the question is what sled is that motherboard in. And from there, where is that sled in the rack, and finally where is that rack in the actual physical data center and which data center is it.

Presence is often related to Location. It answers the question of whether something is physically present in a given location, regardless of whether or not it is functioning. The two are interconnected in terms of utility. Someone usually cares about whether or not something is present in a specific location. Consider the case of a DIMM that is not present, the very next question someone might ask is where is it missing from? See the Managed Device Inventory for details on how presence works for different components.

A major part of [cat-ohf] and [cat-pl] is having service technicians be able to correctly identify a piece of hardware to replace or to insert it. In this case someone is walking into the datacenter and needs to try and determine which particular physical component they care about out of what may be a multitude of racks. We need to assume that this person may not have access to the management console or anything other than the information in a support ticket. For this, the ability to correctly identify this may mean things like:

  • Labels for things like U.2 bays, switch ports, selds, rack locations, and the rack.

  • LEDs that can be toggled to correctly identify a specific location such as a U.2 bay, a switch port, or a server from the front or rear

  • Silkscreen labels on PCBs to make it clear what a slot or component on a PCB is.

While these aren’t necessarily the ways that we should solve these problems, the important thing is making this as easy to get correct as possible. Identifying and servicing the wrong component because of location information is the fast way to creating a cascading problem.

When we instead turn towards [cat-iup] and another aspect of [cat-ohf] here we care about location information for being able to correlate issues and also understand the impact of problems. For example, based on knowing which servers are in the same rack or are being serviced by similar components, we can understand the impact of what has failed and why.

In addition when we’re investigating problems, knowing where things are located can be important because when there are correlations between failures and location, there may sometimes be a shared cause due to something in that location.

When we instead turn to presence, it has a major role in [cat-iup] and [cat-ohf]. When a device disappears from software’s view, being able to answer the question of is it physically there or not can be useful in bifurcating the solution space and what needs to be considered. While a device that’s not present may not be fully seated, which is a rather common problem, a device that is present, but not operational tells you a lot more about what might be going on and helps direct the investigation.

Having the ability to answer these questions without requiring someone to be able to visually inspect the unit is a useful and important consideration. Each trip into the DC at best only costs money and time. However, sending someone into that environment for a wild goose chase does not leave someone endeared. The more information we have going into that, the better this will be for our customers.

Something we may want to consider as part of the service flows in [cat-ohf] and [cat-pl] is how we should communicate presence. While the software means are more straightforward, like with location, there may be value to being able to indicate this in the datacenter.

No Irreparable Damage

All of the hardware components that we’re considering have certain physical tolerances for their operating environments. This may be the ambient temperature that they can operate in, the voltage and current that they can handle, or some other attribute. When the components exceed these thresholds that can result in permanent damage and cascading failures such as electrical fires.

For [cat-bpo] we need the ability to make sure that we have appropriate fail safes in either the components or in the surrounding pieces to try and avoid such catastrophic failure. For our customers, temporary service outages are vastly preferred to permanent damage. Examples of this include things like explicit thermal shutoffs.

When these occur, if it’s possible to know that they have that is valuable. In particular, being able to answer this kind of question becomes useful for something liek [cat-iup]. If we hit one of these trip points, knowing that fact can be very useful in answering questions to explain and understand what happened.

Recoverability

  • Motivations: OHF

There will always be unplanned issues in the field that have cascading consequences. Consider a couple of different examples:

  • A firmware update is being written out to a non-volatile memory and the rack loses power.

  • An NVMe device has acknowledged a write, but it’s still sitting in its cache while power is lost.

  • A piece of flash containing configuration or firmware is wearing out or errors on a bus cause bad data to be committed.

While the means, methods, and impact vary for each type and class of component (see the Managed Device Inventory for more information and examples), we need to dliver a system that can recover from these types of events. The means of handling these may vary based on the component. For example, it may mean having more than one firmware slot or power-loss protection on an NVMe device. While some components may be able to mitigate this type of failure, others we will need to have specific mechanisms to recover from it.

Critically, recovery here is about returning the device to service. There may be data loss, whether that’s of user data, software, firmware, configuration, or something else. As part of the [cat-ohf] our goal is to be able to handle these events and restore service without sending someone into the datacenter to replace a part. Th

Thermal Control Loop

  • Motivations: BPO

The server sled and switches in the rack require a thermal control loop. This is a part of [cat-bpo] as this is required for the hardware to function correctly. Without it, hardware might overheat which will lead to interruptions in service and possibly component damage. This is something that customers will expect exists.

The accuracy and efficiency of the thermal loop may not be a direct customer requirement; however, if we can meet our needs while running the fans at a slower speed that’ll leave more power for the rest of the rack and hopefully have a positive impact on the lifetime of the fans and possibly other components.

Time

At the end of the day, our customers will want the ability to correlate events with times that they understand. Operators will be being told by their users that something went wrong at 13:42 and operators will want to be able to go through and correlate that.

As part of [cat-bpo], [cat-ohf], and [cat-iup], events will be occurring in the system that we need the ability to correlate and ultimately be able to relate back to times that our customer’s and support staff will understand. While we may be able to get away with relative ordering of events and the amount of time that has elapsed between them such as the failure of a component relative to the reboot of a server, if we cannot relate that back to events that our customers experience and discuss which will be using traditional usage of time (e.g. UTC), then we cannot deliver them a good experience.

In addition, there are places that we need time for more general product operation. Warranties have fixed time frames that we want to be able to communicate to customers. Time is required for TLS certificates that we might be serving to users of the API.

Footnotes
  • 1

    If someone tells you that putting a SATA device behind their SAS expander is bug-free, don’t believe them!

    View