116 - A Midsummer Night's Metric / RFD / Oxide

RFD

116

Authors

Updated

Hardware and software are constantly generating a multitude of events, telemetry, and information that can be used to provide insight into every aspect of the running system. Some relate to the implementation such as capacity, performance, and availability; others are more directly correlated to business objects such as conversions, active users, and more. This RFD surveys different use cases for telemetry in the product.

It’s published state will indicate general agreement on the use cases for telemetry. This RFD will not propose any specific architecture or implementation, rather it will serve as the input for an RFD with that focus.

Telemetry and what is a Metric?

When people use the term metrics, that already sends us down a path of often referring to something a bit specific. In particular, a quantitative measurement that occurs over some period of time. This measurement will likely have a bunch of metadata around it that describes or distinguishes it. Disk I/O operations per-second is a classic example.

Along with a quantitative value, it’s also often useful to record contextual information. For example, when collecting statistics about each HTTP request one may store associated metadata (e.g. API route, source IP, response code). This data can then be sliced, diced, and aggregated (even to the point where discrete events resemble rate information as in the previous example).

These may be built into traces that cross beyond just a single application or layer of the stack, allowing one to associate metadata and break up something like say a logical disk I/O into its constituent parts such as network calls, database lookups, and actual physical disk I/O.

Due to that pre-conceived notion of what’s meant by a metric, we’re opting to use a broader term, 'telemetry'. We don’t want to presuppose a specific type of data or how it is stored and maintained (for example, this doesn’t have the same implications that Prometheus assumes when using the term 'metric'). From a customer perspective, we actually want to be able to interleave other types of information that people don’t traditionally think of as metrics with this data. For example:

What version of software was running on the Oxide rack during this time window?
Were there any hardware failures that occurred near that time?
Was my instance undergoing a live migration?

Categories of capability

For each user archetype or collection of activities we can roughly consider three categories for the capabilities we build.

Meeting Pre-Existing Expectations

From using other systems, our customers are going to expect that they have the ability to answer and see basic metrics about different resources in the system. Take the classic case of a virtual machine. People expect a few simple metrics to be in easy reach today such as information related to CPU, network, and disk usage.

The same is also true in other parts of the product. Many hardware operators expect to be able to see the basic information that is present in the ipmitool sensor output or the equivalent from Dell’s iDRAC or HPE’s iLO. In a similar vein, network teams have traditionally expected that they have access to various SNMP MIBs and other means of describing basic information about network PHYs, link state, throughput, errors, and breakdowns of traffic.

An important note about this category is this may not always be information that we commonly describe as actionable. This information may be very useful and important to solving problems, or it may not. But the point of this category is really that there will be certain metrics that our customers expect to exist, just because. Items in this category will hopefully generally overlap with others.

These expectations are generally around metrics. See [sec-expect] for more detailed examples and comparisons of what exists between platforms.

Core Requirements

Features in this category reflect those critical to the system’s operation. While there may be significant overlap with the previous category, these are driven purely by the exigencies of the system, and the concrete needs of users, not merely their familiarity. We expect the need to educate users to be relatively light as these features and telemetry should match with areas of expected knowledge.

Differentiators

A unique thing that we need to think about with respect to use cases are differentiators. Effectively, what are things that we can deliver and execute on because we are co-designing software and hardware.

While it’s important that these remain actionable, this category is meant to showcase things we can do because of our unique position. There are a few different ways and examples for us to consider in this case.

First, let’s walk through the networking visualizations that have been part of the product console demo. This is an example of differentiator for us that comes from us providing both the virtual machine abstraction and the underlying networking fundamentals. When customers see this, there are a few reasons that it seems to stand out to them, regardless of whether they find the visualization useful:

It’s combining information across multiple disparate instances into one, coherent-ish, view.
It’s getting at what their VMs are doing without interposing or requiring agents in the VM itself to get at the information.
Networking data is often high-volume and trying to capture and mirror all traffic to process it inevitably leads to delays and an impact to the customer.

From our perspective, it promotes the narrative of hardware and software co-design (even if in this particular case there isn’t as much at play here). But to me, the key insight is that we’re amalgamating information that normally would be spread out across several disparate organizations that sometimes only tolerate each other on good days.

It’s worth teasing out other views that we can correlate and piece together here that are often difficult because of the above. Our ability to differentiate is going to be where we can cut across multiple layers of the stack and help correlate or at least provide some insight into what’s going on.

A good example of this came up while running a distributed object store at scale at Joyent/Samsung. While the team running the system had access to the software and server-level metrics, there were numerous problems that would often relate to the broader physical network, which was managed by different groups. Often times the service team would see issues and being able to use data from the network to immediately disprove hypothesis would have helped cleave the search space. For example, because of the network design, there was often a high-degree of port and uplink utilization that would lead the service team to ask questions like:

Were any ports close to saturation during that period?
Did any of the switches drop packets due to a micro-burst of activity?
Was there any kind of network reconfiguration or other activity that might explain why there was a service level flurry of TCP-level retransmits, dropped connections, etc.?

While the above examples are related to the network, this is by no means limited to that. Being able to automate understandings of the entire rack and more importantly, being able to correlate that to something actionable or just as a way to get insight, such as the power and thermal heat maps may help us get the product across the line, though we have to always be cautious of making sure that we have actionable information or that certain things don’t just become support call generators.

Things that we can go do that customers don’t normally get good visibility into are things like PCIe link resets, thermal throttling, correctable error rates and how they lead to retries, and then how those might translate to or be responsible for the latency that an instance sees in say the storage service. While operators may not want to expose this to all of their users, there will be some who do and being able to put those different pieces side by side is going to be powerful.

Use Cases

It’s essential to understand the reasons customers care about this data and how they’re applying it so that the product can satisfy their needs.

We describe several higher-level classes of telemetry and some intersecting attributes that they can have. These will help us get a flavor for what we’re trying to do and in some cases tie back to use cases and people that have been described previously (e.g. see [rfd78], [rfd82]).

System is Meeting Expectations

One of the important questions that folks who are running services have, whether that be at the application level or down at the hardware level, is whether or not the system they are running is meeting expectations. A key part of this is often telemetry, metrics, and other information which may tie back to explicit service-level agreements and objectives (SLAs and SLOs), or as is just as often the case, a more nebulous, subjective view of the 'Is it working well?'.

While it’s tempting to discard the latter due to its lack of a rigorous definition, when one is a consumer of a service, that qualitative feeling is often just as important. Just as the network architecture goals section describes that people want the network to feel snappy ([rfd63]), the same is true here. Ultimately, the goal here is to help our customers answer the question is the system meeting their expectations and when it is not, to inform them ahead of their customers telling them. Historical information is a powerful tool to understand whether findings are novel and suspicious or merely typical.

The challenge for us is that this isn’t across a single level of the system, but more or less, a question across all of them.

An application team will have questions about if their applications are meeting expectations and that will in turn, potentially, depend on the underlying instances themselves and the ability to understand their performance.

Operators who are managing the rack will care about this at several levels with questions such as:

Are the Oxide API services operating correctly and responding in a timely fashion?
Are there underlying issues going on that might impact certain customers? Can I proactively notice that something is going on, regardless of who is responsible for the issue and help make sure that they are prepared?

When we come back to the question of how is this information actionable, and how is it used, fundamentally this is trying to help our customers indict or acquit different parts of the system so they can focus their attention less on blame, but on resolution. While the question of 'Is it working well?' may seem more qualitative than quantitative, to prove that the system is meeting expectations will ultimately require that we have the telemetry to answer that in a quantitative way.

This suggests not only having telemetry that knows when something is bad, but also having ways of knowing that something is good. For example, being able to quickly establish and rule out something like I/O latency or packet loss will be invaluable.

Capacity Planning

We know that Capacity Planning is one of the major features that we have to consider from an operator perspective ([rfd82]). This can be viewed as the logical extension to the system meeting expectations: will it continue to do so? The type of data that we collect for capacity planning may at first blush be different from the type of data that we’re collecting to address other use cases and thus deserves to be called out.

However, as we discuss in [_differentiators], what’s going to be equally important is to be able to correlate changes and planning here with potentially other information in the system. While there is the macro level of Capacity Planning, this also ties into the broader per-project and instance level views of their resource utilization and being able to understand how they contribute to that macro view.

Monitoring of internal components

Our software and hardware will fail in expected and unexpected (and inexplicable) ways; what matters most is how we detect and handle that failure. RFD 82 describes some of this as Operational Hardware Failures, though it also applies to software. Though this isn’t limited just to failure, it also applies to how we generally run (such as SP thermal loops), and ties into upgrade.

We will, of course, attempt to detect and correct problems automatically, but we will never perfectly achieve this goal. Operators (and our support engineers) will need access to internal telemetry even when their ability to understand and act on it may be limited.

For example, an SSD may be misbehaving in ways below our detection threshold, an internal service may be consuming resources, or a NIC might start dropping packets—in all these cases, users workloads may be impacted. Internal telemetry will correlate with observed application behavior. The operator’s ability to intervene will be limited to rebooting servers, rebalancing load, or taking components (e.g. an SSD) out of service. We need to inform those interventions with data that can be clearly interpreted.

We’re going to need to store telemetry and activity that cover a large range of activity, including:

Hardware utilization, saturation, and errors
Underlying device and sensor readings
Component revisions that are active at the time
Occurrences of events such as link resets, speed renegotiations, etc.
Software level errors, resets, etc.

It can be useful to break this into a couple of different primary use cases:

Active diagnosis engines. There are going to be a large number of cases where we are going to want to diagnose that hardware or software is potentially faulty and take action on a rack and eventually AZ-level. To make that a reality, we need to feed this telemetry into those diagnosis engines, which may be operating in the context of a single server or perhaps at an even higher level.
Information that may tie into [_debugging], isn’t part of an immediate diagnosis engine, or is something that we’re gathering that’s related to some of the other use cases or to help us correlate activity.

An important call out from this is the need for us to instrument all of our own services, whether it be Nexus request latency, underlying database performance, or execution properties of the storage and networking subsystems. This was a major thing that was called out in recent conversations with a potential customer ([customer-20201110]). They really want to make sure that they can rely on the product itself to help know whether it is meeting expectations.

Debugging

At Oxide, we believe in the importance of building a debuggable product. We know that we’re going to see failures in the field, unknown problems, and that some of these will of course, be unreproducible. Our overarching goal is to be able to debug from the first failure that a customer sees.

This suggests that we’re going to need to take swings and guesses at what telemetry we want to be collecting and make sure that we have them, even if they’re not something that’s always exposed directly to customers. As an example, for many parts of the control plane we’re going to likely need basic telemetry and metrics around request rates, errors, and latency distributions. Having higher level telemetry which is already meaningful to our customers and helps us be able to focus our debugging initially, is not only necessary to have, but useful to expose directly to our customers. However, we want to avoid customers latching onto telemetry that may not be actionable or worse, that just becomes a call generator, possibly for bad reasons.

While ultimately, the set of things here is perhaps more nebulous, it’s going to be one of the most important use cases that we have and will be central to creating a great support experience. For example, we are going to want to be able to collect a support bundle of sorts that has information about what failed and why and try to have additional information that we might need to further diagnose this without going back and forth to the customer over repeated support actions.

Product Iteration

Once we ship, our commitment to our customers is that the product will continuously improve (via software updates). Our opportunity (and our commitment to investors) is to make subsequent hardware releases even more capable. We’ll execute both most precisely and effectively when armed with data about the behavior of the product.

We know the product will fail in unique ways in the field and as described in [_debugging]. We are always going to uncover and hypothesize new ways of correlating issues in the field. To that end, being able to go back over all the accumulated data that all the racks have generated and test hypothesis and deliver software updates that can do a better job of performing diagnosis or knowing what to highlight to customers.

This will also elucidate hardware gaps and opportunities. For example, if we see a high failure rate of a given part or that is correlated with specific conditions, we can use that to either change suppliers or some part of the electro-mechanical design in future revisions of hardware. We might decide that version 2 needs SSDs with higher endurance or that we can do with much less. Under-used resources might inform both more optimized ratios of CPU/DRAM/disk and may point to product features that let customers target those unproductive resources.

More completely closing the loop, gathering and analyzing to inform iteration of the product will itself inform iteration on the telemetry we collect.

Expectations Based on Existing Products

Regardless of our opinions about the best way for our customers to operate the Oxide rack, they will have expectations accumulated from experience with other products in broader ecosystsem. While we need not be obeisant to industry norms (indeed, ignoring them is our raison d’être), it’s worth considering the examples below to understand customer expectations.

Cloud Portals

Many of our customers will be used to using various public clouds today. While there are often dedicated services in the form of Amazon’s CloudWatch or Google’s Cloud Monitoring, there are a number of basic metrics that are presented on a per-VM basis in the various platform overview pages. This is summarized in the following table:

Cloud Provider Default per-Instance Metrics
Provider	Duration	CPU	Network	Disk	Other
AWS	?	Utilization	Bytes / Packets	Bytes/IOPS	CPU bursting, status checks, CloudWatch
Azure	30 days	Utilization	Bytes	Bytes / IOPS	-
Digital Ocean	14 days	Utilization	Bytes	Bytes	More with agent
GCP	30 days	Utilization	Bytes / Packets	-	Have a separate query builder to show other metrics
Linode	Lifetime	Utilization	Bytes	Blocks	-

While there are more advanced services and agents that are available for most of these, the main highlight is a surprisingly uniform set of basic metrics for given instances:

CPU Utilization (generally not broken out between user/kernel)
Network Bytes and Packets on a per-interface basis
Disk Bytes and IOPS on a per-device basis

These basic metrics are also found as the basic metrics for common developer tools. Software like virt-manager has the ability to display basic CPU, disk, and network I/O graphs. This goes to show the ubquity of these measurements, regardless of whether or not they are actionable or regularly used.

Server Management

Many operators are used to being able to pull telemetry from the underlying hardware and use this in various ways, some that are actionable or not. There are a myriad number of ways to get at this which range from specific to a vendor to a common wrapper (IPMI, Redfish) of vendor-specific data. Just as Redfish roughly describes a schema, but not how any vendor should fill in the schema, the same is true of IPMI. While there are rough entity types, there is no good programmatic way to map entities together to things that are known in the system.

If we look at the sensor records that are used between common Dell, HPE, and SuperMicro systems, we’ll see that there’s a collection of common sensor readings that are present:

Fan Speed
Inlet and Exhaust Temperature
CPU Temperature
Power Consumption
DIMM Temperatures
Voltage and Current readings (though the part isn’t always specified), often on a per-VRM basis.
General system status or device presence (varies dramatically by the vendor)
Possible disk device temperature

The main take away from this is that there are a number of pieces of telemetry that are easily exposed in these systems that operators may be used to seeing and capturing, even if they’re not things which are immediately actionable for them.

Network Devices

XXX

User Interaction

Customers will access telemetry directly and indirectly through a few different interfaces.

API Access

As with customer-facing aspects of the systems, telemetry will be fully visible in our API.

We can lean on the API for any use cases we hope to defer. Integrations, reports, archiving—all of these we may choose to defer (possibly indefinitely), but can provide an escape route for customers through the API.

Web Console

The Web Console is one of the main ways that we expect users to engage with our system. Just as the API access described above is important, we cannot assume that customers will want to do integrations and pull it into their existing systems. There is going to be a lot of power and value from just being able to get quick validation through the UI and seeing graphs that activity and changes are happening.

We’re going to want to consider several different pieces here. We’re going to want to make it easy to get quick overviews of basic instance, server, and rack health with associated graphs and visualizations in the corresponding views. For example, it should be easy to go from an instance to the basic telemetry for that instance. Similarly, when viewing one of the Sidecar switches, it should be easy to get to its telemetry.

We need to consider the tradeoffs in the work required for richer views. For example, being able to overlay data, changing the time frames, and other views such as heatmaps, are all going to be provide a lot of value, but require corresponding work.

Timeliness

When working through the Web UI, there is going to be some telemetry and questions that we can answer in a timely fashion and others that might require more batch processing. As an example, it should probably be relatively fast to get the answer to a questions like an instance’s, server’s, or other resource’s base utilization. Similarly, getting the history of events across a single resource shouldn’t be too bad.

However, there are going to be questions that we have and computations that we won’t be able to index in advance and may devolve into large scans of the entire datasets, which will happen when we try to join data together. An example of this kind of question is 'What model of hard drive gets hottest in the hottest location in all of the different server designs, and what is that average temperature?" ([jc-monarch]).

Being able to be up front with users about both our ability to answer this kind of question, but that it won’t be a soft real-time operation would help us properly set user expectations and gives us a bit more latitude in the solution space. Striking the balance between what can be returned to a user in a more timely fashion versus taking more processing will sometimes be tricky, but should be helpful especially as we evolve the product over time and can make more things timely.

Dashboards

Depending on what someone cares about and their role ([rfd78]) there will be different views that they care about. While pre-canned dashboards of telemetry are useful, allowing our users to create their own and tailor it to their needs will give us some additional flexibility and allows us to work with customers to figure out the views that matter to them without requiring software updates.

Alerts and Thresholds

Another core part of the product that we need to consider is how we create alerts based on telemetry and the related aspect of that are thresholds. Just as we’re going to want to be able to generate alerts when a hardware device fails or on say a notification on a password change, we’re going to want to be able to do the same when telemetry indicates that we’ve crossed some threshold.

External Oxide Service

Where possible, it would be incredibly valuable to be able to store telemetry from all customer systems in an Oxide-run service. This service might run on public cloud infrastructure or might be a place where we’d use our own product in a business-critical function.

There are a couple of disparate benefits of this for us and customers:

If customers want to consider the data storage of this separate from that of their own environment, this provides a means, particularly in the early days of preserving data.
It may be possible to get at this data by support staff, making it easier for us to pre-emptively figure out what’s going on without asking the customer for more information.
It helps us with investigating additional diagnosis hypothesis and determining future product directions.

There are some caveats and security considerations here. There will always be customers who will refuse to let data leave the environment. Similarly, there are some security risks that could arise form having this data centralized in one Internet-accessible location.

Scope for V1

While there are a number of things that we could try to do in this space, it’s just as important for us to actually cleave out and be up front about what we’re not doing as part of our initial planning.

General Purpose Customer Application Metrics

We are not trying to build a general purpose metrics collection interface where our customers can upload their own application level metrics. While viewing application related metrics side-by-side with the lower level platform metrics is a powerful tool (much for the same reason that we want to be able to do so as laid out in [_use_cases], it is not something that we are going to try and build today.

There are a couple of reasons for deferring this feature in v1 of the product, and likely for some time after that:

Customers already likely have existing systems that they use to try and consolidate application-level metrics. There are a lot more different software options that has been built and focus on specific language and/or run-time environments. In addition, there are numerous commercial offerings that folks have already invested in (e.g. NewRelic, Honeycomb, etc.).
We do not assume that a single customer application will be entirely located in an Oxide environment at this phase. Critically, due to the focus on large Fortune 500 customers in our initial product ([rfd78]), it is almost a lock that parts of their applications will exist outside the Oxide environment.

Indefinite Storage in the Rack

A single rack is inherently finite. If we could overcome the physics of storing more data than there is capacity in the rack, we’d be in a different line of business. As a result, it’s important that we don’t try and build and design something that tries to enable this inside of the rack. Instead, we should have an explicit time frame that we are keeping fine grained telemetry for, which should probably be in the range of 30-90 days (based on the cloud provider defaults).

Not doing this would lead us down the path of trying to always compress data, but still having it grow without some kind of bound (though perhaps a slower growth slope). While there is something that appeals to customers about being able to view and compare historical telemetry and this is an area that we should invest in so we can eventually do things like compare year-over-year trends, etc. we need to pick a point to draw the line on v1.

Conversely, this suggests a few different ways that users will want to interact with the system which are discussed in [user-api] and [user-oxide].

Third-Party Integrations

While there are a number of different formats that we could export data into or wire up things that allow us to push data into customer’s existing system, we should not to do this in the initial version of the product. Critically, there are going to be many, many different types of services and formats. With our initial ten customers, we’d be lucky if there were only 11 systems that we needed to interface with.

Customers will have existing metrics systems that they use. This will likely run the gamut of formats and different systems. While some folks may be using Prometheus and Granfana, we’ll probably see someone using the venerable trio of Cacti and rrdtool, and of course, several home-grown systems that have been built-up over the years that are tailored to our customer’s specific needs.

This implies that customers will have to use the API we provide and do whatever transformations are necessary for it to make sense for their own systems.

References

[customer-20201110] Customer Conversation. November 10, 2020. https://github.com/oxidecomputer/reports/blob/master/customers/links/20201110.txt
[jc-monarch] Oxide Computer Company. Journal Cloub: Monarch. https://drive.google.com/file/d/13r94i54Sx7M0w_OKAElnnBPHtVGRKJHr/view
[rfd63] Oxide Computer Company. RFD 63 Network Architecture. https://rfd.shared.oxide.computer/rfd/0063
[rfd78] Oxide Computer Company. RFD 78 Customers, Roles, and Priorities. https://rfd.shared.oxide.computer/rfd/0078
[rfd82] Oxide Computer Company. RFD 82 Motivations and Principles for the Design of Operator Facilities. https://rfd.shared.oxide.computer/rfd/0082

RFD 116 A Midsummer Night's Metric

Table of Contents