RFD 162
Metrics collection architecture and design
RFD
162
Updated

RFD 162 covers the architecture of the metric data collection system within an Oxide rack. We’re trying to understand where data is produced, how that data is collected, and how metadata about which metrics exist is communicated to both internal and external Oxide software.

Background

This RFD assumes some background that can be found in previous RFDs. [rfd116] covers metrics in general; [rfd125] describes our choice to use ClickHouse (CH) as the database for our metrics; and [rfd161] describes how our data is modeled in general and within CH specifically.

Given the early state of this RFD, please refer to the Open Questions. These highlight some of the major tensions and areas where early thoughts may help direct or focus further discussion and experimentation.

Terminology

The terminology from the data model discussion in [rfd161] is assumed, but is reviewed here. A target is a named, schematized source of data; for example, a software process. A metric is a named, schematized stream of measurements; for example, total number of requests a server has received, by path, method, and response code. The schema for both is defined by fields, which have a name and type. A specific sequence of field values defines a single timeseries, with its ID specified by converting each field value to a string, and concatenating them with ":". A measurement is a single sample from a timeseries.

Producer

A producer is a software process that generates measurements. The measurements may derive from multiple targets and metrics. Note that while the producer itself is software, it may be collecting the underlying data from hardware such as a sensor, drive, NIC, etc. The measurements themselves may also be generated entirely by software; e.g., vCPU utilization information.

It’s important to note that a producer may not be the original source of the measurement data. Some metrics will likely be proxied from other sources, and this system must be flexible enough to support this.

Collector

A collector is a software process that ingests metrics from one or more producers. The main intent here is for insertion into our timeseries database, ClickHouse, but one could imagine other processing occurring in or around the collector (alerting, perhaps).

The proposed collection system is called oximeter.

Assignment of producers to collectors

There are many ways to assign metrics from producers to collectors. In the interest of urgency, we take a simple model here, and assign one producer to one or more collectors (multiple to help availability). Nexus will be responsible for this initial assignment, and will record this in the control plane database.

Collection architecture

In broad terms, collection can be pull, in which the collector explicitly polls producers for metrics, or push, in which producers send data to the collector. Both have tradeoffs. A push model allows producers to decide when they wish to generate metrics, rather than being interrupted at a potentially inconvenient time by the collector. A pull model may make it simpler to do tasks like alerting — if you can’t reach the producer, that’s a problem. Pull may also be more secure. For example, if an attacker were able to replace a hardware device, we might more readily detect that fact if we poll for its firmware revision regularly, rather than waiting for the device or its managing software to send the revision to the collector.

Though we’re allowing for future changes, we intend to follow a pull-based collection model at this point. In general, producers let Nexus know what data can be collected from them. Nexus assigns each producer to some number of oximeter instances, along with the interval on which to poll each producer. The oximeter instances may cache their producers locally, and Nexus may also modify their assigned producers.

Initialization

Nexus is assumed to be running. When a producer starts, it registers with Nexus the list of targets and metric schema that it produces. Each producer also specifies a default collection interval, and an endpoint at which the metrics may be collected.

Nexus verifies that the received schema match what exists in ClickHouse, or updates it (or returns an error) in a TDB way. Nexus also assigns an ID to the producer.

When oximeter starts, it also registers with Nexus. Nexus assigns the oximeter instance a collector ID; a list of producers to poll; the schema for the metrics that producer returns; and the interval on which to poll them. oximeter provides interfaces for modifying the endpoint, schema, or endpoint of an existing producer, and adding or removing a producer entirely.

Assuming Nexus is available has some implications. See Open Questions for a discussion.

oximeter caching

oximeter instances should cache their list of producers in local storage. Ideally, this storage is durable, but it should at least survive a restart of oximeter itself. A temporary filesystem may be sufficient, but on-disk might be preferable to survive restart of a sled.

Duplicate collectors

To provide some availability, Nexus may choose to assign one producer to more than one oximeter instance. Importantly, each of these should be backed by a different ClickHouse instance, so each is entirely independent.

Interfaces

Nexus interfaces

Nexus must provide a variety of public and private interfaces for controlling metrics collection. This includes knobs for registering producers and collectors; listing the metrics available; and modifying the collection interval for the metrics.

To the extent that the below metrics are made public (or even if they’re not), we may need to prefix every route with /projects/{project_id}/.

/metrics/producers

Visibility: Private

POST Register with Nexus as a new producer. The request must contain:

  • A single IP address at which data for this producer can be collected

  • A default collection interval

  • A list of (targets, metrics) available with their schema

In response, Nexus generates a producer_id which uniquely identifies this producer. This ID is used in the collection endpoint, described below.

/metrics/producers/{producer_id}

Visibility: Private

PATCH Update information for a producer, such as its exported metrics, IP address for the collection endpoint, or a default interval.

DELETE De-register as a producer. This interface is intended to be used to notify Nexus and a collector to no longer expect the producer’s collection endpoint to be available, and that this should be a normal situation, rather than generating an alert of some kind.

/metrics/collectors

Visibility: Private

POST Register with Nexus as a new collector. The request must contain an IP address that Nexus can use for further communication.

In response, Nexus generates a collector_id which uniquely identifies this collector. This ID is used in later requests from Nexus to control the collection oximeter is responsible for.

/metrics/target-schema

Visibility: Private, possibly public.

GET Return the list of current target schema, maybe with information about the producer that generates them.

/metrics/metric-schema

Visibility: Private, possibly public.

GET Return the list of current metric schema, maybe with information about the producer that generates them.

/metrics/timeseries-schema

Visibility: Private, possibly public.

GET Return the list of current timeseries schema. This is really the list of (target_schema, metric_schema) pairs, and so may be redundant with the above two interfaces, or not very actionable.

/metrics/targets

Visibility: Private, possibly public.

GET Return the list of currently-available actual targets (not their schema).

/metrics/metrics

Visibility: Private, possibly public.

GET Return the list of currently-available actual metrics (not their schema).

/metrics/timeseries

Visibility: Private, possibly public.

GET Return the list of currently-available actual timeseries.

The goal was to list all available timeseries from ClickHouse, but that sounds low-priority, and possibly not the right way to get this information.

Producer interfaces

/metrics/{producer_id}

Visibility: Private

GET Return the currently-available metrics from a producer. This returns the latest data for all metrics that the producer generates. The producer uses the producer_id assigned by Nexus during registration.

oximeter interfaces

/producers

Visibility: Private

GET: Return the list of producers this oximeter instance pulls from.

PUT: Set the list of producers this oximeter instance pulls from.

/producers/{producer_id}

Visibility: Private

GET: Return information about the given producer (endpoint, metrics, and interval).

PATCH: Update the information for this producer, such as its endpoint, the metrics to expect from it, or their collection interval.

DELETE: Remove this producer from the oximeter instance, i.e., tell oximeter to stop collecting from it.

Determinations

Pull-based collection

We’re making the determination to follow a pull-based collection model for now. The oximeter collector will poll producers for data, rather than producers sending data to oximeter. This isn’t to say that push-based collection is precluded by the design here, and we’re trying to keep that possibility open in future work.

Collector caching

oximeter should cache the information about producers from which it pulls locally. Ideally this would be durable, but the capability to at least survive a restart of oximeter itself is important.

Open Questions

Reliance on Nexus

The described [_initialization] process requires that Nexus (and thus CockroachDB) are already running, and that they can be reached over the network. This implies that this collection system cannot be used to collect telemetry data prior to a trust quorum being established, or if the producer is otherwise partitioned from Nexus.

Collecting telemetry and metric data prior to a quorum, during sequences like cold start, is probably very useful for debugging those early phases of rack operation. (Moreso in early versions of the product.) How do we collect that data, and where is it stored? Is it buffered, made available when the quorum is achieved?

External interfaces

This RFD focuses on the interfaces used inside the control plane to collect metrics from producers. How do external clients, such as dashboards in the console, operate on the primitives discussed here? The Nexus endpoints at /metrics/producers and /metrics/metrics are probably important levers for managing the collection — clients express the intent to capture this set of data on some interval, and Nexus arranges for an oximeter instance to do so.

Another obvious open area is how the metric data itself actually makes it to the client. Querying is intentionally out of scope for this RFD, but we should ensure that these interfaces support and enable (and definitely don’t hinder) the mechanisms used to query and send data to customers.

Schema updates

How does this system handle updates to the metric schema? This relies in part on how the data is stored, which is explored in [rfd161]. It also requires some amount of consistency, and the right way to manage that in Nexus isn’t yet obvious.

Security Considerations

  • Pull-based collection may help detect certain types of attacks, such as replaced hardware components or unexpected software versions.

  • Need to think about how we control access to the metric data.

External References