162 - Metrics collection architecture and design / RFD / Oxide

RFD

162

Authors

Updated

RFD 162 covers the architecture of the metric data collection system within an Oxide rack. We’re trying to understand where data is produced, how that data is collected, and how metadata about which metrics exist is communicated to both internal and external Oxide software.

Background

This RFD assumes some background that can be found in previous RFDs. [rfd116] covers metrics in general; [rfd125] describes our choice to use ClickHouse (CH) as the database for our metrics; and [rfd161] describes how our data is modeled in general and within CH specifically.

Given the early state of this RFD, please refer to the [_open_questions]. These highlight some of the major tensions and areas where early thoughts may help direct or focus further discussion and experimentation.

Terminology

The terminology from the data model discussion in [rfd161] is assumed, but is reviewed here. A target is a named, schematized source of data; for example, a software process. A metric is a named, schematized stream of measurements; for example, total number of requests a server has received, by path, method, and response code. The schema for both is defined by fields, which have a name and type. A specific sequence of field values defines a single timeseries, with its ID specified by converting each field value to a string, and concatenating them with ":". A measurement is a single sample from a timeseries.

Producer

A producer is a software process that generates measurements. The measurements may derive from multiple targets and metrics. Note that while the producer itself is software, it may be collecting the underlying data from hardware such as a sensor, drive, NIC, etc. The measurements themselves may also be generated entirely by software; e.g., vCPU utilization information.

It’s important to note that a producer may not be the original source of the measurement data. Some metrics will likely be proxied from other sources, and this system must be flexible enough to support this.

Collector

A collector is a software process that ingests metrics from one or more producers. The main intent here is for insertion into our timeseries database, ClickHouse, but one could imagine other processing occurring in or around the collector (alerting, perhaps).

The proposed collection system is called oximeter.

Assignment of producers to collectors

There are many ways to assign metrics from producers to collectors. In the interest of urgency, we take a simple model here, and assign one producer to one or more collectors (multiple to help availability). Nexus will be responsible for this initial assignment, and will record this in the control plane database.

Collection architecture

In broad terms, collection can be pull, in which the collector explicitly polls producers for metrics, or push, in which producers send data to the collector. Both have tradeoffs. A push model allows producers to decide when they wish to generate metrics, rather than being interrupted at a potentially inconvenient time by the collector. A pull model may make it simpler to do tasks like alerting — if you can’t reach the producer, that’s a problem. Pull may also be more secure. For example, if an attacker were able to replace a hardware device, we might more readily detect that fact if we poll for its firmware revision regularly, rather than waiting for the device or its managing software to send the revision to the collector.

Though we’re allowing for future changes, we intend to follow a pull-based collection model at this point. In general, producers let Nexus know what data can be collected from them. Nexus assigns each producer to some number of oximeter instances, along with the interval on which to poll each producer. The oximeter instances may cache their producers locally, and Nexus may also modify their assigned producers.

Initialization

Nexus is assumed to be running. When a producer starts, it registers with Nexus the list of targets and metric schema that it produces. Each producer also specifies a default collection interval, and an endpoint at which the metrics may be collected.

Nexus verifies that the received schema match what exists in ClickHouse, or updates it (or returns an error) in a TDB way. Nexus also assigns an ID to the producer.

When oximeter starts, it also registers with Nexus. Nexus assigns the oximeter instance a collector ID; a list of producers to poll; the schema for the metrics that producer returns; and the interval on which to poll them. oximeter provides interfaces for modifying the endpoint, schema, or endpoint of an existing producer, and adding or removing a producer entirely.

Assuming Nexus is available has some implications. See [_open_questions] for a discussion.

`oximeter` caching

oximeter instances should cache their list of producers in local storage. Ideally, this storage is durable, but it should at least survive a restart of oximeter itself. A temporary filesystem may be sufficient, but on-disk might be preferable to survive restart of a sled.

Duplicate collectors

To provide some availability, Nexus may choose to assign one producer to more than one oximeter instance. Importantly, each of these should be backed by a different ClickHouse instance, so each is entirely independent.

Interfaces

Nexus interfaces

Nexus must provide a variety of public and private interfaces for controlling metrics collection. This includes knobs for registering producers and collectors; listing the metrics available; and modifying the collection interval for the metrics.

To the extent that the below metrics are made public (or even if they’re not), we may need to prefix every route with /projects/{project_id}/.

`/metrics/producers`

Visibility: Private

POST Register with Nexus as a new producer. The request must contain:

A single IP address at which data for this producer can be collected
A default collection interval
A list of (targets, metrics) available with their schema

In response, Nexus generates a producer_id which uniquely identifies this producer. This ID is used in the collection endpoint, described below.

`/metrics/producers/{producer_id}`

Visibility: Private

PATCH Update information for a producer, such as its exported metrics, IP address for the collection endpoint, or a default interval.

DELETE De-register as a producer. This interface is intended to be used to notify Nexus and a collector to no longer expect the producer’s collection endpoint to be available, and that this should be a normal situation, rather than generating an alert of some kind.

`/metrics/collectors`

Visibility: Private

POST Register with Nexus as a new collector. The request must contain an IP address that Nexus can use for further communication.

In response, Nexus generates a collector_id which uniquely identifies this collector. This ID is used in later requests from Nexus to control the collection oximeter is responsible for.