RFD 162 covers the architecture of the metric data collection system within an Oxide rack. We’re trying to understand where data is produced, how that data is collected, and how metadata about which metrics exist is communicated to both internal and external Oxide software.
Background
This RFD assumes some background that can be found in previous RFDs. [rfd116] covers metrics in general; [rfd125] describes our choice to use ClickHouse (CH) as the database for our metrics; and [rfd161] describes how our data is modeled in general and within CH specifically.
Given the early state of this RFD, please refer to the [_open_questions]. These highlight some of the major tensions and areas where early thoughts may help direct or focus further discussion and experimentation.
Terminology
The terminology from the data model discussion in [rfd161] is assumed, but is
reviewed here. A target is a named, schematized source of data; for example,
a software process. A metric is a named, schematized stream of measurements;
for example, total number of requests a server has received, by path, method,
and response code. The schema for both is defined by fields, which have a name
and type. A specific sequence of field values defines a single timeseries,
with its ID specified by converting each field value to a string, and
concatenating them with ":"
. A measurement is a single sample from a
timeseries.
Producer
A producer is a software process that generates measurements. The measurements may derive from multiple targets and metrics. Note that while the producer itself is software, it may be collecting the underlying data from hardware such as a sensor, drive, NIC, etc. The measurements themselves may also be generated entirely by software; e.g., vCPU utilization information.
Collector
A collector is a software process that ingests metrics from one or more producers. The main intent here is for insertion into our timeseries database, ClickHouse, but one could imagine other processing occurring in or around the collector (alerting, perhaps).
The proposed collection system is called oximeter
.
Assignment of producers to collectors
There are many ways to assign metrics from producers to collectors. In the interest of urgency, we take a simple model here, and assign one producer to one or more collectors (multiple to help availability). Nexus will be responsible for this initial assignment, and will record this in the control plane database.
Collection architecture
In broad terms, collection can be pull, in which the collector explicitly polls producers for metrics, or push, in which producers send data to the collector. Both have tradeoffs. A push model allows producers to decide when they wish to generate metrics, rather than being interrupted at a potentially inconvenient time by the collector. A pull model may make it simpler to do tasks like alerting — if you can’t reach the producer, that’s a problem. Pull may also be more secure. For example, if an attacker were able to replace a hardware device, we might more readily detect that fact if we poll for its firmware revision regularly, rather than waiting for the device or its managing software to send the revision to the collector.
Though we’re allowing for future changes, we intend to follow a pull-based
collection model at this point. In general, producers let Nexus know what data
can be collected from them. Nexus assigns each producer to some number of
oximeter
instances, along with the interval on which to poll each producer.
The oximeter
instances may cache their producers locally, and Nexus may also
modify their assigned producers.
Initialization
Nexus is assumed to be running. When a producer starts, it registers with Nexus the list of targets and metric schema that it produces. Each producer also specifies a default collection interval, and an endpoint at which the metrics may be collected.
Nexus verifies that the received schema match what exists in ClickHouse, or updates it (or returns an error) in a TDB way. Nexus also assigns an ID to the producer.
When oximeter
starts, it also registers with Nexus. Nexus assigns the
oximeter
instance a collector ID; a list of producers to poll; the schema for
the metrics that producer returns; and the interval on which to poll them.
oximeter
provides interfaces for modifying the endpoint, schema, or endpoint
of an existing producer, and adding or removing a producer entirely.
oximeter
caching
oximeter
instances should cache their list of producers in local storage.
Ideally, this storage is durable, but it should at least survive a restart of
oximeter
itself. A temporary filesystem may be sufficient, but on-disk might
be preferable to survive restart of a sled.
Duplicate collectors
To provide some availability, Nexus may choose to assign one producer to more
than one oximeter
instance. Importantly, each of these should be backed by
a different ClickHouse instance, so each is entirely independent.
Interfaces
Nexus interfaces
Nexus must provide a variety of public and private interfaces for controlling metrics collection. This includes knobs for registering producers and collectors; listing the metrics available; and modifying the collection interval for the metrics.
/ metrics/ producers
Visibility: Private
POST Register with Nexus as a new producer. The request must contain:
A single IP address at which data for this producer can be collected
A default collection interval
A list of (targets, metrics) available with their schema
In response, Nexus generates a producer_id
which uniquely identifies this
producer. This ID is used in the collection endpoint, described
below.
/ metrics/ producers/ {producer_id}
Visibility: Private
PATCH Update information for a producer, such as its exported metrics, IP address for the collection endpoint, or a default interval.
DELETE De-register as a producer. This interface is intended to be used to notify Nexus and a collector to no longer expect the producer’s collection endpoint to be available, and that this should be a normal situation, rather than generating an alert of some kind.
/ metrics/ collectors
Visibility: Private
POST Register with Nexus as a new collector. The request must contain an IP address that Nexus can use for further communication.
In response, Nexus generates a collector_id
which uniquely identifies this
collector. This ID is used in later requests from Nexus to control the
collection oximeter
is responsible for.
/ metrics/ target-schema
Visibility: Private, possibly public.
GET Return the list of current target schema, maybe with information about the producer that generates them.
/ metrics/ metric-schema
Visibility: Private, possibly public.
GET Return the list of current metric schema, maybe with information about the producer that generates them.
/ metrics/ timeseries-schema
Visibility: Private, possibly public.
GET Return the list of current timeseries schema. This is really the list of
(target_schema, metric_schema)
pairs, and so may be redundant with the above
two interfaces, or not very actionable.
/ metrics/ targets
Visibility: Private, possibly public.
GET Return the list of currently-available actual targets (not their schema).
/ metrics/ metrics
Visibility: Private, possibly public.
GET Return the list of currently-available actual metrics (not their schema).
/ metrics/ timeseries
Visibility: Private, possibly public.
GET Return the list of currently-available actual timeseries.
Producer interfaces
/ metrics/ {producer_id}
Visibility: Private
GET Return the currently-available metrics from a producer. This returns
the latest data for all metrics that the producer generates. The producer uses
the producer_id
assigned by Nexus during registration.
oximeter
interfaces
/ producers
Visibility: Private
GET: Return the list of producers this oximeter
instance pulls from.
PUT: Set the list of producers this oximeter
instance pulls from.
/ producers/ {producer_id}
Visibility: Private
GET: Return information about the given producer (endpoint, metrics, and interval).
PATCH: Update the information for this producer, such as its endpoint, the metrics to expect from it, or their collection interval.
DELETE: Remove this producer from the oximeter
instance, i.e., tell
oximeter
to stop collecting from it.
Determinations
Pull-based collection
We’re making the determination to follow a pull-based collection model for now.
The oximeter
collector will poll producers for data, rather than producers
sending data to oximeter
. This isn’t to say that push-based collection is
precluded by the design here, and we’re trying to keep that possibility open
in future work.
Collector caching
oximeter
should cache the information about producers from which it pulls
locally. Ideally this would be durable, but the capability to at least survive
a restart of oximeter
itself is important.
Open Questions
Reliance on Nexus
The described [_initialization] process requires that Nexus (and thus CockroachDB) are already running, and that they can be reached over the network. This implies that this collection system cannot be used to collect telemetry data prior to a trust quorum being established, or if the producer is otherwise partitioned from Nexus.
Collecting telemetry and metric data prior to a quorum, during sequences like cold start, is probably very useful for debugging those early phases of rack operation. (Moreso in early versions of the product.) How do we collect that data, and where is it stored? Is it buffered, made available when the quorum is achieved?
External interfaces
This RFD focuses on the interfaces used inside the control plane to collect
metrics from producers. How do external clients, such as dashboards in the
console, operate on the primitives discussed here? The Nexus endpoints at
/
and /
are probably important levers for
managing the collection — clients express the intent to capture this set of
data on some interval, and Nexus arranges for an oximeter
instance to do so.
Another obvious open area is how the metric data itself actually makes it to the client. Querying is intentionally out of scope for this RFD, but we should ensure that these interfaces support and enable (and definitely don’t hinder) the mechanisms used to query and send data to customers.
Schema updates
How does this system handle updates to the metric schema? This relies in part on how the data is stored, which is explored in [rfd161]. It also requires some amount of consistency, and the right way to manage that in Nexus isn’t yet obvious.
Security Considerations
Pull-based collection may help detect certain types of attacks, such as replaced hardware components or unexpected software versions.
Need to think about how we control access to the metric data.
External References
[rfd116] Oxide Computer Company. RFD 116 A Midsummer Night’s Metric. https://rfd.shared.oxide.computer/rfd/0116
[rfd125] Oxide Computer Company. RFD 125 Telemetry requirements and building blocks. https://rfd.shared.oxide.computer/rfd/0125
[rfd161] Oxide Computer Company. RFD 161 Metric data model. https://rfd.shared.oxide.computer/rfd/0161