RFD 161
Metrics data model
RFD
161
Updated

RFD 161 covers how metric data will be modeled throughout the Oxide rack. It focuses on general schema for the data, and storage and querying of the data in our timeseries database.

Background

[rfd116] examines the general background on metrics and telemetry: what data we intend to collect and the purposes to which it will be put. [rfd125] explores the requirements of the metric system in the product, and investigates existing technologies that we might leverage. These are both important background for this RFD. An important determination from RFD 125 that’s relevant for this RFD is our use of ClickHouse (CH) as a timeseries database.

Another important piece of background is Monarch [mon], a timeseries database designed by Google. Of particular interest for our purposes is the data model Monarch describes, and we adopt much of its terminology in this RFD.

Please refer to the [_open_questions]. These highlight some of the major tensions and areas where early thoughts may help direct or focus further discussion and experimentation.

Data model goals

Before diving into the model, it’s important to consider our requirements. What are the key constraints for modeling our metric data? What are the axes on which we’ll evaluate it?

Expressiveness

A hard constraint for any model is that we must be able to express the data generated by the rack. While this sounds obvious, many existing models are geared towards systems which generate different data than we might expect. For example, many monitoring systems assume that the monitored resource is a web service, where the relevant features of interest are things like request counts and response latencies (usually measured in milliseconds).

The Oxide rack will certainly generate these, but it will also produce much more. As discussed in RFD 116 [rfd116], data such as temperature, link states, firmware revisions, and much more is expected, and we have to be able to express these in the data model.

Action-oriented

Any metric data we collect is worth less than nothing if we can’t gain insight from it. The data model must support effective queries and useful analysis, with the goal of gaining insight that supports action, either to debug the system or iterate on the Oxide product itself.

One important aspect of this is the ability to slice and dice the metric data in a wide variety of ways. The dimensions along which data can be categorized are usually called tags, fields, or labels. Our desired data model must have the ability to associate fields with each measurement, and it must support querying those fields in a variety of ways.

Extensibility

There are clear examples of metrics we can be certain we’ll collect [rfd116], but many open questions remain. Hardware parts, service boundaries, scope, and software are all subject to change, and therefore so is the set of data we can collect from them. It must be plausible, though not necessarily trivial, to add new metric data into this system. Certainly the existing system, including data, collection, storage, and analysis components should continue to operate.

Efficiency

We’ll have a lot of data. That data must be relatively efficient to work with. Efficiency is intentionally broad here, but it may be broken into several key (non-orthogonal) dimensions, which are often in tension.

  • Query efficiency: We would like queries to be efficient. It’s useful to consider different classes of data here. Queries that are common or visible to the customer, such as dashboards, should be snappy. Those used to support product iteration may be allowed to take much longer to complete (hours or days). This implies that we should organize the data to be suited to those most common queries — such data should be easy to retrieve and operate on. This goal is certainly in tension with the required storage space, for example in database denormalization.

  • Storage space: The volume of data we’ll likely accumulate requires that we consider storage space carefully. ClickHouse is very helpful here. The columnar organization supports high compression ratios, and the database offers many options for codecs tailored to the data. The on-disk size of the data is certainly in tension with query speed, but another clear tradeoff is around retention policies. Ideally, we can keep all Oxide-generated metric data forever. Practically, we’ll have to decide what data is deleted at what age.

Strong typing

A related goal, though not necessarily a constraint, is a rich type system. Much of our data will be numeric, but there will be strings, IP addresses, and UUIDs galore, and it is desirable that these types be accurately represented in the data model. Types help make a system more expressive, but also support querying, efficient computation and storage, and correctness.

Other goals

  • The system should be discoverable. The set of available metrics, and probably their schema, will change over time, and it should be possible for client code to follow these changes.

Non-goals

  • Interop with any existing components, such as Prometheus.

  • Customer metric data will not be contained in ClickHouse, and we have no plans to offer metrics as a service.

  • Collection of metric data prior to achieving a trust quorum for the rack. The goal is to store data acquired during the steady-state operation of the rack. (See [_open_questions].)

Terminology

Before exploring available solutions, we standardize on a few key terms. The monitoring community often uses these in slightly different or conflicting ways, so its important to address some of this ambiguity. (This is intentionally modeled after Google’s Monarch system, which is quite flexible and general.)

Targets

A target is a monitored resource or entity, which is a source of metric data. This may be a virtual machine, an internal or external Oxide web service, a database, a daemon collecting hardware information, or other components.

A target has a string name, and a schema, which is defined by its list of [_fields] (string keys and typed values). The target key is defined by concatentating the target name with each of the values of its fields, converted to strings, joined with the ":" character. See [_example_targets_and_metrics] for detailed examples.

Metrics

A metric is aspect of a target from which data is derived. This could be data produced directly by an application, such as response latencies, or pulled from a sensor or hardware device. This is intended to capture a single aspect of a target, so there may be many metrics for a single target. For example, we may collect both IO operations and temperature from a disk — these are separate metrics.

As with [_targets], metrics are schematized. They have a string name and a list of [_fields], each with a name and typed value. The metric key is defined analogously to the target key: concatenate the metric’s field values with the metric name, joined by ":". See below for examples.

Timeseries

We use the term timeseries to refer to a timestamped sequence of metrics.

Fields

A field is a dimension along which targets or metrics may be divided or categorized. These can be thought of as important features of a target, such as the name of a software application or the serial number of a NIC. The intent of fields is to uniquely identify individual targets and metrics, as well as to help correlate metric data with features of their target. As a simple example, knowing that a disk drive failed is unhelpful — knowing a drive in rack A, sled B, slot C, with GUID DDDD is extremely helpful.

An important aspect of fields is that they form a total ordering. One can compare any two items to determine if one is greater than, less than, or equal to the other. It should also be possible to fold a sequence of field values by taking the minimum, maximum, median, or other similar summary operations. But arithmetic operations are not relevant — the data is more categorical than numerical or arithmetic.

Many existing metric systems provide support for fields with a string value, and some also support integers. But to satisfy some of the goals, we’d like to support the following types:

  • strings

  • boolean

  • signed 64-bit integer

  • IP address

  • UUID

Metric types

Metric data itself has several different characteristics, which describe the semantics of their actual value.

  • Data type refers to the type of the measurement, for example a string or float.

  • A metric’s kind may be either cumulative, a running counter, or gauge, an instantaneous measurement.

  • Its rank is either scalar or distribution (a vector or rank-1 array).

  • Metrics should have a unit of measure, which includes a quantity (length, mass, time), and the unit used to measure it (meters, grams, nanoseconds).

Though units of measure play little role in most existing data models, it is crucial to understanding the data. In systems like Prometheus, the unit is usually encoded in the metric’s name, such as request_latency_ms. We can do better, and provide a richer system for annotating units.

Timeseries

A timeseries is a set of timestamped measurements for a single metric, from a single target. Its schema is the combination of the schema for its target and metric. We can build a timeseries key for each individual series, by concatenating the target keys and metric keys. See the next section for examples.

Example targets and metrics

A service, process, or software task

Targets are supposed to ID pretty broad classes of data. One example might be an HTTP-based service, such as Nexus. An example metric for this might be a distribution over request latency, broken out by endpoint, request method, and response status code.

Target schema
NameField namesField types

"oxide-service"

"name", "instance"

string, Uuid

Metric schema
NameField namesField typesKindRankData typeUnitBins

request-latency

"path", "method", "response-code"

string, string, int

cumulative

distribution

int

nanoseconds

array of int, giving the right bin edges

Example timeseries ID

"oxide-service:nexus:9b5d3aed-9e46-4868-8b7d-156edf22f1a:/instances:GET:404:request-latency"

VM CPU utilization

Another example in the software space is CPU utilization for a virtual machine.

Target schema
NameField namesField types

virtual-machine

"project", "instance"

Uuid, Uuid (or string, string if we’re using names)

Metric schema
NameField namesField typesKindRankData typeUnit

cpu-busy

cpu-id

int

cumulative

scalar

int

nanoseconds

Example timeseries ID

"virtual-machine:edb2682f-594b-49a7-ac4f-eb434d680700:cead30f4-af75-44c3-9c61-c048300b90a:2:cpu-busy

Physical disk temperature

In the hardware domain, perhaps we’d like to track the temperature of disks in the rack.

Target schema
NameField namesField types

disk

"rack", "sled", "slot", "model", "serial", "disk-id"

UUID, UUID, int, string, string, UUID

Metric schema
NameField namesField typesKindRankData typeUnitBins

temperature

sensor-id (not sure here)

string

gauge

scalar

float

Celsius

None

Example timeseries ID

disk:uuids:for:days:temperature

Illustrative queries

We’d like to explore data model and collection strategies in the context of some specific queries that we’ll likely encounter. These are intended to be representative, either of common queries, or queries that might be large or difficult, but which we’d nonetheless like to support.

  • Return the combined request latency histogram for all nexus instances, independent of request parameters and response code.

  • Return the last measure of CPU utilization for a specific virtual machine, and a specific vCPU of that machine.

  • Return the average temperature of a specific model of physical disk, in all racks.

    • There are other correlations here we might like, such as relating drive temperatures to overall/average sled temperatures. I’d love some help making sure we cover our bases w.r.t. hardware data and queries.

Database organization

ClickHouse (CH) is a columnar DMBS. See [rfd125] for details, but it’s overall tabular and similar enough to a traditional DMBS like Postgres that the queries and schema should be clear.

The tables in the database must have schema, though CH has pretty flexible types. So the main axes along which we’re considering the database setup are:

  • How easy is it to add new metrics, or update the schema for existing metrics?

  • What do our example queries look like, under the different possible table setups?

There are two main candidates I’m going to explore here (though if there are others I’d love to hear about them): field unrolling and dynamic tables.

Field unrolling

The core of the idea here is, for each dimension of a measurement, add a row to a separate table for each field key and value. This "expands" or unrolls a measurement into one or more rows in other tables, and so costs storage space. However, the benefit here is that these extra rows may be indexed. Thus while there may be more overall data, we can search it more quickly.

Table schema

In this setup, there are several metadata tables, which describe the target and metric schema, and possibly the timeseries schema (combination of the two). In addition, there is a table for each of the field names, field values, and measurement values. Note that the tables for field and measurement values may really be a collection of tables, one per supported data type for the fields and measurements. I’ll sketch out the schema assuming that’s true, for just two field types, integer and string, and cumulative/scalar and cumulative/distribution measurements.

CREATE TABLE target_schema (
name String,
field_names Array(String),
field_types Array(String) // Or an enum
)
ENGINE = MergeTree()
PRIMARY KEY name
ORDER BY (name, field_names);
CREATE TABLE metric_schema (
name String,
field_names Array(String),
field_types Array(String),
/* "gauge" or "cumulative" */
kind String,
/* "scalar" or "distribution" */
rank String,
/* Depends on kind/rank, but generally: string, bool, int, float */
data_type String,
/* Only for distribution types. */
bins Array(Float64)
)
ENGINE = MergeTree()
PRIMARY KEY name
ORDER BY (name, field_names);
CREATE TABLE target_fields_int (
/* FK into the `target_schema` table. */
target_name String,
field_name String,
timeseries_id String,
field_value Int64
)
ENGINE = MergeTree()
PRIMARY KEY (target_name, field_name, timeseries_id)
ORDER BY (target_name, field_name, timeseries_id, field_value);
CREATE TABLE target_fields_string (
/* FK into the `target_schema` table. */
target_name String,
field_name String,
timeseries_id String,
field_value String
)
ENGINE = MergeTree()
PRIMARY KEY (target_name, field_name, timeseries_id)
ORDER BY (target_name, field_name, timeseries_id, field_value);

These are duplicated exactly to the metric_fields_int and metric_fields_string tables.

CREATE TABLE measurements_cumulative_scalar_int (
/* FK into the `target_schema` table. */
target_name String,
/* FK into the `metric_schema` table. */
metric_name String,
timeseries_id String,
/* A 64-bit timestamp with microsecond precision, UTC timezone. */
timestamp DateTime64(6, 'UTC'),
value Int64
)
ENGINE = MergeTree()
PRIMARY KEY (target_name, metric_name, timeseries_id)
ORDER BY (target_name, metric_name, timeseries_id, timestamp);
CREATE TABLE measurements_cumulative_distribution_int (
/* FK into the `target_schema` table. */
target_name String,
/* FK into the `metric_schema` table. */
metric_name String,
timeseries_id String,
/* A 64-bit timestamp with microsecond precision, UTC timezone. */
timestamp DateTime64(6, 'UTC'),
/* Each distribution is effectively a nested set of bins and counts, of the
* same length.
*/
distribution Nested(
bins Int64,
counts Int64
)
)
ENGINE = MergeTree()
PRIMARY KEY (target_name, metric_name, timeseries_id)
ORDER BY (target_name, metric_name, timeseries_id, timestamp);

Imagine we receive a single measurement for the CPU utilization of a VM. This might look like:

Target name: virtual-machine
Target field names: "project", "instance"
Target field values: "some-uuid", "some-other-uuid"
Metric: cpu-busy
Metric field names: cpu-id
Metric field values: 1
Timestamp: "1999-12-31:12:59:59.999999Z"
Measurement: 100ns
Insertion

To insert this sample, we unroll the key-value pairs of the target and metric fields, and insert one row in the corresponding field table for each of these. We create an additional row for each field in the measurement.

For this example, the timeseries ID is: "virtual-machine:some-uuid:some-other-uuid:1". So we insert the rows:

INSERT INTO target_fields_string (*) VALUES
(
(
"virtual-machine:some-uuid:some-other-uuid:1",
"project",
"some-uuid",
),
(
"virtual-machine:some-uuid:some-other-uuid:1",
"instance",
"some-other-uuid",
),
)
INSERT INTO metric_fields_int (*) VALUES
(
(
"virtual-machine:some-uuid:some-other-uuid:1",
"cpu-id",
1
)
)
INSERT INTO measurements_cumulative_scalar_int (*) VALUES
(
(
"virtual-machine",
"cpu-busy",
"virtual-machine:some-uuid:some-other-uuid:1",
"1999-12-31:12:59:59.999999Z",
100
)
)
Querying

In the field-unrolling table setup, we have additional rows for every single measurement — one extra row per field value. The general strategy then is to find the rows matching our conditions (which may be in different tables) and join those on their timeseries IDs. We can then use those filtered timeseries IDs to efficiently lookup data in the actual measurements table of interest.

HTTP request counts

In this example, we want to select any matching target field, and we have no constraints on the metric fields when filtering.

SELECT (target_name, metric_name, timeseries_id, sum(distribution))
SELECT (timeseries_id)
FROM target_fields_string
WHERE (
target_name = "oxide-service" AND
field_name = "name" AND
field_value = "nexus"
)
JOIN
SELECT (timeseries_id)
FROM metric_fields_string
WHERE (
metric_name = "request-latency"
)
)
ON timeseries_id;

This query is definitely not correct, since I’m ignoring the fact that the sum is over distributions. CH has lots of methods for arrays, but it’s not obvious how to structure this using those primitives. A correct, but probably not efficient approach would be to zip the distributions into tuples, and then reduce each tuple with a summation. There are probably other methods, but we should think carefully about how to operate on these arrays in general, especially when aggregating them.

CPU utilization

In our second example query, we want the last sample of CPU utilization for a VM, whose ID (or name) we know. Generally, the query might (1) find the rows in the target fields that match our VM, (2) join that with the rows of the metric fields that match our CPU, and (3) select the last timepoint from the measurements, where the timeseries ID matches the result of (1) and (2).

SELECT (target_name, metric_name, timeseries_id, timestamp, value)
FROM measurements_cumulative_scalar_int
JOIN (
SELECT (timeseries_id)
FROM target_fields_string
WHERE (
target_name = "virtual-machine" AND
field_name = "instance" AND
field_value = "some-other-uuid"
)
JOIN
SELECT (timeseries_id)
FROM metric_fields_string
WHERE (
metric_name = "cpu-busy" AND
field_name = "cpu-id" AND
field_value = 1
)
)
ON timeseries_id
ORDER BY timestamp DESC
LIMIT 1;

This is supposed to use the "duplicated" rows from the measurements to select the timeseries IDs that match our query, JOIN that for each filtering statement in the query, and then use that timeseries ID to find the actual measurements.

I believe that the number of JOINs is proportional to the number of filtering conditions that apply to different tables. For example, here we’re filtering on a string parameter and an integer parameter, so there is one JOIN. We must always JOIN this list of timeseries IDs with the appropriate measurements table itself.

If the conditions apply to the same type of field, the filtering could be done entirely inside a single table. For example, if we didn’t care about the CPU ID, but only wanted to find the utilization for a VM by its project and instance UUIDs, these are two separate rows in the target_fields_string table with the same timeseries ID. However, this isn’t a logical operator — we don’t want rows where the project ID matches OR the instance matches, nor do we want the conjunction. Really we want a self-join on the timeseries ID, which may or may not be more efficient, depending on the database implementation and complexity of the query.

Disk temperature

This query starts in much the same way. We filter the target_fields_string and metric_fields_string tables to get the timeseries IDs, then join that with the measurements table. However, we need to carry some of the target fields through the JOIN with the measurements table, so that we can later manipulate them, such as grouping the results by those. Additionally, we need to compute the average within each group.

SELECT (
target_name,
metric_name,
model,
timeseries_id,
timestamp,
avg(value) AS avg_temperature
)
FROM measurements_gauge_scalar_int
JOIN (
SELECT (timeseries_id, field_value AS model)
FROM target_fields_string
WHERE (
target_name = 'disk' AND
field_name = 'model' AND
field_value = 'Samsung whatever'
)
/* Are there any filters on the `metric_fields_*` tables? */
)
ON timeseries_id
GROUP BY model
SORT BY avg_temperature;

This is probably not correct, but the gist is there. What makes this query different is pulling the inner fields through to the outer query, and then grouping them, along with other possible operations.

A small abstraction: We can see some small patterns here already for searching for a field. Assume we know the field comes from schema S, for example, the target schema; that it comes from a target with name N; and that it has type T. Then to search for a field with a (name, value) pair, we’ll see queries like:

SELECT *
FROM
S_fields_T
WHERE (
target_name = N AND
field_name = name AND
field_value = value
);

Summary: This model relies heavily on joins. There is one select-statement for each separate filtering condition, i.e., for a filter applied to any one field value. Each of these must be joined together, i.e., we’re reducing the sequence of subqueries with a JOIN operator.

One nice thing is that the each filtering operation reduces the size of the data, meaning the input table to the next join is smaller.

Adding new metrics

What does adding a new metric in the field-unrolling setup look like? We’re given a new pair of (target_schema, metric_schema). Verify that these do not yet exist, and return an error if they do. If not, we simply insert those schema into the target_schema and metric_schema tables, respectively. New measurements added that correspond to these schema should be verified to conform to them, but there’s not a lot of other work.

Adding new fields

Adding new fields is slightly more complicated. The entries in the schema tables must be updated, by adding a new (field_name, field_type) entry to each array. These tables should remain small, proportional to the number of target schema and metric schema respectively. (On the order of thousands of rows in each seems plausible, as that is sufficient to capture millions of timeseries schema, and the cardinality of each of those schema types defines the total number of timeseries. I.e., for 64-bit integers, if the timeseries schema has 4 columns, that’s (264)4 possible timeseries. That’s…​a lot.)

There’s nothing that needs to be done to update the measurement tables when new fields are added. New measurements will be unrolled into more rows, as there are now more fields in each measurement. Old data from earlier "versions" of the schema will not have these fields, which means that queries which select in measurements with those fields will not return those rows. Conversely, queries which filter out measurements with those fields will also not return them. Queries which don’t mention those fields at all will return them. This seems like the desired semantics — the answer to the question, "Where does this measurement fall on this axis?" should be "N/A".

Dynamic tables

This database organization takes a very different approach. The intent is to map more closely with the description in the Monarch paper, providing faster queries with stronger type safety. The tradeoff is more work to "manage" the database, in that we’re performing more database-level operations like creating tables, not that we need to tend to the DBMS in the operations or SRE sense.

The idea is straightforward: create a new table for each (target_schema, metric_schema) pair. Rather than encoding the schema for each target and metric in a table, the target_schema metadata table for example, encode the schema directly in the table itself. This removes a level of "meta" in the metadata required, and it reduces the number of joins required (to zero) for the simple and extremely common case of searching within a single timeseries schema.

Table schema

We’ll continue with the same examples as above. It may be useful to keep the same target_schema and metric_schema metadata tables as above, since these contain static data such as units that probably shouldn’t live in each row of the actual measurements table. But at that point, rather than inserting a new row for each measurement into the shared measurements_* tables, we build an entirely new table just for this timeseries:

CREATE TABLE `oxide-service:request-latency`
(
name String,
instance UUID,
path String,
method String,
response_code Int64,
timeseries_id String MATERIALIZED arrayStringConcat(
[name, toString(instance), path, method, toString(response_code),
':'
),
request_latency Nested(
bins Array(Int64),
counts Array(Int64)
)
)
ENGINE = MergeTree()
PRIMARY KEY (name, instance, path, method, response_code);

Note that timeseries_id is not in the primary key. This is intentional. The main reason here is that most queries will not have access to this, since they’re by definition looking for timeseries where that ID is only partially known. Also, that ID is the most fine-grained key for each timeseries, so adding it to primary key before the others (name, instance, etc.) adds no benefit to searching and only wastes space. That is, if everything with the same timeseries_id must have the same value for the remaining key columns — they provide no additional information.

Inserting

Inserting measurements in this case is straightforward. Given a sample, just…​ insert it into the right table.

Querying

Querying is also more straightforward in the dynamic table setup. One must know the tables to search, but the list of available tables is is either known statically in an application (for example, the metrics it exports are known) or easily retrieved from the metadata tables. But once the (target_schema, metric_schema) pair is known, the query is simple.

HTTP request counts
SELECT avg(request_latency)
FROM `oxide-service:request-latency`
WHERE (
name = 'nexus'
);

The averaging is not entirely correct again, since it’s not yet clear what methods CH provides for summing arrays together elementwise.

CPU utilization
SELECT (project, instance, cpu_id, cpu_busy)
FROM `virtual-machine:cpu_busy`
WHERE (
project = 'some-uuid' AND
instance = 'some-other-uuid' AND
cpu_id = 1
);
Disk temperature
SELECT (rack, avg(temperature) as avg_temp)
FROM `disk:temperature`
WHERE (
rack = 'my-rack-id' AND
model = 'Samsung whatever'
)
GROUP BY rack
ORDER BY avg_temp DESC;
Adding new metrics

Adding new targets or metrics is much more troublesome in the dynamic-tables setup. The new target and metric schema must be inserted into the corresponding metadata tables. But we must also create an entirely new table. The query itself for this is simple: CREATE TABLE `target_name:metric_name`, with the schema similarly transcribed. The issues are mostly around organizing the database itself in a consistent fashion. Who creates the tables? What if they already exist, but have different schema? What if the existing schema has a subset of the columns in the new request? What if the existing schema has columns that the new request doesn’t have? It sounds much easier to accidentally drop a table in this setup, given that we’ll be frequently manipulating them.

None of these are insurmountable, but the describe a general class of problem. DBMS are designed to consistently and correctly manage data within a table, but there are much weaker guarantees (if any) about managing a database as a whole. CH uses an "atomic" database engine by default. The documentation claims this supports non-blocking table drops, renames, or swapping of two tables.

The larger unknown is around resources. It sounds plausible that we’d have thousands of tables in a single database, using this setup. We’d get to that with only on the order of 10 targets each with 100 metrics. (Note that there is not a table per timeseries, but a table per timeseries schema.) How will CH perform in this setup? It’s designed for efficiency with large, but few, tables in mind, using heavily vectorized operations to keep the CPU cache hot. It’s pretty resource-hungry already, and adding many more tables to a database may prove too much.

There is a way to alleviate this, inspired by the Monarch paper again — sharding. We could shard the set of all tables across multiple CH servers, so that each server is responsible for a small portion of timeseries tables. This has a cost, of course, in the horizontally scaled fleet of CH servers, taken from the pool of resources we are selling to our customers. Also, that sharding requires coordination, for which Oxide would be responsible.

Adding new fields

Adding new fields is thankfully not as dramatic as adding new schema. We only need to add a new column to the schema tables, and to the affected measurement tables themselves. ClickHouse has good support for adding columns, which makes sense given the columnar storage model.

Older records will return the default for that type (an empty string, 0, etc). This may actually be a problem, since at least some of those are valid values for that type (0 for integers being the most egregious case). It doesn’t seem possible to distinguish between a vCPU with a CPU ID of 0, or a previous measurement that originally lacked the CPU ID column.

Summary

In general this plan is more straightfoward for querying, but much more complicated in its implementation. There are no JOINs, unless you want to correlate multiple different timeseries. (And correlating two timeseries with the same schema is a self-join, which may be more efficient.) The table contains data from multiple timeseries, but data from any one timeseries is contiguous. This also shows the benefit that’s brought up in the Monarch paper: related timeseries are closer together, which should provide overall benefits to compression and query efficiency.

The complexity of this implementation is a bit daunting, though. It’s not clear exactly how much work would be required, but the work is significantly less well-understood. More investigation into ClickHouse’s performance with thousands of tables is a must. We’d need a way to manage the tables in the database consistently across a rack or datacenter. That would likely involve the general control plane database (CockroachDB), which is where we’d like to store information with strong consistency requirements. Only the schema would need to be stored here, but the point stands that this arrangement is likely a more challenging distributed system to build.

Determinations

This RFD explores two general methods for modeling metric data: field-unrolling and a dynamic tables approach. The two models are similar in many respects: they support our desired data types, are similarly extensible, are strongly typed, and discoverable. There are two main axes along which the models differ: efficiency and complexity.

The field-unrolling case results in far fewer tables, and instead relies on joins to extract a timeseries by some filtering criteria. These operations are expensive, but it’s not obvious how difficult these will be in practice.

While the dynamic tables approach is likely more efficient (fewer joins), it also implies a lot of complexity at the level of the database, which must be managed by Oxide software (creating or modifying tables, changing table schema). And although the dynamic tables model may be more efficient when querying a single table, it’s not clear how ClickHouse will perform with thousands of tables!

The other important point is that neither of these is irreversible. A client library can be built to hide such complexity, and prototyping seems more important than further experimentation at this point.

We’ve decided to move forward with the field-unrolling model. Metadata tables will store target and metric schema, and separate field and measurement tables will be correlated to extract timeseries data itself.

Open questions

Pre-quorum data

A stated non-goal of this system is storing telemetry and metric data that’s generated prior to achieving a trust quorum for the rack. This raises the critical question of what to do with that data? It will likely be crucial for debugging sequences like cold start, especially so during early product iteration. How do we collect, store, and query such data?

Managing data with different timescales

A side-effect of adding many fields to each metric is that some of those fields don’t change very often. For example, we might store information related to disk IOPS, and want to keep track of an identifier for the disk in question. But given the volume of that data, it seems extremely wasteful to store a UUID (or some other identifier) inline. The measurements may have a relevant timescale of milliseconds, while the ID only changes over the course of months.

There are two related tensions here. We may expend storage space to make that query faster, for example by reducing joins or other cross-table correlations. Relatedly, while storing the data out of line could reduce storage costs, that system is more complex to design, build, and operate.

How do we decide between these?

Non-covered data

There are several important types of data that are not well-covered here. Though strings are supported, this system is not meant to ingest and store large amounts of logging data. A system like Elasticsearch would be much better suited to such semi- or unstructured data.

Distributed tracing spans are another important type of data that is not easily fit into this model. The tracing-rs ecosystem will be helpful to generate this data, but it’s not clear how to collect and store this. As with logs, a more tailored solution such as OpenTelemetry may be appropriate.

External References