463 - The Oximeter Query Language / RFD / Oxide

RFD

463

Authors

Updated

Telemetry and metric data forms an essential feature of the Oxide rack. Software and hardware components throughout the rack produce data, which is collected by a program called oximeter and stored in a ClickHouse database. This RFD describes the Oximeter Query Language (OxQL), a domain-specific language (DSL) for querying and processing the stored data. It is intended to be used as a building block for both Oxide-internal and customer-facing software.

The following content is (mostly) intended to be appropriate for end-user documentation. After publication, we expect that the majority of this RFD can be re-published as user documentation, which then become the document of record for OxQL. This RFD will not be updated further.

Background

Motivation

Oxide is in a unique position. Our product takes a holistic approach to rack-scale computing infrastructure, with hardware and software co-designed and sold to customers as a unit. We therefore have a unique opportunity to provide those customers deep insight into their infrastructure, with telemetry data sourced from the smallest hardware sensors all the way to the largest programs. While collecting data is a necessary first step, it’s critical that we also provide a powerful system with which to access this data. This is the purpose of OxQL.

OxQL should enable users to understand their Oxide deployments. In addition to extracting and exporting raw data, OxQL can process the data to provide insight and serve as a foundation for observability tools ranging from graphical dashboards to alerting mechanisms.

Why a new language?

There already exist many languages for querying metric data. Why do we need a new one? The short answer is that there is no existing language that obviously satisfies our needs. Many telemetry systems have implemented their own DSL, and each tends to be tailored (intentionally or not) to the underlying data model and storage mechanisms. Standard languages like SQL enjoy wide usage in relational databases, but are not as well suited to timeseries data. SQL is also complex and arcane, having accrued a wide array of syntax and semantics over its many decades of use. Adopting SQL would require deciding which subset of SQL to support, and how to translate that into an efficient query on the underlying data storage, an area of continual research and improvement even for battle-tested database engines. That work is likely to be expensive and time-consuming.

Additionally, it’s important that our query system integrate well with other parts of the Oxide rack, both hardware and software. Incorporating an off-the-shelf language requires either tailoring the language to Oxide systems, or the other way around. Neither of these is particularly attractive.

Finally, we should also note that adopting any language will require effort to interpret and translate statements into queries on the underlying tables in ClickHouse, which reduces the differential effort between using an existing language and building our own.

In other words, while we could adopt an existing language, it would probably be a similar amount of work to writing our own; provide only superficial compatibility and familiarity; and like constrain us in ways that drive us to evolve the language or break compatibility further.

Other background

OxQL is designed to query the timeseries data produced by various components in the Oxide rack. The data model for such data is described in detail in [rfd161]. [rfd162] describes the architecture of the collection system itself. This RFD includes a brief overview of the data model, with a focus on the pieces relevant to OxQL. [rfd442] explores how other, related platforms expose metric data. That survey led us to adopt our own DSL.

Key examples

To motivate the design of OxQL, it is helpful to look at a few key examples. These will be used throughout the document to illustrate OxQL’s main concepts. These queries are accessible enough to give a general sense of the language and its core elements.

Selecting raw data

Let’s start by looking at a simple query. The query below selects the number of bytes sent out sleds' data links over the last minute.

get sled_data_link:bytes_sent 
   
    | filter timestamp > @now() - 1m

This query shows a few key features of OxQL:

1	First, we can select raw data from a named table, using the `get` table operation. In this case, we’re selecting data from the `sled_data_link:bytes_sent` table.
2	Data can be piped between table operations, using the familiar Unix pipe symbol `\|`.
3	We use the `filter` table operation to limit the returned data, in this case based on the timestamp of the data. Data may also be filtered by comparing field values, which we’ll see later.

Running this query on a live system, we might see data like the following:

Note

The oxide CLI command below returns raw data as a detailed JSON object. For clarity and exposition purposes, a Python script is used to format the output. See the details below for more information.

Details

OxQL queries can be submitted to the Oxide API through the timeseries_query API call, or the /v1/timeseries/query endpoint. This returns data as JSON, which has a rich structure with both data and metadata. The full schema for that object can be found in the OpenAPI document, starting here.

The output is quite verbose, however, and so the following Python script is used in the examples to simplify and format it. It will accept the output of the command oxide experimental timeseries query on standard input, and print the formatted results shown in this document on standard output. The script uses only modules from the standard library for portability. The examples assume it is in the current directory, executable, and named fmt-query.py.

fmt-query.py

#!/usr/bin/env python3

import json
import sys


def format_points(maybe_start_times, timestamps, values):
    if maybe_start_times is None:
        for (timestamp, value) in zip(timestamps, values):
            value = list(value) if len(value) > 1 else value[0]
            print(f"  {timestamp}: {value}")
    else:
        for (start_time, timestamp, value) in zip(maybe_start_times, timestamps, values):
            value = list(value) if len(value) > 1 else value[0]
            print(f"  {start_time} - {timestamp}: {value}")


def format_one_timeseries(ts):
    fields = ts["fields"]
    for field_name in sorted(fields.keys()):
        data = fields[field_name]
        typ = data["type"]
        value = data["value"]
        print(f" {field_name} ({typ}): {value}")
    points = ts["points"]
    format_points(
        points.get("start_times"),
        points["timestamps"],
        zip(*(pts["values"]["values"] for pts in points["values"])),
    )


def format_timeseries(table: str, timeseries: list[object]):
    print(f"\n{table}\n")
    for ts in timeseries.values():
        format_one_timeseries(ts)
        print()


if __name__ == '__main__':
    raw = []
    for line in sys.stdin:
        raw.append(line)
    data = json.loads("".join(raw))
    for table in data["tables"]:
        table_name = table["name"]
        timeseries = table["timeseries"]
        format_timeseries(table_name, timeseries)

$ oxide experimental timeseries query --query \
        'get sled_data_link:bytes_sent | filter timestamp > @now() - 1m' | \
        ./fmt-query.py

sled_data_link:bytes_sent 


 kind (string): vnic 
 link_name (string): oxControlService10
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): 5f6720b8-8a31-45f8-8c94-8e699218f28b
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM42220017
 zone_name (string): oxz_crucible_2e32fdcc-737a-4430-8290-cb7028ea4d50
  2024-08-14T03:17:43.131022517Z - 2024-08-14T03:37:58.138871843Z: 4086 
  2024-08-14T03:37:58.138871843Z - 2024-08-14T03:38:08.143299789Z: 172
  2024-08-14T03:38:08.143299789Z - 2024-08-14T03:38:18.146477647Z: 0
  2024-08-14T03:38:18.146477647Z - 2024-08-14T03:38:28.151284452Z: 0


 kind (string): physical
 link_name (string): cxgbe1
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): 2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM44220011
 zone_name (string): global
  1986-12-28T00:00:35.461760437Z - 2024-08-14T03:37:53.041474188Z: 6683986815
  2024-08-14T03:37:53.041474188Z - 2024-08-14T03:38:03.047111912Z: 105039519
  2024-08-14T03:38:03.047111912Z - 2024-08-14T03:38:14.132695318Z: 117618346
  2024-08-14T03:38:14.132695318Z - 2024-08-14T03:38:24.138699820Z: 106764171
  2024-08-14T03:38:24.138699820Z - 2024-08-14T03:38:34.144176654Z: 105105116

<... output continues ...>

Let’s break down this output.

1	The entire output from this query is a table. It has a name, shown here. In this case, we’re selecting raw data from a table in the database, and so the output has the same name as that table, `sled_data_link:bytes_sent`. We’ll see in later examples that this isn’t always the case.
2	This paragraph shows one timeseries in the `sled_data_link:bytes_sent` table. Every timeseries in the table has the same field names and field types. For example, every timeseries in this table has a field named `kind` with type `string`.
3	The field values for this timeseries. While every timeseries has the same field names and types, the set of all field values for this timeseries is unique. The values unique identify the timeseries within its table, similar to a primary key in a relational database. This particular timeseries has the value `vnic` for its `kind` field.
4	After the field values are the data points, one on each line. We can see three columns here: a start time, a timestamp, and a data value. The start time reflects the fact that this is a cumulative counter, where the data measures the total number of bytes sent between the start time and timestamp. Note that the start time of each sample is equal to the timestamp of the preceding sample, if there is one.
5	A different timeseries in the same table. Note that the names and types of the fields are the same as in (2), but the values for the fields differ (e.g., the `sled_serial` field has value `BRM44220011` here).

Selecting raw data can be quite useful, especially when first examining a system or its behavior. It can illuminate the basic shape and structure of the data, such as fields and their values; data and timestamp ranges; etc. Still, we often want to process the raw data in various ways to better to understand a system’s behavior and diagnose issues. The next queries will explore how to do that.

Note

In the previous example, we showed the entire CLI command used to run the query and format its output. For clarity, the remaining examples show only the OxQL query itself. The call to the Oxide CLI and the formatting Python script is implicit, and follows the above example.

Average network usage

As we saw above, data can be piped from one table operation into another. In addition to filters, table operations can also be used to compute averages or other summaries of the data. The query below computes the number of bytes sent over the last hour, averaged into 5 minute windows.

get sled_data_link:bytes_sent
    | filter timestamp > @now() - 1h 
    | align mean_within(5m)

Its output might look like this:

sled_data_link:bytes_sent

 kind (string): vnic 
 link_name (string): oxControlService5
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): dd83e75a-1edf-4aa1-89a0-cd8b2091a7cd
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM42220016
 zone_name (string): oxz_crucible_0e708ee3-b7a6-4993-a88a-4489add33e29
  2024-08-15T20:01:53.730368377Z: 41571.83377696665 
  2024-08-15T20:06:53.730368377Z: 10126.403066282965
  2024-08-15T20:11:53.730368377Z: 10375.428410271988
  2024-08-15T20:16:53.730368377Z: 35976.89723356015
  2024-08-15T20:21:53.730368377Z: 15632.43617543621
  2024-08-15T20:26:53.730368377Z: 10540.75845058832
  2024-08-15T20:31:53.730368377Z: 9825.903589579711
  2024-08-15T20:36:53.730368377Z: 10653.3210599372
  2024-08-15T20:41:53.730368377Z: 9844.633858761896
  2024-08-15T20:46:53.730368377Z: 9554.318609179172
  2024-08-15T20:51:53.730368377Z: 10030.136672115084
  2024-08-15T20:56:53.730368377Z: 10705.870660262928

 kind (string): physical
 link_name (string): cxgbe1
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): db183874-65b5-4263-a1c1-ddb2737ae0e9
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM42220009
 zone_name (string): global
  2024-08-15T20:01:53.730368377Z: 3851101.689250712
  2024-08-15T20:06:53.730368377Z: 3123494.0195476883
  2024-08-15T20:11:53.730368377Z: 3134962.1962845605
  2024-08-15T20:16:53.730368377Z: 3331401.7208908484
  2024-08-15T20:21:53.730368377Z: 3322688.084577879
  2024-08-15T20:26:53.730368377Z: 3143185.7570474516
  2024-08-15T20:31:53.730368377Z: 3227492.9576822957
  2024-08-15T20:36:53.730368377Z: 3098943.827306392
  2024-08-15T20:41:53.730368377Z: 3163852.61823991
  2024-08-15T20:46:53.730368377Z: 3103095.5405027973
  2024-08-15T20:51:53.730368377Z: 3181145.248283609
  2024-08-15T20:56:53.730368377Z: 5273885.797166327
<... output continues ...>

1	It’s important to notice that the initial portion of the query here is the same as the previous example. This shows a strength of OxQL: building complex processing pipelines incrementally by piping table operations into one another. However, we also see some new table operations, such as the `align` and `group_by` operations on the following lines.
2	The `align` table operation. When merging data between timeseries, it’s important that the timestamps be aligned, so that all timestamps occur at regular intervals. The `align` table operation uses an alignment function to determine (1) the size of the output time window and (2) how to combine data in each timeseries that falls within a window. Here, the `mean_within` alignment function is used. This alignment function computes the average of the input data in each timeseries within non-overlapping windows. It takes one argument, which is the window size, here 5 minutes, which is exactly the interval of the output samples shown below.
3	We can see that each timeseries in the output table has exactly the same fields as the earlier query. We haven’t modified which timeseries are returned by this query, only what data points are returned.
4	The data now shows the average number of bytes sent over 5 minute windows. There are two important things to note here. First, the start time is now gone, because it’s implied by the timestamp and the 5 minute windows to the `mean_within` alignment function. Second, the values are now floating point numbers, because we’ve averaged many values. (This is reflected in the full JSON output as well, not just the numerical values, which explicitly lists the type of data returned in each query.)

Average network usage by sled

In the preceding example, we saw that we can average data over time within one timeseries. (The number of timeseries in the table didn’t change.) But in many situations one actually wants to average data between timeseries. For example, one might be interested in the aggregate network usage for each sled, regardless of which data link was used to actually send data. That can be done by grouping timeseries together.

The query below computes the number of bytes sent in the last hour, averaged over 5 minute windows, for every sled.

get sled_data_link:bytes_sent 
    | filter timestamp > @now() - 1h
    | align mean_within(5m)
    | group_by [sled_serial]

1	As in the previous examples, we can see that the first portion of this query is identical. We’re piping outputs together, and so we can extend the query by adding the new table operation at the end.
2	This is a `group_by` table operation. It accepts a list of field names. For each timeseries in the input table, it groups together all those which have the same values for the listed fields. So in this case, all timeseries with the same sled serial number will be grouped.

The data for this query might look like this:

sled_data_link:bytes_sent

 sled_serial (string): BRM42220031
  2024-08-15T20:22:27.190710536Z: 496949.09648107865
  2024-08-15T20:27:27.190710536Z: 109857.58221837238
  2024-08-15T20:32:27.190710536Z: 109232.30921474376
  2024-08-15T20:37:27.190710536Z: 105512.25538612343
  2024-08-15T20:42:27.190710536Z: 626023.2184284237
  2024-08-15T20:47:27.190710536Z: 106813.24027380218
  2024-08-15T20:52:27.190710536Z: 105864.14432594774
  2024-08-15T20:57:27.190710536Z: 107299.78825178408
  2024-08-15T21:02:27.190710536Z: 105852.9834382634
  2024-08-15T21:07:27.190710536Z: 111337.96568275186
  2024-08-15T21:12:27.190710536Z: 108153.90428668828
  2024-08-15T21:17:27.190710536Z: 110668.32619822676

 sled_serial (string): BRM42220014
  2024-08-15T20:22:27.190710536Z: 568670.8494093848
  2024-08-15T20:27:27.190710536Z: 510732.8644817208
  2024-08-15T20:32:27.190710536Z: 507523.0107900741
  2024-08-15T20:37:27.190710536Z: 525524.4641971652
  2024-08-15T20:42:27.190710536Z: 514333.71314438555
  2024-08-15T20:47:27.190710536Z: 512395.6909131015
  2024-08-15T20:52:27.190710536Z: 508051.0772952132
  2024-08-15T20:57:27.190710536Z: 506067.695052197
  2024-08-15T21:02:27.190710536Z: 792672.5493766736
  2024-08-15T21:07:27.190710536Z: 511381.2428469829
  2024-08-15T21:12:27.190710536Z: 501101.7476959403
  2024-08-15T21:17:27.190710536Z: 516705.9677027045

<... output continues ...>

We can see that each timeseries in the output table just has one field, sled_serial, which is the field listed in the group_by operation. Each timeseries in the output table is generated by averaging all the timeseries in the group, i.e., with the value shown here for the sled_serial field. See below for more details on the group_by table operation.

Joining network input and output

The above example shows how one might look at average network output over time. However, understanding system behavior or diagnosing issues often requires correlating multiple pieces of data at once. In this case, one might also want to know the number of bytes received during this same time period.

This requires merging data from two different tables, not just two timeseries within the same table. We can do that with a subquery and the join table operation, which looks like this:

{ 
    get sled_data_link:bytes_sent;
    get sled_data_link:bytes_received
}
    | filter timestamp > @now() - 5m 
    | align mean_within(1m) 
    | join

The output of the above query might look like this:


sled_data_link:bytes_sent,sled_data_link:bytes_received

kind (string): vnic
 link_name (string): oxControlService5
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): 7b862eb6-7f50-4c2f-b9a6-0d12ac913d3c
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM44220010
 zone_name (string): oxz_crucible_0e38475e-b8b2-4813-bf80-3c170081081a
  2024-08-14T04:35:17.063247129Z: [0.3079514963268704, 1.1129475130409705] 
  2024-08-14T04:36:17.063247129Z: [0.0, 57.333333333333336]
  2024-08-14T04:37:17.063247129Z: [21.0, 0.0]
  2024-08-14T04:38:17.063247129Z: [28.666666666666668, 286.6666666666667]
  2024-08-14T04:39:17.063247129Z: [25.2, 0.0]

 kind (string): physical
 link_name (string): cxgbe0
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM42220006
 zone_name (string): global
  2024-08-14T04:35:17.063247129Z: [72891.30744207311, 57202.59302643966]
  2024-08-14T04:36:17.063247129Z: [85104.66198650173, 67958.87151877217]
  2024-08-14T04:37:17.063247129Z: [83203.29649990669, 66604.95842044336]
  2024-08-14T04:38:17.063247129Z: [84199.90222011972, 66680.09252436645]
  2024-08-14T04:39:17.063247129Z: [79358.75400917938, 63100.706405006815]

<... output continues ...>

1	A subquery is shown here, nested in curl braces (`{}`). A subquery can contain any number of queries (or subqueries!), each separated by semicolons (`;`). In this case there are two: we first select number of bytes sent and then the bytes received on a compute sled data link. Note that this means the output of the subquery has two tables, one for each of its queries.
2	As we’ve seen in the previous examples, we are reducing the amount of data with the `filter` table operation. Note that in this case, the filter applies to the entire output of the subquery, so to both tables! (One can also apply filters inside the suqbuery, if needed.)
3	As we’ve also see before, we’re aligning the tables, this time using a 1 minute window. Just like with `group_by`, tables must be aligned in order to `join` them.
4	Finally, the query uses the `join` table operation to combine the results from the two tables. The `join` operation combines the timeseries from each table that have the same field values. Note that this works because the two tables in the subquery have the same field names and types!
5	The table output from the `join` table operation has a name that’s derived from the two input table names. In this case, by joining them with a `,`.
6	OxQL combines the data values from two tables in a `join` operation by concatenating the values that occur at the same time. (This is why the tables must be aligned first.) In this case, since there are two tables in the subquery, the output also has two data values at each timestamp. It is now two-dimensional, showing the number of bytes sent and received at every time point, respectively.

Note

The join operation is equivalent to a natural inner join. Other kinds of joins (like outer joins) may be supported in the future, which would allow combining tables with different schema.

Intro summary

By examining a few different example queries, we’ve seen many of the major features of OxQL:

OxQL operates on tables, which are a set of timeseries with the same schema; they have the same field names, field types, and data type.
Each table contains some number of timeseries, which are identified by their unique field values.
Table operations are used to select or process the timeseries in a table.
Table operations can be piped together to build up complex queries.
Subqueries are written by enclosing multiple queries in braces.

After this quick tour of OxQL, the remaining sections are a detailed reference of the language. We’ll start with the main concepts and terminology.

Terminology

Tables and timeseries

A table is a set of data corresponding to a fixed schema. The schema includes the table name; a set of typed key-value pairs called fields; a data type>; and a <<_metric_type,metric type.

A timeseries is one element in a table, with unique values for the fields. There may be any number of timeseries in a table. All of them have the same schema, but each is uniquely identified by these field values. The values for the fields are used to generate a timeseries key, which is essentially a hash of the unique data identifying the timeseries. Note that the key is used during query processing and to organize the output data, but it is rarely useful in the query output.

Fields

A field is a typed key-value pair. It consists of a name and a value, which is one of the following types:

1-, 2-, 4-, or 8-byte integers, both signed and unsigned
Booleans (true or false)
Strings (UTF-8)
UUIDs (16-octet values, often random, used for unique resource identifiers)
IP addresses (v4 or v6)

Fields are used to provide context for the data in a timeseries. They are also used to identify a timeseries, as discussed above, by generating the timeseries key.

In our examples above, the field sled_serial is a string serial number for a sled, while the sled_id field is a UUID. These identify the compute sled from which the data is derived, for example. Other fields provide additional context; for example, the link_name identifies the individual network data link in question.

Datum types

The datum refers to the measured value of a timeseries. It is always associated with a timestamp, and may be associated with a start time if the datum is cumulative.

A datum may have one of the following types:

1-, 2-, 4-, or 8-byte integers, both signed and unsigned
Booleans (true or false)
Strings (UTF-8)
4- or 8-byte IEEE-754 floating point numbers
Histogram (see the following section)

Histograms

In addition to the types above, a datum may also be a histogram. Histograms let us compactly represent a large amount of data. Rather than individual values, we store a set of bins and counts within those bins. At each bin, we keep track of the total number of samples whose value falls between the upper and lower edges of the bin. For example, one might track the I/O latency on a disk using a histogram, because there may be a huge number of total I/Os in any time interval. For oximeter histograms, the bins may be any integer or floating-point type, and the counts are always 64-bit unsigned integers.

Let’s look at an example histogram, one tracking those I/O latencies on a disk.

Example histogram
Start time	Timestamp	Bins (milliseconds)	Counts
17:44:22	17:44:22	[1, 2, 4, 8, … ]	[0, 0, 0, 0, … ]
17:44:22	17:44:32	[1, 2, 4, 8, … ]	[0, 2, 9, 13, … ]
17:44:22	17:44:42	[1, 2, 4, 8, … ]	[1, 8, 24, 78, … ]

The column "Bins" shows the right edges of the first few bins. Each is a time in milliseconds. As I/Os are completed, their latency is measured, and we look up the smallest bin in this array that is larger than the latency, and increment the count in that bin by 1.

Histograms greatly compress the underlying data. There may be thousands or millions of I/Os every second on modern SSDs, however our histogram may only have a few tens or hundreds of buckets. We lose information, since we don’t know the exactly latency for any I/O, only that it was within the appropriate bin. In exchange, we may accurate capture information about the distribution of a large number of I/Os, while using a fixed amount of memory per datum. The memory usage is a function of the number of bins, not the number of samples.

Missing data points

Sometimes we fail to collect a sample for a timeseries. For example, we might fail to access the hardware containing a temperature sensor, or be unable to reach a peer over the internal network to determine its state. In these situations, we may generate a missing datum.

Missing data points are represented by the JSON value null in the output of a query. Missing points still contain lots of the information that a valid sample does, including a timestamp, datum type, and start time for cumulative counters or histograms. It is often important to consider missing data. By providing an explicit null as the value along with the timing and type information present for other points, one can accurately distinguish the absence of valid data from the absence of any measured activity at all.

Metric type

The datum type explored in the previous section describes the range of values that a datum can take. Data in a timeseries also has a metric type, which captures how data at different times relates to each other. The metric type takes one of three values:

Gauges
Cumulative counters
Deltas

Gauges

A gauge represents an instantaneous measurement. It captures the value of an underlying metric at one point in time. Each data point is completely independent of one another, different measurements of some underlying quantity taken at different timestamps. For example, the number of items resident in a queue at any time is a gauge, shown below.

Example gauge
Timestamp	Value
17:44:22	3
17:44:32	7
17:44:42	0
17:44:54	10

At each timestamp, a new, independent sample is recorded. The values may go up or down arbitrarily — in this case, as the queue receives and processes items at different rates.

Cumulative counters

In contrast to gauges, which are instantaneous measurements, a cumulative counter tracks a total value over a range of time. It has start time in addition to its timestamp, which is the beginning of its time range. Its value represents the accumulated total encountered from the start time up to the recorded timestamp.

Cumulative counters are widely used in the Oxide product. It’s always possible that an attempt to collect a sample fails — software may be unable to access a hardware sensor or reach a peer on the internal network. Cumulative counters still provide information about the period of time we missed a sample, because the next acquired sample will count events that occurred during that time. Cumulative counters are also useful for identifying program crashes, denoted when the start time for the counter resets. See the section on missing data points for more information.

Here is an example cumulative counter.

Example cumulative counter
Start time	Timestamp	Value
17:44:22	17:44:22	0
17:44:22	17:44:32	1
17:44:22	17:44:42	8
17:44:44	17:44:52	1

Continuing with our example of a queue, suppose this cumulative counter tracks the total number of items sent on the queue. (The gauge above tracked the number of items in the queue at any one time.) The counter begins with a start time of 17:44:22, and a sample is taken immediately. The value is 0, meaning no items have been pushed onto the queue.

Another sample is taken 10 seconds later. The start time is the same, and the value has increased to 1. The queue received 1 new item in this 10 second window, and has now seen 1 item total.

After another 10 seconds, an additional sample is taken with the value 8. The queue has received 8 items in total now, and so 7 new items in this time window.

10 seconds later we see a new sample, but the start time has also changed! This could mean that the program managing the queue has restarted, for example. The value is now 1, which means between 17:44:44 and 17:44:52, there was 1 item sent on the queue.

Note

Histograms are very similar to cumulative counters. Both have a start time, in addition to their timestamp, and track the accumulated data between the two time points. However, cumulative counters are scalars, while histograms are stored as arrays of bins and counts.

Deltas

The last metric type in OxQL is a delta. In a timeseries with a delta metric type, the values at each point in time represent the change over a time range. Similar to cumulative counters, deltas have a start time in addition to a timestamp. In contrast, the start times are always non-overlapping, with the start time of one bin exactly equal to the timestamp of the one before it.

Example delta
Start time	Timestamp	Value
17:44:22	17:44:32	2
17:44:32	17:44:42	7
17:44:42	17:44:52	1
17:44:52	17:45:02	3

Here is an example delta timeseries. We can see that each timestamp is equal to the start time of the following sample (if it exists). The value counts the total number of items just within that window.

Deltas from cumulatives

Deltas are common in OxQL queries. That is mainly because, when selecting data from a cumulative table, it is automatically converted into a delta! Why is that?

As mentioned above, much of the data collected in the Oxide rack is cumulative, whether a counter or a histogram. Cumulative metrics have a number of nice properties that make the collected data more robust and useful in a complex distributed system. (See the section on cumulatives for details.) Despite their wide use, cumulative data is actually quite tricky to work with. The samples in a cumulative timeseries are not independent, since later values contain overlapping information with earlier values. This makes it difficult to combine data from multiple sources, either within the same timeseries (for example, temporally averaging) or across timeseries (say, averaging during a grouping operation). If one were to temporally average a cumulative counter into large time windows, it would weight earlier samples more heavily, since an incremement in an early time range is counted again for each later time range in which it also appears.

To address this, both cumulative counters and histograms are automatically and implicitly converted to deltas when they are selected from the database. (There is currently no way to select raw, cumulative data, although this will likely be supported in the future.) After retrieving the data, OxQL computes the adjacent differences between each sample to convert a cumulative into a delta metric. This lets us aggregate, average, join, group, and align timeseries, since each data point is independent of its neighbors.

Converting from cumulative to delta

Original cumulative

Start time

Timestamp

Value

17:44:22

17:44:32

17:44:22

17:44:42

17:44:22

17:44:52

17:44:54

17:45:02

Derived delta

Start time

Timestamp

Value

17:44:22

17:44:32

17:44:42

17:44:44

17:44:52

17:44:54

17:45:02

The cumulative data shown above is similar to the previous cumulative counter example. The first three rows all share a start time, and the timestamps increment regularly. The value is also monotonically increasing. In the fourth row, we see a restart of the counter, with the start time and value both resetting.

The delta data on the right is derived from the cumulative, depending on whether the counter reset. If it did not reset, the start time of each row in the delta is set to the timestamp of the previous sample (if one exists), and the value is difference between the current value and the previous (if it exists). If the cumulative counter did reset, then the start time, timestamp, and value are all equal to the corresponding value from the cumulative counter.

The Oximeter Query Language

Now that we’ve seen several examples of OxQL queries, and defined the key terms and concepts, this section will serve as a more formal reference to the elements of the language. We’ll first cover the syntax of the language, focusing on how strings are parsed into queries, and then cover the key semantic components of a query.

Lexical items

Lexical items are the basic words or units that form the vocabulary of a formal language. This section describes the recognized elements for OxQL. This includes components like tables or field names; functions or table operations; comparison operators; and literals. The grammar of each of these items is described here.

Identifiers

An identifier is a name, such as for a table operation or the field of a table.

EBNF grammar for identifiers

IDENTIFIER = { [a-z]+ [a-z0-9]* ("_" [a-z0-9]+)* | SPECIAL_IDENTIFIER }
SPECIAL_IDENTIFIER = { "timestamp" | "start_time" | "datum" | "bins" | "counts" }

For example, in the table operation filter link_name == "foo", both the operation name filter and the field name link_name are identifiers.

Identifiers are formatted as lowercase, ASCII characters or digits, possibly delimited by underscores, and have the following structure:

Identifiers may not start with a digit, though the underscore-delimited words after the first may start that way. The special identifiers are implicitly added to a table by the OxQL query engine, and are used to refer to the common elements that all tables and timeseries share, such as their timestamp or the datum value itself. See [_terminology] for details on the anatomy of a timeseries.

Example valid identifiers

timestamp
sled_serial
time0
time_new_0

Example invalid identifiers

123
0time
a-b-c
!

Table names

A table name is used to identify a table in the database or the output of an OxQL query.

EBNF grammar for table names

TARGET_NAME = { [a-z]+ [a-z0-9]* ("_" [a-z0-9]+)* }
METRIC_NAME = TARGET_NAME
TABLE_NAME = { TARGET_NAME ":" METRIC_NAME }

The initial data input to OxQL queries is a table stored in the telemetry database. These are identified by a name with a particular structure: a target resource and a metric name, separated by a :. For example, the sled_data_link:bytes_sent table lists the number of bytes sent on a single sled’s NIC.

Table names are also used to name the output of a query. In many cases, this is the same as the input table name, but may change during processing. For example, the name of the output of a join operation is a concatenation of the input table names.

Literals

OxQL supports the common types of literals, usually used in filtering expressions or as arguments to table operations. This includes things like booleans, timestamps, numeric values, and strings. Where possible and helpful, the regular expression matching each literal has been included below.

Booleans

The literals true and false are used for boolean values.

EBNF grammar for booleans

BOOLEAN = { true | false }

Integers

OxQL supports signed and unsigned integers.

EBNF grammar for integers

DECIMAL_INTEGER = { ["-"]? (0 | ([1-9] [0-9]+)) }
HEX_INTEGER = { ["-"]? "0x" (0 | ([1-9] [0-9]+)) }
INTEGER = { DECIMAL_INTEGER | HEX_INTEGER }

Signed and unsigned integers are supported. These are sequences of digits, delimited by whitespace. Both base-10 and base-16 (hexadecimal) integers are currently supported.

Floats

OxQL supports floating point literals, in either decimal or scientific notation

EBNF grammar for floats.

DIGIT = { [0-9] }
SIGN = { [+-] }
EXPONENT = { ('e' | 'E') SIGN? DIGIT+ }
NUMBER = { (DIGIT+ | DIGIT + '.' DIGIT* | DIGIT* '.' DIGIT+) EXPONENT? }
FLOAT = { SIGN? ( 'inf' | 'infinity" | 'nan' | NUMBER ) }

Floating point values are supported, written in either decimal notation, with a point, such as 3.0, or in scientific notation, such as 3.0e-3. Floats parsed from queries are stored as IEEE-754 64-bit doubles. (Floats for data stored in the database maybe be single- or double-precision, which are widened if needed during a query.) Values that are not exactly representable as IEEE-754 values will be rounded to the nearest representable value.

IP addresses

OxQL supports IP addresses of either the V4 or V6 family.

Note

We omit the formal grammar rules for IP addresses, as they’re both obtuse and relatively unhelpful. The syntax is described below.

IPv4 and IPv6 addresses are supported, written in the familiar dotted-decimal notation for IPv4, e.g. "127.0.0.1", and colon-separated hex notation for IPv6, e.g., "2001:adcd::1". Addresses are surrounded by double-quotes, ". IPv6 addresses should not be surrounded with brackets ([]), as is sometimes common when they are followed by a port or scope ID.

UUIDs

Universally-unique identifiers (UUIDs) are 128-bit values, commonly used to identify resources in the Oxide control plane

EBNF grammar for UUIDs

HEX_CHAR = { 'a-f' | 'A-F' | [0-9] }
DASHED_UUID = { HEX_CHAR{8} '-' ( HEX_CHAR{4} '-' ){3} HEX_CHAR{12} }
BARE_UUID = { HEX_CHAR{32} }
UUID = { '"' (DASHED_UUID | BARE_UUID) '"' }

UUIDs are written as double-quoted strings of hexadecimal characters. Supported formats include the common 8-4-4-4-12, where groups of 8, 4, 4, 4, and 12 hex characters are separated by dashes, such as "116068ca-dcc7-4c0d-9a24-82dc7e0a0bc1". The dashes may also be omitted, so that the previous value is equivalent to "116068cadcc74c0d9a2482dc7e0a0bc1". The ASCII characters in a UUID are case-insensitive, "116068ca-dcc7-4c0d-9a24-82dc7e0a0bc1" is equivalent to "116068CA-DCC7-4C0D-9A24-82DC7E0A0BC1".

Note

OxQL supports UUIDs of any version, though the Oxide rack uses almost exclusively version 4.

Durations

An OxQL literal to denote a duration of time.

EBNF grammar for durations

YEAR = "Y"
MONTH = "M"
WEEK = "w"
DAY = "d"
HOUR = "h"
SECOND = "s"
MILLISECOND = "ms"
MICROSECOND = "us"
NANOSECOND = "ns"
UNIT = { YEAR | MONTH | WEEK | DAY | HOUR | SECOND | MILLISECOND | MICROSECOND | NANOSECOND }
DURATION = { DECIMAL_INTEGER UNIT }

A literal duration can be written as an unsigned integer count, followed by a duration unit. Supported units, in decreasing size, are:

Y: an approximate year, 365 days
M: an approximate month, 30 days
w: an approximate week, 7 days
d: an approximate day, 1 24-hour period
h: an hour, 3600 seconds
m: a minute, 60 seconds
s: seconds
ms: milliseconds
us: microseconds
ns: nanoseconds

For example, the duration 7d is equal to 7 24-hour periods. Note that the larger durations are all approximate. One year is always 365 24-hour periods, and there is no accounting for leap years.

Important

Note that the duration unit is case-sensitive. Uppercase M refers to a month and lowercase m to a minute, for example.

Timestamps

A timestamp literal is used to specify an instant in time.

EBNF grammar for timestamps

YEAR = { [1-9] [0-9]+ }
MONTH = { [1-9] | 10 | 11 | 12 }
DAY = { [1-9] | [1-2][0-9] | 3[0-1] }
HOUR = { [0-9] | 1[0-9] | 2[0-4] }
MINUTE = { [0-9] | [1-5][0-9] }
SECOND = { [0-9] | [1-5][0-9] }
SUBSEC = { "." [0-9]{1,6} }

DATE = { "@" YEAR "-" MONTH "-" DAY }
TIME = { "@" HOUR ":" MINUTE ":" SECOND }
DATETIME = { "@" DATE "T" TIME SUBSEC? | NOW }
NOW = { "@now()" }

TIMESTAMP = { DATE | TIME | DATETIME | NOW }

Timestamps may be expressed in a few formats, all beginning with the @ symbol. They can be written as just a date, such as @2024-01-02. In this case, midnight UTC on the specified date (2 Jan 2024) is used as the time, so the above value is equal to 2024-01-02T00:00:00 UTC. All timestamps are in UTC, and there is no mechanism to express a particular timezone or offset.

Supported formats are:

@%Y-%m-%d: A single date, with a time given by midnight.
@%H:%M:%S: A time within the current day (UTC). Note that this may differ from the current day of the user’s local timezone.
@%Y-%m-%dT%H:%M:%S.%f: A date and time, including fractions of a second up to 9 significant digits. This is very similar to the RFC 3339 format, except that neither an offset nor timezone is allowed.
@now(): The special timestamp literal @now() can be used to represent the current time, when the query is parsed. In this case, it’s also possible to add or subtract a duration literal from the value. For example, @now() - 1m represents one minute in the past. This kind of arithmetic is exclusive to timestamps expressed with @now(), and may not be used with other timestamp formats.

Strings

String literals are UTF-8 strings, quoted in either single- or double-quotes.

For example, "this" and 'this' are equivalent. If the string contains quotes itself, then they should use the opposite kind of quote to surround it. For example "this contains 'single quotes'", or 'this contains "double quotes"' are both acceptable. If you need to include both kinds of quotes, escape-sequences are also recognized. For example, 'this string contains \'single quotes\'" is acceptable. The full list of supported escape-sequences is:

\n
\r
\t
\\
\0
\u

The last escape-sequence is used to support full Unicode codepoints. Those must be written as sequences of hex digits, enclosed in \u{}. For example, the value \u{1234} will decode to the codepoint ሴ — those two representations are equivalent.

Comparison and logical operators

Comparison operators are used to compare two values, and return a boolean. Most familiar operators are supported, including equality, less-than, etc. Here is the full list:

EQ = { "==" }
NE = { "!=" }
GT = { ">" }
GE = { ">=" }
LT = { "<" }
LE = { "<=" }
REGEX_MATCH = { "~=" }
COMPARISON_OP = { EQ | NE | GT | GE | LT | LE | REGEX_MATCH }

The last operator, ~= implements regular expression matching, and is only supported for strings. Expressions such as "[a-f]+0$ can be used, for example, to match any string ending with one or more character a-f and a single digit 0. The full supported regular expression syntax is described here.

Logical operators can also be used to combine one or two boolean values together. The supported operators are:

LOGICAL_OP = { AND | OR | XOR | NOT }
AND = { "&&" }
OR = { "||" }
XOR = { "^" }
NOT = { "!" }

Note that the XOR operator is logical, not bitwise. The precendece of the operators follows the common programming order: in order of increasing precedence, they are OR, AND, XOR, and NOT.

Basic query structure

An OxQL query is composed of a pipeline of table operations, each of which applies a kind of data transformation to its inputs. Queries can be either flat, composed of simple table operations, or nested, composed of subqueries themselves.

QUERY = { FLAT_QUERY | NESTED_QUERY }
FLAT_QUERY = TABLE_OP [ | TABLE_OP | ... ]
NESTED_QUERY = '{' QUERY; QUERY; [ QUERY; ... ] '}'

A nested query is written as ;-delimited subqueries, all contained within braces ({}).

Table operations

The fundamental unit of an OxQL query is a table operation. Each table is zero or more timeseries, all with the same field names, field types, and data types. In this sense, a table is analogous to a table in a traditional relational database: it contains entries all with the same schema. In this case, an entry is a timeseries, which has a unique set of field values along with its timestamped data points. (See [_terminology] for more details.) Below is a summary of each supported table operation, and more details on each are in the following sections.

Summary of table operations
Name	# input tables	# output tables	Description	Constraints
`get`	0	1	Retrieve a table by name	None
`filter`	N > 0	N	Filter timeseries in a table	None
`align`	N > 0	N	Temporally align a tables’s samples	None
`group_by`	1	1	Group timeseries in a table by some fields, reducing along the unnamed fields	Timeseries must aligned
`join`	N > 1	1	Natural inner join between tow or more tables, matching up timeseries by their field names, and data points by timestamp	Timeseries must be aligned
`first` / `last`	N > 0	N	Limit the data points from each timeseries to the first or last `k` samples	None

Each table operation may take zero or more tables, and produce zero or more new tables after processing their input. The number of supported tables, and any restrictions on data types, are listed in the section describing each operation.

All queries or subqueries must begin by extracting some data from the timeseries database, on which the rest of the query operates. This is done using the get table operation, described in detail here.

Constructing pipelines

Complex queries are constructed by stringing together table operations with the familiar pipe operator, |. The output of each table operation is fed into the next. For example, the query

get sled_data_link:bytes_sent | filter timestamp > @now() - 1m

will first get the raw data from the sled_data_link:bytes_sent table, and then restrict it to the last minute from the time the query is run. This will result in one table, since there is one table named in the query. There may be zero, one, or many timeseries in the table, depending on the data available in the database when the query is run.

The above query’s output might look like this:

sled_data_link:bytes_sent

 kind (string): vnic
 link_name (string): oxBootstrap0
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): 71def415-55ad-46b4-ba88-3ca55d7fb287
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM42220051
 zone_name (string): oxz_switch
  1986-12-28T00:00:40.081399425Z - 2024-08-17T21:54:42.930353607Z: 22456314166
  2024-08-17T21:54:42.930353607Z - 2024-08-17T21:54:52.936228303Z: 0
  2024-08-17T21:54:52.936228303Z - 2024-08-17T21:55:02.939660391Z: 7296
  2024-08-17T21:55:02.939660391Z - 2024-08-17T21:55:13.234281307Z: 0
  2024-08-17T21:55:13.234281307Z - 2024-08-17T21:55:23.240183102Z: 0
  2024-08-17T21:55:23.240183102Z - 2024-08-17T21:55:33.243404269Z: 7296

 kind (string): physical
 link_name (string): cxgbe0
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): 2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM44220011
 zone_name (string): global
  1986-12-28T00:00:35.747836119Z - 2024-08-17T21:54:41.542113525Z: 19729551115
  2024-08-17T21:54:41.542113525Z - 2024-08-17T21:54:51.547409164Z: 855335
  2024-08-17T21:54:51.547409164Z - 2024-08-17T21:55:02.333056352Z: 332582
  2024-08-17T21:55:02.333056352Z - 2024-08-17T21:55:12.335451187Z: 561008

<... output continues ...>

Subqueries

As mentioned above, queries can be either flat or nested. A query is nested if it contains subqueries, two or more queries inside braces ({}), separated by a semicolon (;). Each subquery itself can be flat or nested, so that quite complex processing pipelines can be built up using these basic elements. For example the query:

{
    get sled_data_link:bytes_sent;
    get sled_data_link:bytes_received
}
    | filter timestamp > @now() - 1m && link_name == "cxgbe0"

will get the number of bytes sent and received on all sleds' physical data links, over the last minute. Further filtering can be done to limit the data, described in detail below. Here is an example output:

sled_data_link:bytes_sent

 kind (string): physical
 link_name (string): cxgbe0
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): bd96ef7c-4941-4729-b6f7-5f47feecbc4b
 sled_model (string): 913-0000019
 sled_revision (u32): 13
 sled_serial (string): BRM27230045
 zone_name (string): global
  1986-12-28T00:00:35.608843070Z - 2024-08-17T21:55:57.796047457Z: 6232264822
  2024-08-17T21:55:57.796047457Z - 2024-08-17T21:56:07.803023952Z: 89948
  2024-08-17T21:56:07.803023952Z - 2024-08-17T21:56:17.808466667Z: 74393

 kind (string): physical
 link_name (string): cxgbe0
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): 2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM44220011
 zone_name (string): global
  1986-12-28T00:00:35.747836119Z - 2024-08-17T21:55:52.346916391Z: 19733555641
  2024-08-17T21:55:52.346916391Z - 2024-08-17T21:56:02.349027176Z: 303074
  2024-08-17T21:56:02.349027176Z - 2024-08-17T21:56:12.353425030Z: 559113
  2024-08-17T21:56:12.353425030Z - 2024-08-17T21:56:22.358928927Z: 783220
  2024-08-17T21:56:22.358928927Z - 2024-08-17T21:56:32.364230588Z: 296332
  2024-08-17T21:56:32.364230588Z - 2024-08-17T21:56:42.752143536Z: 310945

<... output continues ...>

sled_data_link:bytes_received

 kind (string): physical
 link_name (string): cxgbe0
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): 71def415-55ad-46b4-ba88-3ca55d7fb287
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM42220051
 zone_name (string): global
  1986-12-28T00:00:35.890800854Z - 2024-08-17T21:55:54.258302609Z: 248829949334
  2024-08-17T21:55:54.258302609Z - 2024-08-17T21:56:04.260650837Z: 542105
  2024-08-17T21:56:04.260650837Z - 2024-08-17T21:56:14.263301567Z: 539631
  2024-08-17T21:56:14.263301567Z - 2024-08-17T21:56:24.265977975Z: 260104

 kind (string): physical
 link_name (string): cxgbe0
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): 2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM44220011
 zone_name (string): global
  1986-12-28T00:00:35.747836119Z - 2024-08-17T21:55:52.346916391Z: 1715906069
  2024-08-17T21:55:52.346916391Z - 2024-08-17T21:56:02.349027176Z: 60795
  2024-08-17T21:56:02.349027176Z - 2024-08-17T21:56:12.353425030Z: 89646
  2024-08-17T21:56:12.353425030Z - 2024-08-17T21:56:22.358928927Z: 93582
  2024-08-17T21:56:22.358928927Z - 2024-08-17T21:56:32.364230588Z: 58751
  2024-08-17T21:56:32.364230588Z - 2024-08-17T21:56:42.752143536Z: 90325

Note that there are two tables here, printed one after the other. The first shows the table sled_data_link:bytes_sent, with its timeseries. This is first because the subquery listed it first. Next is the sled_data_link:bytes_received table, with all its timeseries.

Supported table operations

Table operations form the main processing component of an OxQL query. Each operation receives some number of tables from the operation before it; processes those tables; and generates some number of tables for the next operation to consume.

`get`

The primordial table operation is get. This retrieves a single table by its name name from the database. It takes zero tables as input, and produces exactly one table as output. That output table may have any number of timeseries in it.

GET = { "get" TIMESERIES_NAME }

Example:

get sled_data_link:bytes_sent

`filter`

A filter table operation passes through data which matches the provided filtering expression. Filters take an identifier, a comparison operator, and a literal. (In the future, more complex expressions beyond literals may be supported.)

FILTER = { "filter" FILTER_EXPR }
FILTER_EXPR = { SIMPLE_FILTER | COMPOUND_FILTER }
SIMPLE_FILTER = { IDENTIFIER COMPARISON_OP LITERAL }
COMPOUND_FILTER = { NOT? "("? FILTER LOGICAL_OP FILTER ")"? }

See [_comparison_and_logical_operators] and [_literals] for details on those components.

The provided identifier may name any one of the fields in the table, or refer to a special identifier, such as timestamp. For cumulative or delta metric types, the start_time identifier is also supported. For scalar (non-histogram) timeseries, the datum identifier is used to refer to the value of the data point. For histograms, the bins and counts identifiers refer to the values of the data point instead.

Filters may include compound expressions, joined with the logical operators defined above. The precedence rules are also defined in that section, but note that comparison operators always bind more tightly than logical operators. In other words, foo == 0 && bar == 1 is equivalent to (foo == 0) && (bar == 1). Parentheses may be used to group expressions.

Below are several examples of filter table operations.

Pass data within the last 1 minute:

filter timestamp > @now() - 1m`

Data within the last 1 minute, where a field named link_name is exactly equal to the string "foo":

filter timestamp > @now() - 1m && link_name == "foo"

Filter points to those in the last day, with a 500 status code; or in the day before, with a status code between 400 and 500:

filter
    (timestamp > @now() - 1d && status_code == 500) ||
    (timestamp > @now() - 2d && timestamp < @now() - 1d && status_code > 400 && status_code < 500)

Note that the literals used to compare against a field must be comparable to the actual data type of that field. It is an error, for example, to write filter foo == 0 if the field "foo" of the input table doesn’t have an integer type.

`align`

The align table operation is used to generate output samples that are evenly spaced in time. This is critical for combining data from more than one timeseries or table. Timeseries data is usually sampled on an approximately regular interval, but almost never exactly so. The align operation can be used to generate exactly-spaced samples. This allows points from one timeseries to be confidently matched up with those of another, letting us perform grouping, aggregation, or other multi-timeseries operations.

ALIGN = { "align" ALIGNMENT_METHOD }
ALIGNMENT_METHOD = { "mean_within(" DURATION ")" }

The align table operation takes one argument, which is the alignment method. Currently the only supported method is mean_within, which computes an average of the input timeseries over the provided duration. For example:

get sled_data_link:bytes_sent
    | filter timestamp > @now() - 1h
    | align mean_within(5m)

will compute the number of bytes sent on a data link over the last one hour, averaged in five minute windows. The output samples will be spaced exactly five minutes apart.

The alignment method also specifies what it means for an input sample to "overlap" with the output period. For the mean_within method, any input timeseries that has any part of its sample period within the output period is considered to overlap, with the amount of overlap determining the contribution of that value to the output. This effectively prorates each point by the amount its time range overlaps with the output window.

For example, suppose there is an input sample over the range [0.1s, 1.1s], and the output period is exactly [0.0s, 1.0s]. Then the input sample will contribute its value weighted by 0.8, since 80% of its interval overlaps with the output window. That is, the input is linearly weighted in the output, where the weight is the amount of its interval which overlaps with the output period.

If there are zero data points in an output period, then a missing value is produced in that period, represented by the JSON value null in the output data array.

Choosing an alignment period

One should pick an alignment period carefully. It’s usually not helpful to choose one smaller than the underlying sample period, since this just divides the data value into the smaller bins. At the same time, be aware that alignment periods larger than the sample period may smooth out dynamics in the data. Generally, one probably wants one of two alignment periods

The ideal sample interval, the interval on which the data is supposed to be sampled, ignoring jitter in the sampling and collection process.
A much larger interval, useful for showing average behavior over a desired period. For example, showing average number of bytes sent per day, even though the timeseries may be sampled many times per minute or second.

`group_by`

The group_by table operation can be used to group more than one timeseries in a table together. One specifies a list of field names on which to group, and optional a method for reducing all the points within a group.

GROUP_BY = { "group_by" "[" (FIELD_NAME),* "]" ("," REDUCER)? }
REDUCER = { "mean" | "sum" }

Note that the input timeseries must be aligned first, so that samples from the input timeseries can be correlated across time. An error is generated if the inputs are not aligned. Example:

get sled_data_link:bytes_received
    | align mean_within(1m)
    | group_by [sled_id], sum

The above query will compute the average number of bytes received on each sled’s NICs, within one minute periods, and then group by the sled’s UUID. This will remove all the other fields, leaving only the sled_id field. Within each group, all data points at the same timestamp will be summed to generate the output result. In other words, this computes the total number of bytes received for all links on a compute sled.

In contrast, the query:

get sled_data_link:bytes_received
    | align mean_within(1m)
    | group_by [sled_id], mean

Will compute the average number of bytes received in each group.

`join`

Grouping is one way to merge data from multiple timeseries. However, one may also wish to correlate data from more than one timeseries over time. For example, one might want to know how many bytes are sent or received for a given sled.

One way to do this is the join table operation, which combines two or more tables together. For example:

{
    get sled_data_link:bytes_sent;
    get sled_data_link:bytes_received
}
    | align mean_within(1m)
    | join

This query will first retrieve data for both the bytes sent and bytes received tables. They will separately be aligned to one minute periods by averaging, since the align operation takes any number of tables as input, and produces the same number as its output.

Finally the join table operation (which takes no arguments), performs a natural inner join one two or more tables. This merges data from timeseries in each table with the exact same values for their fields. It combines data points by concatenating those values which occur at the same timestamp.

For example, the query:

{
    get sled_data_link:bytes_sent;
    get sled_data_link:bytes_received
}
    | filter timestamp > @2024-03-16T00:00:00 && timestamp < @2024-03-16T00:02:00
    | align mean_within(20s)
    | join

Will produce data at 20s intervals, each with two values for the point: the first will give the average number of bytes sent in that interval, and the second the average bytes received. For example, entries might look like this:

2024-03-16 00:00:20.000000000: [1296.3357361535034,1347.6610151429818]
2024-03-16 00:00:40.000000000: [1174.90794741706,1191.0918657310574]
2024-03-16 00:01:00.000000000: [1069.655113621735,1214.166748853574]
2024-03-16 00:01:20.000000000: [1409.5862705234695,1411.122044253881]
2024-03-16 00:01:40.000000000: [1163.3038173144982,1158.1461762530694]
2024-03-16 00:02:00.000000000: [609.0538622363117,642.6238774860385]

Note

Similar to group_by, the join table operation requires that its input timeseries be aligned. An error is generated if that’s not the case.

`first` and `last`

The first and last table operations are used to limit the number of data points from each query. They both take an unsigned integer argument k, limiting the output to at most k samples, independently for each timeseries in each table.

For example:

get sled_data_link:bytes_sent
    | filter kind == "physical"
    | last 1

will select the timeseries from the sled_data_link:bytes_sent table corresponding to the physical data links, and take the very last sample from each. The output might look like this:

sled_data_link:bytes_sent

 kind (string): physical
 link_name (string): cxgbe0
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): 71def415-55ad-46b4-ba88-3ca55d7fb287
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM42220051
 zone_name (string): global
  1986-12-28T00:00:35.449571704Z - 2024-08-23T16:49:51.282315601Z: 23912605713

 kind (string): physical
 link_name (string): cxgbe1
 rack_id (uuid): de608e01-b8e4-4d93-b972-a7dbed36dd22
 sled_id (uuid): 2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa
 sled_model (string): 913-0000019
 sled_revision (u32): 6
 sled_serial (string): BRM44220011
 zone_name (string): global
  1986-12-28T00:00:35.212682494Z - 2024-08-23T16:49:41.356651510Z: 18131229015

<... output continues ...>

Query output

All OxQL queries output a list of tables. Each table includes some number of timeseries, and each timeseries includes metadata about the data types; field names and values; and the data itself. (See [_terminology] for more details on these terms.)

As mentioned in the examples, the outputs shown above have been simplified and formatted to show the most salient components. The query output returned from the Oxide API does contain all the metadata needed to understand it, including the timeseries names, details about the fields, and data and metric types.

See the OpenAPI spec for the full JSON schema describing a query output.

OxQL cookbook

This section is a quick reference of useful queries. It’s a sort of cookbook for answering common questions, with links to detail sections where useful.

Last sample

get sled_data_link:bytes_sent | last 1

Select the last sample from all timeseries in the sled_data_link:bytes_sent table. See this section for more details.

Last 5 minutes

get sled_data_link:bytes_sent | filter timestamp > @now() - 5m

Select the last 5 minutes of data from all timeseries in the sled_data_link:bytes_sent table. See the section on the filter table operation and the section on timestamps.

Average in 5 minute windows

get sled_data_link:bytes_sent | align mean_within(5m)

Average all timeseries in the sled_data_link:bytes_sent table into non-overlapping 5 minute time windows. See the section on the align table operation.

Group by a unique ID

get sled_data_link:bytes_sent | align mean_with(5m) | group_by [sled_id]

Average all timeseries in the sled_data_link:bytes_sent table into non-overlapping 5 minute time windows, and also average all timeseries with the same value for the sled_id field. See the section on the group_by table operation.

Correlate two timeseries

{
    get sled_data_link:bytes_sent;
    get sled_data_link:bytes_received
}
    | align mean_with(5m)
    | join

Average all timeseries independently from the sled_data_link:bytes_sent and sled_data_link:bytes_received tables into 5 minute windows, and then concatenate the samples from timeseries with the same field values, at the same timestamps. See the section on the join table operation.

Determinations

Open Questions

Security Considerations

Ideally, queries in OxQL would take the exact same security stance as the rest of Oxide’s public API. We should avoid leaking information in query results that users cannot otherwise see. As an example, it would be nice if users could not see the vCPU utilization of instances in a project for which they don’t have read access. Ideally, we would effectively show the user only the subset of the metric database that refers to objects they can read in other parts of the API.

However, this approach seems complex and expensive, and possibly intractable. Rather than answer this particular question here, we defer to future RFDs to discuss the goals for authorizing metric data, and how we might implement it efficiently in the control plane.

External References

[RFD 61] RFD 61, Control Plane Architecture and Design
[RFD 125] RFD 125, Telemetry building blocks and requirements
[RFD 161] RFD 161, Metrics data model
[RFD 162] RFD 162, Metrics collection architecture and design
[RFD 304] RFD 304, You CAN Have Your Metric and Query it Too: A Proposal for Future Telemetry APIs
[RFD 442] RFD 442, Requirements for metrics API, and analysis of endpoints from cloud vendors (and VMware)

RFD 463 The Oximeter Query Language

Table of Contents