RFD 45
User System-level API
RFD
45
Updated

Operators of the rack are going to want endpoints to get metrics and other data at a more global view than per project or virtual machine. The following endpoints are aimed towards operators.

RFD 4 User Facing API Design defines some default behavior for the API which will also apply to the API components defined here.

Metrics

In the user interface, we will want to display graphs for operators on the current status of their system. We will want them to also be able to get capacity and utilization metrics from the API. Operators should be able to view global metrics as well as drill down into individual racks and sleds. They should also be able to view how individual users are utilizing their compute as well as filter based off project. These APIs are going to need to be easy to use but also easy to customize the results that are returned based off the information they want.

Operators should also be able to make custom dashboards and custom utilization reports that are saved to their accounts in the user interface for Oxide. This is not necessary for our MVP, we can wait until it is requested by the user.

SQL Query interface

A SQL-interface for querying metrics. This has been mocked up in Figma. It would allow you to query any data we store.

Performace Metrics

Most important for v1

  • How is Disk I/O over time?

  • How is network latency?

  • Are my disks getting slower over time?

You will also want to be alerted when any of these get below a certain threshold.

Questions operators answer

Operators will want to answer the following questions:

  • How much capacity do I have?

  • How much of that capacity is used?

  • How has that usage changed over time?

  • How is each resource that I care about being used?

    • CPU

    • DRAM

    • Storage

    • Network

  • Who is using these resources?

    • Who is using the most of these?

    • How has the usage changed over time?

    • Is anyone up against their resource cap?

    • If I need to free up resources, who has never used all of theirs?

  • What is the above on a per project basis?

  • What is this utilization on a per server basis?

  • What is this utilization on a per rack basis?

  • What if I want to do a custom utilization report?

  • How do I know when I am getting low on capacity?

  • What should I look at to know that I should?

  • What is the capacity and utilization on a project/user/organization basis?

All the above questions should be able to be answered by the metrics API.

The ability to set up custom notifications on various metrics if they get to a certain limit would be awesome. Also a way to know that an operator needs to increase their capacity would be great, that way they can more easily scale their Oxide racks before it starts to effect their end users.

The following API endpoints relate to metrics for operators:

GET /system/metrics
GET /system/metrics?query=some-query-to-make-a-segment-of-metrics

The Metrics data type will look like the following OpenAPI schema reference:

Metrics:
type: object
readOnly: true
properties:
cpu:
type: object
properties:
capacity:
description: |
The maximum number of cores that can be used in the segment.
Sampled every 60 seconds. After sampling, data is not
visible for up to 240 seconds.
In the example, 4096 defines the capacity for the entire rack.
For a single sled, it would be 64.
For an individual VM, it might be more like 1-62(?), this is
excluding what we need to run our agent.
type: double
example: 4096
utilization:
description: |
The fraction of the allocated cores that is currently in use
in the segment.
This value can be greater than 1.0 on some
machine types that allow bursting. This value is reported
by the hypervisor for the VM and can differ from utilization,
which is reported from inside the VM. Sampled every 60
seconds. After sampling, data is not visible for up to
240 seconds.
In the example, 0.5 would mean 2048 cores are being utilized of
the 4096 maximum cores. 50% * 4096.
type: double
example: 0.5
memory:
type: object
properties:
capacity:
description: |
The maximum memory that can be used in the segment, in MiB.
Sampled every 60 seconds. After sampling, data is not
visible for up to 240 seconds.
In the example, 16384 defines the capacity for the entire rack.
For a single sled, it would be 512.
For an individual VM, it might be more like 4-128+.
type: double
example: 16384
utilization:
description: |
The memory being used in the segment, in MiB.
Sampled every 60 seconds. After sampling, data is not
visible for up to 240 seconds.
In the example, 0.5 would mean 8192 GB of memory are being utilized of
the 16384 maximum. 50% * 16384.
type: double
example: 0.5
storage:
type: object
properties:
capacity:
description: |
The maximum storage space that can be used in the segment, in GiB.
Sampled every 60 seconds. After sampling, data is not
visible for up to 240 seconds.
In the example, 7680 defines the capacity for the entire rack.
For a single sled, it would be ~240, not sure what we actually landed on.
For an individual VM, it might be more like 50-240.
type: double
example: 7680
utilization:
description: |
Storage space being used in the segment, in GiB.
Sampled every 60 seconds. After sampling, data is not
visible for up to 240 seconds.
In the example, 0.5 would mean 3840 GB are being utilized of
the 7680 maximum. 50% * 7680.
type: double
example: 0.5
network:
type: object
properties:
receivedBytesCount:
description: |
Count of bytes received from the network. Sampled
every 60 seconds. After sampling, data is not visible for
up to 240 seconds.
type: int64
example: 1024
receivedPacketsCount:
description: |
Count of packets received from the network.
Sampled every 60 seconds. After sampling, data is not
visible for up to 240 seconds.
type: int64
example: 54
sentBytesCount:
description: |
Count of bytes sent over the network. Sampled every
60 seconds. After sampling, data is not visible for up to
240 seconds.
type: int64
example: 7042
sentPacketsCount:
description: |
Count of packets sent over the network. Sampled every
60 seconds. After sampling, data is not visible for up to
240 seconds.
type: int64
example: 86

Inventory

Operators will want to answer the following about the inventory of items that make up a rack:

  • What are all the components that I have in the DC?

  • For each component:

    • Who made the component? When? (VPD Data)

    • What is the warranty information for this component?

    • Is this component a FRU? Or a part of something else?

    • Is there firmware in this component? What revision?

    • What is its history?

      • When did this component enter the DC?

      • Where is this component in the DC?

      • Where has it been?

      • When was firmware upgraded on it?

    • Does it have the subcomponents I expect?

  • What is the health of this component?

    • Is it currently `powered on'?

    • What is its utilization, saturation, and errors?

    • Is its phy linked up at the right rate?

    • Is it running the firmware revision that I expect?

    • Are the device’s settings what I expect them to be?

    • What if it’s broken?

    • How do I warn about a component being unhealthy?

    • How do I want to query all of the components in the DC?

    • Counting the number of different components?

    • Querying various firmware bits?

All the above questions should be able to be answered by the inventory API.

The following API endpoints relate to Inventory:

GET /system/inventory - List all inventory items
GET /system/inventory/{id} - Get details on a specific component in inventory

The Inventory data type will look like the following OpenAPI schema reference:

Component:
type: object
readOnly: true
properties:
id:
description: |
The ID of the component. This is assigned upon creation and
cannot be modified.
type: string
format: uuid
readOnly: true
health:
type: object
properties:
status:
description: |
The status of this component. Can be one of the following:
- HEALTHY
- UNHEALTHY
type: string
enum:
- HEALTHY
- UNHEALTHY
power:
description: |
If the component is powered on, this will be true. If the
component is powered off, it will be false.
type: bool
utilization:
description: |
Utilization of the component.
saturation:
description: |
Saturation of the component.
errors:
description: |
Any errors for the component will be shown here.
type: array
items:
$ref: '#/components/schemas/Error'
warranties:
description: |
Any and all warranty information for the component.
firmware:
type: object
properties:
history:
description: |
The history of firmware on this component.
type: array
items:
$ref: '#/components/schemas/Firmware'
timeCreated:
description: |
The date and time the component was added to the inventory
database.
type: string
format: date-time
settings:
type: object
description: |
The settings for the component, this will be different depending on
the object but we need a way to express it for all objects.

The Firmware data type will look like the following OpenAPI schema reference:

Firmware:
type: object
properties:
revision:
description: |
The revision of firmware this component is running.
type: string
timeUpdated:
description: |
The date and time the firmware was updated to this revision.
type: string
format: date-time