45 - User System-level API / RFD / Oxide

RFD

Authors

Updated

Operators of the rack are going to want endpoints to get metrics and other data at a more global view than per project or virtual machine. The following endpoints are aimed towards operators.

RFD 4 User Facing API Design defines some default behavior for the API which will also apply to the API components defined here.

Metrics

In the user interface, we will want to display graphs for operators on the current status of their system. We will want them to also be able to get capacity and utilization metrics from the API. Operators should be able to view global metrics as well as drill down into individual racks and sleds. They should also be able to view how individual users are utilizing their compute as well as filter based off project. These APIs are going to need to be easy to use but also easy to customize the results that are returned based off the information they want.

Operators should also be able to make custom dashboards and custom utilization reports that are saved to their accounts in the user interface for Oxide. This is not necessary for our MVP, we can wait until it is requested by the user.

SQL Query interface

A SQL-interface for querying metrics. This has been mocked up in Figma. It would allow you to query any data we store.

Performace Metrics

Most important for v1

How is Disk I/O over time?
How is network latency?
Are my disks getting slower over time?

You will also want to be alerted when any of these get below a certain threshold.

Questions operators answer

Operators will want to answer the following questions:

How much capacity do I have?
How much of that capacity is used?
How has that usage changed over time?
How is each resource that I care about being used?
- CPU
- DRAM
- Storage
- Network
Who is using these resources?
- Who is using the most of these?
- How has the usage changed over time?
- Is anyone up against their resource cap?
- If I need to free up resources, who has never used all of theirs?
What is the above on a per project basis?
What is this utilization on a per server basis?
What is this utilization on a per rack basis?
What if I want to do a custom utilization report?
How do I know when I am getting low on capacity?
What should I look at to know that I should?
What is the capacity and utilization on a project/user/organization basis?

All the above questions should be able to be answered by the metrics API.

The ability to set up custom notifications on various metrics if they get to a certain limit would be awesome. Also a way to know that an operator needs to increase their capacity would be great, that way they can more easily scale their Oxide racks before it starts to effect their end users.

The following API endpoints relate to metrics for operators:

GET /system/metrics
GET /system/metrics?query=some-query-to-make-a-segment-of-metrics

The Metrics data type will look like the following OpenAPI schema reference:

Metrics:
    type: object
    readOnly: true
    properties:
    cpu:
        type: object
        properties:
        capacity:
            description: |
                The maximum number of cores that can be used in the segment.
                Sampled every 60 seconds. After sampling, data is not
                visible for up to 240 seconds.
                In the example, 4096 defines the capacity for the entire rack.
                For a single sled, it would be 64.
                For an individual VM, it might be more like 1-62(?), this is
                excluding what we need to run our agent.
            type: double
            example: 4096
        utilization:
            description: |
                The fraction of the allocated cores that is currently in use
                in the segment.
                This value can be greater than 1.0 on some
                machine types that allow bursting. This value is reported
                by the hypervisor for the VM and can differ from utilization,
                which is reported from inside the VM. Sampled every 60
                seconds. After sampling, data is not visible for up to
                240 seconds.
                In the example, 0.5 would mean 2048 cores are being utilized of
                the 4096 maximum cores. 50% * 4096.
            type: double
            example: 0.5
    memory:
        type: object
        properties:
        capacity:
            description: |
                The maximum memory that can be used in the segment, in MiB.
                Sampled every 60 seconds. After sampling, data is not
                visible for up to 240 seconds.
                In the example, 16384 defines the capacity for the entire rack.
                For a single sled, it would be 512.
                For an individual VM, it might be more like 4-128+.
            type: double
            example: 16384
        utilization:
            description: |
                The memory being used in the segment, in MiB.
                Sampled every 60 seconds. After sampling, data is not
                visible for up to 240 seconds.
                In the example, 0.5 would mean 8192 GB of memory are being utilized of
                the 16384 maximum. 50% * 16384.
            type: double
            example: 0.5
    storage:
        type: object
        properties:
        capacity:
            description: |
                The maximum storage space that can be used in the segment, in GiB.
                Sampled every 60 seconds. After sampling, data is not
                visible for up to 240 seconds.
                In the example, 7680 defines the capacity for the entire rack.
                For a single sled, it would be ~240, not sure what we actually landed on.
                For an individual VM, it might be more like 50-240.
            type: double
            example: 7680
        utilization:
            description: |
                Storage space being used in the segment, in GiB.
                Sampled every 60 seconds. After sampling, data is not
                visible for up to 240 seconds.
                In the example, 0.5 would mean 3840 GB are being utilized of
                the 7680 maximum. 50% * 7680.
            type: double
            example: 0.5
    network:
        type: object
        properties:
        receivedBytesCount:
            description: |
                Count of bytes received from the network. Sampled
                every 60 seconds. After sampling, data is not visible for
                up to 240 seconds.
            type: int64
            example: 1024
        receivedPacketsCount:
            description: |
                Count of packets received from the network.
                Sampled every 60 seconds. After sampling, data is not
                visible for up to 240 seconds.
            type: int64
            example: 54
        sentBytesCount:
            description: |
                Count of bytes sent over the network. Sampled every
                60 seconds. After sampling, data is not visible for up to
                240 seconds.
            type: int64
            example: 7042
        sentPacketsCount:
            description: |
                Count of packets sent over the network. Sampled every
                60 seconds. After sampling, data is not visible for up to
                240 seconds.
            type: int64
            example: 86

Inventory

Operators will want to answer the following about the inventory of items that make up a rack:

What are all the components that I have in the DC?
For each component:
- Who made the component? When? (VPD Data)
- What is the warranty information for this component?
- Is this component a FRU? Or a part of something else?
- Is there firmware in this component? What revision?
- What is its history?
  - When did this component enter the DC?
  - Where is this component in the DC?
  - Where has it been?
  - When was firmware upgraded on it?
- Does it have the subcomponents I expect?
What is the health of this component?
- Is it currently `powered on'?
- What is its utilization, saturation, and errors?
- Is its phy linked up at the right rate?
- Is it running the firmware revision that I expect?
- Are the device’s settings what I expect them to be?
- What if it’s broken?
- How do I warn about a component being unhealthy?
- How do I want to query all of the components in the DC?
- Counting the number of different components?
- Querying various firmware bits?

All the above questions should be able to be answered by the inventory API.

The following API endpoints relate to Inventory:

GET /system/inventory - List all inventory items
GET /system/inventory/{id} - Get details on a specific component in inventory

The Inventory data type will look like the following OpenAPI schema reference:

Component:
    type: object
    readOnly: true
    properties:
    id:
        description: |
            The ID of the component. This is assigned upon creation and
            cannot be modified.
        type: string
        format: uuid
        readOnly: true
    health:
        type: object
        properties:
        status:
            description: |
                The status of this component. Can be one of the following:
                - HEALTHY
                - UNHEALTHY
            type: string
            enum:
            - HEALTHY
            - UNHEALTHY
        power:
            description: |
                If the component is powered on, this will be true. If the
                component is powered off, it will be false.
            type: bool
        utilization:
            description: |
                Utilization of the component.
        saturation:
            description: |
                Saturation of the component.
        errors:
            description: |
                Any errors for the component will be shown here.
            type: array
            items:
                $ref: '#/components/schemas/Error'
    warranties:
        description: |
            Any and all warranty information for the component.
    firmware:
        type: object
        properties:
        history:
            description: |
                The history of firmware on this component.
            type: array
            items:
                $ref: '#/components/schemas/Firmware'
    timeCreated:
        description: |
            The date and time the component was added to the inventory
            database.
        type: string
        format: date-time
    settings:
        type: object
        description: |
            The settings for the component, this will be different depending on
            the object but we need a way to express it for all objects.

The Firmware data type will look like the following OpenAPI schema reference:

Firmware:
    type: object
    properties:
    revision:
        description: |
            The revision of firmware this component is running.
        type: string
    timeUpdated:
        description: |
            The date and time the firmware was updated to this revision.
        type: string
        format: date-time