595 - Oxide CSI Plugin / RFD / Oxide

RFD

595

Authors

Labels

Updated

Background

This RFD expands on [rfd-493] to provide more details on how a Container Storage Interface plugin for Oxide would be implemented.

Kubernetes uses persistent volumes to provide long-term storage to containers. Pods can access volumes by creating persistent volume claims (PVCs) that are fulfilled by Kubernetes during deployment.

These claims can be static, where the volume is pre-created and registered to the cluster manually, or dynamic, where the volume is automatically provisioned when the pod is first created. The end result is a persistent volume (PV) that can be mounted into the container.

To support different storage providers, Kubernetes standardized the volume lifecycle handling into a common interface called the Container Storage Interface (CSI).

The CSI spec uses specific terminology to refer to certain roles and components. This RFD follows the same terminology to keep the content consistent.

CO is the container orchestrator. The most well-known solution is Kubernetes, but other projects, such as Mesos and Nomad, also implement the CSI spec and are other examples of COs.
SP is the storage provider. In the context of this RFD, the SP is the Oxide API, but other examples include any solution that provides persistent data access via a remote API, such as block storages (AWS EBS and GCP Persitent Disks), network storages (NFS and SMB), secret management systems (Vault and AWS Secrets Manager), object storages (AWS S3 and GCP Cloud Storage) etc.
Workload is the unit of work created by the CO. In Kubernetes this can be thought of as a pod or a container.
Volume is the unit of storage created by the SP and used by the workload. In the context of Oxide, the volume is a disk, but it can be a file or a secret, depending on the SP being used.
Node is a host where workloads run, such as Oxide instances.

The CSI spec leaves some implementation details to COs and, with Kubernetes being the most used CO in the industry, it is often used as the "expected" behaviour. This RFD mentions Kubernetes specifically (instead of using the more generic term CO) when referring to details specific to Kubernetes.

CSI plugins (sometimes also called drivers) are binaries that implement the CSI spec and expose an endpoint with a set of RPCs that are called by COs in a specific order to complete the lifecycle of a volume.

They also advertise the capabilities they implement to allow COs to dynamically adjust which calls to make depending on the features supported by the plugin.

The plugin’s RPCs are grouped into three high-level services:

The Identity Service provides basic information about the plugin, such as its name and capabilities.
The Controller Service exposes RPCs that are expected to call the upstream SP APIs to manage the lifecycle of a volume from a remote state perspective.
The Node Service exposes RPCs related to the local node where volumes will be used.

These services can be implemented and deployed as separate binaries or, most commonly, bundled into the same binary that can be configured to run in specific modes, such as node, controller, or all.

In Kubernetes, CSI plugins are packaged as OCI images and deployed as regular pods in the cluster. Pods that provide the Controller Services are usually created using Deployments, with more than one replica for redundancy, and Node Services as DaemonSets so they are available in every node of the cluster.

CSI plugins are also deployed along with a set of sidecar containers in the same pod to help reduce boilerplate and keep plugin implementation focused on the CSI spec instead of Kubernetes-specific details.

For example, in Kubernetes, the Controller Services RPCs are not called directly, but rather plugins are expected to subscribe and listen to specific events about volume creation requests from the Kubernetes API.

Since this is a common requirement for all CSI plugins, the external-provisioner sidecar is provided to listen for the right events from the Kubernetes API and call the appropriate Controller Service RPCs via a local Unix Domain Socket.

Node services are called directly by the Kubernetes agent running in the node (kubelet), but they also use sidecar containers for common functionality, such as livenessprobe to monitor the plugin’s health.

The Kubernetes design proposal for CSI illustrates how plugins are deployed.

Volume lifecycle

The CSI spec provides a few alternatives for volume lifecycle implementation. This RFD focuses on the most complete option as it provides more implementation flexibility.

The lifecycle of a CSI volume with stage and unstage steps

   CreateVolume +------------+ DeleteVolume
 +------------->|  CREATED   +--------------+
 |              +---+----^---+              |
 |       Controller |    | Controller       v
+++         Publish |    | Unpublish       +++
|X|          Volume |    | Volume          | |
+-+             +---v----+---+             +-+
                | NODE_READY |
                +---+----^---+
               Node |    | Node
              Stage |    | Unstage
             Volume |    | Volume
                +---v----+---+
                |  VOL_READY |
                +---+----^---+
               Node |    | Node
            Publish |    | Unpublish
             Volume |    | Volume
                +---v----+---+
                | PUBLISHED  |
                +------------+

Figure 6: The lifecycle of a dynamically provisioned volume, from
creation to destruction, when the Node Plugin advertises the
STAGE_UNSTAGE_VOLUME capability.

The following sections detail the actions the Oxide CSI plugin needs to take to implement these RPCs.

`CreateVolume`

The CreateVolume RPC is part of the Controller Services and is the first function to be called when creating a new volume. This call is made before any scheduling decision about where the workload will run.

The CSI plugin is expected to communicate with the SP API to create a new volume given the parameters set in the request, and respond with, among other things, an unique identifier for this volume, which is usually the ID specified by the SP.

Volume creation also needs to meet the volume capabilities and accessibilities requirements.

Volume capability has two main properties.

Access type can be one of block or mount and define how the volume content is structured. mount volumes are formatted and made available to workloads as a regular filesystem, while block volumes are kept as raw disks.
Access mode controls how many nodes and workloads can mount the volume (one or many) and how they are allowed to access it (read-only or read-and-write). Possible values are SINGLE_NODE_WRITER, SINGLE_NODE_READER_ONLY, MULTI_NODE_READER_ONLY, MULTI_NODE_SINGLE_WRITER, and MULTI_NODE_MULTI_WRITER.

The volume accessibility defines the topology preferences and requisites for where the volume should be accessible from and, therefore, they define where the volume will be created.

The topologies are specified as key-value maps, where the keys are called topological domains and the values topological segments. Domains and segments are opaque to the CSI spec, meaning that their semantics are defined by each CSI plugin.

The CSI spec provides a few general examples of how topologies should be evaluated by CSI plugins. Topology support for Oxide is still an open question.

Example of how topologies should be evaluated

Example 1:
Given a volume should be accessible from a single zone, and
requisite =
  {"region": "R1", "zone": "Z2"},
  {"region": "R1", "zone": "Z3"}
preferred =
  {"region": "R1", "zone": "Z3"}
then the SP SHOULD first attempt to make the provisioned volume
available from "zone" "Z3" in the "region" "R1" and fall back to
"zone" "Z2" in the "region" "R1" if that is not possible.

The request can optionally specify a volume source.

The snapshot source creates the new volume from an existing snapshot. This feature can only be used if the plugin advertises the CREATE_DELETE_SNAPSHOT capability.
The volume source creates a new volume by cloning an existing volume. This feature is gated by the CLONE_VOLUME capability.

If no source is specified, the volume is created as a blank disk.

Oxide implementation

This RPC is implemented using the POST /v1/disks API endpoint to create a new blank disk with the given name and disk capacity.

Users can request disk capacity to be a range, starting from a required value (disk MUST be at least this big) to a limit value (disk MUST be at most this big). This range can be relevant when rounding or transforming the requested disk size, since the Oxide API requires it to be set in bytes and as a multiple of the disk block size.

The block size itself (and other Oxide-specific values) can be passed in the RPC request as an opaque key-value store of parameters. If not specified, disks are created with 4K block size by default.

The disk name is provided by the CO, but the plugin may choose a different name when making the API. This could be important considering that disk serial numbers are truncated to 20 bytes and so the names provided by COs may conflict with disks that have the same prefix. When this happens, only one of the disks is visible from the guest instance, even if both disks are attached.

The RPC response also accepts opaque metadata context that the CO propagates to the other RPCs. The Oxide CSI plugin can use this field to store the name of the disk created so it can later identify it from within an instance.

Sequence diagram for the CreateVolume RPC

One challenge in implementing this RPC is the current blocker that prevents hot plugging disks to instances.

In a single rack scenario, this process is relatively straight forward, since instances and disks will always be placed in the same rack. But handling multi-rack environments is an open question that will likely require adding support for topologies.

`ControllerPublishVolume`

The CO calls the ControllerPublishVolume RPC of the Controller Services once it has selected which node will run the workload.

The request contains the disk ID returned by CreateVolume and the node ID where the workload has been scheduled. The CO retrieves the node ID by calling the NodeGetInfo RPC from the Node Services.

The plugin is expected to make the volume available for the node to use it.

Oxide implementation

The implementation uses the POST /v1/instances/{instance}/disks/attach API endpoint to attach the newly created disk to the instance that is running the kubelet where the workload is scheduled to run.

The NodeGetInfo RPC needs to return the Oxide instance ID as the node ID so the controller plugin can know which ID to use in the API call.

Sequence diagram for the ControllerPublishVolume RPC

`NodeStageVolume` and `NodePublishVolume`

The NodeStageVolume and NodePublishVolume RPCs are part of the Node Services, so they run in the node where the volume is being attached.

They are called by the CO when the workload that is going to use the volume is scheduled to the node, but before the workload is actually created.

When these RPCs are received, the plugin is expected to make the volume ready to be used by the workload and available at the path defined in the request.

Also in the request are information about the filesystem and mount flags to use when preparing the volume.

These RPC are fairly similar, and NodeStageVolume is optional if the plugin does not advertise the STAGE_UNSTAGE_VOLUME capability, but having the two steps provide some additional flexibility.

For example, in Kubernetes the mount path specified in the request to NodeStageVolume is a global path that can be reused by multiple pods, while the NodePublishVolume RPC receives a per-pod path.

Sequence diagram from the NodeStageVolume and NodePublishVolume RPCs

Oxide implementation

The specific implementation of this RPC depends on the host operating system running the node plugin. This RFD assumes Linux as the operating system as first implementation.

To prepare a blank disk, the node plugin needs to create a partition, format it, and mount it to the path specified in the request.

Since these operations are common across SPs, the Kubernetes development team provides the k8s.io/mount-utils utility package to implement most of this logic.

One key step in this process is correlating the disk ID received in the RPC request with the logical name of the device in the Linux host.

This can be done by reading each disk serial value (using the lsblk command or from the /sys/block/nvmeXnY/device/serial file) and comparing them with the disk name set by the CreateVolume response in the metadata context.

`NodeUnpublishVolume` and `NodeUnstageVolume`

The NodeUnpublishVolume and NodeUnstageVolume RPCs are part of the Node Services and are called by the CO when the workload is stopped and ready to be moved out of the node, either because it was rescheduled or completely stopped. They must undo the actions taken during NodePublishVolume and NodeStageVolume.

Oxide implementation

Similarly to NodePublishVolume and NodeStageVolume, the k8s.io/mount-utils package provides most of the logic necessary to implement these RPCs, and there is no Oxide-specific work to be done.

The function CleanupMountPoint() unmounts the volume from a given path and deletes any remaining unused directories.

`ControllerUnpublishVolume`

The ControllerUnpublishVolume RPC is part of the Controller Services and is called by the CO when the workload is descheduled from a node.

The plugin is expected to make the volume ready to be consumed and published to a different node.

Oxide implementation

The Oxide CSI plugin uses the POST /v1/instances/{instance}/disks/detach API endpoint to detach the disk from the current instance and make it available to be attached to a different instance, if necessary.

`DeleteVolume`

The DeleteVolume RPC is part of the Controller Services and is called when the CO determines that a volume is no longer needed. The volume is identified by the unique ID returned in the CreateVolume RPC response.

In Kubernetes, this RPC is called when the user deletes a PVC object.

Oxide implementation

The Oxide CSI plugin uses the DELETE /v1/disks/{disk} API endpoint to delete the disk. The disk UUID is retrieved from the volume ID set in RPC request.

UX

This section describes the user experience of operators using the Oxide CSI plugin in a Kubernetes cluster running on an Oxide rack.

The general workflow would be similar for other orchestrators but, as mentioned earlier, the focus of this RFD is the Kubernetes integration.

Deploying the Oxide CSI plugin

The Oxide CSI plugin is packaged as an OCI image and can be made available via any OCI-compatible registry, such as Docker Hub, Quay, AWS Elastic Container Registry (ECR), GitHub Container Registry etc.

The image is deployed as pods in the cluster.

The controller plugin can be defined as a Deployment since it does not need to run on any specific node. It should be possible to run multiple instances of the controller plugin for increased reliability, but the implementation must take care to ensure RPCs are idempotent and have proper coordination for concurrent execution, such as some kind of leader election process.

Meanwhile, the node plugin can be defined as a DaemonSet so it is available on every node in the cluster. Each instance of the node plugin can be considered independent from each other since the Kubernetes control plane is responsible for activating the right kubelet where the RPCs are called.

Both plugins are deployed with a handful of supporting sidecars, which are provided by the Kubernetes development team. The following YAML snippets provide a general example of how the pods could be deployed.

Example of a Kubernetes Deployment spec to deploy the CSI controller plugin

kind: Deployment
apiVersion: apps/v1
metadata:
  name: oxide-csi-controller
  labels:
    app.kubernetes.io/name: oxide-csi-driver
spec:
  replicas: 2
  selector:
    matchLabels:
      app: oxide-csi-controller
      app.kubernetes.io/name: oxide-csi-driver
  template:
    spec:
      containers:
        - name: oxide-plugin
          image: oxidecomputer/oxide-csi-plugin:v0.1.0
          args:
            - --endpoint=$(CSI_ENDPOINT)
            - --mode=controller
            # ...
          env:
            - name: CSI_ENDPOINT
              value: unix:///csi/csi.sock
            # Environment variables for Oxide API access.
            - name: OXIDE_HOST
              valueFrom:
                secretKeyRef:
                  name: oxide-secret
                  key: host
            - name: OXIDE_TOKEN
              valueFrom:
                secretKeyRef:
                  name: oxide-secret
                  key: token
            - name: OXIDE_PROJECT
              valueFrom:
                secretKeyRef:
                  name: oxide-secret
                  key: project
            # ...
          volumeMounts:
            - name: socket-dir
              mountPath: /csi
          # ...
        - name: csi-provisioner
          image: registry.k8s.io/sig-storage/csi-provisioner:v5.0.2
          volumeMounts:
            - mountPath: /csi
              name: socket-dir
          # ...
        - name: csi-attacher
          image: registry.k8s.io/sig-storage/csi-attacher:v4.6.1
          volumeMounts:
            - mountPath: /csi
              name: socket-dir
          # ...
        - name: liveness-probe
          image: registry.k8s.io/sig-storage/livenessprobe:v2.13.1
          volumeMounts:
            - mountPath: /csi
              name: socket-dir
          # ...
      # Shared volume so the sidecar containers can communicate
      # with the plugin via its Unix domain socket.
      volumes:
        - name: socket-dir
          emptyDir: {}
# ...

Example of a Kubernetes DaemonSet spec to deploy the CSI node plugin

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: oxide-csi-node
  labels:
    app.kubernetes.io/name: oxide-csi-driver
spec:
  selector:
    matchLabels:
      app: oxide-csi-node
      app.kubernetes.io/name: oxide-csi-driver
  template:
    spec:
      containers:
        - name: oxide-plugin
          image: oxidecomputer/oxide-csi-plugin:v0.1.0
          args:
            - --endpoint=$(CSI_ENDPOINT)
            - --mode=node
            # ...
          env:
            - name: CSI_ENDPOINT
              value: unix:///csi/csi.sock
            # ...
          volumeMounts:
            - name: kubelet-dir
              mountPath: /var/lib/kubelet
              mountPropagation: "Bidirectional"
            - name: plugin-dir
              mountPath: /csi
            # Access host's /dev path.
            - name: device-dir
              mountPath: /dev
            # ...
          securityContext:
            # Run plugin as privileged container to allow formatting and
            # mounting the disk.
            privileged: true
        - name: node-driver-registrar
          image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.11.1
          volumeMounts:
            - name: plugin-dir
              mountPath: /csi
            - name: registration-dir
              mountPath: /registration
            - name: probe-dir
              mountPath: /var/lib/kubelet/plugins/csi.oxide.computer/
            # ...
        - name: liveness-probe
          image: registry.k8s.io/sig-storage/livenessprobe:v2.13.1
          volumeMounts:
            - name: plugin-dir
            mountPath: /csi
          # ...
      volumes:
        - name: kubelet-dir
          hostPath:
            path: /var/lib/kubelet
            type: Directory
        - name: plugin-dir
          hostPath:
            path: /var/lib/kubelet/plugins/csi.oxide.computer/
            type: DirectoryOrCreate
        - name: registration-dir
          hostPath:
          path: /var/lib/kubelet/plugins_registry/
          type: Directory
        - name: device-dir
          hostPath:
          path: /dev
          type: Directory
        - name: probe-dir
          emptyDir: {}
# ...

Since the plugin sidecars need access to the Kubernetes API, a production deployment also needs to include a set of RBAC rules, which are usually bound to a service account that is used by these containers.

The last two pieces for the plugin deployment are CSIDriver and StorageClass objects.

Example of a Kubernetes CSIDriver spec to register the Oxide CSI plugin

apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  name: csi.oxide.computer
  labels:
    app.kubernetes.io/name: oxide-csi-driver
spec:
  # ...

Example of a Kubernetes StorageClass spec to register the Oxide CSI storage class

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: oxide-disk
provisioner: csi.oxide.computer

The plugin deployment is fairly standardized across clusters, and so it is possible for to provide users with a set of base YAML configuration files, or package them with higher-level tools, such as a Helm chart or Kustomize.

Using the Oxide CSI plugin

With the plugin deployed, users can create and use Oxide disks as regular Kubernetes PVCs.

Example of a Kubernetes PersistentVolumeClaim spec to use create an Oxide CSI PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: oxide-postgres-data
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: oxide-disk

Example of a Kubernetes Pod spec that uses an Oxide PVC

apiVersion: v1
kind: Pod
metadata:
  name: postgres
spec:
  containers:
    - name: postgres
      image: postgres:17.6
      volumeMounts:
        - name: postgres-data
          mountPath: /var/lib/postgresql/data
      volumes:
        - name: postgres-data
          persistentVolumeClaim:
            claimName: oxide-postgres-data

Blockers, limitations, and open questions

This section lists all the current blockers, limitations, and open questions that currently affect the development of the Oxide CSI plugin.

Blockers

Blockers prevent the plugin from being developed or adopted in production by users.

Attaching and detaching volumes require instances to be stopped

Priority: High

Requiring instances to be stopped before attaching or detaching a disk prevents most real-world uses of the Oxide CSI plugin, as it results in downtime every time a pod with a PVC is scheduled into the instance.

It can also cause cluster-wide disruptions when all the pods running in a given instance that is shutdown need to be rescheduled somewhere else, causing those instances to restart as well.

Limitations

Limitations are pain points that make plugin implementation harder, or missing features that some users may expect to have.

Instances are limited to a maximum of 8 disks

Priority: Medium

Since each Kubernetes volume correlates to an Oxide disk, this limit affects the number of pods with PVCs that can be scheduled per instance, reducing the overall cluster workload density.

For reference, the table below lists the same limit for other cloud service providers. The exact number varies depending on the instance type of the node ([k8s-storage-limits]).

Disks per node limits for other cloud provider CSI plugins
Cloud provider	Disks per node
AWS	25 or 39
GCP	Up to 127
Azure	Up to 64

Plugins can advertise this limit in the response of the NodeGetInfo RPC, so COs are able to take this limit into consideration during scheduling.

Oxide API authentication

Priority: Medium

The Oxide CSI plugin needs to access the Oxide API to create and delete disks, attach and detach disks from instances, and other operations.

But currently the only way to authenticate API requests is via device tokens, which are attached to a specific user and are hard to manage at scale.

[rfd-553] describes to concept of service accounts, which can help alleviate this problem. Introducing the concept of machine principals to the Oxide API could help scope requests even further.

Retrieving Oxide instance metadata

Priority: Medium

RPCs such as NodeGetInfo require information about the specific Oxide instance where the plugin is running, such as its name and ID. Cloud providers usually expose a metadata endpoint that can be queried from within the instance to retrieve these types of information, but they can present challenges in access control and be a source of vulnerabilities and unintended data exposure.

At a minimum, the Oxide CSI node plugin needs to know the UUID of the Oxide instance it is running on, since this information is used by the controller plugin to attach and detach disks from the right instance. Additional metadata is required to support topologies, as discussed in [multi-rack].

The Kubernetes Cloud Controller Manager described in [rfd-493] can help with this problem by adding the Oxide instance name and ID to the Kubernetes Node objects themselves. The Oxide CSI plugin can then query the Kubernetes API to retrieve the information it needs, leveraging the comprehensive Kubernetes RBAC system to limit the scope of the request.

One caveat of this approach is that CSI plugins are expected to be CO agnostic, and querying the Kubernetes API directly breaks this assumption.

One alternative is to use the same sidecar pattern to isolate the Kubernetes API calls in a different container that is then responsible for feeding the information to the Oxide CSI plugin via environment variables. Deploying the plugin in other COs can follow a similar pattern.

Example for deploying the Oxide CSI node plugin with a sidecar for the Oxide API

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: oxide-csi-node
  labels:
    app.kubernetes.io/name: oxide-csi-driver
spec:
  selector:
    matchLabels:
      app: oxide-csi-node
      app.kubernetes.io/name: oxide-csi-driver
  template:
    spec:
      initContainers:
        # This container queries the Kubernetes API to retrieve metadata for
        # the node K8S_NODE_NAME and writes it to a file in the /data volume as
        # KEY=VALUE pairs.
        - name: oxide-instance-metadata
          image: oxidecomputer/oxide-k8s-instance-metadata:v0.1.0
          env:
            - name: K8S_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - name: oxide-instance-metadata
              mountPath: /data
          # ...
      containers:
        - name: oxide-plugin
          env:
            # The Oxide CSI plugin reads the Oxide instance ID from the node
            # metadata retrieved by the initContainer.
            - name: OXIDE_INSTANCE_ID
              valueFrom:
                fileKeyRef:
                  path: config.env
                  volumeName: oxide-instance-metadata
                  key: OXIDE_INSTANCE_ID
            # ...
        # ...
      volumes:
        - name: oxide-instance-metadata
          emptyDir: {}
        # ...
# ...

Disks cannot be expanded

Priority: Medium

The CSI spec defines the RPCs ControllerExpandVolume and NodeExpandVolume to allow cluster operators to dynamically grow an existing volume, but Oxide disks have a fixed size that is defined upon creation so the Oxide CSI plugin cannot support this feature.

These RPCs are gated by the EXPAND_VOLUME capability, so COs are able to prevent users from accessing this functionality if the plugin does not indicate support for it.

Lack of disk metadata

Priority: Low

There are three types of metadata that could be useful for the Oxide CSI plugin to set in the disks it manages.

The first kind are automated pod. If the CSIDriver object is created with podInfoOnMount set to true, the kubelet provides the following information in the volume context attribute of the NodePublishVolume request:

Pod name
Pod namespace
Pod ID
Pod service account name

The node plugin can update the disk metadata with these values.

The second type of metadata are user defined values that are set to all disks created by the Oxide CSI plugin. These values can be passed via CLI flags to the controller plugin and are included in every API request that creates a disk. For example, some users may use this feature to tag disks for a specific project or application.

The third type of metadata is a static value that the Oxide CSI plugin can use to implement the ListVolume RPC of the Controller Service. This RPC needs to return all volumes the plugins know about, but without a more structure way to determine which Oxide disks were created by the CSI plugin, this list would need to be created using name patterns, which could be unreliable.

The ListVolume RPC is gated by the LIST_VOLUMES capability, so it would be possible to release the plugin without this functionality. Popular plugins, such as the AWS EBS CSI plugin, do not have the LIST_VOLUMES capability either, so there may not be much impact in not supporting it.

Users can benefit from this disk metadata when trying to correlate their Kubernetes workload with their Oxide infrastructure.

It is important to consider security aspects. CSI plugins run in containers, but node plugins need to run in privileged mode in order to mount and format disks.

The controller plugin is responsible for most of the API calls, but it doesn’t need to run in privileged mode.

The node plugin is mostly responsible for operations that are local to the node, but would need to make mutating API calls to support the pod metadata use case.

Only `SINGLE_NODE_WRITER` access mode supported

Priority: Deferred

Oxide disks can only be attached to a single instance at a time, and are always available for reads and writes, so the only access mode the plugin can support is SINGLE_NODE_WRITER.

This requirement can be documented and validated by the plugin, and the SINGLE_NODE_WRITER is arguably the most common access mode used for disks, so this limitation should have little impact for most users.

Cloud provider plugins also have similar limitations. For example, AWS only supports MULTI_NODE_MULTI_WRITER with access type block and has no read-only support.

Supporting additional access modes will require substantial work, and without specific customer asks, this functionality can be deferred until needed.

Volume cloning

Priority: Deferred

CSI volumes can be created from three different sources: a blank disk, a snapshot, or by cloning an existing volume. The Oxide API can support the first two use cases, but not the third one.

Volume cloning could be implemented as an automated snapshot-and-restore operation, but this can become a complex saga and introduce challenges in terms of error handling and unwinding operations. Even if the Oxide API introduced a volume cloning functionality, the implementation would follow a similar pattern.

This feature is gated by the CLONE_VOLUME capability and not all cloud providers support this functionality, so it should be safe to defer implementation until a specific need arises.

Open questions

Open questions are decisions that have been deferred until more information is available to guide implementation.

Topologies and multi-rack clusters

[rfd-24] and [rfd-543] describe multi-rack deployments. In these scenarios, the Oxide CSI plugin needs to be aware of the Oxide fault domains and service coherence to avoid situations such as creating a disk in a rack that the instance running the workload will never be able to access.

These rules are codified in the CSI spec as topologies. CSI plugins need to resolve the topology request to determine where the volume should be created.

Each node plugin in the cluster can provide the topology segments that it can be accessed from in the response to the NodeGetInfo RPC. The CO then forwards this information to the controller plugin during the CreateVolume RPC so it can determine the best location to create the new volume.

Using the terminology proposed in [rfd-24], and considering that storage volumes have service coherence of a cell, a scenario with three racks (R1, R2, and R3) combined into two cells (C1 [R1, R2] and C2 [R3]) could have the Oxide CSI node plugin responding with the following topologies:

Example of topology responses for NodeGetInfo

Node plugin running in an instances in R1:
{"topology.oxide.computer/rack": "R1", "topology.oxide.computer/cell": "C1"}

Node plugin running in an instances in R2:
{"topology.oxide.computer/rack": "R2", "topology.oxide.computer/cell": "C1"}

Node plugin running in an instances in R3:
{"topology.oxide.computer/rack": "R3", "topology.oxide.computer/cell": "C2"}

When scheduling a workload in an Oxide instance running in rack R1, the CreateVolume RPC could receive the following topologies requirements and preferences:

Example of topology request for CreateVolume

requisite =
  {"topology.oxide.computer/rack": "R1", "topology.oxide.computer/cell": "C1"},
  {"topology.oxide.computer/rack": "R2", "topology.oxide.computer/cell": "C1"}
preferred =
  {"topology.oxide.computer/rack": "R1", "topology.oxide.computer/cell": "C1"}

The controller plugin then first attempts to create the new disk in rack R1 (since this is the preferred rack), but fallback to R2 in case of failure. It never attempts to create the disk in R3 because that would violate the topology and cell-boundary constraint.

This example uses the rack as the lowest level of failure domain. Cloud environments don’t often expose this fine-grain level of resolution to users, but rather larger domains, such as a zone. We may opt for a similar approach and only define the topology.oxide.computer/cell segment.

Another aspect to consider is the adoption of the Well-Known Labels, Annotations and Taints for Kubernetes and use topology.kubernetes.io/zone instead (or in addition to the more specific topology.oxide.computer/cell). But it’s not clear at this point if there is any advantage in doing so, and more research is needed on this topic to understand how these topology segments affect pod scheduling.

Using existing CSI plugins as reference, the AWS EBS CSI plugin uses both segments while GCP only supports a custom one.

Given that multi-rack support is still under active discussions, topology support in the Oxide CSI plugin is deferred until these concepts are more solidified.

Implementation plan

The first release of the Oxide CSI plugin will leverage existing functionality and skip any functionality that is not possible to implement with the current features and APIs in Oxide.

The only additional work that is required is to remove the blocker on hot plugging disks to instances.

The goal of this initial release is to create the base work necessary for an MVP: project structure, CLI parsing, documentation, deployment artifacts, idempotency and concurrency mechanisms, gRPC server implementation etc.

More specifically, the first version of the CSI plugin will support:

Create and destroy Oxide disks.
Attach and detach Oxide disks to instances based on workload scheduling and volume mounts.
Snapshot and restore volumes.

And will have all the limitations and open questions listed above:

Limit of 8 volumes per node.
Plugin will authenticate with the API using device tokens.
Instance ID will need to manually be set on each node plugin instance.
Volumes cannot be expanded.
Disks will not have CSI-specific metadata.
Only SINGLE_NODE_WRITER access mode.
No support for volume cloning.
No concept of topologies.

Despite these limitations, this first release should provide users with an initial integration point from which we can gather feedback on where to focus next.

Some of the limitations also affect other projects, and may be implemented outside the scope of the CSI plugin. As the limitations are solved, we will be able to expand the list of functionalities supported by the CSI plugin.

Local storage

[rfd-584] and [rfd-590] describe the concept of local instance storage. Unlike Oxide disks, local storage has a different lifecycle that is intrinsically connected with the lifecycle of the instance they are attached to, and so they may not be suitable to be managed by the CSI plugin described in the RFD.

Depending on the exact details of how local storage will work and be exposed to instances, it may be possible to just use local volumes directly in Kubernetes. The Local Persistence Volume Static Provisioner could also be useful.

If local storage end up requiring more work to be done in order to use them with Kubernetes, we should create a separate CSI plugin, potentially with just the node services since all operations will be local to the node.

The CSI spec provides some examples of how these headless deployments would look like.

Example of a node-only headless deployment

                            CO "Node" Host(s)
+-------------------------------------------+
|                                           |
|  +------------+           +------------+  |
|  |     CO     |   gRPC    |    Node    |  |
|  |            +----------->   Plugin   |  |
|  +------------+           +------------+  |
|                                           |
+-------------------------------------------+

Figure 4: Headless Plugin deployment, only the CO Node hosts run
Plugins. A Node-only Plugin component supplies only the Node Service.
Its GetPluginCapabilities RPC does not report the CONTROLLER_SERVICE
capability.

The lifecycle of a volume in a node-only headless deployment

       +-+  +-+
       |X|  | |
       +++  +^+
        |    |
   Node |    | Node
Publish |    | Unpublish
 Volume |    | Volume
    +---v----+---+
    | PUBLISHED  |
    +------------+

Figure 8: Plugins MAY forego other lifecycle steps by contraindicating
them via the capabilities API. Interactions with the volumes of such
plugins is reduced to `NodePublishVolume` and `NodeUnpublishVolume`
calls.

User would be able to deploy both plugins into their Kubernetes cluster, and define different storage classes for each. This allows them to pick a faster local volume when needed instead of a disk-based volume.

External References

[rfd-24] https://24.rfd.oxide.computer/
[rfd-493] https://493.rfd.oxide.computer/
[rfd-543] https://543.rfd.oxide.computer/
[rfd-553] https://553.rfd.oxide.computer/
[rfd-584] https://584.rfd.oxide.computer/
[rfd-590] https://590.rfd.oxide.computer/
[csi-spec] https://github.com/container-storage-interface/spec/blob/master/spec.md
[k8s-deployment] https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
[k8s-daemonset] https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
[k8s-storage-class] https://kubernetes.io/docs/concepts/storage/storage-classes/
[k8s-csi-driver] https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/csi-driver-v1/
[k8s-storage-limits] https://kubernetes.io/docs/concepts/storage/storage-limits/
[k8s-external-provisioner] https://kubernetes-csi.github.io/docs/external-provisioner.html
[k8s-livenessprobe] https://kubernetes-csi.github.io/docs/livenessprobe.html
[mount-utils] https://pkg.go.dev/k8s.io/mount-utils
[csi-create-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#createvolume
[csi-controller-publish-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerpublishvolume
[csi-node-get-info] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodegetinfo
[csi-node-stage-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodestagevolume
[csi-node-publish-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodepublishvolume
[csi-node-unpublish-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodeunpublishvolume
[csi-node-unstage-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodeunstagevolume
[csi-controller-unpublish-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerunpublishvolume
[csi-delete-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#deletevolume
https://www.redhat.com/en/blog/persistent-volume-support-peer-pods-technical-deep-dive
https://kubernetes-csi.github.io/docs/introduction.html

RFD 595 Oxide CSI Plugin

Table of Contents