Background
This RFD expands on [rfd-493] to provide more details on how a Container Storage Interface plugin for Oxide would be implemented.
Kubernetes uses persistent volumes to provide long-term storage to containers. Pods can access volumes by creating persistent volume claims (PVCs) that are fulfilled by Kubernetes during deployment.
These claims can be static, where the volume is pre-created and registered to the cluster manually, or dynamic, where the volume is automatically provisioned when the pod is first created. The end result is a persistent volume (PV) that can be mounted into the container.
To support different storage providers, Kubernetes standardized the volume lifecycle handling into a common interface called the Container Storage Interface (CSI).
The CSI spec uses specific terminology to refer to certain roles and components. This RFD follows the same terminology to keep the content consistent.
CO is the container orchestrator. The most well-known solution is Kubernetes, but other projects, such as Mesos and Nomad, also implement the CSI spec and are other examples of COs.
SP is the storage provider. In the context of this RFD, the SP is the Oxide API, but other examples include any solution that provides persistent data access via a remote API, such as block storages (AWS EBS and GCP Persitent Disks), network storages (NFS and SMB), secret management systems (Vault and AWS Secrets Manager), object storages (AWS S3 and GCP Cloud Storage) etc.
Workload is the unit of work created by the CO. In Kubernetes this can be thought of as a pod or a container.
Volume is the unit of storage created by the SP and used by the workload. In the context of Oxide, the volume is a disk, but it can be a file or a secret, depending on the SP being used.
Node is a host where workloads run, such as Oxide instances.
The CSI spec leaves some implementation details to COs and, with Kubernetes being the most used CO in the industry, it is often used as the "expected" behaviour. This RFD mentions Kubernetes specifically (instead of using the more generic term CO) when referring to details specific to Kubernetes.
CSI plugins (sometimes also called drivers) are binaries that implement the CSI spec and expose an endpoint with a set of RPCs that are called by COs in a specific order to complete the lifecycle of a volume.
They also advertise the capabilities they implement to allow COs to dynamically adjust which calls to make depending on the features supported by the plugin.
The plugin’s RPCs are grouped into three high-level services:
The Identity Service provides basic information about the plugin, such as its name and capabilities.
The Controller Service exposes RPCs that are expected to call the upstream SP APIs to manage the lifecycle of a volume from a remote state perspective.
The Node Service exposes RPCs related to the local node where volumes will be used.
These services can be implemented and deployed as separate binaries or, most commonly, bundled into the same binary that can be configured to run in specific modes, such as node
, controller
, or all
.
In Kubernetes, CSI plugins are packaged as OCI images and deployed as regular pods in the cluster. Pods that provide the Controller Services are usually created using Deployments
, with more than one replica for redundancy, and Node Services as DaemonSets
so they are available in every node of the cluster.
CSI plugins are also deployed along with a set of sidecar containers in the same pod to help reduce boilerplate and keep plugin implementation focused on the CSI spec instead of Kubernetes-specific details.
For example, in Kubernetes, the Controller Services RPCs are not called directly, but rather plugins are expected to subscribe and listen to specific events about volume creation requests from the Kubernetes API.
Since this is a common requirement for all CSI plugins, the external-provisioner
sidecar is provided to listen for the right events from the Kubernetes API and call the appropriate Controller Service RPCs via a local Unix Domain Socket.
Node services are called directly by the Kubernetes agent running in the node (kubelet), but they also use sidecar containers for common functionality, such as livenessprobe
to monitor the plugin’s health.

Volume lifecycle
The CSI spec provides a few alternatives for volume lifecycle implementation. This RFD focuses on the most complete option as it provides more implementation flexibility.
CreateVolume +------------+ DeleteVolume
+------------->| CREATED +--------------+
| +---+----^---+ |
| Controller | | Controller v
+++ Publish | | Unpublish +++
|X| Volume | | Volume | |
+-+ +---v----+---+ +-+
| NODE_READY |
+---+----^---+
Node | | Node
Stage | | Unstage
Volume | | Volume
+---v----+---+
| VOL_READY |
+---+----^---+
Node | | Node
Publish | | Unpublish
Volume | | Volume
+---v----+---+
| PUBLISHED |
+------------+
Figure 6: The lifecycle of a dynamically provisioned volume, from
creation to destruction, when the Node Plugin advertises the
STAGE_UNSTAGE_VOLUME capability.
The following sections detail the actions the Oxide CSI plugin needs to take to implement these RPCs.
CreateVolume
The CreateVolume
RPC is part of the Controller Services and is the first function to be called when creating a new volume. This call is made before any scheduling decision about where the workload will run.
The CSI plugin is expected to communicate with the SP API to create a new volume given the parameters set in the request, and respond with, among other things, an unique identifier for this volume, which is usually the ID specified by the SP.
Volume creation also needs to meet the volume capabilities and accessibilities requirements.
Volume capability has two main properties.
Access type can be one of
block
ormount
and define how the volume content is structured.mount
volumes are formatted and made available to workloads as a regular filesystem, whileblock
volumes are kept as raw disks.Access mode controls how many nodes and workloads can mount the volume (one or many) and how they are allowed to access it (read-only or read-and-write). Possible values are
SINGLE_NODE_WRITER
,SINGLE_NODE_READER_ONLY
,MULTI_NODE_READER_ONLY
,MULTI_NODE_SINGLE_WRITER
, andMULTI_NODE_MULTI_WRITER
.
The volume accessibility defines the topology preferences and requisites for where the volume should be accessible from (such as which zone, region, datacenter, rack etc.) and, therefore, they define where the volume will be created.
Example 1:
Given a volume should be accessible from a single zone, and
requisite =
{"region": "R1", "zone": "Z2"},
{"region": "R1", "zone": "Z3"}
preferred =
{"region": "R1", "zone": "Z3"}
then the SP SHOULD first attempt to make the provisioned volume
available from "zone" "Z3" in the "region" "R1" and fall back to
"zone" "Z2" in the "region" "R1" if that is not possible.
The request can optionally specify a volume source.
The
snapshot
source creates the new volume from an existing snapshot. This feature can only be used if the plugin advertises theCREATE_DELETE_SNAPSHOT
capability.The
volume
source creates a new volume by cloning an existing volume. This feature is gated by theCLONE_VOLUME
capability.
If no source is specified, the volume is created as a blank
disk.
Oxide implementation
This RPC is implemented using the POST /v1/
API endpoint to create a new blank disk with the given name and disk capacity.
Users can request disk capacity to be a range, starting from a required value (disk MUST be at least this big) to a limit value (disk MUST be at most this big). This range can be relevant when rounding or transforming the requested disk size, since the Oxide API requires it to be set in bytes and as a multiple of the disk block size.
The block size itself (and other Oxide-specific values) can be passed in the RPC request as an opaque key-value store of parameters.
The disk name is provided by the CO, but the plugin may choose a different name when making the API. This could be important considering that disk serial numbers are truncated to 20 bytes and so the names provided by COs may conflict with disks that have the same prefix. When this happens, only one of the disks is visible from the guest instance, even if both disks are attached.
The RPC response also accepts opaque metadata context that the CO propagates to the other RPCs. The Oxide CSI plugin can use this field to store the name of the disk created so it can later identify it from within an instance.
ControllerPublishVolume
The CO calls the ControllerPublishVolume
RPC of the Controller Services once it has selected which node will run the workload.
The request contains the disk ID returned by CreateVolume
and the node ID where the workload has been scheduled. The CO retrieves the node ID by calling the NodeGetInfo
RPC from the Node Services.
The plugin is expected to make the volume available for the node to use it.
Oxide implementation
The implementation uses the POST /v1/
API endpoint to attach the newly created disk to the instance that is running the kubelet where the workload is scheduled to run.
The NodeGetInfo
RPC needs to return the Oxide instance ID as the node ID so the controller plugin can know which ID to use in the API call.
NodeStageVolume
and NodePublishVolume
The NodeStageVolume
and NodePublishVolume
RPCs are part of the Node Services, so they run in the node where the volume is being attached.
They are called by the CO when the workload that is going to use the volume is scheduled to the node, but before it is actually created.
When these RPCs are received, the plugin is expected to make the volume ready to be used by the workload and available at the path defined in the request.
Also in the request are information about the filesystem and mount flags to use when preparing the volume.
These RPC are fairly similar, and NodeStageVolume
is optional if the plugin does not advertise the STAGE_UNSTAGE_VOLUME
capability, but having the two steps provide some additional flexibility.
For example, in Kubernetes the mount path specified in the request to NodeStageVolume
is a global path that can be reused by multiple pods, while the NodePublishVolume
RPC receives a per-pod path.
Oxide implementation
The specific implementation of this RPC depends on the host operating system running the node plugin. This RFD assumes Linux as the operating system as first implementation.
To prepare a blank disk, the node plugin needs to create a partition, format it, and mount it to the path specified in the request.
Since these operations are common across SPs, the Kubernetes development team provides the k8s.io/
utility package to implement most of this logic.
One key step in this process is correlating the disk ID received in the RPC request with the logical name of the device in the Linux host.
This can be done by reading each disk serial value (using the lsblk
command or from the /
file) and comparing them with the disk name set by the CreateVolume
response in the metadata context.
NodeUnpublishVolume
and NodeUnstageVolume
The NodeUnpublishVolume
and NodeUnstageVolume
RPCs are part of the Node Services and are called by the CO when the workload is stopped and ready to be moved out of the node, either because it was rescheduled or completely stopped. They must undo the actions taken during NodePublishVolume
and NodeStageVolume
.
Oxide implementation
Similarly to NodePublishVolume
and NodeStageVolume
, the k8s.io/
package provides most of the logic necessary to implement these RPCs, and there is no Oxide-specific work to be done.
The function CleanupMountPoint()
unmounts the volume from a given path and deletes any remaining unused directories.
ControllerUnpublishVolume
The ControllerUnpublishVolume
RPC is part of the Controller Services and is called by the CO when the workload is descheduled from a node.
The plugin is expected to make the volume ready to be consumed and published to a different node.
Oxide implementation
The Oxide CSI plugin uses the POST /v1/
API endpoint to detach the disk from the current instance and make it available to be attached to a different instance, if necessary.
DeleteVolume
The DeleteVolume
RPC is part of the Controller Services and is called when the CO determines that a volume is no longer needed. The volume is identified by the unique ID returned in the CreateVolume
RPC response.
In Kubernetes, this RPC is called when the user deletes a PVC object.
Oxide implementation
The Oxide CSI plugin uses the DELETE /v1/
API endpoint to delete the disk. The disk UUID is retrieved from the volume ID set in RPC request.
UX
This section describes the user experience of operators using the Oxide CSI plugin in a Kubernetes cluster running on an Oxide rack.
The general workflow would be similar for other orchestrators but, as mentioned earlier, the focus of this RFD is the Kubernetes integration.
Deploying the Oxide CSI plugin
The Oxide CSI plugin is packaged as an OCI image and can be made available via any OCI-compatible registry, such as Docker Hub, Quay, AWS Elastic Container Registry (ECR), GitHub Container Registry etc.
The image is deployed as pods in the cluster.
The controller plugin can be defined as a Deployment
since it does not need to run on any specific node. It should be possible to run multiple instances of the controller plugin for increased reliability, but the implementation must take care to ensure RPCs are idempotent and have proper coordination for concurrent execution, such as some kind of leader election process.
Meanwhile, the node plugin can be defined as a DaemonSet
so it is available on every node in the cluster. Each instance of the node plugin can be considered independent from each other since the Kubernetes control plane is responsible for activating the right kubelet where the RPCs are called.
Both plugins are deployed with a handful of supporting sidecars, which are provided by the Kubernetes development team. The following YAML snippets provide a general example of how the pods could be deployed.
kind: Deployment
apiVersion: apps/v1
metadata:
name: oxide-csi-controller
labels:
app.kubernetes.io/name: oxide-csi-driver
spec:
replicas: 2
selector:
matchLabels:
app: oxide-csi-controller
app.kubernetes.io/name: oxide-csi-driver
template:
spec:
containers:
- name: oxide-plugin
image: oxidecomputer/oxide-csi-plugin:v0.1.0
args:
- --endpoint=$(CSI_ENDPOINT)
- --mode=controller
# ...
env:
- name: CSI_ENDPOINT
value: unix:///csi/csi.sock
# Environment variables for Oxide API access.
- name: OXIDE_HOST
valueFrom:
secretKeyRef:
name: oxide-secret
key: host
- name: OXIDE_TOKEN
valueFrom:
secretKeyRef:
name: oxide-secret
key: token
- name: OXIDE_PROJECT
valueFrom:
secretKeyRef:
name: oxide-secret
key: project
# ...
volumeMounts:
- name: socket-dir
mountPath: /csi
# ...
- name: csi-provisioner
image: registry.k8s.io/sig-storage/csi-provisioner:v5.0.2
volumeMounts:
- mountPath: /csi
name: socket-dir
# ...
- name: csi-attacher
image: registry.k8s.io/sig-storage/csi-attacher:v4.6.1
volumeMounts:
- mountPath: /csi
name: socket-dir
# ...
- name: liveness-probe
image: registry.k8s.io/sig-storage/livenessprobe:v2.13.1
volumeMounts:
- mountPath: /csi
name: socket-dir
# ...
# Shared volume so the sidecar containers can communicate
# with the plugin via its Unix domain socket.
volumes:
- name: socket-dir
emptyDir: {}
# ...
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: oxide-csi-node
labels:
app.kubernetes.io/name: oxide-csi-driver
spec:
selector:
matchLabels:
app: oxide-csi-node
app.kubernetes.io/name: oxide-csi-driver
template:
spec:
containers:
- name: oxide-plugin
image: oxidecomputer/oxide-csi-plugin:v0.1.0
args:
- --endpoint=$(CSI_ENDPOINT)
- --mode=node
# ...
env:
- name: CSI_ENDPOINT
value: unix:///csi/csi.sock
# ...
volumeMounts:
- name: kubelet-dir
mountPath: /var/lib/kubelet
mountPropagation: "Bidirectional"
- name: plugin-dir
mountPath: /csi
# Access host's /dev path.
- name: device-dir
mountPath: /dev
# ...
securityContext:
# Run plugin as privileged container to allow formatting and
# mounting the disk.
privileged: true
- name: node-driver-registrar
image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.11.1
volumeMounts:
- name: plugin-dir
mountPath: /csi
- name: registration-dir
mountPath: /registration
- name: probe-dir
mountPath: /var/lib/kubelet/plugins/csi.oxide.computer/
# ...
- name: liveness-probe
image: registry.k8s.io/sig-storage/livenessprobe:v2.13.1
volumeMounts:
- name: plugin-dir
mountPath: /csi
# ...
volumes:
- name: kubelet-dir
hostPath:
path: /var/lib/kubelet
type: Directory
- name: plugin-dir
hostPath:
path: /var/lib/kubelet/plugins/csi.oxide.computer/
type: DirectoryOrCreate
- name: registration-dir
hostPath:
path: /var/lib/kubelet/plugins_registry/
type: Directory
- name: device-dir
hostPath:
path: /dev
type: Directory
- name: probe-dir
emptyDir: {}
# ...
Since the plugin sidecars need access to the Kubernetes API, a production deployment also needs to include a set of RBAC rules, which are usually bound to a service account that is used by these containers.
The last two pieces for the plugin deployment are CSIDriver
and StorageClass
objects.
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: csi.oxide.computer
labels:
app.kubernetes.io/name: oxide-csi-driver
spec:
# ...
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: oxide-disk
provisioner: csi.oxide.computer
The plugin deployment is fairly standardized across clusters, and so it is possible for to provide users with a set of base YAML configuration files, or package them with higher-level tools, such as a Helm chart or Kustomize.
Using the Oxide CSI plugin
With the plugin deployed, users can create and use Oxide disks as regular Kubernetes PVCs.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: oxide-postgres-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: oxide-disk
apiVersion: v1
kind: Pod
metadata:
name: postgres
spec:
containers:
- name: postgres
image: postgres:17.6
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data
volumes:
- name: postgres-data
persistentVolumeClaim:
claimName: oxide-postgres-data
Blockers, limitations, and open questions
This section lists all the current blockers, limitations, and open questions that currently affect the development of the Oxide CSI plugin.
Blockers
Blockers prevent the plugin from being developed or adopted in production by users.
Attaching and detaching volumes require instances to be stopped
Priority: High
Requiring instances to be stopped before attaching or detaching a disk prevents most real-world uses of the Oxide CSI plugin, as it results in downtime every time a pod with a PVC is scheduled into the instance.
It can also cause cluster-wide disruptions when all the pods running in a given instance that is shutdown need to be rescheduled somewhere else, causing those instances to restart as well.
Limitations
Limitations are pain points that make plugin implementation harder, or missing features that some users may expect to have.
Instances are limited to a maximum of 8 disks
Priority: Medium
Since each Kubernetes volume correlates to an Oxide disk, this limit affects the number of pods with PVCs that can be scheduled per instance, reducing the overall cluster workload density.
For reference, the table below lists the same limit for other cloud service providers. The exact number varies depending on the instance type of the node ([k8s-storage-limits]).
Cloud provider | Disks per node |
---|---|
AWS | 25 or 39 |
GCP | Up to 127 |
Azure | Up to 64 |
Plugins can advertise this limit in the response of the NodeGetInfo
RPC, so COs are able to take this limit into consideration during scheduling.
Oxide API authentication
Priority: Medium
The Oxide CSI plugin needs to access the Oxide API to create and delete disks, attach and detach disks from instances, and other operations.
But currently the only way to authenticate API requests is via device tokens, which are attached to a specific user and are hard to manage at scale.
[rfd-553] describes to concept of service accounts, which can help solve this problem.
Retrieving Oxide instance metadata
Priority: Medium
RPCs such as NodeGetInfo
require information about the specific Oxide instance where the plugin is running, such as its name and ID. Cloud providers usually expose a metadata endpoint that can be queried from within the instance to retrieve these types of information.
Without a metadata API, users need to manually set static configuration values, such as environment variables, directly into the Oxide CSI plugin container.
The Kubernetes Cloud Controller Manager described in [rfd-493] can help with this problem by adding the Oxide instance name and ID to the node object itself. The Oxide CSI plugin can then query the Kubernetes API to retrieve the information it needs.
One caveat of this approach is that CSI plugins are expected to be CO agnostic, and querying the Kubernetes API directly breaks this assumption.
One alternative is to use the same sidecar pattern to isolate the Kubernetes API in a different container that is then responsible for feeding the information to the Oxide CSI plugin via environment variables.
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: oxide-csi-node
labels:
app.kubernetes.io/name: oxide-csi-driver
spec:
selector:
matchLabels:
app: oxide-csi-node
app.kubernetes.io/name: oxide-csi-driver
template:
spec:
initContainers:
# This container queries the Kubernetes API to retrieve metadata for
# the node K8S_NODE_NAME and writes it to a file in the /data volume as
# KEY=VALUE pairs.
- name: oxide-instance-metadata
image: oxidecomputer/oxide-k8s-instance-metadata:v0.1.0
env:
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: oxide-instance-metadata
mountPath: /data
# ...
containers:
- name: oxide-plugin
env:
# The Oxide CSI plugin reads the Oxide instance ID from the node
# metadata retrieved by the initContainer.
- name: OXIDE_INSTANCE_ID
valueFrom:
fileKeyRef:
path: config.env
volumeName: oxide-instance-metadata
key: OXIDE_INSTANCE_ID
# ...
# ...
volumes:
- name: oxide-instance-metadata
emptyDir: {}
# ...
# ...
Disks cannot be expanded
Priority: Medium
The CSI spec defines the RPCs ControllerExpandVolume
and NodeExpandVolume
to allow cluster operators to dynamically grow an existing volume, but Oxide disks have a fixed size that is defined upon creation so the Oxide CSI plugin cannot support this feature.
These RPCs are gated by the EXPAND_VOLUME
capability, so COs are able to prevent users from accessing this functionality if the plugin does not indicate support for it.
Lack of disk metadata
Priority: Low
The Controller Service has a method called ListVolume
that is expected to return all the volumes the plugin knows about.
Without some kind of resource tagging or metadata, the plugin would need to rely on name pattern matching to find the disks it created.
This RPC is gated by the LIST_VOLUMES
capability, so it would be possible to release the plugin without this functionality.
Only SINGLE_NODE_WRITER
access mode supported
Priority: Low
Oxide disks can only be attached to a single instance at a time, and are always available for reads and writes, so the only access mode the plugin can support is SINGLE_NODE_WRITER
.
This requirement can be documented and validated by the plugin, and the SINGLE_NODE_WRITER
is arguably the most common access mode used for disks, so this limitation should have little impact for most users.
Cloud providers also have similar limitations. For example, AWS only supports MULTI_NODE_MULTI_WRITER
with access type block
and has no read-only support.
Volume cloning
Priority: Low
CSI volumes can be created from three different sources: a blank disk, a snapshot, or by cloning an existing volume. The Oxide API can support the first two use cases, but not the third one.
This feature is gated by the CLONE_VOLUME
capability and not all cloud providers support this functionality either.
Open questions
Open questions are decisions that have been deferred until more information is available to guide implementation.
Topologies and multi-rack clusters
Multi-rack deployments are still under active discussion, but a Kubernetes cluster can already be deployed across multiple Oxide racks. In this scenario, the Oxide CSI plugin needs a way to guarantee that pods are only scheduled to instances in the rack that has the disks they need.
This requires some mechanism for Oxide racks to be uniquely identified so that they can be used as a topology key.
The topology details will also be affected by the final multi-rack support implementation. For example, if an instance on rack A is able to access a disk from rack B, then rack
could be defined as a preferred topology instead of a requisite.
Multi-rack environments could also be further grouped into different datacenters or geolocations, so users would need a way to specify this information as well.
External References
[rfd-493] https://493.rfd.oxide.computer/
[rfd-553] https://553.rfd.oxide.computer/
[csi-spec] https://github.com/container-storage-interface/spec/blob/master/spec.md
[k8s-deployment] https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
[k8s-daemonset] https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
[k8s-storage-class] https://kubernetes.io/docs/concepts/storage/storage-classes/
[k8s-csi-driver] https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/csi-driver-v1/
[k8s-storage-limits] https://kubernetes.io/docs/concepts/storage/storage-limits/
[k8s-external-provisioner] https://kubernetes-csi.github.io/docs/external-provisioner.html
[k8s-livenessprobe] https://kubernetes-csi.github.io/docs/livenessprobe.html
[mount-utils] https://pkg.go.dev/k8s.io/mount-utils
[csi-create-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#createvolume
[csi-controller-publish-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerpublishvolume
[csi-node-get-info] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodegetinfo
[csi-node-stage-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodestagevolume
[csi-node-publish-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodepublishvolume
[csi-node-unpublish-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodeunpublishvolume
[csi-node-unstage-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodeunstagevolume
[csi-controller-unpublish-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerunpublishvolume
[csi-delete-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#deletevolume
https://www.redhat.com/en/blog/persistent-volume-support-peer-pods-technical-deep-dive