Background
This RFD expands on [rfd-493] to provide more details on how a Container Storage Interface plugin for Oxide would be implemented.
Kubernetes uses persistent volumes to provide long-term storage to containers. Pods can access volumes by creating persistent volume claims (PVCs) that are fulfilled by Kubernetes during deployment.
These claims can be static, where the volume is pre-created and registered to the cluster manually, or dynamic, where the volume is automatically provisioned when the pod is first created. The end result is a persistent volume (PV) that can be mounted into the container.
To support different storage providers, Kubernetes standardized the volume lifecycle handling into a common interface called the Container Storage Interface (CSI).
The CSI spec uses specific terminology to refer to certain roles and components. This RFD follows the same terminology to keep the content consistent.
CO is the container orchestrator. The most well-known solution is Kubernetes, but other projects, such as Mesos and Nomad, also implement the CSI spec and are other examples of COs.
SP is the storage provider. In the context of this RFD, the SP is the Oxide API, but other examples include any solution that provides persistent data access via a remote API, such as block storages (AWS EBS and GCP Persitent Disks), network storages (NFS and SMB), secret management systems (Vault and AWS Secrets Manager), object storages (AWS S3 and GCP Cloud Storage) etc.
Workload is the unit of work created by the CO. In Kubernetes this can be thought of as a pod or a container.
Volume is the unit of storage created by the SP and used by the workload. In the context of Oxide, the volume is a disk, but it can be a file or a secret, depending on the SP being used.
Node is a host where workloads run, such as Oxide instances.
The CSI spec leaves some implementation details to COs and, with Kubernetes being the most used CO in the industry, it is often used as the "expected" behaviour. This RFD mentions Kubernetes specifically (instead of using the more generic term CO) when referring to details specific to Kubernetes.
CSI plugins (sometimes also called drivers) are binaries that implement the CSI spec and expose an endpoint with a set of RPCs that are called by COs in a specific order to complete the lifecycle of a volume.
They also advertise the capabilities they implement to allow COs to dynamically adjust which calls to make depending on the features supported by the plugin.
The plugin’s RPCs are grouped into three high-level services:
The Identity Service provides basic information about the plugin, such as its name and capabilities.
The Controller Service exposes RPCs that are expected to call the upstream SP APIs to manage the lifecycle of a volume from a remote state perspective.
The Node Service exposes RPCs related to the local node where volumes will be used.
These services can be implemented and deployed as separate binaries or, most commonly, bundled into the same binary that can be configured to run in specific modes, such as node
, controller
, or all
.
In Kubernetes, CSI plugins are packaged as OCI images and deployed as regular pods in the cluster. Pods that provide the Controller Services are usually created using Deployments
, with more than one replica for redundancy, and Node Services as DaemonSets
so they are available in every node of the cluster.
CSI plugins are also deployed along with a set of sidecar containers in the same pod to help reduce boilerplate and keep plugin implementation focused on the CSI spec instead of Kubernetes-specific details.
For example, in Kubernetes, the Controller Services RPCs are not called directly, but rather plugins are expected to subscribe and listen to specific events about volume creation requests from the Kubernetes API.
Since this is a common requirement for all CSI plugins, the external-provisioner
sidecar is provided to listen for the right events from the Kubernetes API and call the appropriate Controller Service RPCs via a local Unix Domain Socket.
Node services are called directly by the Kubernetes agent running in the node (kubelet), but they also use sidecar containers for common functionality, such as livenessprobe
to monitor the plugin’s health.
The Kubernetes design proposal for CSI illustrates how plugins are deployed.

Volume lifecycle
The CSI spec provides a few alternatives for volume lifecycle implementation. This RFD focuses on the most complete option as it provides more implementation flexibility.
CreateVolume +------------+ DeleteVolume +------------->| CREATED +--------------+ | +---+----^---+ | | Controller | | Controller v +++ Publish | | Unpublish +++ |X| Volume | | Volume | | +-+ +---v----+---+ +-+ | NODE_READY | +---+----^---+ Node | | Node Stage | | Unstage Volume | | Volume +---v----+---+ | VOL_READY | +---+----^---+ Node | | Node Publish | | Unpublish Volume | | Volume +---v----+---+ | PUBLISHED | +------------+ Figure 6: The lifecycle of a dynamically provisioned volume, from creation to destruction, when the Node Plugin advertises the STAGE_UNSTAGE_VOLUME capability.
The following sections detail the actions the Oxide CSI plugin needs to take to implement these RPCs.
CreateVolume
The CreateVolume
RPC is part of the Controller Services and is the first function to be called when creating a new volume. This call is made before any scheduling decision about where the workload will run.
The CSI plugin is expected to communicate with the SP API to create a new volume given the parameters set in the request, and respond with, among other things, an unique identifier for this volume, which is usually the ID specified by the SP.
Volume creation also needs to meet the volume capabilities and accessibilities requirements.
Volume capability has two main properties.
Access type can be one of
block
ormount
and define how the volume content is structured.mount
volumes are formatted and made available to workloads as a regular filesystem, whileblock
volumes are kept as raw disks.Access mode controls how many nodes and workloads can mount the volume (one or many) and how they are allowed to access it (read-only or read-and-write). Possible values are
SINGLE_NODE_WRITER
,SINGLE_NODE_READER_ONLY
,MULTI_NODE_READER_ONLY
,MULTI_NODE_SINGLE_WRITER
, andMULTI_NODE_MULTI_WRITER
.
The volume accessibility defines the topology preferences and requisites for where the volume should be accessible from and, therefore, they define where the volume will be created.
The topologies are specified as key-value maps, where the keys are called topological domains and the values topological segments. Domains and segments are opaque to the CSI spec, meaning that their semantics are defined by each CSI plugin.
The CSI spec provides a few general examples of how topologies should be evaluated by CSI plugins. Topology support for Oxide is still an open question.
Example 1: Given a volume should be accessible from a single zone, and requisite = {"region": "R1", "zone": "Z2"}, {"region": "R1", "zone": "Z3"} preferred = {"region": "R1", "zone": "Z3"} then the SP SHOULD first attempt to make the provisioned volume available from "zone" "Z3" in the "region" "R1" and fall back to "zone" "Z2" in the "region" "R1" if that is not possible.
The request can optionally specify a volume source.
The
snapshot
source creates the new volume from an existing snapshot. This feature can only be used if the plugin advertises theCREATE_DELETE_SNAPSHOT
capability.The
volume
source creates a new volume by cloning an existing volume. This feature is gated by theCLONE_VOLUME
capability.
If no source is specified, the volume is created as a blank
disk.
Oxide implementation
This RPC is implemented using the POST /v1/
API endpoint to create a new blank disk with the given name and disk capacity.
Users can request disk capacity to be a range, starting from a required value (disk MUST be at least this big) to a limit value (disk MUST be at most this big). This range can be relevant when rounding or transforming the requested disk size, since the Oxide API requires it to be set in bytes and as a multiple of the disk block size.
The block size itself (and other Oxide-specific values) can be passed in the RPC request as an opaque key-value store of parameters. If not specified, disks are created with 4K block size by default.
The disk name is provided by the CO, but the plugin may choose a different name when making the API. This could be important considering that disk serial numbers are truncated to 20 bytes and so the names provided by COs may conflict with disks that have the same prefix. When this happens, only one of the disks is visible from the guest instance, even if both disks are attached.
The RPC response also accepts opaque metadata context that the CO propagates to the other RPCs. The Oxide CSI plugin can use this field to store the name of the disk created so it can later identify it from within an instance.
CreateVolume
RPC
One challenge in implementing this RPC is the current blocker that prevents hot plugging disks to instances.
In a single rack scenario, this process is relatively straight forward, since instances and disks will always be placed in the same rack. But handling multi-rack environments is an open question that will likely require adding support for topologies.
ControllerPublishVolume
The CO calls the ControllerPublishVolume
RPC of the Controller Services once it has selected which node will run the workload.
The request contains the disk ID returned by CreateVolume
and the node ID where the workload has been scheduled. The CO retrieves the node ID by calling the NodeGetInfo
RPC from the Node Services.
The plugin is expected to make the volume available for the node to use it.
Oxide implementation
The implementation uses the POST /v1/
API endpoint to attach the newly created disk to the instance that is running the kubelet where the workload is scheduled to run.
The NodeGetInfo
RPC needs to return the Oxide instance ID as the node ID so the controller plugin can know which ID to use in the API call.
ControllerPublishVolume
RPC
NodeStageVolume
and NodePublishVolume
The NodeStageVolume
and NodePublishVolume
RPCs are part of the Node Services, so they run in the node where the volume is being attached.
They are called by the CO when the workload that is going to use the volume is scheduled to the node, but before the workload is actually created.
When these RPCs are received, the plugin is expected to make the volume ready to be used by the workload and available at the path defined in the request.
Also in the request are information about the filesystem and mount flags to use when preparing the volume.
These RPC are fairly similar, and NodeStageVolume
is optional if the plugin does not advertise the STAGE_UNSTAGE_VOLUME
capability, but having the two steps provide some additional flexibility.
For example, in Kubernetes the mount path specified in the request to NodeStageVolume
is a global path that can be reused by multiple pods, while the NodePublishVolume
RPC receives a per-pod path.
NodeStageVolume
and NodePublishVolume
RPCs
Oxide implementation
The specific implementation of this RPC depends on the host operating system running the node plugin. This RFD assumes Linux as the operating system as first implementation.
To prepare a blank disk, the node plugin needs to create a partition, format it, and mount it to the path specified in the request.
Since these operations are common across SPs, the Kubernetes development team provides the k8s.io/
utility package to implement most of this logic.
One key step in this process is correlating the disk ID received in the RPC request with the logical name of the device in the Linux host.
This can be done by reading each disk serial value (using the lsblk
command or from the /
file) and comparing them with the disk name set by the CreateVolume
response in the metadata context.
NodeUnpublishVolume
and NodeUnstageVolume
The NodeUnpublishVolume
and NodeUnstageVolume
RPCs are part of the Node Services and are called by the CO when the workload is stopped and ready to be moved out of the node, either because it was rescheduled or completely stopped. They must undo the actions taken during NodePublishVolume
and NodeStageVolume
.
Oxide implementation
Similarly to NodePublishVolume
and NodeStageVolume
, the k8s.io/
package provides most of the logic necessary to implement these RPCs, and there is no Oxide-specific work to be done.
The function CleanupMountPoint()
unmounts the volume from a given path and deletes any remaining unused directories.
ControllerUnpublishVolume
The ControllerUnpublishVolume
RPC is part of the Controller Services and is called by the CO when the workload is descheduled from a node.
The plugin is expected to make the volume ready to be consumed and published to a different node.
Oxide implementation
The Oxide CSI plugin uses the POST /v1/
API endpoint to detach the disk from the current instance and make it available to be attached to a different instance, if necessary.
DeleteVolume
The DeleteVolume
RPC is part of the Controller Services and is called when the CO determines that a volume is no longer needed. The volume is identified by the unique ID returned in the CreateVolume
RPC response.
In Kubernetes, this RPC is called when the user deletes a PVC object.
Oxide implementation
The Oxide CSI plugin uses the DELETE /v1/
API endpoint to delete the disk. The disk UUID is retrieved from the volume ID set in RPC request.
UX
This section describes the user experience of operators using the Oxide CSI plugin in a Kubernetes cluster running on an Oxide rack.
The general workflow would be similar for other orchestrators but, as mentioned earlier, the focus of this RFD is the Kubernetes integration.
Deploying the Oxide CSI plugin
The Oxide CSI plugin is packaged as an OCI image and can be made available via any OCI-compatible registry, such as Docker Hub, Quay, AWS Elastic Container Registry (ECR), GitHub Container Registry etc.
The image is deployed as pods in the cluster.
The controller plugin can be defined as a Deployment
since it does not need to run on any specific node. It should be possible to run multiple instances of the controller plugin for increased reliability, but the implementation must take care to ensure RPCs are idempotent and have proper coordination for concurrent execution, such as some kind of leader election process.
Meanwhile, the node plugin can be defined as a DaemonSet
so it is available on every node in the cluster. Each instance of the node plugin can be considered independent from each other since the Kubernetes control plane is responsible for activating the right kubelet where the RPCs are called.
Both plugins are deployed with a handful of supporting sidecars, which are provided by the Kubernetes development team. The following YAML snippets provide a general example of how the pods could be deployed.
kind: Deployment
apiVersion: apps/v1
metadata:
name: oxide-csi-controller
labels:
app.kubernetes.io/name: oxide-csi-driver
spec:
replicas: 2
selector:
matchLabels:
app: oxide-csi-controller
app.kubernetes.io/name: oxide-csi-driver
template:
spec:
containers:
- name: oxide-plugin
image: oxidecomputer/oxide-csi-plugin:v0.1.0
args:
- --endpoint=$(CSI_ENDPOINT)
- --mode=controller
# ...
env:
- name: CSI_ENDPOINT
value: unix:///csi/csi.sock
# Environment variables for Oxide API access.
- name: OXIDE_HOST
valueFrom:
secretKeyRef:
name: oxide-secret
key: host
- name: OXIDE_TOKEN
valueFrom:
secretKeyRef:
name: oxide-secret
key: token
- name: OXIDE_PROJECT
valueFrom:
secretKeyRef:
name: oxide-secret
key: project
# ...
volumeMounts:
- name: socket-dir
mountPath: /csi
# ...
- name: csi-provisioner
image: registry.k8s.io/sig-storage/csi-provisioner:v5.0.2
volumeMounts:
- mountPath: /csi
name: socket-dir
# ...
- name: csi-attacher
image: registry.k8s.io/sig-storage/csi-attacher:v4.6.1
volumeMounts:
- mountPath: /csi
name: socket-dir
# ...
- name: liveness-probe
image: registry.k8s.io/sig-storage/livenessprobe:v2.13.1
volumeMounts:
- mountPath: /csi
name: socket-dir
# ...
# Shared volume so the sidecar containers can communicate
# with the plugin via its Unix domain socket.
volumes:
- name: socket-dir
emptyDir: {}
# ...
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: oxide-csi-node
labels:
app.kubernetes.io/name: oxide-csi-driver
spec:
selector:
matchLabels:
app: oxide-csi-node
app.kubernetes.io/name: oxide-csi-driver
template:
spec:
containers:
- name: oxide-plugin
image: oxidecomputer/oxide-csi-plugin:v0.1.0
args:
- --endpoint=$(CSI_ENDPOINT)
- --mode=node
# ...
env:
- name: CSI_ENDPOINT
value: unix:///csi/csi.sock
# ...
volumeMounts:
- name: kubelet-dir
mountPath: /var/lib/kubelet
mountPropagation: "Bidirectional"
- name: plugin-dir
mountPath: /csi
# Access host's /dev path.
- name: device-dir
mountPath: /dev
# ...
securityContext:
# Run plugin as privileged container to allow formatting and
# mounting the disk.
privileged: true
- name: node-driver-registrar
image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.11.1
volumeMounts:
- name: plugin-dir
mountPath: /csi
- name: registration-dir
mountPath: /registration
- name: probe-dir
mountPath: /var/lib/kubelet/plugins/csi.oxide.computer/
# ...
- name: liveness-probe
image: registry.k8s.io/sig-storage/livenessprobe:v2.13.1
volumeMounts:
- name: plugin-dir
mountPath: /csi
# ...
volumes:
- name: kubelet-dir
hostPath:
path: /var/lib/kubelet
type: Directory
- name: plugin-dir
hostPath:
path: /var/lib/kubelet/plugins/csi.oxide.computer/
type: DirectoryOrCreate
- name: registration-dir
hostPath:
path: /var/lib/kubelet/plugins_registry/
type: Directory
- name: device-dir
hostPath:
path: /dev
type: Directory
- name: probe-dir
emptyDir: {}
# ...
Since the plugin sidecars need access to the Kubernetes API, a production deployment also needs to include a set of RBAC rules, which are usually bound to a service account that is used by these containers.
The last two pieces for the plugin deployment are CSIDriver
and StorageClass
objects.
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: csi.oxide.computer
labels:
app.kubernetes.io/name: oxide-csi-driver
spec:
# ...
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: oxide-disk
provisioner: csi.oxide.computer
The plugin deployment is fairly standardized across clusters, and so it is possible for to provide users with a set of base YAML configuration files, or package them with higher-level tools, such as a Helm chart or Kustomize.
Using the Oxide CSI plugin
With the plugin deployed, users can create and use Oxide disks as regular Kubernetes PVCs.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: oxide-postgres-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: oxide-disk
apiVersion: v1
kind: Pod
metadata:
name: postgres
spec:
containers:
- name: postgres
image: postgres:17.6
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data
volumes:
- name: postgres-data
persistentVolumeClaim:
claimName: oxide-postgres-data
Blockers, limitations, and open questions
This section lists all the current blockers, limitations, and open questions that currently affect the development of the Oxide CSI plugin.
Blockers
Blockers prevent the plugin from being developed or adopted in production by users.
Attaching and detaching volumes require instances to be stopped
Priority: High
Requiring instances to be stopped before attaching or detaching a disk prevents most real-world uses of the Oxide CSI plugin, as it results in downtime every time a pod with a PVC is scheduled into the instance.
It can also cause cluster-wide disruptions when all the pods running in a given instance that is shutdown need to be rescheduled somewhere else, causing those instances to restart as well.
Limitations
Limitations are pain points that make plugin implementation harder, or missing features that some users may expect to have.
Instances are limited to a maximum of 8 disks
Priority: Medium
Since each Kubernetes volume correlates to an Oxide disk, this limit affects the number of pods with PVCs that can be scheduled per instance, reducing the overall cluster workload density.
For reference, the table below lists the same limit for other cloud service providers. The exact number varies depending on the instance type of the node ([k8s-storage-limits]).
Cloud provider | Disks per node |
---|---|
AWS | 25 or 39 |
GCP | Up to 127 |
Azure | Up to 64 |
Plugins can advertise this limit in the response of the NodeGetInfo
RPC, so COs are able to take this limit into consideration during scheduling.
Oxide API authentication
Priority: Medium
The Oxide CSI plugin needs to access the Oxide API to create and delete disks, attach and detach disks from instances, and other operations.
But currently the only way to authenticate API requests is via device tokens, which are attached to a specific user and are hard to manage at scale.
[rfd-553] describes to concept of service accounts, which can help alleviate this problem. Introducing the concept of machine principals to the Oxide API could help scope requests even further.
Retrieving Oxide instance metadata
Priority: Medium
RPCs such as NodeGetInfo
require information about the specific Oxide instance where the plugin is running, such as its name and ID. Cloud providers usually expose a metadata endpoint that can be queried from within the instance to retrieve these types of information, but they can present challenges in access control and be a source of vulnerabilities and unintended data exposure.
At a minimum, the Oxide CSI node plugin needs to know the UUID of the Oxide instance it is running on, since this information is used by the controller plugin to attach and detach disks from the right instance. Additional metadata is required to support topologies, as discussed in [multi-rack].
The Kubernetes Cloud Controller Manager described in [rfd-493] can help with this problem by adding the Oxide instance name and ID to the Kubernetes Node objects themselves. The Oxide CSI plugin can then query the Kubernetes API to retrieve the information it needs, leveraging the comprehensive Kubernetes RBAC system to limit the scope of the request.
One caveat of this approach is that CSI plugins are expected to be CO agnostic, and querying the Kubernetes API directly breaks this assumption.
One alternative is to use the same sidecar pattern to isolate the Kubernetes API calls in a different container that is then responsible for feeding the information to the Oxide CSI plugin via environment variables. Deploying the plugin in other COs can follow a similar pattern.
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: oxide-csi-node
labels:
app.kubernetes.io/name: oxide-csi-driver
spec:
selector:
matchLabels:
app: oxide-csi-node
app.kubernetes.io/name: oxide-csi-driver
template:
spec:
initContainers:
# This container queries the Kubernetes API to retrieve metadata for
# the node K8S_NODE_NAME and writes it to a file in the /data volume as
# KEY=VALUE pairs.
- name: oxide-instance-metadata
image: oxidecomputer/oxide-k8s-instance-metadata:v0.1.0
env:
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: oxide-instance-metadata
mountPath: /data
# ...
containers:
- name: oxide-plugin
env:
# The Oxide CSI plugin reads the Oxide instance ID from the node
# metadata retrieved by the initContainer.
- name: OXIDE_INSTANCE_ID
valueFrom:
fileKeyRef:
path: config.env
volumeName: oxide-instance-metadata
key: OXIDE_INSTANCE_ID
# ...
# ...
volumes:
- name: oxide-instance-metadata
emptyDir: {}
# ...
# ...
Disks cannot be expanded
Priority: Medium
The CSI spec defines the RPCs ControllerExpandVolume
and NodeExpandVolume
to allow cluster operators to dynamically grow an existing volume, but Oxide disks have a fixed size that is defined upon creation so the Oxide CSI plugin cannot support this feature.
These RPCs are gated by the EXPAND_VOLUME
capability, so COs are able to prevent users from accessing this functionality if the plugin does not indicate support for it.
Lack of disk metadata
Priority: Low
There are three types of metadata that could be useful for the Oxide CSI plugin to set in the disks it manages.
The first kind are automated pod. If the CSIDriver
object is created with podInfoOnMount
set to true
, the kubelet provides the following information in the volume context attribute of the NodePublishVolume
request:
Pod name
Pod namespace
Pod ID
Pod service account name
The node plugin can update the disk metadata with these values.
The second type of metadata are user defined values that are set to all disks created by the Oxide CSI plugin. These values can be passed via CLI flags to the controller plugin and are included in every API request that creates a disk. For example, some users may use this feature to tag disks for a specific project or application.
The third type of metadata is a static value that the Oxide CSI plugin can use to implement the ListVolume
RPC of the Controller Service. This RPC needs to return all volumes the plugins know about, but without a more structure way to determine which Oxide disks were created by the CSI plugin, this list would need to be created using name patterns, which could be unreliable.
The ListVolume
RPC is gated by the LIST_VOLUMES
capability, so it would be possible to release the plugin without this functionality. Popular plugins, such as the AWS EBS CSI plugin, do not have the LIST_VOLUMES
capability either, so there may not be much impact in not supporting it.
Users can benefit from this disk metadata when trying to correlate their Kubernetes workload with their Oxide infrastructure.
It is important to consider security aspects. CSI plugins run in containers, but node plugins need to run in privileged mode in order to mount and format disks.
The controller plugin is responsible for most of the API calls, but it doesn’t need to run in privileged mode.
The node plugin is mostly responsible for operations that are local to the node, but would need to make mutating API calls to support the pod metadata use case.
Only SINGLE_NODE_WRITER
access mode supported
Priority: Deferred
Oxide disks can only be attached to a single instance at a time, and are always available for reads and writes, so the only access mode the plugin can support is SINGLE_NODE_WRITER
.
This requirement can be documented and validated by the plugin, and the SINGLE_NODE_WRITER
is arguably the most common access mode used for disks, so this limitation should have little impact for most users.
Cloud provider plugins also have similar limitations. For example, AWS only supports MULTI_NODE_MULTI_WRITER
with access type block
and has no read-only support.
Supporting additional access modes will require substantial work, and without specific customer asks, this functionality can be deferred until needed.
Volume cloning
Priority: Deferred
CSI volumes can be created from three different sources: a blank disk, a snapshot, or by cloning an existing volume. The Oxide API can support the first two use cases, but not the third one.
Volume cloning could be implemented as an automated snapshot-and-restore operation, but this can become a complex saga and introduce challenges in terms of error handling and unwinding operations. Even if the Oxide API introduced a volume cloning functionality, the implementation would follow a similar pattern.
This feature is gated by the CLONE_VOLUME
capability and not all cloud providers support this functionality, so it should be safe to defer implementation until a specific need arises.
Open questions
Open questions are decisions that have been deferred until more information is available to guide implementation.
Topologies and multi-rack clusters
[rfd-24] and [rfd-543] describe multi-rack deployments. In these scenarios, the Oxide CSI plugin needs to be aware of the Oxide fault domains and service coherence to avoid situations such as creating a disk in a rack that the instance running the workload will never be able to access.
These rules are codified in the CSI spec as topologies. CSI plugins need to resolve the topology request to determine where the volume should be created.
Each node plugin in the cluster can provide the topology segments that it can be accessed from in the response to the NodeGetInfo
RPC. The CO then forwards this information to the controller plugin during the CreateVolume
RPC so it can determine the best location to create the new volume.
Using the terminology proposed in [rfd-24], and considering that storage volumes have service coherence of a cell, a scenario with three racks (R1
, R2
, and R3
) combined into two cells (C1 [R1, R2]
and C2 [R3]
) could have the Oxide CSI node plugin responding with the following topologies:
NodeGetInfo
Node plugin running in an instances in R1:
{"topology.oxide.computer/rack": "R1", "topology.oxide.computer/cell": "C1"}
Node plugin running in an instances in R2:
{"topology.oxide.computer/rack": "R2", "topology.oxide.computer/cell": "C1"}
Node plugin running in an instances in R3:
{"topology.oxide.computer/rack": "R3", "topology.oxide.computer/cell": "C2"}
When scheduling a workload in an Oxide instance running in rack R1
, the CreateVolume
RPC could receive the following topologies requirements and preferences:
CreateVolume
requisite =
{"topology.oxide.computer/rack": "R1", "topology.oxide.computer/cell": "C1"},
{"topology.oxide.computer/rack": "R2", "topology.oxide.computer/cell": "C1"}
preferred =
{"topology.oxide.computer/rack": "R1", "topology.oxide.computer/cell": "C1"}
The controller plugin then first attempts to create the new disk in rack R1
(since this is the preferred rack), but fallback to R2
in case of failure. It never attempts to create the disk in R3
because that would violate the topology and cell-boundary constraint.
This example uses the rack as the lowest level of failure domain. Cloud environments don’t often expose this fine-grain level of resolution to users, but rather larger domains, such as a zone. We may opt for a similar approach and only define the topology.oxide.computer/
segment.
Another aspect to consider is the adoption of the Well-Known Labels, Annotations and Taints for Kubernetes and use topology.kubernetes.io/
instead (or in addition to the more specific topology.oxide.computer/
). But it’s not clear at this point if there is any advantage in doing so, and more research is needed on this topic to understand how these topology segments affect pod scheduling.
Using existing CSI plugins as reference, the AWS EBS CSI plugin uses both segments while GCP only supports a custom one.
Given that multi-rack support is still under active discussions, topology support in the Oxide CSI plugin is deferred until these concepts are more solidified.
Implementation plan
The first release of the Oxide CSI plugin will leverage existing functionality and skip any functionality that is not possible to implement with the current features and APIs in Oxide.
The only additional work that is required is to remove the blocker on hot plugging disks to instances.
The goal of this initial release is to create the base work necessary for an MVP: project structure, CLI parsing, documentation, deployment artifacts, idempotency and concurrency mechanisms, gRPC server implementation etc.
More specifically, the first version of the CSI plugin will support:
Create and destroy Oxide disks.
Attach and detach Oxide disks to instances based on workload scheduling and volume mounts.
Snapshot and restore volumes.
And will have all the limitations and open questions listed above:
Limit of 8 volumes per node.
Plugin will authenticate with the API using device tokens.
Instance ID will need to manually be set on each node plugin instance.
Volumes cannot be expanded.
Disks will not have CSI-specific metadata.
Only
SINGLE_NODE_WRITER
access mode.No support for volume cloning.
No concept of topologies.
Despite these limitations, this first release should provide users with an initial integration point from which we can gather feedback on where to focus next.
Some of the limitations also affect other projects, and may be implemented outside the scope of the CSI plugin. As the limitations are solved, we will be able to expand the list of functionalities supported by the CSI plugin.
Local storage
[rfd-584] and [rfd-590] describe the concept of local instance storage. Unlike Oxide disks, local storage has a different lifecycle that is intrinsically connected with the lifecycle of the instance they are attached to, and so they may not be suitable to be managed by the CSI plugin described in the RFD.
Depending on the exact details of how local storage will work and be exposed to instances, it may be possible to just use local
volumes directly in Kubernetes. The Local Persistence Volume Static Provisioner could also be useful.
If local storage end up requiring more work to be done in order to use them with Kubernetes, we should create a separate CSI plugin, potentially with just the node services since all operations will be local to the node.
The CSI spec provides some examples of how these headless deployments would look like.
CO "Node" Host(s) +-------------------------------------------+ | | | +------------+ +------------+ | | | CO | gRPC | Node | | | | +-----------> Plugin | | | +------------+ +------------+ | | | +-------------------------------------------+ Figure 4: Headless Plugin deployment, only the CO Node hosts run Plugins. A Node-only Plugin component supplies only the Node Service. Its GetPluginCapabilities RPC does not report the CONTROLLER_SERVICE capability.
+-+ +-+ |X| | | +++ +^+ | | Node | | Node Publish | | Unpublish Volume | | Volume +---v----+---+ | PUBLISHED | +------------+ Figure 8: Plugins MAY forego other lifecycle steps by contraindicating them via the capabilities API. Interactions with the volumes of such plugins is reduced to `NodePublishVolume` and `NodeUnpublishVolume` calls.
User would be able to deploy both plugins into their Kubernetes cluster, and define different storage classes for each. This allows them to pick a faster local volume when needed instead of a disk-based volume.
External References
[rfd-24] https://24.rfd.oxide.computer/
[rfd-493] https://493.rfd.oxide.computer/
[rfd-543] https://543.rfd.oxide.computer/
[rfd-553] https://553.rfd.oxide.computer/
[rfd-584] https://584.rfd.oxide.computer/
[rfd-590] https://590.rfd.oxide.computer/
[csi-spec] https://github.com/container-storage-interface/spec/blob/master/spec.md
[k8s-deployment] https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
[k8s-daemonset] https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
[k8s-storage-class] https://kubernetes.io/docs/concepts/storage/storage-classes/
[k8s-csi-driver] https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/csi-driver-v1/
[k8s-storage-limits] https://kubernetes.io/docs/concepts/storage/storage-limits/
[k8s-external-provisioner] https://kubernetes-csi.github.io/docs/external-provisioner.html
[k8s-livenessprobe] https://kubernetes-csi.github.io/docs/livenessprobe.html
[mount-utils] https://pkg.go.dev/k8s.io/mount-utils
[csi-create-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#createvolume
[csi-controller-publish-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerpublishvolume
[csi-node-get-info] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodegetinfo
[csi-node-stage-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodestagevolume
[csi-node-publish-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodepublishvolume
[csi-node-unpublish-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodeunpublishvolume
[csi-node-unstage-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodeunstagevolume
[csi-controller-unpublish-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerunpublishvolume
[csi-delete-volume] https://github.com/container-storage-interface/spec/blob/master/spec.md#deletevolume
https://www.redhat.com/en/blog/persistent-volume-support-peer-pods-technical-deep-dive