60 - Storage Architecture Considerations / RFD / Oxide

RFD

Authors

Adam H. Leventhal

Labels

control-plane, storage

Updated

Note

This RFD was the subject of an Oxide and Friends episode, Crucible: The Oxide Storage Service.

Introduction

The Oxide rack will have a block storage facility to provide virtual disks to VM instances. Oxide block storage will provide resilient, high (not highest) performance, secure virtual block devices (disks) to VM instances (see below for discussion of each topic). In addition it will provide mechanisms for storing and instantiating VM images (root devices). It will provide mechanisms for in-rack snapshots as well as out out-of-rack backups.

There are several approaches we might reasonably pursue; the purpose of this document is to provide the constraints and criteria that will inform our direction.

Terminal State

The goal of this RFD is to explore various potential architectures, identify relevant investigations, and, finally, decide on a specific plan regarding the architecture. A subsequent RFD will detail that design, decompose it into discrete subcomponents, and flesh out the specification of features deemed important (e.g. VM image template storage, backup, migration), but ancillary in this document. When in the published state, this document will select and describe the high level design from among several options.

User Experience

Unlike networking or IAM, for example, the range of user-visible entities and settings is relatively small. The typical use case will simply be to specify the required size and the number of IOPS required for the disk. Both storage capacity and IOPS are limited resources of the servers in the rack so will typically have an internal billing rate associated with them (these are also the units by which public cloud vendors charge). Creation of new disks should be constant time and "fast" as perceived by a human (i.e. the user should "feel" like creation is responsive).

There are a few other settings a user might manage. On a per-disk basis, users will be able to configure a snapshot creation and retention policy (including a storage capacity reservation for snapshots). Users will be able to duplicate a disk. Users will be able to create VM images (root device templates) from disks. Users will be able to attach and detach disks from VM instances.

Principles and Priorities

A data storage aphorism is that it takes a decade for a system to mature. Our timeline for building and validation does not allow for a decade, however, requiring that we take on a "plumber mentality": as much as possible we need to select well-tested components in the data path, run the piping between them, and get out of the way.

Of course because we’re dealing with persistent storage, we do need to consider how we might evolve the product. We need to assume not just version heterogeneity (i.e. different versions of our software cooperating), but also architecture heterogeneity, anticipating an improved storage architecture, less constrained by time and risk. For example, we should expect to migrate SSDs, servers, and customer VMs.

Our key concerns are data integrity, data security, performance sufficiency (i.e. as opposed to "high" or "highest" performance), and delivery timeline. The first version will be unambitious—especially when compared with dedicated storage products—when it comes to high performance and storage efficiency (i.e. economics).

In a few sections, we’ll cover relevant technologies we might plumb together, but first we’ll discuss hardware, a general architecture, and the various functional components that need to exist in the system. Note that an off-the-shelf storage solution (Ceph is the prime contender) would be a great solution if it satisfies our requirements; the discussion below regarding architecture pertains to components we’d assemble rather than using as a (mostly) complete solution.

Hardware

A key architectural fork in the road is the connectivity of a given storage endpoint (i.e. SSD): does it attach to a single host or to multiple hosts? The advantage of the former is simpler hardware (fewer, less complex parts) and fewer pathological scenarios to avoid (e.g. split brain). The advantage of the latter is potentially reduced need for data redundancy and the accompanying reduction in data traffic. Given that reducing scope is paramount and economics are secondary, I propose that we consider only dedicated storage hardware architectures (i.e disks in a compute server attached to the CPU via NVMe). This also increases the diversity of suppliers for SSDs as dual port support is much more limited across most vendors.

High Level Architecture

As a black box, the storage facility has relatively straightforward inputs and outputs: on one side it provides virtual disks to VMs via a block protocol; on the other side it connect with disks also via a block protocol. The storage facility satisfies reads and writes (and management commands) from the VM through a circuitous route that leads to the disks. While it speaks a block protocol on both sides, the inner workings will be quite complex. Expanding that black box, there are two distinct—but highly related—architectures to consider:

The names "North" and "South" are chosen generically to refer to the components close to the customer VM and close to the underlying storage media (SSDs) respectively.

At the top, we’ve got VMs communicating via an emulated block device to the North component. In option A, the customer VM and North are on the same server; in option B, the customer VM is on a different server. North communicates to South over our internal network via whatever network protocol we deem appropriate. South connects with the locally attached SSDs via a block protocol (NVMe). See more on the responsibilities of these components below, but North will send data to (and retrieve data from) multiple instances of South to ensure data redundancy.

The distinction between the two is simply whether VMs run on the same compute server as the "North" component (in which case the two would communicate via a local protocol) or VMs run on a different compute server (in which case the two would communicate via a network protocol). From the perspective the key architectural questions of this document, the distinction between those two is relatively minor, however Option B introduces an additional failure domain which adds complexity. The empirical evaluation of CPU and memory consumption of the storage system will inform this decision; I believe it’s one we can safely defer until we have more data. If it turns out North consumes a lot of CPU/memory, customers need large VMs, or for security considerations, we may want to quarantine those workloads. Let’s assume for now that compute servers run customer VMs as well as all components of the storage system.

Without loss of generality, we can think of a volume (as presented to the VM instance) as being composed of multiple chunks (VMware VSAN calls these "components") which are distributed among different compute servers. Later in this document we’ll refine this.

Note that the configuration between North and South-served chunks is more-or-less static. In other words, when a volume is created, the control plane allocates chunks from South and communicates the placement of those chunks to North. This is a less ambitious approach than a true clustered or distributed storage system where data for a given entity, such as a volume, might be scattered among all nodes.

Functional Components

Below is a list of functions—roughly, activities or services within the broader storage system. In some cases there is a seemingly innate choice about which of North or South will host a given mechanism. In other cases, the mechanism could live in either. For any given mechanism it seems ideal to have it implemented in one of North and South and not in both for the sake of simplicity and efficiency.

Data Redundancy (North)

Customer data must be able to survive server and SSD failure. Accordingly North is responsible for sending data to multiple instances of South. Initially this might look like simple redundancy / mirroring (i.e. because economic efficiency isn’t a top-level goal); in the future this will look more like erasure coding (EC) / RAID.

Data Repair and Reconstruction (North)

When chunks / components of a volume become temporarily or permanently absent, we need mechanisms to track which chunks contain valid data, repair chunks after a transient absence, or reconstruct them in coordination with the control plane.

Valid Data Tracking

If a chunk is offline we need to maintain information about ranges of valid data. Consider the case of a 2-way mirror. Most of the time we will keep both sides up to date. If one side is unavailable, only the available half will contain the latest data. When the absent half returns, we will need to ensure that we don’t issue reads for invalid or state. Ideally, we might issue reads to known good regions to distribute load.

Repair

When a chunk contains mostly valid data, but has missed some writes, we’ll need to bring it up to date. This is a specific case of reconstruction, however is a likely necessary optimization because transient failures are uncommon, but expected occurrences.

Reconstruction

If (through some coordination of the storage system, control plane, and—possibly—the operator) we determine that a chunk is gone forever, we’ll need a mechanism to reconstruct it in order to regain the desired level of data redundancy.

North seems like the obvious owner for all of these as it is the component that has visibility into the chunks of redundant data.

Data Integrity (North/South)

We want to ensure with extremely high confidence that data returned to a VM as the result of a read matches exactly the data previously written. While SSDs promise very low error rates, we do not want to rely on them to detect invalid data [disagreement welcome on this point if there are dissenting opinions]. This implies that we will store checksums that validate user data.

Legacy storage systems would store user data and checksums together. For example, rather than 512B sectors, they would use media with 520B sectors and store an 8 byte checksum alongside user data. This has some obvious gaps e.g. a sector that contains stale data still appears valid (phantom write), and a write sent to the wrong sector appears valid (misdirected write). Contemporary storage systems employ a Merkle tree of checksums in which the checksum is stored alongside the reference to the data rather than with the data itself. We’ll pursue the modern approach.

One could imagine either North or South owning this responsibility:

North: Checksums here are attractive because it would ensure end-to-end integrity including validating that data wasn’t corrupted over the network transport between North and South. Mechanisms such as TCP checksums and TLS sessions can detect corruption in-flight, however an important lesson from Manta was a single end-to-end checksum simplifies protection against software and hardware failures of all types. If we already need to have a translation layer in the North component—i.e. Between client-facing LBA and an internal locator—it may be convenient to include a checksum.
South: We may need a translation layer between the internal, network protocol and SSD blocks. It would be easy to include a checksum in this translation layer.

When data corruption is detected, it will be reported to a fault management facility—likely part of the general control plane, beyond the scope of this document.

This may be one of the limited instances where we are content to have overlapping mechanisms between North and South. Both may have a role to play in end-to-end data integrity. In particular when scrubbing data we would ideally like South to be able to operate autonomously, or—barring that—to minimize the network traffic between North and South due to a scrub (e.g. with a command from North to South that includes a checksum that could be validated on South without transferring data across the network).

Snapshots (North/South)

There are two general approaches regarding snapshots:

Co-locate live and snapshot data
Use a dedicated "delta disk" to hold data associated with snapshots

COW filesystems such as ZFS, Btrfs, and WAFL use the former: new data are written to new locations while old data may be retained for snapshots. EMC storage arrays, VMware vSAN, and others use the latter: overwritten data triggers a copy of old data to a disk or region dedicated for snapshot storage. In general, we prefer 1 as it is simpler to reason about and to operate (although 2 has the advantage of simpler capacity management for data vs. snapshot storage).

In either North or South (or a combination of the two) we will have metadata that describes all user data. A snapshot preserves a point-in-time, consistent representation of that metadata along with the data it refers to.

The control plane will own the snapshot policy: when snapshots are created and when they’re deleted. North will, at a minimum, need to coordinate top level snapshot actions. Either North will own the metadata and be responsible for taking snapshots or South will own the metadata and North will need to "freeze" writes and coordinate the various South chunks.

Rate Limiting (North)

Users will administer storage performance with comprehensible knobs such as IOPS (which AWS EBS uses) and throughput. While traffic at any point in the system could be shaped, it makes the most sense to do it closest to the VM before operations have been aggregated or optimized deeper into the system.

While the primary, user-facing rate limiting will be in North. South will likely require some aspect of rate limiting to prevent against DoS or other cascading failures.

Compression (North/South)

Compression could happen at either North or South, but we’ll want to compress before we encrypt. That said, compression may yield little to no benefit. It will be common for a VM instance to write in 4KB units. Even if the data were significantly compressible, we would need a 4KB sector on an SSD to store the data. In addition, compressing encrypted data may leave us exposed to CRIME attacks whereby we can leak the compressibility of plaintext data from which it can be possible to extract data. While we should consider it in our architecture, compression may be a post-v1.0 item depending on the efficacy we anticipate.

Encryption (North/South)

Encryption could similarly happen either in either component. If we want a different key for each volume (with a given VM granted access to particular keys) then encryption should happen in North (or even north of North, before we receive the block data). It may also be attractive to encrypt in North so data are encrypted end-to-end as it flows through the system.

Note that encryption requires additional metadata for each block such as additional entropy (e.g. salt, initialization vector (IV)), and a message authentication code (MAC — effectively a checksum to validate the ciphertext). We would ideally like to use existing, vetted components for encryption. Barring that, we’ll faithfully replicate the mechanism from a vetted component.

Allocation and Free List Management (North/South)

Storage is a finite resource. Some component of our storage system will need to keep track of what blocks are free or in use, and it will need to allocate new blocks intelligently. Much ink has been spilled regarding the best ways to allocate and manage blocks because it’s a critical determinant of performance especially as the system grows fragmented, full, or both.

A given SSD will host chunks that correspond to disparate volumes. Reasonable architectures could variously assign allocation as a responsibility of either North or South. In North, it would assume a relatively static division of SSDs in South, assigning large chunks in which North would manage the find-grained blocks. Alternatively, South could handle allocation, managing blocks for different North instances effectively side-by-side. This would allow for a more dynamic management of available capacity.

Device Management (South)

South is responsible for anything that requires direct access to hardware such as retiring devices. It will also collect telemetry relevant for fault management and feed it into a generic facility.

Candidate Software

Ideally we would find a well-suited, open source component that meets our requirements. Barring that, we’d like to assemble a novel architecture with open source components well-suited for portions of our architecture, and de novo code as appropriate.

The section covers a non-exhaustive list of components under evaluation. If there are others that bear discussion, please leave a comment or add the results of your investigation.

Potential Complete Solutions

Ceph

Ceph is a full clustered storage system that supports block, file, and object storage types.

Evaluating Ceph according to the taxonomy of components above, OSDs correspond most closely to South, and librados clients (e.g. RBD, the RADOS Block Device) correspond to North. OSDs are responsible for (aspects of) device management, block allocation, compression, encryption (though there are different options here). RBD is responsible for snapshots. The chatty RADOS layer is distributed among all participating nodes and is responsible for data redundancy, data repair, and data integrity.

Our requirements fall within the scope of Ceph’s intended use. It has plugins for qemu-kvm, openstack, etc. Cloud providers such as Digital Ocean and Tencent use Ceph as the basis for block storage products. Conversely, Ceph has a reputation for pathological failures, high CPU consumption, and a need for hands-on attention (in the words of a contact at DO "Ceph is operated not shipped").

We’re continuing to explore the possibility of using Ceph, in particular concretely evaluating its performance and resources consumption while looking for evidence of operational challenges.

Lustre

Lustre is a clustered filesystem. We haven’t explored it too seriously given its focus on a file-based interface and on building a global namespace—neither of which is particularly relevant for our use.

Lustre is listed for completeness and to invite discussion if people think it’s worth a closer look.

GlusterFS

GlusterFS seems strictly less well-suited than Ceph. It relies on the underlying filesystem for their South equivalent (XFS in particular), and uses Swift to emulate a block layer. Both of those attributes made us defer serious exploration.

GlusterFS is listed for completeness and to invite discussion if people think it’s worth a closer look.

Potential Components

OpenZFS (local filesystem)

OpenZFS (née ZFS) is a local filesystem that incorporates traditional volume manager capabilities. It was developed for Solaris starting in 2001. Its current incarnation, OpenZFS has recently moved from targeting illumos to Linux as its first-class platform (however illumos is still supported). ZFS is the storage component with which the Oxide team has far and away the most experience and expertise, having written key components, and shipped multiple products based on ZFS: the ZFS Storage Appliance (Fishworks), Manta and other components of the Joyent cloud, and the Delphix Virtual Appliance.

OpenZFS validates all data in a Merkle tree. It supports constant-time snapshots, RAID, compression, and encryption. It’s proved to be reliable and high performance. If we’re going to use something that looks like a local filesystem, OpenZFS is a compelling choice. It compares favorably to other local filesystems, and the team’s expertise means we would proceed quickly.

OpenZFS has facilities for each of the required functional components above, of course limited to a single system rather than replicated across a cluster.

DRBD (distributed volume manager)

DRBD (Distributed Replicated Block Device) is a facility for remote mirroring of block devices. It was first developed in the late ‘90s and integrated into the Linux kernel in the mid ‘00s. It operates like a volume manager with synchronous remote replication.

While it’s worth keeping in mind, DRBD is likely too closely tied both to direct block device ownership and to the Linux kernel to be obviously useful to us.

EdgeFS

EdgeFS is a storage project out of Nexenta primarily focused on Kubernetes storage. It has a strong emphasis on geographic distribution, and heterogeneous protocols—neither of which is relevant for us. While the project seems to have some momentum (and it’s latched on to Rook https://rook.io/ which has even more momentum), it appears to be fairly early, not particularly battle tested, and oriented towards a different problem space than ours.

Apparently the new owners of Nexenta are trying to walk back the open sourcing of EdgeFS… which makes this even easier to ignore.

Distributed K-V Stores

This is a blanket category for technologies such as CockroachDB, Riak, or TiKV: distributed key-value stores the manage redundancy, consensus, etc. A block device is a special case of a key-value store: keys are unique values within the LBA range, and values are 4KB data blocks.

These key-value stores solve many relevant problems such as temporary unavailability, restoring consensus, etc. While it is tantalizing to use such a technology, and further investigation is warranted, there doesn’t seem to be much use of k-v stores for use cases like ours. Being unique doesn’t typically scare off Oxide designs, in this case—where user data integrity is concerned—we may want to be more cautious.

Potential Architectures

This section describes some architectures under consideration. Additional proposals are extremely welcome. Each subsection describes an architecture as well as how the services above slot into North or South.

Southern Volume Manager

This approach implements South as a very simple layer, carving up each SSD into a number of fixed-sized chunks, each exposed over the network to a specific North (as dictated by the control plane). North would be implemented as a virtual block device layer (such as ZFS exposing a single zvol) using those network-visible, remote block devices as the backing store.

Data Redundancy

North

Data Repair / Reconstruction

North

Data Integrity

North

Snapshots

North

Rate Limiting

North

Compression

North

Encryption

North

Allocation / Free List

North

Device Management

South

This moves almost all responsibility to North, leaving South to monitor disks, and route reads and writes. South would be responsible for aggregating fault telemetry (e.g. failed checksums) from North to diagnose increased failure rates in particular SSDs.

The control plane would be responsible for assigning chunks from various instances of South to various instances of North (within the same Cell). The number of chunks (read, the total capacity) would be dynamic, but slow moving. That is to say, the control plane could assign new chunks (or, potentially, remove chunks), but it would be a relatively uncommon event.

The amount of storage assigned to a given instance of North would be a function of several factors:

The size of the virtual disk
The amount of metadata required to support that virtual disk (a TBD static function related to the size of the disk)
A "performance reservoir" discussed below
The space specified by the user for snapshots

Performance Reservoir

Regarding a "performance reservoir", ZFS (and all COW filesystems) suffer tremendously when the system is low on space. This is because allocations become harder and harder to satisfy as we search for tiny empty spots in a highly fragmented LBA space. Overall performance falls off dramatically and non-linearly as the pool fills. At Delphix, we reserved a relatively large 15% of the total pool size to avoid this performance cliff. Our empirical testing will determine the appropriate level of overprovisioning; we may expose a service or operator knob to adjust this in the field.

Wrap-up

This approach has several negative attributes:

We need to determine a relatively static overprovisioning
We need the user to state a snapshot quota (though that may be a useful requirement regardless)
Thin provisioning would require some careful logic
- For example, we could assign fewer slices to North and then allocate them on-demand as North depleted its assigned storage

This approach also has the positive attribute of letting us use many ZFS capabilities without much effort compression, encryption, checksums out of the box as well as RAID (though the built-in ZFS RAID may not be well-suited for a 4KB virtual block devices)

Northern Mux

This approach implements South as a virtual block device layer (e.g. ZFS exposing zvol volumes). The virtual disk volume in North is statically divided into LBA ranges (e.g. in 1GB chunks); each range has 3 associated instances of South. North and South communicate over a network protocol for both block data and related metadata. North handles a write by checking its LBA and forwarding it to the appropriate South instances. Reads would be distributed to the various components (e.g. round-robin, or more intelligently based on observed performance).

Data Redundancy

North

Data Repair / Reconstruction

North

Data Integrity

North or South (see below)

Snapshots

South (but coordinated by North)

Rate Limiting

North

Compression

North or South (see below)

Encryption

North or South (see below)

Allocation / Free List

South

Device Management

South

This pushes most of the complexity South with the goal of making North a relatively simple layer. Absent a failure, North would forward operations to the appropriate South based simply on the LBA.

Repair and Reconstruction

North does, however, require some sophistication in order to handle data repair and reconstruction. We expect that instances of South will be temporarily unavailable as a part of normal operation (e.g. rebooting a server for an upgrade). It would be prohibitively expensive (and impractical in the details) to fully reconstruct a South instance after a temporary unavailability, therefore North would need to maintain a persistent record of writes pending for a given device. This could be done compactly, e.g. referencing data on the other, known valid components of the mirror. The persistent record could live on the related arms of the mirror or on some volume-wide metadata hosted on different South facilities (e.g. a filesystem or kv-store).

Since South instances won’t always have the correct data, North must be careful to direct reads only to South instances with data we expect to be valid. It would be tempting to declare an instance of South as unavailable for reads until it has been fully recovered from a temporary outage. It’s likely, however, that all arms of a mirror could be incomplete (consider a rolling upgrade across a cluster which would render all South instances temporarily unavailable one at a time). In that case, there would be no "golden copy" of user data. In addition to persisting a record of pending writes, North would also need to keep an in-memory cache of valid blocks (e.g. a bit per dirty block). Some careful consideration would be required to keep the memory footprint of this structure compact.

Compression, Encryption, and Data Integrity

Given the need for ancillary metadata, we have some flexibility about whether North or South takes primary responsibility for compression, encryption, and data integrity. South could do that work as it already has facilities for each. North could also do these with the benefit of end-to-end security and integrity. North would compress data (optionally), encrypt it, generate a checksum, and store the relevant metadata for each block. To reiterate, this would lean on the metadata storage already required to handle write failures (e.g. due to unavailability) to South instances.

Wrap-up

This approach offers surface-level simplicity, but gets complicated as we wade into the details. While there may be shortcuts, recovering data consistency may be as complex as quorum recovery in distributed k-v stores. The mux (we write) will need to handle some tricky data paths with a wide range of corner cases. This isn’t disqualifying, but we will need to embrace the complexity of this solution should we choose it.

ZFS on ZFS

A simple approach would implement both North and South as ZFS. Southern ZFS would expose volumes over the network. Northern ZFS would build a pool on those volumes, and create a single volume for the VM.

Data Redundancy

North

Data Repair / Reconstruction

North

Data Integrity

North

Snapshots

North

Rate Limiting

North

Compression

North

Encryption

North

Allocation / Free List

North AND South (see discussion below)

Device Management

South

Compared with the two previous approaches, this approach offers the SSD space sharing on South as with the Northern Mux, and it allows us to use—rather than build—complex logic to handle recovery from expected failures. It would also allow for a more flexible assignment of storage from South to North unlike the fixed chunk size of the previous two approaches.

Concerns

This approach has several associated concerns.

ZFS on ZFS carries a word of mouth stigma. This configuration has been used in testing since the earliest days of ZFS, but there are reports of poor performance and instability. These are likely surmountable, but need to be understood and explored.

North and South would both be doing the work of free block management and allocation. It doesn’t necessarily make this architecture untenable, but it does violate a stated goal. Also (as with the Southern Volume Manager above), this approach would require us to determine a storage allocation statically (i.e. how big would we make the North storage pool).

These last two concerns might be mitigated by having South expose much larger volumes than required by North. North would still need to do the work of allocation, but allocation is much simpler in a mostly unoccupied pool. It would, however, be extremely important for North to TRIM unused data and to make sure that path was plumbed all the way through to South.

Wrap-up

The approach carries some uncertainty regarding functionality and performance (that we need to test). It likely trades complexity in the data plane for complexity in the control plane (but that may be a trade we’re happy to take).

ZFS with Remote Allocation

A variant of ZFS on ZFS, this approach imagines replacing the block-based allocator in North’s ZFS with a protocol that delegates allocation to South. All filesystems need to determine a free LBA at which to store new data. For subsequent reads, this LBA effectively acts as a key used to retrieve the related value. One can imagine replacing the code that writes data and returns the LBA to the consumer with something that looks more like interactions with a key/value store, in this case a new, non-block-based interface shared by South.

Consider the ZFS IO pipeline for logical writes above. Rather than allocating a free LBA we would generate a known unique ID.

Data Redundancy

North

Data Repair / Reconstruction

North

Data Integrity

North

Snapshots

North

Rate Limiting

North

Compression

North

Encryption

North

Allocation / Free List

South

Device Management

South

This would be relevant if the ZFS on ZFS approach was deemed of interest, but computing allocations on both North and South was determined to be too computationally costly.

Capacity

A volume has a "size" as presented to the guest OS, but the amount of storage consumed can vary significantly from that specified size.

Smaller than specified:

A volume may be thin-provisioned (i.e. there may be no reservation guaranteeing that space)
Blocks may be compressed, reducing the size of stored data
Blocks may be deduplicated, eliminating redundant data within or even between volumes

Larger than specified:

Data redundancy means there’s up to a 3x multiplier on volume space used
Snapshots can pin as much data as is stored in a volume
Per block metadata overhead (which we’ll need to minimize)
General metadata; e.g., used/free

Our goal is to make efficient use of the raw storage capacity available, but to prioritize the delivery timeline over optimizing storage use. To that end I propose that we try to maintain 3 copies of all user data and consider anything less than that a degraded state from which we need to recover.

Performance

Customers will provision volumes with IOPS limits. To that end, we want to make as many IOPS available as possible. Rather than setting specific, arbitrary IOPS goals for the system based on the raw capabilities of the SSDs we choose, I propose that we choose key performance criteria and build tests that exercise those criteria. We’ll design, test, and optimize the system against those metrics. We’ll use the performance we measure to assess the pool of IOPS available for customers to provision.

100% 4K random writes — this is a good proxy for a pessimal database workload
50/50 4K random read/write — intermix some reads
Provision time — from control plane API call to availability of a volume (a subset of overall VM provision time, another hot path)
100% 128K streaming writes
TBD: some off the shelf system benchmarks such as TPC-H (on a known VM/OS/FS)
TBD: in-guest VM microbenchmark on Windows and Linux
- For example, at Delphix our Windows users always used some off the shelf tool to evaluate whether the storage we were giving them was "fast"

Future-Proofing

North / South interface

The VM <⇒ North <⇒ South interfaces are extremely important and may constrain our future ability to innovate in follow-on versions. A common failure of hyperconverged storage is to leave islands of capacity and IOPS unused and inaccessible. For examples VMware’s VSAN product looks most similar to the "Southern Volume Manager" architecture where the minimum allocatable chunk is 255GB. In addition to this being quite large, it can mean that a given chunk can only deliver the IOPS of that one underlying device or the collection of devices on a single server.

While it may not be critical in version 1.0, we need to consider future customer demands where a large number of IOPS may be concentrated over a small number of GB (or vice versa). We need to keep this in mind when it comes to designing the various interfaces in particular to ensure we aren’t locked into fixed allocations.

Metadata v. Data QoS

Customer data may have very different size and performance attributes compared with the metadata we store to track data, verify checksums, encrypt, track snapshots, etc. While version 1.0 will have a single storage medium (TLC NAND), storage technologies are (again) bifurcating to QLC NAND and pmem (e.g. Optane, zSSD). It would be prudent to anticipate how we might take advantage of those different technologies, and—in particular—how they might be applied to data and metadata.

Architecture experiments and outcomes

After a discussion of various architectures [meet2020_06], we decided to investigate two of the options above—the "southern volume manager", and "northern mux"--while deferring the other two.

Southern Volume Manager

This was deemed to be the simplest of the options, but perhaps too simple. If viable, it would let us lean almost exclusively on the known quantity of ZFS while building relatively little and quite simple code in the data path. In particular the only novel software required of this approach would be a network block device to expose portions of SSDs on South to North where they would be consumed by ZFS. The prevailing concern is that this would be a significant departure from the design center for ZFS. ZFS was designed for spinning disks and adapted, nominally, for SSDs and SAN devices. The types of failures and aberrations we expect over our internal network (robust as it is) will differ from those of disks. Nodes (and therefore disks) will reboot and disappear unexpectedly; we weren’t sure how well ZFS would handle this or how challenging it would be to invest in improving ZFS in this regard.

We simulated this configuration in AWS with ZFS (on OmniOS), instances with locally attached disks, and iSCSI as the block protocol between North and South nodes. While there were likely many contributing factors, the performance of this arrangement was quite poor (~10%) compared with the performance of the instance-attached SSDs. In addition (and as anticipated) there was little resilience to failures e.g. when nodes were taken offline. Some of this was attributable to iSCSI, but after exploring the code and discussing with Matt Ahrens, it seemed like it would be quite challenging to incorporate into ZFS (more on this below).

While this architecture may have been the simplest option, it proved to be too simple. Further, it more closely resembles an extended, local-storage system rather than a clustered one.

Northern Mux

This was an attractive option since, of the options under consideration, it most closely resembles that of other distributed storage systems (such as Ceph) while being significantly simpler both in ambition and design. An apprehension here was that it would involve quite a bit of delicate code in the data path that would need to be resilient to all types of failures and pathologies. In addition, we weren’t sure if we could deliver adequate performance.

To explore this we developed a simulator in which we can develop, refine, and test the algorithms that dictate the handling of all failure and success cases. The early results of this simulator have been promising.

Conclusions

While we could consider investing in ZFS to have it accommodate network-mirror-aware storage nodes, it seems like a similar effort to developing the Northern Mux albeit with a much more complex surround of ZFS making it harder to build, test, and maintain. Combined with the poor performance of the Southern Volume Manager, the Northern Mux looks very attractive: the work will be no more complex (and likely much less complex) with more latitude over the design while conforming more closely with the architectures of modern distributed storage systems.

We consulted with Matt Ahrens on these conclusions to make sure we weren’t making a rash judgement with regard to the capabilities of ZFS. He agreed with the selection of experiments, and the conclusions.

The steps from here will be to

Continue development of the simulator to refine the algorithms of the mux
Build large-scale test rigs in AWS to be able to do functional, failure, and performance testing in advance of our own hardware
Develop test and stress workloads to run in AWS and on Oxide hardware
Develop the storage service, using or sharing code with the simulator

References

[meet2020_06] Google Meet recording from June 2020
[jamboard] All drawings in Jamboard

RFD 60 Storage Architecture Considerations

Table of Contents