Crucible’s raw file format
Crucible stores customer data in raw files on disk, as of
crucible#991
.
It stores both block data (i.e. actual customer data) and per-block context data (checksums and a nonce + tag for encrypted blocks).
The current raw file format prioritizes block-level alignment: the i'th 4 KiB block is located at offset i × 4096 in the file. That block’s context data is located at the end of the file, along with file-level metadata:
[ --- block --- | --- block --- | ... ] [ ctx a0 | ctx a1 | ... ] [ ctx b0 | ctx b1 | ... ] [ metadata ]
There are significant downsides to this layout.
Per-block contexts are stored in two different arrays (a
/ b
in the diagram
above) because we need to be robust against crashing between writing the context
and writing new block data; we ping-pong between context slots to ensure that we
always have a valid context for the current block data. This doubles the
per-block context size, from 48 to 96 bytes (i.e. 18.75% overhead for regions
with a 512-byte block size).
It’s possible for context slot arrays to get "fragmented" (some blocks using array A, and some using array B), which means reads have to be split between the two arrays. We have extra logic to defragment if this effect becomes significant, which is subtle.
Finally, the separation between block and context means that they are located in different ZFS records (by default, 128 KiB). When doing reads and writes, we must load both the block data (e.g. 4 KiB in a 128 KiB record) and the context data (48 bytes, also in a 128 KiB record).
This is extremely noticeable in the flamegraph, e.g. for 4K random reads:
Notice that read_into
calls _pread
twice (once in get_block_contexts
), and
it takes about the same amount of time for each call!
A modest proposal
Prioritizing block-level alignment optimizes for the wrong thing: it’s a good idea if we care about mechanical sympathy with the actual disk, but we’ve got the entirety of ZFS between Crucible and the SSD.
Instead, we should rearrange the raw file format to interleave block and context data, to maximize mechanical sympathy with ZFS:
[ --- block --- | ctx | --- block --- | ctx | ... ] [ --- block --- | ctx | --- block --- | ctx | ... ] [ --- block --- | ctx | --- block --- | ctx | ... ] [ ... ] [ metadata ]
Designing for crash consistency
If we only have a single context per block, we must be absolutely sure that
the block and context land together (or not at all). At the ZFS level, this
means that the block and context write must always be in the same transaction
group (txg
).
Transaction splitting in ZFS
ZFS does not currently offer fine-grained control over transaction groups. The most obvious way to be correct is to ensure that the block and context are both written in a single transaction, which is tautologically guaranteed to not be split between two transaction groups.
Under what circumstances can a pwrite
or pwritev
call be split into multiple
ZFS transactions? Such splitting must happen; in the limit, we wouldn’t expect
a multi-GB pwrite
call to be a single transaction!
To find out, let’s look into the ZFS implementation
(usr/
).
Both pwrite
and pwritev
end up calling zfs_write
. We are writing to an
existing file and are not changing the file size, so the relevant code
(in zfs_vnops.c
) is the following, where n
is bytes remaining:
while (n > 0) {
woff = uio->uio_loffset;
/* ...lots of stuff elided */
/*
* Start a transaction.
*/
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
dmu_tx_hold_write(tx, zp->z_id, woff, MIN(n, max_blksz));
zfs_sa_upgrade_txholds(tx, zp);
error = dmu_tx_assign(tx, TXG_WAIT);
if (error) {
dmu_tx_abort(tx);
if (abuf != NULL)
dmu_return_arcbuf(abuf);
break;
}
/* ...more stuff elided */
/*
* XXX - should we really limit each write to z_max_blksz?
* Perhaps we should use SPA_MAXBLOCKSIZE chunks?
*/
nbytes = MIN(n, max_blksz - P2PHASE(woff, max_blksz));
if (abuf == NULL) {
tx_bytes = uio->uio_resid;
error = dmu_write_uio_dbuf(sa_get_db(zp->z_sa_hdl),
uio, nbytes, tx);
tx_bytes -= uio->uio_resid;
} else {
/* more stuff elided ... */
}
nbytes
is limited to max_blksz
size and alignment. In other words, writes
are split into transactions at recordsize (max_blksz
) boundaries.
dmu_write_uio_dbuf
fetches data from the file, then calls uiomove
to copy
over our incoming data (and marks the DMU buffer as dirty). It takes the
transaction tx
as an argument, so the nbytes
passed into
dmu_write_uio_dbug
should land in a single transaction.
Aligning our data to ZFS records
Based on the ZFS source, if the pwrite
call is bounded to a single ZFS record,
it’s guaranteed to only emit one transaction: bytes
will match our actual data
size, and dmu_write_uio_dbuf
will only be called once.
Requiring that block and context are in the same ZFS record means that we may need to add padding between sets of blocks, i.e. each row in this layout fills a ZFS record:
[ --- block --- | ctx | --- block --- | ctx | ... | padding ] [ --- block --- | ctx | --- block --- | ctx | ... | padding ] [ --- block --- | ctx | --- block --- | ctx | ... | padding ] [ ... ] [ metadata ]
I anticipate storing the recordsize as part of the file’s metadata and checking it when opening the file. If a read-write volume’s recordsize has changed (which should be rare), migration is relatively straight-forward.
The Upstairs does not need to be aware of the ZFS recordsize. It can send a
tightly-packed data
array, and the Downstairs can split it into
recordsize-aligned chunks.
Alternative: add ZFS APIs for fine-grained control
As an alternative, we could add new APIs to ZFS for writes that are guaranteed to land in a single transaction / transaction group.
This is venturing beyond the realm of POSIX, and would look like new syscalls
(or ioctl
's) that hook into the DMU layer directly.
I’m not exactly sure what this would look like; suggestions are welcome!
Metadata reduction
Currently, metadata is written as an Option<OnDiskDownstairsBlockContext>
,
serialized using bincode
. Here are the relevant data structures:
struct OnDiskDownstairsBlockContext {
block_context: BlockContext,
on_disk_hash: u64,
}
pub struct BlockContext {
pub hash: u64,
pub encryption_context: Option<EncryptionContext>,
}
pub struct EncryptionContext {
pub nonce: [u8; 12],
pub tag: [u8; 16],
}
This serializes to 46 bytes, which we round up to 48 bytes
(BLOCK_CONTEXT_SLOT_SIZE_BYTES
).
In a single-context-slot world, on_disk_hash
is not necessary: we use it to
decide which context slot is valid on dirty startup, but that would no longer be
in question.
Additionally, storing both hash
and encryption_context
is unnecessary. If
we have an encryption_context
, then the upstairs needs to decrypt against it,
and the decryption process will tell us whether the data is valid; checking the
hash is just wasted work.
I propose reducing our context data to the following structure:
enum PackedBlockContext {
None,
Unencrypted(u64),
Encrypted(EncryptionContext),
}
Using bincode
, this serializes to 32 bytes (an auspicious value). This brings
our per-block overhead for 512-byte blocks from 18.75% down to 6.25%.
on_disk_hash
means that the Downstairs can no longer check
block integrity (for encrypted data). We were already only doing that check if
the file was dirty, and are trusting many other properties of ZFS, so I think
that trusting ZFS’s checksums is reasonable.Message format compatibility
For encrypted blocks, the new format does not store a hash
. The current
ReadResponseHeader
implementation expects a hash
for every block, so we
would either need to change the header or reply with a dummy hash. I would
recommend changing the header: we’re still updating the entire rack
simultaneously, so we don’t need to worry about message compatibility.
Optional: pre-packing messages
Right now, we send writes with the Message::Write
variant.
enum Message {
// ...
Write {
header: WriteHeader,
data: bytes::Bytes,
},
}
pub struct WriteHeader {
pub upstairs_id: Uuid,
pub session_id: Uuid,
pub job_id: JobId,
pub dependencies: Vec<JobId>,
pub blocks: Vec<WriteBlockMetadata>,
}
pub struct WriteBlockMetadata {
pub eid: ExtentId,
pub offset: Block,
pub block_context: BlockContext,
}
Note that the block contexts are separate from the raw data.
We can write interleaved block data and contexts using pwritev
, e.g. something
like
pwritev(&[&data[0..4096], &context[0], &data[4096..8192], &context[1], ...]);
It may be more efficient to use a single pwrite
instead, which would require
the block and context data to be contiguous in memory. Rearranging data in the
Crucible downstairs would add a full copy. We’ve spent the past few months
removing memcpy
from the flamegraph, so that would be unfortunate.
Instead, we could add a new Message
variant:
enum Message {
// ...
PackedWrite {
header: PackedWriteHeader,
data: bytes::Bytes,
},
}
pub struct PackedWriteHeader {
pub upstairs_id: Uuid,
pub session_id: Uuid,
pub job_id: JobId,
pub dependencies: Vec<JobId>,
pub eid: ExtentId,
pub offset: Block,
pub checksum: u64,
}
The data
member now contains packed (data, context)
tuples, with
bincode
serialization already done in the Upstairs.
The new checksum
member is a digest of data
to protect against corruption
in transit, because I’m also proposing to remove the now-unnecessary on-disk
checksums (see § Metadata reduction).
To avoid unnecessary copies during read operations, we’d also introduce an
equivalent Message::PackedReadResponse
. In-transit corruption for reads is
detected with either the checksum (for unencrypted data) or decryption.
Compatibility and migration
It is likely that we’ll need to support old (raw) and new (packed) extents simultaneously, if only because we can’t change read-only snapshots.
Introducing new messages for packed extents means that the Upstairs will send and receive different messages depending on extent type, which is awkward.
As an alternative, we could use the new (packed) message format everywhere,
and have the Downstairs translate when necessary. That works fine for reads
(where we simply discard superfluous data), but is awkward for writes: for raw
extents, the Downstairs would have to compute on_disk_hash
, doing 3× the work
(because it would be computed on every Downstairs instead of Upstairs).
It’s possible to migrate between the raw and packed on-disk formats, if the region is writeable. This suggests a best-of-both-worlds strategy:
Read-only regions are left as raw extents. We can automatically translate their read results into
Message::PackedReadResponse
with no overheadRead-write regions are migrated to packed extents (either at startup or lazily upon the first write). Once migrated, they can use
Message::PackedWrite
with no additional overhead
(at this point, we could remove Message::ReadResponse
and Message::Write
, or
reuse their names for PackedReadResponse
and PackedWrite
)
Preliminary results
I tested an early version of a packed layout in
crucible#1270
, and saw
significant performance improvements.
Doing random reads to a completely full 128 GiB disk, performance is improved for every read size. For small reads, it’s a 40% improvement!
Read size (blocks) | V1 | V2 | Speedup |
---|---|---|---|
1 | 12.7 MiB/sec ± 328.8 KiB/sec | 17.7 MiB/sec ± 269.9 KiB/sec | 1.39× |
2 | 24.6 MiB/sec ± 764.4 KiB/sec | 34 MiB/sec ± 334.8 KiB/sec | 1.38× |
4 | 47.1 MiB/sec ± 1 MiB/sec | 65.4 MiB/sec ± 870.7 KiB/sec | 1.39× |
8 | 81 MiB/sec ± 658.1 KiB/sec | 112.1 MiB/sec ± 1.8 MiB/sec | 1.38× |
16 | 135.8 MiB/sec ± 2.7 MiB/sec | 192.5 MiB/sec ± 3.8 MiB/sec | 1.42× |
32 | 220.5 MiB/sec ± 5.1 MiB/sec | 291.3 MiB/sec ± 6.9 MiB/sec | 1.32× |
64 | 290.2 MiB/sec ± 8 MiB/sec | 368.8 MiB/sec ± 12.4 MiB/sec | 1.27× |
128 | 455.5 MiB/sec ± 5.2 MiB/sec | 549.9 MiB/sec ± 7.6 MiB/sec | 1.21× |
256 | 580.6 MiB/sec ± 19.2 MiB/sec | 746.2 MiB/sec ± 16.5 MiB/sec | 1.29× |
384 | 580.6 MiB/sec ± 36.2 MiB/sec | 680.1 MiB/sec ± 15.8 MiB/sec | 1.17× |
512 | 652 MiB/sec ± 47.6 MiB/sec | 708.8 MiB/sec ± 29.6 MiB/sec | 1.09× |
640 | 641.5 MiB/sec ± 13.4 MiB/sec | 720.8 MiB/sec ± 26 MiB/sec | 1.12× |
768 | 666.2 MiB/sec ± 13.7 MiB/sec | 756 MiB/sec ± 27.1 MiB/sec | 1.13× |
896 | 673.2 MiB/sec ± 14.1 MiB/sec | 746.7 MiB/sec ± 21.9 MiB/sec | 1.11× |
1024 | 707.7 MiB/sec ± 32.3 MiB/sec | 764.8 MiB/sec ± 24.9 MiB/sec | 1.08× |
This implementation uses the existing Message
variant, so uses pwritev
(instead of pwrite
) to write data to the file. Using pwrite
may be more
performant, but it’s hard to tell without further benchmarking.