RFD 490
Packed Crucible extents
RFD
490
Authors
Updated

Crucible’s raw file format

Crucible stores customer data in raw files on disk, as of crucible#991.

It stores both block data (i.e. actual customer data) and per-block context data (checksums and a nonce + tag for encrypted blocks).

The current raw file format prioritizes block-level alignment: the i'th 4 KiB block is located at offset i × 4096 in the file. That block’s context data is located at the end of the file, along with file-level metadata:

[ --- block --- | --- block --- | ... ]
[ ctx a0 | ctx a1 | ... ]
[ ctx b0 | ctx b1 | ... ]
[ metadata ]

There are significant downsides to this layout.

Per-block contexts are stored in two different arrays (a / b in the diagram above) because we need to be robust against crashing between writing the context and writing new block data; we ping-pong between context slots to ensure that we always have a valid context for the current block data. This doubles the per-block context size, from 48 to 96 bytes (i.e. 18.75% overhead for regions with a 512-byte block size).

It’s possible for context slot arrays to get "fragmented" (some blocks using array A, and some using array B), which means reads have to be split between the two arrays. We have extra logic to defragment if this effect becomes significant, which is subtle.

Finally, the separation between block and context means that they are located in different ZFS records (by default, 128 KiB). When doing reads and writes, we must load both the block data (e.g. 4 KiB in a 128 KiB record) and the context data (48 bytes, also in a 128 KiB record).

This is extremely noticeable in the flamegraph, e.g. for 4K random reads:

flamegraph

Notice that read_into calls _pread twice (once in get_block_contexts), and it takes about the same amount of time for each call!

A modest proposal

Prioritizing block-level alignment optimizes for the wrong thing: it’s a good idea if we care about mechanical sympathy with the actual disk, but we’ve got the entirety of ZFS between Crucible and the SSD.

Instead, we should rearrange the raw file format to interleave block and context data, to maximize mechanical sympathy with ZFS:

[ --- block --- | ctx | --- block --- | ctx | ... ]
[ --- block --- | ctx | --- block --- | ctx | ... ]
[ --- block --- | ctx | --- block --- | ctx | ... ]
[ ... ]
[ metadata ]

Designing for crash consistency

If we only have a single context per block, we must be absolutely sure that the block and context land together (or not at all). At the ZFS level, this means that the block and context write must always be in the same transaction group (txg).

Transaction splitting in ZFS

ZFS does not currently offer fine-grained control over transaction groups. The most obvious way to be correct is to ensure that the block and context are both written in a single transaction, which is tautologically guaranteed to not be split between two transaction groups.

Under what circumstances can a pwrite or pwritev call be split into multiple ZFS transactions? Such splitting must happen; in the limit, we wouldn’t expect a multi-GB pwrite call to be a single transaction!

To find out, let’s look into the ZFS implementation (usr/src/uts/common/fs/zfs).

Both pwrite and pwritev end up calling zfs_write. We are writing to an existing file and are not changing the file size, so the relevant code (in zfs_vnops.c) is the following, where n is bytes remaining:

    while (n > 0) {
        woff = uio->uio_loffset;
        /* ...lots of stuff elided */

        /*
         * Start a transaction.
         */
        tx = dmu_tx_create(zfsvfs->z_os);
        dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE);
        dmu_tx_hold_write(tx, zp->z_id, woff, MIN(n, max_blksz));
        zfs_sa_upgrade_txholds(tx, zp);
        error = dmu_tx_assign(tx, TXG_WAIT);
        if (error) {
            dmu_tx_abort(tx);
            if (abuf != NULL)
                dmu_return_arcbuf(abuf);
            break;
        }

        /* ...more stuff elided */

        /*
         * XXX - should we really limit each write to z_max_blksz?
         * Perhaps we should use SPA_MAXBLOCKSIZE chunks?
         */
        nbytes = MIN(n, max_blksz - P2PHASE(woff, max_blksz));

        if (abuf == NULL) {
            tx_bytes = uio->uio_resid;
            error = dmu_write_uio_dbuf(sa_get_db(zp->z_sa_hdl),
                uio, nbytes, tx);
            tx_bytes -= uio->uio_resid;
        } else {
            /* more stuff elided ... */
    }

nbytes is limited to max_blksz size and alignment. In other words, writes are split into transactions at recordsize (max_blksz) boundaries.

dmu_write_uio_dbuf fetches data from the file, then calls uiomove to copy over our incoming data (and marks the DMU buffer as dirty). It takes the transaction tx as an argument, so the nbytes passed into dmu_write_uio_dbug should land in a single transaction.

Warning
Are there other parts of the ZFS pipeline that could split the transaction, further down the line?

Aligning our data to ZFS records

Based on the ZFS source, if the pwrite call is bounded to a single ZFS record, it’s guaranteed to only emit one transaction: bytes will match our actual data size, and dmu_write_uio_dbuf will only be called once.

Requiring that block and context are in the same ZFS record means that we may need to add padding between sets of blocks, i.e. each row in this layout fills a ZFS record:

[ --- block --- | ctx | --- block --- | ctx | ... | padding ]
[ --- block --- | ctx | --- block --- | ctx | ... | padding ]
[ --- block --- | ctx | --- block --- | ctx | ... | padding ]
[ ... ]
[ metadata ]

I anticipate storing the recordsize as part of the file’s metadata and checking it when opening the file. If a read-write volume’s recordsize has changed (which should be rare), migration is relatively straight-forward.

The Upstairs does not need to be aware of the ZFS recordsize. It can send a tightly-packed data array, and the Downstairs can split it into recordsize-aligned chunks.

Alternative: add ZFS APIs for fine-grained control

As an alternative, we could add new APIs to ZFS for writes that are guaranteed to land in a single transaction / transaction group.

This is venturing beyond the realm of POSIX, and would look like new syscalls (or ioctl's) that hook into the DMU layer directly.

I’m not exactly sure what this would look like; suggestions are welcome!

Metadata reduction

Currently, metadata is written as an Option<OnDiskDownstairsBlockContext>, serialized using bincode. Here are the relevant data structures:

struct OnDiskDownstairsBlockContext {
    block_context: BlockContext,
    on_disk_hash: u64,
}

pub struct BlockContext {
    pub hash: u64,
    pub encryption_context: Option<EncryptionContext>,
}

pub struct EncryptionContext {
    pub nonce: [u8; 12],
    pub tag: [u8; 16],
}

This serializes to 46 bytes, which we round up to 48 bytes (BLOCK_CONTEXT_SLOT_SIZE_BYTES).

In a single-context-slot world, on_disk_hash is not necessary: we use it to decide which context slot is valid on dirty startup, but that would no longer be in question.

Additionally, storing both hash and encryption_context is unnecessary. If we have an encryption_context, then the upstairs needs to decrypt against it, and the decryption process will tell us whether the data is valid; checking the hash is just wasted work.

I propose reducing our context data to the following structure:

enum PackedBlockContext {
    None,
    Unencrypted(u64),
    Encrypted(EncryptionContext),
}

Using bincode, this serializes to 32 bytes (an auspicious value). This brings our per-block overhead for 512-byte blocks from 18.75% down to 6.25%.

Warning
Removing on_disk_hash means that the Downstairs can no longer check block integrity (for encrypted data). We were already only doing that check if the file was dirty, and are trusting many other properties of ZFS, so I think that trusting ZFS’s checksums is reasonable.

Message format compatibility

For encrypted blocks, the new format does not store a hash. The current ReadResponseHeader implementation expects a hash for every block, so we would either need to change the header or reply with a dummy hash. I would recommend changing the header: we’re still updating the entire rack simultaneously, so we don’t need to worry about message compatibility.

Optional: pre-packing messages

Right now, we send writes with the Message::Write variant.

enum Message {
    // ...
    Write {
        header: WriteHeader,
        data: bytes::Bytes,
    },
}

pub struct WriteHeader {
    pub upstairs_id: Uuid,
    pub session_id: Uuid,
    pub job_id: JobId,
    pub dependencies: Vec<JobId>,
    pub blocks: Vec<WriteBlockMetadata>,
}

pub struct WriteBlockMetadata {
    pub eid: ExtentId,
    pub offset: Block,
    pub block_context: BlockContext,
}

Note that the block contexts are separate from the raw data.

We can write interleaved block data and contexts using pwritev, e.g. something like

pwritev(&[&data[0..4096], &context[0], &data[4096..8192], &context[1], ...]);

It may be more efficient to use a single pwrite instead, which would require the block and context data to be contiguous in memory. Rearranging data in the Crucible downstairs would add a full copy. We’ve spent the past few months removing memcpy from the flamegraph, so that would be unfortunate.

Instead, we could add a new Message variant:

enum Message {
    // ...
    PackedWrite {
        header: PackedWriteHeader,
        data: bytes::Bytes,
    },
}

pub struct PackedWriteHeader {
    pub upstairs_id: Uuid,
    pub session_id: Uuid,
    pub job_id: JobId,
    pub dependencies: Vec<JobId>,
    pub eid: ExtentId,
    pub offset: Block,
    pub checksum: u64,
}

The data member now contains packed (data, context) tuples, with bincode serialization already done in the Upstairs.

The new checksum member is a digest of data to protect against corruption in transit, because I’m also proposing to remove the now-unnecessary on-disk checksums (see § Metadata reduction).

To avoid unnecessary copies during read operations, we’d also introduce an equivalent Message::PackedReadResponse. In-transit corruption for reads is detected with either the checksum (for unencrypted data) or decryption.

Note
Philosophically, this makes the Downstairs even dumber, which is good. Specifically, the Downstairs no longer looks into the block context data; the only thing that matters is its (serialized) size.

Compatibility and migration

It is likely that we’ll need to support old (raw) and new (packed) extents simultaneously, if only because we can’t change read-only snapshots.

Introducing new messages for packed extents means that the Upstairs will send and receive different messages depending on extent type, which is awkward.

As an alternative, we could use the new (packed) message format everywhere, and have the Downstairs translate when necessary. That works fine for reads (where we simply discard superfluous data), but is awkward for writes: for raw extents, the Downstairs would have to compute on_disk_hash, doing 3× the work (because it would be computed on every Downstairs instead of Upstairs).

It’s possible to migrate between the raw and packed on-disk formats, if the region is writeable. This suggests a best-of-both-worlds strategy:

  • Read-only regions are left as raw extents. We can automatically translate their read results into Message::PackedReadResponse with no overhead

  • Read-write regions are migrated to packed extents (either at startup or lazily upon the first write). Once migrated, they can use Message::PackedWrite with no additional overhead

(at this point, we could remove Message::ReadResponse and Message::Write, or reuse their names for PackedReadResponse and PackedWrite)

Preliminary results

I tested an early version of a packed layout in crucible#1270, and saw significant performance improvements.

Doing random reads to a completely full 128 GiB disk, performance is improved for every read size. For small reads, it’s a 40% improvement!

Read size (blocks)V1V2Speedup

1

12.7 MiB/sec ± 328.8 KiB/sec

17.7 MiB/sec ± 269.9 KiB/sec

1.39×

2

24.6 MiB/sec ± 764.4 KiB/sec

34 MiB/sec ± 334.8 KiB/sec

1.38×

4

47.1 MiB/sec ± 1 MiB/sec

65.4 MiB/sec ± 870.7 KiB/sec

1.39×

8

81 MiB/sec ± 658.1 KiB/sec

112.1 MiB/sec ± 1.8 MiB/sec

1.38×

16

135.8 MiB/sec ± 2.7 MiB/sec

192.5 MiB/sec ± 3.8 MiB/sec

1.42×

32

220.5 MiB/sec ± 5.1 MiB/sec

291.3 MiB/sec ± 6.9 MiB/sec

1.32×

64

290.2 MiB/sec ± 8 MiB/sec

368.8 MiB/sec ± 12.4 MiB/sec

1.27×

128

455.5 MiB/sec ± 5.2 MiB/sec

549.9 MiB/sec ± 7.6 MiB/sec

1.21×

256

580.6 MiB/sec ± 19.2 MiB/sec

746.2 MiB/sec ± 16.5 MiB/sec

1.29×

384

580.6 MiB/sec ± 36.2 MiB/sec

680.1 MiB/sec ± 15.8 MiB/sec

1.17×

512

652 MiB/sec ± 47.6 MiB/sec

708.8 MiB/sec ± 29.6 MiB/sec

1.09×

640

641.5 MiB/sec ± 13.4 MiB/sec

720.8 MiB/sec ± 26 MiB/sec

1.12×

768

666.2 MiB/sec ± 13.7 MiB/sec

756 MiB/sec ± 27.1 MiB/sec

1.13×

896

673.2 MiB/sec ± 14.1 MiB/sec

746.7 MiB/sec ± 21.9 MiB/sec

1.11×

1024

707.7 MiB/sec ± 32.3 MiB/sec

764.8 MiB/sec ± 24.9 MiB/sec

1.08×

This implementation uses the existing Message variant, so uses pwritev (instead of pwrite) to write data to the file. Using pwrite may be more performant, but it’s hard to tell without further benchmarking.