192 - Omicron Database Design / RFD / Oxide

RFD

192

Authors

Updated

Designing database schemas is hard. There are established methods for laying out tables, columns, indexes, etc. and normalizing them to ensure integrity and ease future expansion. Less widely known but still established patterns exist for operating on the data while preserving the correctness, consistency, and scalability requirements of applications like Omicron. This RFD seeks to centralize our understanding of the general requirements for the database, specific implications of those requirements, and proposed patterns for meeting those requirements.

Note

This RFD describes the intent behind the existing database interface within Nexus. That implementation is far from complete. Thus, the ideas here are neither greenfield nor immutable. As we evolve them, we’ll want to consider the cost of changing direction.

High level goals

Strong consistency. [RFD 48] says that in general the public API should be strongly consistent. The example: "if a user creates an Instance, another user with appropriate permissions should immediately see that Instance." [rfd24] describes a similar idea called "coherence". It summarizes the scope across which namespaces are expected to be coherent. We expect the API to be strongly consistent within at least an availability zone for most uses of the API, including CRUD for Projects, Instances, VPCs, VPC Subnets, Disks, Snapshots, Images, etc.^[1]

Scalability. We mean two things:

We can increase the capacity of the system by adding hardware. CockroachDB facilities this: we can add hardware to increase the number of databases, tables, or records; amount of data; or requests per second. But we still need to structure the application to avoid bottlenecks. If the first step of every API request updates one particular record in the database, we still wind up limited to the capacity of one system no matter how much we grow the CockroachDB cluster.
Informally, the system doesn’t get much slower as it gets much larger. More precisely, we say that API request latency should grow less-than-linearly with the number of records in the database.

See [RFD 48] and [rfd53] for more about control plane database requirements.

Modeling control plane data

We’ll use the example of a Project (the API resource), which is a collection of other API resources like Instances, Disks, and VPCs. Instances have links to other resources like Disks. Instances get created through a complex, asynchronous process. VPCs are useful for some of our examples because they’re created synchronously. See [RFD 4] for more about the API in general and Projects and Instances in particular. See [RFD 21] for more on VPCs.

There are many other collections in the public API:

Organizations contain Projects.
Projects contain most other resources, including Instances, Disks, VPCs, and so on.
VPCs contain other resources like VPC Subnets
lots more

We impose these constraints on collections:

The API must be able to enumerate the entire collection using paginated requests.
- Consistency: such an enumeration should include all items present and not renamed during the whole scan. The scan may skip items created, renamed, or deleted during the scan. It may show the same item twice in the case of a rename.
- Scalability: each page request must take bounded time. More on this below.
Regarding deletion:
- Users cannot create items inside a deleted collection.
- Users cannot delete a collection that contains any items.^[5]
Name uniqueness must always be preserved, including for create, rename, and move-to-another-collection (e.g., moving an Instance from one Project to another)

Modeling a Collection

To describe Projects, we start with this:

CREATE TABLE omicron.public.Project (
    /* Identity metadata */
    id UUID PRIMARY KEY,      /* immutable unique id */
    name STRING(63) NOT NULL, /* mutable, unique among live objects in scope */
    time_deleted TIMESTAMPTZ, /* non-NULL only if object has been deleted */

    organization_id UUID NOT NULL, /* foreign key into "Organization" table */

    /* Other identity metadata that are not used programmatically. */
    description STRING(512) NOT NULL,
    time_created TIMESTAMPTZ NOT NULL,
    time_modified TIMESTAMPTZ NOT NULL
);

Here we use database-provided types to store UUIDs and timestamps. Where possible, we use specific types provided by the database, both for efficiency and type-safety among database clients. (This is not a substitute for validation at the API.)

The id, name, description, time_modified, and time_created columns are prescribed by [RFD 4] as what we call identity metadata.

id is an immutable, system-generated UUIDv4 that uniquely identifies each object across all time, all types of objects, and all Oxide systems. This makes it a natural primary key. Like many systems, we assume this uniqueness by virtue of the astronomical probability of UUIDv4 collision.

name is a mutable, user-controlled identifier that is unique among live objects of the same type within the object’s parent scope. A Project is contained in an Organization, so the Project’s name is only unique among Projects within the Organization.

Both id and name work as identifiers for API users. We mostly expect people to use name. Software that wants to be robust against other clients renaming (or moving) objects might prefer to use ids. This is fine, but usually not necessary. If the software is just updating individual resources and wants to avoid accidentally applying an update to the wrong thing (because of a rename), it can use ETags with HTTP conditional headers to avoid that.

When Nexus receives an API request, we’ll usually want to resolve the name to an id up front and do most internal work using ids. Otherwise, we can wind up with nasty concurrency issues (i.e., serializability violations) if somebody renames a resource while we’re working on it. This can lead to security vulnerabilities, if we were to do an access check of some resource and then somebody renames it and we wind up operating on a different resource than the one we checked.

Important

It’s true that resolving names to ids early in an API request is a sound way to avoid these (serious) concurrency issues. However, if the entire request can be translated into one database statement, the resolution can be done within the query more efficiently. Compare a two-step SELECT id FROM … WHERE name = … (resolve the name to an id) plus UPDATE … SET foo = … WHERE id = … (update the record having that id) to just UPDATE … SET foo = … WHERE name = …. For this reason, it’s tempting for a Rust database layer to expose a function like update_foo_by_name(), but it’s also very easy to misuse for the reasons mentioned (i.e., if you do this as part of a multi-step operation, you likely will introduce a potentially serious concurrency bug). For this reason, we should avoid this sort of general-purpose API.

Since Projects are contained within Organizations, there would be a separate "Organization" table with a similar schema. Each row in the "Project" table has organization_id, essentially a foreign key into the Organizations table. We do not use explicit foreign keys, though — more on this later under [_use_of_foreign_keys].

time_deleted is used to support soft deletes. More on this later under [_soft_deletes].

Not all API resources need to look exactly this way. For example, disk attachments might be represented not with a separate table but as rows in the Disk table having a non-null "attached_instance" field.

The approach here normalizes the database.^[6] Some high-scale systems use denormalization to avoid potentially expensive JOIN operations. We expect JOINs to be efficient as long as we’re careful in how we structure the queries — just like any other query. More on this below.

Modeling an Item in a Collection

Instances are scoped to a particular Project. They might look like this:

CREATE TABLE omicron.public.Instance (
    /* Identity metadata */
    id UUID PRIMARY KEY,      /* immutable unique id */
    name STRING(63) NOT NULL, /* mutable, unique among live objects in collection */
    time_deleted TIMESTAMPTZ, /* non-NULL only if object has been deleted */

    /* Linkage to parent collection */
    project_id UUID NOT NULL, /* foreign key into "Project" table */

    /* Other identity metadata that are not used programmatically. */
    description STRING(512) NOT NULL,
    time_created TIMESTAMPTZ NOT NULL,
    time_modified TIMESTAMPTZ NOT NULL
);

This looks pretty similar to the Project, except the link to the parent collection is called "project_id" instead of "organization_id".

Let’s create a unique index like this:

CREATE UNIQUE INDEX ON omicron.public.Instance (
    project_id,
    name
) WHERE
    time_deleted IS NULL;

This is a multi-column unique partial index, which sounds more exotic than it really is:

"multi-column" means that the index stores both the project_id and name columns together, sorted first by project_id and then by name.
"unique" means that there can be only one entry in the index with the same project_id and name. Any operation that would violate that (like creating an Instance with the same name as another Instance in the same Project) will raise a constraint violation error.
"partial" just means that only some rows are in the table are indexed — namely, those where time_deleted IS NULL.

This index is critical — it enables a bunch of things simultaneously:

Supports efficient lookup of an Instance within a Project by its name.
Supports efficient pagination of all Instances within a Project. Each API request for a page of results corresponds to fetching the same number of sequential entries from this index.
Enforces that each Instance’s name is unique among live Instances within its parent Project. The same name can be used by different Instances in different Projects. The same name can also be used by a live Instance and any number of deleted Instances, even with soft deletion. The constraint is maintained when someone renames the Instance or moves it from one Project to another. (All of this would be much harder to implement outside the database.)

In this way, we leverage the database’s ability to validate constraints like name uniqueness. The database is well-designed to handle these efficiently with the desired consistency.

We said above that the id is the primary key for our tables. CockroachDB automatically creates an index on the primary key. This means we can do efficient pagination and direct lookup of Instances by their id. (This does not support efficient pagination of the Instances within a Project by their id — for that, we’d need a separate composite index on the (project_id, id) columns.)

Use of foreign keys

As of this writing, we avoid explicit foreign keys due to this warning from the docs:

Foreign key dependencies can significantly impact query performance, as queries involving tables with foreign keys, or tables referenced by foreign keys, require CockroachDB to check two separate tables. We recommend using them sparingly.

That quote might scare one into thinking that foreign keys are particularly pathological in CockroachDB. But we have no reason to believe this. Referential integrity requires that if we insert or update a record in a table with a foreign key, the database must look up the foreign key value to ensure it exists. It seems likely that’s what the docs are talking about. But that lookup should be indexed, since primary keys are indexed, and so it should be fast. We’d likely pay a higher cost in the application to check the same constraint.

Referential integrity also requires that if we delete a row in the foreign table, we must ensure that there are no rows in this table that reference it. It would be possible to do this efficiently with our unique index by name. Does CockroachDB use that? We would really want to verify this before switching to using explicit foreign keys to avoid seemingly innocuous operations generating very expensive scans. We’d also want to make sure we understand (and carefully choose) the behavior of operations like "DELETE" when foreign keys are used.

It’s worth noting that even if we use explicit foreign keys, that wouldn’t be sufficient for our application-level invariants. We’d need additional checks to ensure that the referenced Project is live, for example.

Soft deletes

Soft delete is a common pattern where instead of deleting a database row, you set a flag to indicate that a row has been deleted. This decouples the removal of the row as far as the user is concerned from other (potentially expensive) consequences of removing the row.

As an example: suppose we have a billing system that uses information in the database to report on usage. A Project might need to show up in a billing statement after it was deleted. Without soft deletes, the billing system would need to handle deleted Projects specially, whereas if the Project persists in the database (with a "deleted" flag), the billing system does not need to do something special to find it. Soft deletes can also be useful to support operations like undelete and for auditing.

The implementation of soft delete affects not just the code that logically deletes rows, but any queries that select live rows, plus the indexes that enforce constraints on live rows. Consumers that only want to look at live objects need to specify the extra time_deleted IS NULL constraint. By using the same time_deleted column in most places, we can implement soft deletes uniformly.

Note

Archiving: Although we probably don’t want to remove rows for deleted objects right away, we also don’t want to keep them around for longer than they’re really needed. They do take up space in the database, which is relatively expensive storage. They potentially take up space in indexes, too. So we’ll need to build some sort of archiver or cleaner that removes rows that have been dead for long enough (potentially after archiving them to cheaper storage, like a simple log file).

Querying the database

This section starts with some guidelines for how we query the database, then goes into several specific patterns we’ll want to use in a number of places. The intent is to implement these patterns once, with good abstractions, so that people don’t have to think about all these details (and all the ways to get them wrong) every time they need to use the database.

General notes

Applications use a variety of patterns to synchronize database updates, including both explicit and implicit locks in the database as well as optimistic concurrency control (OCC). These patterns can have drastically different latency (and correctness) impacts under different kinds of load. Workloads that run well at small scale can see faster-than-linear increase in latency as the system gets busier due to queueing delays waiting for locks. The suggestions here seek to minimize the use of explicit locks, minimize the hold time of all locks, and minimize the need for retry loops.

Prefer small, short-lived transactions to long-running ones. One rule of thumb: transactions should never block on something outside the database. This rule helps ensure liveness and consistent latency for database clients. As an example, you would not want to do this:

/* WARNING: avoid this pattern! */
BEGIN;
SELECT * FROM Instance WHERE id = 12345 FOR UPDATE;
/* over in the database client, look at the row you got back and decide how to update it */
UPDATE Instance SET ... WHERE id = 12345;
COMMIT;

This transaction blocks on both network activity and client activity. The SELECT query locks the row that it returns. When there’s contention, in the best case, the locking and queueing delays create latency bubbles. In the worst case, partitions or software bugs or hangs can leave rows locked for very long periods (even indefinitely). For a good discussion of this pattern and alternatives, see [ringer-2014]. For more on long-running transactions, see [lipiński-2011]. While these posts are oriented around PostgreSQL, many of the same problems apply.

The CockroachDB docs reinforce this:

Make transactions smaller, so that each transaction has less work to do. In particular, avoid multiple client-server exchanges per transaction. For example, use common table expressions to group multiple SELECT and INSERT/UPDATE/DELETE/UPSERT clauses together in a single SQL statement.

Closely related is avoiding read-modify-write patterns when possible. Again, [ringer-2014] provides a good overview of different patterns here. Often you can do the read-modify-write in a single SQL statement, which avoids round-trips to the client (possibly with locks held) and retry loops. That in turn improves latency and reduces latency bubbles. For examples, see [_inserting_records] and [_conditional_updates].

Database queries should be scalable, by which we mean the latency should be pretty consistent even as the database grows. This follows from short-lived transactions, but what does this mean? It’s easy to write queries that run quickly on small tables but whose runtime grows linearly (or even superlinearly) with the size of various tables. If you’ve got a table with no index, a simple SELECT on any field other than the primary key will require scanning the whole table!

Some rules of thumb: assume that most tables can grow arbitrarily large, and assume that you’ll need an index to find any rows in the table. What index do you need? This gets tricky — you have to know a bit about how databases execute queries and how indexes affect that, which is beyond the scope of this RFD. The EXPLAIN command helps verify your assumptions. There are lots of useful resources around this:

CockroachDB’s EXPLAIN documentation gives a good overview of the command and basic ideas.
For a deeper dive, check out the EXPLAIN documentation in PostgreSQL. CockroachDB is not PostgreSQL, particularly under the hood. But the basic principles around query planning, table scans, indexes, and EXPLAIN are very similar.

Indexes do have cost: they take up considerable disk space, plus memory when they’re used.

Warning

Don’t just create an index and assume that the database can use it for your query. Test it with EXPLAIN. It’s easy to accidentally create indexes that aren’t quite what the database needs. Example: you have a table Foo with column bar, a string. You have an index on column (bar) WHERE bar is NOT NULL. You run SELECT * FROM Foo WHERE bar = 'baz'. You know the database can use the index because any matching rows have bar = 'baz', which implies bar != NULL, which means those rows are in the index. But the database doesn’t necessarily infer that bar = 'baz' implies bar != NULL, so it might think it can’t use that index. The last thing you want is to create a huge index that the database can’t actually use for the queries that you created it for. Use EXPLAIN to check.^[7]

Rethinking CRUD

The basic CRUD operations are more complicated for us than just INSERT/SELECT/UPDATE/DELETE:

"Create": See [_inserting_records]. We need to maintain various application-level invariants, like you can’t create an Instance in a Project that doesn’t exist. Or in some cases we want the INSERT to succeed when the row already exists.
"Read" needs to account for [_soft_deletes] and related mechanisms.
"Update": See [_conditional_updates]. To avoid read-modify-write, we want to apply conditions to UPDATE clauses. But we usually want to distinguish between update failing because a row that we wanted to update didn’t exist vs. the preconditions weren’t met.
"Delete": See [_soft_deletes]. Removing a row from user visibility doesn’t necessarily mean we don’t want to hang onto it for a little while.

Inserting records

To create a resource, we need to insert a row:

INSERT INTO Instance (id, name, project_id, time_deleted) VALUES (
    'a964661d-6fb2-4119-aa9f-cfdacd70c06c',
    'my-instance',
    '5d552a96-ea45-40ce-9674-f3a3792acc4c',
    NULL
);

If there’s a uniqueness conflict on (project_id, name), that may generate an error that we need to handle. No special SQL is required.

Now, what if we want to return the created object back to the user? We could SELECT it back:

/* WARNING: don't do this */
INSERT INTO Instance (id, name, project_id, time_deleted) VALUES (
    'a964661d-6fb2-4119-aa9f-cfdacd70c06c',
    'my-instance',
    '5d552a96-ea45-40ce-9674-f3a3792acc4c',
    NULL
);
SELECT * FROM Instance WHERE id = 'a964661d-6fb2-4119-aa9f-cfdacd70c06c';

Even when you issue these in the same transaction, depending on the database and isolation level, another transaction could remove or modify the Instance in between those calls. A better option is to use a RETURNING clause:

INSERT INTO Instance (id, name, project_id, time_deleted) VALUES (
    'a964661d-6fb2-4119-aa9f-cfdacd70c06c',
    'my-instance',
    '5d552a96-ea45-40ce-9674-f3a3792acc4c',
    NULL
) RETURNING *;

If this INSERT succeeds, it will return the row that it inserted.

When creating Projects, we may create a few other resources as well: a default Vpc and VpcSubnet, for example. We can handle this using multiple INSERTs. All of the INSERTs should happen in the same transaction so that we do not create detritus for users to clean up in the event of a half-finished operation.

We also need to handle a dueling administrator removing a collection at the same time as we’re trying to create something inside it. This is its own complicated problem that we’ll set aside for now.

Now, what if the record that we’re creating might already exist? For API calls, that’s probably an (impossible) error because we usually just generated the id, so nothing can already have the same one.^[8] But for saga actions, this is common: the action creates the record, and the action needs to be idempotent, so it needs to handle the case that it already ran and created the record. We can do this with:

INSERT INTO Instance (id, name, project_id, time_deleted) VALUES (
    'a964661d-6fb2-4119-aa9f-cfdacd70c06c',
    'my-instance',
    '5d552a96-ea45-40ce-9674-f3a3792acc4c',
    NULL
) ON CONFLICT (id) DO NOTHING;

This statement will succeed (having done nothing) if a record existed with the same id. It will still fail (as we want) if another constraint is violated, like uniqueness of the name column.

We’ve lost the RETURNING clause here. We could use both ON CONFLICT and RETURNING. But it’s not sufficient: if the row already existed, we won’t get its contents because RETURNING only returns whatever was inserted, not the contents of whatever existing row had a conflict that was ignored. Fortunately, for many sagas (especially those that create records), we can design invariants that limit what kinds of concurrent operations are allowed while the saga is running. For example, we can guarantee that an Instance cannot be deleted while its "create" saga is running (by disallowing delete operations on sagas in the "creating" state). Thus, it’s okay to do the sequential INSERT and SELECT (even without a transaction) for this case.^[9]

Important

We’ve just proposed two different INSERT mechanisms: one for sagas (where we can eliminate concurrency issues) and one for API calls (where concurrency is an issue but we don’t need to return the existing row if the INSERT fails). It may seem ugly to have two different special-case INSERTs, but both are much simpler than trying to come up with something that meets all the constraints. Such a mechanism might be possible with a CTE similar to what we use for conditional updates.

Note

Rust implementation patterns are beyond the scope of this RFD, but it’s worth noting that the id for a record created by a saga should be generated by a separate saga action from the one that inserts the record. Otherwise, the action wouldn’t be idempotent — you’d get two records if you ran the action twice instead of just once.

When objects (like Instances) are created by sagas, they might be in a half-created state. We’ll want to make sure the precise state is represented in the database so that consumers (like the API) can decide whether they want to leave these objects out or represent them differently. For example, Instances have a "creating" state and the API could leave these out or pass the state through so that clients can see exactly what’s happening. We need more consumers before we can generalize patterns here.

See [_which_to_use_when].

Conditional updates

The [_general_notes] section explains why it’s preferable to use UPDATE statements with conditions in the WHERE clause rather than an interactive transaction (in which the client fetches a row, examines it, decides how to update it, and issues an UPDATE — all with the row lock held). Let’s take a closer look.

Suppose our Instance table has these additional columns:

run_state STRING(16) NOT NULL,
run_gen INT8 NOT NULL,

where "run_state" might be "starting", "running", "stopped", etc.^[10] and "run_gen" is the generation number for the current "run_state". Let’s suppose the Sled Agent makes a request to Nexus whenever the runtime state of an Instance changes (e.g., because someone in the guest VM issued a poweroff). Nexus’s job is to make sure that the database state always reflects the most recent runtime state of the Instance.

Now, suppose we go through a number of state changes in quick succession — maybe someone issues a reboot and we quickly go through "stopping", "stopped", "starting", and "running". Of course, at Sled Agent, there’s a clear ordering of these events. But when Sled Agent notifies Nexus and Nexus updates the database, these transactions could be processed out of order. (This is especially likely if there are transient failures and requests get retried after some delay.) We can use the generation number to ensure that whatever happened last at the Sled Agent appears last in the database — without retries or explicit locks!

We can write the Nexus query like this:

UPDATE Instance SET run_state = "running" AND run_gen = 456 WHERE id = 123 AND run_gen < 456;

In this case, the generation number "456" comes from Sled Agent within the message announcing the state change to "running". We assume it’s able to keep a monotonically increasing counter of state changes.^[11]

Now, when you execute the UPDATE statement, you can tell how many rows were updated. If the number is non-zero, then this update was newer than all previous ones and recorded to the database. If it’s zero, then it wasn’t.

The problem: if zero rows were updated, you can’t tell if it’s because the row with id=123 didn’t exist or because it already had a newer generation number. That might matter! Suppose this isn’t a fire-and-forget notification from the Sled Agent but instead an HTTP conditional request — maybe somebody trying to avoid [_dueling_administrators]. In that case, you want to return a 404 if the row is gone, but a 412 ("precondition failed") if the row had a newer generation.

The simplest way we’ve found to distinguish these cases is with a common table expression (CTE):^[12]

WITH found_row   AS (SELECT id,run_state,run_gen FROM Instance WHERE id = 123)
     updated_row AS (UPDATE Instance SET run_state = "running", run_gen = 456 WHERE id = 123 AND run_gen < 456 RETURNING id,run_gen)
SELECT
     found.id        AS found_id,
     found.run_gen   AS found_gen,
     found.run_state AS found_state,
     updated.id      AS updated_id,
     updated.run_gen AS updated_gen,
FROM
    found_row found
FULL OUTER JOIN
    updated_row updated
ON
    found.id = updated.id

Within one statement, this:

SELECTs the row in question, presumably producing at most one row;
UPDATEs the same row conditional on run_gen < 456, also producing at most one row; and
computes a full outer join on these two rows, which essentially creates one row containing the contents of each of these two rows, side-by-side. If either query returned no rows, the corresponding columns are empty.

The result should be one row with columns found_id, found_gen, found_state (which are all NULL if the row did not exist at all), plus updated_id and updated_gen (which are both NULL if the row was not updated). There are three cases:

If the "found" columns are NULL, the "updated" columns will also be NULL, and the row did not exist at all.
Otherwise, if the "updated" columns are NULL, then the row existed but was not updated. Thus, it did not meet the precondition (run_gen < 456). found_gen and found_state will show the newer generation number and state so you can report a detailed message to the user.
Otherwise, the "updated" columns are non-NULL. The row existed and the update was applied.

To work through how this works, see [_appendix_update_cte].

Dueling administrators

By "dueling administrators", we mean any case where two users make logically inconsistent requests to the API at the same time. (They don’t have to be administrators, and they can even be the same user!) A typical case might be: both users fetch an Instance that is currently "stopped", then user 1 issues a request to "start" the instance, and before user 1’s request completes, user 2 issues a request to delete it. As mentioned above, we expect the system to look basically serializeable, which generally means that the resulting behavior is as if the requests were processed in series. This still gives us a lot of freedom. In this example, any of these outcomes would be okay:

user 1’s request completes successfully, user 2’s request fails because a running instance cannot be deleted, and the Instance is running at the end
user 2’s request completes successfully, user 1’s request fails because you cannot stop a deleted instance, and the Instance winds up deleted (after not having started)

These would not be okay:

both users' requests complete successfully, regardless of the end state of the instance
user 1’s request completes, user 2’s request fails as above, but the Instance is anything other than running at the end
user 2’s request completes, user 1’s request fails as above, but the Instance is anything other than deleted at the end

There are two parts of this problem: one is making sure that the system always does something reasonable in this situation, and the other is providing users tools to control which of these things happens.

Simple modifications

It’s easy enough to give users tools to synchronize modifications to a single resource using HTTP ETags and conditional requests (see [RFD 4]). In our example, both user 1 and user 2 would fetch the Instance, get some ETag back with it, and could make their modification request conditional on the ETag not having changed. One of these would succeed while the other would fail before having done anything. As long as there’s not too much contention, this works well. To implement it, we need a provide a means for generating ETags such that the ETag for a resource will always change when the resource itself changes and usually doesn’t change otherwise.

It’s trickier to help users when one of these requests renames or moves a resource (e.g., moving an Instance from one Project to another). One of the requests will wind up failing because it can’t find the resource. This is out of scope for us.

Creation of an item in a collection that’s being deleted

Suppose we have an empty Project. One user issues a request to delete it while another issues a request to create a new VPC inside it. It would be easy to implement the create-VPC endpoint in a way that checks whether the Project exists, resolving the Project name to its id, then creates a new record in the VPC table that references the Project by its id. But the project-delete request could be processed in the middle of this, after the name-to-id resolution but before the VPC insert. That would leave us with a VPC in a Project that no longer exists! Worse, both users' requests succeeded, which violates our promise of serializeability: there is no sequencing these two operations such that they should both succeed.

To address this, we propose that every collection have a child-resource-generation-number ("rcgen").

To create a child resource inside a collection, we’d do something like this:

WITH
    found_row AS (SELECT id FROM Project WHERE id = ... AND time_deleted IS NULL FOR UPDATE),
    dummy AS (SELECT IF(EXISTS(found_row), TRUE, 1/0))
    updated_row AS (UPDATE Project SET rcgen = rcgen + 1 WHERE id = ... AND time_deleted IS NULL),
    inserted_row AS (INSERT INTO Instance (...) VALUES (...) RETURNING *)
SELECT * FROM inserted_row;

This is confusing and a little hacky. First, we look for the collection. Next, we select "dummy": the point of this expression is to produce an invalid value if we did not find the collection. This will trigger a transaction rollback if that happens. Assuming that doesn’t happen, we bump the rcgen of the collection, conditional on it still being live. Then we attempt to insert the new Instance, returning the whole row that we inserted. (Is there a better way to do this?)

Note

We’re using SELECT … FOR UPDATE, which uses a (somewhat) explicit lock. However, the database does not block on a response from the client while holding this lock.

We could skip the FOR UPDATE because CockroachDB uses SERIALIZEABLE isolation. We include it here because the FOR UPDATE shouldn’t be worse with CockroachDB and using FOR UPDATE here ensures this could work even with more relaxed isolation levels.

We could make this a lot simpler if we were willing to examine "updated_row" on the client (Nexus) and then decide what to do. But the solution above does all the work in the database without either round-trips with locks held or retry loops.

To remove a collection, we’d do these queries in sequence. They can be in separate transactions:

/*
 * Fetch the rcgen for the collection.  Here, we combine it with the initial
 * resolution from name to id.
 */
SELECT id,rcgen FROM Project WHERE name = ...

/*
 * See if there are any child resources in the collection.  This needs to be
 * repeated for every child resource table.
 *
 * Note that we're not selecting COUNT because that could take forever.
 * This should always be fast, provided we've created an appropriate
 * index.  And that index should exist for pagination in the collection.
 */
SELECT id FROM Instance WHERE project_id = ... LIMIT 1

/*
 * Finally, remove the project (soft delete) _conditional_ on rcgen not
 * having changed.  This should ideally use the conditional update described
 * elsewhere in this RFD to distinguish the different cases where this updates
 * no rows.
 */
UPDATE Project SET time_deleted = NOW() WHERE id = ... AND rcgen = ...

Things one might think are solutions but don’t work:

Use a foreign key reference in the VPC table. This has some issues mentioned under [_use_of_foreign_keys]. It also doesn’t work with [_soft_deletes], since the row for the deleted Project will still be present, just not live.
Use sagas for deletion of projects and creation of resources inside projects. This doesn’t really solve the problem by itself: the real problem here, both at delete and create time, is the gap between time-of-check and time-of-use. The only real way to solve that is to make the INSERT fail if the parent resource doesn’t exist and to make the delete (UPDATE) fail if any child records exist. Sagas here also suck for users because they would imply that these delete and create operations should be asynchronous, even for cases where it’s nothing more than a database update.

Transactions, CTEs, sagas, generation numbers

Transactions, CTEs, sagas, and generation numbers are closely related tools. To review:

Database transactions are used to group a sequence of SQL statements so that together they have ACID semantics. Most people’s intuitive understanding of transactions corresponds to the ANSI "SERIALIZABLE" isolation level. Traditional databases like PostgreSQL do not work this way by default: PostgreSQL uses "READ COMMITTED" by default. This allows for a bunch of things you might not expect, like "two successive SELECT commands can see different data, even though they are within a single transaction, if other transactions commit changes after the first SELECT starts and before the second SELECT starts." For a scarily plausible example, see [ringer-2014]. Unlike PostgreSQL, CockroachDB always uses the "SERIALIZABLE" isolation level.

Note

All statements are executed in a transaction. If you don’t use BEGIN to start a transaction, then your statement is implicitly wrapped in a new transaction. When people say "use a transaction", that’s usually shorthand for "put multiple statements into a single transaction".

Common Table Expressions (CTEs) essentially let you build a complicated query in terms of subqueries, each with their own name. As mentioned above, for isolation levels that are less strict than "SERIALIZABLE", if you run the same "SELECT" query twice even in one transaction, they may see different results. A CTE allows you to save the result and re-use it so that you’re operating on a single snapshot.^[13] See [_conditional_updates] for an example of where we use these and why.

Generation numbers are often used for optimistic concurrency control. In a simple case, a client might SELECT a record that has a generation number, then later UPDATE that record conditional on the generation number not having changed. If the generation number has changed, the client might abort or retry the whole operation again, starting with the SELECT. We describe a similar example under [_conditional_updates] above.

You don’t usually need generation numbers when the SELECT and UPDATE parts can be placed in the same transaction because you can skip the SELECT altogether and issue the UPDATE, including whatever parts of the WHERE clause would have been in the SELECT. To be concrete, let’s suppose you wanted to set value = 3 in some table Foo for the row with name = 'thing1'. If you were going to do:

/* WARNING: don't do this */
BEGIN;
SELECT id,generation FROM Foo WHERE name = 'thing1';
/*
 * Assume you got back one row with id = 3, generation = 7.  You insert the
 * generation number into the following SQL to avoid anything else having
 * modified this row in the meantime.
 */
UPDATE Foo SET value = 3 WHERE id = 3 AND generation = 7;
COMMIT;

you may as well do this:

UPDATE Foo SET value = 3 WHERE name = 'thing1';

The first version might require that you wrap it in a retry loop in case of a concurrent modification. The second version eliminates concurrent modifications altogether. For more complicated conditions, check out the pattern described under [_conditional_updates] to distinguish between the case where the row wasn’t there vs. it didn’t match the precondition.

Sagas are a tool for decomposing a complex sequence of reversible actions such that they will either all happen or all be reversed. Our saga implementation is built on CockroachDB and so uses transactions under the hood. Saga actions can freely use the database, so they can use explicit transactions, CTEs, generation numbers, etc.

Which to use when?

Warning

As of this writing, it’s not clear how to issue a batched transaction using Diesel or tokio-postgres and also get results back from the statements that make up the batch. For that reason, when you need to do this, use a CTE instead to combine the statements. See [_conditional_updates] for an example.

Case 1: If you’re only reading data, and you can be sure that the queries will run quickly (because you’ve verified with EXPLAIN that they can use indexes and scan a bounded number of rows), then you can just use ordinary database queries. You could use an explicit transaction or CTE if you’re reading multiple pieces of data and you want a consistent snapshot of the database.

Example: If you want to fetch the run state of an Instance, you can just fetch that row. If you also want to list the first 10 attached disks, you can just do this too, as long as there are indexes to support these queries. You can use a transaction or CTE to make sure these views are in sync.

Case 2: Whether or not you’re reading any data, if all of these are true:

you want to insert, update, or delete any rows; and
the queries you want to run do not depend on anything outside the database; and
you can be sure that the queries will run quickly (again, having checked with EXPLAIN)

then you can use a single explicit transaction to do all the work. Be sure to do this in a way that issues the whole transaction to CockroachDB at once rather than issuing statements one at a time. The latter would block other clients.

Example: Creating a Project also creates an associated VPC and VPC Subnet. You can insert all these together in one transaction without any round-trips to the client.

Case 3: If any of these is true:

you need to issue one or more statements or queries that depend on calling out to an external service (like Sled Agent),
you need to issue one or more statements or queries and do some work in Nexus in between them,^[15]
the database queries would take too long and you need to work in batches instead (e.g., if you wanted to report on a table with a million rows, you could break it into batches of 1,000, assuming the batched queries will be fast (because of indexes))

and you want to either resume or undo everything in the case of a crash, then you probably want to use a saga. Typically:

An early saga action updates the database to reflect that a long operation is ongoing (e.g., an Instance row is inserted in state "creating")
A subsequent saga action makes whatever calls to external services are needed (e.g., request to Sled Agent to start the Instance)
A later saga action updates the database state to reflect that the operation completed (e.g., the Instance row state becomes "running" or some other at-rest state)

In this way, the database state always reflects that a long-running operation is ongoing. A key difference between cases 2 and 3:

In case 2, where you make multiple updates in a single transaction, the intermediate states (where the first update has been applied and other ones haven’t) are usually invalid from the application’s perspective (i.e., they violate application-level invariants). They never appear to other consumers because the transaction is atomic.
In case 3, where you make multiple updates using multiple transactions via a saga, the intermediate states are first-classed: they’re valid states and other parts of the application need to be aware of them. For example, when you create an Instance, it might not initially have an IP assigned, but it will some time later. Consumers need to be aware of Instances that don’t have IPs yet assigned.

Case 4: Suppose you need to issue one or more statements or queries interleaved with some work in the client (Nexus) but you don’t want to use a saga? You could do the interactive transaction: have Nexus issue multiple statements, doing the needed work between them. As mentioned above, we’re trying to avoid this because it carries the risk that Nexus being slow, hanging, or becoming partitioned blocks other database clients for long periods. Even short delays can accumulate into substantial latency bubbles.

Another way to think about when to use a saga: Generally, if you’re using a saga to implement an API operation, that should be exposed to API users as an asynchronous operation. [RFD 4] describes these. They would typically return a "202 Accepted" response with an operationId that the caller can poll on to check completion and status.^[16] So you probably don’t want to use a saga for something that can easily be done with a simple INSERT (like creating a Project might be). On the other hand, creating an Instance is the canonical case for a saga: users expect that to be asynchronous, long-running, and either finish successfully or unwind everything.

Determinations

We’re using a normalized, relational schema with strong types.
We’re using a common set of fields to implement common functions (like id for identification, name uniqueness, time_deleted for soft deletes).
We’re using UUIDv4 uuids for identifying objects internally. We assume these are sufficiently unique across all time and all systems.
We’re leveraging the database for verifying uniqueness constraints.
We’re first-classing the idea of collections, which support efficient pagination with indexes. Relatedly, all queries that might return a large set of results must be paginated.
We’re using soft deletes.
If at all possible, we’re avoiding interactive transactions and retry loops to minimize latency bubbles. We have not yet found a use case where we need either of these.
All queries should scale better-than-linearly with the size of the database. (This mostly means using indexes and verifying that they’re used with EXPLAIN.)

More complicated examples

TODO These examples would be good to flesh out

Example: reserving resources on a Sled

TODO Two parts: (1) picking a sled (ideally in a way that we if we blast a lot of provision requests at the system, they don’t all wind up trying to use the same sled?), (2) inserting a record somewhere that reserves the resources, in a way that this transaction will fail if the resources aren’t available. See CockroachDB CHECK functionality or else have each attempt conditional on some generation number.

There is not likely to be any interface here, either to end users or even between internal components. It’s probably easy to change this later. For that reason, we may want to do the simplest thing we can now and work on more clever placement algorithms later.

Example: allocating an IP address

TODO Again, two parts: (1) picking an IP (ideally in a way that we if we blast a lot of provision requests at the system, they don’t all wind up trying to use the same IP?), (2) inserting a record somewhere that reserves the IP, in a way that this transaction will fail if the IP isn’t available. See CockroachDB CHECK functionality or else have each attempt conditional on some generation number.

References to related implementations:

PostgreSQL Administration Cookbook by Riggs et. al includes an example of exactly this using RANGE types with inet and spatial indexes. The result allows PostgreSQL to enforce that a table of IP address ranges has no overlapping ranges. Unfortunately CockroachDB does not support this. It’s also not clear this would be an ideal way to achieve this.
Several complex implementations for a similar problem are suggested on StackOverflow.

Like with reserving Sled resources, there is not likely to be any interface here, either to end users or even between internal components. It’s probably easy to change this later. For that reason, we may want to do the simplest thing we can now and work on more clever allocation algorithms later.

Open Questions

Can Diesel do batched queries the way we want (i.e., so they all get sent over to the database at once, and we can still get information back about what happened)?
Is it even possible for us to send a multi-statement transaction to the database in a way that gets sent as a batch (i.e., not blocking on the client) and also lets us get the results of the individual statements?
How do we implement pagination in the face of auth? Concretely: if we’re going to filter out objects that you’re not supposed to be able to see, then we can no longer rely on a fixed index to efficiently list a page of results. We can punt on this if granularity of "read" auth is large enough. For example, if it’s project-wide, we can still use our project index.
How do we implement pagination in the face of arbitrary filtering? (e.g., "list all Instances created from this particular Image"). Doing this efficiently may be beyond the scope of CockroachDB and its ilk. Systems like Dremel are built for this, often at the expensive of consistency. So do we not support any filtering? Do we just accept that the database will do more (potentially a lot more) work for some pages than others? This seems bad: it would be easy to kick off arbitrary-length table scans this way by just specifying a filter that matches zero items.
Is there a general way (or a few general ways) of generating ETags from our database data?
How will we do schema migrations? (Yes, this is a huge open area. We’ve done enough to vet that CockroachDB appears to behave reasonably here, but we’re still going to need schema patterns, usage patterns, and programming patterns to be able to do this well.) The answer should probably be a new RFD.
What sort of "track changes" / "past versions" will we want/need? How can we build that in a consistent, easy-to-use way?

Security Considerations

TBD.

Appendix: Latency

Whether or not Oxide winds up with an SRE function, we’ll want to define SLOs like "99.9% of operations complete successfully within 300ms". Such SLOs are essential for setting user expectations and for our support process to evaluate whether a sluggish system is actually broken. The specific latency often isn’t critical so much as that outliers are very rare. Variance is a problem because people want to be able to use the system for a while, develop an understanding of how fast it is (intuitively or formally, depending on the user), and not be surprised by the system getting slower as it gets bigger or busier or they hit more weird cases.

Eliminating latency outliers is a long slog through many edge cases all over the system. In this RFD, we focus on two ways that the database tends to contribute to variable latency:

Variance due to unindexed queries, where listing one Project is very fast (because it’s empty) and another Project always times out (because it has 10 million VPCs in it). This is a function of database size and the mitigation is usually to use paginated queries and indexes.
Variance due to latency bubbles, which are often related to queueing delays, in turn often related to locks. So for example: you’ve got 5 clients all listing some Project and it’s always fast because we’ve got that indexed. Now somebody comes along and creates a VPC, which requires bumping the "rcgen" column on the Project. We’ve implemented this with the interactive "SELECT … FOR UPDATE" approach, and now all five of those queries see a latency blip. Now suppose a bunch of clients are creating VPCs. Each of them sees a blip from the other ones and adds its own blip. They all wind up seeing latency that’s significantly worse than before because the updates are queued up behind each other. This is what we mean by the propagation of latency bubbles. This variance is a function of the number of concurrent clients and the mitigation is to structure queries without locks and retry loops.

Appendix: Update CTE

The [_conditional_updates] CTE is kind of gnarly. This section shows exactly what’s going on. You can use this DB-Fiddle to try this out yourself. This section just copies the code and output from that fiddle in case db-fiddle disappears.

We’ll start with this table:

CREATE TABLE T (id INT, generation INT);

Let’s start with one row in the table having id 1, generation 1:

INSERT INTO T (id, generation) VALUES (1, 1);

Now let’s update the row conditional on generation = 1. This should succeed, since the row does have generation = 1. First, let’s do the two parts of the CTE explicitly. Select the row:

SELECT id, generation FROM T WHERE id = 1;

As you’d expect, this produces:

id	generation
1	1

generation

Now, here’s the UPDATE part:

UPDATE T
    SET generation = 2
    WHERE id = 1 AND generation = 1 RETURNING *;

Because of the RETURNING *, this produces a row describing the (one) row updated:

id	generation
1	2

generation

So that worked.

Now, let’s reset the state back to the beginning.

TRUNCATE TABLE T;
INSERT INTO T (id, generation) VALUES (1, 1);

Do the same exact thing, this time with the CTE. We’re doing the same SELECT query as above and assigning that to "found_rows". Then we’re doing the same UPDATE query as above and assigning that to "updated_rows". The only new thing is we’re taking the results of both queries and joining them together.

WITH found_rows   AS (SELECT id, generation FROM T WHERE id = 1),
     updated_rows AS (UPDATE T SET generation = 2 WHERE id = 1 AND generation = 1 RETURNING *)
SELECT found.id AS found_id,
       found.generation AS found_state,
       updated.id AS updated_id,
       updated.generation AS updated_state
FROM
        found_rows found
FULL OUTER JOIN
        updated_rows updated
ON
        found.id = updated.id;

That query produces:

found_id	found_state	updated_id	updated_state
1	1	1	2

found_id

found_state

updated_id

updated_state

We can tell that this successfully updated the row because the "updated" columns are non-NULL. This is exactly what happened above without the CTE.

Now let’s try the same update again, but without resetting the table to its original state. This time, we’ll expect it to fail because the precondition won’t hold (because we’ve changed the generation number). Like before, we’ll do this without the CTE first.

SELECT id, generation FROM T WHERE id = 1;

produces:

id	generation
1	2

generation

The row was found, as expected. Here’s the conditional UPDATE part:

UPDATE T
    SET generation = 2
    WHERE id = 1 AND generation = 1 RETURNING *;

This time, we get back zero rows because none matched the condition id = 1 AND generation = 1.

Here’s that same (failed) conditional update using the CTE:

WITH found_rows   AS (SELECT id, generation FROM T WHERE id = 1),
     updated_rows AS (UPDATE T SET generation = 2 WHERE id = 1 AND generation = 1 RETURNING *)
SELECT found.id AS found_id,
       found.generation AS found_state,
       updated.id AS updated_id,
       updated.generation AS updated_state
FROM
        found_rows found
FULL OUTER JOIN
        updated_rows updated
ON
        found.id = updated.id;

This time, we get back:

found_id	found_state	updated_id	updated_state
1	2	null	null

found_id

found_state

updated_id

updated_state

null

Finally, let’s show what happens if we try to update a row that doesn’t exist at all. First, the SELECT part, which will find no rows:

SELECT id, generation FROM T WHERE id = 99;

This produces no rows because there is no row in the table with id = 99. Here’s the UPDATE part:

UPDATE T
    SET generation = 2
    WHERE id = 99 AND generation = 1 RETURNING *;

Again, this produces no rows, for the same reason.

And here’s what it looks like with our CTE:

WITH found_rows   AS (SELECT id, generation FROM T WHERE id = 99),
     updated_rows AS (UPDATE T SET generation = 2 WHERE id = 99 AND generation = 1 RETURNING *)
SELECT found.id AS found_id,
       found.generation AS found_state,
       updated.id AS updated_id,
       updated.generation AS updated_state
FROM
        found_rows found
FULL OUTER JOIN
        updated_rows updated
ON
        found.id = updated.id;

which produces no rows.

In this way, we can always tell from the CTE’s output whether the row didn’t exist, didn’t match the extra preconditions, or was successfully updated.

External References

[RFD 4] RFD 4 User Facing API Design
[RFD 21] RFD 21 User Networking API
[RFD 24] RFD 24 Multi-Rack Oxide Deployments
[RFD 48] RFD 48 Control Plane Requirements
[RFD 53] RFD 53 Control plane data storage requirements
[ringer-2014] 2ndQuadrant. PostgreSQL anti-patterns: read-modify-write cycles. 2014.
[lipiński-2011] Simon Lipiński. Are long-running transactions bad? 2011.

Footnotes

1
We could be more precise about the consistency model. For a deep dive, see Linearizability versus Serializability and Cockroach Labs’s blog post on CockroachDB’s consistency model. The latter concludes that CockroachDB is "more than serializable, less than strict serializability". Informally, we assume that _users_ expect that if they submit multiple concurrent requests to the API, the result should be consistent with some sequential order of them (serializability) and that if requests do not overlap in time, then they will have been processed in their real-time order (strict serializability). CockroachDB guarantees most of this out of the box. If we care about the obscure and uncommon cases where it does not, we can build slightly more sophisticated mechanisms to deal with them. See the above-linked blog post for details.
View
2
We could instead say that deleting a collection deletes everything in it. Such a complex, long-running operation would have to be done via a saga and so be an asynchronous HTTP request. We can always do this later but we propose punting on it for now.
View
3
We could instead say that deleting a collection deletes everything in it. Such a complex, long-running operation would have to be done via a saga and so be an asynchronous HTTP request. We can always do this later but we propose punting on it for now.
View
4
We could instead say that deleting a collection deletes everything in it. Such a complex, long-running operation would have to be done via a saga and so be an asynchronous HTTP request. We can always do this later but we propose punting on it for now.
View
5
We could instead say that deleting a collection deletes everything in it. Such a complex, long-running operation would have to be done via a saga and so be an asynchronous HTTP request. We can always do this later but we propose punting on it for now.
View
6
There are several levels of normalization and we’re not specific here about which level we’re talking about. The linked Wikipedia article mentions that "informally, a relational database relation is often described as "normalized" if it meets third normal form.
View
7
Yes, the EXPLAIN output can change depending on the data. But if you see the index used in EXPLAIN, you know there are at least some conditions where the index will work, which eliminates a common class of mistake here.
View
8
We’re distinguishing here between another record with the same name (which is almost always an error that we’ll bubble out to the user) and one with the same id.
View
9
This contrived-sounding but real behavior leads to Nexus’s old sql_insert_unique_idempotent_and_fetch function.
View
10
This should be an enum rather than a String.
View
11
If you pick at the monotonic counter, it might seem tricky — what if Sled Agent restarts? Or we move an Instance to another Sled Agent? A solution to this is that instead of the Instance generation being a single integer, it could include the Sled Agent uuid, a separate counter that gets incremented any time the Instance is moved, and a counter that gets incremented each time this particular Sled Agent restarts. Each of these is easy to maintain without persistent state. These details are important for the real problem but much more complicated than needed for this example.
View
12
The query here is derived from this post. It’s also consistent with the CockroachDB advice mentioned earlier around using common table expressions to combine multiple operations into one statement.
View
13
It’s not totally clear whether two subqueries within a CTE can get different results. Do they operate like two queries in the same transaction, or do they have some stricter isolation by virtue of being in the same top-level query?
View
14
It’s not yet clear why this would be necessary.
View
15
It’s not yet clear why this would be necessary.
View
16
Sagas are almost by definition long-running operations, involving requests to external services that could hang or take a long time. We don’t usually want to wait for a saga to complete in the context of an HTTP request because it’s usually not a great idea to have HTTP requests block for arbitrarily long periods of time.
View

RFD 192 Omicron Database Design

Table of Contents