RFD 469
Stopgap CockroachDB Updates
RFD
469
Updated

[rfd418] discusses how to get from the current Mupdate-driven rack update workflow ([rfd326]) to an automated update workflow. There are already a number of high-priority milestones that automated update is working toward. Meanwhile, we presently remain on CockroachDB v22.1, which is now out-of-support upstream.

To perform an upgrade of an existing cluster, CockroachDB only supports upgrading between two major versions (such as v22.1 → v22.2). Each major version supports its own on-disk format as well as the previous major version’s on-disk format. By default when all nodes of a cluster are upgraded to a new version, the upgrade process is auto-finalized, and downgrading to the original version is no longer possible. To control finalization, prior to upgrading the cluster the cluster.preserve_downgrade_option cluster setting must be set. ([crdb-upgrade])

These constraints will eventually need to be supported by the automated update workflow, but we also want to move forward in the short-term with upgrading CockroachDB into supported versions. We discussed a procedure to manually roll out new major versions of CockroachDB while retaining the ability to downgrade if needed during the April 9 Update Sync.

Determination

In version 8 of the control plane, we will introduce a change that sets the cluster.preserve_downgrade_option cluster option to the current version of CockroachDB. We will then use a tick-tock model to perform upgrades to each subsequent major version of a series of at least two control plane versions:

  • Tick: Upgrade the version of CockroachDB in the cockroachdb zone to the next major version. Verify on a long-term testing environment that CockroachDB downgrades work properly.

  • Tock: Finalize the upgrade by resetting cluster.preserve_downgrade_option, and then set cluster.preserve_downgrade_option to the current version.

This allows deploying a new CockroachDB major version while retaining the ability to downgrade if needed in a future release, until the Tock for that major version is deployed.

In the immediate term, assuming all goes well, the schedule would be:

  • v8: CockroachDB v22.1 Tock (cluster.preserve_downgrade_option is presently unset, so set it)

  • v9: CockroachDB v22.2 Tick

  • v10: CockroachDB v22.2 Tock

  • v11: CockroachDB v23.1 Tick

  • v12: CockroachDB v23.1 Tock

  • v13: CockroachDB v23.2 Tick

  • v14: CockroachDB v23.2 Tock

Two considerations to strive for when implementing this: * We shouldn’t add any extra manual steps to the Mupdate process. * We shouldn’t write any new code that we’re going to throw away once Nexus is able to drive rolling CockroachDB upgrades on its own.

When preparing to ship a new release, we should audit the Technical Advisories associated with it.

Open Questions

  • Can we upgrade to a new patch release between a Tick and Tock release? (Almost certainly, but we ought to test this.)