532 - Versioning for internal HTTP APIs / RFD / Oxide

RFD

532

Authors

Updated

For many components, the forthcoming control-plane-driven update feature ([rfd504]) introduces for the first time the possibility that two sides of a network API may be deployed from different releases. This RFD builds on the plan from [rfd421] for Dropshot/Progenitor-based HTTP APIs to ensure that they continue operating correctly across a control-plane-driven update.

The external API is a special case. See [rfd531] for more on that.

Assumptions

As mentioned in [rfd531]:

For the foreseeable future, it will only be supported for customers to upgrade between consecutive releases (e.g., from release 11 to release 12, not from release 10 or earlier directly to release 12). The specific policy does not matter here. What’s important is that there’s an explicit list of supported "from" releases for each upgrade.

Determinations

In addition to supporting updates across consecutive releases for customers, we will support updating to a given commit from recent commits in "main". This is necessary for the dogfood workflow and to be able to use upgrade in development. (We assume it’s not necessary to support upgrading between commits that span new releases — e.g., from some commit in the middle of release 10 to one in the middle of release 11.)
Versioning for every Dropshot/Progenitor API in the system will be managed at runtime in one of three ways, in decreasing order of preference:
- Lockstep: when the online update process can update the server and client atomically, both the server and client need only ever support a single release. Such APIs can essentially ignore versioning altogether. These are not common. But it does happen in cases like Wicket → Wicketd API.
- Server-side-only versioned (sometimes just called "server-side-versioned"): when the online update process can guarantee that all the servers for a given API will be updated before any of its clients, then it’s sufficient to say that the API server in release N will always support its own API as shipped in releases N and N - 1 (and any other supported versions). This is made possible (and, we hope, easy) using the support introduced in Dropshot 0.13.0. Clients don’t need to think about versioning in this case.
- Client-side versioned: when there’s no way to ensure that the server side will always be updated first, then servers still need to support the previous release (as in server-side-only versioned), and clients also need to support the API version from the previous release.
  When client-side-versioning is necessary, Reconfigurator will provide a reliable way for clients to know which API version they should use. This property will change dynamically at runtime.
  Note that the choice of lockstep, server-side, and client-side could in principle vary depending on which client of an API we’re talking about. For example, both Nexus and Sled Agent are consumers of the Dendrite DPD API. It’s possible that Nexus would treat it as server-side-versioned while Sled Agent would treat it as client-side. We’d do this if we could always guarantee that Nexus is updated after Dendrite, but not that Sled Agent is always updated after Dendrite.
We’ll build common tooling to ensure that the developer workflow for changing APIs makes several things easy:
- It should be easy to create new versions of an API (to effect an incompatible change) while preserving support for previously-shipped (but still supported) APIs.
- Automated tests should fail when the server’s API for a previously-shipped API version changes in any meaningful way.
- The workflow should account for branching changes (e.g., multiple PRs branched from "main"); it should not generate a logical conflict (without also generating an actual merge conflict) when both changes land in "main".
To figure out where we can safely use lockstep and server-side-only versioning, we’ve built tooling to model the API dependency graph. We’ll use this to decide the sequence in which components will be updated as part of an online update. That will tell us which ones can use server-side-only versioning (because they’re updated before their dependents).
We will augment this tooling to try to ensure that if a new API or API dependency is added which violates the assumptions about update order or which APIs can be lockstep or server-side-versioned, then automated tests will fail.

Open Questions

Switch zone and host OS APIs

The ls-apis tooling appears complete enough to define an update ordering for most components with one big exception: it’s not yet clear which API dependencies need to be client-side-managed vs. server-side-only within the host OS and switch zone. The existing tooling for identifying API dependencies is not by itself sufficient to determine this. It’s only aware of which Rust packages depend on which other APIs. It doesn’t know which instances of those other APIs they use nor which API calls they use. As an example of where this falls short: ls-apis knows that Dendrite DPD depends on MGS. These are both deployed in the switch zone. If Dendrite always talks to the MGS in the same switch zone, then this dependency could be managed in lockstep. If Dendrite DPD might talk to the MGS in the other switch zone, then this dependency must be client-side managed because whichever switch zone we update first, its Dendrite might talk to the other (older) one’s MGS.

There are many similar interdependencies among services in the host OS and switch zone and these need to be better characterized to figure out the best update sequence and then which APIs require client-side management.

Even harder is enforcement: even if we can say that the MGS → Dendrite dependency is server-side-managed, if the tool isn’t aware that that’s safe only because they’re always deployed together, then it won’t be able to detect if someone accidentally breaks things by having Dendrite talk to the other switch zone’s MGS.

Client-side metadata

How will Reconfigurator know what version(s) of which APIs are exposed by which components, in a way that doesn’t make it easy for this to get out of date when people change the API? See [_reconfigurators_role_in_client_side_versioning].

Semantic breakage

How big a problem is it that this doesn’t address semantic breakage? How could we address this?

Developer summary (work in progress)

When you’re adding a new API or considering how an existing one will behave across an upgrade, you need to ask: how do producers and consumers of this API relate to other components in the whole system’s API dependency graph? This is a pretty big question!

First of all, if an API is not changing, we do not need to do anything immediately. While we may proactively fix up APIs for online upgrade that we know we’ll need to change, we can potentially defer this work for many APIs that don’t change often.

Given that an API is changing and you’re deciding how to handle upgrade: the next easy case is that we can generally ignore test suites, most developer tools, and anything else that isn’t shipped on real systems. We can also ignore dependencies from clients that are deployed atomically with the server (e.g., Wicket → Wicketd). If these are the only consumers, use a lockstep approach. With lockstep, you just use the existing tooling to generate an OpenAPI spec and client and that’s pretty much it.

If lockstep is not possible, if at all possible, you want to use a server-side-managed approach. The path for doing this is still TBD, but the intent is that you can use the same automation that’s used elsewhere to manage a handful of supported API versions at a time. Clients don’t have to think about versioning in this case. If you’re not sure whether you can use this, try it: if you configure it this way in the ls-apis metadata and it can’t work, the tooling should tell you. But if you’re not sure, ask the update team.

If a server-side-only approach is not possible, you’ll need to use client-side versioning. Realistically, we may not have a golden path for this for some time. You might want to consider:

Can we avoid changing the API so much? For example, if A depends on B and B also depends on A, one of them has to be client-side-versioned. Suppose that’s the B → A API. One way to avoid having to change that API very much is to structure things so that the only operation in the B → A API is an operation that tells A to go fetch something from B. That operation would have very little input or output and could be kept very stable, allowing all the changes to happen in the A → B API.
Relatedly, we could consider [_try_not_to_break_apis]? That section explains why that approach can be really problematic. But it’s not necessarily unreasonable for APIs that we expect to change very rarely.
Can we remove the API altogether (or some other API that’s preventing this one from being made server-side-versioned)? This may sound silly but there are a few APIs we’ve discussed and can remove and may do so for exactly this reason (e.g., Oximeter registration).

If these aren’t options, changes to these APIs may become blocked on the work to fully support client-side APIs.

If in doubt, please reach out to the update team!

Discussion

The disaster we’re trying to avoid

The root goal here is to ensure that the control plane can drive the update process from release N to release N + 1 without help from Oxide support. That necessarily means that at least some parts of the control plane must be working throughout the duration of the update process. Being a distributed system, they cannot all be updated atomically. Thus, some amount of dealing with drift between client and server versions is essential.

Concretely, the scenario we’re trying to avoid is something like:

The customer kicks off an upgrade from release N to release N + 1.
The system starts updating individual components to release N + 1.
Partway through, the upgrade gets stuck. There are ways this could happen pretty easily:
- Some component that’s already been updated to release N + 1 is calling some component that’s still at release N but it’s trying to use the API from release N + 1. This might happen if a developer didn’t realize that these components could be updated in that order. The best case here is that the older component rejects the request that it doesn’t understand and things grind to a halt.
- Some component that’s not yet been updated to release N + 1 is calling some component that’s already at release N + 1 but it’s trying to use the API from release N, but support for that release was removed (or broken) in release N + 1 because a developer didn’t realize it could be called from an older client. Similarly, the best case here is that the newer component rejects the request it doesn’t understand and things grind to a halt.
The customer has to call Oxide support to diagnose and fix this.
- It’s pretty likely that we’ll need low-level interfaces (e.g., bash shell, omdb, customer blueprints) to fix this kind of problem.
- It’s pretty likely that we’ll actually need a new release to fix this problem. The new software will need to be updated to account for this scenario. This potentially adds days to the time required to resolve the problem.

How do we avoid this? There are so many different components and API dependencies and it literally only takes one bug (in the wrong place) to bring the upgrade to a halt. We can’t rely on people being perfect. This RFD is aimed at:

providing structure so that the obvious way to evolve an API is also the one that works at runtime
providing automation so that we can identify before changes land on "main" if they would introduce problems like the above

Specifically: we’d hope that the automation around identifying API dependencies and which are supposed to be server-side-versioned and which are supposed to be client-side versioned would help us identify a case where somebody’s depending on an API assuming server-side versioning but the client might be updated before the server.

Reconfigurator’s role in client-side versioning

With client-side versioning, client programs incorporate more than one distinct Progenitor-generated client — one for each version of the API that they support talking to. In an ideal world, when upgrading the system from release N to release N + 1, client programs would use the client from release N until all instances of the server are at release N + 1. At that point, they would switch to using the client from release N + 1.

To help clients manage this, we propose:

When creating a new version of a client-side-managed API, developers should be able to define metadata for Reconfigurator that says: when all instances of component X are updated, configure client programs to use the updated Progenitor client.
Client programs make periodic requests to Nexus to fetch what version of the client API they should be using. In cases where the program needs to start even if Nexus is offline (e.g., Sled Agent), the component should persist this information to disk.

The details here are a little tricky because Reconfigurator needs to know what version(s) of which API(s) are served by which components. In an ideal world, this would be embeded into that component’s artifact metadata (discussed in [rfd421]) rather than hardcoded in Reconfigurator or a separate piece of metadata that could get out of date. With this information, Reconfigurator could easily keep track of the maximum API version supported by all instances of an API and make this available to client programs that need to know which version’s client package to use.

Developer workflow for modifying APIs

Today, in the world of offline, MUPdate-based update: when you want to change most of our APIs, you do this:

Make the API change in the server implementation.
Regenerate the API spec (in Omicron, this is just cargo xtask openapi generate).
Make the corresponding change in the client implementation.
Repeat steps 1-3 as needed to iterate.
Do it all in one pull request land land the whole thing in one commit on "main".

With lockstep APIs, we can continue to do the same thing.

With server-side-managed APIs, the process is slightly more complicated:

You make the API change in the server implementation while preserving the old version of the API. In practice, this means:
- New API endpoints must be annotated with version = NEW_VERSION...
- Removed API endpoints must be annotated with version = ..NEW_VERSION.
- Breaking changes (e.g., newly added required arguments) must be modeled as removal of the old endpoint and addition of a new one.
Regenerate the API spec. This will likely look a little different than above, but the details are TBD.
Make the corresponding change in the client implementation.
- Optional: add tests of the server that use the old client.
Repeat steps 1-3 as needed to iterate.
Do it all in one pull request land land the whole thing in one commit on "main".

The hope is to keep this process almost as easy as it is for lockstep APIs.

With client-side-managed APIs, the process is a fair bit more complicated. You need to do all of the above, plus:

create a new Rust package for the new client version
update client programs to depend on the new Rust package
update client programs to periodically query Nexus for which version they should be using
- This information may need to be stored persistently if the service must be capable of starting up when Nexus is not yet available.
update client program code paths that use the API to pick the Rust package conditionally based on the currently-configured version

Testing is also more complicated. You need some way of testing client programs with the old server. The details here are TBD. (You can see why server-side-only versioning is much preferred.)

Developer workflow for changing APIs or API dependencies

Developers can freely create new Dropshot servers, Progenitor-based clients, or add new dependencies between existing components. We need to be sure that when this happens, it doesn’t break online upgrade. The main way this would happen would be that someone adds a new server-side-managed dependency from component A to component B, even though the upgrade system doesn’t always update B before A. But since this information is driven by metadata, we actually want to detect a lot more conditions:

a Dropshot server is added with no associated API metadata
a new Progenitor client is added for an API with no associated metadata
the metadata reflects that a dependency is managed lockstep, but in fact it’s used across components in different deployment units
the metadata reflects that a dependency is managed server-side, but that would imply components are updated in a different order than they are
the metadata reflects that a dependency is managed client-side, but that doesn’t match reality

ls-apis does some of these things today. There’s a detailed README for ls-apis. But the tool is limited. Notably:

It cannot tell if its metadata about which Rust packages are deployed in which deployment units is correct. (This could be made better by using package-manifest.toml to derive deployment units instead of hardcoding them.)
It does not know whether clients are correctly aware of the versions they need to support.
As of this writing, the version in "main" is not aware of lockstep, server-side, or client-side management. A first cut at this is implemented in omicron#7138, but it ignores "lockstep" altogether (which is probably okay) and it also doesn’t support different dependents of an API making a different choice about whether to treat the API as server-side-managed or client-side-managed.

Testability

We expect it to be far easier to test server-side-managed APIs than client-side-managed ones. Since HTTP is largely stateless, for a server-side-managed API to work, it just needs to correctly handle requests from older clients. The logic for doing this is generally pretty straightforward (e.g., populate some new required field with some default value). For testing, such requests can be generated using clients generated from the older OpenAPI spec. And we necessarily have those API specs around. More exhaustive testing might set up an entire system with client programs from the older release, but this is probably overkill for many changes.

By contrast, for client-side-managed APIs, clients' control flow is potentially very different depending on which server version they’re talking to. It’s not easy to stand up a server running the old software. It could be mocked, but it’s not trivial to faithfully mock multi-request sequences. It’s not clear exactly what level of test coverage we’ll want for these programs. This is a major reason why server-side versioning is preferred.

Alternatives considered

Try not to break APIs

A common first approach to this problem is to simply never break APIs. Instead, you find ways of expressing all desired changes in a backwards-compatible way. Want to add some new field? It must be optional to deal with the fact that old clients might not be sending it.

There are several problems with this:

It can be hard to phrase everything you might want to change in a backwards-compatible way.
It can be hard to know that you’ve successfully done so (i.e., that some change really is backwards-compatible).
A gap grows between what’s expressible in the API and what can actually happen in practice. For example: if two releases add new "required" fields (that are optional for compatibility), the API spec makes it look like zero, one, or both maybe specified. The logic in the server gets exponentially complex as it grows conditions on these inputs, even though in practice it may be impossible to have one field without the other (because they were added to the clients in a particular order).
You have to live with awkward or confusing APIs forever. Or: it’s hard to know when you can remove support for older behavior.

The end result is that over time, the server-side code becomes hard to reason about (because few of the input constraints are encoded in the type system). Development becomes slow. Further, changes become risky at deployment time. These are all the things we came to Rust and developed Dropshot/Progenitor to avoid.

You could view the approach proposed here and in [rfd421] as first-classing this basic idea in a way that avoids these problems. Instead of trying to phrase all changes in a backwards-compatible way and creating complex logic to figure out what to do, we first-class the allowed sets of inputs ("versions") and use real types to manage them. We get the benefits of static type checking and we can also know when it’s safe to remove old stuff (because there will be no clients for that API version).

Have clients ask the server what version it’s running

Instead of having Reconfigurator tell clients what version of an API to use, you can imagine having clients ask the server and then use the corresponding client. There are three major problems with this approach:

Load balancers often present a single TCP connection to their clients that multiplexes requests to multiple servers. In that case, just because the server reported that it speaks version X doesn’t mean the next request you send over the same TCP connection will understand version X. This is not a huge problem right now because we don’t use load balancers internally and we could implement load balancers to be aware of problems like this, but it’s definitely a downside of this approach.
There’s an inherent time-of-check-to-time-of-use (TOCTOU) race between asking the server and trying to use the result. Servers can be upgraded — or even rolled back — immediately after answering such a request. It’s admittedly unlikely that this could happen without breaking the TCP connection unless a load balancer is in use, too.
This information needs to be correctly persisted not to disk, but across different memory contexts. For example: suppose some new instance feature depends on a new Sled Agent API version. Nexus may want to check this at instance creation time, long before it’s even picked a Sled Agent on which to place the instance.
Even if we didn’t care about problems 1-2, this third problem highlights that the conditional code would likely be easier to reason about and test if the condition could be set and tested independently of the dependency. That is: by using something like a feature flag and decoupling the conditional code that depends on the flag from how the flag gets set, both parts can be thought about and tested independently.

[RFD 421] RFD 421 Using OpenAPI as a locus of update compatibility
[RFD 504] RFD 504 Next steps for software update
[RFD 531] RFD 531 Managing external API versions for online upgrade

RFD 532 Versioning for internal HTTP APIs