479 - Dropshot API traits / RFD / Oxide

RFD

479

Authors

Updated

Introduction

At Oxide, we use [dropshot] application programming interfaces (APIs) extensively to communicate between various parts of the system. Currently, the only way to define APIs is via an approach that the rest of the RFD calls function-based Dropshot APIs. With this method, individual endpoint functions^[1] are annotated with a procedural macro. Endpoints can be combined and an API built out of them. This API can then be used to start an HTTP or HTTPS server.

For example, consider a requirement for an HTTP API that gets and sets a counter. This code implements that API, and defines two functions:

start_server to start an HTTP server.
generate_openapi to generate OpenAPI specs for clients.

use dropshot::endpoint;
// -- other imports elided

#[derive(Deserialize, Serialize, JsonSchema)]
struct CounterValue {
    counter: u64,
}

// Server context shared across functions.
type ServerContext = /* ... */;

/// Gets the counter value.
#[endpoint { method = GET, path = "/counter" }]
async fn get_counter(
    rqctx: RequestContext<ServerContext>,
) -> Result<HttpResponseOk<CounterValue>, HttpError> {
    // ...
}

/// Writes a new counter value.
#[endpoint { method = PUT, path = "/counter" }]
async fn put_counter(
    rqctx: RequestContext<ServerContext>,
    update: TypedBody<CounterValue>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
    // ...
}

fn create_api() -> ApiDescription<ServerContext> {
    let mut api = ApiDescription::new();
    api.register(get_counter).unwrap();
    api.register(put_counter).unwrap();

    api
}

// Starts a server.
async fn start_server() {
    // Build a description of the API.
    let api = create_api();
    // Use `api` to start a server...
}

// Generates an OpenAPI spec.
fn generate_openapi() {
    let api = create_api();
    let openapi = api.openapi("Counter Server", "1.0.0");
    // ...
}

This RFD introduces an alternative way to define Dropshot APIs, called Dropshot API traits^[2].

Current status

Updated 2024-06-13.

The Dropshot part of the work is mostly done, and some of it is waiting on review.
Within Omicron, an example that converts Nexus’s internal API over is at Omicron PR #5844. This also includes tooling which automatically generates and manages the APIs using cargo xtask.

Motivation

The idea to use a trait to define Dropshot APIs was initially proposed by Sean Klein in Dropshot PR #5247. Much of this section is an expanded form of this post.

Issues with function-based APIs

Function-based Dropshot APIs are straightforward to write and understand, and they’ve overall been very successful at getting us to a shipping product. But along the way, we’ve identified a few recurring problems, which are generally related to tight coupling between the API and the implementation. Specifically: function-based Dropshot APIs require that the implementation be compiled and run before the OpenAPI spec can be generated.

Alternative implementations of the same API

The most obvious issue with the tight coupling is the fact that it is much harder to provide a second implementation of the same API. For example, one may wish to provide a fake implementation of an API that only operates in-memory. With tight coupling, this becomes significantly more difficult.

(It is possible to use a trait behind the functions to achieve this goal; see [traits_at_home] This RFD essentially proposes adding native support for this approach to Dropshot.)

Slow API iteration cycles

For large servers such as Nexus within [omicron], it can be quite slow to build the implementations. This slows down API iteration cycles considerably. As an example, consider the "Nexus internal API"--the interface between Nexus and internal services. Consider a simple change to add a new endpoint. Generating the OpenAPI specification involves:

Define a function, annotate it with the #[endpoint] macro, and include it in the API description.
Rebuild the OpenAPI spec, typically by running a test that ensures the OpenAPI spec is up-to-date. For example, EXPECTORATE=overwrite cargo nextest run -p omicron-nexus openapi.

This process can take a while! On the author’s workstation using the mold linker, against Omicron commit bca1f2d, this process takes somewhere between 18–60 seconds (depending on how incremental rustc was).

The test also catches some OpenAPI correctness and lint issues, such as:

Colliding HTTP routes. For example, Dropshot rejects an API description which has both /task/{task_id}/status and /task/activate endpoints, since the task_id can be activate. Finding and fixing each such case requires going through the iteration cycle again.
Parameters that aren’t in snake_case. Conventionally, OpenAPI schemas use snake_case parameters.

Fixing each such issue would require going through another cycle.

Circular dependencies

In Nexus, we have many circular dependencies between services; we effectively "break" these cycles with the generation of the OpenAPI document as a static artifact. However, this can be monstrously complex to navigate as, typically, implementations need to compile before a new OpenAPI document may be generated. This becomes particularly problematic when making changes that are incompatible with the old schema.

Note

"Incompatible" encompasses a wider range of changes than one might think at first. It includes many "append-only" changes like adding a new variant to an enum, or a new request parameter.

For example, Nexus and sled-agent depend on each other, since both of them must be able to call into each other. (Nexus instructs sled-agent to run various operations, and sled-agent reports statuses to the Nexus internal API.)

Current Nexus and sled-agent dependency graph

graph LR
    sled_agent_impl([sled-agent])
    nexus_impl([nexus])
    sled_agent_client([sled-agent client])
    nexus_client([nexus client])

    subgraph sled-agent
        sled_agent_client
        sled_agent_impl
    end

    subgraph nexus
        nexus_client
        nexus_impl
    end

    sled_agent_client -->|is generated by| sled_agent_impl
    sled_agent_impl -->|uses| nexus_client
    nexus_impl -->|uses| sled_agent_client
    nexus_client -->|is generated by| nexus_impl

Now, consider this example of a type shared across both APIs. Omicron’s update system has a list of well-known update artifact kinds, defined in omicron-common:

pub enum KnownArtifactKind {
    GimletSp,
    GimletRot,
    Host,
    Trampoline,
    // ...
}

This KnownArtifactKind enum has made its way extensively throughout Omicron, including the Nexus and Sled-Agent APIs—which, as discussed above, have a circular dependency.

How hard is it to add a new artifact kind? The process is documented within Omicron’s source code:

Adding a new KnownArtifactKind

// Adding a new update artifact kind is a tricky process. To do so:
//
// 1. Add it here.
//
// 2. Add the new kind to <repo root>/{nexus-client,sled-agent-client}/lib.rs.
//    The mapping from `KnownArtifactKind::*` to `types::KnownArtifactKind::*`
//    must be left as a `todo!()` for now; `types::KnownArtifactKind` will not
//    be updated with the new variant until step 5 below.
//
// 3. [Add the new KnownArtifactKind to the database.]
//
// 4. Add the new kind and the mapping to its `update_artifact_kind` to
//    <repo root>/nexus/db-model/src/update_artifact.rs
//
// 5. Regenerate the OpenAPI specs for nexus and sled-agent:
//
//    ```
//    EXPECTORATE=overwrite cargo nextest run -p omicron-nexus -p omicron-sled-agent openapi
//    ```
//
// 6. Return to <repo root>/{nexus-client,sled-agent-client}/lib.rs from step 2
//    and replace the `todo!()`s with the new `types::KnownArtifactKind::*`
//    variant.
//
// See https://github.com/oxidecomputer/omicron/pull/2300 as an example.

There are several steps involved in such a simple case—and even for experienced Omicron developers, a full cycle can take up to 30 minutes. More complex changes can take longer, and each time a spec is changed, it’s another 30+ minute cycle. This slows down development velocity considerably.

But if we decouple the API from the implementation, and only require the API to generate the corresponding clients, then the dependency graph becomes:

Proposed Nexus and sled-agent dependency graph

graph LR
    sled_agent_api([sled-agent API])
    sled_agent_impl([sled-agent])
    sled_agent_client([sled-agent client])
    nexus_api([nexus API])
    nexus_impl([nexus])
    nexus_client([nexus client])

    subgraph sled-agent
        sled_agent_client
        sled_agent_api
        sled_agent_impl
    end

    subgraph nexus
        nexus_client
        nexus_api
        nexus_impl
    end

    %% This order of edge definitions creates a symmetric graph
    sled_agent_client -->|is generated by| sled_agent_api
    sled_agent_impl -->|implements| sled_agent_api
    sled_agent_impl -->|uses| nexus_client
    nexus_impl -->|implements| nexus_api
    nexus_client -->|is generated by| nexus_api
    nexus_impl -->|uses| sled_agent_client

In other words, if we put the APIs in separate crates from the implementations, the dependency cycle no longer exists and making changes becomes much easier. Code organization also becomes more comprehensible, since it’s easier to see where the interfaces are defined without having to dig around within the implementations.

In reality…

Actually, as things are currently set up, the nexus API depends on the sled-agent client so it can implement From for those types.

But this isn’t an inherent circular dependency! It’s just one that avoids Rust orphan rule issues. The From impls can be replaced with free-standing functions that do the conversions in another crate.
And even if this cycle is retained, the scope of circularity is more limited. For example, with KnownArtifactKind, generating the new OpenAPI schema no longer requires dealing with database code first.

Merge conflicts

(This is an extension of the previous section, but is worth calling out as an addendum.)

If there are merge conflicts between two changes to an OpenAPI schema, developers are forced to deal with a schema file on disk that may be out of sync with either side. If those conflicts aren’t fixable by hand, it may require substantial work just to make the implementations compile, and the updated schema can be generated. After that, it’s likely that the updated schemas will stop the implementations from compiling, again, requiring a second go-around.

If generating the schema doesn’t require getting the whole implementation to compile, much of this pain can be avoided.

Common schema editing operations in large repositories like [omicron] become faster by at least an order of magnitude, and significantly less painful, as documented in [impact].

Guide-level description

A Dropshot API trait consists of a Rust trait annotated with a new procedural macro called dropshot::api_description. API traits decouple the API definition from one or more implementations.

HTTP and OpenAPI definitions

Continuing with the counter API example in [introduction]--we’re going to first define a trait that looks like:

#[derive(Deserialize, Serialize, JsonSchema)]
struct CounterValue {
    counter: u64,
}

#[dropshot::api_description]
pub trait CounterApi {
    /// The type representing the server's context.
    type Context;

    /// Gets the counter value.
    #[endpoint { method = GET, path = "/counter" }]
    async fn get_counter(
        rqctx: RequestContext<Self::Context>,
    ) -> Result<HttpResponseOk<CounterValue>, HttpError>;

    /// Writes a new counter value.
    #[endpoint { method = PUT, path = "/counter" }]
    async fn put_counter(
        rqctx: RequestContext<Self::Context>,
        update: TypedBody<CounterValue>,
    ) -> Result<HttpResponseUpdatedNoContent, HttpError>;
}

OpenAPI schema generation can be performed using just the API described above, without an actual implementation at hand. The dropshot::api_description macro creates a new module called counter_api, which has a stub_api_description method on it.

fn generate_openapi() {
    // The `stub_api_description` method returns a description which
    // can only be used to generate an OpenAPI schema. Importantly, it does not
    // require an actual implementation.
    let api = counter_api::stub_api_description().unwrap();
    let openapi = api.openapi("Counter Server", "1.0.0");
    // ...
}

Implementation

An implementation will look like:

// This is a never-constructed type that exists solely to name the
// `CounterApi` impl -- it can be an empty struct as well.
enum MyImpl {}

// Server context shared across functions.
struct ServerContext {
    // ...
}

impl CounterApi for MyImpl {
    type Context = ServerContext;

    async fn get_counter(
        rqctx: RequestContext<Self::Context>,
    ) -> Result<HttpResponseOk<CounterValue>, HttpError> {
        // ...
    }

    async fn put_counter(
        rqctx: RequestContext<Self::Context>,
        update: TypedBody<CounterValue>,
    ) -> Result<HttpResponseUpdatedNoContent, HttpError> {
        // ...
    }
}

And the start_server and generate_openapi functions are defined using a new counter_api::api_description function:

async fn start_server() {
    // The `api_description` function returns an ApiDescription<ServerContext>.
    let api = counter_api::api_description::<MyImpl>().unwrap();
    // Use `api` to start a server...
}

Choosing between functions and traits

Prototyping: If you’re prototyping with a small number of endpoints, functions provide an easier way to get started. The downside to traits is that endpoints signatures are defined at least twice, once in the trait and once in the implementation.

Small services: For a service that is relatively isolated and quick to compile, traits and functions are both good options.

APIs with multiple implementations: For services that are large enough to have a second, simpler implementation (of potentially parts of them), a trait is best.

Here’s an archetypal way to organize code for a large service with a real and an in-memory test implementation. Each rounded node represents a binary and each rectangular node represents a library crate (or more than one for "logic").

flowchart LR
    openapi_generator([OpenAPI generator]) --> api
    types[base types]
    stateless_logic[stateless logic]
    api[Dropshot API trait] --> types
    test_lib --> stateless_logic
    test_lib --> api
    top_level_lib --> api
    stateless_logic --> types
    stateful_logic --> stateless_logic

    subgraph production_impl [real implementation]
    production_binary([production binary]) --> top_level_lib
    top_level_lib[top-level library]
    top_level_lib --> stateful_logic
    stateful_logic[stateful logic]
    end

    subgraph test_impl [test implementation]
    test_bin([test binary]) --> test_lib
    test_lib[test library]
    end

Impact

With Dropshot API traits, all of the above operations become at least an order of magnitude faster.

Omicron PR #5653 contains a prototype of this RFD, including a conversion of the Nexus internal API.

With this prototype, repeating the test in [slow_iteration] with EXPECTORATE=overwrite cargo nextest run -p nexus-internal-api openapi goes down from 18 seconds to just 1.5 seconds.
The time taken from adding a new KnownArtifactKind to generating a new OpenAPI schema also becomes much faster, going from 20+ minutes to taking under a minute.

Determinations

There’s a surprising amount of flexibility in implementation—much more than with function-based APIs. Providing a good user experience requires carefully making several decisions that merit some discussion.

Do we really need to change Dropshot?

There actually is a way to achieve this kind of decoupling with function-based Dropshot APIs.

Define a trait, say MyApiBackend, with methods corresponding to each endpoint.
Use an Arc<dyn MyApiBackend> as the server context.
Each function calls into Arc<dyn MyApiBackend>.

An example of this approach is in Omicron’s installinator-artifactd.

But there’s a fair amount of extra boilerplate involved with this approach. Specifically, if there are N implementation, each endpoint needs to be specified 2+N times:

As a function.
As a trait method.
With an implementation for each trait.

With native support for Dropshot API traits, this becomes 1+N times^[3].

This approach also necessitates use of dynamic dispatch with async_trait, which isn’t a great user experience as documented in [trait_mechanics]. Native API traits result in much nicer rust-analyzer integration and error reporting.

Source of truth

Traditionally, OpenAPI specifications are hand-written, as a kind of interface definition language (IDL). Server and client interfaces may be generated from the IDL, but may also be written separately (and as a result may not be in sync).

With function-based Dropshot APIs, the source of truth is the collection of endpoint and channel invocations written in Rust. The OpenAPI schema is generated from that.

As an alternative to API traits, we could choose to switch to something more like the traditional workflow. But in Dropshot we deliberately decided to use annotations in Rust as the source of truth (see RFD 10 for some discussion), which has worked quite well.

Moreover, there are a few practical issues with any switch today:

We don’t have enough expertise to write OpenAPI schemas directly.
OpenAPI schemas are quite complicated with many optional features, while the way we define endpoints in Rust is generally straightforward.
Converting existing functions to the traditional workflow is costly and time-consuming, while converting even complicated function-based APIs to traits is quite simple.

We continue to believe in Dropshot’s general approach, so the RFD proposes that the source of truth for schemas continues to be in Rust.

The `dropshot::api_description` macro

Core to Dropshot API traits is the dropshot::api_description macro, which accepts a trait as input (and nothing else).

Continuing with the above CounterApi example, the dropshot::api_description macro does the following:

Gathers the list of endpoints specified in the CounterApi trait.
Verifies that the trait is in the right form and its methods are syntactically correct (discussed below), reporting any errors found.

Then, as its output, the macro generates:

The CounterApi trait mostly as-is, with some minor changes discussed in [auto_trait_bounds] below.
Semantic checks as discussed in [endpoint_verifications].
A support module, described immediately below.

The support module

With function-based Dropshot APIs, the way to turn a set of endpoints into the API is to construct an ApiDescription struct. For example:

fn create_api() -> ApiDescription<ServerContext> {
    let mut api = ApiDescription::new();
    api.register(get_counter).unwrap();
    api.register(put_counter).unwrap();

    api
}

Given a trait and an implementation, there needs to be a way to turn it into an ApiDescription as well. In order to do so, the proc macro generates a support module.

For CounterApi, the macro produces:

#[automatically_derived]
pub mod counter_api {
    use dropshot::{ApiDescriptionBuildError, StubContext};

    pub fn api_description<T: CounterApi>(
    ) -> Result<ApiDescription<T::Context>, ApiDescriptionBuildError> {
        // ... automatically generated code to make an ApiDescription
    }

    pub fn stub_api_description(
    ) -> Result<ApiDescription<StubContext>, ApiDescriptionBuildError> {
        // ... automatically generated code to make a stub description
    }
}

The module has two functions:

api_description, which converts a CounterApi implementation (specified by type) into the corresponding ApiDescription against the Context type.
stub_api_description, which generates a stub description. For more, see [stub_description].

Alternatives

Instead of a module, earlier versions of this RFD proposed putting the functions on a CounterApiFactory type. That approach is appealing because the trait name doesn’t have to be transformed into snake_case. But that prevents us from, in the future, being able to generate items that can’t live on a type. One use case for that is [delegation].

Another alternative is to generate an extension trait. For example:

pub trait CounterApiExt: CounterApi {
    fn api_description(
    ) -> Result<ApiDescription<Self::Context>, ApiDescriptionBuildError> {
        // ... automatically generated code to make an ApiDescription
    }
}

impl<T: CounterApi> CounterApiExt for T {}

This works for standard implementations, but doesn’t leave an obvious place to define the stub description.

Note

There are some unresolved questions with this approach, which are discussed in [question_api_description].

Trait mechanics and dispatch

While prototyping, the author experimented with three styles of defining and using traits:

Using dynamic dispatch (i.e. boxed trait objects) with the async_trait crate. In this case:
- The function signatures become async fn endpoint(&self, rqctx: RequestContext<()>, …).
- The request context moves to being a part of self.
- The #[dropshot::api_description] macro outputs an #[async_trait] annotation.
- Implementations also have to be annotated with #[async_trait].
Using static dispatch with async_trait.
Using static dispatch with native async traits, as supported in Rust 1.75 and above.
- In this case, the definition needs to be annotated with #[dropshot::api_description], but implementations do not require annotations.
- Current versions of Rust don’t support dynamic dispatch with native traits.

The following table summarizes the results of the experiment:

Feature Dynamic Static, async_trait Static, native

Feature	Dynamic	Static, `async_trait`	Static, native
MSRV	Rust 1.56	Rust 1.56	Rust 1.75
Trait object safety	Requires object-safe	Cannot be object-safe	Cannot be object-safe
Conversion from function-based APIs	Requires modifications	Copy-paste	Copy-paste
Impls require annotation?	Yes	Yes	No
Error reporting (see below)	Poor	Poor	Great
rust-analyzer support (see below)	Poor	Poor	Better, though with shortcomings

MSRV

Rust 1.56

Rust 1.75

Trait object safety

Requires object-safe

Cannot be object-safe

Conversion from function-based APIs

Requires modifications

Copy-paste

Impls require annotation?

Yes

Error reporting (see below)

Poor

Great

rust-analyzer support (see below)

Poor

Better, though with shortcomings

More about error reporting and rust-analyzer support

Ensuring that procedural macros produce good errors is a major challenge. Two examples:

If there are syntax errors in the input that are unrelated to anything the procedural macro needs to do, are they detected early by the parser, or passed through to be reported by rustc? With native traits, #[dropshot::api_description] can perform partial parsing. For more, see [error_codegen] (special case).
If there are semantic errors in the generated output caused by bad input, how do we ensure they’re correctly mapped onto the input? A well-written proc macro carefully annotates each output token with the right input span, but sometimes this isn’t even possible.
- With #[async_trait], both the definition and the implementation must be annotated. Within the implementation, where most errors are likely to occur, rustc is often unable to point to the actual code at issue. Instead, it just points to the #[async_trait] macro invocation (Span::call_site()). If the trait has many methods, it can be very difficult to find the exact line of code that’s failing.
- With native traits, the definition must be annotated, but the implementation does not need to be. So error reporting in the implementation is always perfect. Due to partial parsing, error reporting in the definition also tends to be better.

Currently, there’s no way for proc macros to provide good hints to rust-analyzer. This means that rust-analyzer may not generate good code in case there are issues.

Since #[async_trait] has to be used as an annotation for implementations, rust-analyzer often doesn’t handle errors within them well.
With native traits, no annotation is required on implementations.

Another example is the "Generate missing impls" action.

With #[async_trait], this action is completely unavailable.
With native traits, the implementation requires no annotations so this action is available. However, as of Rust 1.78 there is one major limitation: Because the #[dropshot::api_description] changes the function signature to add Send (see [auto_trait_bounds]), the action generates methods that implement Future<Output = …> + Send + 'static rather than async fn instances.

The author hopes that in the future, rust-analyzer can more gracefully handle simple cases like the latter. This could be done via a relatively-easy-to-describe heuristic. Handling async_trait better is much more difficult.

Based on these results, the RFD proposes that we commit to option 3, and the rest of the RFD assumes it. The downsides are:

Compared to options 1 and 2, the more limited Rust version support. We’re already past this version at Oxide, so this isn’t a factor for us. Also, because API traits are an alternative to function-based APIs, the Dropshot MSRV doesn’t have to be bumped.
Compared to option 1, the fact that API traits can’t be object-safe. (That’s because options 2 and 3 define static methods on the trait.) But the only real options are either that, or requiring that traits be object-safe. The latter is much harder to achieve and report good errors for.

Endpoint annotation names

For API traits, what name should the endpoint annotation have? There are two options:

Use the same name as function-based APIs: #[endpoint]. For example:

#[dropshot::api_description]
pub trait CounterApi {
    #[endpoint { /* ... */ }]
    async fn get_counter(/* ... */);
}

Use a different name, for example #[api_endpoint] or #[trait_endpoint].

#[dropshot::api_description]
pub trait CounterApi {
    #[api_endpoint { /* ... */ }]
    async fn get_counter(/* ... */);
}

While sharing the same name across function-based APIs and traits is really nice for user understanding, there is a particularly bad failure mode with it. If users forget to invoke the #[dropshot::api_description] proc macro, then rustc will "helpfully" suggest that they import the dropshot::endpoint macro. If users follow that advice, the error messages get quite long and confusing.

With these considerations in mind, the RFD proposes:

Using #[endpoint] as the annotation.
But also, altering the implementation of the dropshot::endpoint function-based macro, to first see whether the item it is invoked on looks like a required trait method. If so, the macro will produce a helpful error message telling the developer to use #[dropshot::api_description].
Note

This won’t catch default trait methods since they look identical to functions, but hopefully will do the job enough of the time that users won’t run into too much confusion. (As of Rust 1.78, proc macros can’t look outside the context they’re invoked in, so it’s not possible to tell that the surrounding context is a trait.)

The endpoint annotation uses the same syntax, and accepts all the same arguments, as the dropshot::endpoint macro^[4].

Extra items and default implementations

A general philosophical question is whether API traits should exclusively define endpoints, or whether they should be more general than that. There are two closely-related ideas in this bucket:

In Rust, a trait can have many different kinds of items: methods, associated types, constants, and so on. Should we allow a trait annotated with #[dropshot::api_description] to have non-endpoint items on it? For example, there could be helper methods on the trait, or associated types that non-Dropshot consumers of the trait can use.
Trait methods can have default implementations, or in other words can be provided methods. These default implementations

Supporting these options allows for greater flexibility in how server traits are defined. For example, just like in standard Rust, the endpoints could be default methods that are thin wrappers over required methods on:

the same trait (similar to std’s io::Write::write_all);
a supertrait;
a supertrait with a blanket implementation, so that callers can’t override endpoint methods (similar to Tokio’s AsyncWriteExt);
the context type, where the type implements some trait; or,
some other associated type.

There are downsides to the extra flexibility, though:

It can be harder to follow control flow—users must be careful to not spaghettify their code too much.
For default implementations that are expected to be overridden, it is easy to forget to do so. (It’s also possible to not expect that methods be overridden, or outright forbid such overrides.)
It is less opinionated than deciding on a best practice and sticking to it.

At this time we’re not entirely sure what patterns users are going to settle into, and would like consumers to have the freedom to explore various options. In other words, we aren’t quite sure what best practices are. So, we allow these features. Extra items will be passed through the macro untouched.

Once we gain some experience we may revisit this decision in the future, either committing to or disallowing extra items. (Or maybe disallowing them by default, but having a flag which lets you opt into them.)

Generating default implementations

A related question is whether we should automatically generate this kind of default implementation, either by default or optionally. It is actually quite straightforward to do so within the procedural macro, so it’s more of a policy question of whether this is a good idea.

Generating such implementations by default would mean that a missing method doesn’t result in a compile error. This seems actively bad, so we reject this.
In the future, we could generate such implementations optionally via an extra attribute on the #[endpoint] annotation. We regard this as out of scope for this RFD.

Stub description

One of the goals of this RFD is to allow users to generate an OpenAPI schema document without needing to write a real implementation. We’re going to call this the stub description of the API.

The stub description needs to track:

Endpoint metadata (method, path, etc).
Endpoint documentation.
The JSON schemas corresponding to the endpoint parameters (the request and response types).

Endpoint metadata and documentation are obtained through syntactic extraction, and can be provided to the ApiEndpoint directly. However, the JSON schemas are semantic information only available at runtime. How should this information be communicated? There are a few options:

Generate a stub implementation of the trait.

Generate stub handlers that have the same signature as real implementations, but panic when called. Pass them in to ApiEndpoint::new. In other words, pass in a function of the right signature as a value parameter. For example:

use dropshot::StubContext; // context type to indicate a stub description

fn handler_put_counter(
    rqctx: RequestContext<StubContext>,
    update: TypedBody<CounterValue>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
    panic!("this is a stub handler");
}

let endpoint = ApiEndpoint::new(
    "put_counter",
    handler_put_counter,
    Method::PUT,
    // ...
);

Add an ApiEndpoint::new_for_types function which takes the request and response parameers as type parameters. For example:

use dropshot::StubContext;

let endpoint = ApiEndpoint::new_for_types::<
    // The request parameters are passed in as a tuple type
    // (in this case a 1-element tuple.)
    (TypedBody<CounterValue>,),
    // The response is a Result type of the right form.
    Result<HttpResponseUpdatedNoContent, HttpError>,
>(
    "put_counter",
    Method::PUT,
    // ...
);

Option 1 is appealing at first, but difficult to achieve in practice. Some things that get in the way include:

What if the API trait has a supertrait?
What if there are extra items, as discussed in [extra_items]?

Options 2 and 3 are roughly equivalent, but 3 leads to a substantially simpler proc macro implementation. So we choose option 3.

Note

We can likely make this better, as discussed in [interface_rework].

Shared context type

This section is about the rqctx type, used to share state across all different endpoints. For function-based APIs, methods look like:

async fn get_counter(
    rqctx: RequestContext<ServerContext>,
) { /* ... */ }

With API traits, there are two options:

All endpoints should accept rqctx: RequestContext<Self>.
Define an associated type Context for the trait^[5], and then accept rqctx: RequestContext<Self::Context>.

Option 2 is better in three ways:

It is strictly more general than option 1, since it’s always possible to turn 2 into 1 by writing type Context = Self.
Option 2 is useful if the shared context is a type like Arc<X>. Keeping in mind that the whole point of this exercise is to put the trait in a different crate from the implementation, Rust’s orphan rules make it not possible to implement a foreign trait on a foreign type. So any Arc instances must be wrapped in a newtype.
Option 2 lets users create multiple implementations that share the same context type. This can be useful in some situations.

The main cost of option 2 is some implementation complexity, but experience with the prototype indicates that it’s only a few more lines of code.

For the above reasons, we choose option 2.

For reasons of readability and in order to better function with rust-analyzer, we do not automatically generate the Context type if it isn’t found. Instead, we require that one be specified, and return an error if one isn’t.

Verifying constraints on endpoint methods

Endpoint methods must satisfy both syntactic and semantic constraints. (Most of these conditions also apply to function-based APIs.) Ensuring the macro fails gracefully if these constraints are violated is key to a good developer experience.

Syntactic constraints

Syntactic constraints are those that can be verified purely by looking at the syntax token stream. They include:

The endpoint must actually be a function. For example, it can’t be an associated type.
The endpoint must be an async fn ^[6].
The first argument must be RequestContext<Self::Context>.
Other than the request context, there must be no other references to Self in the endpoint signature.
The signature must not have any lifetime parameters or where clauses.

Not verifying syntactic constraints can lead to extraordinarily inscrutable error messages, so we verify syntactic constraints in the proc macro.

Semantic constraints

Semantic constraints are those that the proc macro cannot evaluate directly. They can be verified at compile time, just not by the proc macro itself. They include:

All arguments to methods, other than the first and the last, must implement SharedExtractor.
The last argument must implement ExclusiveExtractor.
Methods must return a Result<T, E>, with T being an HTTP response and E being exactly dropshot::HttpError.

With function-based APIs, Dropshot currently generates blocks of code which perform these checks. This results in errors that are somewhat easier to follow.

With API traits, the author sought to ensure errors at roughly the same quality or better. Through experimentation, it was found that in most^[7] cases, calling the new_for_types function was enough to generate error messages of similar quality. (Or, at least, that adding code blocks similar to the ones added for function-based APIs today produced a lot of noise.)

So on balance it turned out to generally be better to not generate code blocks that verify semantic constraints. This decision is purely empirical, and can continue to evolve as the Rust compiler improves or our needs change.

Automatic trait bounds

For Dropshot to be able to turn a trait implementation into a functioning server, some bounds must be specified:

All endpoint methods, which have been verified to be async fn in [endpoint_verifications], must return Send + 'static futures.
Self: Sized + 'static.
Self::Context: dropshot::ServerContext (where ServerContext is an alias for Send + Sync + Sized + 'static).

Should we insert these automatically, or require that users specify them?

For 1, there isn’t really a choice: there’s nowhere to actually write the Send bound, so we must insert these bounds automatically. (The implementation is borrowed from trait_variant.)

For 2 and 3, we choose to insert trait bounds automatically, for consistency with 1. Idiomatic servers are almost always going to respect these bounds. As a bonus, there are fewer error cases to consider.

Code generation in case of errors

If any of the checks in [syntactic_constraints] fail, the macro ensures that compile errors are always generated. Beyond that, there are a few options for what to do:

Simply don’t generate the trait at all.
If the trait as a whole is okay but endpoints have errors, don’t generate those specific endpoints.
Always generate all items, even those that have errors, but do not generate the support module.
Always generate all items, and also generate the support module—but replace their implementations with a panic!.

To the greatest extent possible, we choose option 4. This is primarily motivated by developer experience with rust-analyzer, which deals significantly better with a trait or method that exists over one that doesn’t. Experimentation has shown that this is overall a much nicer user experience.

To emphasize: there will still be compile errors. But rust-analyzer can continue to do its analysis even in the face of them, as long as the item exists in some form.

Special case: syntax error in method body

This is a general problem with any code annotated with a proc macro.

Consider a trait which has an associated method (possibly an endpoint, possibly not) with a default implementation. This implementation has a syntax error within it. For example:

#[dropshot::api_description]
trait MyApi {
    type Context;

    #[endpoint { ... }]
    async fn my_endpoint(rqctx: RequestContext<Self::Context>) -> Result<HttpResponseUpdatedNoContent, Error> {
        "123" 456
    }
}

Most proc macros use syn, which, as of version 2.0.65, attempts to parse the syntax tree completely. So errors will be reported by syn, and no macro output will be generated. This ends up being suboptimal, particularly with rust-analyzer.

The #[dropshot::api_description] macro, however, has no need for inspecting the contents of function bodies. So this RFD proposes not attempting to parse function bodies, and instead passing them through. Then rustc will generate errors for them.

This is an expansion of the ItemFnForSignature approach already taken by Dropshot. See Dropshot issue #134 for more details.

Open Questions

Question: Managing API descriptions

The current way API descriptions work means that operations like tag validation are performed immediately. This makes sense for function-based APIs, but gets in the way of API traits. For example, with the approach currently described in [support_module], it isn’t possible to specify a custom TagConfig. There are a few ways to manage this:

Rather than returning an ApiDescription directly, return an ApiDescriptionBuilder which doesn’t perform any validation. Then users are expected to create an ApiDescription themselves, and pass in an ApiDescriptionBuilder.
Accept a TagConfig as a parameter to the dropshot::api_description macro, e.g. #[dropshot::api_description { tag_config = { … }}].
Provide functions like fn extend_api_description<T: CounterApi>(&mut ApiDescription<T::Context>) either in addition to, or instead of, the usual ones.
Make ApiDescription itself delay validation until all endpoints have been added.

There are tradeoffs to each of these solutions. Some questions that may help guide us:

Should we allow users to customize the set of methods included in an API description?
Should we allow users to combine several ApiDescription instances into a single API, as long as they all share the same server context type?

Future work

This section captures ideas that came up while the RFD was being discussed. We aren’t committing to doing any of these, but they are good directions for future work.

Dropshot interface rework

As discussed in [stub_description], the RFD currently proposes generating a stub_api_description method that returns an ApiDescription<StubContext>. This permits some invalid states at compile-time, such as making a server out of a stub description.

This suggests that we should consider a larger rework of Dropshot’s types at some point. Some thoughts:

Should we generate a new StubApiDescription type instead? This way, we could statically prevent turning an ApiDescription<StubContext> into an actual HTTP API.
Instead of this, should we generate the OpenAPI specification directly?
But that raises the question: should all of the OpenAPI metadata (e.g. name, version) just become attributes of the #[dropshot::api_description] macro?

We’d like to generally defer this work for now, as this project is large enough already without it. The answer to [question_api_description] above will play a role in this.

Components

As described in this RFD, API traits require that all endpoints be defined within the same file. But in practice, there are sometimes endpoints shared across different traits. (For example, a general endpoint that exposes metrics.)

Notably, these shared endpoints may use a more restricted server context that is embedded in application contexts. So one might imagine:

#[dropshot::api_description { context = MetricsContext }]
trait Metrics {
    type MetricsContext;

    #[endpoint { method = GET, path = "/metrics" }]
    async fn get_metrics(
        rqctx: RequestContext<Self::MetricsContext>,
    ) -> Result<HttpResponseOk<Metrics>, HttpError>;
}

And then the actual API’s Self::Context is expected to embed the MetricsContext.

With function-based APIs, the only way to share these kinds of endpoint definitions is with a macro. But with API traits, we can provide a better way to do this by treating Metrics as a "component", using Rust’s trait inheritance. For example:

#[dropshot::api_description {
    components = [Metrics],
}]
trait MyApi: Metrics {
    type Context: AsRef<Self::MetricsContext>;

    // ...
}

Then, an implementation of MyApi would also need to provide one for Metrics, which could be forwarded to a reusable implementation based on a concrete MetricsContext (potentially via [delegation]).

Delegation through macros

Some ideas for delegation include:

Provide explicitly unimplemented endpoints, avoiding issues where one can forget to override default methods on a trait.
Allow forwarding from a server implementation to a reusable implementation of a component, as described in [components] above.

The most obvious way to do this is by generating declarative macros similar to serde’s forward_to_deserialize_any, but there’s further design work that needs to be sketched out.

External References

[Dropshot] Dropshot: Expose APIs from a Rust program. https://github.com/oxidecomputer/dropshot.
[Omicron] Omicron: The Oxide control plane. https://github.com/oxidecomputer/omicron.

Footnotes

1
Currently, Dropshot supports two different kinds of functions: HTTP endpoints with the #[endpoint] macro, and WebSocket channels with the #[channel] macro. API traits do not have any special implications for channels beyond the ones already for endpoints. So, to keep things simple, the rest of the RFD is going to use "endpoints" to talk about both cases.
View
2
Throughout this document, we use the term API to mean a collection of supported HTTP and WebSocket methods with corresponding parameters and response types, as well as associated metadata like a name and version number. This is distinct from server, which is a combination of an API, its implementation, and the specific location (address/port number/SSL) it is associated with.
View
3
If there’s just a single implementation of the trait, it is still one more time than function-based APIs. But in return, developers get all the benefits discussed above, and adding a second implementation is also easy.
View
4
There is one exception to this: the undocumented _dropshot_crate parameter. That parameter will instead be accepted by the top-level dropshot::api_description macro.
View
5
The name Context is conventional, and we can choose to let users override it.
View
6
Note that we do not support the impl Future<Output = …> syntax, just the async fn syntax. This makes it easier to implement some of the semantic checks below.
View
7
The one apparent exception being the WebSocket connection type for channels, which the author is still investigating.
View

RFD 479 Dropshot API traits

Table of Contents