479 - Dropshot API traits / RFD / Oxide

RFD

479

Authors

Updated

Introduction

At Oxide, we use [dropshot] application programming interfaces (APIs) extensively to communicate between various parts of the system. Currently, the only way to define APIs is via an approach that the rest of the RFD calls function-based Dropshot APIs. With this method, individual endpoint functions^[1] are annotated with a procedural macro. Endpoints can be combined and an API built out of them. This API can then be used to start an HTTP or HTTPS server.

For example, consider a requirement for an HTTP API that gets and sets a counter. This code implements that API, and defines two functions:

start_server to start an HTTP server.
generate_openapi to generate OpenAPI documents (JSON files) for clients.

use dropshot::endpoint;
// -- other imports elided

#[derive(Deserialize, Serialize, JsonSchema)]
struct CounterValue {
    counter: u64,
}

// Server context shared across functions.
type ServerContext = /* ... */;

/// Gets the counter value.
#[endpoint { method = GET, path = "/counter" }]
async fn get_counter(
    rqctx: RequestContext<ServerContext>,
) -> Result<HttpResponseOk<CounterValue>, HttpError> {
    // ...
}

/// Writes a new counter value.
#[endpoint { method = PUT, path = "/counter" }]
async fn put_counter(
    rqctx: RequestContext<ServerContext>,
    update: TypedBody<CounterValue>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
    // ...
}

fn create_api() -> ApiDescription<ServerContext> {
    let mut api = ApiDescription::new();
    api.register(get_counter).unwrap();
    api.register(put_counter).unwrap();

    api
}

// Starts a server.
async fn start_server() {
    // Build a description of the API.
    let api = create_api();
    // Use `api` to start a server...
}

// Generates an OpenAPI document.
fn generate_openapi() {
    let api = create_api();
    let openapi = api.openapi("Counter Server", "1.0.0");
    // ...
}

This RFD introduces an alternative way to define Dropshot APIs, called Dropshot API traits^[2].

Current status

Updated 2024-09-06.

The work described in this RFD has landed, and shipped as part of Dropshot 0.11.0.
All of Omicron’s OpenAPI documents have been converted over to using Dropshot traits.
Omicron also has a tool that is responsible for keeping OpenAPI documents up-to-date. See [openapi_manager].
The impact of these changes has generally been realized. Generating OpenAPI documents is now much faster, and the circular dependency between Nexus and sled-agent no longer requires contorted workarounds while adding to shared types.

Motivation

The idea to use a trait to define Dropshot APIs was initially proposed by Sean Klein in Dropshot PR #5247. Much of this section is an expanded form of the issue summary.

Issues with function-based APIs

Function-based Dropshot APIs are straightforward to write and understand, and they’ve overall been very successful at getting us to a shipping product. But along the way, we’ve identified a few recurring problems, which are generally related to tight coupling between the API and the implementation.

Specifically: With Dropshot, OpenAPI documents are generated based on endpoint type definitions. But with function-based Dropshot APIs, the implementation (not just the definitions) must be compiled before the OpenAPI document can be generated.

Alternative implementations of the same API

The most obvious issue with the tight coupling is the fact that it is much harder to provide a second implementation of the same API. For example, one may wish to provide a fake implementation of an API that only operates in-memory. With tight coupling, this becomes significantly more difficult.

(It is possible to use a trait behind the functions to achieve this goal; see [traits_at_home] This RFD essentially proposes adding native support for this approach to Dropshot.)

Slow API iteration cycles

For large servers such as Nexus within [omicron], it can be quite slow to build the implementations. This slows down API iteration cycles considerably. As an example, consider the "Nexus internal API"--the interface between Nexus and internal services. Consider a simple change to add a new endpoint. Generating the OpenAPI document involves:

Define a function, annotate it with the #[endpoint] macro, and include it in the API description.
Rebuild the OpenAPI document, typically by running a test that ensures the document is up-to-date. For example, EXPECTORATE=overwrite cargo nextest run -p omicron-nexus openapi.

This process can take a while! On the author’s workstation using the mold linker, against Omicron commit bca1f2d, this process takes somewhere between 18–60 seconds (depending on how incremental rustc was).

The test also catches some OpenAPI correctness and lint issues, such as:

Colliding HTTP routes. For example, Dropshot rejects an API description which has both /task/{task_id}/status and /task/activate endpoints, since the task_id can be activate. Finding and fixing each such case requires going through the iteration cycle again.
Parameters that aren’t in snake_case. Conventionally, OpenAPI schemas use snake_case parameters.

Fixing each such issue would require going through another cycle.

Circular dependencies

In Nexus, we have many circular dependencies between services; we effectively "break" these cycles with the generation of the OpenAPI document as a static artifact. However, this can be monstrously complex to navigate as, typically, implementations need to compile before a new OpenAPI document may be generated. This becomes particularly problematic when making changes that are incompatible with the old schema.

Note

"Incompatible" encompasses a wider range of changes than one might think at first. It includes many "append-only" changes like adding a new variant to an enum, or a new request parameter.

For example, Nexus and sled-agent depend on each other, since both of them must be able to call into each other. (Nexus instructs sled-agent to run various operations, and sled-agent reports statuses to the Nexus internal API.)

Current Nexus and sled-agent dependency graph

Now, consider this example of a type shared across both APIs. Omicron’s update system has a list of well-known update artifact kinds, defined in omicron-common:

pub enum KnownArtifactKind {
    GimletSp,
    GimletRot,
    Host,
    Trampoline,
    // ...
}

This KnownArtifactKind enum has made its way extensively throughout Omicron, including the Nexus and Sled-Agent APIs—which, as discussed above, have a circular dependency.

How hard is it to add a new artifact kind? The process is documented within Omicron’s source code:

Adding a new KnownArtifactKind

// Adding a new update artifact kind is a tricky process. To do so:
//
// 1. Add it here.
//
// 2. Add the new kind to <repo root>/{nexus-client,sled-agent-client}/lib.rs.
//    The mapping from `KnownArtifactKind::*` to `types::KnownArtifactKind::*`
//    must be left as a `todo!()` for now; `types::KnownArtifactKind` will not
//    be updated with the new variant until step 5 below.
//
// 3. [Add the new KnownArtifactKind to the database.]
//
// 4. Add the new kind and the mapping to its `update_artifact_kind` to
//    <repo root>/nexus/db-model/src/update_artifact.rs
//
// 5. Regenerate the OpenAPI specs for nexus and sled-agent:
//
//    ```
//    EXPECTORATE=overwrite cargo nextest run -p omicron-nexus -p omicron-sled-agent openapi
//    ```
//
// 6. Return to <repo root>/{nexus-client,sled-agent-client}/lib.rs from step 2
//    and replace the `todo!()`s with the new `types::KnownArtifactKind::*`
//    variant.
//
// See https://github.com/oxidecomputer/omicron/pull/2300 as an example.

There are several steps involved in such a simple case—and even for experienced Omicron developers, a full cycle can take up to 30 minutes. More complex changes can take longer, and each time a spec is changed, it’s another 30+ minute cycle. This slows down development velocity considerably.

But if we decouple the API from the implementation, and only require the API to generate the corresponding clients, then the dependency graph becomes:

Proposed Nexus and sled-agent dependency graph

In other words, if we put the APIs in separate crates from the implementations, the dependency cycle no longer exists and making changes becomes much easier. Code organization also becomes more comprehensible, since it’s easier to see where the interfaces are defined without having to dig around within the implementations.

In reality…

Actually, as things are currently set up, the nexus API depends on the sled-agent client so it can implement From for those types.

But this isn’t an inherent circular dependency! It’s just one that avoids Rust orphan rule issues. The From impls can be replaced with free-standing functions that do the conversions in another crate.
And even if this cycle is retained, the scope of circularity is more limited. For example, with KnownArtifactKind, generating the new OpenAPI document no longer requires dealing with database code first.

Merge conflicts

(This is an extension of the previous section, but is worth calling out as an addendum.)

If there are merge conflicts between two changes to an OpenAPI document, developers are forced to deal with a JSON file on disk that may be out of sync with either side. If those conflicts aren’t fixable by hand, it may require substantial work just to make the implementations compile, and the updated document can be generated. After that, it’s likely that the updated document will stop the implementations from compiling, again, requiring a second go-around.

If generating the schema doesn’t require getting the whole implementation to compile, much of this pain can be avoided.

Guide-level description

An API trait is a Rust trait that represents a collection of API endpoints. Each endpoint is defined as a static method on the trait, and the trait as a whole is annotated with #[dropshot::api_description]. (Rust 1.75 or later is required.)

While slightly more complex than function-based servers, API traits separate the interface definition from the implementation. Keeping the definition and implementation in different crates can allow for faster iteration of the interface, and simplifies multi-service repos with clients generated from the OpenAPI output of interfaces. In addition, API traits allow for multiple implementations, such as a mock implementation for testing.

Trait definition

Continuing with the counter API example in [introduction]--we’re going to first define a trait that looks like:

#[derive(Deserialize, Serialize, JsonSchema)]
struct CounterValue {
    counter: u64,
}

#[dropshot::api_description]
pub trait CounterApi {
    /// The type representing the server's context.
    type Context;

    /// Gets the counter value.
    #[endpoint { method = GET, path = "/counter" }]
    async fn get_counter(
        rqctx: RequestContext<Self::Context>,
    ) -> Result<HttpResponseOk<CounterValue>, HttpError>;

    /// Writes a new counter value.
    #[endpoint { method = PUT, path = "/counter" }]
    async fn put_counter(
        rqctx: RequestContext<Self::Context>,
        update: TypedBody<CounterValue>,
    ) -> Result<HttpResponseUpdatedNoContent, HttpError>;
}

API implementation

An implementation will look like:

// This is a never-constructed type that exists solely to name the
// `CounterApi` impl -- it can be an empty struct as well.
enum MyImpl {}

// Server context shared across functions.
struct MyContext {
    // ...
}

impl CounterApi for MyImpl {
    type Context = MyContext;

    async fn get_counter(
        rqctx: RequestContext<Self::Context>,
    ) -> Result<HttpResponseOk<CounterValue>, HttpError> {
        // ...
    }

    async fn put_counter(
        rqctx: RequestContext<Self::Context>,
        update: TypedBody<CounterValue>,
    ) -> Result<HttpResponseUpdatedNoContent, HttpError> {
        // ...
    }
}

`start_server` and `generate_openapi`

The dropshot::api_description macro creates a new module called counter_api_mod. This module provides two functions:

api_description: turn an implementation into a Dropshot ApiDescription which can be used to create an HTTP server that serves requests.
stub_api_description: return a description which can be used to generate an OpenAPI document, without needing an actual implementation at hand.

The start_server function uses api_description:

async fn start_server() {
    // The `api_description` function returns an ApiDescription<MyContext>.
    let api = counter_api_mod::api_description::<MyImpl>().unwrap();
    // Use `api` to start a server...
}

The generate_openapi function uses stub_api_description:

fn generate_openapi() {
    // The `stub_api_description` method returns a description which
    // can only be used to generate an OpenAPI document. Importantly, it does not
    // require an actual implementation.
    let api = counter_api_mod::stub_api_description().unwrap();
    let openapi = api.openapi("Counter Server", "1.0.0");
    // ...
}

Choosing between functions and traits

Prototyping: If you’re prototyping with a small number of endpoints, functions provide an easier way to get started. The downside to traits is that endpoints signatures are defined at least twice, once in the trait and once in the implementation.

Small services: For a service that is relatively isolated and quick to compile, traits and functions are both good options.

APIs with multiple implementations: For services that are large enough to have a second, simpler implementation (of potentially parts of them), a trait is best.

Here’s an archetypal way to organize code for a large service with a real and an in-memory test implementation. Each rounded node represents a binary and each rectangular node represents a library crate (or more than one for "logic").

Migrating functions to API traits

Existing function-based APIs can be converted to traits, typically without much difficulty.

Following the example in [guide_trait_definition], define a trait (say CounterApi) and annotate it with #[dropshot::api_description].
Add an associated type Context. The requisite bounds Send + Sync + 'static will automatically be added to Context.
For each endpoint function, add the corresponding signature as a required method to the trait.
- Include the #[endpoint] annotation (if specified as #[dropshot::endpoint], change this to #[endpoint]).
- Include the function signature, changing the first parameter to be RequestContext<Self::Context>.
- Include existing doc comments on the function.
- Do not include the implementation.
- In some cases, the current API might have exposed implementation-specific types. Refactor these types out into a shared crate ("base types" in the dependency graph above), or define new types and corresponding From conversions as appropriate. This is usually the most involved part of any conversion.

Then, where the endpoint functions are currently defined:

Add a dependency on the crate containing the API trait.
Following the example in [guide_api_implementation], define the type (say MyImpl) to which the implementation will be attached.
Within impl CounterApi for MyImpl, specify type Context to be your shared context type. For example, type Context = MyContext.
Convert the endpoint functions over to being trait methods. This usually just means copying the function signature into the impl block, and removing the #[endpoint] annotation. Changing RequestContext<MyContext> to RequestContext<Self::Context> is recommended but not required.
Following the example in [guide_integration], update the code that glued together the endpoint functions to use api_description.

Impact

With Dropshot API traits, all of the operations described in [motivation_issues] become at least an order of magnitude faster.

Omicron PR #5653 contains a prototype of this RFD, including a conversion of the Nexus internal API.

With this prototype, repeating the test in [slow_iteration] with EXPECTORATE=overwrite cargo nextest run -p nexus-internal-api openapi goes down from 18 seconds to just 1.5 seconds.
The time taken from adding a new KnownArtifactKind to generating a new OpenAPI document also becomes much faster, going from 20+ minutes to taking under a minute.

The approach this RFD proposes uses static dispatch (e.g. it doesn’t box futures), so there aren’t expected to be any runtime performance implications compared to function-based servers.

Determinations

There’s a surprising amount of flexibility in implementation—much more than with function-based APIs. Providing a good user experience requires carefully making several decisions backed by experimentation and judgment.

The full reasoning for these determinations is in [rationale].

Native async traits

We use Rust’s native async traits (Rust 1.75+) for API traits. Compared to alternatives like the async_trait library, this decision provides better error reporting and improved IDE support. At the time of publishing this RFD (September 2024), Rust 1.75 has been out for several months, so the impact of the version requirement is expected to be minimal.

For more details, see [trait_mechanics].

The support module

For each API trait, a corresponding support module is generated, with support for generating:

A real server backed by an implementation of the API trait.
A "stub" description, meant solely for generating OpenAPI documents.

For more details, see [support_module_details] and [stub_description_details].

Endpoint annotations

Within each API trait, endpoints and channels are annotated with #[endpoint] and #[channel], respectively. Both of these accept the same arguments as the existing function-based macros. This minimizes complexity and retains consistency with function-based APIs.

For more details, see [endpoint_annotation_details].

Endpoint constraint verifications

Endpoint methods must have a certain shape, and the api_description macro attempts to catch issues as early as possible to provide a good developer experience. Much of the work here also results in improvements to function-based APIs.

For more details, see [endpoint_verification_details].

Miscellaneous

The shared context type (the T in RequestContext<T>) is an associated type on the trait rather than Self. See [shared_context].
Trait bounds required by Dropshot are automatically inserted. See [auto_trait_bounds].
API traits can have non-Dropshot related items. See [extra_items].
In case of errors, the proc macro attempts to generate code to the greatest extent possible to improve the IDE experience. See [error_codegen].
Tag-related configuration is provided via an argument to the api_description macro. See [tag_config].

Future work

This section captures ideas that came up while the RFD was being discussed. We aren’t committing to doing any of these, but they are good directions for future work.

OpenAPI manager

OpenAPI documents generated by Dropshot are typically checked into a repository, next to their corresponding API definitions. This has several advantages:

OpenAPI documents can be tracked in source control over time, and changes to them can be inspected during code review.
Tooling can consume these documents to ensure API compatibility over time.
External users can download these documents without having to generate them.

In general, it is important that checked-in generated files be kept up-to-date. With function-based APIs, we’ve historically added a test next to each API document that uses expectorate to validate and update it.

With API traits we could keep doing the same thing, but there’s also a better approach that can be taken: providing a dedicated tool called an OpenAPI manager which is responsible for the lifecycle of API documents in the repository.

Within Omicron we have built such a tool, and the results have been quite promising. The interface for the tool is quite straightforward:

cargo xtask openapi list: Lists all managed documents.
cargo xtask openapi check: Checks that all documents are up-to-date.
cargo xtask openapi generate: Updates all documents.

Outside of the one file listing out all of the APIs in the workspace, this tool is not specific to Omicron. Extracting this tool into a reusable crate would be quite useful to consumers.

Dropshot interface rework

As discussed in [stub_description_details], the RFD currently generates a stub_api_description method that returns an ApiDescription<StubContext>. This permits some invalid states at compile-time, such as making a server out of a stub description.

This suggests that we should consider a larger rework of Dropshot’s types at some point. Some thoughts:

Should we generate a new StubApiDescription type instead? This way, we could statically prevent turning an ApiDescription<StubContext> into an actual HTTP API.
Instead of this, should we generate the OpenAPI document directly?
But that raises the question: should all of the OpenAPI metadata (e.g. name, version) just become attributes of the #[dropshot::api_description] macro?

We’d like to generally defer this work for now, as this project is large enough already without it.

Composing API traits

As proposed and implemented in this RFD, an API trait generally stands alone — it is not possible to combine two or more API traits together. This seems like a generally useful thing to want. Some use cases for this are:

Splitting large traits into several smaller ones for maintainability.
Sharing a subset of methods across several servers (for example methods that return general server health).
Having a separate API trait per Dropshot version ([rfd421]).

This work is filed as Dropshot issue #1069; we will need to do some design and prototyping work before committing to a direction.

Delegation through macros

One of the promises of API traits is that they allow for second, simpler test implementations. However, test implementations are often partial, since tests may only call a subset of APIs.

With the initial implementation of API traits, the only way to do this is to manually forward to a common method. For example:

use dropshot::{HttpError, RequestContext};
use counter_api::{CounterApi, counter_api_mod};

struct ServerContext {
    // ...
}

enum TestImpl {}

impl CounterApi for TestImpl {
    type Context = ServerContext;

    async fn get_counter(
        rqctx: RequestContext<Self::Context>,
    ) -> Result<HttpResponseOk<CounterValue>, HttpError> {
        // test implementation here
    }

    async fn put_counter(
        rqctx: RequestContext<Self::Context>,
        update: TypedBody<CounterValue>,
    ) -> Result<HttpResponseUpdatedNoContent, HttpError> {
        method_unimplemented()
    }

    async fn increment_counter(
        rqctx: RequestContext<Self::Context>,
    ) -> Result<HttpResponseOk<CounterValue>, HttpError> {
        method_unimplemented()
    }
}

fn method_unimplemented<T>(rqctx: RequestContext<ServerContext>) -> Result<T, HttpError> {
    Err(HttpError {
        status_code: http::StatusCode::METHOD_NOT_ALLOWED,
        error_code: None,
        external_message: "Method not implemented in TestImpl"
            .to_string(),
        internal_message: "Method not implemented in TestImpl"
            .to_string(),
    })
}

The api_description code generator could enable support for this use case by generating a declarative macro similar to serde’s forward_to_deserialize_any. For example, there could be a forward_endpoints_to macro, which could be invoked as:

impl CounterApi for TestImpl {
    type Context = ServerContext;

    async fn get_counter(
        rqctx: RequestContext<Self::Context>,
    ) -> Result<HttpResponseOk<CounterValue>, HttpError> {
        // ...
    }

    counter_api_mod::forward_endpoints_to! {
        put_counter increment_counter => method_unimplemented
    }
}

The macro would generate method bodies that ignore provided arguments, and simply delegate to method_unimplemented.

Conclusion and summary

This RFD introduces a new way to create APIs in Dropshot using Rust traits instead of functions. This new approach separates API definitions from implementations. Doing so addresses key challenges we’ve faced with the current approach, such as slow iteration cycles and issues with circular dependencies. Besides, the separation allows us to create different implementations of the same API for various purposes, such as testing.

API traits have been adopted across Omicron, resulting in significant improvements to the day-to-day developer experience. Potential areas for future work include composition of API traits for more modular designs and better support for partial implementations. As we gain experience with this new approach, we will continue to refine API traits and make them an effective way to define Dropshot interfaces.

External References

[Dropshot] Dropshot: Expose APIs from a Rust program.
[Omicron] Omicron: The Oxide control plane.
[RFD 421] RFD 421: Using OpenAPI as a locus of update compatibility.

Appendix: Rationale

This section consists of detailed reasoning for the decisions outlined in [determinations].

Do we really need to change Dropshot?

There is already a way to achieve this kind of decoupling with function-based Dropshot APIs.

Define a trait, say MyApiBackend, with methods corresponding to each endpoint.
Use an Arc<dyn MyApiBackend> as the server context.
Each function calls into Arc<dyn MyApiBackend>.

An example of this approach is in Omicron’s installinator-artifactd.

But there’s a fair amount of extra boilerplate involved with this approach. Specifically, if there are N implementation, each endpoint needs to be specified 2+N times:

As a function.
As a trait method.
With an implementation for each trait.

With native support for Dropshot API traits, this becomes 1+N times^[3].

This approach also necessitates use of dynamic dispatch with async_trait, which isn’t a great user experience as documented in [trait_mechanics]. Native API traits result in much nicer rust-analyzer integration and error reporting.

Source of truth

Traditionally, OpenAPI documents are hand-written, as a kind of interface definition language (IDL). Server and client interfaces may be generated from the IDL, but may also be written separately (and as a result may not be in sync).

With function-based Dropshot APIs, the source of truth is the collection of endpoint and channel invocations written in Rust. The OpenAPI document is generated from that.

As an alternative to API traits, we could choose to switch to something more like the traditional workflow. But in Dropshot we deliberately decided to use annotations in Rust as the source of truth (see RFD 10 for some discussion), which has worked quite well.

We continue to believe in Dropshot’s general approach, so the RFD proposes that the source of truth for schemas continues to be in Rust.

The `dropshot::api_description` macro

Core to Dropshot API traits is the dropshot::api_description macro, which accepts a trait as input (and nothing else).

Continuing with the above CounterApi example, the dropshot::api_description macro does the following:

Gathers the list of endpoints specified in the CounterApi trait.
Verifies that the trait is in the right form and its methods are syntactically correct (discussed below), reporting any errors found.

Then, as its output, the macro generates:

The CounterApi trait mostly as-is, with some minor changes discussed in [auto_trait_bounds] below.
Semantic checks as discussed in [endpoint_verification_details].
A support module, described immediately below.

Support module details

With function-based Dropshot APIs, the way to turn a set of endpoints into the API is to construct an ApiDescription struct. For example:

fn create_api() -> ApiDescription<ServerContext> {
    let mut api = ApiDescription::new();
    api.register(get_counter).unwrap();
    api.register(put_counter).unwrap();

    api
}

Given a trait and an implementation, there needs to be a way to turn it into an ApiDescription as well. In order to do so, the proc macro generates a support module.

For CounterApi, the macro produces:

#[automatically_derived]
pub mod counter_api_mod {
    use dropshot::{ApiDescriptionBuildError, StubContext};

    pub fn api_description<T: CounterApi>(
    ) -> Result<ApiDescription<T::Context>, ApiDescriptionBuildError> {
        // ... automatically generated code to make an ApiDescription
    }

    pub fn stub_api_description(
    ) -> Result<ApiDescription<StubContext>, ApiDescriptionBuildError> {
        // ... automatically generated code to make a stub description
    }
}

Module name

The default name of the module is a snake_case version of the name of the trait, with the suffix _mod appended at the end. (The suffix is added to avoid name collisions, where the name of the crate is the same as that of the trait.)

The name can be customized via the module parameter to dropshot::api_description. For example:

#[dropshot::api_description {
    module = "counter_api_support_module",
}]
trait MyApi {
    // ...
}

Module contents

In the initial implementation, the module has two functions:

api_description, which converts a CounterApi implementation (specified by type) into the corresponding ApiDescription against the Context type.
stub_api_description, which generates a stub description. For more, see [stub_description].

Alternatives

Instead of a module, earlier versions of this RFD proposed putting the functions on a CounterApiFactory type. That approach is appealing because the trait name doesn’t have to be transformed into snake_case. But that prevents us from, in the future, being able to generate items that can’t live on a type. One use case for that is [delegation].

Another alternative is to generate an extension trait. For example:

pub trait CounterApiExt: CounterApi {
    fn api_description(
    ) -> Result<ApiDescription<Self::Context>, ApiDescriptionBuildError> {
        // ... automatically generated code to make an ApiDescription
    }
}

impl<T: CounterApi> CounterApiExt for T {}

This works for standard implementations, but doesn’t leave an obvious place to define the stub description.

Trait mechanics and dispatch

While prototyping, the author experimented with three styles of defining and using traits:

Using dynamic dispatch (i.e. boxed trait objects) with the async_trait crate. In this case:
- The function signatures become async fn endpoint(&self, rqctx: RequestContext<()>, …).
- The request context moves to being a part of self.
- The #[dropshot::api_description] macro outputs an #[async_trait] annotation.
- Implementations also have to be annotated with #[async_trait].
Using static dispatch with async_trait.
Using static dispatch with native async traits, as supported in Rust 1.75 and above.
- In this case, the definition needs to be annotated with #[dropshot::api_description], but implementations do not require annotations.
- Current versions of Rust don’t support dynamic dispatch with native traits.

The following table summarizes the results of the experiment:

Feature Dynamic Static, async_trait Static, native

Feature	Dynamic	Static, `async_trait`	Static, native
MSRV	Rust 1.56	Rust 1.56	Rust 1.75
Trait object safety	Requires object-safe	Cannot be object-safe	Cannot be object-safe
Conversion from function-based APIs	Requires modifications	Copy-paste	Copy-paste
Impls require annotation?	Yes	Yes	No
Error reporting (see below)	Poor	Poor	Great
rust-analyzer support (see below)	Poor	Poor	Better, though with shortcomings

MSRV

Rust 1.56

Rust 1.75

Trait object safety

Requires object-safe

Cannot be object-safe

Conversion from function-based APIs

Requires modifications

Copy-paste

Impls require annotation?

Yes

Error reporting (see below)

Poor

Great

rust-analyzer support (see below)

Poor

Better, though with shortcomings

More about error reporting and rust-analyzer support

Ensuring that procedural macros produce good errors is a major challenge. Two examples:

If there are syntax errors in the input that are unrelated to anything the procedural macro needs to do, are they detected early by the parser, or passed through to be reported by rustc? With native traits, #[dropshot::api_description] can perform partial parsing. For more, see [error_codegen] (special case).
If there are semantic errors in the generated output caused by bad input, how do we ensure they’re correctly mapped onto the input? A well-written proc macro carefully annotates each output token with the right input span, but sometimes this isn’t even possible.
- With #[async_trait], both the definition and the implementation must be annotated. Within the implementation, where most errors are likely to occur, rustc is often unable to point to the actual code at issue. Instead, it just points to the #[async_trait] macro invocation (Span::call_site()). If the trait has many methods, it can be very difficult to find the exact line of code that’s failing.
- With native traits, the definition must be annotated, but the implementation does not need to be. So error reporting in the implementation is always perfect. Due to partial parsing, error reporting in the definition also tends to be better.

Currently, there’s no way for proc macros to provide good hints to rust-analyzer. This means that rust-analyzer may not generate good code in case there are issues.

Since #[async_trait] has to be used as an annotation for implementations, rust-analyzer often doesn’t handle errors within them well.
With native traits, no annotation is required on implementations.

Another example is the "Generate missing impls" action.

With #[async_trait], this action is completely unavailable.
With native traits, the implementation requires no annotations so this action is available. However, as of Rust 1.78 there is one major limitation: Because the #[dropshot::api_description] changes the function signature to add Send (see [auto_trait_bounds]), the action generates methods that implement Future<Output = …> + Send + 'static rather than async fn instances.

The author hopes that in the future, rust-analyzer can more gracefully handle simple cases like the latter. This could be done via a relatively-easy-to-describe heuristic. Handling async_trait better is much more difficult.

Based on these results, the RFD proposes that we commit to option 3, and the rest of the RFD assumes it. The downsides are:

Compared to options 1 and 2, the more limited Rust version support. We’re already past this version at Oxide, so this isn’t a factor for us. Also, because API traits are an alternative to function-based APIs, the Dropshot MSRV doesn’t have to be bumped.
Compared to option 1, the fact that API traits can’t be object-safe. (That’s because options 2 and 3 define static methods on the trait.) But the only real options are either that, or requiring that traits be object-safe. The latter is much harder to achieve and report good errors for.

Endpoint annotation details

For API traits, what name should the endpoint annotation have? There are two options:

Use the same name as function-based APIs: #[endpoint]. For example:

#[dropshot::api_description]
pub trait CounterApi {
    #[endpoint { /* ... */ }]
    async fn get_counter(/* ... */);
}

Use a different name, for example #[api_endpoint] or #[trait_endpoint].

#[dropshot::api_description]
pub trait CounterApi {
    #[api_endpoint { /* ... */ }]
    async fn get_counter(/* ... */);
}

While sharing the same name across function-based APIs and traits is really nice for user understanding, there is a particularly bad failure mode with it. If users forget to invoke the #[dropshot::api_description] proc macro, then rustc will "helpfully" suggest that they import the dropshot::endpoint macro. If users follow that advice, the error messages get quite long and confusing.

With these considerations in mind, the RFD proposes:

Using #[endpoint] as the annotation.
But also, altering the implementation of the dropshot::endpoint function-based macro, to first see whether the item it is invoked on looks like a required trait method. If so, the macro will produce a helpful error message telling the developer to use #[dropshot::api_description].
Note
This won’t catch default trait methods since they look identical to functions, but hopefully will do the job enough of the time that users won’t run into too much confusion. (As of Rust 1.78, proc macros can’t look outside the context they’re invoked in, so it’s not possible to tell that the surrounding context is a trait.)

The endpoint annotation uses the same syntax, and accepts all the same arguments, as the dropshot::endpoint macro^[4].

Extra items and default implementations

A general philosophical question is whether API traits should exclusively define endpoints, or whether they should be more general than that. There are two closely-related ideas in this bucket:

In Rust, a trait can have many different kinds of items: methods, associated types, constants, and so on. Should we allow a trait annotated with #[dropshot::api_description] to have non-endpoint items on it? For example, there could be helper methods on the trait, or associated types that non-Dropshot consumers of the trait can use.
Trait methods can have default implementations, or in other words can be provided methods. These default implementations

Supporting these options allows for greater flexibility in how server traits are defined. For example, just like in standard Rust, the endpoints could be default methods that are thin wrappers over required methods on:

the same trait (similar to std’s io::Write::write_all);
a supertrait;
a supertrait with a blanket implementation, so that callers can’t override endpoint methods (similar to Tokio’s AsyncWriteExt);
the context type, where the type implements some trait; or,
some other associated type.

There are downsides to the extra flexibility, though:

It can be harder to follow control flow—users must be careful to not spaghettify their code too much.
For default implementations that are expected to be overridden, it is easy to forget to do so. (It’s also possible to not expect that methods be overridden, or outright forbid such overrides.)
It is less opinionated than deciding on a best practice and sticking to it.

At this time we’re not entirely sure what patterns users are going to settle into, and would like consumers to have the freedom to explore various options. In other words, we aren’t quite sure what best practices are. So, we allow these features. Extra items will be passed through the macro untouched.

Once we gain some experience we may revisit this decision in the future, either committing to or disallowing extra items. (Or maybe disallowing them by default, but having a flag which lets you opt into them.)

Generating default implementations

A related question is whether we should automatically generate this kind of default implementation, either by default or optionally. It is actually quite straightforward to do so within the procedural macro, so it’s more of a policy question of whether this is a good idea.

Generating such implementations by default would mean that a missing method doesn’t result in a compile error. This seems actively bad, so we reject this.
In the future, we could generate such implementations optionally via an extra attribute on the #[endpoint] annotation. We regard this as out of scope for this RFD.

Stub description details

One of the goals of this RFD is to allow users to generate an OpenAPI document without needing to write a real implementation. We’re going to call this the stub description of the API.

The stub description needs to track:

Endpoint metadata (method, path, etc).
Endpoint documentation.
The JSON schemas corresponding to the endpoint parameters (the request and response types).

Endpoint metadata and documentation are obtained through syntactic extraction, and can be provided to the ApiEndpoint directly. However, the JSON schemas are semantic information only available at runtime. How should this information be communicated? There are a few options:

Generate a stub implementation of the trait.

Generate stub handlers that have the same signature as real implementations, but panic when called. Pass them in to ApiEndpoint::new. In other words, pass in a function of the right signature as a value parameter. For example:

use dropshot::StubContext; // context type to indicate a stub description

fn handler_put_counter(
    rqctx: RequestContext<StubContext>,
    update: TypedBody<CounterValue>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
    panic!("this is a stub handler");
}

let endpoint = ApiEndpoint::new(
    "put_counter",
    handler_put_counter,
    Method::PUT,
    // ...
);

Add an ApiEndpoint::new_for_types function which takes the request and response parameers as type parameters. For example:

use dropshot::StubContext;

let endpoint = ApiEndpoint::new_for_types::<
    // The request parameters are passed in as a tuple type
    // (in this case a 1-element tuple.)
    (TypedBody<CounterValue>,),
    // The response is a Result type of the right form.
    Result<HttpResponseUpdatedNoContent, HttpError>,
>(
    "put_counter",
    Method::PUT,
    // ...
);

Option 1 is appealing at first, but difficult to achieve in practice. Some things that get in the way include:

What if the API trait has a supertrait?
What if there are extra items, as discussed in [extra_items]?

Options 2 and 3 are roughly equivalent, but 3 leads to a substantially simpler proc macro implementation. So we choose option 3.

Note

We can likely make this better, as discussed in [interface_rework].

Shared context type

This section is about the rqctx type, used to share state across all different endpoints. For function-based APIs, methods look like:

async fn get_counter(
    rqctx: RequestContext<ServerContext>,
) { /* ... */ }

With API traits, there are two options:

All endpoints should accept rqctx: RequestContext<Self>.
Define an associated type Context for the trait^[5], and then accept rqctx: RequestContext<Self::Context>.

Option 2 is better in three ways:

It is strictly more general than option 1, since it’s always possible to turn 2 into 1 by writing type Context = Self.
Option 2 is useful if the shared context is a type like Arc<X>. Keeping in mind that the whole point of this exercise is to put the trait in a different crate from the implementation, Rust’s orphan rules make it not possible to implement a foreign trait on a foreign type. So any Arc instances must be wrapped in a newtype.
Option 2 lets users create multiple implementations that share the same context type. This can be useful in some situations.

The main cost of option 2 is some implementation complexity, but experience with the prototype indicates that it’s only a few more lines of code.

For the above reasons, we choose option 2.

For reasons of readability and in order to better function with rust-analyzer, we do not automatically generate the Context type if it isn’t found. Instead, we require that one be specified, and return an error if one isn’t.

Endpoint constraint verification details

Endpoint methods must satisfy both syntactic and semantic constraints. (Most of these conditions also apply to function-based APIs.) Ensuring the macro fails gracefully if these constraints are violated is key to a good developer experience.

Syntactic constraints

Syntactic constraints are those that can be verified purely by looking at the syntax token stream. They include:

The endpoint must actually be a function. For example, it can’t be an associated type.
The endpoint must be an async fn ^[6].
The first argument must be RequestContext<Self::Context>.
Other than the request context, there must be no other references to Self in the endpoint signature.
The signature must not have any lifetime parameters or where clauses.

Not verifying syntactic constraints can lead to inscrutable error messages, so we verify syntactic constraints in the proc macro.

Semantic constraints

Semantic constraints are those that the proc macro cannot evaluate directly. They can be verified at compile time, just not by the proc macro itself. They include:

All arguments to methods, other than the first and the last, must implement SharedExtractor.
The last argument must implement ExclusiveExtractor.
Methods must return a Result<T, E>, with T being an HTTP response and E being exactly dropshot::HttpError.

With function-based APIs, Dropshot currently generates blocks of code which perform these checks. This results in errors that are somewhat easier to follow.

With API traits, the author sought to ensure errors at roughly the same quality or better. Through experimentation, it was found that in most^[7] cases, calling the new_for_types function was enough to generate error messages of similar quality. (Or, at least, that adding code blocks similar to the ones added for function-based APIs today produced a lot of noise.)

So on balance it turned out to generally be better to not generate code blocks that verify semantic constraints. This decision is purely empirical, and can continue to evolve as the Rust compiler improves or our needs change.

Automatic trait bounds

For Dropshot to be able to turn a trait implementation into a functioning server, some bounds must be specified:

All endpoint methods, which have been verified to be async fn in [endpoint_verification_details], must return Send + 'static futures.
Self: Sized + 'static.
Self::Context: dropshot::ServerContext (where ServerContext is an alias for Send + Sync + Sized + 'static).

Should we insert these automatically, or require that users specify them?

For 1, there isn’t really a choice: there’s nowhere to actually write the Send bound, so we must insert these bounds automatically. (The implementation is borrowed from trait_variant.)

For 2 and 3, we choose to insert trait bounds automatically, for consistency with 1. Idiomatic servers are almost always going to respect these bounds. As a bonus, there are fewer error cases to consider.

Code generation in case of errors

If any of the checks in [syntactic_constraints] fail, the macro ensures that compile errors are always generated. Beyond that, there are a few options for what to do:

Simply don’t generate the trait at all.
If the trait as a whole is okay but endpoints have errors, don’t generate those specific endpoints.
Always generate all items, even those that have errors, but do not generate the support module.
Always generate all items, and also generate the support module—but replace their implementations with a panic!.

To the greatest extent possible, we choose option 4. This is primarily motivated by developer experience with rust-analyzer, which deals significantly better with a trait or method that exists over one that doesn’t. Experimentation has shown that this is overall a better user experience overall.

To emphasize: there will still be compile errors. But rust-analyzer can continue to do its analysis even in the face of them, as long as the item exists in some form.

Special case: syntax error in method body

This is a general problem with any code annotated with a proc macro.

Consider a trait which has an associated method (possibly an endpoint, possibly not) with a default implementation. This implementation has a syntax error within it. For example:

#[dropshot::api_description]
trait MyApi {
    type Context;

    #[endpoint { ... }]
    async fn my_endpoint(rqctx: RequestContext<Self::Context>) -> Result<HttpResponseUpdatedNoContent, Error> {
        "123" 456
    }
}

Most proc macros use syn, which, as of version 2.0.65, attempts to parse the syntax tree completely. So errors will be reported by syn, and no macro output will be generated. This ends up being suboptimal, particularly with rust-analyzer.

The #[dropshot::api_description] macro, however, does not need to inspect the contents of function bodies. So this RFD proposes not attempting to parse function bodies, and instead passing them through. Then, the Rust compiler will generate errors for them.

This is an expansion of the ItemFnForSignature approach already taken by Dropshot. See Dropshot issue #134 for more details.

Tag configuration

With Dropshot, API endpoints can have tags specified for them. Dropshot also allows the specification of global tag constraints via tag configuration.

Tag configuration consists of settings like:

Allowing only a specific set of tags.
Requiring that each endpoint have at least one tag, or exactly one tag.
Providing a description, and a link to documentation, regarding each tag.

With API traits, the tag configuration is provided as the tag_config argument to the api_description macro. For example:

use dropshot::EndpointTagPolicy;

#[dropshot::api_description {
    tag_config = {
        tags = {
            "tag1" = {
                description = "Tag 1",
                external_docs = {
                    description = "External docs for tag1",
                    url = "https://example.com/tag1",
                },
            },
        },
        policy = EndpointTagPolicy::ExactlyOne,
        allow_other_tags = false,
    },
}]
trait MyTrait {
    // ...
}

Footnotes

1
Currently, Dropshot supports two different kinds of functions: HTTP endpoints with the #[endpoint] macro, and WebSocket channels with the #[channel] macro. API traits do not have any special implications for channels beyond the ones already for endpoints. So, to keep things simple, the rest of the RFD is going to use "endpoints" to talk about both cases.
View
2
Throughout this document, we use the term API to mean a collection of supported HTTP and WebSocket methods with corresponding parameters and response types, as well as associated metadata like a name and version number. This is distinct from server, which is a combination of an API, its implementation, and the specific location (address/port number/SSL) it is associated with.
View
3
If there’s just a single implementation of the trait, it is still one more time than function-based APIs. But in return, developers get all the benefits discussed above, and adding a second implementation is also easy.
View
4
There is one exception to this: the undocumented _dropshot_crate parameter. That parameter will instead be accepted by the top-level dropshot::api_description macro.
View
5
The name Context is conventional, and we can choose to let users override it.
View
6
Note that we do not support the impl Future<Output = …> syntax, just the async fn syntax. This makes it easier to implement some of the semantic checks below.
View
7
The one apparent exception being the WebSocket connection type for channels, which the author is still investigating.
View

RFD 479 Dropshot API traits

Table of Contents