223 - Web Console Architecture / RFD / Oxide

RFD

223

Authors

Updated

Background

RFD 169 originally combined two related questions in an unwieldy way: console authentication and the console’s client-server architecture more generally. This RFD pulls the latter discussion out of RFD 169.

A common architecture for a frontend app that depends on a more general-purpose API is to have a dedicated server app for the frontend that handles sessions and proxies through requests to the API. For JS applications this also is used for optimizations like server-side rendering, which requires executing on the server the same UI code that runs in the browser. This pattern has been called "backend for frontend" and is discussed from a security perspective in [oauth-2.0-browser]. For a typical cloud app development scenario, a dedicated backend would be a natural choice for a control panel like the Console.

In the case of the Oxide rack, however, a key part of the technical vision is that we are running a locked-down mostly-Rust software stack. Long experience from our ex-Joyent colleagues informs a strong reluctance to let Node and the npm ecosystem anywhere near the rack internals. We would like to satisfy those constraints without compromising the quality of the web console.

Determinations

Because of the risks of exposing the rack to Node.js and the npm ecosystem, we will go as far as we can with the console as a static JS bundle served by Nexus. This corresponds to the "browser-only" archictecture below. As discussed in more detail in RFD 169, this means Nexus must accept session cookies as an authentication mechanism so the console can make API requests directly from the browser.

We will revisit this determination if there are features we want to deliver that require JS on the server, but the user experience needs to be significantly better than baseline to justify the work of containing the risk. Approaches to running JS on the server without Node.js, whether through Deno or direct use of Deno’s Rust bindings for V8, may be a lot more palatable than Node. The potential UX benefits of JS on the server are discussed in detail below. The short version is that they are nice-to-haves with clear value, but we believe the console can still be very good without them.

The console is being built in such a way that adding server integration later on will not result in code being thrown out. We can safely follow the browser-only path without incurring technical debt.

Architectures

Browser-only

The console is a static JS bundle that runs in the browser and calls to Nexus directly. This has numerous advantages. There is no JavaScript running on the server. There is nothing but static files to deploy, no extra servers to monitor.

This option requires some sort of access token to be stored in the browser so the console app can talk to Nexus. The standard approach is to store a relatively short-lived session ID in a cookie and have the server maintain a table linking session IDs to user IDs and some metadata to help with expiration. Using the HttpOnly attribute on the cookie prevents malicious scripts from stealing it, but because cookies are automatically sent along with requests to our domain, we also need to use standard CSRF mitigations. See [rfd169] for a detailed discussion.

Console server with DB connection

Architecture diagram

A console server would manage sessions and proxy authenticated API requests from the browser client to Nexus. We would use the session cookie to look up the session, which would point to an API token, and then we’d forward the request to Nexus with the token attached.

A console server written in Rust would be limited to the following functionality:

Serving JS bundle(s) and other static assets
Session management
Auth redirect and auth code exchange
Proxying API requests from the browser to Nexus (attaching a token)

A console server written in JS could also provide some of the features detailed below.

Rather than giving the console server its own key-value store for sessions, we would rely on the CockroachDB instance which is already set up for the control plane in order to take advantage of replication.

DDOS risk and mitigations

Having a DB connection opens the console up to DDOS risk. Possible mitigations:

Make session token something we can validate before looking it up in the DB
Possible throttling setups
- nginx in front of both Nexus and console
- Nexus is the shield, put console behind it
- A Rust console server could share Nexus throttling logic

Console server without DB connection

Architecture diagram

Rather than talking directly to CockroachDB for session storage, the console could use client credentials ([oauth-2.0-spec] §4.4) to hit a dedicated Nexus sessions endpoint with a session ID and get back a token. This is not very different from the Cockroach solution — simply replace the arrow from console server to CockroachDB with an arrow to Nexus — and has some advantages:

We don’t have to set up and worry about a database connection from the console server
We can centralize throttling logic inside Nexus rather than worry about the console server overwhelming cockroach (see DDOS mitigation section)

Console server with its own DB

This is only included for completeness. If the console server had its own database, it would have to be replicated across racks, a problem we are already doing a bunch of work to solve with the control plane CockroachDB instance.

Node.js risks

Libraries on npm tend to have a lot of dependencies, and those dependencies are often shoddy and have vulnerabilities. When they are patched, they often take a while to get upgraded in whatever is pulling them in.

Console dependency tree stats

The size of the Console’s dependency tree is quite large. To give a sense of the order of magnitude:

$ yarn list | wc -l
7546

By contrast, in omicron at time of writing:

$ cargo tree | wc -l
1022

Note that the Console number overstates things a bit — there are many duplicates in the list, many of the dependencies (e.g., Storybook, which has a ton of transitive dependencies) are only for development and wouldn’t make it into either the static bundle or any putative server code. So more significant than the sheer number of dependencies is the typical quality.

Node itself is also likely to contain vulnerabilities and the pace of development on that project may be slowing. We can mitigate these issues by locking dependency versions and keep our own copies of everything, as well as by running Node in its own Illumos zone or even in a VM, but this all takes work and complicates the deployment and upgrade story for Nexus.

Server-side JS without Node

Running a server with Deno

The security issues with Node and npm have prompted the development of alternative JS runtimes like Deno, which is an written in Rust and wraps V8. Deno would probably be a lot safer than Node for running a long-running server. Deno does not use npm, and making apps written for Node work on Deno takes some work.

Executing JS scripts directly with V8

However, we can probably get all the benefits of server-side JS without a long-running server. Next.js, SvelteKit, and Remix are designed not to rely on Node specifically, and can also run in so-called "serverless" environments like Cloudflare Workers, which execute JavaScript using the V8 engine directly.

The Deno team maintains high quality Rust bindings for V8 that are core to their product. We may be able to use that to execute JS on the server without bringing in Node or Deno. There is prior art for using the V8 Rust bindings to render React components server-side. This is a bit out there, but could make sense for us as an incremental optimization on top of a solid client-side console. It could also make sense in the context of adding a general serverless worker feature to the rack.

Benefits of server-side JS

Shared code

This is the weakest advantage, so it’s only first to get it out of the way. Sometimes you want to run the exact same code on client and server. The most common example is input validation: you have to validate input on the server no matter what because you can’t trust clients, but ideally you can also run the same validation client-side so the user doesn’t need a server round-trip to know their input is invalid. In order to share this code, you need to be running JS on the server.

Loading spinner problems

There is a basic set of problems client-side web apps tend to have around data fetching. They don’t fundamentally break an app, but they can degrade user experience a little or a lot depending on how many components on the page are fetching data. There are client-side mitigations, but they tend to be partial and/or complicated, so the state of the art here is various forms of server integration. Here I characterize the loading spinner problems and several server-side mitigations: server-side rendering, server-integrated frameworks, and Server Components. These solutions are all React-specific because that’s what we’re using and what I’m most familiar with, and all require Node because they rely on running React code on the server. However, this problem is common to all client-side web apps, and once you get the shape of the problem in React world you can easily find similar approaches in other JS frameworks as well as the Rust + Wasm frameworks. For example, SvelteKit has file-based routing and async navigation like Remix and Next.js.

Over time single-page app devs have come to understand that letting components define their own data requirements by fetching data at render time is the best way to architect apps for ease of reasoning. (Alternate approaches where data requirements are spelled out in a central place and components ask for bits and pieces have led to a lot of pain.) But doing data fetching at the component level comes with downsides:

If every component triggers its own fetch, every component has its own loading spinner and they all finish at different times. You could end up with 20 loading spinners on a complex page.
For fast requests (which should be most of them) loading spinners flash very briefly, almost too fast to see, which is a worse experience than if the UI could somehow hang for 200ms while the request comes back and just render when it’s done.
Request waterfalls: if a component fetches data at render time and conditionally renders a child based on the results of the fetch, if that child has its own fetch, the child fetch won’t kick off until after the parent’s has completed. This can lead to a cascade of requests happening in series rather than in parallel.
Components often have overlapping data requirements, which means you either fetch the same data twice or make things much more complex by pulling data fetching out of the individual components so you can make single request and pass the data down
- This sub-problem is more or less solved by tools like React Query (which we are using) and SWR through a client-side global request cache with a TTL of a few seconds, where the cache key is the request URL. Every component fetches its own data and the library handles de-duping globally.

Server-side rendering (SSR)

SSR means doing the work required to render the page server-side (including data fetching, in sophisticated implementations) so that initial pageload can be a complete page with no loading spinners. The result of the initial render is sent to the browser as HTML, but then React spins up on the client and everything from there is business as usual. Time to paint is faster because API requests originating from the frontend server have fewer hops (maybe the server is faster at rendering too, but that’s not the main factor). SSRed HTML can also be cached (server-side or client-side), but for data-driven views of live systems like ours, this is of limited utility because cached HTML gets outdated quickly.

Note that SSR is a technical term with a narrow meaning. The other two solutions below also involve something that could be called server-side rendering, but typically SSR refers only to the initial page load. This is one reason it’s not a slam dunk for improving perceived performance: it only affects the initial pageload and not subsequent client-side navigation (i.e., most navigation), which is still going to use async fetches and loading states. The reason it only affects initial pageload is because server integration with the rest of the app’s lifecycle requires more substantial architecture changes, while SSR of the initial render can usually be turned on with just a few lines of code.

Another reason it’s not a slam dunk is that perceived performance with SSR may actually be worse (depending on request latencies) due to user psychology around partial loading and spinners. For example, a client-rendered site where a page shell with a loading state shows up in 200ms and content pops in at 800ms might feel the same as or faster than an SSR page that takes 500ms to do both steps simultaneously because, in the latter case, the user sees nothing for the entire 500ms whereas in the former there is at least some response at 200ms. SSR has to be much faster than CSR in order to have a clear perceived performance benefit.

Server components

In December 2020, the React team previewed a new feature of React currently in development called Server Components [server-components]. They’ve been working on it since then, but I’m not aware of any indication of when it will be ready. With SSR, the entire app is rendered server-side initially and then after that then entire app is rendered client-side. Server Components allow you to tell parts of the app to always render on the server — the source for those components is never sent to the client, only the result of rendering. Future renders also have to happen on the server. The idea is that highly interactive parts remain on the client, and parts that render infrequently (or for which renders are triggered by HTTP requests already) go on the server.

If this sounds like classic server-rendered HTML, it should: the goal of Server Components is explicitly to preserve the capability to make native-feeling web apps afforded by SPA frameworks while making them work more like classic web apps when prudent. There is some consensus across languages that this is likely to be a valuable approach going forward: Server Components are quite a bit like Phoenix LiveView over in Elixir/Erlang world and Laravel Livewire in PHP land. It’s all rather complex and only tangentially related to this RFD, so don’t worry too much about it, but I’m including details below for anyone interested.

Server components address the loading spinner problems by moving all API requests server-side where they will presumably have much shorter round trips. The basic structure is the same — the requests can still happen in series rather than in parallel — they just make everything faster and render all at once at the end. Here is how Server Components address each part of the problem:

Because data fetching is batched up server-side, there doesn’t need to be a loading state for every component on the page. The page is only rendered when all the data is ready.
- If the entire page is taking too long, you can have a page-level loading indicator
- If there are individual API requests that are expected to be very slow, you can render the rest of the page server-side and kick off the slow request client-side with a loading state just as before
- No loading spinners also means no quick-flashing spinners
Waterfalls are still there, but fast data fetching means a sequence of 3 requests doesn’t take 600ms
Overlapping data fetches matter a lot less (up to a point) because the raw data is not being sent over the wire to the client, only the result of rendering is

My overall take is that there’s a lot of complexity here and a lot of risk for apps to overdo it. Deciding what goes on the server and what goes on the client can add a lot of conceptual overhead. I expect that the best use of server components will be made by library developers who expose a constrained version of it to app developers, rather than app developers using them directly ad hoc. Regardless, taking advantage of this feature (when it comes out) whether directly or through mediating libraries would require a Node server.

Frameworks with server integration

Remix is just one framework, but I’m going to let it stand in for the best of what’s possible right now. It uses file-based routing like Next.js, but it’s simpler and more grounded in fundamental web APIs like Fetch [remix-next-1] [remix-next-2]. I’m presenting these as an alternative to Server Components, but they will be able to integrate and take advantage of Server Components without app devs having to change their code.

React by itself is sort of the View and the View-Controller layer of the old MVC model, but it is silent on two essential app functions: routing and data fetching. Countless libraries for these are available. What we’re using in the console right now is React Router (by far the most popular option) for the former and React Query for the latter.

Remix is a library by the React Router devs that solves routing and data fetching together in an integrated way. In a Remix app, you define the data requirements for each route in a loader function that runs server-side and sends its result down to the client. If a route needs to make multiple requests and combine them, only the aggregated result of that is sent to the client. "Route" somewhat counterintuitively here means the set of layouts corresponding to the set of route segments. For example, in the route /team/david there would be a file for the /team part (e.g., a page wrapper) and a file for the /david part (the page content) and each can specify their own data requirements. The data requirements for /team/david are simply the respective data requirements for /team and /data, and these two loaders run in parallel on the server.

A second key feature is that when the user navigates to a route, the client does not show the target page until after the loaders have completed. In plain React, the pattern is: switch immediately to the new view, kick off fetches (with loading spinners), then render when the data comes back. In Remix it’s: kick off all loaders (show a global loading indicator if you want, but by default do not) and only render the new page after the data comes back.

There is more to Remix, but route-based data requirements and async navigation are sufficient to address the loading spinner problems:

Batching requests server-side means nothing is rendered until all data fetching is done, which means no loading spinners at all. It works a lot more like a classic server app: the user clicks a link, a request goes out, and only when it comes back does the next view show up.
- As with server components, if you have a really slow fetch you can make sure it gets kicked off client side so it doesn’t hold up the rest of the page.
- No spinners means no too-fast spinner problem either.
No waterfalls: file-based nested routes mean data requirements for a given route are statically known, which means all data fetches can be kicked off in parallel rather than in series.
There might still be overlapping data fetches, but components coordinate their needs at the route level, so there should be less of that. Fetches are happening server-side, so while some extra data is sent to the client, most of the back-and-forth is happening server to server.

React alternatives in Rust + WebAssembly

There is an emerging category of tooling that compiles Rust to WebAssembly in order to write single-page browser apps in Rust [rust-web-frameworks]. Because these tools produce client-side bundles, they don’t address the basic question of this RFD, namely whether we need a server for the console. However, because (some of) these tools allow server-side rendering in a Rust server, they might reduce the advantage of Node over Rust in that regard.

However, these tools have major downsides:

Ecosystem around the browser domain is immature compared to JS
- There is JS interop, but that kind of defeats the purpose of using Rust (if the purpose was to not use Node): in order to run that code on the server, you’d have to be running Node there in some form
Hard to hire for because almost no one is writing frontends in Rust
Porting existing code to a slightly different data model is probably harder than it sounds
The raw performance of WebAssembly is useful for doing large amounts of computation in the browser (e.g., [wasm-case-study]) but not for most web UIs. Perceived performance in web UIs is more often about HTTP request latency, large assets, or failure to debounce input events like scrolling.

Given these downsides, it does not seem likely we will want to build the entire console with one of these frameworks. The most plausible use case would be a highly performance-sensitive feature that needs to run in the browser and can’t be made to work in JS. This would mean embedding WebAssembly within a JS app. If we have a problem like that, we’ll know.

References

[oauth-2.0-browser] OAuth 2.0 for Browser-Based Apps (draft). 17 May 2021. https://datatracker.ietf.org/doc/html/draft-ietf-oauth-browser-based-apps-08.
[rust-web-frameworks] Rust web framework comparison. https://awesomeopensource.com/project/flosse/rust-web-framework-comparison#frontend-frameworks-wasm
[wasm-case-study] How We Used WebAssembly To Speed Up Our Web App By 20X (Case Study). https://www.smashingmagazine.com/2019/04/webassembly-speed-web-app/.
[server-components] Introducing Zero-Bundle-Size React Server Components. 21 December 2020. https://reactjs.org/blog/2020/12/21/data-fetching-with-react-server-components.html.
[remix-next-1] Remix vs Next.js Comparison. https://sergiodxa.com/articles/remix-vs-next-js-comparison.
[remix-next-2] How is Remix different from Next.js. 29 Oct 2020. https://dev.to/dabit3/how-is-remix-different-from-next-js-4jd1.

RFD 223 Web Console Architecture

Table of Contents