297 - Silos and API Resources / RFD / Oxide

RFD

297

Authors

Updated

[rfd234] describes in detail the motivation for Silos but leaves ambiguous how API resources like Organizations, Projects, or Sleds relate to Silos. This (overdue) RFD describes the approach taken within Omicron. Because much of this is already implemented, there’s some cost to changing it. But that cost will only grow, so now remains the best time to change direction.

Determinations

API Resources may be either siloed or non-siloed.
1. Organizations and everything in them are examples of siloed resources. Siloed resources are virtualized within each Silo. That is: each Silo has a totally different view of its siloed resources. That means that if you fetch /organizations in two different Silos, you’ll see two totally different sets of things. Siloed resources cannot be shared between Silos. There’s no way to even talk about a siloed resource from a different Silo because its name (e.g., /organization/dev-team) would refer to something else in the other Silo.
2. Operator resources like Racks, Sleds, and external networking configuration are examples of non-siloed resources. There are non-operator, non-siloed resources too, like global images. All non-siloed resources look the same in all Silos, often because they correspond go a real-world thing (like a Sled).
Most of the examples of siloed resources are intended for end users (e.g., Instances, Disks, etc.) and most of the examples of non-siloed resources are intended for operators. But neither role nor permissions-level defines whether a resource is siloed or not. For example, operators might want to provide a global list of base images. That’s a non-siloed resource that would be used by end users. See [_which_api_resources_are_siloed_today] for examples.
The endpoints for non-siloed resources appear in every Silo. They’re protected by normal access control. So a Silo intended only for end users just doesn’t have any users with access to these resources.

Open Questions

Should we allow an operator to control whether a Silo supports:
- endpoints for non-siloed operator resources,
- endpoints for siloed resources [and non-siloed end-user resources], or
- both?
This would be a Silo-level policy knob separate from access control. When endpoints are disabled with this knob, they would return 404 (as though they don’t exist) rather than 403 (which would imply insufficient privileges). They would 404 even if you were a Silo or Fleet administrator. We might also generate separate OpenAPI specs so that generated clients only include usable endpoints. For more on why we would do this, see [_each_silo_could_be_either_operator_or_not_operator].
Aside from the merits of this idea is the question of prioritization. Also, if we feel that end-user non-siloed resources are a real thing, then this knob is really about hiding operator-only vs end-user endpoints and is separate from whether those endpoints are siloed or not.

Discussion

Background on Silos

[rfd234] describes the basic idea of Silos in detail. It also describes how they relate to authentication and identity management. Silos combine two different things:

a realm of user identities that corresponds to an "app" in the identity provider (IdP), and
a container for resources that cannot be shared outside the Silo

These are more related than they seem because if you can share a resource with somebody, you need to be able to identify both the resource and the other user. That doesn’t mean they have to be all in the same realm, but it sure makes things a lot simpler.

We use two different metaphors for Silos:

Silos might be seen as virtual Oxide systems (i.e., separate racks, in terms of the MVP). The expectation is that each Silo will have its own DNS names for the console and API and so its own login URLs, too.^[1] For an end user that has access to two Silos, the two Silos may as well be two totally separate Oxide racks.
Silos might also be seen as analogous to AWS accounts. Each contains a bunch of resources; if you have access to one, you can probably see a lot of the stuff in it (i.e., visibility tends not to be fine-grained); and things that are closely related tend to be in the same container. (In the case of Silos, they have to be because there’s essentially no cross-Silo sharing.)

It’s important to remember that every user identity is scoped to a Silo. That includes the very initial user(s) created during rack setup, accounts used day-to-day by operators to administer the system, end users, service accounts, etc. (Why? See [_operators_not_in_silos] under "Alternatives considered".)

We expect that for our first round of customers, most Silos will be configured so that all users can see each other as well as the Organizations in the Silo and potentially all the Projects in all Organizations as well. Put differently, if you want to hide things from people, you’re best off putting them into different Silos. Concretely, a customer would do this by granting the "silo viewer" role to the "everyone" group from their IdP. This isn’t strictly required by the product but the facilities needed for a good user experience without this aren’t likely to be there. For example, if you had access to a Project, you could go directly to it in the console, but we don’t have a way to list the Projects you have access to, so if you go to the top-level screen, it’ll just be blank.

[rfd288] covers organizing API resources to make clear which are siloed and which are not. (This is a little trickier than it sounds. The obvious solution would be to nest the resources under /silo, but an important security feature of Silos is that your Silo is encoded in your authentication credential, not the URL.)

Example Silo deployments

Recall that every Oxide system has a Recovery Silo (see [rfd234]) with at least one account with local-only credentials. This is only intended for emergency fixes to the IdP config of one of the other Silos. We will ignore the Recovery Silo in these examples since it’s the same in all of them and we hope people never need to think about it.

In all of these examples, it’s possible to migrate IdPs by creating a new Silo with the new config and retiring the old one. In a situation like that, the deployment might have more Silos than described below.

Example: One Silo to rule them all (okay)

The simplest possible deployment would involve one Silo used for everything, including day-to-day operations as well as end-user infrastructure. This might be convenient for small-to-medium-sized organizations that all work together. For example, if Oxide in 2021 had an Oxide rack for internal use, we might go with this approach because we’d value easy collaboration and ease of deployment over the isolation benefits of separate Silos, even for operator use.

Example: One operations Silo, one end-user Silo (better?)

A common deployment might be two Silos:

an Operations Silo used day-to-day by operators when they want to look at hardware or manage other system-wide resources
one Silo used by end users for all Organizations, Projects, etc

Compared to the previous example:

Operators who also have any end-user infrastructure will have two identities to deal with. Since each Silo is expected to have its own login URL, when acting as an operator, they’ll need to log in separately to the Operations Silo. Otherwise, they’ll need to log into the other Silo. Many users like this (because it’s harder to accidentally make operator-level changes). Others don’t (because it’s more identities to manage).
Since each Silo shows up in the IdP as a separate app, there’s an extra level of control and auditing around who can access the Operations Silo.
At the same time, operators who need visibility into the end-user Silo (e.g., to help support end users) need an account in the IdP app for the end-user Silo so that they can see Organizations, Projects, etc. This in turn has pros and cons: the pro is that an audit of the end user Silo app at the IdP covers all access to that Silo — there’s no alternate access path for rack operators. The con is that they may need to use both identities while debugging complex problems.

Within this option, the customer could choose whether to put operations-related Projects into the Operations Silo or not. We’d probably suggest that if they had operations-related Projects, they should put them into a separate end-user Silo (see the next option).

This deployment might make sense for small-to-medium-sized organizations that all work together, but are willing to deal with a bit more complexity to better isolate privileged resources.

Note

It’s not necessary to have multiple identity providers deployed in order to use multiple Silos. Identity providers can be configured with multiple apps or service providers. Each Silo looks like a different app in the IdP. From the end user’s perspective, they have one corporate SSO account. When logging into multiple Silos, they’ll be using that one account to log into what look like separate systems.

Example: One operations Silo, many end-user Silos (best?)

Here’s a nice setup:

an Operations Silo used day-to-day by operators when they want to look at hardware or manage other system-wide resources
a bunch of end-user Silos for different purposes or teams:
- one for operations-related Projects (so that they’re not in the privileged Operations Silo)
- one each for various teams (e.g., engineering, product, finance)
- one each for production/development/testing
- one each for business units to which the rack owners are "selling" infrastructure
- one each for various projects that might be secret even within the company

This deployment is where Silos most resemble AWS accounts.

There are a lot of benefits to this approach:

all the benefits above: fine-grained control and auditing at the IdP of who has access to what
by keeping Projects out of the Operations Silo, it’s even harder for operators to accidentally do things with the wrong account
strong isolation for these various teams and use cases

The downside is just the complexity: each Silo needs a corresponding IdP app, and people with access to multiple Silos need to juggle multiple identities. But they’re still authenticated with corporate single sign-on (SSO) from the same IdP, and the different Silos would look to them like different systems, so this shouldn’t be too painful.

This deployment is suitable for any organization willing to tolerate the complexity, and especially for large organizations that have various groups that don’t need to know about each other. The separation could be for security as well as a better user experience for people (so they don’t have to wade through a lot of irrelevant stuff).

Which API resources are siloed today?

Items in parentheses have not been implemented yet.

Siloed resources include:

Organizations
- Projects
  - Instances (including their network interfaces)
  - Disks and Snapshots
  - VPCs (including routers, routes, subnets, firewalls, etc.)
  - Project-local images
Users (from the IdP)
- SSH keys
- Device access tokens
- (personal access tokens)
Groups (from the IdP)
(Service accounts)

Non-siloed resources aimed at operators include:

Racks
Sleds
Silos
Device authn requests (because we don’t know who they’re for until they’re verified)
Update Available Artifacts
(software update management)
(Other hardware: switches, PSCs, SPs)
(external network configuration)
Built-in users

Non-siloed resources aimed at end users include:

Global images
Built-in roles

Alternatives considered

Operators not in Silos

Why are operators scoped to a Silo at all? [rfd234] goes deeper into this. The short version is: we expect customers to have operators authenticate against their IdP (as opposed to some rack-local username and password). So there needs to be an IdP configuration for that. Silos are the unit of configuring IdPs. Why not have one "global" IdP config (in addition to Silos) that’s used for operators? Well, IdP configurations may need to change. If we had one global IdP config, any change to that IdP would be fraught — what if you accidentally lock everyone out? The way we’ve described things, changing the administrative IdP is easy: you create a new Silo pointing at your new IdP and grant whatever global privileges you want to whatever IdP groups you want. You now have a second realm with global system access and you can retire the first when you’re satisfied that you got it right.

One operator Silo

Okay, operators will have a Silo. But why not nominate one Silo as the "operations silo", and that one (and only that one) has access to non-siloed resources?

This has the same problem as [_operators_not_in_silos] where it’d be fraught if you ever needed to modify the IdP config of the operations silo. The natural extension is to allow a second Silo to be an operations silo…and that’s where we are today. Any silo can be used for operations — it’s just a matter of access control protecting the operations resources.

It’s also worth noting that even if we did this, there would likely still be non-siloed resources available within a Silo.

Each Silo could be either "operator" or "not operator"

What if we said that when you create a Silo, you designate it as either "for operations" or not. If you designate it as "for operations", then you cannot create non-siloed resources in it (e.g., Organizations, Projects, etc.). If you designate it as "not for operations", then the operations-related non-siloed resources (e.g., hardware resources) would not be visible at all in the Silo.

This isn’t quite as well-defined as it sounds. Users in the operations Silo might want to create API access tokens — well those are effectively siloed resources. There’s really some new category of "siloed infrastructure resources" (which basically means "Organizations and Projects") which is what you’re not allowed to create in a designated-for-operations Silo.

This also doesn’t obviate the need for siloed and non-siloed resources because as mentioned in Determinations and [_which_api_resources_are_siloed_today] there are non-siloed resources that are accessible to end users.

It’s possible for a customer to deploy the system this way already by keeping a separate operations Silo in which they do not create any Organizations and using authz to limit access to endpoints. So why would we create this distinction? A few reasons:

It would guide customers to this pattern that we think is better than the alternative (i.e., separate operator vs. end user realms).
For new customers, it may be simpler to explain that they can have an Operations Silo and an End User Silo and encourage them to create both, as opposed to explaining that access control really depends on what permissions they set where.
For customers that want to strictly isolate operator activity in its own Silo, they might like an extra knob that makes it harder for an operator to accidentally expose operator-only endpoints in the end user Silo.

It does make things more complicated, though, and it’s more complex to set up (you need two IdP apps). We might imagine simpler deployments (even dev, test, demo, or trial systems) where it’s a lot more convenient to combine everything in one Silo. We could make this choice optional: a Silo could be only for operations, only for end-user-resources, or both.

In terms of implementation cost: it seems fairly straightforward to add an immutable bit to each Silo that’s used for this purpose. Using that bit to determine API visibility might be error-prone (how would we add this check to exactly the right endpoints without making it easy to get it wrong?). It might also be confusing because the current OpenAPI spec would include endpoints that are not available in every Silo — more specifically, it would be a little wrong for every Silo. Maybe the answer is that we’d have two OpenAPI specs: one for operations Silos and one for non-operations Silos?

… and we could also call it something else

Suppose we implement the previous idea and say that a Silo designated only for operations is called something else, like an "Ops Realm". So operators could create "Ops Realms" that (just like Silos) have their own IdP config, and their own login page, etc. On the plus side, we’d avoid overloading the term Silo. But we kind of need a new name for the thing that can be either an Ops Realm or a Silo — the thing that has its own login page, etc. This would also be rather a lot more work as we’d need a separate set of APIs and screens to list and manage these, even though in terms of managing them, they’re exactly like Silos.

Security Considerations

Security considerations are a major aspect of Silo design. See [rfd234] for some. Silos and access control enable operators of the Oxide system to limit visibility and access to both end-user resources (like Projects and Instances) and operational resources (like hardware and system software).

The choice to expose operator resources in end-user Silos could also be phrased as "using existing access control mechanisms to restrict access to these resources". That might be fine; or some customers might prefer additional defense in depth afforded by not having these resources even available inside some Silos.

Like with many systems, we face the challenge of balancing fine-grained access control (including visibility) with flexibility and easy collaboration. These are by no means mutually exclusive, but features intended for one can often be misused to undermine the other. The choice to even allow people to expose operator resources in standard end-user Silos (even when protected by access control) gives people choices in deploying the system. Some of those choices might violate another customer’s security goals.

External References

Footnotes

1
Details around this remain to-be-firmed-up.
View

RFD 297 Silos and API Resources

Table of Contents