113 - Engineering Determination / RFD / Oxide

RFD

113

Authors

Updated

When you come to a fork in the road, take it.

— Yogi Berra

[rfd5] outlines the phases of engineering, but one phase in particular is sufficiently nuanced that it merits its own RFD: Determination.

Making decisions is a very important element of engineering, and any engineer can recount having gotten it wrong in their career: decisions that were wrong-headed (or otherwise made for the wrong reasons); decisions that were overly deliberated; decisions that were made hastily. At Oxide, we wish to avoid these pathologies, making decisions in a timely and thoughtful manner — which is easier said than done! While there is no single rubric that can guide all decisions, this RFD can hopefully serve as inspiration (or solace?) when deliberating a fork in the road.

Decision versus determination

One (perhaps pedantic) distinction we draw is between decision and determination: on the one hand, these are (for our purposes) roughly synonymous, and both amount to selecting between choices. On the other, for many such selections, the tradeoffs will be too complicated to make an unequivocally "correct" choice: it is more important to discern a viable direction — which may not be the only one! We call this a determination rather than a decision, preferring the implication of direction finding in the wilderness rather than a judge handing down a verdict in a courtroom.

Values in determination

Oxide values come into play — and into tension! — when making determinations. There are particular Oxide values that merit consideration, presented here in loose priority order.

Responsibility: In general, when making determinations of any nature, the needs of the product (and more generally, the customer) must be thought of as constraints. Within these constraints, however, implementation determination should generally lie with those who bear the consequences (that is, those implementing it): the burden of responsibility carries with it the empowerment of determination, and we must grant latitude to those who have signed up to live with the consequences. This is not an excuse to allow individual engineers to make a decision in a vacuum (see values like teamwork and versatility, below), merely that when making implementation decisions, we have deference to the judgement of the implementers.
Rigor: There are two dimensions of rigor with respect to determination. First, we must be rigorous about our process for making determinations. Ideally, one could be absolutely rigorous about different options — but this ideal presupposes the ability to thoroughly explore paths not taken. Where this can happen (e.g., when performance, features or economics can be clearly known and crisply compared) it should of course be embraced — but the world is more complicated and often two different paths can’t be so neatly compared (especially in sophisticated software systems that themselves embody many of their own decisions).
The second dimension of rigor may be more subtle: in addition to be rigorous about the process, we must ask how different options themselves differently reflect this value. For example, if comparing two like parts, does one have a management interface that allows us to better understand its failures, thereby allowing us to be more rigorous in how we use the part as foundation? Ironically, this quality can often feel softer — and it can be hard to be absolutely rigorous in analyzing differing degrees of rigor afforded by different options.
Urgency: When citing the potential for Oxide values to come into tension with one another, many engineers will point to the propensity for urgency to come into tension with rigor — with anyone of any experience having some tale of this tension being poorly resolved. It can be tempting to believe that rigor should always dominate this tension, but this is a mistake that can lead to analysis paralysis; urgency must, in the end, trump rigor. But this is not an excuse to be arbitrarily urgent, for a hasty determination can indeed lead to wasted effort in costly backtracking. (Or, in the words of the old software engineering saw, "weeks of coding can save hours of planning.") There is not a formulaic way to resolve this, other than to advise that extremes be avoided: calls for more rigorous analysis on a well-deliberated determination should be made carefully — as should insistence that no such analysis is required at all.
Finally, as with rigor, there is a similarly second dimension to urgency: we must move through determination with pace — but we also want to factor in the relative expediency of the paths themselves. As with rigor, however, it can be very difficult to know which path is ultimately faster — and this may come down to factors like part availability or team familiarity.
Courage: Courage is not just the courage of conviction (though that too, especially when summoning the will to take an approach that others aren’t taking), but also the often much greater courage of knowing when we are on the wrong track. This happens blessedly rarely, but when it does happen, the fear of lost work shouldn’t prevent us from candidly assessing that our previously determined direction is proving to be increasingly non-viable. In short, we must have both the courage to make a determination — and the courage to change that determination if and when it is increasingly wrong.
Versatility: We value the ability for team members to be at different layers of the stack, and this may play some role in determination: when considering between similar options, it may be worth making the same determination in two different layers of the stack simply to facilitate versatility. For example, if a particular language or database is used in one layer of the stack, that use should perhaps be given weight when considering the same elsewhere. This shouldn’t serve to overly confine, and where one determination clearly does not fit another problem, it shouldn’t be forced.
Teamwork: In the spirit of collaboration, we seek to be consensus-based as much as possible. Where this isn’t possible, we may seek to find third paths to break a false dichotomy.
Thriftiness: The economics of what we’re building and how is important. We want bang for the buck, and this will factor into some determinations — especially those that have economic ramifications for our unit cost. While this value is last, it’s not because it is unimportant — and often when it plays a role, it is a non-negotiable one. (For example, under no condition will we contemplate licensing proprietary software for which we need to pay a per-unit fee, as this would undermine our margin.)

Communicating determination

Once a determination has been made, it should be written down. This is true even if (and indeed, especially if) a determination is considered to more directional than unequivocal. That is, a determination may be strong (e.g., a firm part selection), it may be tepid (e.g., selecting a direction to allow for prototyping to plumb strengths and — especially — weaknesses), or it may preserve optionality (e.g., keeping two options alive with a more concrete plan for further evaluation). Determinations should be made clear in either a determination section of an RFD or (as warranted) in their own RFD. The length of the determination section of an RFD (or of the RFD dedicated to a particularly involved determination) should be considered to be shrink-to-fit: in some cases, it might be quite long — in others it might be a single sentence.

Past determinations

It may be helpful to discuss some example determinations from Oxide past. There are, of course, many to pick from as of this writing (this is, after all, the 113th RFD!), but each of these highlights some important aspect of determination.

Root-of-Trust operating system

When we initially waded into OpenTitan, it had immediate appeal: Google had an excellent track record with Titan, and it seemed natural that we would want to play an active role in its open embodiment. Over a short while, though, it became clear that the first ASIC embodiment of OpenTitan (Earl Grey) was not a fit for our schedule. Nonetheless, we continued to move forward with the Tock embedded operating system. However, the deeper we became involved in Tock, the more it became clear that our values were coming into tension with Tock’s. This wasn’t necessarily immediately obvious, as Tock is a relatively young, de novo, Rust-based system — but the abstractions that Tock developed themselves did not lend themselves to rigorous software in several important dimensions. All of this served to inform setting our own direction with Hubris, as described in [rfd13]. In this case, that determination was accelerated by the embrace of Tock: only by wading seriously into it could we really appreciate those elements that made it (in our opinion) unsuitable. This highlights the importance of not overly weighing analysis — but also of being unafraid to change direction.

Host CPU selection

In some ways, the host CPU selection as outlined in [rfd12] represents the cleanest possible determination: on essentially every axis (price, performance, power), one option was clearly superior to the other. Even still, there is more nuance to this determination that might meet the eye. First (and perhaps surprisingly), there was an advisor to the company that viewed the determination as hasty, especially with respect to performance — and felt that we should be modeling full systems performance rather than viewing SPECrate as a proxy for that performance. On the one hand, this perspective is not without foundation, and history has plenty of examples of systems that performed well on microbenchmarks but terribly in practice. But on the other, it was misplaced: there was no reason to believe that further analysis would yield a different conclusion — and we had urgent need to make a decision. It was an example of more rigor not, in fact, being any more rigorous.

Second (and as elucidated in [rfd12]), even though the host CPU selection was between two CPUs with the same instruction set, there are many, many differences in the chips themselves and the companies behind them — with profound ramifications for the other hardware and software we need to build. In the end, even this cleanest and least controversial of determinations came down to a deliberate series of tradeoffs.

SP network

The service processors clearly need to be able to be communicated with directly — but the mechanism for how to do this was decidedly unclear. Or rather: the options themselves were both clearly flawed. The first (using the high-speed network via NC-SI) would perhaps involve less complexity and would certainly involve less cabling — if it worked. The second (a dedicated SP network) might involve more complexity and would certainly involve more cabling — but also could be made to work. In the initial discussions, these were the lines drawn: would this work or wouldn’t it? Ultimately, this is a discussion that is hard to satisfyingly resolve: everyone knew both that this might work — but that it might also very well also draw us into battle with firmware for one of the most critical components in the system. Rigor and urgency with both in play: which option was faster? Which was more likely to be robust? This was not immediately resolvable, as there were simply too many unknowns — so we made the important decision to preserve both options. This might sound like a punt, but in fact it was exactly the determination: it allowed us to move onto other decisions — the result of which ended up being a clear repudiation of NC-SI, as described in the eponymous [rfd89].

References

[rfd5] Oxide Computer Co. RFD 5: Phases of Engineering. https://5.rfd.oxide.computer. 2020.
[rfd12] Oxide Computer Co. RFD 12: Host CPU Evaluation. https://12.rfd.oxide.computer. 2020.
[rfd13] Oxide Computer Co. RFD 13: Microcontroller Operating System. https://13.rfd.oxide.computer. 2020.
[rfd89] Oxide Computer Co. RFD 89: Arrivederci NC-SI. https://89.rfd.oxide.computer. 2020.

RFD 113 Engineering Determination

Table of Contents