70 - Capacity, Allocation, and Utilization / RFD / Oxide

RFD

Authors

Updated

Terminology

In-use: resources that are currently in use, as in a process or application is consuming that CPU, Disk, or Memory.
Provisioned: resources that have been provisioned by a user, these can be either in-use or idle, depending on the application load.
Reserved: resources that have been reserved, these are being held by a quota, they can be provisioned, or yet to be provisioned.
Capacity: The limit we actually have.

All other terms are math or a segment of the already defined terms.

Utilization: What is actually being used versus what is provisioned (typically a percentage: in-use / provisioned OR provisioned / reserved. Which ever equation we are using to calucate this, we need to always tell the user what we used to get it.
Threshold: the customer designated threshold for when they want to be alerted, this can be ⇐ capacity. The default will be 75% capacity.
Overprovisioning: Allowing operators to provision more of a resource than its capacity. ("Selling more seats than there are on the plane.") Compare with bursting. For example, if a server has 64 CPUs, a system might allow an operator to provision 72 CPUs, which would be 13% overprovisioned. Note that this can apply to CPUs, memory, storage, IOPS, or any other resource. This is usually done to improve utilization (improving economic efficiency) when operators expect that not all resources that are allocated will be used. But if all allocated resources are used, quality of service degrades significantly.
Bursting: allowing applications to use more resources than they’ve been allocated when these resources are available. ("Giving empty seats away to existing customers", maybe?) Compare with overprovisioning. For example, an instance might be allocated 4 CPU-seconds per second. If the system has extra CPU time available, it can allow the instance to use 8 CPU-seconds in a second.

Both overprovisioning and bursting have the potential to significantly degrade customer experience at some future time as a result of activities outside of that customer’s or the operator’s control.

High-level limited resources

Storage
- GB
- IOPS: This is more appropriate under health, like other cloud providers we should do the "right thing" here and limit the IOPS per the size of the disk
CPU
Memory
Network: This is more appropriate under health, like other cloud providers we should do the "right thing" here and limit the network bandwidth with the size of the instance

Our principles

Transparently show data. This does not mean show the customer the raw data, that would not be actionable. But frame the data in a way that is understandable. Make sure the data is clear, accurate, and actionable. The user can also export any data via the API, that way they can integrate easily into any existing systems they have.
Provide data so users can appropriately make decisions for themselves.
Make the "right thing" the "easy thing".
Have action buttons for doing “the right thing” based on the data. (Examples below.) This could include “resizing an over-sized disk” based on actual usage patterns or “resizing an over-sized instance” based on usage patterns.
Always inform all users working on a project of the changes made to resources in the project and by whom with a customizable message sent at the time an action took place. (Examples below.) For example, if a rack admin(s) decides to resize an over-sized disk for a project, the users in the project should get an email saying something like: “I resized your disk X in project Y because over the past Z days you were only utilizing 15% of it. This new size should last you a projected 200 days before needing to be resized at your current write rate.” This should be part of a longer form audit log including the timestamp, user, action, and any other relevant data.
Do not over-provision or allow for over-provisioning. Do not allow a lie to cover up what should otherwise be a conversation between teams. Our goal (and value to the customer) is to help them achieve great utilization of their infrastructure through data, not over-subscription.
In each scenario, show the data users mostly likely want to see with some most likely actions. (Examples below.)
Allow rack(s) admins to set thresholds for when they will be notified they are near capacity.
When in doubt allow for sending multiple users or individual users messages. This should be a one-click button to open an email to the specific users or user. We can even pre-populate it, but still allow for customization, if we have enough data to know what is going on. Something like “Your project, X, is only utilizing 10% of the disk space you provisioned. Can you give me the reasoning behind your allocation or permission to resize to X GB so we can free up disk space for other projects? Thank you!” This way rack(s) admins who want to communicate first before resizing something can do so, OR they can resize it for them and send an email after.
Empower developers with the data to do the right thing so rack admin(s) may not need to intervene. (Examples below). Notify developers when their disks or instances are under-utilized with a one-click button to resize if appropriate.
Quotas are a way to prevent individual users and teams from over-allocating all the resources for themselves.
Focus not only on over-sized instances and disks but also under-sized instances and disks. (Examples below.)

Standard metrics

Dashboard. The standard metrics shown on a rack(s) admins dashboard should always include visuals for the limited resources as far as what is in-use, what has been provisioned, reserved, and a comparison to their threshold and capacity. (Examples below).
Views. They should then be able to drill down into tables by user, team, and project, to see where the numbers are coming from (examples below). These tables should automatically be sorted by percent utilized but allow for changing the sort based on clicking a table header or whatever the standard table visual UX is today.
Utilization. If a project’s utilization is 45% or less (we should likely in the future make this number customizable but for now let’s just call it this) there should be handy buttons for doing the “the right thing” whether it is notifying a team to take action, resizing a disk for a project, or resizing an instance. In the event something is resized by someone other than a project member, for instance a rack(s) admin, the project users should be notified. The rack(s) admin should be able to customize the message that gets sent to the project members so everyone is on the same page. Something like “Your instance X in project Y has been resized since you were only using Z CPU and A GB Mem for the past X days. Thank you!”
Inventory. Based on the remaining capacity for a fleet or rack, we can do the math to show rack admin(s) how many instances of what size are remaining.
Forecasting. Where available it would be nice to forecast based on the past what future utilization/allocation will look like. (Examples below.) Seeing as this would take an analytical model we should do what we can easily for V1 and continue work on this thereafter.

If there is time, let’s gamify which developers best utilize their allocation. As in, let’s create a leaderboard ranking the developers and/or projects that have the best utilization for their disks and instances. This will create a culture of caring about this as well as incentivize leadership to reward those folks for more accurately forecasting, and paying attention to their usage.

Leaderboard should be smarter than just a ranking it should make sure the utilization percentage is equal to or above 95%. And that it is consistent for at least 7 days before appearing on the leaderboard. Make it hard so people really strive to get there. These numbers can be customized in the future but we can hard code them for v1 if necessary.

Leaderboard can be more open within an organization, visible by users in the org, hiding any information about the projects themselves. While shaming (ranking which projects need action) is not open and only visible to those with the IAM permissions to see details about those projects.

Billing

In the billing invoices laid out in [rfd56], we should be sure to include easy to ready utilization and allocation information so folks know when they are wasting resources. This would then incentivize them to either login to the dashboard or take an action via the command line to "fix" the utilization.

When we notify people that their instances are over-sized we should include how much money they would save as well. This should be a simple calculation since we already would have calculated utilization. Something like "You could save $250 by resizing your VM".

Example scenarios

Some common scenarios drawn out, by no means must the real thing look exactly like this. I am sure there are better ways to visualize this information. The key thing is giving folks the information they need. In the scenarios below where a user is seeing an alert in their console, you can expect that if they also have email notifications enabled for alerts that they get an email as well. Notification settings apply for all alerts. See [rfd55] for more details on alerts. In the email alerts we should be sure to always generate the command line instructions for what ever action they need to take as well as linking directly to the right screen in the console to perform the action.

Instance over-sized

We should also be sure to include the cost savings by resizing the VM as well. Like "You will save $100 a month by resizing your VM". This goes for other examples as well.