Problem Statement
Live migration of virtual machines (RFD 71) is a requirement of the MVP. In a live migration, a VM is paused, moved to another physical machine, and made to resume running on a destination host. This process should ideally not be observable by the guest.
One seam where the fiction of live migration can break is the TSC, an x86 register used by operating systems to implement monotonic time. In illumos, the TSC is also used to make adjustments to the high resolution (wall clock) time. This RFD will provide background on the TSC and propose an implementation for handling the TSC across migrations.
Scope
This RFD focuses only on addressing monotonic time in live migration.
Originally, I intended this RFD to capture all aspects of time for live migration, including both monotonic time and "real time". But as I dove into just the topic of monotonic time, I have a lot of prose and some proposals to share that I don’t want to gate on other time-related topics. Monotonic time is used for other time-related functions in the system (such as the wall clock time), so I will reference those as appropriate, but discuss them in further detail in a future RFD.
Supported Architectures
In the product, sleds have an AMD Milan CPU. This means AMD support is our priority for testing and development. Still, we would like our implementations here to be general enough to support Intel CPUs as well.
Background
RFD 34 has some excellent background on monotonic time (and other types of time) that’s worth checking out. In the interest of not repeating what’s in that RFD, this section will only focus on additional context needed on this topic, specifically some more concrete details about the TSC and what tools a hypervisor has for shaping the appearance of the TSC’s value to guests.
Monotonic Time
x86 Time Stamp Counter
The TSC is used as the basis for monotonic time. Implemented as an MSR on each processor, the value of the TSC starts at 0 following processor reset and increments on some frequency, typically related to the clock cycles of the processor. (In illumos, the frequency of the TSC is measured at boot and assumed to stay at the calibrated frequency thereon.) With the combination of the TSC value and its frequency, an OS can implement a monotonic clock, typically expressed in terms of nanoseconds since boot.
The TSC can be read using the instruction rdtsc (and rdtscp). The rdtsc
instruction can generally be called regardless of privilege level (i.e., it can
be called from userland.[1]:
The TSC can also be read and written to as an MSR using the rdmsr and wrmsr
instructions.
TSC in virtual environments
An OS operating in a virtual machine has the same expectations of the TSC:
that its value monotonically increases
that it ticks at a constant frequency
that it begins at 0 following a processor reset (such as boot)
This means that the value presented cannot be the same as the host system TSC, but will have to be adjusted to make sense to the guest OS based on when the guest booted and the guest TSC frequency.
Virtualization support in hardware provides mechanisms for adjusting the TSC. Specifically, the host system can provide to the hardware:
a TSC "offset" to add to the TSC when a guest reads the TSC
a representation of a frequency ratio to "scale" the TSC value based on differing guest/host TSC frequencies
On AMD, this offset is specified by the hypervisor in the Virtual Machine
Control Block, a structure that is passed to the VMRUN command of SVM. The
specific value to set is TSC_OFFSET in the Control Bits (section 15.5.1).
Status quo bhyve implementation
In bhyve today, the TSC offset is simply the negation of the host TSC when the VM was booted. Frequency ratios are not used.
Generally, this means when the guest reads the TSC, it sees the value:[2]
current_host_tsc - boot_time_tscThis calculation is insufficient in a world with live migration, as TSC values on one host don’t have any bearing on the TSC values of a different host. Calculating the TSC offset is thus an area that must be addressed in our implementation.
Implementation Constraints
A live migration should ideally not be noticeable by a guest. It’s worth spelling out concrete requirements here for the TSC.
Specifically:
the TSC should never decrease (if it hasn’t been written to by the guest OS)
post-migration, the guest TSC should increment at the same rate as before, within a reasonable error rate
the TSC jumping forward may be acceptable, but we need to ensure we have a crisp idea of what an acceptable range is
Live Migration and the TSC: How do you calculate the guest TSC, anyway?
This section provides some additional background on the math involved for calculating the TSC. Specifically, I’ve found it useful to understand how the guest TSC is calculated, and it walks through some worked examples.
Keep in mind that the actual calculation is done by hardware, not the hypervisor. But the hypervisor has tools to adjust the guest TSC, which are used in the hardware’s calculation.
Specifically, both AMD and Intel provide a way for the hypervisor to program:
an offset value added to a scaled host TSC reading
a value representing the frequency ratio of guest to host frequencies, used to scale the host TSC reading
So in this section, we will:
outline how to calculate the guest TSC in different scenarios based on its host TSC and frequency, including across live migrations
arrive at general formulas for how to calculate a guest TSC at any given point in time, as well as the appropriate TSC offset that should be programmed by the hypervisor
Calculating the guest TSC (in general)
To start, let’s walk through some examples of what the guest TSC would be in different circumstances, with the goal of understanding what the TSC offset should be programmed by the host as in each circumstance.
For these examples, to keep things simple, we have some of assumptions that don’t hold true in the real world, including:
the TSC ticks exactly at its frequency (in reality, its affected by a number of factors, including temperature)
live migration is instanteanous
Scenario 1: No migration, guest and host have same TSC frequency
To start, let’s assume that:
we have a VM that has never been migrated
the VM TSC frequency is the same as its host’s TSC frequency
To keep the math simple, let’s suppose that the host frequency is 1GHz, and the value of the TSC when the VM is booted is 180000000000.
So the TSC of the guest and host increment at the same rate:
| Time (seconds) | Guest TSC | Host TSC |
|---|---|---|
0 | 0 | 180000000000 |
1 | 1000000000 | 181000000000 |
2 | 2000000000 | 182000000000 |
3 | 3000000000 | 183000000000 |
Looking at these values, we can easily tell what the TSC offset should be: simply the host TSC at the time the guest was booted, negated. (This is in fact, how things work in bhyve today).
Thus to calculate the TSC offset and the guest TSC:
# boot_tsc: host TSC at guest boot
tsc_offset = -boot_tsc
guest_tsc = host_tsc + tsc_offsetScenario 2: No migration, guest and host have different TSC frequencies
Now let’s expand the problem space a bit. Let’s assume:
we have a VM that has never been migrated
the VM TSC frequency is different from its host’s TSC frequency
Suppose the host is still at 1 GHz, but the guest frequency is 0.5 GHz.
We see that now the guest TSC increments at half the speed of the host:
| Time (seconds) | Guest TSC | Host TSC |
|---|---|---|
0 | 0 | 180000000000 |
1 | 500000000 | 181000000000 |
2 | 1000000000 | 182000000000 |
3 | 1500000000 | 183000000000 |
Next, let’s think through what the offset should be. In the case where the frequencies are the same, the offset is simply the host TSC when the guest booted. But the virtualization support in hardware adds the offset after it scales the host TSC based on the ratio, as such:
guest_tsc = host_tsc * freq_ratio + tsc_offsetSimply using the boot time TSC here would clearly not work. Looking at our
exmample, at t=0, it would give us a negative guest TSC value: 18000000000 *
0.5 - 180000000000.
One way to think about adjusting the offset here is by calculating a new "effective" TSC boot time value based on the frequency the guest TSC has. That "effective" boot time is simply the boot time TSC times the frequency ratio:
tsc_offset = - (boot_tsc * freq_ratio)So for our example here, the offset would be 90000000000. This works as
expected:
at t=0, the guest TSC is 0:
180000000000 * 0.5 - 90000000000 = 90000000000 - 90000000000 = 0a t=1, guest TSC is 5000000000:
181000000000 * 0.5 - 90000000000 = 95000000000 - 90000000000 = 5000000000…and so on
Scenario 3: Live migration to a host with a different frequency
Expanding our assumptions further, let’s say we have:
a VM that gets live migrated
the source host and the VM have the same TSC frequency
the source and destination hosts have different frequencies
migration is instantaneous
For our example, let’s assume: - the VM and source host have the frequency of 1 GHz - at the VM’s boot, the host TSC is 180000000000 - migration happens at t=3 seconds - the destination host has a frequency of 0.5 GHz - migration occurs when the destination host has TSC value 500000000000
Before migration we have:
| Time (seconds) | Guest TSC | Host TSC |
|---|---|---|
0 | 0 | 180000000000 |
1 | 1000000000 | 181000000000 |
2 | 2000000000 | 182000000000 |
3 | 3000000000 | 183000000000 |
After migration, the guest TSC should increment at the same frequency as before:
| Time (seconds) | Guest TSC | Host TSC |
|---|---|---|
3 | 3000000000 | 500000000000 |
4 | 4000000000 | 500500000000 |
5 | 5000000000 | 506000000000 |
6 | 6000000000 | 506500000000 |
Our frequency ratio of guest/host is 2. To get the effective boot TSC, we need some combination of the TSC on the destination when the VM was migrated, and the guest TSC, as the guest has been running for some time. To get these values to make sense together, we need to normalize for the frequency:
# host is the destination here
freq_ratio = guest_freq / host_freq
# "effective" boot TSC relative to destination host
# host_migrate_tsc: TSC of dest at migration
# guest_migrate_tsc: TSC of guest at migration
host_migrate_tsc * freq_ratio - guest_migrate_tscSo in our example, the effective boot TSC relative to this new host is:
500000000000 * 2 - 3000000000
1000000000000 - 3000000000
997000000000Plugging our example numbers in, this checks out to give us the TSC we expect:
t=3: guest_tsc = 500000000000 * 2 - 997000000000 = 3000000000
t=4: guest_tsc = 500500000000 * 2 - 997000000000 = 4000000000
t=5: guest_tsc = 501000000000 * 2 - 997000000000 = 5000000000That gives us the following formula for calculating the guest TSC:
guest_tsc = host_tsc * freq_ratio - (host_migrate_tsc * freq_ratio - guest_migrate_tsc)And our TSC offset is this term:
tsc_offset = - (host_migrate_tsc * freq_ratio - guest_migrate_tsc)Scenario 4: Live Migration where source, destination, and guest all have different frequencies
For our last scenario, let’s assume that the source, destination, and guest all have different frequencies. This could happen if a VM is migrated twice, for example.
Here’s some values to start with:
source freq: 1GHz
source TSC when guest is booted: 180000000000
dest freq: 2GHz
dest TSC when guest is migrated: 500000000000
guest freq: 0.5GHz
migration occurs at t=3 seconds
Our frequency ratios are thus the following:
source host: 0.5
dest host: 0.25
Here’s what the expected guest TSC should look like on the source, before migration:
Time (seconds) | Guest TSC | Host TSC |
0 | 0 | 180000000000 |
1 | 500000000 | 181000000000 |
2 | 1000000000 | 182000000000 |
3 | 1500000000 | 183000000000 |
After migration, here’s what we’d expect after a few seconds on the destination:
Time (seconds) | Guest TSC | Host TSC |
3 | 1500000000 | 500000000000 |
4 | 2000000000 | 502000000000 |
5 | 2500000000 | 504000000000 |
The offset on the source side would be:
# boot_tsc is the TSC when the guest was booted
tsc_offset = - (boot_tsc * freq_ratio)
= - (180000000000 * 0.5)
= - 90000000000Which gives us the guest TSC values we would expect:
guest_tsc = host_tsc * freq_ratio + tsc_offset
t=0: guest_tsc = 180000000000 * 0.5 - 90000000000 = 0
t=1: guest_tsc = 181000000000 * 0.5 - 90000000000 = 500000000
t=2: guest_tsc = 182000000000 * 0.5 - 90000000000 = 1000000000
t=3: guest_tsc = 183000000000 * 0.5 - 90000000000 = 1500000000The offset on the destination side is:
tsc_offset = - (host_migrate_tsc * freq_ratio - guest_migrate_tsc)
= - (500000000000 * 0.25 - 1500000000)
= - 123500000000And again, with our formula, we get the TSC values we expect there, too:
guest_tsc = host_tsc * freq_ratio + tsc_offset
t=3: guest_tsc = 500000000000 * 0.25 - 123500000000 = 1500000000
t=4: guest_tsc = 502000000000 * 0.25 - 123500000000 = 2000000000
t=5: guest_tsc = 504000000000 * 0.25 - 123500000000 = 2500000000TSC offset, frequency ratio, and guest TSC general formulas
To summarize, we have the following formula to calculate the TSC offset on a machine where the VM is booted:
# boot_tsc: host TSC value when guest is booted
freq_ratio = guest_freq / host_freq
tsc_offset = - (boot_tsc * freq_ratio)And to calculate the offset on the destination machine in a live migration:
# host_migrate_tsc: TSC value of destination host when VM is migrated
# guest_migrate_tsc: TSC value of guest when VM is migrated
freq_ratio = guest_freq / host_freq
tsc_offset = - (host_migrate_tsc * freq_ratio - guest_migrate_tsc)TSC Offset Formula
We can succinctly describe the TSC offset formula in terms of an "initial" host TSC (at the time of boot or migration), and thus we get:
# initial_host_tsc: TSC value of host when VM was booted/migrated
# initial_guest_tsc:
# - 0 if the guest was booted on this host
# - TSC value of guest at migration time, otherwise
freq_ratio = guest_freq / host_freq
tsc_offset = - (host_migrate_tsc * freq_ratio - guest_migrate_tsc)Guest TSC Formula
And the guest TSC is calculated as follows:
# host_tsc: host TSC value at time of guest TSC read
# guest_freq: frequency of the guest TSC
# host_freq: frequency of the host TSC
freq_ratio = guest_freq / host_freq
guest_tsc = host_tsc * freq_ratio + tsc_offsetCalculating the Guest TSC (with finite numerical representations)
In the previous section, we discussed how to calculate the TSC of a guest by working out the generic mathematical formulas for the guest TSC, the offset, and the frequency ratio between the guest and the host.
But for live migration, we will be doing this math on a computer, in which numerical values are represented with finite representations. In this section, we will discuss how these values are represented on the system and how we can perform these calculations without losing precision, overflowing values, or running into other similar problems.
Representations
Recall that our formula for calculating the guest TSC is:
guest_tsc = host_tsc * freq_ratio + tsc_offsetLet’s first look at how each of these values is represented on Intel and AMD systems.
TSC
The TSC is a 64-bit value. It’s treated as unsigned, as it is a monotonic counter that begins at 0.
TSC Offset
The TSC offset is a 64-bit value.
TSC Scaling: Architecture-Specific Details
Both SVM and VMX provide a mechanism to "scale" the TSC based on frequency. Before looking closer here, let’s review what the manuals say for each.
AMD
In SVM, the frequency ratio is an MSR, the TSC Ratio MSR
(C000_0104h)[3].
The MSR is a 64-bit value, but only 40 bits are used. It’s composed of an integer component (8 bits) and a fractional component (32 bits).
Its bit layout is:
64:40 - reserved / not used
32:39 - integer part (
INT)0:31 - fractional part (
FRAC)
The frequency ratio is:
INT + FRAC * 2^(-32)AMD describes calculating the guest TSC with the following formulas:
TSCRatio = (Desired TSCFreq) / Core P0 frequency
TSC Value (in guest) = (P0 frequency * TSCRatio * t) + VMCB.TSC_OFFSET + (Last Value Written to TSC) * TSCRatio
Where t is time since the TSC was last written via the TSC MSR (or since reset if not written)P0 frequency is the highest power frequency of a core[4].
This formula is equivalent to the general formula we derive above.
Intel
VMX calls the scaling mechanism a "multiplier"[5].
This is how the manual describes the use of the multiplier in calculating the
guest TSC when the TSC MSR is read (either via the rdtsc or rdmsr
instructions)[6]:
RDMSR first computes the product of the value of the IA32_TIME_STAMP_COUNTER
MSR and the value of the TSC multiplier. It then shifts the value of the
product right 48 bits and loads EAX:EDX with the sum of that shifted value and
the value of the TSC offset.Translating this text into a formula, we get:
guest_tsc = ((host_tsc * multiplier) >> 48) + tsc_offsetTSC Scaling: General Approach
While described slghtly differently, both Intel and AMD allow for TSC scaling by allowing the system to program a fixed point number.
AMD’s documentation makes this more obvious by explicitly describing an MSR with an integer and fractional component, but Intel’s approach is similar.
The frequency ratio is thus represented as:
AMD: a 40-bit number with 8 bits of integer and 32 bits of fraction
Intel: a 64-bit number with 16 bits of integer and 48 bits of fraction
Frequency Multiplier calculation
To represent a frequency ratio in a fixed point number, we need to do the
following, where SHIFT is the number of bits used for the fractional element:
scaling_factor = 1 << SHIFT
freq_multiplier = (guest_freq * scaling_factor) / host_freqIf it isn’t intuitive why this works, consider another fixed point number format most of us are familiar with: dollars and cents, which are generally encoded with 2 decimal places for the fractional element.
Suppose you have the dollar value $10.80, and you want to find the ratio of it
to another value, $2.20. We will represent the result with a fixed point number
with 2 decimal places. Thus, their scaling factor is 1x10^2.
If we plugged these values into our formula above we’d have:
scaling factor = 1 x 10^2 = 100
10.80 * 100 / 2.20 = 400The result is 400, which is 4.0 encoded with 2 fractional places.
Here’s another example, that includes fractional numbers:
# 10.00 / 4.00 = 2.50
10.00 * 100 / 4.00 = 250
# 250, 2.50 encoded with 2 fractional placesTSC Offset calculation
If freq_multiplier is a fixed point number, and SHIFT is the number of bits
used for the fractional element, then we can refine our guest TSC and TSC offset
equations from before to be:
guest_tsc = ((host_tsc * freq_multiplier) >> SHIFT ) + tsc_offsetWe can calculate the TSC offset in a similar manner:
# pre-migration
tsc_offset = - ((boot_tsc * freq_multiplier) >> SHIFT)
# post-migration
tsc_offset = - (((host_migrate_tsc * freq_multiplier) >> SHIFT) - guest_migrate_tsc)Scaling and offsetting the guest TSC: Mathematical Limits
Because we are dealing with finite representations, it’s important to understand where this calculation might run into problems. We will look at each value the hypervisor has to calculate, outline where we might run into problems, and propose reasonable limits to avoid those issues.
The two values we are interested in are:
the frequency multiplier
the TSC offset
Frequency Multiplier limitations
This is our formula for calculating the frequency multiplier:
scaling_factor = 1 << SHIFT
freq_multiplier = (guest_freq * scaling_factor) / host_freqwhere SHIFT is the number of bits of the fractional element of the multiplier.
Constraints
The size of the frequency multiplier is constrained by the architecture we are on. Those formats are:
AMD: 8.32 (8 bits integer, 32 bits fraction, 40 bits total)
Intel: 16.48 (16 bits integer, 48 bits fraction, 64 bits total)
Areas of flexibility
We have some choice in how we represent the frequencies of the guest and host, both in terms of units and integer type used to represent them.
(Today, the TSC frequency is a unsigned 64-bit integer representing the frequency in hz.)
Limitations
Regardless of the types we choose for the frequency, the possible limitations are the same. We can run into trouble if:
the integer portion of the ratio cannot fit into
intbits. For AMD, this would be any ratio > 255 (8 bits); for Intel, any ratio > 65535 (16 bits). In this case, we would overflow theintvalue.the frac portion of the ratio cannot fit into
fracbits. This would mean loss of precision for the guest TSC, as we can no longer faithfully apply the ratio as a scaling mechanism.
TSC Offset limitations
Our formula for calculating the TSC offset is:
# SHIFT = number of bits of fraction
tsc_offset = - (((initial_host_tsc * freq_multiplier) >> SHIFT) - initial_guest_tsc)Constraints
We have the following constraints on our inputs:
the TSC offset is 64 bits
the freq multiplier is 40 bits, or 64 bits, dependending on architecture
all TSC values (guest and host) are 64 bits
Limitations
Let’s break up our calculation into terms to make it easier to talk about:
# Scaling the initial host TSC to get "effective boot time"
effective_boot = (initial_host_tsc * freq_multiplier) >> SHIFT
# Correction for initial guest TSC
boot_tsc = effective_boot - initial_guest_tsc
# Negate boot TSC to get offset
tsc_offset = - boot_tscDigging into the arithmetic operations in the calculation, we can see the following opportunities for trouble, working from the inside out:
effective_boot:initial_host_tsc * freq_multiplierwill overflow if the result doesn’t fit into 64-bits. This can be mitigated in Rust, which has 128-bit values. But if the product right-shiftedSHIFTbits doesn’t fit into 64 bits, then that would still overflow.boot_tsc: subtractinginitial_guest_tscfrom the scaled host TSC will underflow if the guest TSC is greater than the host TSC. This could happen if a VM is moved to a machine that has been rebooted, which could easily happen, as live migration is a part of the update story.negating the entire term will underflow if the result is >=
1 << 63
Analysis: Upper ratio limit
Understanding where each calculation could run into underflow/overflow is helpful, but it doesn’t offer a concrete sense of what real life values might get us there.
A good starting point is the ratio between guest to host frequencies we will allow, as this is what’s used for scaling.
Let’s work through an example to see what happens. Suppose we allowed the maximum integer component ratio allowed by AMD: 255. How would that impact these calculations?
A ratio of 255 would give us the frequency multiplier:
255 << 32 = 0xff00000000 = 1095216660480This is a value using all 40 bits of the multiplier. Recall that our host scaling operation looks like:
# Scaling the initial host TSC to get "effective boot time"
effective_boot = (initial_host_tsc * freq_multiplier) >> SHIFTBoth initial_host_tsc and freq_multiplier have a max value of 64 bits.
Assuming we can store the product in a 128-bit value, we only will overflow this
value if the result after the shift is greater than 64-bits, as that’s the size
of the offset. The maximum bits that can be used in the product of a binary
multiplication is the sum of the number of bits needed for each operand; here,
that’s int bits at a max for the multiplier, and 64 bits max for the TSC. In
order to not overflow the scaled TSC, then, it can’t go above 56 bits.
The max scaled host TSC value would thus be 2^56 - 1 = 72057594037927935.
At a speed of 1 GHz, that value would be reached in just over 2 years. But keep
in mind that this is a scaled TSC, which in this example means that the guest is
ticking 256x time faster, an absurd speedup.
Broadly, the tradeoff here is: Bits that are used for the ratio take away from bits that we can use to scale the host TSC. This means the scaled host TSC will overflow faster.
Clearly, allowing the max ratio is not reasonable (nor practical, as a frequency difference of 256x between CPUs is unheard of). Ideally, we want to pick a reasonable upper bound, ensuring that scaling the host TSC will not overflow for reasonable circumstances and move on.
Proposal: Max ratio limit
The next question is what a reasonable ratio limit would be, knowing that each
bit used for the int portion of the ratio takes away from how much the host
TSC can be scaled, and assessing what reasonable assumptions we can make about
how quickly CPU speeds will increase.
I would propose we use at least 4 bits, for a maximum allowed ratio of 15. From some cursory research, it’s clear that 2x is not a sufficient maximum, but 8x probably will be. For some context, the Linux KVM implementation allows a max of 31 (5 bits) for AMD, and 255 (8 bits) for Intel.
Here is a summary of the absolute maximum guest lifetime, meaning that depending on when it was booted, it might not be able to live this long, for different ratios and guest frequencies. A 2x speedup seems plausible, and 15x and 31x much less so. Keep in mind that the ratio implies the guest is running much faster, and presumably began its life on much faster hardware.
Ratio | Guest Speed | Host Speed | Best Case Time to Overflow | |
2 | 1 GHz | 0.5 GHz | 292.4 years | |
2 | 6 GHz | 3 GHz | 48.7 years | |
2 | 15 GHz | 7.5 GHz | 19.5 years | |
15 | 7.5 GHz | 0.5 GHz | 36.5 years | |
15 | 18 GHz | 3 GHz | 6 years | |
15 | 112.5 GHz | 7.5 GHz | 2.4 years | |
31 | 15.5 GHz | 0.5 GHz | 18.2 years | |
31 | 93 GHz | 3 GHz | 3 years | |
31 | 232.5 GHz | 7.5 GHz | 0.39 years |
Note: to generate the values for the above table, I used the formula:
# int_bits: int bits needed to represent ratio
(2^(64 - int_bits) - 1) / host_freq_hz / 3600 / 24 / 365We only begin to approach worrying low lifetimes in the 31x ratio range, for ridiculous speed differences. 15x also seems sufficient for these purposes.
Analysis: Lower ratio limit
Because VMs in some circumanstances may live much longer than physical hosts, it seems more plausible that VMs will move to hosts with faster TSC frequencies than they have. So how does that impact us?
First, it’s notable that in a perfectly-modeled system, for any ratio that can’t be perfectly represented in a binary fraction (e.g., 0.5, 0.25, 0.75, etc…), calculating the guest TSC will lead to some drift.
As an example using tsc-simulator, a tool I wrote to help me model some of these calculations, here is an example guest TSC over time for a guest of 1GHz, and a host of 3GHz, for a ratio of 1/3:
$ tsc-simulator simulate -d 5 -g 1000000 -f 3000000
DURATION 5 seconds
GUEST FREQUENCY 1000000 KHz
HOST 0
START TIME 0 seconds
TSC 1000000000
FREQUENCY 3000000 KHz
TIME GUEST_TSC HOST_TSC
=== GUEST_BOOT ==================================================================
0 0 1000000000
1 1000000000 4000000000
2 1999999999 7000000000
3 2999999999 10000000000
4 3999999999 13000000000
5 4999999999 16000000000Some amount of drift is fine and expected in a real system, but since the TSC is used to adjust the wall clock time, too much drift can upset userspace time synchronization daemons such as ntp.
As a strawman, I considered ntp’s threshold for max frequency error it can correct, 500 ppm. In some initial modeling of how this drift might work, I ran the following test:
compute a guest TSC one second into the host’s future (i.e., incrementing the host TSC frequency by its frequency in Hz)
compute the difference between the guest frequency in Hz and its TSC value
check whether the real frequency was within some ppm threshold (i.e., the diff)
For pretty substantially small ratios, I didn’t see error beyond even 1 ppm. It’s possible this isn’t the correct way to model this, and I’m open to suggestions there. I suspect modeling error over a longer period of time, and through multiple migrations might be required.
Further, one bit that hasn’t been discussed yet is the passing of time during migration, in which we presumably will need to move the wall clock forward to account for migration time. Given that the TSC is used to update the wall clock as well, we will need to consider what reasonable error bounds should be there as well — such as moving the TSC forward some amount. That seems far more likely to run into issues.
Proposal: Lower ratio limit
Given that there are several open questions here, I don’t yet have a good proposal for how we should think about the fractional portion of the ratio. But I’d like to get some feedback on this RFD in parallel with other work I’m doing.
Open Questions / TODO
How can we best model the error rate induced by using the frequency multiplier, particularly over time, or multiple migrations?
What is the expected variation of frequencies among a rack? (This is a pretty easy experiment to do even now, with the existing racks we have available.)
What type of interface do we need for syncing between wall clock times across migrations? The TSC alone isn’t sufficient to understand that, as it only makes sense in the context of one host. In the past, I think I’ve seen discussed that we can expect some synchronization of wall clock time (ntp or similar), but I haven’t yet fleshed out the details of what’s that’s going to look like here.
The implementation of the kernel interfaces here is a current work in progress — I can update the RFD with more details about what I’m thinking there if it’s of interest, but for now, I’m prototyping that out to see how ergonomic it looks.
External References
RFD 34
RFD 71