358 - Live Migration and Monotonic Time / RFD / Oxide

RFD

358

Authors

Updated

Problem Statement

Live migration of virtual machines (RFD 71) is a requirement of the MVP. In a live migration, a VM is paused, moved to another physical machine, and made to resume running on a destination host. This process should ideally not be observable by the guest.

One seam where the fiction of live migration can break is the TSC, an x86 register used by operating systems to implement monotonic time. In illumos, the TSC is also used to make adjustments to the high resolution (wall clock) time. This RFD will provide background on the TSC and propose an implementation for handling the TSC across migrations.

Scope

This RFD focuses only on addressing monotonic time in live migration.

Originally, I intended this RFD to capture all aspects of time for live migration, including both monotonic time and "real time". But as I dove into just the topic of monotonic time, I have a lot of prose and some proposals to share that I don’t want to gate on other time-related topics. Monotonic time is used for other time-related functions in the system (such as the wall clock time), so I will reference those as appropriate, but discuss them in further detail in a future RFD.

Supported Architectures

In the product, sleds have an AMD Milan CPU. This means AMD support is our priority for testing and development. Still, we would like our implementations here to be general enough to support Intel CPUs as well.

Background

RFD 34 has some excellent background on monotonic time (and other types of time) that’s worth checking out. In the interest of not repeating what’s in that RFD, this section will only focus on additional context needed on this topic, specifically some more concrete details about the TSC and what tools a hypervisor has for shaping the appearance of the TSC’s value to guests.

Monotonic Time

x86 Time Stamp Counter

The TSC is used as the basis for monotonic time. Implemented as an MSR on each processor, the value of the TSC starts at 0 following processor reset and increments on some frequency, typically related to the clock cycles of the processor. (In illumos, the frequency of the TSC is measured at boot and assumed to stay at the calibrated frequency thereon.) With the combination of the TSC value and its frequency, an OS can implement a monotonic clock, typically expressed in terms of nanoseconds since boot.

The TSC can be read using the instruction rdtsc (and rdtscp). The rdtsc instruction can generally be called regardless of privilege level (i.e., it can be called from userland.^[1]:

The TSC can also be read and written to as an MSR using the rdmsr and wrmsr instructions.

TSC in virtual environments

An OS operating in a virtual machine has the same expectations of the TSC:

that its value monotonically increases
that it ticks at a constant frequency
that it begins at 0 following a processor reset (such as boot)

This means that the value presented cannot be the same as the host system TSC, but will have to be adjusted to make sense to the guest OS based on when the guest booted and the guest TSC frequency.

Virtualization support in hardware provides mechanisms for adjusting the TSC. Specifically, the host system can provide to the hardware:

a TSC "offset" to add to the TSC when a guest reads the TSC
a representation of a frequency ratio to "scale" the TSC value based on differing guest/host TSC frequencies

On AMD, this offset is specified by the hypervisor in the Virtual Machine Control Block, a structure that is passed to the VMRUN command of SVM. The specific value to set is TSC_OFFSET in the Control Bits (section 15.5.1).

Status quo bhyve implementation

In bhyve today, the TSC offset is simply the negation of the host TSC when the VM was booted. Frequency ratios are not used.

Generally, this means when the guest reads the TSC, it sees the value:^[2]

current_host_tsc - boot_time_tsc

This calculation is insufficient in a world with live migration, as TSC values on one host don’t have any bearing on the TSC values of a different host. Calculating the TSC offset is thus an area that must be addressed in our implementation.

Implementation Constraints

A live migration should ideally not be noticeable by a guest. It’s worth spelling out concrete requirements here for the TSC.

Specifically:

the TSC should never decrease (if it hasn’t been written to by the guest OS)
post-migration, the guest TSC should increment at the same rate as before, within a reasonable error rate
the TSC jumping forward may be acceptable, but we need to ensure we have a crisp idea of what an acceptable range is

Live Migration and the TSC: How do you calculate the guest TSC, anyway?

This section provides some additional background on the math involved for calculating the TSC. Specifically, I’ve found it useful to understand how the guest TSC is calculated, and it walks through some worked examples.

Keep in mind that the actual calculation is done by hardware, not the hypervisor. But the hypervisor has tools to adjust the guest TSC, which are used in the hardware’s calculation.

Specifically, both AMD and Intel provide a way for the hypervisor to program:

an offset value added to a scaled host TSC reading
a value representing the frequency ratio of guest to host frequencies, used to scale the host TSC reading

So in this section, we will:

outline how to calculate the guest TSC in different scenarios based on its host TSC and frequency, including across live migrations
arrive at general formulas for how to calculate a guest TSC at any given point in time, as well as the appropriate TSC offset that should be programmed by the hypervisor

Calculating the guest TSC (in general)

To start, let’s walk through some examples of what the guest TSC would be in different circumstances, with the goal of understanding what the TSC offset should be programmed by the host as in each circumstance.

Note that we want the TSC offset to be a constant once it’s determined (unless the TSC is written to); it is not feasible to compute it with each read of the TSC.

For these examples, to keep things simple, we have some of assumptions that don’t hold true in the real world, including:

the TSC ticks exactly at its frequency (in reality, its affected by a number of factors, including temperature)
live migration is instanteanous

Scenario 1: No migration, guest and host have same TSC frequency

To start, let’s assume that:

we have a VM that has never been migrated
the VM TSC frequency is the same as its host’s TSC frequency

To keep the math simple, let’s suppose that the host frequency is 1GHz, and the value of the TSC when the VM is booted is 180000000000.

So the TSC of the guest and host increment at the same rate:

Guest frequency = Host frequency = 1GHz
Time (seconds)	Guest TSC	Host TSC
0	0	180000000000
1	1000000000	181000000000
2	2000000000	182000000000
3	3000000000	183000000000

Looking at these values, we can easily tell what the TSC offset should be: simply the host TSC at the time the guest was booted, negated. (This is in fact, how things work in bhyve today).

Thus to calculate the TSC offset and the guest TSC:

# boot_tsc: host TSC at guest boot
tsc_offset = -boot_tsc
guest_tsc = host_tsc + tsc_offset

Scenario 2: No migration, guest and host have different TSC frequencies

Now let’s expand the problem space a bit. Let’s assume:

we have a VM that has never been migrated
the VM TSC frequency is different from its host’s TSC frequency

Suppose the host is still at 1 GHz, but the guest frequency is 0.5 GHz.

We see that now the guest TSC increments at half the speed of the host:

Guest frequency = 0.5 GHz, Host frequency = 1GHz
Time (seconds)	Guest TSC	Host TSC
0	0	180000000000
1	500000000	181000000000
2	1000000000	182000000000
3	1500000000	183000000000

Next, let’s think through what the offset should be. In the case where the frequencies are the same, the offset is simply the host TSC when the guest booted. But the virtualization support in hardware adds the offset after it scales the host TSC based on the ratio, as such:

guest_tsc = host_tsc * freq_ratio + tsc_offset

Simply using the boot time TSC here would clearly not work. Looking at our exmample, at t=0, it would give us a negative guest TSC value: 18000000000 * 0.5 - 180000000000.

One way to think about adjusting the offset here is by calculating a new "effective" TSC boot time value based on the frequency the guest TSC has. That "effective" boot time is simply the boot time TSC times the frequency ratio:

tsc_offset = - (boot_tsc * freq_ratio)

So for our example here, the offset would be 90000000000. This works as expected:

at t=0, the guest TSC is 0: 180000000000 * 0.5 - 90000000000 = 90000000000 - 90000000000 = 0
a t=1, guest TSC is 5000000000: 181000000000 * 0.5 - 90000000000 = 95000000000 - 90000000000 = 5000000000
…and so on

Scenario 3: Live migration to a host with a different frequency

Expanding our assumptions further, let’s say we have:

a VM that gets live migrated
the source host and the VM have the same TSC frequency
the source and destination hosts have different frequencies
migration is instantaneous

For our example, let’s assume: - the VM and source host have the frequency of 1 GHz - at the VM’s boot, the host TSC is 180000000000 - migration happens at t=3 seconds - the destination host has a frequency of 0.5 GHz - migration occurs when the destination host has TSC value 500000000000

Before migration we have:

Guest frequency = Host frequency = 1GHz
Time (seconds)	Guest TSC	Host TSC
0	0	180000000000
1	1000000000	181000000000
2	2000000000	182000000000
3	3000000000	183000000000

After migration, the guest TSC should increment at the same frequency as before:

Guest frequency = 1GHgz, Host frequency = 0.5GHz
Time (seconds)	Guest TSC	Host TSC
3	3000000000	500000000000
4	4000000000	500500000000
5	5000000000	506000000000
6	6000000000	506500000000

Our frequency ratio of guest/host is 2. To get the effective boot TSC, we need some combination of the TSC on the destination when the VM was migrated, and the guest TSC, as the guest has been running for some time. To get these values to make sense together, we need to normalize for the frequency:

# host is the destination here
freq_ratio = guest_freq / host_freq

# "effective" boot TSC relative to destination host
# host_migrate_tsc: TSC of dest at migration
# guest_migrate_tsc: TSC of guest at migration

host_migrate_tsc * freq_ratio - guest_migrate_tsc

So in our example, the effective boot TSC relative to this new host is:

500000000000 * 2 - 3000000000
1000000000000    - 3000000000
997000000000

Plugging our example numbers in, this checks out to give us the TSC we expect:

t=3: guest_tsc = 500000000000 * 2 - 997000000000 = 3000000000
t=4: guest_tsc = 500500000000 * 2 - 997000000000 = 4000000000
t=5: guest_tsc = 501000000000 * 2 - 997000000000 = 5000000000

That gives us the following formula for calculating the guest TSC:

guest_tsc = host_tsc * freq_ratio - (host_migrate_tsc * freq_ratio - guest_migrate_tsc)

And our TSC offset is this term:

tsc_offset = - (host_migrate_tsc * freq_ratio - guest_migrate_tsc)

Scenario 4: Live Migration where source, destination, and guest all have different frequencies

For our last scenario, let’s assume that the source, destination, and guest all have different frequencies. This could happen if a VM is migrated twice, for example.

Here’s some values to start with:

source freq: 1GHz
source TSC when guest is booted: 180000000000
dest freq: 2GHz
dest TSC when guest is migrated: 500000000000
guest freq: 0.5GHz
migration occurs at t=3 seconds

Our frequency ratios are thus the following:

source host: 0.5
dest host: 0.25

Here’s what the expected guest TSC should look like on the source, before migration:

Time (seconds)	Guest TSC	Host TSC
0	0	180000000000
1	500000000	181000000000
2	1000000000	182000000000
3	1500000000	183000000000

After migration, here’s what we’d expect after a few seconds on the destination:

Time (seconds)	Guest TSC	Host TSC
3	1500000000	500000000000
4	2000000000	502000000000
5	2500000000	504000000000

The offset on the source side would be:

# boot_tsc is the TSC when the guest was booted
tsc_offset = - (boot_tsc 	 * freq_ratio)
	   = - (180000000000 	 * 0.5)
	   = -   90000000000

Which gives us the guest TSC values we would expect:

guest_tsc = host_tsc * freq_ratio + tsc_offset

t=0: guest_tsc = 180000000000 * 0.5 - 90000000000 = 0
t=1: guest_tsc = 181000000000 * 0.5 - 90000000000 = 500000000
t=2: guest_tsc = 182000000000 * 0.5 - 90000000000 = 1000000000
t=3: guest_tsc = 183000000000 * 0.5 - 90000000000 = 1500000000

The offset on the destination side is:

tsc_offset = - (host_migrate_tsc * freq_ratio - guest_migrate_tsc)
	   = - (500000000000     * 0.25       - 1500000000)
	   = - 123500000000

And again, with our formula, we get the TSC values we expect there, too:

guest_tsc = host_tsc * freq_ratio + tsc_offset

t=3: guest_tsc = 500000000000 * 0.25 - 123500000000 = 1500000000
t=4: guest_tsc = 502000000000 * 0.25 - 123500000000 = 2000000000
t=5: guest_tsc = 504000000000 * 0.25 - 123500000000 = 2500000000

TSC offset, frequency ratio, and guest TSC general formulas

To summarize, we have the following formula to calculate the TSC offset on a machine where the VM is booted:

# boot_tsc: host TSC value when guest is booted
freq_ratio = guest_freq / host_freq
tsc_offset = - (boot_tsc * freq_ratio)

And to calculate the offset on the destination machine in a live migration:

# host_migrate_tsc: TSC value of destination host when VM is migrated
# guest_migrate_tsc: TSC value of guest when VM is migrated
freq_ratio = guest_freq / host_freq
tsc_offset = - (host_migrate_tsc * freq_ratio - guest_migrate_tsc)

TSC Offset Formula

We can succinctly describe the TSC offset formula in terms of an "initial" host TSC (at the time of boot or migration), and thus we get:

# initial_host_tsc: TSC value of host when VM was booted/migrated
# initial_guest_tsc:
# - 0 if the guest was booted on this host
# - TSC value of guest at migration time, otherwise

freq_ratio = guest_freq / host_freq
tsc_offset = - (host_migrate_tsc * freq_ratio - guest_migrate_tsc)

Guest TSC Formula

And the guest TSC is calculated as follows:

# host_tsc: host TSC value at time of guest TSC read
# guest_freq: frequency of the guest TSC
# host_freq: frequency of the host TSC
freq_ratio = guest_freq / host_freq
guest_tsc = host_tsc * freq_ratio + tsc_offset

Calculating the Guest TSC (with finite numerical representations)

In the previous section, we discussed how to calculate the TSC of a guest by working out the generic mathematical formulas for the guest TSC, the offset, and the frequency ratio between the guest and the host.

But for live migration, we will be doing this math on a computer, in which numerical values are represented with finite representations. In this section, we will discuss how these values are represented on the system and how we can perform these calculations without losing precision, overflowing values, or running into other similar problems.

Representations

Recall that our formula for calculating the guest TSC is:

guest_tsc = host_tsc * freq_ratio + tsc_offset

Let’s first look at how each of these values is represented on Intel and AMD systems.

TSC

The TSC is a 64-bit value. It’s treated as unsigned, as it is a monotonic counter that begins at 0.

TSC Offset

The TSC offset is a 64-bit value.

TSC Scaling: Architecture-Specific Details

Both SVM and VMX provide a mechanism to "scale" the TSC based on frequency. Before looking closer here, let’s review what the manuals say for each.

Manual Documentation

AMD

In SVM, the frequency ratio is an MSR, the TSC Ratio MSR (C000_0104h)^[3].

The MSR is a 64-bit value, but only 40 bits are used. It’s composed of an integer component (8 bits) and a fractional component (32 bits).

Its bit layout is:

64:40 - reserved / not used
32:39 - integer part (INT)
0:31 - fractional part (FRAC)

The frequency ratio is:

INT + FRAC * 2^(-32)

AMD describes calculating the guest TSC with the following formulas:

TSCRatio = (Desired TSCFreq) / Core P0 frequency

TSC Value (in guest) = (P0 frequency * TSCRatio * t) + VMCB.TSC_OFFSET + (Last Value Written to TSC) * TSCRatio

Where t is time since the TSC was last written via the TSC MSR (or since reset if not written)

P0 frequency is the highest power frequency of a core^[4].

This formula is equivalent to the general formula we derive above.

Intel

VMX calls the scaling mechanism a "multiplier"^[5].

This is how the manual describes the use of the multiplier in calculating the guest TSC when the TSC MSR is read (either via the rdtsc or rdmsr instructions)^[6]:

RDMSR first computes the product of the value of the IA32_TIME_STAMP_COUNTER
MSR and the value of the TSC multiplier. It then shifts the value of the
product right 48 bits and loads EAX:EDX with the sum of that shifted value and
the value of the TSC offset.

Translating this text into a formula, we get:

guest_tsc = ((host_tsc * multiplier) >> 48) + tsc_offset

TSC Scaling: General Approach

While described slghtly differently, both Intel and AMD allow for TSC scaling by allowing the system to program a fixed point number.

AMD’s documentation makes this more obvious by explicitly describing an MSR with an integer and fractional component, but Intel’s approach is similar.

The frequency ratio is thus represented as:

AMD: a 40-bit number with 8 bits of integer and 32 bits of fraction
Intel: a 64-bit number with 16 bits of integer and 48 bits of fraction

Frequency Multiplier calculation

To represent a frequency ratio in a fixed point number, we need to do the following, where SHIFT is the number of bits used for the fractional element:

scaling_factor = 1 << SHIFT
freq_multiplier = (guest_freq * scaling_factor) / host_freq

If it isn’t intuitive why this works, consider another fixed point number format most of us are familiar with: dollars and cents, which are generally encoded with 2 decimal places for the fractional element.

Suppose you have the dollar value $10.80, and you want to find the ratio of it to another value, $2.20. We will represent the result with a fixed point number with 2 decimal places. Thus, their scaling factor is 1x10^2.

If we plugged these values into our formula above we’d have:

scaling factor = 1 x 10^2 = 100
10.80 * 100 / 2.20 = 400

The result is 400, which is 4.0 encoded with 2 fractional places.

Here’s another example, that includes fractional numbers:

# 10.00 / 4.00 = 2.50
10.00 * 100 / 4.00 = 250
# 250, 2.50 encoded with 2 fractional places

TSC Offset calculation

If freq_multiplier is a fixed point number, and SHIFT is the number of bits used for the fractional element, then we can refine our guest TSC and TSC offset equations from before to be:

guest_tsc = ((host_tsc * freq_multiplier) >> SHIFT ) + tsc_offset

We can calculate the TSC offset in a similar manner:

# pre-migration
tsc_offset = - ((boot_tsc * freq_multiplier) >> SHIFT)

# post-migration
tsc_offset = - (((host_migrate_tsc * freq_multiplier) >> SHIFT) - guest_migrate_tsc)

Scaling and offsetting the guest TSC: Mathematical Limits

Because we are dealing with finite representations, it’s important to understand where this calculation might run into problems. We will look at each value the hypervisor has to calculate, outline where we might run into problems, and propose reasonable limits to avoid those issues.

The two values we are interested in are:

the frequency multiplier
the TSC offset

Frequency Multiplier limitations

This is our formula for calculating the frequency multiplier:

scaling_factor = 1 << SHIFT
freq_multiplier = (guest_freq * scaling_factor) / host_freq

where SHIFT is the number of bits of the fractional element of the multiplier.

Constraints

The size of the frequency multiplier is constrained by the architecture we are on. Those formats are:

AMD: 8.32 (8 bits integer, 32 bits fraction, 40 bits total)
Intel: 16.48 (16 bits integer, 48 bits fraction, 64 bits total)

Areas of flexibility

We have some choice in how we represent the frequencies of the guest and host, both in terms of units and integer type used to represent them.

(Today, the TSC frequency is a unsigned 64-bit integer representing the frequency in hz.)

Limitations

Regardless of the types we choose for the frequency, the possible limitations are the same. We can run into trouble if:

the integer portion of the ratio cannot fit into int bits. For AMD, this would be any ratio > 255 (8 bits); for Intel, any ratio > 65535 (16 bits). In this case, we would overflow the int value.
the frac portion of the ratio cannot fit into frac bits. This would mean loss of precision for the guest TSC, as we can no longer faithfully apply the ratio as a scaling mechanism.

TSC Offset limitations

Our formula for calculating the TSC offset is:

# SHIFT = number of bits of fraction
tsc_offset = - (((initial_host_tsc * freq_multiplier) >> SHIFT) - initial_guest_tsc)

Constraints

We have the following constraints on our inputs:

the TSC offset is 64 bits
the freq multiplier is 40 bits, or 64 bits, dependending on architecture
all TSC values (guest and host) are 64 bits

Limitations

Let’s break up our calculation into terms to make it easier to talk about:

# Scaling the initial host TSC to get "effective boot time"
effective_boot = (initial_host_tsc * freq_multiplier) >> SHIFT

# Correction for initial guest TSC
boot_tsc = effective_boot - initial_guest_tsc

# Negate boot TSC to get offset
tsc_offset = - boot_tsc

Digging into the arithmetic operations in the calculation, we can see the following opportunities for trouble, working from the inside out:

effective_boot: initial_host_tsc * freq_multiplier will overflow if the result doesn’t fit into 64-bits. This can be mitigated in Rust, which has 128-bit values. But if the product right-shifted SHIFT bits doesn’t fit into 64 bits, then that would still overflow.
boot_tsc: subtracting initial_guest_tsc from the scaled host TSC will underflow if the guest TSC is greater than the host TSC. This could happen if a VM is moved to a machine that has been rebooted, which could easily happen, as live migration is a part of the update story.
negating the entire term will underflow if the result is >= 1 << 63

Analysis: Upper ratio limit

Understanding where each calculation could run into underflow/overflow is helpful, but it doesn’t offer a concrete sense of what real life values might get us there.

A good starting point is the ratio between guest to host frequencies we will allow, as this is what’s used for scaling.

Let’s work through an example to see what happens. Suppose we allowed the maximum integer component ratio allowed by AMD: 255. How would that impact these calculations?

A ratio of 255 would give us the frequency multiplier:

255 << 32 = 0xff00000000 = 1095216660480

This is a value using all 40 bits of the multiplier. Recall that our host scaling operation looks like:

# Scaling the initial host TSC to get "effective boot time"
effective_boot = (initial_host_tsc * freq_multiplier) >> SHIFT

Both initial_host_tsc and freq_multiplier have a max value of 64 bits. Assuming we can store the product in a 128-bit value, we only will overflow this value if the result after the shift is greater than 64-bits, as that’s the size of the offset. The maximum bits that can be used in the product of a binary multiplication is the sum of the number of bits needed for each operand; here, that’s int bits at a max for the multiplier, and 64 bits max for the TSC. In order to not overflow the scaled TSC, then, it can’t go above 56 bits.

The max scaled host TSC value would thus be 2^56 - 1 = 72057594037927935. At a speed of 1 GHz, that value would be reached in just over 2 years. But keep in mind that this is a scaled TSC, which in this example means that the guest is ticking 256x time faster, an absurd speedup.

Broadly, the tradeoff here is: Bits that are used for the ratio take away from bits that we can use to scale the host TSC. This means the scaled host TSC will overflow faster.

Clearly, allowing the max ratio is not reasonable (nor practical, as a frequency difference of 256x between CPUs is unheard of). Ideally, we want to pick a reasonable upper bound, ensuring that scaling the host TSC will not overflow for reasonable circumstances and move on.

Proposal: Max ratio limit

The next question is what a reasonable ratio limit would be, knowing that each bit used for the int portion of the ratio takes away from how much the host TSC can be scaled, and assessing what reasonable assumptions we can make about how quickly CPU speeds will increase.

I would propose we use at least 4 bits, for a maximum allowed ratio of 15. From some cursory research, it’s clear that 2x is not a sufficient maximum, but 8x probably will be. For some context, the Linux KVM implementation allows a max of 31 (5 bits) for AMD, and 255 (8 bits) for Intel.

Here is a summary of the absolute maximum guest lifetime, meaning that depending on when it was booted, it might not be able to live this long, for different ratios and guest frequencies. A 2x speedup seems plausible, and 15x and 31x much less so. Keep in mind that the ratio implies the guest is running much faster, and presumably began its life on much faster hardware.

Ratio

Guest Speed

Host Speed

Best Case Time to Overflow

1 GHz

0.5 GHz

292.4 years

6 GHz

3 GHz

48.7 years

15 GHz

7.5 GHz

19.5 years

7.5 GHz

0.5 GHz

36.5 years

18 GHz

3 GHz

6 years

112.5 GHz

7.5 GHz

2.4 years

15.5 GHz

0.5 GHz

18.2 years

93 GHz

3 GHz

3 years

232.5 GHz

7.5 GHz

0.39 years

Note: to generate the values for the above table, I used the formula:

# int_bits: int bits needed to represent ratio
(2^(64 - int_bits) - 1) / host_freq_hz / 3600 / 24 / 365

We only begin to approach worrying low lifetimes in the 31x ratio range, for ridiculous speed differences. 15x also seems sufficient for these purposes.

Analysis: Lower ratio limit

Because VMs in some circumanstances may live much longer than physical hosts, it seems more plausible that VMs will move to hosts with faster TSC frequencies than they have. So how does that impact us?

First, it’s notable that in a perfectly-modeled system, for any ratio that can’t be perfectly represented in a binary fraction (e.g., 0.5, 0.25, 0.75, etc…), calculating the guest TSC will lead to some drift.

As an example using tsc-simulator, a tool I wrote to help me model some of these calculations, here is an example guest TSC over time for a guest of 1GHz, and a host of 3GHz, for a ratio of 1/3:

$ tsc-simulator simulate -d 5 -g 1000000 -f 3000000

 DURATION        5 seconds
 GUEST FREQUENCY 1000000 KHz

 HOST 0
      START TIME 0 seconds
             TSC 1000000000
       FREQUENCY 3000000 KHz


TIME              GUEST_TSC         HOST_TSC
=== GUEST_BOOT ==================================================================
0                         0       1000000000
1                1000000000       4000000000
2                1999999999       7000000000
3                2999999999      10000000000
4                3999999999      13000000000
5                4999999999      16000000000

Some amount of drift is fine and expected in a real system, but since the TSC is used to adjust the wall clock time, too much drift can upset userspace time synchronization daemons such as ntp.

As a strawman, I considered ntp’s threshold for max frequency error it can correct, 500 ppm. In some initial modeling of how this drift might work, I ran the following test:

compute a guest TSC one second into the host’s future (i.e., incrementing the host TSC frequency by its frequency in Hz)
compute the difference between the guest frequency in Hz and its TSC value
check whether the real frequency was within some ppm threshold (i.e., the diff)

For pretty substantially small ratios, I didn’t see error beyond even 1 ppm. It’s possible this isn’t the correct way to model this, and I’m open to suggestions there. I suspect modeling error over a longer period of time, and through multiple migrations might be required.

Further, one bit that hasn’t been discussed yet is the passing of time during migration, in which we presumably will need to move the wall clock forward to account for migration time. Given that the TSC is used to update the wall clock as well, we will need to consider what reasonable error bounds should be there as well — such as moving the TSC forward some amount. That seems far more likely to run into issues.

Proposal: Lower ratio limit

Given that there are several open questions here, I don’t yet have a good proposal for how we should think about the fractional portion of the ratio. But I’d like to get some feedback on this RFD in parallel with other work I’m doing.

Open Questions / TODO

How can we best model the error rate induced by using the frequency multiplier, particularly over time, or multiple migrations?
What is the expected variation of frequencies among a rack? (This is a pretty easy experiment to do even now, with the existing racks we have available.)
What type of interface do we need for syncing between wall clock times across migrations? The TSC alone isn’t sufficient to understand that, as it only makes sense in the context of one host. In the past, I think I’ve seen discussed that we can expect some synchronization of wall clock time (ntp or similar), but I haven’t yet fleshed out the details of what’s that’s going to look like here.
The implementation of the kernel interfaces here is a current work in progress — I can update the RFD with more details about what I’m thinking there if it’s of interest, but for now, I’m prototyping that out to see how ergonomic it looks.

External References

RFD 34
RFD 71

Footnotes

1
Assuming that that this isn’t disabled via the Time Stamp Disable bit of Control Register 4. See 3.1.3 of the AMD Architecture Programmer’s Manual Volume 2, and 2.5 of Intel 64 and IA-32 Architectures Software Developer’s Manual 3A
View
2
This formula is ignoring per-CPU and per-vcpu adjustments, which are also added to the offset. Per-CPU adjustments may occur because cores on the same machine might have slightly different frequencies. Per-vcpu adjustments can occur because the guest can write to the TSC.
View
3
AMD Architecture Programmer’s Manual Volume 2: System Programming, Section 15.30.5
View
4
AMD Architecture Programmer’s Manual Volume 2: System Programming, Section 17.1
View
5
Intel 64 and IA-32 Architectures Software Developer’s Manual 3C, Section 24.6.5
View
6
Intel 64 and IA-32 Architectures Software Developer’s Manual 3C, Section 25.3, RDMSR bullet point
View

RFD 358 Live Migration and Monotonic Time

Table of Contents