Performance Mode

Performance mode reduces noise during VM execution by applying host-side isolation (vCPU pinning, hugepages, NUMA mbind, RT scheduling). On x86_64, additionally: a guest-visible CPUID hint (KVM_HINTS_REALTIME) and KVM exit suppression (PAUSE and HLT VM exits disabled). On aarch64, the four host-side optimizations apply (vCPU pinning, hugepages, NUMA mbind, RT scheduling); KVM exit suppression and CPUID hints are not available.

What it does

On x86_64, seven optimizations are applied when performance_mode is enabled (six host-side, one guest-visible via CPUID). On aarch64, four of these apply (vCPU pinning, hugepages, NUMA mbind, RT scheduling); the x86-specific items (PAUSE/HLT exit disabling, KVM_HINTS_REALTIME CPUID, halt poll) are not available.

Host-side KVM_CAP_HALT_POLL is explicitly skipped on x86_64 — the guest haltpoll cpuidle driver disables it via MSR_KVM_POLL_CONTROL (see below):

vCPU pinning – each virtual LLC is mapped to a physical LLC group on the host. vCPU threads are pinned to cores within their assigned LLC via sched_setaffinity. This prevents the host scheduler from migrating vCPU threads across LLCs, which would add cache thrashing noise to measurements.

Hugepages – guest memory is allocated with 2MB hugepages (MAP_HUGETLB) when sufficient free hugepages exist. This eliminates TLB pressure from host-side page walks during guest execution.

NUMA mbind – guest memory is bound to the NUMA node(s) of the pinned vCPUs via mbind(MPOL_BIND). This ensures memory allocations are local to the CPUs executing vCPU threads, avoiding cross-node memory access latency.

RT scheduling – vCPU threads are set to SCHED_FIFO priority 1. The watchdog and monitor threads run at priority 2 on a dedicated host CPU not assigned to any vCPU, so they can preempt for timeout/sampling without competing for vCPU cores. The serial console mutex uses PTHREAD_PRIO_INHERIT to avoid priority inversion between RT vCPU threads and service threads.

Disable PAUSE VM exits (x86_64 only) – KVM_CAP_X86_DISABLE_EXITS with KVM_X86_DISABLE_EXITS_PAUSE suppresses VM exits on PAUSE instructions. Guest spinlocks execute PAUSE in tight loops; each PAUSE normally causes a vmexit so the hypervisor can schedule other vCPUs. With dedicated cores (vCPU pinning), this reschedule is unnecessary overhead. The capability is optional – if unsupported, a warning is logged and the VM proceeds without it.

Disable HLT VM exits (x86_64 only) – KVM_X86_DISABLE_EXITS_HLT suppresses VM exits on HLT instructions, the most frequent exit type during boot and idle. BSP shutdown detection uses I8042 reset (port 0x64, value 0xFE via reboot=k) and VcpuExit::Shutdown instead of VcpuExit::Hlt. KVM blocks HLT disable when mitigate_smt_rsb is active (host has X86_BUG_SMT_RSB and cpu_smt_possible()); in that case, only PAUSE exits are disabled.

KVM_HINTS_REALTIME CPUID (x86_64 only) – sets bit 0 of CPUID leaf 0x40000001 EDX, telling the guest kernel that vCPUs are pinned to dedicated host cores. The guest disables PV spinlocks, PV TLB flush, and PV sched_yield (all add hypercall overhead unnecessary on dedicated cores), and enables haltpoll cpuidle (polls briefly before halting, reducing wakeup latency). PV spinlocks require CONFIG_PARAVIRT_SPINLOCKS, which is not in ktstr.kconfig, so that disable is a no-op for ktstr guests.

Skip host-side halt poll (x86_64 only) – when a guest vCPU halts (executes HLT with nothing to do), KVM can busy-wait briefly on the host before putting the vCPU thread to sleep, reducing wakeup latency at the cost of host CPU time. KVM_CAP_HALT_POLL controls this per-VM ceiling. In performance mode it is not set because the guest haltpoll cpuidle driver (enabled by KVM_HINTS_REALTIME above) handles polling inside the guest and writes MSR_KVM_POLL_CONTROL=0 to disable host-side polling via kvm_arch_no_poll(). Non-performance-mode VMs set KVM_CAP_HALT_POLL to 200µs (matching the x86 kernel default), or 0 when vCPUs exceed host CPUs.

Prerequisites

Sufficient host CPUs – the host must have at least (llcs * cores_per_llc * threads_per_core) + 1 online CPUs. The extra CPU is reserved for service threads (monitor, watchdog) so they do not share a core with any RT vCPU. The host must also have at least as many LLC groups as virtual LLCs.

2MB hugepages (optional) – the host must have free 2MB hugepages (check /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages). Without them, guest memory uses regular pages. A warning is printed.

CAP_SYS_NICE or rtprio limit (optional) – SCHED_FIFO requires either CAP_SYS_NICE (root) or an RLIMIT_RTPRIO >= the requested priority. Set an rtprio limit for non-root use:

# /etc/security/limits.conf
username  -  rtprio  99

Log out and back in for the limit to take effect. Without either capability, RT scheduling is skipped with a warning and vCPU threads run at normal priority (results may be noisy).

Validation

validate_performance_mode() runs during VM build and applies two levels of checks:

Errors (fatal):

Total vCPUs + 1 service CPU exceed available host CPUs.
Virtual LLCs exceed available LLC groups.
Pinning plan cannot be satisfied (an LLC group has fewer available CPUs than the virtual LLC requires).
No free host CPU for service threads after vCPU assignment.

Warnings (non-fatal):

Insufficient free hugepages – regular page allocation is used.
Host load is high – procs_running from /proc/stat exceeds half the vCPU count, results may be noisy.
TSC not stable (x86_64 only, checked at VM creation time) – KVM_CLOCK_TSC_STABLE not set after KVM_GET_CLOCK, kvmclock falls back to per-vCPU timekeeping. Timing measurements may have higher variance. Common in nested virtualization.

Usage

In #[ktstr_test]:

#[ktstr_test(
    llcs = 2,
    cores = 4,
    threads = 2,
    performance_mode = true,
)]
fn my_perf_test(ctx: &Ctx) -> Result<AssertResult> {
    // vCPUs are pinned, hugepage-backed
    Ok(AssertResult::pass())
}

Via the builder API:

let vm = vmm::KtstrVm::builder()
    .kernel(&kernel_path)
    .init_binary(&ktstr_binary)
    .topology(1, 2, 4, 2)
    .memory_mb(4096)
    .performance_mode(true)
    .build()?
    .run()?;

When to use

Performance mode is for tests where host-side scheduling noise affects results – fairness spread measurements, scheduling gap detection, imbalance ratio checks. It is not needed for correctness tests (cpuset isolation, starvation detection) where pass/fail is binary.

The gauntlet runs many VMs in parallel. Performance mode on parallel VMs can oversubscribe the host if scheduled naively. Avoid performance_mode unless the host has enough CPUs for the topology matrix.

Two dimensions

Performance mode serves two purposes:

Noise reduction – pinning, hugepages, NUMA mbind, and RT scheduling reduce measurement variance on both architectures. On x86_64, PAUSE and HLT VM exit disabling, the KVM_HINTS_REALTIME CPUID hint, and skipping host-side halt poll further reduce noise. Scheduling gaps, spread, and throughput checks become meaningful because host jitter is controlled. Without performance mode, a 50ms gap could be host noise; with it, the same gap indicates a scheduler problem.

Performance assertions – with stable measurements, tests can set tight thresholds (max_gap_ms, min_iteration_rate, max_p99_wake_latency_ns) to detect scheduling regressions. A test using execute_steps_with can pass custom Assert checks that are evaluated inside the guest against worker telemetry. These thresholds are only meaningful under performance mode’s controlled environment.

Nextest parallelism

Performance-mode tests each consume one LLC group on the host. The vm-perf test group in .config/nextest.toml sets a static max-threads limit. The flock-based LLC slot reservation (acquire_resource_locks) handles runtime contention: if all LLC slots are busy, the test returns ResourceContention.

On contention, the test returns exit code 0 (skip) – it never ran. The SKIP: prefix in stderr distinguishes skips from real passes.

LLC exclusivity validation

When performance_mode is enabled, the build step validates LLC exclusivity: each virtual LLC must reserve the entire physical LLC group it maps to. The validation sums the actual CPU count of each LLC group and checks the total (plus service CPU) fits within the host’s online CPUs. If validation fails, the build returns an error (tests skip with ResourceContention).

Three-way mode tier

ktstr’s host-side resource coordination has three effective tiers, selected by the combination of performance_mode, --no-perf-mode/KTSTR_NO_PERF_MODE, and --cpu-cap/KTSTR_CPU_CAP:

Tier 1: performance mode (full isolation)

Enabled when performance_mode=true is set on the VM builder (or via #[ktstr_test(performance_mode = true)]). Acquires LOCK_EX on each selected LLC’s /tmp/ktstr-llc-{N}.lock — the LLC-level exclusive lock already covers every CPU in the group, so per-CPU /tmp/ktstr-cpu-{C}.lock files are NOT touched (try_acquire_all in vmm/host_topology.rs short-circuits the per-CPU loop when LlcLockMode == Exclusive). Applies every isolation feature listed under “What it does”: vCPU pinning via sched_setaffinity, 2 MB hugepages, NUMA mbind, RT SCHED_FIFO scheduling, and (x86_64) PAUSE/HLT exit suppression + KVM_HINTS_REALTIME CPUID.

Tier 2: no-perf-mode with CPU-cap reservation

Enabled by --no-perf-mode / KTSTR_NO_PERF_MODE=1. Every no-perf-mode VM goes through acquire_llc_plan: the reservation is LOCK_SH across a NUMA-aware, consolidation-aware set of LLCs, sized to meet the CPU budget — either --cpu-cap N (or KTSTR_CPU_CAP=N) if set, or 30% of the calling process’s sched_getaffinity cpuset (minimum 1) if not. The flock granularity stays per-LLC; plan.cpus holds EXACTLY the budget (partial-take on the last LLC when the budget falls mid-LLC). Multiple no-perf-mode VMs coexist on the same LLCs because shared locks are reentrant; a concurrent perf-mode VM attempting LOCK_EX blocks until every no-perf-mode peer has released.

Enforcement under --cpu-cap:

cgroup v2 cpuset sandbox — the reserved CPUs and derived NUMA nodes are written to a child cgroup’s cpuset.cpus and cpuset.mems, and the build pid is migrated into that cgroup, so make -jN gcc children inherit the binding. Under --cpu-cap, narrowing by a parent cgroup is a fatal error; without the flag (but with acquire_llc_plan still running on the 30% default) the sandbox warns and proceeds.
Soft-mask affinity — vCPU threads receive a sched_setaffinity mask covering only the reserved CPUs, so the guest’s CPU placement respects the budget even though no pinning is applied.
No RT scheduling, no hugepages, no mbind, no KVM exit suppression — these remain off; --cpu-cap is not a partial performance mode.
make -jN hint — kernel-build pipelines pass plan.cpus.len() to make so gcc’s fan-out matches the reserved capacity rather than nproc.

This tier is mutually exclusive with performance_mode=true (on the CLI, clap requires = "no_perf_mode" rejects --cpu-cap without --no-perf-mode at parse time) and with KTSTR_BYPASS_LLC_LOCKS=1 (rejected at every entry point because the contract and the bypass escape hatch are contradictory). Library consumers that set performance_mode=true on KtstrVmBuilder directly bypass the CLI parse — KTSTR_CPU_CAP is silently ignored in that path because the builder’s perf-mode branch never consults CpuCap::resolve.

See Resource Budget for the CpuCap, LlcPlan, and ktstr locks surfaces in detail.

Tier 3: default (per-CPU window + LLC LOCK_SH)

Selected when neither performance_mode=true nor --no-perf-mode/KTSTR_NO_PERF_MODE is set — the default path for #[ktstr_test] entries that don’t declare performance_mode (entry.rs KtstrTestEntry::DEFAULT sets performance_mode: false). acquire_cpu_locks (in vmm/host_topology.rs) walks a contiguous CPU window, takes LOCK_EX on each window CPU’s /tmp/ktstr-cpu-{C}.lock, then additionally takes LOCK_SH on the LLC lockfiles covering those CPUs so a perf-mode (tier 1) VM cannot grab LOCK_EX on an LLC that this path is using. No pinning, no isolation, no cgroup sandbox — the per-CPU reservation is purely for host-scheduling-noise avoidance between concurrent VMs.

This is the ONLY tier that actually flocks per-CPU lockfiles. Tier 1 skips them (LLC EX already covers all CPUs in the group); tier 2 skips them (capped LLC SH is enforced via the cgroup cpuset and the flock is sufficient per-LLC coordination).

Disabling performance mode

--no-perf-mode (or KTSTR_NO_PERF_MODE=1) forces performance_mode=false. The result is tier 2 above — a CPU-capped LOCK_SH reservation (either explicit --cpu-cap N or the 30%-of-allowed default). The feature differences relative to tier 1 are:

LLC flock mode — tier 1 holds LOCK_EX on each reserved LLC; tier 2 holds LOCK_SH. Multiple shared holders coexist; an exclusive holder blocks every shared acquirer and vice-versa.
Per-CPU flocks — tier 1 relies on LLC-level LOCK_EX for exclusivity; per-CPU /tmp/ktstr-cpu-{C}.lock files are skipped (try_acquire_all in vmm/host_topology.rs short-circuits the per-CPU loop when LlcLockMode == Exclusive because the LLC lock already covers every CPU in the group). Tier 2 also skips them — the cgroup cpuset is the enforcement layer.
vCPU pinning — tier 1 pins via sched_setaffinity to the reserved LLC’s CPUs. Tier 2 applies soft-mask affinity (budget-scoped but no 1:1 vCPU-to-CPU binding).
RT scheduling — tier 1 only; tier 2 runs vCPU threads at normal priority.
Hugepages — tier 1 only; tier 2 uses regular pages.
NUMA mbind — tier 1 only; tier 2 instead writes cpuset.mems on its child cgroup to achieve NUMA locality at the cgroup layer.
KVM exit suppression (x86_64) — tier 1 only; tier 2 leaves PAUSE and HLT exits enabled.
KVM_HINTS_REALTIME CPUID (x86_64) — tier 1 only; tier 2 leaves the guest on PV spinlocks and standard cpuidle.

Use tier 2 on multi-tenant hosts where you want bounded concurrency (at most N concurrent builds or no-perf-mode VMs per host) but cannot afford the full perf-mode contract. Use tier 1 for regression measurement where host jitter must be controlled.

Available via:

ktstr shell --no-perf-mode
cargo ktstr test --no-perf-mode
cargo ktstr coverage --no-perf-mode
cargo ktstr llvm-cov --no-perf-mode
cargo ktstr shell --no-perf-mode
KTSTR_NO_PERF_MODE=1 (any value; presence is sufficient)

--cpu-cap N layers on top of any of the above when present; when absent, the 30%-of-allowed default applies automatically. The env var is read by every VM builder call site (test harness, auto-repro, verifier, shell). The CLI flags set the env var before test execution so library consumers inherit it.

Keyboard shortcuts

ktstr