ktstr
ktstr is a test harness for Linux process schedulers, with a focus on sched_ext (BPF-extensible process scheduling). It boots Linux kernels in KVM virtual machines with controlled CPU topologies, runs workloads, and verifies scheduling correctness. Also tests under the kernel’s default EEVDF scheduler.
Quick taste
The simplest test calls a canned scenario:
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
scenarios::steady(ctx)
}
cargo ktstr test --kernel ../linux
Without a scheduler attribute, tests run under EEVDF. See
Getting Started for testing a sched_ext scheduler.
Library API
The ktstr::prelude module re-exports the types needed for writing
tests. Declare cgroups and workloads as data with CgroupDef:
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("cg_0").workers(2),
CgroupDef::named("cg_1").workers(2),
])
}
The prelude also exports low-level types (CgroupGroup,
WorkloadConfig, WorkloadHandle) for manual cgroup and worker
management, Assert for composable assertion config, and
WorkerReport for telemetry access.
For binary workloads (running schbench, fio, or any external
executable as part of a test), see
Payload Definitions.
#[ktstr_test(payload = FIXTURE)] runs a Payload (binary
workload) alongside the cgroup workers; the scheduler = slot
takes a bare Scheduler reference — the const emitted by
declare_scheduler!.
What it tests
- Fair scheduling – workers get CPU time without starvation or excessive scheduling gaps.
- Cpuset isolation – workers stay on assigned CPUs.
- Dynamic operations – cgroups created, destroyed, and resized mid-run.
- Affinity – scheduler respects thread affinity constraints.
- Stress – many cgroups, many workers, rapid topology changes.
- Stall detection – scheduler doesn’t drop tasks.
Design
Two principles drive ktstr’s architecture:
Fidelity without overhead – every test boots a real Linux kernel in a KVM VM with real cgroups and real BPF programs. No mocking, no containers, no shared state. The VMM is minimal and PCI-free: two 16550 serial ports (COM1 for kernel console, COM2 for application I/O), a shared-memory ring buffer, and three virtio-MMIO devices (virtio-console for guest console I/O, virtio-blk for file-backed block storage with optional btrfs templates, virtio-net for in-VMM L2 loopback used by network workload tests).
Direct access over tooling layers – the host-side monitor reads guest memory directly via BTF (BPF Type Format)-resolved struct offsets to observe scheduler state. The monitor runs entirely host-side — no BPF programs are injected into the guest to collect scheduler telemetry, so observations do not perturb scheduling decisions. (BPF programs loaded by the scheduler under test, the BPF verifier pipeline, and the auto-repro probe pipeline are separate concerns; those are the code under test, not the observation layer.) See Monitor for details on BTF resolution and guest memory introspection.
BPF verifier analysis
The verifier_pipeline tests boot a scheduler in a VM and capture
per-program verifier output from the real kernel verifier. The
default output applies cycle collapse to reduce repetitive loop
unrolling. See BPF Verifier for details.
Auto-repro probe pipeline
When a scheduler crashes, ktstr can automatically rerun the failing scenario with BPF probes attached to the crash-path functions. See Auto-Repro for details.
Workspace structure
| Component | Purpose |
|---|---|
ktstr (lib) | Core library |
ktstr-macros | #[ktstr_test], declare_scheduler!, and #[derive(Payload)] proc macros |
ktstr (bin) | Host-side CLI |
cargo-ktstr (bin) | Cargo-integrated workflow: test, coverage, llvm-cov, kernel mgmt, verifier analysis, stats, interactive shell |
scx-ktstr | Minimal BPF scheduler for testing |
ktstr and cargo-ktstr are the two user-facing [[bin]]
targets in the crate; install them with
cargo install --locked ktstr --bin ktstr --bin cargo-ktstr.
The crate also defines two test-fixture [[bin]] targets —
ktstr-jemalloc-probe and ktstr-jemalloc-alloc-worker —
used by the tests/jemalloc_probe_tests.rs integration tests.
The explicit --bin flags scope the install to just the two
operator-facing entry points; without them, cargo install
would also place the test-fixture binaries on $PATH.
Kernel config
ktstr.kconfig in the repo root contains the kernel config fragment
needed for scheduler testing (sched_ext, BPF, kprobes, cgroups).
Copy it to your kernel source tree and run make olddefconfig.
Features
ktstr is a test framework for Linux process schedulers. See Overview for a quick introduction with code examples.
Supported kernels
ktstr’s runtime dispatches to per-kernel-version fallback paths for
the watchdog timeout and event counters. CI explicitly exercises
6.14 and 7.0 on both x86_64 and aarch64. On 7.1+ kernels the watchdog
override uses scx_sched.watchdog_timeout via BTF detection;
older kernels use the static scx_watchdog_timeout symbol.
Event counters follow a different layout split: 6.18+ kernels
(backported to 6.17.7+ stable) read via
scx_sched.pcpu -> scx_sched_pcpu.event_stats; 6.16-6.17 kernels
read scx_sched.event_stats_cpu directly. When neither path is
available, event-counter sampling is silently disabled.
Testing
Real kernel, clean slate, x86/arm parity — every VM test boots its own Linux kernel in KVM, fresh state each run
Tests boot a real Linux kernel in a KVM virtual machine with configurable topology: NUMA nodes, LLCs, cores per LLC, threads per core. Multi-NUMA topologies produce NUMA domains via ACPI SRAT/SLIT/HMAT tables on x86_64. On aarch64, CPU topology is described via FDT cpu nodes with MPIDR affinity. Both architectures are supported (24 topology presets on x86_64, 14 on aarch64). See Gauntlet for the full preset list.
Fast boot — compressed initramfs SHM cache with COW overlay, VMs boot in ms, not s
The initramfs base (test binary + busybox + shared libraries) is LZ4-compressed and cached in a shared memory segment. Concurrent VMs COW-map the cached base into guest memory, avoiding per-VM compression and copy. Per-test arguments are packed as a small suffix appended to the cached base. Result: VM boot time is dominated by kernel init, not initramfs preparation.
Automatic shared library resolution — recursive DT_NEEDED discovery, no need to link statically
Shared library dependencies for the test binary and any injected
host files are resolved automatically by walking DT_NEEDED entries
in ELF headers. The framework builds a complete closure of transitive
dependencies — no manual .so lists or LD_LIBRARY_PATH hacks.
Bare-metal mode — run the same scenarios on real hardware without VMs
cargo ktstr export packages a registered test as a self-extracting
.run script that reproduces the scenario on bare metal without
a VM. The runfile validates host topology and sched_ext support,
then dispatches the test directly under whatever scheduler is
already active. Used for testing under production schedulers and
real topology.
Declarative scheduler registration — one macro declares the binary, default topology, kernels, and assertions
Tests load sched_ext schedulers
via BPF struct_ops inside the VM. The
declare_scheduler!
macro registers a scheduler in the KTSTR_SCHEDULERS distributed
slice — binary path, default topology, kernel filter for the
verifier sweep, assertion overrides, and always-on CLI args all
land in one declaration that tests reference via the bare const
ident the macro emits.
use ktstr::declare_scheduler;
declare_scheduler!(MITOSIS, {
name = "mitosis",
binary = "scx_mitosis",
topology = (1, 2, 4, 1),
sched_args = ["--exit-dump-len", "1048576"],
});
Without a scheduler = … attribute on #[ktstr_test], tests run
under the kernel’s default scheduler (EEVDF).
Data-driven scenarios — declare what you want, the framework handles cgroups, cpusets, and workers
Scenarios are composable sequences of
steps and ops. You declare intent as data —
the framework creates cgroups, assigns cpusets, spawns workers,
sets scheduling policies, and manages affinity. 50+
canned scenarios across 8 scenario
submodules cover basic, cpuset, dynamic, stress, interaction,
affinity, nested, and performance patterns. (The ops module
is the underlying DSL that every scenario is expressed in, not
a scenario category; the scenarios module is the top-level
catalog aggregator.)
API types:
CgroupDef— declarative cgroup: name + cpuset + workload(s)Step— sequence of ops followed by a hold periodOp— atomic operation (add/remove cgroup, set/swap/clear cpuset, spawn, stop, set affinity, move tasks)CpusetSpec— topology-relative cpuset (LLC-aligned, disjoint, overlapping, range, exact)HoldSpec— hold duration (fractional, fixed, or looped)AffinityIntent— per-worker affinity (inherit, random subset, LLC-aligned, cross-cgroup, single CPU, exact)SchedPolicy— Linux scheduling policy (Normal, Batch, Idle, FIFO, RoundRobin)WorkSpec— workload definition for a group of workers
Gauntlet — one test declaration, dozens of topology variants with budget-aware CI selection
A single #[ktstr_test]
auto-expands across topology presets. Multi-kernel runs
(cargo ktstr test --kernel A --kernel B) add the kernel as
an additional dimension. Budget-based selection
(KTSTR_BUDGET_SECS) picks the subset that maximizes coverage
within a CI time limit.
Constraint attributes:
min_llcs,max_llcs,min_cpus,max_cpus,min_numa_nodes,max_numa_nodes,requires_smt— topology gatesextra_sched_args— per-test scheduler CLI arguments
#[ktstr_test] proc macro — zero-boilerplate test declaration with nextest integration
Declares tests with topology, scheduler, and constraint attributes.
Generates both nextest-compatible entries and standard #[test]
wrappers. No custom harness or main() needed.
See The #[ktstr_test] Macro.
Library-first — add as a dev-dependency, write tests in your own crate
Add ktstr as a dev-dependency. ktstr::prelude re-exports
all test-authoring types.
See Getting Started for setup.
Automatic lifecycle — boot, load scheduler, run scenario, collect results, shutdown — all handled
The framework manages the full VM lifecycle: boot, scheduler start, scenario execution, result collection, and shutdown. Bidirectional SHM signal slots coordinate graceful shutdown, BPF map writes, and readiness gates between host and guest.
38 work types — configurable workload profiles for different scheduling pressures
Workers are fork()ed processes placed in cgroups:
SpinWait— tight CPU spin loopYieldHeavy— repeated sched_yield with minimal CPU workMixed— CPU spin burst followed by sched_yieldAluHot— dependent integer multiply chain at high IPC (≥ 2.0)SmtSiblingSpin— paired PAUSE-spin pinned across two SMT siblingsIpcVariance— alternating high-IPC (multiplies) / low-IPC (cache touches) phasesIoSyncWrite— 16 × 4 KB pwrites + fdatasync per iteration (O_SYNC)IoRandRead— 4 KB random pread (O_DIRECT)IoConvoy— interleaved sequential pwrite + random pread with periodic fdatasync (O_DIRECT)Bursty— CPU burst then sleep (parameterized viaDuration)IdleChurn— CPU burst thennanosleep(hrtimer + idle-class path)PipeIo— CPU burst then pipe exchange (cross-CPU wake placement)FutexPingPong— paired futex wait/wake (non-WF_SYNC)FutexFanOut— 1:N fan-out wakeCachePressure— strided RMW sized to pressure L1CacheYield— cache pressure + sched_yieldCachePipe— cache pressure + pipe exchangeSequence— compound: loop through phasesForkExit— rapid fork+_exit cyclingNiceSweep— cycle nice level from -20 to 19AffinityChurn— rapid self-directed sched_setaffinityPolicyChurn— cycle SCHED_OTHER → BATCH → IDLE (→ FIFO/RR with CAP_SYS_NICE)NumaMigrationChurn— rotate sched_setaffinity across NUMA nodesCgroupChurn— cycle cgroup membership between sibling cgroupsFanOutCompute— messenger/worker fan-out with matrix-multiply computeAsymmetricWaker— paired workers in mismatched scheduling classes share one futex wordWakeChain— ring of waker-wakee hops (Pipe with WF_SYNC, or Futex)EpollStorm— eventfd producers + epoll_wait consumersThunderingHerd— N waiters on one global futex word; broadcast wakePageFaultChurn— rapid mmap/fault/MADV_DONTNEED cyclingNumaWorkingSetSweep— rotate working-set memory across NUMA nodes via mbindMutexContention— N-way futex mutex contentionPriorityInversion— three-tier lock contention (Pi or Plain futex)ProducerConsumerImbalance— unbalanced producer/consumer pipeline (queue grows)SignalStorm— paired workers fire tkill(partner, SIGUSR1) between CPU burstsPreemptStorm— one SCHED_FIFO worker preempts CFS spinners at ~kHzRtStarvation— SCHED_FIFO workers monopolise CPU; SCHED_NORMAL workers starveCustom— user-supplied work function
See WorkType.
Observability
Zero-overhead introspection — read/write kernel state and BPF maps from the host w/o the guest knowing
All observability is built on direct read/write of guest physical memory from the host via the KVM memory mapping, with page table walks for dynamically allocated addresses. No guest-side instrumentation, no BPF syscalls — the observer does not perturb the scheduler under test.
Kernel state: per-CPU runqueues, sched_domain trees, schedstat counters, and sched_ext event counters — read via BTF-resolved struct offsets from vmlinux. See Monitor.
BPF state: maps discovered by walking kernel data structures through page table translation. Array and percpu_array maps support typed field access via BPF program BTF; hash maps return raw key-value pairs. Read/write — write enables host-initiated crash reproduction.
Types: GuestMem, GuestKernel, BpfMapAccessor
Cast analysis — recover typed pointers from u64 map fields automatically; no annotations required
Schedulers stash kernel kptrs (task_struct *, cgroup *, …) and
arena pointers in BPF map fields the BTF declares as u64 because
BTF cannot express a pointer to a per-allocation type. The cast
analyzer walks the scheduler’s .bpf.o instruction stream, tracks
register state across LDX / STX / stack-spill / kfunc-return, and
records every (source_struct, field_offset) → target_struct
mapping it can prove from the program’s own access pattern.
The renderer feeds those mappings into render_cast_pointer so a
field that previously surfaced as a raw 0xffff… integer now
chases through to the target struct’s fields and prints with a
(cast→arena) or (cast→kernel) annotation distinguishing
cast-recovered pointers from BTF-typed ones. Failure dumps,
periodic captures, and on-demand snapshots all benefit
automatically.
A complementary sdt_alloc bridge recovers a chase target’s real
struct id when the scheduler’s program BTF declares the pointee as
a BTF_KIND_FWD forward declaration (the typical shape for
struct sdt_data __arena * fields whose body lives in a separate
library BTF). The freeze pre-pass populates a
slot_start → ArenaSlotInfo range index from each live
sdt_alloc allocator slot — one entry per slot, carrying
elem_size, header_size, and the resolved payload type id.
When a chase lands on a Fwd terminal, the renderer range-looks
up the slot the chased address falls in and renders the recovered
payload struct from the slot’s payload start (skipping the
union sdt_id header when the chased pointer lands at slot-start
rather than payload-start). The result carries an sdt_alloc
annotation suffix: (sdt_alloc) for the BTF-typed Type::Ptr
arm, or cast→arena (sdt_alloc) / cast→kernel (sdt_alloc)
when the chase originated from a cast-analyzer hit.
A parallel cross-BTF Fwd resolution path covers a different
multi-BTF shape: a BTF_KIND_FWD whose body lives in a sibling
embedded BPF object’s BTF rather than an sdt_alloc slot — the
typical scheduler shape where one .bpf.c declares
struct cgx_target; (forward) and another defines the body. The
cast-analysis pre-pass builds a name-keyed index over every
parsed embedded program BTF (one entry per complete !is_fwd
Struct / Union; first-write-wins on duplicate names; anonymous
types skipped). When a chase target survives the local same-BTF
Fwd resolve as a Fwd, the renderer consults the cross-BTF index
by name (matching aggregate kind — struct vs union); a hit
switches the recursion to the resolved sibling BTF and renders
the full body. No new annotation is introduced — the recovered
subtree carries whatever annotation it would have had if the
struct body lived in the entry BTF.
Runs unconditionally on every scheduler load; no test-author
configuration. False negatives (a missed cast — renderer falls
back to raw u64, the prior behavior) are acceptable; false
positives (a misidentified cast) are not, so the analyzer is
deliberately conservative on conflicting evidence and branch
joins. See Monitor → Cast analysis.
Unified timeline — correlate scenario phases with scheduler telemetry
Stimulus events (cgroup ops, cpuset changes, step transitions) correlated with monitor samples for per-phase scheduler behavior analysis. Each event carries timestamps, operation details, and cumulative worker iteration counts.
Periodic capture + temporal assertions — cadenced sampling across the workload window with monotonicity, rate, steady-state, convergence, and ratio patterns
#[ktstr_test(num_snapshots = N)] fires N host-side
freeze_and_capture boundaries inside the workload’s 10 %–90 %
window, anchored at the first MSG_TYPE_SCENARIO_START. Each
capture is stored on the SnapshotBridge under periodic_NNN
along with the parallel scx_stats JSON observed pre-freeze and a
pause-adjusted elapsed-ms timestamp.
A post_vm callback drains the bridge into a SampleSeries and
projects per-sample columns (SeriesField<T>) along the BPF or
stats axis. Seven temporal patterns evaluate the projections:
nondecreasing— counter monotonicity (v[i] <= v[i+1])strictly_increasing— strict counter monotonicity (v[i] < v[i+1])rate_within(lo, hi)— bounded delta-per-millisecondsteady_within(warmup_ms, tolerance)— post-warmup mean bandconverges_to(target, tolerance, deadline_ms)— three-consecutive-in-band witness before deadlinealways_true— boolean invariant at every sampleratio_within(other, lo, hi)— cross-field correlation between two series
Per-sample projection errors render with the underlying
SnapshotError variant (PlaceholderSample, MissingStats,
FieldNotFound, TypeMismatch, …) so coverage gaps surface with
their cause without re-running. See
Periodic Capture and
Temporal Assertions.
Worker comm / nice / pcomm — set task->comm, nice level, and thread-group leader name on every worker
CgroupDef::comm("name") calls prctl(PR_SET_NAME) on every
worker. CgroupDef::nice(n) calls setpriority(PRIO_PROCESS, 0, n)
on every worker. CgroupDef::pcomm("name") triggers ktstr’s
fork-then-thread spawn path: workers sharing a pcomm value coalesce
into ONE forked thread-group leader whose task->group_leader->comm
is the pcomm string, with worker threads inside it. Each worker
thread additionally sets its own task->comm via the per-WorkSpec
.comm(). Models real applications like chrome (pcomm) hosting
ThreadPoolForeg (per-thread comm). PipeIo/CachePipe and
SignalStorm work correctly under Fork and Thread clone modes,
including inside pcomm-coalesced thread groups. See Tutorial: Step 11.
Inline scheduler configs — pass JSON config strings directly into the test, framework writes to guest path
Schedulers that take a --config JSON file (scx_layered,
scx_lavd, …) declare the arg template + guest path via
Scheduler::config_file_def(arg_template, guest_path). Tests
supply the inline JSON via #[ktstr_test(config = LAYERED_CONFIG)].
The framework writes the content to a temp file, packs it into the
initramfs at the guest path, and substitutes {file} in the arg
template before launching the scheduler. A bidirectional pairing
gate (compile time + runtime) catches mismatched declarations: a
scheduler with config_file_def REQUIRES config = … on every
test, and a scheduler without it REJECTS config = …. See
Tutorial: Step 12 and
The #[ktstr_test] Macro.
no_perf_mode — decouple virtual topology from host hardware for tests with NUMA / LLC counts the host can't satisfy
#[ktstr_test(no_perf_mode = true)] (or KTSTR_NO_PERF_MODE=1)
builds the VM with the declared numa_nodes / llcs / cores /
threads even on smaller hosts. vCPU pinning, hugepages, NUMA
mbind, RT scheduling, and KVM exit suppression are skipped, and
gauntlet preset filtering relaxes host-topology checks to the
single “host has enough total CPUs” inequality. Mutually exclusive
with performance_mode = true (validated at runtime by KtstrTestEntry::validate). See
Tutorial: Step 13
and Performance Mode.
Statistical regression detection — Polars-powered analysis across combinatoric test matrices
Polars-powered aggregation computes scheduling metrics across runs. Run-to-run compare with dual-gate significance thresholds (absolute and relative) catches regressions that single-run assertions miss.
Metrics:
worst_spread— CPU time fairness (0.0 = perfect)worst_gap_ms— longest scheduling gaptotal_migrations/worst_migration_ratio— cross-CPU migration volumemax_imbalance_ratio— runqueue length imbalancep99_wake_latency_us— tail wake-to-run latencymean_run_delay_us— mean schedstat run delaytotal_iterations— throughput
Debugging
Auto-repro — automatically captures function arguments and struct state at crash-path call sites
On scheduler crash, extracts the crash stack and discovers
struct_ops callbacks. Attaches BPF kprobes and fentry/fexit probes,
triggers on sched_ext_exit, and reruns the scenario to capture
function arguments and struct field state at each crash-path call
site. See Auto-Repro.
Interactive shell — busybox shell inside the VM with host file injection (debugging, not tests — too slow)
ktstr shell boots a VM with busybox and drops into an interactive
shell. --include-files injects host binaries and libraries with
automatic shared library resolution.
--exec mode — run commands inside the VM non-interactively
ktstr shell --exec "command" runs a command inside the VM and
exits.
Infrastructure
Kernel management — build, cache, and auto-discover kernels from any source
ktstr kernel build builds and caches kernel images from version
numbers, local source paths, or git URLs. Automatic kernel discovery
resolves cached images, host kernels, and CI-provided paths without
manual configuration.
Performance mode — host-side isolation-ish and topology mirroring for maybe workable results
vCPU pinning, 2MB hugepages with pre-fault allocation, NUMA mbind, RT scheduling (SCHED_FIFO), KVM exit suppression (PAUSE + HLT), and KVM_HINTS_REALTIME CPUID — isolates the VM from host noise for reproducible measurements. Topology mirroring maps the VM’s LLC structure to match the host’s physical layout, so cache-aware scheduling decisions in the guest reflect real hardware behavior rather than synthetic geometry. Kinda. See Performance Mode.
Resource-budget coordination — --cpu-cap N bounds concurrent kernel builds and no-perf-mode VMs per host
--cpu-cap N (or KTSTR_CPU_CAP=N) constrains a no-perf-mode VM or
kernel build to exactly N host CPUs, selected by walking whole LLCs
in NUMA-aware, consolidation-aware order (filtered to the calling
process’s sched_getaffinity cpuset), and partial-taking the last LLC
so plan.cpus.len() == N. The full LLC is still flocked for
per-LLC coordination with concurrent ktstr peers. When the flag is
absent, the planner defaults to 30% of the allowed-CPU set (minimum
1). The plan writes the reserved CPUs and NUMA nodes into a cgroup
v2 cpuset sandbox so make -jN gcc fan-out and vCPU soft-mask
affinity respect the budget. On shell, mutually exclusive with
performance_mode=true (clap parse rejection); library consumers
see the env var silently ignored under perf-mode. Mutually exclusive
with KTSTR_BYPASS_LLC_LOCKS=1 at every entry point (contract vs.
bypass conflict rejected at CLI parse plus the library and
kernel-build-pipeline sites). ktstr locks / ktstr locks --json
enumerates every held LLC + per-CPU + cache-entry flock on the host
with holder PID + cmdline for contention diagnosis. See
Resource Budget.
Guest coverage — profraw collection from inside the VM, merged with host coverage
Guest-side profraw collection via shared memory. The host merges
guest and host coverage for unified cargo llvm-cov reports.
cargo-ktstr — cargo subcommand for the full workflow, more introspection less shell scripts
Wraps cargo nextest run with automatic kernel resolution.
Subcommands (in --help order): test, coverage, llvm-cov,
stats, kernel, model, verifier, funify, completions,
show-host, show-thresholds, export, locks, shell.
See cargo-ktstr.
Remote kernel cache — GHA cache backend for cross-run kernel sharing
GHA cache backend for CI kernel sharing. When KTSTR_GHA_CACHE=1
and ACTIONS_CACHE_URL are set, all cache lookups check the remote
after a local miss before falling back to download (for versions) or
erroring (for cache keys). All successful builds are pushed to the
remote automatically. Non-fatal on failure; local cache is
authoritative.
Real-kernel verifier analysis — boots the kernel, loads the scheduler, reads actual verified instruction counts
Runs the BPF verifier against every declare_scheduler!-registered
scheduler’s struct_ops programs inside a real kernel. Reports
per-program verified instruction counts with cycle collapse —
deduplicating repeated verifier paths to show true unique cost.
The sweep emits one nextest cell per (declared scheduler ×
kernel-list entry × accepted topology preset) tuple, with
parallelism and retries handled natively by nextest.
Reads bpf_prog_aux.verified_insns from guest memory after loading
the scheduler via struct_ops — the same path production uses.
Captures real-world verification costs including map sharing, BTF
resolution, and program composition.
See Verifier.
Host-guest signaling — bidirectional host/VM coordination built into the test library
SHM signal slots enable the host and guest to coordinate without a network stack, serial protocol, or guest-side daemon. Graceful shutdown, readiness gates, and BPF map write triggers flow through shared memory mapped into both address spaces.
See Getting Started to set up your first test, or browse the Recipes for common workflows.
Getting Started
Prerequisites
Linux only (x86_64, aarch64). ktstr boots KVM virtual machines; it does not build or run on other platforms.
- Linux host with KVM access (
/dev/kvm) - Rust toolchain (stable, >= 1.94.1; pinned via
rust-toolchain.toml) - clang (BPF skeleton compilation)
- pkg-config, make, gcc
- autotools (autoconf, autopoint, flex, bison, gawk) – vendored libbpf/libelf/zlib build
- BTF (
/sys/kernel/btf/vmlinux– present by default on most distros; setKTSTR_KERNELif missing) - Internet access on first build (downloads busybox source)
- Linux kernel 6.12+ for sched_ext tests (check with
uname -r). The host kernel has no version requirement beyond KVM; the test kernel is whichever you build or cache viacargo ktstr kernel build. See Supported kernels for per-feature version boundaries.
Ubuntu/Debian:
sudo apt install clang pkg-config make gcc autoconf autopoint flex bison gawk
Fedora:
sudo dnf install clang pkgconf make gcc autoconf gettext-devel flex bison gawk
Install tools
cargo install cargo-nextest # required
cargo install --locked ktstr --bin ktstr --bin cargo-ktstr # both user-facing binaries (optional)
cargo-nextest is required. cargo ktstr test delegates to nextest
internally; without it, cargo ktstr test will fail.
cargo install --locked ktstr --bin ktstr --bin cargo-ktstr
installs the two user-facing binaries (ktstr host-side CLI and
cargo-ktstr dev workflow plugin); the --bin flags scope the
install away from the two test-fixture binaries
(ktstr-jemalloc-probe, ktstr-jemalloc-alloc-worker) that the
crate’s integration tests spawn.
Add the dependency
Add ktstr as a dev-dependency:
[dev-dependencies]
ktstr = { version = "0.4" }
Kernel discovery
Tests require a bootable Linux kernel. The test harness checks these locations in order:
KTSTR_TEST_KERNELenvironment variable (direct image path).KTSTR_KERNELenvironment variable, parsed as one of three forms:- Path: search that directory for
arch/<arch>/boot/<image> - Version (e.g.
6.14.2): look up the version in XDG cache - Cache key (from
cargo ktstr kernel list): exact cache lookup
- Path: search that directory for
- XDG cache: most recent cached image (newest first); entries built
with a different kconfig fragment are skipped. When
KTSTR_KERNELnamed an explicit version or cache key that was not present in the cache, the cache scan is skipped entirely – discovery moves on to step 4 rather than substituting an unrelated cached kernel. ./linux/arch/<arch>/boot/<image>(workspace-local build tree)../linux/arch/<arch>/boot/<image>(sibling directory)/lib/modules/$(uname -r)/build/arch/<arch>/boot/<image>(installed kernel build tree)/lib/modules/$(uname -r)/vmlinuz(installed kernel)/boot/vmlinuz-$(uname -r)/boot/vmlinuz(unversioned symlink)
On x86_64, the build-tree image is arch/x86/boot/bzImage; on
aarch64, arch/arm64/boot/Image.
The host’s installed kernel works for basic testing. For sched_ext tests, build a kernel with the ktstr config fragment (below). See Troubleshooting for details.
Implicit discovery reads existing cache entries but does not run the build pipeline or produce a new cache entry. The chain reads existing cache entries on the read path (most-recent-valid first; entries built with a different kconfig fragment are skipped) and falls back to local build trees and host paths when nothing matches. It does NOT compute a
local-{hash7}-{arch}-kc{suffix}cache key, run the build pipeline, or store a new cache entry from whatever source-tree image it ends up using. To opt into the build + cache-store pipeline so a source tree’s build is recorded and reused under a stable cache key, pass the path explicitly viacargo ktstr test --kernel ../linux; see cargo-ktstr — What it does for the full path-mode flow including the cache-hit short-circuit.
Build a kernel
cargo ktstr kernel build downloads a kernel tarball from kernel.org,
configures it with the embedded ktstr.kconfig fragment, builds it,
and caches the result:
cargo ktstr kernel build # latest stable series with >= 8 maintenance releases
cargo ktstr kernel build 6.14.2 # specific version
cargo ktstr kernel build 6.12 # highest 6.12.x patch release
cargo ktstr kernel build 6 # highest 6.x.y release
The bare cargo ktstr kernel build skips series that have fewer
than 8 maintenance releases to keep CI off brand-new majors whose
early point releases tend to hit build issues on older toolchains;
pass the specific version explicitly if you need a series that
hasn’t reached .8 yet.
Subsequent runs of cargo ktstr test or cargo nextest run will
find the cached kernel automatically (step 3 in the discovery chain
above).
To build from a local source tree:
cargo ktstr kernel build --source ../linux
To list and manage cached kernels:
cargo ktstr kernel list
cargo ktstr kernel clean --keep 3
See cargo-ktstr for all options.
Manual
cd /path/to/linux
make defconfig
cat /path/to/ktstr/ktstr.kconfig >> .config
make olddefconfig
make -j$(nproc)
ktstr.kconfig in the repo root contains a kernel config fragment
tuned for scheduler testing (sched_ext, BPF, kprobes, minimal boot).
Write a test
Create a file in your crate’s tests/ directory (e.g.
tests/sched_test.rs) and write a #[ktstr_test] function. The
prelude
module re-exports the types you need.
The simplest test uses a canned scenario. AssertResult carries the
pass/fail verdict, diagnostic messages, and per-cgroup statistics from
the run.
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 2, threads = 1)] // llcs = last-level caches
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
// `scenarios::steady` is a canned scenario: two cgroups of equal
// CPU-spin workers, no cpuset restrictions, run for the default
// duration.
scenarios::steady(ctx)
}
For custom cgroup topology, declare cgroups with CgroupDef and run
them with execute_defs. A CgroupDef bundles the cgroup name,
optional cpuset, and workload specification into a single declaration.
This is the most common custom test pattern:
use std::time::Duration;
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("cg_0").workers(4),
CgroupDef::named("cg_1")
.workers(2)
// CPU burst for 50 ms, sleep for 100 ms, repeat.
.work_type(WorkType::bursty(
Duration::from_millis(50),
Duration::from_millis(100),
)),
])
}
execute_defs is a convenience wrapper that creates a single step
holding for the full duration – use it when all cgroups run
concurrently for one phase. Use execute_steps when you need
multiple phases (e.g., adding cgroups mid-test or changing cpusets
between phases).
Step::with_defs pairs a list of CgroupDefs with a HoldSpec that
controls how long the step runs. This example starts two cgroups, then
adds a third mid-test:
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 4, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
let steps = vec![
// Phase 1: two cgroups for the first half.
Step::with_defs(
vec![
CgroupDef::named("cg_0").workers(2),
CgroupDef::named("cg_1").workers(2),
],
HoldSpec::Frac(0.5),
),
// Phase 2: add a third cgroup for the remaining half.
Step::with_defs(
vec![CgroupDef::named("cg_2").workers(2)],
HoldSpec::Frac(0.5),
),
];
execute_steps(ctx, steps)
}
How it runs
The framework boots a KVM VM with the requested topology and runs
your test binary as the guest’s init process. Your test function
executes inside the VM – execute_defs and execute_steps
immediately create cgroups, spawn workers, run the workload, and
return assertion results. Ctx provides the guest topology
(ctx.topo) and cgroup management (ctx.cgroups).
What gets checked
Every test automatically checks for worker starvation, scheduling
fairness, scheduling gaps, and host-side runqueue health (including imbalance,
stalls, dispatch queue depth). These defaults come from
Assert::default_checks() and can be overridden per-scheduler or
per-test. See Checking for the full
list of checks and thresholds.
Run
The recommended way to run #[ktstr_test] tests is cargo ktstr test,
which handles kernel resolution and wraps cargo nextest:
cargo ktstr test --kernel ../linux
The ktstr ctor automatically intercepts nextest protocol args
(--list, --exact) for gauntlet expansion and budget-driven test
selection.
Fallbacks:
cargo nextest run: ctor intercepts, runs gauntlet-expanded tests (you must supply your own kernel viaKTSTR_KERNEL/KTSTR_TEST_KERNELor the discovery chain).cargo test: standard harness runs the#[test]wrappers (base topology only, no gauntlet expansion).
Requires /dev/kvm. See
Troubleshooting if KVM
is unavailable.
Passing tests:
PASS [ 11.34s] my_crate::my_sched_tests ktstr/my_test
A failing test prints assertion details:
FAIL [ 12.05s] my_crate::my_sched_tests ktstr/my_test
--- STDERR ---
ktstr_test 'my_test' [topo=1n1l2c1t] failed:
stuck 3500ms on cpu1 at +1200ms
--- stats ---
4 workers, 2 cpus, 8 migrations, worst_spread=12.3%, worst_gap=3500ms
cg0: workers=2 cpus=2 spread=5.1% gap=3500ms migrations=4 iter=15230
cg1: workers=2 cpus=2 spread=12.3% gap=890ms migrations=4 iter=14870
Each test invocation writes results into
{CARGO_TARGET_DIR or "target"}/ktstr/{kernel}-{project_commit}/
as one *.ktstr.json sidecar per #[ktstr_test] variant. Run
cargo ktstr stats list to see runs (RUN, TESTS, DATE, ARCH
columns). See Runs for the full layout
and analysis workflow.
Using cargo-ktstr
cargo ktstr test handles kernel resolution and test execution in
one command:
cargo ktstr test # auto-discover kernel
cargo ktstr test --kernel ../linux # local source tree (builds + caches; subsequent runs hit cache)
cargo ktstr test --kernel 6.14.2 # version (auto-downloads on miss)
cargo ktstr test -- -E 'test(my_test)' # pass nextest args
See cargo-ktstr for details.
Interactive shell
cargo ktstr shell boots a VM with busybox for manual exploration:
cargo ktstr shell # default 1,1,1,1 topology
cargo ktstr shell --topology 1,2,4,1 # 1 NUMA node, 2 LLCs, 4 cores/LLC, 1 thread/core
cargo ktstr shell -i ./my-scheduler # include a file in the guest
cargo ktstr shell -i ./test-data/ # include a directory recursively
Included ELF binaries get automatic shared library resolution.
Directories are walked recursively; their contents appear under
/include-files/<dirname>/ preserving the original structure.
Individual files are available at /include-files/<name> inside the guest.
See cargo-ktstr shell for
details.
Next steps
To understand scenarios, flags, and checking: Core Concepts.
To write new tests: Writing Tests.
To test your own scheduler: Test a New Scheduler.
Zero to ktstr
This tutorial walks through writing a complete #[ktstr_test] from
scratch. By the end you’ll have a working scheduler test that runs
two cgroups with different lifecycle patterns across a multi-LLC
topology, tunes test duration and the watchdog, and asserts
fairness, throughput parity, and cpuset isolation.
What you’ll build
A test named mixed_workloads that:
- Runs two cgroups on separate LLCs:
background_spinner– a persistent CPU-bound load that runs for the entire test duration.phased_worker– a worker that loops through explicitSpin -> Yield -> Spin -> Yield ...phases viaWorkType::Sequence.
- Targets a 2-LLC, 4-core topology so the scheduler has a real cache boundary to respect.
- Sets explicit test duration and scx watchdog timeout via
#[ktstr_test]attributes. - Asserts fairness (per-cgroup spread), throughput parity (CV across workers + minimum rate), and cpuset isolation (workers stay on their assigned CPUs). Scheduling gaps and host-side runqueue health are checked automatically.
The complete test is at the end of this page.
Prerequisites
Set up the host and a kernel before continuing:
- Getting Started covers KVM access, the toolchain, and the dev-dependency.
- A bootable Linux kernel image is required. Build one with
cargo ktstr kernel buildor point at a source tree withcargo ktstr test --kernel ../linux. See Getting Started: Build a kernel for the full kernel-management workflow.
Once the dependency is in place, create a file under your crate’s
tests/ directory (e.g. tests/mixed_workloads.rs) and follow along.
Step 1: The skeleton
Every #[ktstr_test] is a Rust function that takes &Ctx and returns
Result<AssertResult>. Start with an empty body that passes
unconditionally:
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
let _ = ctx;
Ok(AssertResult::pass())
}
use ktstr::prelude::*; brings in every type the test body needs –
Ctx, AssertResult, CgroupDef, WorkType, CpusetSpec,
execute_defs, and the Result alias from anyhow. The
#[ktstr_test] attribute registers the function so cargo ktstr test
discovers it and boots a VM with the requested topology.
A test without a scheduler = ... attribute runs under the kernel’s
default EEVDF scheduler — useful as a baseline. Step 2 swaps in a
sched_ext scheduler so the rest of the tutorial exercises that
scheduler instead.
For the full attribute reference, see The #[ktstr_test] Macro.
Step 2: Define your scheduler
To target a sched_ext scheduler, declare it with
declare_scheduler! and reference the generated const from
#[ktstr_test(scheduler = …)]. The example uses scx-ktstr,
the test-fixture scheduler shipped in this workspace; substitute
your own binary name to target a different scheduler.
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(KTSTR_SCHED, {
name = "ktstr_sched",
binary = "scx-ktstr",
});
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
let _ = ctx;
Ok(AssertResult::pass())
}
declare_scheduler! emits a pub static KTSTR_SCHED: Scheduler
holding the declared fields and registers a private static in the
KTSTR_SCHEDULERS distributed slice via linkme so
cargo ktstr verifier discovers it automatically. The
scheduler = slot on #[ktstr_test] expects an &'static Scheduler
— pass the bare KTSTR_SCHED ident.
The macro fields:
name— scheduler name for display and sidecar keys.binary— binary name for auto-discovery intarget/{debug,release}/, the directory containing the test binary, or aKTSTR_SCHEDULERoverride path. When the scheduler is a[[bin]]target in the same workspace,cargo buildalready places it where discovery looks. The resolved binary is packed into the VM’s initramfs.topology = (numa, llcs, cores, threads)— optional default VM topology. Tests can override individual dimensions via#[ktstr_test(llcs = ...)]. Omitted here; the per-test attributes in Step 4 set every dimension explicitly.sched_args = ["--flag", "--another"]— optional CLI args prepended to every test that uses this scheduler. Useful when a scheduler needs the same--enable-llc-style switches in every run; for one-off variations, use#[ktstr_test(extra_sched_args = [...])]on the test instead.kernels = ["6.14", "6.15..=7.0"]— optional set of kernel specs the verifier sweep should exercise this scheduler against. See BPF Verifier for the cell emission contract.
For the full attribute surface (sysctls, kargs, config_file,
gauntlet constraints, scheduler-level assertion overrides), see
Scheduler Definitions.
When the macro doesn’t fit — the most common case being inline
JSON config supplied per-test or programmatic composition — define
the Scheduler const through the manual builder instead. Step 12
below walks through that path with scx_layered.
Step 3: Add workloads
A CgroupDef declares a cgroup along with the workers that will run
inside it. The builder methods configure worker count, the work each
worker performs, scheduling policy, and cpuset assignment.
Add two cgroups – both running tight CPU spinners for now. Step 5 will swap one of them for a phased workload:
use ktstr::prelude::*;
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::SpinWait),
])
}
Without .with_cpuset(...), a cgroup’s workers run on every CPU
in the test’s topology — they share the VM’s full CPU set
with all other cgroups. .with_cpuset(CpusetSpec::Llc(idx))
(introduced in Step 4) restricts a cgroup to one LLC’s CPUs, and
the other CpusetSpec variants narrow further.
WorkType::SpinWait runs a tight CPU spin loop; it is one of many
work primitives – see WorkType for the
full enum (Bursty, FutexPingPong, CachePressure,
IoSyncWrite, PageFaultChurn, MutexContention, Sequence, etc.)
and the work-type-to-scheduler-behavior mapping table.
execute_defs is a convenience wrapper that runs each cgroup
concurrently for the test’s full duration. Both cgroups are
persistent – they hold for the entire scenario. Use
execute_steps when you need to add cgroups mid-run or swap
cpusets between phases; see Ops and Steps for
the multi-step API.
Step 4: Set topology
The #[ktstr_test] attribute carries the VM’s CPU topology.
Topology dimensions are big-to-little: numa_nodes (default 1),
llcs (total across all NUMA nodes), cores per LLC, and
threads per core. Total CPU count is llcs * cores * threads.
LLC count matters because the last-level cache is the primary
scheduling boundary – tasks sharing an LLC benefit from shared
cache lines, while cross-LLC migration carries a cold-cache penalty.
A scheduler that ignores LLC topology will look fine on llcs = 1
and start failing as soon as there is a real cache boundary to
respect.
Bump the topology to two LLCs with two cores each (4 CPUs total) so each cgroup can own its own LLC:
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.with_cpuset(CpusetSpec::Llc(0)),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::SpinWait)
.with_cpuset(CpusetSpec::Llc(1)),
])
}
CpusetSpec::Llc(idx) confines a cgroup to the CPUs that belong to
LLC idx. Other variants (Numa, Range, Disjoint, Overlap,
Exact) cover NUMA-node binding, fractional partitioning, and
hand-built CPU sets.
For the full topology surface (NUMA accessors, per-LLC info, cpuset generation helpers), see TestTopology.
Step 5: Compose phased work inside a cgroup
So far both cgroups run identical CPU spinners. The point of this
test is to exercise a scheduler against different lifecycle
patterns at once, so swap phased_worker for a worker that loops
through explicit phases.
WorkType::Sequence { first: Phase, rest: Vec<Phase> } runs each
phase for its specified duration and then advances to the next; when
the last phase ends the loop restarts from first. Phases:
Phase::Spin(Duration), Phase::Sleep(Duration),
Phase::Yield(Duration), Phase::Io(Duration). Use the
WorkType::sequence(first, rest) constructor.
Phase, WorkType, and CpusetSpec are all in
ktstr::prelude::*; only std::time::Duration needs an extra
use line — added on the first line of the example below:
use std::time::Duration;
use ktstr::prelude::*;
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
// Persistent CPU pressure on LLC 0 for the whole run.
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.with_cpuset(CpusetSpec::Llc(0)),
// Phased worker on LLC 1: spin 100 ms, yield for 20 ms,
// then loop. Stresses the scheduler's wake-after-yield
// placement repeatedly while the LLC-0 spinner keeps
// host runqueue pressure constant.
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::sequence(
Phase::Spin(Duration::from_millis(100)),
[Phase::Yield(Duration::from_millis(20))],
))
.with_cpuset(CpusetSpec::Llc(1)),
])
}
The two cgroups now exercise distinct paths concurrently:
background_spinnerkeeps two CPUs continuously busy on LLC 0.phased_workeralternates between burning CPU and yielding on LLC 1, exercising the scheduler’s voluntary-preemption + wakeup placement code paths.
Both cgroups still run for the entire scenario duration: the
phasing happens within each phased_worker worker’s loop, while
execute_defs holds both cgroups across the whole run via
HoldSpec::FULL. To express phasing across cgroups (e.g. add
phased_worker only for the second half of the run), use
execute_steps with multiple Step entries – see
Ops and Steps. Step 9 below adds an Op::snapshot
capture into a step’s op list.
Step 6: Tune execution
Several #[ktstr_test] attributes control how the VM runs the
scenario. The defaults are tuned for fast iteration; tune up for
longer / heavier runs:
| Attribute | Default | What it does |
|---|---|---|
duration_s | 12 | Per-scenario wall-clock seconds. The framework keeps both cgroups running for duration_s seconds, then signals workers to stop and collects reports. |
watchdog_timeout_s | 5 | sched_ext watchdog fire threshold. Applied via scx_sched.watchdog_timeout on 7.1+ kernels and the static scx_watchdog_timeout symbol on pre-7.1 kernels. When neither path is available the override silently no-ops. |
memory_mb | 2048 | VM memory in MiB. |
watchdog_timeout_s is sched_ext’s per-task stall threshold — if
a runnable task is not picked for watchdog_timeout_s seconds,
the scheduler exits with SCX_EXIT_ERROR_STALL. The scenario
duration and watchdog are independent; a 12 s scenario with a 5 s
watchdog is normal. Tune the watchdog only when the scheduler
under test is expected to legitimately leave a runnable task
parked longer than the default 5 s.
For the run we’re building, set the duration to 20 s (so each phase iteration repeats many times):
#[ktstr_test(
scheduler = KTSTR_SCHED,
llcs = 2,
cores = 2,
threads = 1,
duration_s = 20,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
// body unchanged from Step 5 -- two cgroups via execute_defs
}
For the full attribute reference (auto-repro, performance mode, topology constraints, etc.), see The #[ktstr_test] Macro.
Step 7: Add assertions
Default checks already run with no configuration – not_starved is
Some(true) in Assert::default_checks(), which enables:
- Starvation – any worker with zero work units fails the test.
- Fairness spread – per-cgroup
max(off-CPU%) - min(off-CPU%)must stay under the spread threshold (release default 15%; debug default 35% — debug builds in small VMs show higher spread, so the threshold loosens automatically whencfg!(debug_assertions)is true). - Scheduling gaps – the longest wall-clock gap observed at
work-unit checkpoints must stay under the gap threshold
(release default 2000 ms; debug default 3000 ms — same
cfg!(debug_assertions)gate as spread).
Host-side monitor checks (imbalance ratio, DSQ depth, stall
detection, fallback / keep-last event rates) are also enabled by
default with thresholds from MonitorThresholds::DEFAULT.
Cpuset isolation is opt-in – enable it with isolation = true.
Override the spread threshold and add throughput-parity gates:
use std::time::Duration;
use ktstr::prelude::*;
#[ktstr_test(
scheduler = KTSTR_SCHED,
llcs = 2,
cores = 2,
threads = 1,
duration_s = 20,
isolation = true,
max_spread_pct = 20.0,
max_throughput_cv = 0.5,
min_work_rate = 1.0,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.with_cpuset(CpusetSpec::Llc(0)),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::sequence(
Phase::Spin(Duration::from_millis(100)),
[Phase::Yield(Duration::from_millis(20))],
))
.with_cpuset(CpusetSpec::Llc(1)),
])
}
What each new attribute gates:
isolation = true– workers must only run on CPUs in their assigned cpuset; any execution on an unexpected CPU fails the test.max_spread_pct = 20.0– per-cgroup fairness override (the release default is 15.0; this loosens it slightly to absorb noise from the phased worker’s yield-driven re-placement). Baremax_spread_pct = 15.0would silently match the default and have no observable effect.max_throughput_cv = 0.5– coefficient of variation ofwork_units / cpu_timeacross workers. Catches a scheduler that gives some workers disproportionately less effective CPU.min_work_rate = 1.0– minimum work units per CPU-second per worker. Catches the case where every worker is equally slow (CV passes but absolute throughput is too low).
#[ktstr_test] exposes the full Assert surface (scheduling gaps,
monitor thresholds, NUMA locality, wake-latency benchmarks). See
Checking for the merge chain
(default_checks() -> Scheduler.assert -> per-test) and the
complete threshold list.
Step 8: Run it
Run the test with cargo ktstr test, scoped to this one test name:
cargo ktstr test --kernel ../linux -- -E 'test(mixed_workloads)'
If cargo ktstr test reports “kernel not found”, the --kernel path
either points at a directory without a built vmlinux or at a kernel
the cache cannot locate. Run cargo ktstr kernel build to populate
the cache, or pass an explicit path to a built kernel source tree —
see Getting Started: Build a kernel
for the resolution order.
If a probe-related error surfaces (“probe skeleton load failed”,
“trigger attach failed”), re-run with RUST_LOG=ktstr=debug to
see the underlying libbpf reason. Common causes: missing tp_btf
target on older kernels (handled automatically by the two-phase
fallback), CONFIG_DEBUG_INFO_BTF=n in the guest kernel (rebuild
with BTF enabled), or a verifier rejection on a non-optional
program (the retry surfaces both the original and retry errors so
the verifier output is preserved).
cargo ktstr test resolves the kernel image, boots a VM with the
declared topology, runs the test as the guest’s init, and reports
the result. A passing run looks like:
PASS [ 11.34s] my_crate::mixed_workloads ktstr/mixed_workloads
A failure prints the violated threshold along with per-cgroup stats:
FAIL [ 12.05s] my_crate::mixed_workloads ktstr/mixed_workloads
--- STDERR ---
ktstr_test 'mixed_workloads' [sched=scx-ktstr] [topo=1n2l2c1t] failed:
unfair cgroup: spread=22% (10-32%) 2 workers on 2 cpus (threshold 20%)
--- stats ---
4 workers, 4 cpus, 12 migrations, worst_spread=22.4%, worst_gap=180ms
cg0: workers=2 cpus=2 spread=22.4% gap=180ms migrations=8 iter=15230
cg1: workers=2 cpus=2 spread=4.1% gap=120ms migrations=4 iter=14870
The detail line unfair cgroup: spread=N% (min-max%) N workers on N cpus (threshold N%) is the exact format produced by
assert::assert_not_starved. Other detail-line shapes the
same producer emits:
-
tid {N} starved (0 work units)— when a worker made no progress. Example:ktstr_test 'mixed_workloads' [topo=1n2l2c1t] failed: tid 2 starved (0 work units) -
tid {N} stuck {N}ms on cpu{N} at +{N}ms (threshold {N}ms)— when a worker’s longest off-CPU gap crossedAssert::max_gap_ms. Example:ktstr_test 'mixed_workloads' [topo=1n2l2c1t] failed: tid 7 stuck 2500ms on cpu3 at +4200ms (threshold 2000ms) -
unfair cgroup: spread={pct}% ({lo}-{hi}%)— when per-cgroup fairness exceededmax_spread_pct. Example:ktstr_test 'mixed_workloads' [topo=1n2l2c1t] failed: unfair cgroup: spread=22% (10-32%) 2 workers on 2 cpus (threshold 20%)
The reporting layer does NOT include the cgroup name — cg{i}
is the positional index in the stats roll-up (cg0, cg1, …)
matching the cg{i}: workers=... cpus=... spread=... per-cgroup
stats line emitted by test_support::eval::evaluate_vm_result.
For the full run lifecycle, sidecar layout, and analysis workflow, see Running Tests.
Step 9: Capture a snapshot
Threshold-based assertions tell you something is off; snapshots tell
you what the scheduler’s state actually was. Op::snapshot(name)
asks the host to freeze every vCPU long enough to read the BPF
(in-kernel program) map state, vCPU registers, and per-CPU counters
into a FailureDumpReport keyed by name, then resumes the guest.
execute_defs (used so far) takes a flat list of cgroups and runs
them concurrently. To inject a snapshot mid-run, switch to
execute_steps, which takes a list of Steps — each step has
setup cgroups, an ops list (where Op::snapshot(...) lives),
and a hold duration:
use ktstr::prelude::*;
#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1, duration_s = 20)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_steps(ctx, vec![
Step {
setup: Setup::Defs(vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.with_cpuset(CpusetSpec::Llc(0)),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::SpinWait)
.with_cpuset(CpusetSpec::Llc(1)),
]),
ops: vec![Op::snapshot("after_workload")],
hold: HoldSpec::FULL,
},
])
}
After the scenario completes, the captured report is keyed by name
on the active SnapshotBridge — the host-side store that owns the
captured FailureDumpReport map for the run. Downstream test code
drains it and walks scalar variables with the dotted-path accessor —
e.g. snap.var("nr_cpus_onln").as_u64()? reads a scheduler global
(any .bss/.data/.rodata symbol; Snapshot::var walks all
three) as a u64.
For the bridge wiring, the full traversal API
(Snapshot::map, SnapshotEntry::get, per-CPU narrowing,
error variants), and the symbol-driven
Op::watch_snapshot variant
that fires whenever the guest writes a kernel symbol, see
Snapshots.
Step 10: Gauntlet expansion
The #[ktstr_test] macro doesn’t just emit a single test – it
also generates a gauntlet of variants that run the same body
across every accepted topology preset (single-LLC, multi-LLC,
multi-NUMA, with/without SMT).
Gauntlet variants are nextest-discovered and run with
cargo ktstr test -- --run-ignored ignored-only -E 'test(gauntlet/)'.
Constrain coverage with min_llcs / max_llcs, min_cpus /
max_cpus, and requires_smt on the attribute. See
Gauntlet Tests for the full
filtering and worked examples.
Step 11: Name and prioritize workers
Per-cgroup defaults travel through CgroupDef’s builder methods so
schedulers that key on task->comm or task_struct->static_prio
can be exercised with realistic, distinguishable workers:
CgroupDef::named("background_spinner")
.workers(2)
.comm("bg_spinner") // prctl(PR_SET_NAME, "bg_spinner")
.nice(10) // setpriority(PRIO_PROCESS, 0, 10)
.work_type(WorkType::SpinWait)
.comm("name")— every worker callsprctl(PR_SET_NAME, name)at startup. The kernel truncatestask->commto 15 bytes inside__set_task_comm. Distinguishes workers intop/psoutput and in scheduler tracepoints..nice(n)— every worker callssetpriority(PRIO_PROCESS, 0, n). Values below the calling task’s current nice requireCAP_SYS_NICE; ktstr always runs as root in-VM so the full-20..=19range is available. Skips the syscall entirely when.nice(...)is not chained (workers inherit the parent’s nice)..pcomm("name")— set the thread-group leader’stask->comm. Triggers ktstr’s fork-then-thread spawn path: workers sharing apcommvalue coalesce into ONE forked leader process whosetask->group_leader->commisname, with worker threads inside it. Models real applications likechrome(pcomm) hostingThreadPoolForeg(per-thread comm) andjava(pcomm) hostingGC Thread/C2 CompilerThre.
pcomm is a WorkSpec field, NOT a CloneMode variant. The two
real CloneMode variants are Fork (default; each worker is its
own thread group) and Thread (workers share the harness’s tgid as
std::thread::spawn threads). pcomm triggers an in-process
fork-then-thread shape that combines per-process leader visibility
schedulers expect with the in-process thread-spawn dispatch the
worker bodies use. PipeIo and CachePipe workers placed in a
.pcomm(...) cgroup run as threads inside the pcomm container;
their pipe-pair partner indices are computed within the
container’s thread group, not across forked siblings.
SignalStorm uses tkill (per-task signal delivery,
PIDTYPE_PID) rather than kill (PIDTYPE_TGID), so the
partner-vs-self addressing is correct uniformly across Fork and
Thread modes — including inside pcomm-coalesced thread groups.
Per-WorkSpec overrides win over cgroup-level defaults — write
.work(WorkSpec::default().nice(0).comm("hot_spinner")) to opt a
specific worker out of the cgroup-level defaults.
Step 12: Inline scheduler config
Schedulers like scx_layered and scx_lavd accept a JSON config via
--config /path/to/file.json. Declare the arg template + guest path
on a Scheduler const built via the manual builder, then supply the
inline content from the test attribute:
const LAYERED_SCHED: Scheduler = Scheduler::new("layered")
.binary(SchedulerSpec::Discover("scx_layered"))
.topology(1, 2, 4, 1)
.config_file_def("--config {file}", "/include-files/layered.json");
const LAYERED_CONFIG: &str = r#"{ "layers": [{ "name": "default" }] }"#;
#[ktstr_test(scheduler = LAYERED_SCHED, config = LAYERED_CONFIG)]
fn layered_default(ctx: &Ctx) -> Result<AssertResult> {
Ok(AssertResult::pass())
}
The framework writes LAYERED_CONFIG to the guest at the path
declared on the scheduler (/include-files/layered.json) and
substitutes {file} in the arg template with that path before
launching the scheduler binary. A scheduler that declares
config_file_def REQUIRES every test to supply config = …
(compile-time + runtime gate); a scheduler that doesn’t declare it
REJECTS config = … (the content would be silently dropped). See
The #[ktstr_test] Macro
for the full pairing rules.
For schedulers whose config lives on disk on the host (no inline
content), use Scheduler::config_file(host_path) instead — the
host file is packed into the initramfs and --config is injected
into scheduler args automatically; no config = … on the test is
needed in that flavor.
Step 13: Decouple virtual topology from host hardware
By default, ktstr pins vCPUs to host cores in a layout that mirrors
the declared virtual topology. A test declaring numa_nodes = 2, llcs = 8 cannot run on a 1-NUMA-node host — the gauntlet preset
filter rejects it. Set no_perf_mode = true to drop the host
mirroring and run the declared virtual topology unchanged:
#[ktstr_test(
numa_nodes = 2,
llcs = 8, // 8 % 2 == 0; the macro requires divisibility
cores = 4,
no_perf_mode = true, // VM built as declared, even on 1-NUMA hosts
)]
fn two_node_test(ctx: &Ctx) -> Result<AssertResult> { /* ... */ }
In no_perf_mode:
- The VM’s virtual topology is built as declared via KVM vCPU layout, ACPI SRAT/SLIT (x86_64), or FDT cpu nodes (aarch64) — the guest sees the full requested NUMA / LLC structure.
- vCPU-to-host-core pinning, 2 MB hugepages, NUMA mbind, RT scheduling, and KVM exit suppression are skipped.
- Host topology constraints (
min_numa_nodes,min_llcs,requires_smt, per-LLC CPU widths) are NOT compared against host hardware. The only host check that survives is “total host CPUs >= total vCPUs”.
no_perf_mode = true is mutually exclusive with performance_mode = true (KtstrTestEntry::validate rejects the combination at
runtime). Equivalent to setting KTSTR_NO_PERF_MODE=1 per-test —
either source forces the no-perf path. See
Performance Mode
for the full lifecycle.
Step 14: Periodic capture and temporal assertions
On-demand Op::snapshot (Step 9) captures the scheduler’s BPF state
at a point you choose. Periodic capture fires automatically at
evenly-spaced points across the workload window, producing a
time-ordered SampleSeries (the host-side container of drained
snapshots, in capture order; .periodic_only() filters to
periodic-tagged samples) for temporal assertions. Periodic capture
is only useful when paired with a post_vm callback that drains
the bridge and asserts something about the sequence — the two
attributes belong together.
Enable periodic capture with num_snapshots = N and register the
host-side callback with post_vm = function_name. The callback
drains the bridge and runs assertions over the time-ordered series:
use ktstr::prelude::*;
fn check_dispatch_advances(result: &VmResult) -> Result<()> {
let series = SampleSeries::from_drained(
result.snapshot_bridge.drain_ordered_with_stats(),
)
.periodic_only();
let mut v = Verdict::new();
let nr_dispatched: SeriesField<u64> = series.bpf(
"nr_dispatched",
|snap| snap.var("nr_dispatched").as_u64(),
);
nr_dispatched.nondecreasing(&mut v);
let r = v.into_result();
anyhow::ensure!(r.passed, "temporal assertions failed: {:?}", r.details);
Ok(())
}
#[ktstr_test(
scheduler = KTSTR_SCHED,
llcs = 2,
cores = 2,
threads = 1,
duration_s = 20,
num_snapshots = 5,
post_vm = check_dispatch_advances,
)]
fn dispatch_advances(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
])
}
num_snapshots = 5 fires 5 freeze-and-capture boundaries inside the
10%-90% window of the 20 s workload — at roughly +5 s, +7 s, +10 s,
+13 s, +15 s. Each capture freezes every vCPU, reads BPF map state,
and resumes. The host watchdog deadline is extended by each freeze
duration so captures do not eat into the workload budget. The
captures are stored under periodic_000…periodic_004 on the
SnapshotBridge.
Verdict is the assertion accumulator: every pattern call records
its outcome on the same Verdict, and v.into_result() consumes it
into a pass/fail AssertResult.
The seven temporal patterns on SeriesField:
| Pattern | Type | What it checks |
|---|---|---|
nondecreasing | u64/f64 | Every consecutive pair: v[i] <= v[i+1] |
strictly_increasing | u64/f64 | Every consecutive pair: v[i] < v[i+1] |
rate_within(lo, hi) | f64 | Per-pair delta_value / delta_ms in [lo, hi] |
steady_within(warmup_ms, tol) | f64 | Post-warmup values within mean ± tol% |
converges_to(target, tol, deadline_ms) | f64 | 3 consecutive samples in [target ± tol] before deadline |
ratio_within(other, lo, hi) | f64 | Per-sample self / other in [lo, hi] (cross-field) |
always_true | bool | Every sample is true |
Every pattern method takes &mut Verdict as its first argument and
returns it, so calls chain into the same accumulator.
SeriesField::each provides per-sample scalar bounds:
field.each(&mut v).at_least(1u64),
field.each(&mut v).between(0.0, 100.0).
When a temporal pattern fails, the AssertDetail entries
identify the offending sample by tag and elapsed-ms timestamp.
Example for nondecreasing flagging a regression on
nr_dispatched:
nr_dispatched (nondecreasing): regression at sample periodic_002 (+10000ms): \
value 41 after prior value 42 at sample periodic_001 (+7000ms)
The rate, steady, converges, ratio, and always-true variants emit parallel shapes — every detail names the pattern, the specific sample(s) involved, and the violating value, so a failing test points at the data without re-running.
For boundary timing, spacing rules, and the bridge cap, see
Periodic Capture. For the full
projection API (bpf, stats, auto-projectors) and failure
rendering, see
Temporal Assertions.
Step 15: After the run — test statistics
cargo ktstr stats aggregates the sidecar JSON files that each test
variant writes — useful for tracking gauntlet coverage, BPF verifier
complexity, and scheduling behavior across commits. This is a
post-run CLI workflow, not part of the test definition:
cargo ktstr stats # summary: gauntlet coverage, verifier, KVM stats
cargo ktstr stats list # list runs with date, test count, arch
cargo ktstr stats compare --a-kernel 6.14 \ # diff sidecar partitions defined by
--b-kernel 6.15 # per-side --a-X / --b-X filter flags
Statistics are collected even on test failure (if: !cancelled() in
CI). For the full subcommand surface, see
cargo-ktstr stats.
The complete test
The shape exercised by every step above, in one file.
sched_args = ["--slow"] always-applies scx-ktstr’s --slow mode
(Step 2); watchdog_timeout_s = 10 overrides the sched_ext stall
threshold (Step 6); num_snapshots + post_vm enable periodic
capture and a temporal assertion (Step 14):
use std::time::Duration;
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(KTSTR_SCHED, {
name = "ktstr_sched",
binary = "scx-ktstr",
sched_args = ["--slow"],
});
fn check_dispatch_advances(result: &VmResult) -> Result<()> {
let series = SampleSeries::from_drained(
result.snapshot_bridge.drain_ordered_with_stats(),
)
.periodic_only();
let mut v = Verdict::new();
let nr_dispatched: SeriesField<u64> = series.bpf(
"nr_dispatched",
|snap| snap.var("nr_dispatched").as_u64(),
);
nr_dispatched.nondecreasing(&mut v);
let r = v.into_result();
anyhow::ensure!(r.passed, "temporal assertions failed: {:?}", r.details);
Ok(())
}
#[ktstr_test(
scheduler = KTSTR_SCHED,
llcs = 2,
cores = 2,
threads = 1,
duration_s = 20,
watchdog_timeout_s = 10,
isolation = true,
max_spread_pct = 20.0,
max_throughput_cv = 0.5,
min_work_rate = 1.0,
num_snapshots = 5,
post_vm = check_dispatch_advances,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("background_spinner")
.workers(2)
.work_type(WorkType::SpinWait)
.with_cpuset(CpusetSpec::Llc(0)),
CgroupDef::named("phased_worker")
.workers(2)
.work_type(WorkType::sequence(
Phase::Spin(Duration::from_millis(100)),
[Phase::Yield(Duration::from_millis(20))],
))
.with_cpuset(CpusetSpec::Llc(1)),
])
}
Run it:
cargo ktstr test --kernel ../linux -- -E 'test(mixed_workloads)'
What you’ll see when things break
The output examples below are the shapes ktstr emits in real runs. They’re worth skimming before you ship a test so a future failure is recognisable.
Auto-repro probe chain
When the scheduler crashes, ktstr re-runs the scenario with BPF
probes attached and dumps the path leading to the exit. Decoded
struct fields appear inline, with → between fentry-captured
entry values and fexit-captured exit values:
ktstr_test 'demo_host_crash_auto_repro' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
scheduler died
--- auto-repro ---
=== AUTO-PROBE: scx_exit fired ===
ktstr_enqueue main.bpf.c:21
task_struct *p
pid 97
cpus_ptr 0xf(0-3)
dsq_id SCX_DSQ_INVALID
enq_flags NONE
slice 0
vtime 0
weight 100
sticky_cpu -1
scx_flags QUEUED|ENABLED
do_enqueue_task kernel/sched/ext.c:1344
rq *rq
cpu 1
task_struct *p
pid 97
cpus_ptr 0xf(0-3)
dsq_id SCX_DSQ_INVALID → SCX_DSQ_LOCAL
enq_flags NONE
slice 20000000
vtime 0
weight 100
sticky_cpu -1
scx_flags QUEUED|DEQD_FOR_SLEEP → QUEUED
For the probe pipeline architecture, the BTF resolution path,
event-stitching rules, and the demo_host_crash_auto_repro
fixture, see Auto-Repro.
Failure dumps with cast-recovered pointers
The freeze coordinator builds a
FailureDumpReport on every snapshot,
periodic capture, and post-failure dump. Each captured map prints
as a map <name> (type=..., value_size=..., max_entries=...)
header followed by the rendered value (single-entry global
sections like .bss/.data) or entry: key=... blocks
(multi-entry maps). u64 fields the
cast analyzer flagged as
typed pointers chase to the recovered struct and print with a
(cast→arena) or (cast→kernel) annotation distinguishing them
from BTF-typed pointers; an (sdt_alloc) suffix is added when the
sdt_alloc bridge recovered the real payload struct from a
forward-declared pointee. A separate cross-BTF Fwd resolution
path also recovers a forward-declared pointee whose body lives
in a sibling embedded BPF object’s BTF — that path adds no
annotation, the body is rendered transparently:
map scx_lavd.bss (type=array, value_size=4096, max_entries=1)
.bss:
nr_cpus_onln=4
task_ctx_root 0xffff888103a01000 (cast→arena) → task_ctx{cpu_id=2, last_runtime_ns=12345678, nice=0}
current_task 0xffff90124f80c000 (cast→kernel) → task_struct:
pid=4321 weight=100
cpus_ptr 0xffff888103b40000 → cpus={0-3}
taskc_data 0x7f0000080000 (cast→arena (sdt_alloc)) → task_data{slice_ns=20000000, vtime=12345678}
A field that the analyzer cannot prove is a pointer falls back to
its raw u64 shape, which is the prior behavior — no
test-author configuration is required either way.
Verifier output
cargo ktstr verifier runs the BPF verifier against every
declare_scheduler!-registered scheduler’s struct_ops programs
inside a real kernel and prints per-program verified-instruction
counts. The dispatcher hands off to
cargo nextest run -E 'test(/^verifier/)'; nextest fans out
across (scheduler × declared kernel × accepted topology preset)
cells, each cell booting its own VM. Per-cell output starts with
a banner identifying the axis values:
=== ktstr_sched | kernel kernel_6_14_2 | topology tiny-1llc ===
verifier
enqueue verified_insns=42
verifier --- verifier stats ---
processed=42 states=8/10
verifier --- scheduler log ---
func#0 @0
0: R1=ctx() R10=fp0
processed 42 insns (limit 1000000) max_states_per_insn 1 total_states 10 peak_states 8 mark_read 5
When the scheduler did not capture a log, the output is just the per-program table:
=== ktstr_sched | kernel kernel_6_14_2 | topology tiny-1llc ===
verifier
enqueue verified_insns=500
dispatch verified_insns=1200
init verified_insns=300
--raw disables cycle collapsing in the scheduler-log section.
--kernel A --kernel B runs the sweep against multiple kernels;
the cell handler walks KTSTR_KERNEL_LIST to match each cell’s
sanitized kernel label against the resolved set. For the full
verifier-sweep model, cycle-collapse rules, and the
cell-name → kernel matching contract, see
Verifier.
What’s next
- Custom Scenarios – when the declarative ops API is not enough and the scenario needs arbitrary Rust logic between phases.
- Ops and Steps – multi-phase scenarios: add/remove cgroups, swap cpusets, freeze, resume.
- Watch Snapshots –
Op::watch_snapshot("symbol")registers a hardware data-write watchpoint (up to 3 per scenario; slot 0 is reserved for the error-class exit_kind trigger). - MemPolicy – NUMA-aware tests that bind memory allocations to specific nodes and check page locality.
- Performance Mode – pinned vCPUs, hugepages, and LLC-exclusivity validation for benchmark-grade runs.
- Auto-Repro – on a scheduler crash, ktstr can boot a second VM with probes attached and dump the failing state automatically.
- Recipes – task-specific guides (test a new scheduler, A/B compare branches, customize checking, benchmarking, host-state diff, ctprof).
Running Tests
Tests run via cargo ktstr test --kernel ../linux, which resolves the
kernel and wraps cargo nextest run to boot KVM virtual machines for
each #[ktstr_test] entry. Raw cargo nextest run remains available
as a fallback once a kernel is in place via the discovery chain.
Quick reference
# Run all tests
cargo ktstr test --kernel ../linux
# Run a specific test
cargo ktstr test --kernel ../linux -- -E 'test(sched_basic_proportional)'
# Run ignored gauntlet tests
cargo ktstr test --kernel ../linux -- --run-ignored ignored-only -E 'test(gauntlet/)'
Run analysis
Each test invocation writes a *.ktstr.json sidecar per variant
into {CARGO_TARGET_DIR or "target"}/ktstr/{kernel}-{project_commit}/.
cargo ktstr stats list enumerates runs; cargo ktstr stats compare, list-values, list-metrics, and show-host operate on
those sidecars. See Runs for the directory
layout, last-writer-wins semantics, and the comparison workflow.
Budget-based test selection
Set KTSTR_BUDGET_SECS to select the subset of tests that maximizes
feature coverage within a time budget. Useful for CI pipelines or
quick smoke tests.
# Run the best 5 minutes of tests
KTSTR_BUDGET_SECS=300 cargo ktstr test --kernel ../linux
# Budget applies to gauntlet variants too
KTSTR_BUDGET_SECS=600 cargo ktstr test --kernel ../linux -- --run-ignored all
The selector encodes each test as a bitset of properties (scheduler, topology class, SMT, workload characteristics) and greedily picks tests with the highest marginal coverage per estimated second. Duration estimates account for VM boot overhead based on vCPU count.
A summary is printed to stderr during --list:
ktstr budget: 42/1200 tests, 295/300s used, 38/38 configurations covered
When KTSTR_BUDGET_SECS is not set, all tests are listed as usual.
Custom scheduler
Declare a scheduler with declare_scheduler! and reference the
bare const from #[ktstr_test(scheduler = ...)]:
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
});
#[ktstr_test(scheduler = MY_SCHED)]
fn my_sched_test(ctx: &Ctx) -> Result<AssertResult> {
Ok(AssertResult::pass())
}
The binary is injected into the VM’s initramfs and started before
scenarios run. See Test a New Scheduler
for the full end-to-end workflow, and
Payload Definitions
for the #[derive(Payload)] macro that handles binary-kind
workloads (schbench, fio, etc.) — distinct from the
scheduler-under-test surface.
Single Scenario
Running a specific test
cargo ktstr test --kernel ../linux -- -E 'test(sched_basic_proportional)'
Running with verbose output
RUST_BACKTRACE=1 cargo ktstr test --kernel ../linux -- -E 'test(sched_basic_proportional)'
Investigating failures
Run one test with verbose output to see scheduler logs and kernel console:
RUST_BACKTRACE=1 cargo ktstr test --kernel ../linux -- -E 'test(cover_cgroup_cpuset_cross_llc_race)'
VM topology
Each #[ktstr_test] declares its topology via macro attributes:
#[ktstr_test(llcs = 2, cores = 4, threads = 2)]
The test framework boots a VM with the specified topology automatically.
See Investigate a Crash for interpreting failure output and Troubleshooting for common error messages.
Gauntlet
The gauntlet runs every test across 24 topology presets (14 on aarch64).
Gauntlet variants are prefixed with gauntlet/ and
ignored by default.
# Run only base tests (default)
cargo ktstr test --kernel ../linux
# Run only gauntlet variants
cargo ktstr test --kernel ../linux -- --run-ignored ignored-only -E 'test(gauntlet/)'
# Run everything
cargo ktstr test --kernel ../linux -- --run-ignored all
Entries with host_only = true never produce gauntlet variants
(topology variation is meaningless without a VM). See
host_only for
how that flag is set.
Variant naming
Single-kernel runs name each gauntlet variant
gauntlet/{test_name}/{preset}:
{test_name}– the#[ktstr_test]function name{preset}– one of the topology preset names below
When --kernel resolves to two or more kernels (multiple
--kernel flags or a START..END range that expands to
several releases), cargo ktstr test / coverage /
llvm-cov add the kernel as a third dimension and append a
{kernel_label} suffix:
gauntlet/{test_name}/{preset}/{kernel_label}. See
Multi-kernel: kernel as a gauntlet dimension
for how the kernel labels are derived (sanitized from the
resolved version, range expansion, cache key, git source, or
path basename).
To run a single variant:
cargo ktstr test --kernel ../linux -- --run-ignored ignored-only \
-E 'test(=gauntlet/my_test/smt-2llc)'
Topology presets
| Preset | Topology | CPUs | LLCs | NUMA | Description |
|---|---|---|---|---|---|
tiny-1llc | 1n1l4c1t | 4 | 1 | 1 | Single LLC |
tiny-2llc | 1n2l2c1t | 4 | 2 | 1 | Minimal multi-LLC |
odd-3llc | 1n3l3c1t | 9 | 3 | 1 | Odd CPU count |
odd-5llc | 1n5l3c1t | 15 | 5 | 1 | Prime LLC count |
odd-7llc | 1n7l2c1t | 14 | 7 | 1 | Prime LLC count |
smt-2llc | 1n2l2c2t | 8 | 2 | 1 | SMT enabled |
smt-3llc | 1n3l2c2t | 12 | 3 | 1 | SMT, 3 LLCs |
medium-4llc | 1n4l4c2t | 32 | 4 | 1 | Medium topology |
medium-8llc | 1n8l4c2t | 64 | 8 | 1 | Medium, many LLCs |
large-4llc | 1n4l16c2t | 128 | 4 | 1 | Large, few LLCs |
large-8llc | 1n8l8c2t | 128 | 8 | 1 | Large, many LLCs |
near-max-llc | 1n15l8c2t | 240 | 15 | 1 | Near maximum |
max-cpu | 1n14l9c2t | 252 | 14 | 1 | Near KVM vCPU limit |
medium-4llc-nosmt | 1n4l8c1t | 32 | 4 | 1 | Medium, no SMT |
medium-8llc-nosmt | 1n8l8c1t | 64 | 8 | 1 | Medium, many LLCs, no SMT |
large-4llc-nosmt | 1n4l32c1t | 128 | 4 | 1 | Large, no SMT |
large-8llc-nosmt | 1n8l16c1t | 128 | 8 | 1 | Large, many LLCs, no SMT |
near-max-llc-nosmt | 1n15l16c1t | 240 | 15 | 1 | Near maximum, no SMT |
max-cpu-nosmt | 1n14l18c1t | 252 | 14 | 1 | Near KVM vCPU limit, no SMT |
numa2-4llc | 2n4l4c1t | 16 | 4 | 2 | Multi-NUMA, 2 nodes |
numa2-8llc | 2n8l8c2t | 128 | 8 | 2 | Multi-NUMA, 2 nodes, SMT |
numa2-8llc-nosmt | 2n8l16c1t | 128 | 8 | 2 | Multi-NUMA, 2 nodes, no SMT |
numa4-8llc | 4n8l4c1t | 32 | 8 | 4 | Multi-NUMA, 4 nodes |
numa4-12llc | 4n12l8c2t | 192 | 12 | 4 | Multi-NUMA, 4 nodes, SMT |
Topology format: {numa_nodes}n{llcs}l{cores_per_llc}c{threads_per_core}t
(e.g. 1n2l4c2t = 1 NUMA node, 2 LLCs, 4 cores per LLC, 2 threads
per core = 16 CPUs). Presets are defined in gauntlet_presets().
Multi-NUMA presets are excluded by default
(max_numa_nodes: Some(1) in TopologyConstraints::DEFAULT), so
tests opt in to NUMA testing by raising max_numa_nodes.
aarch64: ARM64 CPUs do not have SMT. Presets with
threads_per_core > 1are excluded on aarch64, leaving 14 presets (the 5 small presets, 6-nosmtvariants, and 3 non-SMT NUMA presets).
Constraint filtering
#[ktstr_test] topology constraints filter which presets a test runs
on. A preset is skipped when any constraint is not met:
num_numa_nodes() < min_numa_nodesmax_numa_nodesis set andnum_numa_nodes() > max_numa_nodesnum_llcs() < min_llcsmax_llcsis set andnum_llcs() > max_llcsrequires_smtandthreads_per_core < 2total_cpus() < min_cpusmax_cpusis set andtotal_cpus() > max_cpus
See Topology Constraints for the full attribute table and Gauntlet Tests for a worked example showing which presets survive a given constraint set.
Budget interaction
When KTSTR_BUDGET_SECS is set, greedy coverage maximization selects
the most diverse set of test configurations within the time budget.
Each candidate test is represented as a feature bitset (CPU count
bucket, LLC count, SMT vs non-SMT, etc.). The selector greedily
picks tests that cover the most uncovered feature bits per
estimated second. The result is a mix of base tests and gauntlet
variants that maximizes configuration diversity within the budget.
See Budget-based test selection.
Memory allocation
Each gauntlet VM gets max(topology_mb, initramfs_floor) MB of RAM,
where topology_mb = max(cpus * 64, 256, entry.memory_mb) is the
topology-requested minimum and initramfs_floor is computed from
the actual initramfs size after build. For max-cpu (252 CPUs) the
topology minimum is at least 16128 MB.
Runs
Each cargo ktstr test --kernel ../linux invocation writes per-test result
sidecars into a run directory under
{CARGO_TARGET_DIR or "target"}/ktstr/. The directory is the
record of the latest test run for that (kernel, project commit)
pair – there is no separate “baselines” cache.
Warning: Re-running the suite at the same kernel and project commit reuses the same directory and deletes prior sidecars at the first sidecar write of the new run. To preserve a previous run’s outputs, archive the directory elsewhere first (e.g.
mv target/ktstr/6.14-abc1234 target/ktstr/6.14-abc1234.archived-{date}) or commit your changes (or amend to drop a-dirtysuffix) so the next run lands in a separate snapshot directory.
Layout
target/
└── ktstr/
├── 6.14-abc1234/ # one run: kernel 6.14, project commit abc1234 (clean)
│ ├── test_a.ktstr.json
│ └── test_b.ktstr.json
└── 7.0-def5678-dirty/ # another run: kernel 7.0, project commit def5678 with uncommitted changes
├── test_a.ktstr.json
└── test_b.ktstr.json
Each subdirectory is keyed {kernel}-{project_commit}, where
{kernel} is the kernel version resolved from the directory
KTSTR_KERNEL points at — first the version field in its
metadata.json, else the content of
include/config/kernel.release, else unknown (when
KTSTR_KERNEL is unset or neither file yields a version) — and
{project_commit} is the project tree’s HEAD short hex (7 chars),
suffixed -dirty when the worktree differs from HEAD, or the
literal unknown when the test process is not running inside a
git repository.
The commit is discovered by walking parents of the test process’s
working directory until a .git marker is found — for a scheduler
crate using ktstr as a dev-dependency, this is the scheduler
crate’s commit, not ktstr’s. The function
that performs the probe (detect_project_commit) is called from
the test process’s cwd, so running tests from inside the scheduler
crate’s clone yields that crate’s HEAD. Run from inside ktstr’s
clone if you want to record ktstr’s HEAD instead.
Two runs sharing the same kernel and project commit (the typical
“re-run the suite without committing changes” loop) reuse the
same directory: the second run pre-clears any prior
*.ktstr.json files in the directory at first sidecar write so
the directory is a last-writer-wins snapshot of (kernel, project
commit), not an append-only archive of every invocation. Re-run
the suite to regenerate the sidecars; commit your changes (or
amend to drop the -dirty suffix) to land a separate snapshot
directory.
Pre-clear is shallow — only *.ktstr.json files in the
immediate run directory are removed. Subdirectories created by
external orchestrators (per-job gauntlet layouts, cluster shards)
are left untouched, but cargo ktstr stats walks one level of
subdirectories when collecting sidecars, so stale sidecar files
left in subdirectories from a prior run will still appear in
stats output. Operators driving subdirectory layouts must clean
those subdirectories themselves; pre-clear’s contract covers the
top-level only.
Filesystem requirement
The runs root must reside on a local filesystem (ext4, xfs, btrfs, tmpfs). NFS and other remote filesystems are rejected by the advisory lock used for cross-process sidecar-write serialization.
Unknown-commit collisions
When the test process is not inside a git repository (so
detect_project_commit returns None), the on-disk dirname uses
the literal sentinel unknown in the commit slot — every such run
lands in {kernel}-unknown. Concurrent or successive non-git runs
collide on this single directory, with the latest run pre-clearing
the previous one’s sidecars. To disambiguate non-git runs, set
KTSTR_SIDECAR_DIR to a per-run path or place the project tree
under git so each run carries its own commit hash.
ktstr emits a one-shot stderr warning on first sidecar write
under this configuration; setting KTSTR_SIDECAR_DIR both
disambiguates the run and suppresses the warning (the override
branch returns from sidecar_dir before the warning site is
reached).
The unknown sentinel applies to the dirname only. The
in-memory SidecarResult.project_commit field stays None
(serialized as JSON null) for these runs — the dirname uses a
filesystem-safe sentinel, while the JSON field preserves the
original probe outcome. As a consequence, cargo ktstr stats compare --project-commit unknown will not match a sidecar
whose project_commit is None; omit the --project-commit
filter entirely to include None-commit rows in the comparison.
KTSTR_SIDECAR_DIR overrides the sidecar directory itself
(used as-is, no key suffix), not the parent. The override only
affects where new sidecars are written and what bare
cargo ktstr stats reads. When the override is set, pre-clear
is skipped — the operator chose that directory and owns its
contents, so any pre-existing sidecars there are preserved.
cargo ktstr stats list, cargo ktstr stats compare,
cargo ktstr stats list-values, and cargo ktstr stats show-host
all walk {CARGO_TARGET_DIR or "target"}/ktstr/ by default —
pass --dir DIR on compare / list-values / show-host to
point them at an alternate run root (e.g. an archived sidecar
tree copied off a CI host). They do NOT consult
KTSTR_SIDECAR_DIR.
Workflow
-
Run tests for kernel A:
cargo ktstr test --kernel 6.14 -
Run again for kernel B:
cargo ktstr test --kernel 7.0 -
List runs:
cargo ktstr stats listEach row carries
RUN,TESTS,DATE, andARCHcolumns.DATEis the earliest sidecar timestamp present in the directory — under the last-writer-wins semantics, this equals the most recent run’s first sidecar timestamp (the prior run’s sidecars were pre-cleared at the new run’s first write, so only the new run’s timestamps remain).ARCHis thehost.archvalue (x86_64,aarch64, …) from the run’s first sidecar, or-when no sidecar carries a populated host context. Rows are ordered by directory mtime, most recent first. -
Compare across dimensions:
cargo ktstr stats compare --a-kernel 6.14 --b-kernel 7.0 cargo ktstr stats compare --a-kernel 6.14 --b-kernel 7.0 -E cgroup_steady cargo ktstr stats compare --a-scheduler scx_rusty --b-scheduler scx_lavd --kernel 6.14 cargo ktstr stats compare --a-project-commit abcdef1 --b-project-commit fedcba2 cargo ktstr stats compare --a-project-commit abc1234 --b-project-commit abc1234-dirty cargo ktstr stats compare --a-kernel-commit abcdef1 --b-kernel-commit fedcba2 cargo ktstr stats compare --a-run-source ci --b-run-source localThe
abc1234vsabc1234-dirtyrow is the canonical WIP-vs-baseline pattern: run the suite once at a clean commit to capture the baseline directory{kernel}-abc1234, edit the tree without committing, run the suite again to capture{kernel}-abc1234-dirty, then diff the two. Both sidecar pools coexist undertarget/ktstr/because the-dirtysuffix makes them distinct directories.Per-side filters (
--a-*/--b-*) partition the sidecar pool into two sides; shared filters (--kernel,--scheduler,--project-commit,--kernel-commit,--run-source, etc.) pin both sides. The seven slicing dimensions arekernel,scheduler,topology,work-type,project-commit,kernel-commit, andrun-source; differing on any subset of them defines the A/B contrast. Per-metric deltas are computed using the unifiedMetricDefregistry (polarity, absolute and relative thresholds). Output is colored: red for regressions, green for improvements. The command exits non-zero when regressions are detected. Usecargo ktstr stats list-valuesto discover available dimension values before constructing a comparison. -
Print analysis for the most recent run (no subcommand):
cargo ktstr statsPicks the newest subdirectory under
target/ktstr/by mtime and prints gauntlet analysis, BPF verifier stats, callback profile, and KVM stats. -
Inspect the archived host context for a specific run:
cargo ktstr stats show-host --run 6.14-abc1234 cargo ktstr stats show-host --run archive-2024-01-15 --dir /tmp/archived-runsResolves
--runagainsttarget/ktstr/(or--dirwhen set), scans the run’s sidecars in order, and renders the first populated host-context field viaHostContext::format_human: CPU model, memory config, transparent-hugepage policy, NUMA node count, uname triple, kernel cmdline, and every/proc/sys/kernel/sched_*tunable. Same fingerprintstats compareuses for its host-delta section, but available on a single run. Fails with an actionable error when no sidecar carries a host field (pre-enrichment run).
Metric registry discovery
Before configuring per-metric ComparisonPolicy overrides, enumerate
the available metric names:
cargo ktstr stats list-metrics
cargo ktstr stats list-metrics --json
Prints the ktstr::stats::METRICS registry: metric name, polarity
(higher / lower better), default_abs and default_rel gate
thresholds, and display unit. Use the metric names from this list as
keys in ComparisonPolicy.per_metric_percent; unknown names are
rejected at --policy load time so typos surface loudly. --json
emits the same data as a serde array — the row accessor function is
omitted (#[serde(skip)]) so the wire surface carries only
wire-stable fields.
Sidecar format
Each test writes a SidecarResult JSON file containing the test name,
topology, scheduler, work type, pass/fail, per-cgroup stats, monitor
summary, stimulus events, verifier stats, KVM stats, effective sysctls,
kernel command-line args, kernel version, timestamp, and run ID. Files
are named with a .ktstr. infix for discovery. cargo ktstr stats
reads all sidecar files from a run directory (recursing one level for
gauntlet per-job subdirectories).
See also: KTSTR_SIDECAR_DIR.
ktstr
ktstr is the standalone debugging companion to the
#[ktstr_test] test harness.
It owns kernel cache management, interactive VM shells, host-wide
per-thread profiling, and lock introspection — the operations a
scheduler author reaches for when investigating a test failure.
To reproduce a test scenario as a self-contained shell script
without a VM, use cargo ktstr export.
To run the test suite, use
cargo ktstr test.
See also cargo ktstr for the cargo-integrated
companion that also covers test execution, coverage, BPF verifier
stats, and gauntlet statistics.
Build from the workspace:
cargo build --bin ktstr
Subcommands
topo
Show the host CPU topology (CPUs, LLCs, NUMA nodes):
ktstr topo
kernel
The kernel subcommand manages cached kernel images. Subcommands:
list, build, clean. See
cargo-ktstr kernel for full documentation
– the kernel subcommands are identical in both binaries.
shell
Boot an interactive shell in a KVM virtual machine. Launches a VM with busybox and drops into a shell.
ktstr shell
ktstr shell --kernel ../linux
ktstr shell --kernel 6.14.2
ktstr shell --topology 1,2,4,1
ktstr shell -i /path/to/binary
ktstr shell -i my_tool -i another_tool
Files and directories passed via -i are available at
/include-files/<name> inside the guest. Directories are walked
recursively, preserving structure (e.g. -i ./release includes all
files under release/ at /include-files/release/...). Bare names
(without path separators) are resolved via PATH lookup.
Dynamically-linked ELF binaries get automatic shared library
resolution via ELF DT_NEEDED parsing. Non-ELF files are copied as-is.
Stdin is a terminal requirement. The host terminal enters raw mode for bidirectional stdin/stdout forwarding. Terminal state is restored on all exit paths.
| Flag | Default | Description |
|---|---|---|
--kernel ID | auto | Kernel identifier: a source directory path (e.g. ../linux), a version (6.14.2 or major.minor prefix 6.14), or a cache key (see ktstr kernel list). Raw image files are rejected. Source directories auto-build; versions auto-download from kernel.org on cache miss. When absent, resolves via cache then filesystem and falls back to downloading the latest stable kernel. |
--topology N,L,C,T | 1,1,1,1 | Virtual CPU topology as numa_nodes,llcs,cores,threads. All values must be >= 1. |
-i, --include-files PATH | – | Files or directories to include in the guest. Repeatable. Directories are walked recursively. |
--memory-mb MB | auto | Guest memory in MB (minimum 128). When absent, estimated from payload and include file sizes. |
--dmesg | off | Forward kernel console (COM1/dmesg) to stderr in real-time. Sets loglevel=7 for verbose kernel output. |
--exec CMD | – | Run a command in the VM instead of an interactive shell. The VM exits after the command completes. |
--no-perf-mode | off | Disable all performance mode features (flock, pinning, RT scheduling, hugepages, NUMA mbind, KVM exit suppression). Also settable via KTSTR_NO_PERF_MODE env var. |
--cpu-cap N | unset | Reserve only N host CPUs for the shell VM (integer ≥ 1). Requires --no-perf-mode — perf-mode already holds every LLC exclusively, so capping under perf-mode would double-reserve. The planner walks whole LLCs in consolidation- and NUMA-aware order, partial-taking the last LLC so plan.cpus.len() == N exactly. Mutually exclusive with KTSTR_BYPASS_LLC_LOCKS=1. Also settable via KTSTR_CPU_CAP env var (CLI flag wins when both are present). |
cargo ktstr shell runs the same VM boot flow and differs in one
respect: it accepts raw image file paths for --kernel (e.g.
bzImage, Image). Source-tree directories auto-build and no-kernel
invocations auto-download — same as ktstr shell.
ctprof
Capture or compare a host-wide per-thread state snapshot. Useful for diagnosing “the scheduler looks fine but something on the host is still behaving oddly” by producing a baseline/candidate diff of every live thread’s scheduling, memory, and I/O counters — a superset of what any single test’s sidecar captures.
ktstr ctprof capture --output baseline.ctprof.zst
# ... run workload of interest ...
ktstr ctprof capture --output candidate.ctprof.zst
ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst
capture walks /proc at capture time and writes every
visible thread’s metric values (cumulative counters from
schedstat / sched / status CSW / page faults / I/O bytes /
taskstats; lifetime peaks from schedstat *_max and
hiwater_*; instantaneous gauges sampled at capture time
including nr_threads, fair_slice_ns, state; categorical /
ordinal scalars including policy, nice, cpu_affinity,
identity strings) as zstd-compressed JSON (conventional
extension .ctprof.zst). Cumulative counters and lifetime
peaks are probe-timing-invariant — sampled twice, the value
either monotonically increased or stayed at its high-water mark
— so a diff between two snapshots measures exactly the
activity over the window. Instantaneous gauges and categorical
scalars are point-in-time readings that can legitimately differ
between two probes of the same thread. Per-cgroup aggregates
(cpu.stat, memory.current) are captured once per distinct
path. Capture is read-only; nothing is attached, no kprobes, no
tracing.
| Flag | Default | Description |
|---|---|---|
-o, --output PATH | required | Destination path (convention: .ctprof.zst). Existing files are overwritten. |
compare joins two snapshots on the selected grouping
axis (pcomm by default) and renders a per-metric
baseline/candidate/delta table. The join key survives across
captures taken on different hosts or after process restarts, so
deltas reflect the behavior of the named workload rather than a
specific pid. Metrics with cumulative semantics (CPU time, page
faults, wait time) show the candidate-minus-baseline delta;
instantaneous metrics (affinity, cgroup path) show the value at
candidate capture time. See the
ctprof reference for the full metric
registry, aggregation rules, derived-metric formulas, and
taskstats kconfig gating.
| Arg / Flag | Default | Description |
|---|---|---|
BASELINE | required | Path to the baseline .ctprof.zst snapshot. |
CANDIDATE | required | Path to the candidate .ctprof.zst snapshot. |
--group-by AXIS | pcomm | Grouping axis: pcomm (process name), cgroup (cgroup v2 path), comm (thread-name pattern, token-normalized), or comm-exact (synonym for comm --no-thread-normalize). |
--cgroup-flatten GLOB | – | Glob pattern that collapses dynamic cgroup path segments before grouping (e.g. '/kubepods/*/workload'). Repeatable; explicit globs apply before auto-normalize. |
--no-thread-normalize | off | Disable token-based pattern normalization for --group-by comm. Threads group by literal comm. |
--no-cg-normalize | off | Disable token-based normalization for --group-by cgroup. Cgroup paths group by literal post-flatten path. |
--sort-by SPEC | by largest |delta_pct| | Multi-key sort spec: metric1[:dir1],metric2[:dir2],.... Each metric is a name from ctprof metric-list; each dir is asc or desc (default desc). |
--display-format FORMAT | full | Per-row column layout. full (default 7 columns), delta-only (drop baseline+candidate), no-pct, arrow (collapse baseline/candidate/delta into one cell), or pct-only. |
--columns NAMES | – | Comma-separated column list overriding --display-format. Valid names: group, threads, metric, baseline, candidate, delta, %, arrow. Order is the rendered order. |
--sections NAMES | every | Comma-separated sub-table list. Valid names: primary, taskstats-delay, derived, cgroup-stats, cgroup-limits, memory-stat, memory-events, pressure, host-pressure, smaps-rollup, sched-ext. Empty renders every section that has data. |
--metrics NAMES | every | Comma-separated metric-name allowlist for primary + derived rows. Names must come from ctprof metric-list. Composes multiplicatively with --sections. |
--wrap | off | Wrap table cells to fit terminal width. Only fires when stdout is a TTY; piped output stays unwrapped so awk/grep pipelines see the same byte sequence. |
show renders a single snapshot’s per-(group, metric)
values without diff math. Same flag vocabulary as compare,
minus the baseline/candidate/delta/pct columns:
ktstr ctprof show snapshot.ctprof.zst --group-by cgroup
ktstr ctprof show snapshot.ctprof.zst --sections taskstats-delay
| Arg / Flag | Default | Description |
|---|---|---|
SNAPSHOT | required | Path to the .ctprof.zst snapshot. |
--group-by AXIS | pcomm | Same as compare. |
--cgroup-flatten GLOB | – | Same as compare. Repeatable. |
--no-thread-normalize | off | Same as compare. |
--no-cg-normalize | off | Same as compare. |
--sort-by SPEC | alphabetical | Sort spec; ranks groups by absolute aggregated value (no delta — single snapshot). |
--columns NAMES | – | Comma-separated column list. Show-only valid names: group, threads, metric, value. The compare-only column names are rejected at parse time. |
--sections NAMES | every | Same as compare. |
--metrics NAMES | every | Same as compare. |
--wrap | off | Same as compare. |
metric-list prints every registered metric (primary +
derived) with its description, unit, kconfig gate, and
sched_class scope. Use this to discover the vocabulary
--sort-by and --metrics accept.
ktstr ctprof metric-list
locks
Enumerate every ktstr flock held on this host. Read-only —
does NOT attempt any flock acquire. Useful as a troubleshooting
companion for --cpu-cap contention: when a build or test is
stalled behind a peer’s reservation, ktstr locks names the
peer (PID + cmdline) without disturbing any of its flocks.
Scans four lock-file roots:
/tmp/ktstr-llc-*.lock— per-LLC reservations held by perf-mode test runs and--cpu-cap-bounded builds./tmp/ktstr-cpu-*.lock— per-CPU reservations from the same flow.{cache_root}/.locks/*.lock— cache-entry locks held duringkernel buildwrites, andsource-{path_hash}.lockfiles held for the duration ofkernel build --sourceandcargo ktstr test --kernel <path>against the same source tree.{runs_root}/.locks/{kernel}-{project_commit}.lock— per-run-key sidecar-write locks held for the duration of the (pre-clear + write) cycle to serialize concurrent ktstr processes targeting the same run directory.
Each lock is cross-referenced against /proc/locks to name the
holder PID and cmdline.
ktstr locks # one-shot snapshot
ktstr locks --json # JSON snapshot
ktstr locks --watch 1s # redraw every second until SIGINT
ktstr locks --watch 1s --json # ndjson stream, one object per interval
| Flag | Default | Description |
|---|---|---|
--json | off | Emit the snapshot as JSON. Pretty-printed in one-shot mode; compact (one object per line, ndjson-style) under --watch. Stable field names — schema documented on ktstr::cli::list_locks. |
--watch DURATION | unset | Redraw the snapshot at the given interval until SIGINT. Value is parsed by humantime: 100ms, 1s, 5m, 1h. Human output clears and redraws in place; --json emits one line-terminated object per interval. |
The same subcommand is available as cargo ktstr locks with
identical flag semantics.
completions
Generate shell completions for ktstr.
ktstr completions bash
ktstr completions zsh
ktstr completions fish
| Arg / Flag | Default | Description |
|---|---|---|
SHELL | required | Shell to generate completions for (bash, zsh, fish, elvish, powershell). |
--binary NAME | ktstr | Binary name to register the completion under. Override when invoking ktstr through a symlink with a different name (the shell looks up completions by argv[0]). |
The same subcommand is available as cargo ktstr completions
with identical flag semantics (--binary accepted on both;
defaults to the respective binary name).
cargo-ktstr
cargo ktstr is a cargo plugin for kernel build, cache, and test
workflow. Subcommands in --help order: test (alias: nextest),
coverage, llvm-cov, stats, kernel, model, verifier,
funify (alias: costume), completions, show-host,
show-thresholds, export, locks, shell.
test
Build the kernel (if needed) and run tests via cargo nextest run.
Also available as cargo ktstr nextest — a visible clap alias that
expands to the same subcommand, so the two forms are interchangeable.
cargo ktstr test # auto-discover kernel
cargo ktstr test --kernel ../linux # local source tree
cargo ktstr test --kernel 6.14.2 # version (auto-downloads on miss)
cargo ktstr test --kernel 6.14.2-tarball-x86_64-kc... # cache key (from kernel list)
cargo ktstr test --kernel 6.12..6.14 # range: every stable+longterm release in [6.12, 6.14]
cargo ktstr test --kernel git+https://example.com/r.git#v6.14 # git URL + ref (tag/branch)
cargo ktstr test --kernel git+https://example.com/r.git#deadbeef1234 # specific commit
cargo ktstr test --kernel 6.14.2 --kernel 6.15.0 # multi-kernel: repeatable
cargo ktstr test --release # release profile (stricter assertions)
--kernel is repeatable and accepts a path, version string,
cache key, version range (START..END), or git source
(git+URL#REF). When absent, the test framework discovers a kernel
from KTSTR_TEST_KERNEL, then KTSTR_KERNEL, then falls back to
cache and filesystem lookup. When --kernel is a path,
cargo-ktstr configures and builds the kernel before running tests.
Version strings auto-download and build on cache miss (both
explicit patch versions like 6.14.2 and major.minor prefixes like
6.14). Cache keys resolve from the cache only — they error if not
cached (run cargo ktstr kernel list to see available keys).
Ranges (START..END) expand against kernel.org’s releases.json
to every stable and longterm release whose version sits inside
[START, END] inclusive (mainline / linux-next rows are dropped).
The endpoints themselves do NOT need to appear in releases.json —
6.10..6.16 brackets the surviving releases even if 6.10 and
6.16 have aged out.
Git sources (git+URL#REF) clone the repo shallow at the given
ref, build, and cache the result. A repeat invocation against an
unchanged branch tip lands a cache hit; a moved tip rebuilds.
Multi-kernel: kernel as a gauntlet dimension
When --kernel resolves to two or more kernels (multiple
--kernel flags, or a single --kernel START..END range that
expands to several releases), cargo-ktstr resolves all kernels
upfront and exports the resolved set to cargo nextest via the
KTSTR_KERNEL_LIST env var. The test binary’s gauntlet expansion
adds the kernel as an additional dimension to the gauntlet
cross-product, so each (test × scenario × topology × kernel)
tuple becomes a distinct nextest test case. Two name shapes carry
the kernel suffix:
- Base tests:
ktstr/{name}/{kernel_label}— one variant per registered#[ktstr_test]per kernel. - Gauntlet variants:
gauntlet/{name}/{preset}/{kernel_label}— one variant per (test × topology preset × kernel).
Single-kernel runs (zero or one resolved kernel) keep the
historical name shapes ktstr/{name} and
gauntlet/{name}/{preset} with no kernel suffix, so
existing CI baselines and per-test config overrides keep matching.
Kernel labels are semantic, operator-readable identifiers
sanitized to kernel_[a-z0-9_]+:
- Version / range expansion →
kernel_6_14_2,kernel_6_15_rc3 - Cache key → version prefix only (
kernel_6_14_2from6.14.2-tarball-x86_64-kc<hash>) - Git source →
kernel_git_{owner}_{repo}_{ref}(e.g.kernel_git_tj_sched_ext_for_nextfromgit+https://github.com/tj/sched_ext#for-next) - Path →
kernel_path_{basename}_{hash6}(e.g.kernel_path_linux_a3f2b1); the 6-char crc32 of the canonical path disambiguates twolinuxdirectories under different parents. Dirty-tree builds (uncommitted source changes, mid-build worktree mutations, or non-git trees) append_dirtyto the label — e.g.kernel_path_linux_a3f2b1_dirty— so the test report distinguishes the non-reproducible run from a subsequent clean rebuild of the same path. - Local cache entry →
kernel_local_{hash6}(first 6 chars of the source tree’s git short_hash, captured at cache-store time) orkernel_local_unknownfor non-git trees. The hash6 keeps two distinct local trees from collapsing to the same label; theunknownliteral is the shared bucket for every non-git tree (no discriminator exists at the cache layer to spread them apart).
Filter with nextest’s -E 'test(kernel_6_14)' to pick a single
kernel from a multi-kernel matrix; nextest’s parallelism, retries,
and --ignored flag all apply natively. Sidecars partition per
kernel: each kernel runs in its own
target/ktstr/{kernel}-{project_commit}/ directory keyed on the
resolved kernel’s identity and the project tree’s HEAD short hex
(with -dirty suffix when the worktree differs). Coverage profraw does NOT partition
per kernel — __llvm_profile_write_buffer writes flat into
target/llvm-cov-target/ with PID-keyed filenames
(ktstr-test-{pid}-{counter}.profraw), and cargo-llvm-cov merges
every variant’s profraw automatically into the single output
report.
Build / download / clone failures abort BEFORE any test runs — a missing kernel can’t be tested, and continuing would mask which kernel was requested-but-unavailable in the operator-visible error stream. Test failures within a kernel are nextest-handled normally.
host_only tests under multi-kernel: tests marked
host_only (those that run on the host without booting a VM)
skip the kernel suffix and list / run once regardless of
KTSTR_KERNEL_LIST cardinality. The dispatch sites
(list_tests, list_tests_budget, and --exact’s
run_host_only_test in src/test_support/dispatch.rs) all gate
on entry.host_only before consulting the resolved kernel set,
so a host-side test never observes the kernel directory and
multiplying it across kernels would just run N copies of
identical work for no signal.
| Flag | Default | Description |
|---|---|---|
--kernel ID (repeatable) | auto | Kernel identifier: path, version, cache key, range (START..END), or git source (git+URL#REF). Repeatable; a multi-kernel set fans the gauntlet across kernels. |
--no-perf-mode | off | Disable all performance mode features (flock, pinning, RT scheduling, hugepages, NUMA mbind, KVM exit suppression). Also settable via KTSTR_NO_PERF_MODE env var. |
--no-skip-mode | off | Convert resource-contention and host-topology-insufficient skips into hard test failures (exit 1 instead of 0). Default behavior skips so a contended runner does not fail tests that simply could not start; setting this flag opts into “if the test cannot run, the test fails”. Exports KTSTR_NO_SKIP_MODE=1 for the test binary. |
--release | off | Build and run tests with the release profile (--cargo-profile release to nextest). Release mode applies stricter assertion thresholds (gap_threshold_ms 2000 vs debug’s 3000, spread_threshold_pct 15% vs debug’s 35%) — tests that barely pass in debug may fail under --release. catch_unwind-based tests and tests gated on #[cfg(debug_assertions)] are skipped. |
What it does (path mode only)
These steps run only when --kernel is a source directory path.
Cached version and cache-key identifiers skip straight to test
execution (step 6); uncached version identifiers run through
download + configure + build + cache-store first. Ranges fan out
to per-version resolution (every release downloads + builds +
caches independently if not already present); git sources clone
shallow at the ref, build, and cache. Multi-kernel resolution
finishes for every requested kernel BEFORE step 6 — the
cargo-nextest invocation in step 6 sees the complete kernel set
as a single KTSTR_KERNEL_LIST export, so nextest fans the
gauntlet across kernels in a single run.
For path mode, the source tree is gix-discovered and classified as either clean (HEAD reachable, index matches HEAD, worktree matches index) or dirty or non-git (any tracked-file diff, or the directory is not a git repo at all). The cache is keyed in one of three shapes:
local-{hash7}-{arch}-kc{suffix}— clean git tree, no user.configfile in the source tree yet (build will runmake defconfig).{hash7}is the source tree’s HEAD short hash;{suffix}distinguishes ktstr framework kconfig fragments.local-{hash7}-{arch}-cfg{user_config}-kc{suffix}— clean git tree with a user.configwhose CRC32 hash discriminates distinct configurations against the same commit, so iterative.configedits at a fixed commit populate distinct cache entries instead of colliding.local-unknown-{path_hash}-{arch}-kc{suffix}— dirty / non-git tree (HEAD does not describe the source).{path_hash}is the full 8-char (32-bit) CRC32 of the canonical source path so two parallelcargo ktstr test --kernel ./linux-aand--kernel ./linux-bruns do not collide on the samelocal-unknown-...slot.
Dirty / non-git trees never cache — the build pipeline runs in
the source directory, the kernel label gets a _dirty suffix,
and a subsequent run of the same path that goes clean produces a
distinct cache entry under the clean shape.
- Source-tree validation — verifies
<kernel>/Makefileand<kernel>/Kconfigboth exist. If either is missing, bails withnot a kernel source tree. - Cache lookup (clean trees only) — looks up the
local-{hash7}-{arch}[-cfg{user_config}]-kc{suffix}key (thecfgsegment present iff a user.configexists in the source tree). Cache hit short-circuits to step 6: cargo-ktstr exports the cache entry directory viaKTSTR_KERNELand emits acargo ktstr: cache hit for {input_path} ({cache_key}, built {age} ago)line on stderr (the, built {age} agosuffix is omitted when the timestamp is unparseable or future-dated). Cache miss continues to step 3. - Auto-configure — if
<kernel>/.configlacks theCONFIG_SCHED_CLASS_EXT=ysentinel, runsmake defconfig(when no.configexists), appendsktstr.kconfigto.config, then runsmake olddefconfig. - Kernel build — runs
make -j$(nproc) KCFLAGS=-Wno-error, then runsvalidate_kernel_configto verify critical config options (CONFIG_SCHED_CLASS_EXT,CONFIG_DEBUG_INFO_BTF,CONFIG_BPF_SYSCALL,CONFIG_FTRACE,CONFIG_KPROBE_EVENTS,CONFIG_BPF_EVENTS) survived the build — the kernel build system silently disables options whose dependencies are not met, and the validator surfaces those failures with a per- option remediation hint.makehandles the no-op case when the kernel is already built. For dirty / non-git trees this is the unconditional path; for clean trees, only reached on cache miss. - compile_commands.json + cache store — runs
make compile_commands.json(skipped only for transient temp directories like extracted tarballs) so LSP / clangd work against the local tree. Then for clean trees, the kernel image + stripped vmlinux are persisted under the resolvedlocal-{hash7}-{arch}[-cfg{user_config}]-kc{suffix}key withmetadata.jsonrecording the source tree path. A post-build re-check of the dirty state catches mid-build mutations (worktree edits or commits that happened duringmake) and skips the cache store on either signal so a racing-write build can not land under a stale identity. Dirty / non-git trees skip the cache store unconditionally (no stable HEAD identity for the cache key) but still getcompile_commands.json. - Test execution — execs
cargo nextest runonce withKTSTR_KERNELset in the environment (single-kernel) or with bothKTSTR_KERNELandKTSTR_KERNEL_LIST(multi-kernel; the latter encodes the resolved kernel set aslabel1=path1;label2=path2;…). For clean Path-spec resolutionKTSTR_KERNELpoints at the cache entry directory; for dirty or non-git trees it points at the source tree directly. The test binary’s gauntlet expansion adds the kernel as a fifth dimension when the list carries 2+ entries; nextest’s parallelism, retries, and-Efiltering apply natively to every (test × kernel) variant.
Implicit vs explicit kernel discovery diverge:
cargo ktstr test --kernel ../linux(explicit Path spec) routes through the cache pipeline above — the source tree is gix-classified, thelocal-{hash7}-{arch}[-cfg{user_config}]-kc{suffix}cache key is computed, the kernel is built (or short-circuited on cache hit), and the cache entry directory is exported viaKTSTR_KERNEL.cargo ktstr test(no--kernelflag) does NOT run the build pipeline or produce a new cache entry. The test binary’sfind_kernelchain reads existing cache entries (most-recent-valid first; entries built with a different kconfig fragment are skipped) and falls back to local build trees (./linux,../linux) and host paths. Whatever pre-built image it finds is returned as-is — no cache key is computed for source trees discovered on the filesystem, nomakeis invoked, and the result does not land in the kernel cache for a futurecache_key-keyed lookup. TheKTSTR_KERNELenv var with a path value follows this same direct-image flow — the cache write path is reached only via thecargo ktstr--kernelargument (or viacargo ktstr kernel build --source ../linuxas an explicit cache-populate step). Pass--kernel ../linuxto opt into the cache pipeline so a clean tree’s build is stored once and reused on subsequent runs.
Passing nextest arguments
Arguments after test are passed through to cargo nextest run:
cargo ktstr test -- -E 'test(my_test)' # nextest filter
cargo ktstr test -- --workspace # all workspace tests
cargo ktstr test -- --retries 2 # nextest retries
coverage
Build the kernel (if needed) and run tests with coverage via
cargo llvm-cov nextest. Same kernel resolution and multi-kernel
semantics as test: --kernel is repeatable; multi-kernel runs
add the kernel suffix to every test name and partition the
sidecar tree per kernel via
target/ktstr/{kernel}-{project_commit}/, where {project_commit}
is the project HEAD short hex (with -dirty when the worktree
differs). Coverage profraw lands flat in
target/llvm-cov-target/ with PID-keyed filenames — it does
NOT partition per kernel — and cargo-llvm-cov merges every
variant’s profraw automatically into the single output report.
cargo ktstr coverage # auto-discover kernel
cargo ktstr coverage --kernel ../linux # local source tree
cargo ktstr coverage --kernel 6.14.2 # version (auto-downloads on miss)
cargo ktstr coverage --kernel 6.14.2 --kernel 6.15.0 # multi-kernel coverage matrix
cargo ktstr coverage --release # release profile (stricter assertions)
cargo ktstr coverage -- --workspace --lcov --output-path lcov.info # lcov output
| Flag | Default | Description |
|---|---|---|
--kernel ID (repeatable) | auto | Same shapes and multi-kernel semantics as cargo ktstr test --kernel: each (test × kernel) variant runs as its own nextest subprocess so cargo-llvm-cov merges every variant’s profraw automatically. |
--no-perf-mode | off | Disable all performance mode features (flock, pinning, RT scheduling, hugepages, NUMA mbind, KVM exit suppression). Also settable via KTSTR_NO_PERF_MODE env var. |
--no-skip-mode | off | Convert resource-contention and host-topology-insufficient skips into hard test failures. Same semantics as on test; exports KTSTR_NO_SKIP_MODE=1 for the test binary. |
--release | off | Collect coverage with the release profile (--cargo-profile release to llvm-cov nextest). Same stricter-threshold caveats as test --release — release mode applies gap_threshold_ms=2000 / spread_threshold_pct=15%, and skips catch_unwind-based tests along with #[cfg(debug_assertions)]-gated tests. |
Requires cargo-llvm-cov and the llvm-tools-preview rustup
component:
cargo install cargo-llvm-cov
rustup component add llvm-tools-preview
Passing arguments
Arguments after coverage are passed through to
cargo llvm-cov nextest:
cargo ktstr coverage -- --workspace --profile ci --lcov --output-path lcov.info
cargo ktstr coverage -- --features integration
profraw layout
Three populations of *.profraw files arise from cargo ktstr
runs. They land in different directories and are not all
collected by the same workflow:
| Filename shape | Directory | Producer | Collected by |
|---|---|---|---|
default-{pid}-{binary_hash}.profraw | parent of cargo-ktstr binary, joined with llvm-cov-target/ (e.g. target/{profile}/llvm-cov-target/ for cargo run --bin cargo-ktstr, or ~/.cargo/bin/llvm-cov-target/ for an installed binary) | host-side cargo ktstr test (via LLVM_PROFILE_FILE injection) | not auto-collected; needs an explicit cargo llvm-cov report invocation |
| cargo-llvm-cov-managed (shape set by the outer harness) | target/llvm-cov-target/ (workspace target dir, NOT under {profile}) | host-side cargo ktstr coverage (cargo-llvm-cov sets its own LLVM_PROFILE_FILE) | merged into the cargo ktstr coverage report automatically |
ktstr-test-{pid}-{counter}.profraw | parent of the test binary’s LLVM_PROFILE_FILE env var, falling back to <test-binary parent>/llvm-cov-target/ (typically target/{profile}/deps/llvm-cov-target/ when no env override is in play); under cargo ktstr test, inherits the host-side injected dir, so co-locates with default-{pid}-{binary_hash}.profraw | guest-side __llvm_profile_write_buffer flushed via the SHM ring at VM exit | merged into the cargo ktstr coverage report automatically |
cargo ktstr test injects LLVM_PROFILE_FILE (added to prevent
default.profraw leaking into a kernel source tree when the
shell cwd was the kernel dir; see
Stale vmlinux.btf or default.profraw).
The resulting host-side default-{pid}-{binary_hash}.profraw
files do NOT land in the target/llvm-cov-target/ directory
that cargo ktstr coverage (cargo-llvm-cov) reads; they are NOT
picked up by a later cargo ktstr coverage run unless you
explicitly include them in a cargo llvm-cov report
invocation pointed at the cargo-ktstr binary’s llvm-cov-target/
directory.
To clean accumulated profraw between runs:
# Remove ONLY *.profraw under target/llvm-cov-target/ (top-level glob, non-recursive):
cargo ktstr llvm-cov clean --profraw-only
# Drop host-side test-path profraw next to the cargo-ktstr binary.
# Run only the line(s) matching how cargo-ktstr was launched —
# the brace-list form is bash-only, so each path is its own command
# for portable POSIX shells (sh / dash):
rm -f target/debug/llvm-cov-target/default-*.profraw
rm -f target/release/llvm-cov-target/default-*.profraw
# If ktstr was installed via `cargo install`:
rm -f ~/.cargo/bin/llvm-cov-target/default-*.profraw
--profraw-only is the safe default: it removes only *.profraw
files at the top level of target/llvm-cov-target/ (the cargo-
llvm-cov-managed dir) and leaves coverage reports, profdata, and
build artifacts intact. It does NOT touch the default-*.profraw
files next to the cargo-ktstr binary (under
target/{profile}/llvm-cov-target/ for cargo run / cargo build,
or ~/.cargo/bin/llvm-cov-target/ for cargo install-deployed
binaries) produced by the host-side injection — remove those with
the explicit rm -f lines above for whichever launch mode you use.
Avoid cargo ktstr llvm-cov clean without arguments (recursively
wipes all of target/llvm-cov-target/, including reports) and
--workspace (additionally runs cargo clean on workspace
packages, removing build artifacts); both are destructive beyond
profraw.
To opt out of the host-side LLVM_PROFILE_FILE injection
entirely, export LLVM_PROFILE_FILE yourself before running
cargo ktstr test — the injector only fires when the env is
absent, so an explicit operator setting takes precedence.
llvm-cov
Raw passthrough to cargo llvm-cov with arbitrary arguments. Use
this for llvm-cov subcommands that don’t fit the coverage
flow — report, clean, show-env, etc. When you want
cargo llvm-cov nextest, prefer cargo ktstr coverage;
this subcommand carries the same kernel-resolution and
--no-perf-mode plumbing but hands every remaining argument to
cargo llvm-cov unchanged.
cargo ktstr llvm-cov report --lcov --output-path lcov.info # generate report from prior run
cargo ktstr llvm-cov clean --workspace # wipe accumulated coverage data
cargo ktstr llvm-cov show-env # print env cargo-llvm-cov would set
cargo ktstr llvm-cov --kernel ../linux report # pin kernel + passthrough
| Flag | Default | Description |
|---|---|---|
--kernel ID (repeatable) | auto | Kernel identifier: path, version, cache key, range (START..END), or git source (git+URL#REF). Same multi-kernel semantics as cargo ktstr test --kernel. |
--no-perf-mode | off | Disable all performance mode features (flock, pinning, RT scheduling, hugepages, NUMA mbind, KVM exit suppression). Also settable via KTSTR_NO_PERF_MODE env var. |
--no-skip-mode | off | Convert resource-contention and host-topology-insufficient skips into hard test failures. Same semantics as on test; exports KTSTR_NO_SKIP_MODE=1 for the test binary. |
Note: a bare cargo ktstr llvm-cov (no trailing subcommand)
dispatches to cargo llvm-cov, which runs cargo test — ktstr
tests rely on the nextest harness for gauntlet expansion
(topology-preset variants), verifier cell emission, and VM
dispatch. Under bare cargo test, only the #[test] stubs run
and gauntlet variants + verifier cells are silently skipped.
Always pass a subcommand after llvm-cov (most often nextest,
for which cargo ktstr coverage is the shorter route).
kernel
Manage cached kernel images. Three subcommands: list, build,
clean. The standalone ktstr kernel subcommands are identical.
kernel list
List cached kernel images, sorted newest first. With --range,
switches to PREVIEW MODE: prints the versions a START..END range
expands to without performing any download or build.
cargo ktstr kernel list
cargo ktstr kernel list --json # JSON output for CI scripting
cargo ktstr kernel list --range 6.12..6.14 # preview range expansion
cargo ktstr kernel list --range 6.12..6.14 --json # preview as JSON
Default mode walks the local cache. Human-readable output shows
key, version, source type, arch, and build timestamp. Entries built
with a different ktstr.kconfig are marked (stale kconfig).
Entries whose major.minor version is no longer in kernel.org’s
active releases list are marked (EOL); prefix lookups for EOL
series fall back to probing cdn.kernel.org for the latest patch
release.
--range mode performs no cache reads: it fetches kernel.org’s
releases.json once, expands the inclusive range against the
stable and longterm releases (mainline / linux-next dropped),
and prints one version per line on stdout. Use this to answer
“what does --kernel 6.12..6.16 actually cover?” before paying
the build cost — no kernel is downloaded or compiled. With
--json, emits a JSON object carrying the literal range, the
parsed start / end, and the expanded versions array.
| Flag | Description |
|---|---|
--json | Output in JSON format. Each entry includes a boolean eol field (computed at list time by fetching kernel.org’s releases.json) alongside the cached metadata. With --range, emits a single object {range, start, end, versions} instead. |
--range START..END | Switch to range-preview mode. Format: MAJOR.MINOR[.PATCH][-rcN]..MAJOR.MINOR[.PATCH][-rcN]. Performs the single releases.json fetch a real range resolve does, expands inclusively, and prints the version list — no downloads, no builds, no cache lookups. |
kernel build
Download, build, and cache a kernel image. Three source modes:
version (tarball download), --source (local tree), --git (clone).
cargo ktstr kernel build # latest stable from kernel.org
cargo ktstr kernel build 6.14.2 # specific version
cargo ktstr kernel build 6.15-rc3 # RC release
cargo ktstr kernel build 6.12 # latest 6.12.x patch release
cargo ktstr kernel build --source ../linux # local source tree
cargo ktstr kernel build --git URL --ref v6.14 # git clone (shallow, depth 1)
cargo ktstr kernel build --force 6.14.2 # rebuild even if cached
When no version or source is given, fetches the latest stable
series that has had at least 8 maintenance releases — keeping CI
off brand-new majors whose early builds are more likely to break —
from kernel.org’s releases.json. A major.minor prefix (e.g.
6.12) resolves to the highest patch release in that series. For
EOL series no longer in releases.json, probes cdn.kernel.org to
find the latest available tarball. Skips building when a cached entry already exists
(use --force to override). Stale entries (built with a different
ktstr.kconfig) are rebuilt automatically. For --source, generates
compile_commands.json for LSP support. Dirty local trees
(uncommitted changes to tracked files) are built but not cached.
| Flag | Description |
|---|---|
VERSION | Kernel version or prefix to download (e.g. 6.14.2, 6.12, 6.15-rc3). A major.minor prefix resolves to the highest patch release, probing cdn.kernel.org for EOL series. Conflicts with --source and --git. |
--source PATH | Path to existing kernel source directory. Conflicts with VERSION and --git. |
--git URL | Git URL to clone. Requires --ref. Conflicts with VERSION and --source. |
--ref REF | Git ref to checkout (branch, tag, commit). Required with --git. |
--force | Rebuild even if a cached image exists. |
--clean | Run make mrproper before configuring. Only meaningful with --source. |
--cpu-cap N | Reserve exactly N host CPUs for the build (integer ≥ 1; must be ≤ the calling process’s sched_getaffinity cpuset size). When absent, 30% of the allowed CPUs are reserved (minimum 1). The planner walks whole LLCs in consolidation- and NUMA-aware order, partial-taking the last LLC so plan.cpus.len() == N exactly. Under --cpu-cap, make -jN parallelism matches the reserved CPU count and the build runs inside a cgroup v2 sandbox that pins gcc/ld to the reserved CPUs + NUMA nodes. Mutually exclusive with KTSTR_BYPASS_LLC_LOCKS=1. Also settable via KTSTR_CPU_CAP env var (CLI flag wins when both are present). |
kernel clean
Remove cached kernel images.
cargo ktstr kernel clean # remove all (with confirmation prompt)
cargo ktstr kernel clean --keep 3 # keep 3 most recent
cargo ktstr kernel clean --force # skip confirmation prompt
cargo ktstr kernel clean --corrupt-only --force # remove only corrupt entries
| Flag | Description |
|---|---|
--keep N | Keep the N most recent VALID cached kernels. Corrupt entries (metadata missing or unparseable, image file absent) are always candidates for removal regardless of this value — a corrupt entry never consumes a keep slot. Mutually exclusive with --corrupt-only. |
--force | Skip confirmation prompt. Required in non-interactive contexts. |
--corrupt-only | Remove only corrupt cache entries (metadata missing or unparseable, image file absent). Valid entries are left untouched regardless of --force. Useful for clearing broken entries after an interrupted build without risking the curated set of good kernels. Mutually exclusive with --keep. |
model
Manage the LLM model cache used by OutputFormat::LlmExtract
payloads. fetch downloads the default pinned model into the
ktstr model cache; status reports whether a SHA-checked copy
is already cached; clean deletes the cached artifact plus
its warm-cache sidecar.
cargo ktstr model fetch # download + SHA-check (no-op if cached)
cargo ktstr model status # report cache path + verdict
cargo ktstr model clean # delete cached artifact + sidecar
fetch is a no-op when the cache already holds a SHA-checked
copy. Respects KTSTR_MODEL_OFFLINE=1 — set to refuse network
fetches. Cache root resolution: KTSTR_CACHE_DIR (if set),
then $XDG_CACHE_HOME/ktstr/models/, then
$HOME/.cache/ktstr/models/.
status prints four fields and adds a one-line annotation
when the verdict is anything other than Matches (a clean
hit gets no annotation):
| Field | Description |
|---|---|
model: | Model file name (the pinned default; e.g. Qwen3-4B-Q4_K_M.gguf). |
path: | Absolute cache path ({cache_root}/models/{file}) the producer reads at LlmExtract time. |
cached: | true if an entry exists at path:, false otherwise. |
checked: | true if the cached entry’s SHA-256 matches the pinned digest. |
The annotation distinguishes four verdicts: NotCached (no
entry — emit a cargo ktstr model fetch hint plus the
expected download size), CheckFailed (cached entry could
not be SHA-checked due to an I/O error — re-fetch),
Mismatches (cached entry hash does not match the pinned
digest — re-fetch), Matches (silent — the all-clear path).
Re-fetch is the shared remediation tail for every cached-but-
not-Matches branch.
clean removes both the GGUF artifact at
{cache_root}/models/{file_name} and its .mtime-size
warm-cache sidecar (a small companion file the SHA fast-path
uses to skip re-hashing on subsequent status calls). Per-
file output names what was deleted with an IEC-prefixed size
in parentheses (removed /path/to/Qwen3-4B-Q4_K_M.gguf (2.34 GiB)); a final freed N total line sums the artifact and
sidecar bytes. A no-op clean (nothing cached) prints a single
no cached model found at {path} line so an idempotent re-run
produces a clear “nothing to do” outcome instead of two
“(absent)” lines. Subsequent cargo ktstr model fetch
re-downloads the pin from scratch.
verifier
Collect BPF verifier statistics for every scheduler declared via
declare_scheduler! in the workspace’s test binaries. Spawns
cargo nextest run -E 'test(/^verifier/)' and lets nextest fan
out per (scheduler × kernel-list entry × accepted topology preset)
cell — each cell boots its own VM, loads the scheduler’s BPF
programs, and reports per-program verified instruction counts
from host-side memory introspection.
cargo ktstr verifier # auto-discover kernel
cargo ktstr verifier --kernel ../linux # pin to one kernel
cargo ktstr verifier --kernel 6.14 --kernel 7.0 # multi-kernel sweep
cargo ktstr verifier --raw # raw verifier log
There are no --scheduler / --scheduler-bin flags: the sweep
discovers schedulers from the KTSTR_SCHEDULERS distributed
slice populated by declare_scheduler!. To exclude a scheduler
from the sweep, omit it from the test binary (or declare it with
SchedulerSpec::Eevdf / SchedulerSpec::KernelBuiltin — both
are skipped at cell-emission time because neither has a
userspace binary to verify).
--kernel is repeatable; cargo-ktstr always exports
KTSTR_KERNEL_LIST to the nextest invocation (synthesizing a
single entry from auto-discovery when no --kernel is passed).
Each scheduler’s kernels = [...] declaration acts as a
per-scheduler filter on the operator-supplied set; an empty (or
omitted) kernels field accepts every entry. See BPF Verifier:
Matrix dimension + per-scheduler filter
for the full filter contract.
--raw exports KTSTR_VERIFIER_RAW=1; the cell handler reads
it via env::var_os and switches format_verifier_output from
the cycle-collapsed default to the raw scheduler-log dump. See
BPF Verifier: Cycle collapse algorithm
for the rendering details.
| Flag | Description |
|---|---|
--kernel ID (repeatable) | Kernel identifier: path, version, cache key, range (START..END), or git source (git+URL#REF). Raw image files (bzImage/Image) are NOT accepted — the verifier needs the cached vmlinux and kconfig fragment alongside the image. Source directories auto-build; version strings auto-download on cache miss. When absent, resolves via cache then filesystem, falling back to auto-download. Raw images are accepted only on cargo ktstr shell. |
--raw | Print raw verifier output without cycle collapse. |
See BPF Verifier for the cell-based dispatch
design and output format, and
Scheduler Definitions
for the declare_scheduler! macro that registers a scheduler
in KTSTR_SCHEDULERS.
shell
Shares the VM boot flow with ktstr shell and accepts the same
flags. See ktstr shell for the full flag
reference. The one behavior difference from ktstr shell is that
cargo ktstr shell accepts raw image file paths for --kernel.
cargo ktstr shell
cargo ktstr shell --kernel 6.14.2
cargo ktstr shell --topology 1,2,4,1
cargo ktstr shell -i ./my-binary -i strace
completions
Generate shell completions for cargo-ktstr. See ktstr completions for the base subcommand.
cargo ktstr completions bash >> ~/.local/share/bash-completion/completions/cargo
cargo ktstr completions zsh > ~/.zfunc/_cargo-ktstr
cargo ktstr completions fish > ~/.config/fish/completions/cargo-ktstr.fish
| Arg | Description |
|---|---|
SHELL | Shell to generate completions for (bash, zsh, fish, elvish, powershell). |
--binary NAME | Binary name for completions. Default: cargo. |
stats
Sidecar analysis, per-record diagnostics, and run-to-run comparison. See Runs for the directory layout.
cargo ktstr stats # print analysis of newest run
cargo ktstr stats list # list runs
cargo ktstr stats list-metrics # list registered regression metrics
cargo ktstr stats compare --a-kernel 6.14 --b-kernel 6.15 # slice on kernel
cargo ktstr stats compare --a-scheduler scx_rusty --b-scheduler scx_alpha # slice on scheduler
cargo ktstr stats compare --a-kernel 6.14 --b-kernel 6.15 --scheduler scx_rusty # slice on kernel, pin scheduler on both sides
cargo ktstr stats compare --a-kernel 6.14 --b-kernel 6.15 -E cgroup_steady # add substring filter
cargo ktstr stats compare --a-project-commit abcdef1 --b-project-commit fedcba2 --no-average # opt out of trial averaging
cargo ktstr stats compare --a-kernel-commit abcdef1 --b-kernel-commit fedcba2 # slice on kernel source HEAD
cargo ktstr stats compare --a-run-source ci --b-run-source local # slice on run environment
cargo ktstr stats explain-sidecar --run RUN_ID # diagnose Option-field absences
When invoked without a subcommand, prints gauntlet analysis from
either the most recent run directory under
{CARGO_TARGET_DIR or "target"}/ktstr/ (newest by mtime) or the
explicit directory in KTSTR_SIDECAR_DIR when that variable is
set. With KTSTR_SIDECAR_DIR set, that directory is the sidecar
source directly – there is no newest-subdirectory walk under it:
- Gauntlet analysis – outlier detection, per-scenario/topology dimension summaries, stimulus cross-tab.
- BPF verifier stats – per-program verified instruction counts, warnings for programs near the 1M complexity limit.
- BPF callback profile – per-program invocation counts, total CPU time, and average nanoseconds per call.
- KVM stats – cross-VM averages for exits, halt polling, host preemptions.
list
Print a table of run directories under
{CARGO_TARGET_DIR or "target"}/ktstr/ with four columns:
RUN: the run-directory leaf name, formatted as{kernel}-{project_commit}per Runs.listdoes NOT consultKTSTR_SIDECAR_DIR— that override only affects where the test harness writes sidecars;listalways enumerates the default runs-root.TESTS: number of sidecars in the directory (and one level of subdirectories —collect_sidecarswalks per-job gauntlet layouts).DATE: the earliest sidecar timestamp present in the directory — under last-writer-wins this equals the most recent run’s first sidecar timestamp (the prior run’s sidecars were pre-cleared at the new run’s first write, so only the new run’s timestamps remain). See Runs for the full semantics.ARCH: thehost.archvalue from the run’s first sidecar (e.g.x86_64,aarch64). Renders as-when no sidecar in the directory carries a populated host context — pre-host- context archives and host-only test stubs that never populate the field land in this bucket.
Rows are sorted by directory mtime, most recent first, so the latest run lands at the top — the operator’s usual interest. Entries whose mtime cannot be read fall back to filename order as a deterministic tiebreaker and sort to the end of the listing.
list-metrics
List the registered regression metrics and their default
thresholds. Enumerates the ktstr::stats::METRICS registry: metric
name, polarity (higher/lower better), default absolute-delta gate,
default relative-delta gate, and display unit. Use this to see which metric names
ComparisonPolicy.per_metric_percent keys can reference, and what
each default absolute and relative gate starts at before an
override. Default output is a human-readable table; --json emits
a JSON array with the same fields.
cargo ktstr stats list-metrics # table
cargo ktstr stats list-metrics --json # JSON array
| Flag | Default | Description |
|---|---|---|
--json | off | Emit JSON instead of a table. |
list-values
List the distinct values present per filterable dimension in the
sidecar pool. Walks every run directory under target/ktstr/
(or --dir), pools the sidecars, and reports per-dimension sets
for all seven dimensions: kernel, commit, kernel_commit,
source, scheduler, topology, and work_type. The commit
and source keys map to the internal
SidecarResult::project_commit / run_source fields; the JSON
wire keys keep the shorter spellings.
Use this before crafting a cargo ktstr stats compare
invocation to discover what --a-X / --b-X values the pool
actually carries: --a-kernel 6.20 against an empty pool fails
downstream with “no rows match filter A”, and list-values is
the upstream answer to “what kernels do I have?”.
cargo ktstr stats list-values # text per-dim blocks
cargo ktstr stats list-values --json # JSON object
cargo ktstr stats list-values --dir /tmp/archived # archived sidecar tree
The text shape renders one block per dimension with values one per line. The JSON shape emits a single object keyed by dimension name with arrays of values:
{
"kernel": [null, "6.14.2", "6.15.0"],
"commit": [null, "abcdef1", "abcdef1-dirty"],
"kernel_commit": [null, "kabcde7", "kabcde7-dirty"],
"source": [null, "ci", "local"],
"scheduler": ["eevdf", "scx_rusty"],
"topology": ["1n2l4c1t", "1n4l2c1t"],
"work_type": ["SpinWait", "PageFaultChurn"]
}
The JSON keys commit and source are the wire contract;
internally the corresponding fields are
SidecarResult::project_commit and SidecarResult::run_source,
and the per-side filter flags spell as --project-commit /
--run-source (see compare).
kernel, commit, kernel_commit, and source are optional
on the source sidecar (SidecarResult::kernel_version /
project_commit / kernel_commit / run_source are
Option<String>); the textual sentinel unknown and JSON
null both denote a sidecar that did not record a value for
that dimension.
| Flag | Default | Description |
|---|---|---|
--json | off | Emit JSON instead of per-dimension text blocks. |
--dir DIR | target/ktstr/ | Alternate run root. Same semantics as compare --dir. |
show-host
Print the archived HostContext for a specific run: CPU identity,
memory/hugepage config, transparent-hugepage policy, NUMA node
count, kernel uname triple, kernel cmdline, and every
/proc/sys/kernel/sched_* tunable captured at archive time. Useful
for inspecting the same fingerprint compare’s host-delta section
uses, available on a single run.
The command scans sidecars in the run directory in iteration order
and prints the FIRST sidecar that carries a populated host field —
older pre-enrichment sidecars may have host: None, and the
forward scan tolerates those. If no sidecar has a populated host
field the command fails with an actionable error rather than
returning empty output.
| Flag | Default | Description |
|---|---|---|
--run ID | required | Run key (e.g. 6.14-abc1234 or 6.14-abc1234-dirty; from cargo ktstr stats list). |
--dir DIR | target/ktstr/ | Alternate run root. Same semantics as compare --dir: useful for archived sidecar trees copied off a CI host. |
explain-sidecar
Diagnose Option-field absences across a run’s sidecars. Loads
every *.ktstr.json under --run ID (or its subdirectories one
level deep, mirroring compare’s gauntlet-job layout) and reports,
per sidecar, which Option<T> fields landed as None plus the
documented causes for each absence and a classification:
expected—Noneis the steady-state shape; no operator action recovers it (e.g.payloadfor a scheduler-only test,scheduler_commitwhich noSchedulerSpecvariant exposes today).actionable—Noneindicates a recoverable gap; re-running in a different environment (in-repo cwd, non-tarball kernel, non-host-only test) would populate the field.
Different gauntlet variants on the same run legitimately differ on which fields populate (host-only vs VM-backed, scheduler-only vs payload-bearing), so the report is per-sidecar rather than aggregate.
Sidecars are loaded verbatim — this command does NOT rewrite
run_source to "archive" even when --dir is set. Diverges
intentionally from compare / list-values; matches show-host.
The override would erase the only signal that surfaces the
pre-rename source-key drop case.
The output header reports walked N sidecar file(s), parsed M valid: N counts every
.ktstr.json file the walker visited, M counts how many
parsed against the current schema. walked > parsed signals a
corrupt or pre-1.0-schema sidecar — re-run the test to
regenerate under the current schema.
Per-None blocks in the text output also include a fix:
line for fields whose None is recoverable by an operator
action (e.g. kernel_commit recovers when KTSTR_KERNEL
points at a local kernel git tree). Fields whose None is
the steady-state shape (or a multi-cause set with no single
remediation) emit no fix: line.
When the walk encounters parse failures, the text output
appends a trailing corrupt sidecars (N): block listing
each corrupt path on its own line followed by the serde
error message indented as error: ..., optionally
followed by an enriched: ... line with operator-facing
remediation prose when the parse failure matches a known
schema-drift case (currently the host missing-field
case). When the walk encounters IO failures (file matched
the predicate but read_to_string failed before parsing
could begin — permission denied, mid-rotate truncation,
broken symlink, EISDIR), the text output appends a parallel
io errors (N): block, structured the same way (path on
its own line, error: ... line below) but carrying
std::io::Error::Display rather than serde-error text. IO
errors do NOT carry enriched: lines — there is no
schema-drift catalog for filesystem incidents; the raw
std::io::Error Display is the remediation surface.
Each block is suppressed independently when its source
vec is empty.
All-corrupt and all-IO-failure runs (every predicate-
matching file failed to parse, or every one failed to
read) are NOT a hard error — text output renders the
header (walked N sidecar file(s), parsed 0 valid)
followed directly by the corrupt sidecars (N): and/or
io errors (N): block(s), skipping the per-sidecar
breakdown that has nothing to render. JSON output mirrors
this with valid: 0, _walk.errors and/or
_walk.io_errors populated, and per-field counts at zero.
This preserves structured per-file visibility for
dashboard consumers facing total-failure runs of either
class.
All-corrupt and all-IO-failure runs exit 0 (not a hard error); CI scripts must inspect the JSON channel for failure detection rather than relying on exit code. Two common gating policies, each appropriate for different operational stances:
- Lenient (treat partial failures as warnings):
_walk.valid > 0. Accepts any run with at least one successfully-parsed sidecar; per-file parse or IO failures surface in the JSON arrays for triage but do not fail the gate. - Strict (fail on any sidecar failure):
_walk.errors.len() == 0 && _walk.io_errors.len() == 0. Requires every predicate-matching file to parse cleanly. Both checks are required because the two arrays cover disjoint failure classes (parse vs read) — a run with zero parse errors but one IO error still has a missing sidecar.
The two policies are NOT equivalent: a run with one valid
and one corrupt sidecar passes lenient (valid == 1 > 0)
but fails strict (errors.len() == 1 > 0). Pick the
policy that matches the operational tolerance for partial
data.
--json emits a single object with three top-level keys:
_schema_version (a string version stamp — currently
"1" — that consumers can gate on for incompatible shape
changes), _walk (an envelope carrying walked / valid
counts — same numbers the text header reports under “walked
N sidecar file(s), parsed M valid” — plus an errors array
of {path, error, enriched_message} entries covering every
parse failure (enriched_message is a human-facing
remediation string when a known schema-drift case matches,
JSON null otherwise) AND an io_errors array of
{path, error} entries covering every IO failure (file
matched the predicate but read_to_string failed; error
carries the raw std::io::Error Display). Both arrays
emit on every render — empty array when no failures of
that class occurred — so dashboard consumers see a uniform
shape without contains_key branching. With both arrays,
walked == valid + errors.len() + io_errors.len() by
construction in the steady state — every predicate-matching
file lands in exactly one bucket. (Filesystem races between
the count and load passes can perturb this; see the rustdoc
on WalkStats for the full caveat.) Then fields. Each
entry under fields carries none_count and some_count
(counts across all valid sidecars in the run, summing to
_walk.valid), classification, causes, and fix
(string when a remediation applies, JSON null otherwise).
Output produced before the schema-version stamp landed has
no _schema_version key; consumers should treat the key’s
absence as pre-stamp output (compatible with shape "1" in
practice but unstamped).
The version bumps on incompatible shape changes (key
rename, key removal, semantic shift in an existing key) but
NOT on additive changes (new optional top-level keys, new
entries in fields, new optional sub-keys under existing
entries). The stamp is emitted as a JSON string (e.g. "1",
"2"); parse it by stripping the quotes and converting the
inner digits to an integer, then gate on parsed >= 1
(integer comparison) — never use raw string comparison, since
lexicographic order would put "10" ahead of "2". Pin
loosely (e.g. accept any version >= 1) so dashboard code
keeps working when the catalog grows; tighten only on the
specific bumps a consumer cannot tolerate.
cargo ktstr stats explain-sidecar --run RUN_ID # text per-sidecar diagnostic
cargo ktstr stats explain-sidecar --run RUN_ID --json # aggregate JSON for dashboards
cargo ktstr stats explain-sidecar --run RUN_ID --dir /path/archive # diagnose archived sidecars
| Flag | Default | Description |
|---|---|---|
--run ID | required | Run key (e.g. 6.14-abc1234 or 6.14-abc1234-dirty; from cargo ktstr stats list). |
--dir DIR | target/ktstr/ | Alternate run root. Same semantics as compare --dir. |
--json | off | Emit aggregate JSON instead of per-sidecar text. |
compare
Pool every sidecar under target/ktstr/ (or --dir), partition
the rows into A and B sides via per-side filter flags, average
each side’s matching sidecars per pairing key (or pass through
distinct sidecars under --no-average), and report regressions
on the A→B delta. Exits non-zero on regression.
The dimensions on which the A and B filters DIFFER are the SLICING dimensions — the axes of the A/B contrast. Every other dimension is part of the dynamic PAIRING key the comparison joins on. Slicing dims are derived automatically from the filters:
# Slice on kernel: A is 6.14, B is 6.15. Pair on every other dim.
cargo ktstr stats compare --a-kernel 6.14 --b-kernel 6.15
# Slice on kernel AND scheduler simultaneously.
cargo ktstr stats compare \
--a-kernel 6.14 --a-scheduler scx_rusty \
--b-kernel 6.15 --b-scheduler scx_alpha
# Slice on project commit, narrow both sides to one scheduler+kernel.
cargo ktstr stats compare \
--a-project-commit abcdef1 --b-project-commit fedcba2 \
--kernel 6.14 --scheduler scx_rusty
# Slice on run environment: CI runs vs local developer runs.
cargo ktstr stats compare \
--a-run-source ci --b-run-source local
Symmetric sugar. Shared --X flags (--kernel, --scheduler,
--topology, --work-type, --project-commit, --kernel-commit,
--run-source) pin BOTH sides to the same value(s). Per-side
--a-X / --b-X flags REPLACE the corresponding shared --X
value for that side only — “more-specific replaces” semantics.
So --kernel 6.14 --a-kernel 6.13 puts A on 6.13 and B on 6.14.
Together the seven slicing dimensions (kernel, scheduler,
topology, work-type, project-commit, kernel-commit,
run-source) cover every typed axis the comparison can contrast
on.
Validation. The dispatch site rejects two cases up front:
- Empty slicing: no
--a-X/--b-Xat all, OR the per-side flags resolve to identical effective filters. Bails with “specify at least one per-side filter (e.g.--a-kernel 6.14 --b-kernel 6.15) to define what dimension separates the two sides.” - Multi-dim slicing: slicing on more than one dimension prints a warning to stderr (“warning: slicing on N dimensions; results compress multiple axes into a single A/B contrast”) but continues — multi-dim contrasts are a deliberate feature for cohort sweeps.
Averaging. By default the comparison aggregates every
matching sidecar within each side into a single arithmetic-mean
row per pairing key, smoothing run-to-run jitter. Failing /
skipped contributors are excluded from the metric mean; the
aggregated row’s passed is the AND across every contributor.
A header line above the comparison table reads averaged across N runs (A) and M runs (B) and a per-group
passes_observed/total_observed block prints below the summary.
+mixed commit marker. When contributors to an averaged
group disagree on the -dirty suffix for the same canonical
hex (some clean, some -dirty), the rendered commit and
kernel_commit columns show {hex}+mixed for that group.
+mixed is a COHORT-level marker (distinct from -dirty,
which is a per-record property of one sidecar): it indicates
mixed working-tree state across the group’s contributors.
Mixed-dirty tracking spans EVERY contributor (passing,
failing, skipped) so WIP-vs-committed disagreement surfaces
in the averaged row even when one of the two states only
appears on a failing run. The marker is rendered against the
canonical un-suffixed hex, so a abc1234 clean entry plus an
abc1234-dirty entry render as abc1234+mixed regardless of
which contributor was scanned first. Homogeneous cohorts
(every contributor clean, every contributor dirty, or every
contributor None) preserve the first-seen value verbatim
and never get the +mixed marker.
--no-average keeps each sidecar distinct. If multiple sidecars
on the same side share the same pairing key under --no-average,
the comparison bails with “duplicate pairing keys” — pairing
across A/B sides is ambiguous when one A-row could match many
B-rows. Either drop --no-average to average them, or add
another per-side filter to disambiguate.
Kernel match shape. A --kernel 6.12 filter (two-segment
major.minor) PREFIX-matches every patch release in that series:
6.12, 6.12.0, 6.12.5 all match. A three-or-more-segment
filter (--kernel 6.14.2, --kernel 6.15-rc3) is strict
equality — 6.14.2 does NOT match 6.14.20. The same shape
applies to --a-kernel / --b-kernel.
Discovering filter values. Run
cargo ktstr stats list-values before
crafting a compare invocation to see what kernel, commit,
kernel_commit, source, scheduler, topology, and
work_type values the sidecar pool actually carries; passing a
--a-kernel 6.20 against an empty pool fails downstream with
“no rows match filter A” and list-values is the upstream
answer to “what have I got?”. list-values reports all seven
filterable dimensions; the JSON keys commit and source map
to the per-side filter flags --project-commit and
--run-source.
When a side comes back as unknown for one of the optional
dimensions (kernel, commit, kernel_commit, source),
cargo ktstr stats explain-sidecar on the
underlying run reports per-sidecar which optional fields are
missing and what each absence means.
| Flag | Default | Description |
|---|---|---|
-E FILTER | – | Substring filter applied to the joined scenario topology scheduler work_type string. Scope is limited: -E does NOT match against kernel, project_commit, kernel_commit, or run_source — those are typed dimensions reachable only via the dedicated --kernel / --project-commit / --kernel-commit / --run-source flags. To narrow on those, use the typed flags. Composes with the typed dimension filters: typed narrows happen first, substring runs over the surviving set. |
--kernel VER (repeatable) | – | Pin BOTH sides to the listed kernel version(s). Sugar for --a-kernel V1 --a-kernel V2 --b-kernel V1 --b-kernel V2. Per-side --a-kernel / --b-kernel REPLACES this shared value for that side only. Major.minor (6.12) prefix-matches; three-segment (6.14.2) is strict. |
--scheduler NAME (repeatable) | – | Pin BOTH sides to the listed scheduler(s). Sugar for --a-scheduler N1 --a-scheduler N2 --b-scheduler N1 --b-scheduler N2. Per-side --a-scheduler / --b-scheduler REPLACES this shared value for that side only. OR-combined: a row matches iff its scheduler field equals ANY listed entry. Strict equality per entry. |
--topology LABEL (repeatable) | – | Pin BOTH sides to the listed rendered topology label(s) (e.g. 1n2l4c2t). Sugar for --a-topology L1 --a-topology L2 --b-topology L1 --b-topology L2. Per-side --a-topology / --b-topology REPLACES this shared value for that side only. OR-combined: a row matches iff its rendered topology label equals ANY listed entry. Strict equality per entry. |
--work-type TYPE (repeatable) | – | Pin BOTH sides to the listed work_type(s) (PascalCase variants of WorkType, e.g. SpinWait). Sugar for --a-work-type T1 --a-work-type T2 --b-work-type T1 --b-work-type T2. Per-side --a-work-type / --b-work-type REPLACES this shared value for that side only. OR-combined: a row matches iff its work_type field equals ANY listed entry. Strict equality per entry. See WorkSpec types. |
--project-commit HASH (repeatable) | – | Pin BOTH sides to listed project_commit value(s) (7-char hex, optional -dirty suffix). Also accepts git revspecs (HEAD, HEAD~N, tags, branches, A..B ranges) resolved against the project repo into the same 7-char short hashes; see --help for details. Filters the ktstr framework commit; the scheduler binary’s commit (SidecarResult::scheduler_commit) is not currently exposed as a filter. |
--kernel-commit HASH (repeatable) | – | Pin BOTH sides to listed kernel_commit value(s) (7-char hex, optional -dirty suffix). Also accepts git revspecs (HEAD, HEAD~N, tags, branches, A..B ranges) resolved against the kernel repo (gix::open against KTSTR_KERNEL’s path); see --help for details. Filters the kernel SOURCE TREE commit (SidecarResult::kernel_commit), distinct from the kernel release version (--kernel): two runs of the same kernel_version with different kernel_commit values represent the same release rebuilt from different trees. Rows whose kernel_commit is None (KTSTR_KERNEL pointed at a non-git path, the underlying source was Tarball / Git rather than a Local tree, or the gix probe failed) NEVER match a populated filter. |
--run-source NAME (repeatable) | – | Pin BOTH sides to listed run-environment source(s). Filters SidecarResult::run_source set by detect_run_source at sidecar-write time: "local" for developer runs, "ci" when KTSTR_CI was set, or rewritten to "archive" at load time when --dir points at a non-default pool root. Rows whose run_source is None (sidecar pre-dates the field) NEVER match a populated filter — same opt-in policy as --kernel / --project-commit / --kernel-commit. Combine per-side --a-run-source ci --b-run-source local to contrast CI runs against developer runs of the same scenarios. |
--a-kernel VER (repeatable) | – | A-side kernel filter. Replaces the shared --kernel for the A side only. |
--a-scheduler NAME (repeatable) | – | A-side scheduler filter, OR-combined. Replaces the shared --scheduler value for the A side only. |
--a-topology LABEL (repeatable) | – | A-side topology filter, OR-combined. Replaces the shared --topology value for the A side only. |
--a-work-type TYPE (repeatable) | – | A-side work_type filter, OR-combined. Replaces the shared --work-type value for the A side only. |
--a-project-commit HASH (repeatable) | – | A-side project-commit filter. Replaces the shared --project-commit for the A side only. |
--a-kernel-commit HASH (repeatable) | – | A-side kernel-commit filter. Replaces the shared --kernel-commit for the A side only. |
--a-run-source NAME (repeatable) | – | A-side run-source filter. Replaces the shared --run-source for the A side only. |
--b-kernel VER (repeatable) | – | B-side kernel filter. Replaces the shared --kernel for the B side only. |
--b-scheduler NAME (repeatable) | – | B-side scheduler filter, OR-combined. Replaces the shared --scheduler value for the B side only. |
--b-topology LABEL (repeatable) | – | B-side topology filter, OR-combined. Replaces the shared --topology value for the B side only. |
--b-work-type TYPE (repeatable) | – | B-side work_type filter, OR-combined. Replaces the shared --work-type value for the B side only. |
--b-project-commit HASH (repeatable) | – | B-side project-commit filter. Replaces the shared --project-commit for the B side only. |
--b-kernel-commit HASH (repeatable) | – | B-side kernel-commit filter. Replaces the shared --kernel-commit for the B side only. |
--b-run-source NAME (repeatable) | – | B-side run-source filter. Replaces the shared --run-source for the B side only. |
--no-average | off | Disable averaging. Each sidecar stays distinct; bails with an actionable error when multiple sidecars on the same side share the same pairing key (since pairing across sides becomes ambiguous). |
--threshold PCT | per-metric default_rel | Uniform relative significance threshold in percent. Overrides the per-metric default_rel for every metric; the absolute gate is always per-metric and cannot be tuned from the CLI. Mutually exclusive with --policy. |
--policy FILE | – | Path to a JSON ComparisonPolicy file with per-metric thresholds. Schema: { "default_percent": N, "per_metric_percent": { "worst_spread": 5.0, ... } }. Priority is per-metric override → default_percent → each metric’s registry default_rel. Per-metric keys are rejected at load time if they do not match a metric in the METRICS registry. Mutually exclusive with --threshold. |
--dir DIR | target/ktstr/ | Alternate runs root for pool collection. Defaults to test_support::runs_root() (typically target/ktstr/). Useful when comparing archived sidecar trees copied off a CI host. |
Prerequisites
Run tests first to generate sidecar JSON files:
cargo ktstr test # generates target/ktstr/{kernel}-{project_commit}/*.json
cargo ktstr stats # reads the newest run
Set KTSTR_SIDECAR_DIR to override the sidecar directory; otherwise
the default is {CARGO_TARGET_DIR or "target"}/ktstr/{kernel}-{project_commit}/,
where {project_commit} is the project HEAD short hex (with -dirty
when the worktree differs).
show-host
Print the live host context used by the sidecar collector:
CPU identity, memory/hugepage config, transparent-hugepage
policy, NUMA node count, kernel uname triple
(sysname / release / machine), kernel cmdline, and every
/proc/sys/kernel/sched_* tunable. Useful for diagnosing
cross-run regressions that trace back to host-context drift
(sysctl change, THP policy flip, hugepage reservation) or for
confirming what cargo ktstr stats compare would record on
the next run produced here.
cargo ktstr show-host
This is a live snapshot (reads /proc, /sys, and
uname() at invocation time). For the archived host
context captured at sidecar-write time for a past run, use
cargo ktstr stats show-host --run RUN_ID
instead — same HostContext::format_human formatter so the
two outputs are byte-for-byte comparable when the host is
unchanged.
For historical drift between archived runs (host-side diff
across two run partitions), use
cargo ktstr stats compare — its host-delta
section reports which host-context fields changed between
side A and side B using the same HostContext::diff logic.
show-thresholds
Print the resolved assertion thresholds for the named test —
the same merged Assert value run_ktstr_test_inner evaluates
against worker reports, produced by the runtime merge chain
Assert::default_checks().merge(entry.scheduler.assert()).merge(&entry.assert).
Surfaces every threshold field (or none when inherited or
unset) so an operator can see what the test will actually
check against without reading source or guessing which layer
contributed each bound.
cargo ktstr show-thresholds preempt_regression_fault_under_load
| Arg | Description |
|---|---|
TEST | Function-name-only test identifier as registered in #[ktstr_test] (e.g. preempt_regression_fault_under_load). Use cargo nextest list to enumerate test names — then strip the <binary>:: prefix that nextest prepends to each line before passing the name here. The #[ktstr_test] registry keys on the bare function name, so a name like ktstr::my_test (as printed by nextest) must be trimmed to my_test before it resolves. |
Fails with an actionable message when no registered test
matches the given name; the diagnostic includes a Did you mean ...? Levenshtein suggestion when a near match exists.
locks
Enumerate every ktstr flock held on this host — read-only,
does NOT attempt any flock acquire. Troubleshooting companion
for --cpu-cap contention: when a build or test is stalled
behind a peer’s reservation, cargo ktstr locks names the
peer (PID + cmdline) without disturbing any of its flocks.
Scans four lock-file roots:
/tmp/ktstr-llc-*.lock— per-LLC reservations held by perf-mode test runs and--cpu-cap-bounded builds./tmp/ktstr-cpu-*.lock— per-CPU reservations from the same flow.{cache_root}/.locks/*.lock— cache-entry locks held duringkernel buildwrites, andsource-{path_hash}.lockfiles held for the duration ofkernel build --sourceandcargo ktstr test --kernel <path>against the same source tree.{runs_root}/.locks/{kernel}-{project_commit}.lock— per-run-key sidecar-write locks held for the duration of the (pre-clear + write) cycle to serialize concurrent ktstr processes targeting the same run directory.
Each lock is cross-referenced against /proc/locks to name
the holder PID and cmdline.
cargo ktstr locks # one-shot snapshot
cargo ktstr locks --json # JSON snapshot
cargo ktstr locks --watch 1s # redraw every second until SIGINT
cargo ktstr locks --watch 1s --json # ndjson stream, one object per interval
| Flag | Default | Description |
|---|---|---|
--json | off | Emit the snapshot as JSON. Pretty-printed in one-shot mode; compact (one object per line, ndjson-style) under --watch. Stable field names — schema documented on ktstr::cli::list_locks. |
--watch DURATION | unset | Redraw the snapshot at the given interval until SIGINT. Value is parsed by humantime: 100ms, 1s, 5m, 1h. Human output clears and redraws in place; --json emits one line-terminated object per interval. |
The same subcommand is available as
ktstr locks with identical flag
semantics.
Install
cargo install --locked ktstr --bin ktstr --bin cargo-ktstr # the two user-facing binaries
The explicit --bin flags scope the install to just ktstr and
cargo-ktstr; without them, cargo install would also place the
test-fixture binaries (ktstr-jemalloc-probe,
ktstr-jemalloc-alloc-worker) on $PATH.
Or build from the workspace:
cargo build --bin cargo-ktstr
Auto-Repro
When a test fails because the scheduler crashes or exits, auto-repro boots a second VM with BPF probes attached to capture function arguments and struct fields along the scheduling path. Stack functions extracted from the crash output seed the probe list; when no crash stack is available (e.g. a BPF text error or verifier failure with no backtrace), auto-repro falls back to dynamic BPF program discovery in the repro VM.
How it works
-
First VM – the test runs normally. If the scheduler crashes or exits (BPF error, verifier failure, stall), ktstr captures any stack trace from the scheduler log (COM2) or kernel console (COM1).
-
Stack extraction – function names are parsed from the crash trace when available. BPF program symbols (
bpf_prog_*) are recognized and their short names extracted. Generic functions (scheduler entry points, spinlocks, syscall handlers, sched_ext exit machinery, BPF trampolines, stack dump helpers) are filtered out. When no stack functions are found, the pipeline continues with an empty probe list. -
BPF discovery – in the repro VM, ktstr discovers loaded struct_ops programs via libbpf-rs and adds them to the probe list alongside any stack-extracted functions. Their kernel-side callers are added (e.g.
enqueue->do_enqueue_task) for bridge kprobes. This step ensures probes can capture variable states across the scheduler exit call chain even when the crash produced no extractable stack. -
BTF resolution – function signatures are resolved from vmlinux BTF (kernel functions) and program BTF (BPF callbacks). Known struct types (task_struct, rq, scx_dispatch_q, etc.) have curated fields resolved to byte offsets. Other struct pointer params have scalar, enum, and cpumask pointer fields auto-discovered from vmlinux or BPF program BTF.
-
Second VM – ktstr boots a new VM and reruns the scenario with BPF probes:
- Kprobe skeleton for kernel function entry (uses
bpf_get_func_ip) - Fentry/fexit skeleton for BPF callbacks and kernel function exit
(batched in groups of 4, shares maps via
reuse_fd). Fexit re-reads struct fields after the function executes, capturing post-mutation state alongside the entry snapshot. - Tracepoint trigger (
tp_btf/sched_ext_exit) fires insidescx_claim_exit()when the scheduler exits, in the context of the current task at exit time
- Kprobe skeleton for kernel function entry (uses
-
Stitching – the task_struct pointer is read from the trigger event’s
bpf_get_current_task()value. Events with a task_struct parameter are filtered to that pointer; events without a task_struct parameter are retained if theirtask_ptr(frombpf_get_current_task()at probe time) matches the triggering task. Events are sorted by timestamp and formatted with decoded field values (cpumask ranges, DSQ names, enqueue flags, etc.) and source locations (DWARF for kernel, line_info for BPF). -
Diagnostic tails – the last 40 lines of the repro VM’s scheduler log (COM2, cycle-collapsed), sched_ext dump (COM1), and kernel console (COM1) are appended after the probe output when non-empty. A duration line reports total repro VM wall time. When probe data is absent, a crash reproduction status line indicates whether the crash reproduced.
Requirements
Auto-repro requires a kernel with the sched_ext_exit tracepoint
(used as the probe trigger). Kernels built with CONFIG_SCHED_CLASS_EXT
and tracepoint support include this. If the tracepoint is unavailable,
auto-repro is skipped and the pipeline diagnostics report the cause.
Enabling auto-repro
In #[ktstr_test]:
#[ktstr_test(auto_repro = true)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> { ... }
auto_repro defaults to true in #[ktstr_test].
Repro mode
During the second VM run, ktstr sets “repro mode” which disables the work-conservation watchdog. Workers normally send SIGUSR2 to the scheduler when stuck > 2 seconds. In repro mode, the scheduler stays alive so BPF assertion probes can fire.
Example output
The demo_host_crash_auto_repro test triggers a host-initiated crash
via BPF map write and captures the scheduling path. Probe output shows
each function with decoded struct fields and source locations. When
fexit captures post-mutation state, changed fields show an arrow
(→) between entry and exit values:
ktstr_test 'demo_host_crash_auto_repro' [sched=scx-ktstr] failed:
scheduler died
--- auto-repro ---
=== AUTO-PROBE: scx_exit fired ===
ktstr_enqueue main.bpf.c:21
task_struct *p
pid 97
cpus_ptr 0xf(0-3)
dsq_id SCX_DSQ_INVALID
enq_flags NONE
slice 0
vtime 0
weight 100
sticky_cpu -1
scx_flags QUEUED|ENABLED
do_enqueue_task kernel/sched/ext.c:1344
rq *rq
cpu 1
task_struct *p
pid 97
cpus_ptr 0xf(0-3)
dsq_id SCX_DSQ_INVALID → SCX_DSQ_LOCAL
enq_flags NONE
slice 20000000
vtime 0
weight 100
sticky_cpu -1
scx_flags QUEUED|DEQD_FOR_SLEEP → QUEUED
After the probe data, the auto-repro section includes the repro VM duration and the last 40 lines of the repro VM’s scheduler log, sched_ext dump, and dmesg (each only when non-empty).
Demo test
A demo test in this shape (reduced from
demo_host_crash_auto_repro in tests/scenario_coverage.rs):
use ktstr::prelude::*;
use ktstr::test_support::{BpfMapWrite, KtstrTestEntry, run_ktstr_test};
fn scenario_yield_heavy(ctx: &Ctx) -> Result<AssertResult> {
let steps = vec![Step::with_defs(
vec![
CgroupDef::named("demo_workers")
.work_type(WorkType::YieldHeavy)
.workers(4),
],
HoldSpec::Fixed(Duration::from_secs(8)),
)];
execute_steps(ctx, steps)
}
Run manually to see full output:
cargo ktstr test --kernel ../linux -- --run-ignored ignored-only -E 'test(demo_host_crash_auto_repro)'
BPF Verifier
The verifier sweep boots every declared scheduler in a KVM VM and captures per-program verifier statistics from the real kernel verifier.
Design
The verifier sweep follows ktstr’s two core principles.
Fidelity without overhead. Each scheduler binary runs inside a VM on the same kernel the scheduler targets in production. The verifier that runs is the real verifier in the real kernel — no host-side BPF loading, no version skew between the host kernel’s verifier and the target kernel’s verifier.
Direct access over tooling layers. No subprocess to bpftool
or veristat. The host reads per-program verified_insns directly
from guest memory via bpf_prog_aux introspection and applies
cycle collapse to verifier logs instead of truncating.
Quick start
# Run every declared scheduler against the kernel discovered via
# KTSTR_KERNEL (or the cache).
cargo ktstr verifier
# Run against a specific kernel build.
cargo ktstr verifier --kernel ../linux
# Sweep across multiple kernels (each cell runs against its own).
cargo ktstr verifier --kernel 6.14.2 --kernel 6.15.0
# Print the raw verifier log without cycle collapse.
cargo ktstr verifier --raw
See cargo-ktstr verifier for the full flag list.
How it works
cargo ktstr verifier is a thin dispatcher around
cargo nextest run -E 'test(/^verifier/)'. The matrix that nextest
runs is generated by the test binaries themselves.
-
Cell emission — every test binary that links the ktstr test support and contains at least one
declare_scheduler!declaration emits averifier/<sched>/<kernel>/<preset>: testlisting for each (declared scheduler × kernel-list entry × accepted gauntlet topology preset) triple. Cells whose topology preset would exceed the host’s CPU / LLC / per-LLC capacity are filtered at emission time (mirroring the gauntlet variant filter). -
Per-cell dispatch — nextest invokes the test binary once per cell with
--exact verifier/<sched>/<kernel>/<preset>. The binary’s#[ctor]intercepts the prefix, parses the cell name into its three components, resolves the scheduler binary, kernel directory, and topology, and boots a single VM dedicated to that cell. -
Verifier collection — inside the VM, the scheduler loads its BPF programs via
scx_ops_load!; the real kernel verifier runs against them. The host reads per-programverified_insnsfrombpf_prog_auxvia guest physical memory introspection. On load failure, libbpf prints the verifier log to stderr; the VM forwards it to the host via the bulk SHM port between===SCHED_OUTPUT_START===/===SCHED_OUTPUT_END===markers. -
Rendering — per-program summary lines, then the verifier log with cycle collapse applied (or raw, with
--raw).
Eevdf + KernelBuiltin scheduler variants have no userspace
binary to load BPF programs from, so they are skipped at
cell-emission time. Direct invocation outside nextest
(--exact verifier/<eevdf-sched>/...) prints a SKIP banner
and exits 0.
Matrix dimension + per-scheduler filter
The verifier sweep matrix is driven by the operator’s --kernel
set, not by per-scheduler declare_scheduler! declarations. The
dispatcher always exports KTSTR_KERNEL_LIST
(label1=path1;label2=path2;...) to the nextest invocation —
even with no --kernel flag it synthesizes a single entry from
the auto-discovered kernel. The test binary’s lister walks that
list as the matrix dimension and emits one cell per (declared
scheduler × kernel-list entry × accepted preset).
Each scheduler’s declare_scheduler! kernels = [...]
declaration acts as a per-scheduler filter on the matrix:
- Empty (
kernels = []) accepts every kernel-list entry — the scheduler verifies against everything the operator passes. Versionspecs ("6.14.2") match entries whose raw label equals the version (or whose sanitized label equals the sanitized form of the version).Rangespecs ("6.14..6.16","6.14..=6.16") match entries whose raw version falls in the inclusive range, parsed via the samedecompose_version_for_comparehelper the operator-side range expansion uses.Path/CacheKey/Gitspecs match by sanitized-label equality.
# Scheduler declares kernels = ["6.14..6.16"]
# Operator runs --kernel 6.14.2 --kernel 6.15.0 --kernel 6.17.0
# Dispatcher's KTSTR_KERNEL_LIST: kernel_6_14_2, kernel_6_15_0,
# kernel_6_17_0
# Scheduler filter: 6.14.2 ∈ [6.14, 6.16] ✓
# 6.15.0 ∈ [6.14, 6.16] ✓
# 6.17.0 ∈ [6.14, 6.16] ✗ — rejected
# Cells emitted: verifier/<sched>/kernel_6_14_2/<preset>
# verifier/<sched>/kernel_6_15_0/<preset>
cargo ktstr verifier --kernel 6.14.2 --kernel 6.15.0 --kernel 6.17.0
# No --kernel: dispatcher auto-discovers one kernel via the
# cache + filesystem chain and synthesizes a single-entry
# KTSTR_KERNEL_LIST. The auto-discovered entry's label is
# derived from the resolved path (`kernel_path_<basename>_<hash6>`).
# Schedulers with non-empty `kernels = [...]` may filter the
# entry out — operators wanting deterministic coverage should
# always pass --kernel.
cargo ktstr verifier
The verifier cell handler resolves the per-cell kernel
directory by looking up the cell’s sanitized label in
KTSTR_KERNEL_LIST — there is no single-kernel fallback that
would silently run a cell against an unrelated kernel. A label
that doesn’t appear in the list errors out with an actionable
diagnostic naming the present labels and pointing at both fix
paths (add --kernel <SPEC> or drop the matching entry from
declare_scheduler!).
Output
Brief (default)
Per-program summary line:
ktstr_enqueue verified_insns=500
verified_insns is the number of instructions the kernel
verifier processed, read from bpf_prog_aux via host-side
memory introspection.
On load failure, the scheduler-log section shows libbpf’s verifier output with cycle collapse applied — repeating loop iterations are reduced to the first iteration, an omission marker, and the last iteration:
--- 8x of the following 10 lines ---
100: (bf) r0 = r1 ; frame1: R0_w=scalar(id=0,umin=0)
101: (bf) r1 = r2 ; frame1: R1_w=scalar(id=1,umin=1)
...
--- 6 identical iterations omitted ---
100: (bf) r0 = r1 ; frame1: R0_w=scalar(id=70,umin=700)
101: (bf) r1 = r2 ; frame1: R1_w=scalar(id=71,umin=701)
...
--- end repeat ---
Raw (--raw)
Full raw verifier log without cycle collapse. Use for
debugging verification failures where the exact register
state at each iteration matters. The flag exports
KTSTR_VERIFIER_RAW=1 for the nextest invocation; the
in-binary cell handler reads it via env::var_os and switches
the format_verifier_output rendering branch.
Cycle collapse algorithm
The kernel verifier unrolls loops by re-verifying each instruction with updated register states. A bounded loop of 8 instructions verified 100 times produces 800 near-identical lines — differing only in register-state annotations. Naive truncation loses context. Cycle collapse preserves structure: the first iteration shows what the loop does, the last shows the final state, and a count tells you how many iterations were elided.
The algorithm normalizes lines by stripping variable annotations, then detects repeating blocks:
-
Normalize — strip
; frame1: R0_w=...annotations, standalone register dumps (3041: R0=scalar()), and inline branch-target state aftergoto pc+N. Source comments (; for (int j = 0; ...)) are preserved as cycle anchors. -
Detect — find the most frequent normalized line (the “anchor”), compute gaps between anchor occurrences to determine the cycle period, then verify consecutive blocks match after normalization. Minimum period: 5 lines. Minimum repetitions: 3.
-
Collapse — replace the cycle with the first iteration, an omission count, and the last iteration. Run iteratively (up to 5 passes) to handle nested loops.
scx-ktstr test flags
scx-ktstr supports these flags to exercise the verifier pipeline:
--fail-verify — sets a .rodata variable before
scx_ops_load!, enabling a code path the BPF verifier
rejects. On failure, libbpf prints the verifier log to stderr.
--verify-loop — sets a .rodata variable that enables
an unrolled loop followed by while(1) in ktstr_dispatch.
The verifier rejects the infinite loop and libbpf prints the
full instruction trace to stderr, exercising cycle collapse.
Core Concepts
ktstr tests compose from four layers:
-
Scenarios – what to test: cgroup layout, CPU partitioning, workloads, custom logic.
-
Flags – which scheduler features to enable for each run.
-
WorkSpec types – what each worker process does: CPU spin, yield, I/O, bursty patterns, pipe-based IPC.
-
Checking – how to evaluate results: starvation, fairness, isolation, scheduling gaps, monitor thresholds.
These compose orthogonally. A scenario runs with every valid flag combination, and checks apply uniformly across all runs.
Four supporting concepts complete the picture:
Ops and Steps is the primary API for defining
scenarios – most tests use CgroupDef and execute_defs from
this module. TestTopology provides CPU and
LLC layout for cpuset partitioning.
Performance Mode applies host-side
isolation for noise-sensitive measurements.
Resource Budget describes the
--cpu-cap tier that coordinates concurrent no-perf-mode VMs and
kernel builds via LLC flocks and cgroup v2 cpuset sandboxes.
Scenarios
Scenarios define the scheduling conditions a test creates. Each scenario sets up cgroups, workers, and cpusets to produce a specific condition, then verifies the scheduler handles it correctly.
Canned scenarios (scenarios::*)
ktstr::scenario::scenarios provides curated scenario functions that
can be called directly from #[ktstr_test]:
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
scenarios::steady(ctx)
}
| Function | Condition tested | Setup |
|---|---|---|
steady | Baseline fairness | 2 cgroups, no cpusets, equal CPU-spin load |
steady_llc | LLC-boundary scheduling | 2 cgroups with LLC-aligned cpusets |
oversubscribed | Dispatch under oversubscription | 2 cgroups, 32 mixed workers each |
cpuset_apply | Cpuset assignment on running tasks | Disjoint cpusets applied mid-run |
cpuset_clear | Cpuset removal on confined tasks | Cpusets cleared mid-run |
cpuset_resize | Cpuset resizing adaptation | Cpusets shrink then grow |
cgroup_add | New cgroup appearance | Cgroups added mid-run |
cgroup_remove | Cgroup removal while others run | Cgroups removed mid-run |
affinity_change | Affinity mask changes | Worker affinities randomized mid-run |
affinity_pinned | Narrow-affinity contention | Workers pinned to 2-CPU subset |
host_contention | Fairness between cgroup and host tasks | Host workers vs cgroup workers |
mixed_workloads | Mixed workload fairness | Heavy + bursty + IO cgroups |
nested_steady | Nested cgroup hierarchy | Workers in nested sub-cgroups |
nested_task_move | Cross-level task migration | Tasks moved between nested cgroups |
Additional custom_* functions are available in
ktstr::scenario::{affinity, basic, cpuset, dynamic, interaction, nested, performance, stress}. See the
API docs
for the full list.
Most tests use these canned functions or build custom scenarios with
CgroupDef and execute_defs / execute_steps (see
Ops and Steps). Custom scenarios receive a Ctx reference
and use the same building blocks; see
Custom Scenarios for the
Ctx struct and helper functions.
WorkType
WorkType controls what each worker process does during a scenario.
The WorkType enum in ktstr::workload is the source of truth.
The variants below are grouped by intent; each one-line summary is
the leading sentence of the variant’s rustdoc. Run cargo doc --open
for full per-variant semantics, parameter ranges, and kernel-path
citations — this page reproduces only the high-level shape.
pub enum WorkType {
// CPU primitives
SpinWait, // Tight CPU spin loop (1024 iterations per cycle).
YieldHeavy, // Repeated sched_yield with minimal CPU work.
Mixed, // CPU spin burst followed by sched_yield.
AluHot { width: AluWidth }, // Dependent integer multiply chain at high IPC (>= 2.0); optional SIMD width.
SmtSiblingSpin, // Tight PAUSE-spin from a paired worker pinned to two SMT siblings.
IpcVariance { // Alternating high-IPC (multiplies) / low-IPC (random cache touches) phases.
hot_iters: u64,
cold_iters: u64,
period_iters: u64,
},
// Block-device I/O (operates on /dev/vda; falls back to per-worker tempfile when absent)
IoSyncWrite, // 16 x 4 KB pwrites + fdatasync per iteration (O_SYNC).
IoRandRead, // Single 4 KB pread at a random sector-aligned offset (O_DIRECT).
IoConvoy, // Interleaved sequential pwrite + random pread with periodic fdatasync (O_DIRECT).
// Burst-and-sleep
Bursty { // CPU burst for `burst_duration`, sleep for `sleep_duration`, repeat.
burst_duration: Duration,
sleep_duration: Duration,
},
IdleChurn { // CPU burst then `nanosleep` (exercises hrtimer + idle-class path).
burst_duration: Duration,
sleep_duration: Duration,
precise_timing: bool,
},
// Cache pressure
CachePressure { size_kb: usize, stride: usize }, // Strided RMW sized to pressure L1.
CacheYield { size_kb: usize, stride: usize }, // Cache pressure burst then sched_yield().
// Wake-placement / cross-CPU paths
PipeIo { burst_iters: u64 }, // CPU burst then 1-byte pipe exchange with a partner worker.
FutexPingPong { spin_iters: u64 }, // Paired futex wait/wake between partner workers (non-WF_SYNC).
CachePipe { size_kb: usize, burst_iters: u64 }, // Cache-hot working set + pipe wake.
FutexFanOut { fan_out: usize, spin_iters: u64 }, // 1:N fan-out wake (one messenger, N receivers).
FanOutCompute { // Messenger/worker fan-out with matrix-multiply compute per receiver.
fan_out: usize,
cache_footprint_kb: usize,
operations: usize,
sleep_usec: u64,
},
AsymmetricWaker { // Paired workers in mismatched scheduling classes share one futex word.
waker_class: SchedClass,
wakee_class: SchedClass,
burst_iters: u64,
},
WakeChain { // Ring of waker-wakee hops via Pipe (WF_SYNC) or Futex wake.
depth: usize,
wake: WakeMechanism,
work_per_hop: Duration,
},
EpollStorm { // eventfd producers + epoll_wait consumers (exclusive autoremove wake).
producers: usize,
consumers: usize,
events_per_burst: u64,
},
ThunderingHerd { // N waiters on ONE global futex word; broadcast FUTEX_WAKE rouses the herd.
waiters: usize,
batches: u64,
inter_batch_ms: u64,
},
// Compound / sequence
Sequence { first: Phase, rest: Vec<Phase> }, // Loop through ordered phases (Spin / Sleep / Yield / Io).
// Lifecycle / scheduling-class churn
ForkExit, // Rapid fork+_exit cycling; parent waitpid's then repeats.
NiceSweep, // Cycle nice level from -20 to 19 across iterations.
AffinityChurn { spin_iters: u64 }, // Rapid self-directed sched_setaffinity to random CPUs.
PolicyChurn { spin_iters: u64 }, // Cycle SCHED_OTHER -> BATCH -> IDLE (-> FIFO/RR if CAP_SYS_NICE).
NumaMigrationChurn { period_ms: u64 }, // Rotate sched_setaffinity across NUMA nodes.
CgroupChurn { groups: usize, cycle_ms: u64 }, // Cycle cgroup membership between sibling cgroups.
// Memory pressure / NUMA
PageFaultChurn { // mmap NOHUGEPAGE -> touch random pages -> MADV_DONTNEED, repeat.
region_kb: usize,
touches_per_cycle: usize,
spin_iters: u64,
},
NumaWorkingSetSweep { // Rotate the working-set memory across NUMA nodes via mbind.
region_kb: usize,
sweep_period_ms: u64,
target_nodes: Vec<usize>,
},
// Lock contention
MutexContention { // N-way futex mutex contention (CAS acquire / FUTEX_WAIT on failure).
contenders: usize,
hold_iters: u64,
work_iters: u64,
},
PriorityInversion { // Three priority tiers contending for one shared lock (Pi or Plain futex).
high_count: usize,
medium_count: usize,
low_count: usize,
hold_iters: u64,
work_iters: u64,
pi_mode: FutexLockMode,
},
// Producer/consumer + signal/preempt pressure
ProducerConsumerImbalance { // Producer / consumer pipeline with deliberately-unbalanced rates.
producers: usize,
consumers: usize,
produce_rate_hz: u64,
consume_iters: u64,
queue_depth_target: u64,
},
SignalStorm { // Paired workers fire tkill(partner, SIGUSR1) between CPU bursts.
signals_per_iter: u64,
work_iters: u64,
},
PreemptStorm { // One SCHED_FIFO worker preempts CFS spinners on the same CPU at ~kHz rate.
cfs_workers: usize,
rt_burst_iters: u64,
rt_sleep_us: u64,
},
RtStarvation { // SCHED_FIFO workers monopolise the CPU at 100%; CFS workers starve.
rt_workers: usize,
cfs_workers: usize,
rt_priority: i32,
burst_iters: u64,
},
// User-supplied
Custom { // User-supplied work function (name + fn pointer).
name: String,
run: fn(&AtomicBool) -> WorkerReport,
},
}
Imports:
WorkType,Phase,SchedPolicy,WorkSpec, andWorkloadConfigare inktstr::prelude::*. The auxiliary enumsFutexLockMode(used byPriorityInversion::pi_mode),WakeMechanism(used byWakeChain::wake), andSchedClass(used byAsymmetricWaker) live underktstr::workload. Bring them into scope withuse ktstr::workload::*;(or import each by name) before writing variant literals that reference them.
Parameterized variants have snake-case convenience constructors —
e.g. WorkType::bursty(burst_duration, sleep_duration),
WorkType::pipe_io(burst_iters),
WorkType::cache_pressure(size_kb, stride),
WorkType::page_fault_churn(region_kb, touches_per_cycle, spin_iters),
WorkType::mutex_contention(contenders, hold_iters, work_iters),
WorkType::priority_inversion(high_count, medium_count, low_count, hold_iters, work_iters, pi_mode),
WorkType::wake_chain(depth, wake, work_per_hop),
WorkType::custom(name, run). Every parameterised variant has one;
see cargo doc --open on WorkType for the full constructor list
and parameter validation rules.
Bursty, IdleChurn, and WakeChain take Duration parameters
(humantime-serialised in captured configs) — pass
Duration::from_millis(N) or
Duration::from_micros(N) from std::time rather than raw integers.
IpcVariance, ProducerConsumerImbalance, RtStarvation,
PriorityInversion, EpollStorm, PreemptStorm, and
ThunderingHerd reject zero-valued counters at spawn time
(WorkTypeValidationError::*).
Choosing a work type
| Scheduler behavior to test | Recommended work type |
|---|---|
| Basic load balancing / fairness | SpinWait (default) |
| Wake placement / sleep-wake cycles | YieldHeavy, FutexPingPong |
| CPU borrowing / idle balance | Bursty |
| Cross-CPU wake latency | PipeIo, CachePipe |
| Cache-aware scheduling | CachePressure, CacheYield |
| Cache-aware fan-out wake latency | FanOutCompute |
| Fan-out wake storms | FutexFanOut |
| Mixed real-world patterns | Sequence |
| Task creation/destruction pressure | ForkExit |
| Priority reweighting / nice dynamics | NiceSweep |
| Rapid CPU migration / affinity churn | AffinityChurn |
| Scheduling class transitions | PolicyChurn |
| Page fault / TLB pressure | PageFaultChurn |
| Lock contention / convoy effect | MutexContention |
| Arbitrary user-defined workload | Custom |
Variants
SpinWait – tight spin loop with spin_loop() hints. 1024
iterations per check. Pure CPU-bound workload.
YieldHeavy – thread::yield_now() on every iteration. Exercises
scheduler wake/sleep paths.
Mixed – 1024 spin iterations then yield. Combines CPU and
voluntary preemption.
IoSyncWrite – 16 × 4 KB pwrites totaling 64 KB at the worker’s
stripe offset (per-worker striping prevents fdatasync from coalescing
across writers), then fdatasync(). Drives fsync-heavy D-state cycles.
Opens /dev/vda with O_SYNC; falls back to a per-worker tempfile
when /dev/vda is absent (host-side unit tests).
IoRandRead – single 4 KB pread at a sector-aligned random
offset within the device capacity. Opens /dev/vda with O_DIRECT
(tempfile fallback); drives high-IOPS short-D-state cycles. Per-worker
xorshift PRNG seeded from tid.
IoConvoy – alternates 4 KB pwrite at the worker’s monotonic
sequential cursor with 4 KB pread at a random offset; fdatasync()
runs every 16 iterations. /dev/vda opened O_DIRECT (tempfile
fallback). Currently uses direct IO so the pathology surface is the
synchronous flush + IO-mix latency rather than page-cache convoy
build-up.
Bursty – CPU burst for burst_duration, sleep for
sleep_duration, repeat. Both fields are Duration (humantime-
serialised); pass Duration::from_millis(N) from std::time. Frees
CPUs during sleep, exercising CPU borrowing.
PipeIo – CPU burst then 1-byte pipe exchange with a partner
worker. Workers are paired: (0,1), (2,3), etc. Sleep duration depends
on partner scheduling, exercising cross-CPU wake placement. Requires
even num_workers.
FutexPingPong – paired futex wait/wake between partner workers.
Each iteration does spin_iters of CPU work then wakes the partner
and waits on a shared futex word. Exercises the non-WF_SYNC wake path.
Requires even num_workers.
CachePressure – strided read-modify-write over a buffer sized
to pressure the L1 cache. Each worker allocates its own buffer
post-fork. size_kb controls buffer size, stride controls the byte
step between accesses.
CacheYield – cache pressure followed by sched_yield(). Tests
scheduler re-placement after voluntary yield with a cache-hot working set.
CachePipe – cache pressure burst then 1-byte pipe exchange with
a partner worker. Combines cache-hot working set with cross-CPU wake
placement. Requires even num_workers.
FutexFanOut – 1:N fan-out wake pattern without cache pressure.
One messenger per group does spin_iters of CPU spin work then wakes
fan_out receivers via FUTEX_WAKE. Receivers measure wake-to-run
latency. For cache-aware fan-out with matrix multiply work, see
FanOutCompute. Requires num_workers divisible by fan_out + 1.
FanOutCompute – messenger/worker fan-out with compute work. One
messenger per group stamps a CLOCK_MONOTONIC timestamp then wakes
fan_out workers via FUTEX_WAKE. Workers measure wake-to-run latency
(time from messenger’s timestamp to worker getting the CPU), sleep for
sleep_usec microseconds (simulating think time), then do operations
iterations of naive matrix multiply over a cache_footprint_kb-sized
working set (three square matrices of u64, O(n^3)). Requires
num_workers divisible by fan_out + 1.
Sequence – compound work pattern: loop through phases in order,
repeat. Each phase runs for its specified duration before the next
starts. Phases are defined via the Phase enum:
Phase::Spin(Duration)– CPU spin for the given duration.Phase::Sleep(Duration)–thread::sleepfor the given duration.Phase::Yield(Duration)– repeatedsched_yieldfor the given duration.Phase::Io(Duration)– simulated I/O (write 64 KB + 100 us sleep) for the given duration.
Sequence cannot be constructed via WorkType::from_name() because
it requires explicit phase definitions. Build it directly:
WorkType::Sequence {
first: Phase::Spin(Duration::from_millis(100)),
rest: vec![
Phase::Sleep(Duration::from_millis(50)),
Phase::Yield(Duration::from_millis(20)),
],
}
ForkExit – rapid fork+_exit cycling. Each iteration forks a
child that immediately calls _exit(0). The parent waitpids then
repeats. Exercises wake_up_new_task, do_exit, and
wait_task_zombie.
NiceSweep – cycles the worker’s nice level from -20 to 19
across iterations. Each iteration: 512-iteration spin burst,
setpriority(PRIO_PROCESS, 0, nice_val), then sched_yield. Exercises
reweight_task and dynamic priority reweighting. Skips negative nice values
when CAP_SYS_NICE is absent. Resets nice to 0 before exit. Records
per-yield wake latency.
AffinityChurn – rapid self-directed CPU affinity changes. Each
iteration: spin_iters spin burst, sched_setaffinity to a random CPU
from the effective cpuset, then sched_yield. Exercises
affine_move_task and migration_cpu_stop. Records per-yield wake
latency.
PolicyChurn – cycles through scheduling policies each iteration.
Each iteration: spin_iters spin burst, sched_setscheduler to the
next policy in the sequence, then sched_yield. Cycles through
SCHED_OTHER, SCHED_BATCH, SCHED_IDLE (and SCHED_FIFO/SCHED_RR
with priority 1 when CAP_SYS_NICE is available). Exercises
__sched_setscheduler and scheduling class transitions. Resets to
SCHED_OTHER before exit. Records per-yield wake latency.
PageFaultChurn – rapid page fault cycling. Workers mmap a
region_kb KB region with MADV_NOHUGEPAGE (forcing 4 KB pages),
touch touches_per_cycle random pages via write faults through
do_anonymous_page, then MADV_DONTNEED to zap PTEs and repeat.
spin_iters iterations of CPU work separate cycles. Exercises
the page allocator, TLB pressure on migration, and rapid user/kernel
transitions. Uses xorshift64 PRNG for random page selection (seeded
from the process ID).
MutexContention – N-way futex mutex contention. contenders
workers per group contend on a shared AtomicU32 via CAS acquire
(FUTEX_WAIT on failure). Loop: spin_burst(work_iters) then CAS
acquire, spin_burst(hold_iters) in the critical section, then
store 0 + FUTEX_WAKE(1) to release. Exercises convoy effect,
lock-holder preemption cascading stalls, and futex wait/wake
contention paths. Requires num_workers divisible by contenders.
Custom – user-supplied work function. The run function pointer
receives a reference to the stop flag (&AtomicBool, set by SIGUSR1)
and returns a WorkerReport when the flag becomes true. The
framework handles fork, cgroup placement, affinity, scheduling policy,
and signal setup; the user function owns the work loop and all
WorkerReport field population. Framework telemetry (migration
tracking, gap detection, schedstat deltas, iteration counter updates)
is not provided – the user function is responsible for any telemetry
it needs.
Warning — pgid SIGKILL sweep on teardown. Every worker process
calls setpgid(0, 0) immediately after fork, so the worker and any
children a Custom closure spawns share a single process group.
At teardown, stop_and_collect issues killpg(worker_pid, SIGKILL)
on BOTH the graceful-exit and StillAlive-escalation paths, and
WorkloadHandle::drop issues another killpg during handle
destruction. Every descendant that inherits the worker’s pgid
(a helper binary via execv, a subshell via sh -c, a test
fixture the closure forks to drive the scheduler) will be
SIGKILLed at teardown. Closures that need a child to outlive the
worker must either detach it from the worker’s pgid (call
setpgid(child_pid, 0) after fork) or wait on it explicitly
before returning the WorkerReport.
Function pointers (fn(&AtomicBool) -> WorkerReport) are fork-safe
because they carry no captured state across the fork boundary. Closures
are not supported. Cannot be constructed via WorkType::from_name().
use std::sync::atomic::{AtomicBool, Ordering};
use ktstr::workload::{WorkType, WorkerReport};
fn my_workload(stop: &AtomicBool) -> WorkerReport {
// `tid` in `WorkerReport` is an `i32` (libc::pid_t). Using
// `std::process::id() as i32` avoids a direct `libc` dependency in
// the consumer crate; inside ktstr the two produce the same value
// because one worker = one process (no threads).
let tid: i32 = std::process::id() as i32;
let start = std::time::Instant::now();
let mut work_units = 0u64;
while !stop.load(Ordering::Relaxed) {
// ... custom work ...
work_units += 1;
}
let wall_time_ns = start.elapsed().as_nanos() as u64;
// Start from `WorkerReport::default()` so the fields you don't
// populate take their zero / empty values automatically and new
// fields added to `WorkerReport` in the future do not require an
// edit here. Only populate the telemetry your custom workload
// actually produces.
WorkerReport {
tid,
work_units,
wall_time_ns,
iterations: work_units,
..WorkerReport::default()
}
}
let wt = WorkType::custom("my_workload", my_workload);
Grouped work types
PipeIo, FutexPingPong, and CachePipe require num_workers
divisible by 2 (paired). FutexFanOut and FanOutCompute require
num_workers divisible by fan_out + 1 (1 messenger + N receivers per
group). MutexContention requires num_workers divisible by
contenders. WorkType::worker_group_size() returns the group size
for these variants, or None for ungrouped types. PipeIo and
CachePipe use pipes; FutexPingPong, FutexFanOut, FanOutCompute,
and MutexContention use shared mmap pages with futex wait/wake.
Clone-mode and pcomm interactions
CloneMode is a per-WorkloadConfig enum with two variants —
Fork (the default; each worker is its own thread group, reaped
via waitpid) and Thread (workers share the parent’s tgid, run
as std::thread::spawn threads, reaped via JoinHandle).
pcomm is not a CloneMode variant — it is a WorkSpec
field set via WorkSpec::pcomm(name) /
CgroupDef::pcomm(name)
in the tutorial.
When a WorkSpec carries pcomm = Some(name), apply_setup
routes it through the fork-then-thread spawn path: ONE forked
thread-group leader whose task->comm is name hosts every
matching worker as a pthread-style thread under that leader.
Workers sharing a pcomm value coalesce into one container; this
combines the per-process-leader visibility schedulers expect (a
chrome parent, a java parent) with the in-process
std::thread::spawn dispatch shape CloneMode::Thread already
uses for the worker bodies themselves.
PipeIo and CachePipe work correctly inside a pcomm container.
When workers run as threads inside one forked leader, the per-pair
pipe-fd indices computed in the global pipe_pairs table are
addressed by each worker’s position WITHIN the container’s thread
group, so worker A reads its partner’s write end whether the pair
lives in two forked processes (Fork mode) or in two threads of
one pcomm container.
SignalStorm uses tkill(partner_tid, SIGUSR1) (per-task
signal delivery, PIDTYPE_PID), NOT kill (per-tgid,
PIDTYPE_TGID) and NOT tgkill(self_tgid, partner_tid, …)
(would return ESRCH under Fork mode because each forked worker
is its own tgid leader). tkill looks up the target via
find_task_by_vpid(pid) and skips the tgid check, so the signal
hits the partner thread’s per-task pending queue under Fork and
Thread modes uniformly — including inside pcomm-coalesced
thread groups. Sibling threads in a pcomm container do NOT dequeue
each other’s SignalStorm signals because the PIDTYPE_PID queue
is per-task, not per-tgid.
Default values
WorkType::from_name() uses these defaults:
Bursty:burst_duration=50ms,sleep_duration=100msPipeIo:burst_iters=1024FutexPingPong:spin_iters=1024CachePressure:size_kb=32,stride=64CacheYield:size_kb=32,stride=64CachePipe:size_kb=32,burst_iters=1024FutexFanOut:fan_out=4,spin_iters=1024FanOutCompute:fan_out=4,cache_footprint_kb=256,operations=5,sleep_usec=100AffinityChurn:spin_iters=1024PolicyChurn:spin_iters=1024PageFaultChurn:region_kb=4096,touches_per_cycle=256,spin_iters=64MutexContention:contenders=4,hold_iters=256,work_iters=1024
String lookup
WorkType::from_name() accepts PascalCase names matching the enum
variants (e.g. "SpinWait", "FutexPingPong"). Sequence and Custom
return None because they require explicit construction parameters.
WorkType::ALL_NAMES lists every variant name. WorkType::name()
returns the PascalCase name for a given value; for Custom, it returns
the user-provided name field.
WorkloadConfig
WorkloadConfig is the low-level struct passed to
WorkloadHandle::spawn(). CgroupDef builds one internally; use
WorkloadConfig directly when calling setup_cgroups() or
WorkloadHandle::spawn() in custom scenarios.
pub struct WorkloadConfig {
pub num_workers: usize, // Number of worker processes to fork
pub affinity: AffinityIntent, // Per-worker affinity intent (resolved at spawn time)
pub work_type: WorkType, // What each worker does
pub sched_policy: SchedPolicy, // Linux scheduling policy
pub mem_policy: MemPolicy, // NUMA memory placement policy
pub mpol_flags: MpolFlags, // Optional mode flags for set_mempolicy(2)
pub nice: Option<i32>, // Per-worker nice via setpriority(2); None inherits
pub clone_mode: CloneMode, // Fork (default) or Thread dispatch
pub comm: Option<Cow<'static, str>>, // task->comm via prctl(PR_SET_NAME); kernel truncates to 15 bytes
pub uid: Option<u32>, // Effective UID via setresuid; None inherits
pub gid: Option<u32>, // Effective GID via setresgid; None inherits
pub numa_node: Option<u32>, // Restrict affinity to one NUMA node's CPU set
pub composed: Vec<WorkSpec>, // Secondary worker groups spawned alongside the primary
}
Default: 1 worker, AffinityIntent::Inherit, SpinWait, Normal policy,
Default mem_policy, no mpol_flags, nice/comm/uid/gid/numa_node = None,
clone_mode = Fork, composed = empty.
AffinityIntent is the type-unified affinity expression used at the
top level and inside WorkSpec entries — Inherit, Exact(...),
and RandomSubset(...) are accepted at WorkloadHandle::spawn;
topology-aware variants (SingleCpu, LlcAligned, CrossCgroup,
SmtSiblingPair) require scenario context and are rejected at the
spawn gate with an actionable diagnostic. composed carries
secondary WorkSpec groups that spawn alongside the primary; each
composed entry can override work_type, num_workers,
sched_policy, affinity, etc., and reports back via
WorkerReport::group_idx (0 for the primary, 1..=N for composed
entries in declaration order).
See MemPolicy for the NUMA memory placement API.
Scheduling policies
Workers can run under different Linux scheduling policies:
pub enum SchedPolicy {
Normal,
Batch,
Idle,
Fifo(u32), // priority 1-99
RoundRobin(u32), // priority 1-99
Deadline {
runtime: Duration, // budget per period
deadline: Duration, // relative deadline from period start
period: Duration, // period; Duration::ZERO uses `deadline`
},
}
Fifo, RoundRobin, and Deadline require CAP_SYS_NICE. The
sched-deadline gate (runtime <= deadline <= period, all non-zero
unless period == Duration::ZERO, which the kernel substitutes
with deadline) is validated user-side in
SchedPolicy::deadline() before sched_setattr so a malformed
Deadline fails fast rather than tunneling EINVAL through the
syscall.
Overriding work types
The work type override (configured via gauntlet or
Ctx.work_type_override) replaces the default SpinWait work type
for all scenarios that use it. Scenarios with non-SpinWait work types
are not overridden.
Overrides to grouped work types (PipeIo, FutexPingPong,
CachePipe, FutexFanOut, FanOutCompute, MutexContention) are skipped
when num_workers is not divisible by the work type’s group size.
Ops-based scenarios have a separate override mechanism via
CgroupDef.swappable. See Ops and Steps.
Checking
ktstr checks scheduler behavior through two channels: worker-side telemetry and host-side monitoring.
Worker checks
After each scenario, ktstr collects
WorkerReport from every worker
process. Several checks run against these reports:
Starvation – any worker with work_units == 0 fails the test.
Fairness – workers in the same cgroup should get similar CPU time. The “spread” (max off-CPU% - min off-CPU%) must be below a threshold (15% in release builds, 35% in debug). Violations report the spread and per-cgroup statistics.
Scheduling gaps – the longest wall-clock gap observed at work-unit checkpoints. Gaps above a threshold (2000ms release, 3000ms debug) indicate the scheduler dropped a task. Reports include the gap duration, CPU, and timing.
Cpuset isolation – workers must only run on CPUs in their assigned
cpuset. Any execution on an unexpected CPU fails the test. Opt-in via
isolation = true on the #[ktstr_test] attribute or via
Assert::check_isolation(); Assert::default_checks() leaves this
None, so the runtime merge resolves to false and the check is
skipped unless explicitly enabled.
Throughput parity – assert_throughput_parity() checks that
workers produce similar throughput (work_units per CPU-second). Two
thresholds:
max_throughput_cv: coefficient of variation across workers. High CV means the scheduler gives some workers disproportionately less effective CPU. Requires at least 2 workers with nonzero CPU time.min_work_rate: minimum work_units per CPU-second per worker. Catches cases where all workers are equally slow (CV passes but absolute throughput is too low).
Neither threshold is set by default; enable via Assert setters or
#[ktstr_test] attributes.
Benchmarking – assert_benchmarks() checks per-wakeup latency
and iteration throughput. Three thresholds:
max_p99_wake_latency_ns: p99 of allresume_latencies_nssamples across workers in a cgroup. Populated only for work types that record wake-to-run latency:IoSyncWrite,IoRandRead,IoConvoy,Bursty,PipeIo,FutexPingPong,CacheYield,CachePipe,FutexFanOut(receivers),Sequence(Sleep / Yield / Io phases),ForkExit,NiceSweep,AffinityChurn,PolicyChurn,FanOutCompute,MutexContention. Pure-CPU work types (SpinWait,Mixed,CachePressure,PageFaultChurn) do not record samples.max_wake_latency_cv: coefficient of variation of wake latency samples. High CV means inconsistent scheduling latency.min_iteration_rate: minimum outer-loop iterations per wall-clock second per worker.
None are set by default. Set via Assert setters or #[ktstr_test]
attributes.
Monitor checks
The host-side monitor reads guest VM memory (per-CPU runqueue structs via BTF offsets) and evaluates:
- Imbalance ratio:
max(nr_running) / max(1, min(nr_running))across CPUs. The denominator is clamped to 1 so an all-idle sample does not divide by zero. - Local DSQ depth: per-CPU dispatch queue depth.
- Stall detection:
rq_clocknot advancing on a CPU with runnable tasks. Idle CPUs and preempted vCPUs are exempt. See Monitor: Stall detection for exemption details. - Event rates: scx fallback and keep-last event counters.
Monitor thresholds use a sustained sample window (default: 5 samples). A violation must persist for N consecutive samples before failing.
NUMA checks
When workers use a MemPolicy, ktstr collects NUMA
page placement data and checks it against thresholds:
Page locality – assert_page_locality() checks the fraction of
pages residing on the expected NUMA node(s). Expected nodes are derived
from the worker’s MemPolicy::node_set() at evaluation time. Page
counts come from WorkerReport::numa_pages (parsed from
/proc/self/numa_maps). Returns 1.0 (vacuously local) when no pages
are observed. Fails if the observed fraction falls below
min_page_locality.
Cross-node migration – assert_cross_node_migration() checks
the ratio of migrated pages to total allocated pages.
WorkerReport::vmstat_numa_pages_migrated provides the delta of the
numa_pages_migrated counter from /proc/vmstat over the work loop.
Fails if the ratio exceeds max_cross_node_migration_ratio.
Slow-tier ratio – max_slow_tier_ratio checks the fraction of
pages on memory-only NUMA nodes (CXL tiers). Fails if more than the
specified fraction of pages land on memory-only nodes.
None of these thresholds are set by default. Set via Assert setters
or #[ktstr_test] attributes.
Assert struct
Assert is a composable configuration that carries both worker checks
and monitor thresholds:
pub struct Assert {
// Worker checks
pub not_starved: Option<bool>,
pub isolation: Option<bool>,
pub max_gap_ms: Option<u64>,
pub max_spread_pct: Option<f64>,
// Throughput checks
pub max_throughput_cv: Option<f64>,
pub min_work_rate: Option<f64>,
// Benchmarking checks
pub max_p99_wake_latency_ns: Option<u64>,
pub max_wake_latency_cv: Option<f64>,
pub min_iteration_rate: Option<f64>,
pub max_migration_ratio: Option<f64>,
// Monitor checks
pub max_imbalance_ratio: Option<f64>,
pub max_local_dsq_depth: Option<u32>,
pub fail_on_stall: Option<bool>,
pub sustained_samples: Option<usize>,
pub max_fallback_rate: Option<f64>,
pub max_keep_last_rate: Option<f64>,
// NUMA checks
pub min_page_locality: Option<f64>,
pub max_cross_node_migration_ratio: Option<f64>,
pub max_slow_tier_ratio: Option<f64>,
}
Every field is Option. None means “inherit from parent layer.”
Merge layers
Checking uses a three-layer merge:
Assert::default_checks()– baseline:not_starvedenabled, monitor thresholds fromMonitorThresholds::DEFAULT.Scheduler.assert– scheduler-level overrides.- Per-test
assert– test-specific overrides via#[ktstr_test]attributes.
All fields use last-Some-wins semantics. A Some(false) in a
higher layer can disable a check that a lower layer enabled.
let final_assert = Assert::default_checks()
.merge(&scheduler.assert)
.merge(&test_assert);
Default thresholds
Worker checks
| Check | Default (release) | Default (debug) |
|---|---|---|
| Scheduling gap | 2000 ms | 3000 ms |
| Fairness spread | 15% | 35% |
Debug builds run in small VMs with higher scheduling overhead, so thresholds are relaxed. Coverage-instrumented builds collect profraw data for code coverage analysis; all assertion and monitor threshold checks run normally.
Monitor checks
| Threshold | Default | Rationale |
|---|---|---|
max_imbalance_ratio | 4.0 | max(nr_running) / max(1, min(nr_running)) across CPUs (denominator clamped to 1 so an all-idle sample does not divide by zero). Lower values (2-3) false-positive during cpuset transitions. |
max_local_dsq_depth | 50 | Per-CPU dispatch queue overflow. Sustained depth above this means the scheduler is not consuming dispatched tasks. |
fail_on_stall | true | Fail when rq_clock does not advance on a CPU with runnable tasks. Idle CPUs (NOHZ) and preempted vCPUs are exempt. |
sustained_samples | 5 | At ~100ms sample interval, requires ~500ms of sustained violation. Filters transient spikes from cpuset reconfiguration. |
max_fallback_rate | 200.0/s | select_cpu_fallback events per second across all CPUs. Sustained rate indicates systematic select_cpu failure. |
max_keep_last_rate | 100.0/s | dispatch_keep_last events per second across all CPUs. Sustained rate indicates dispatch starvation. |
All monitor thresholds use the sustained_samples window – a
violation must persist for N consecutive samples before failing.
Worker checks via Assert
Assert provides assert_cgroup() for running worker-side checks
directly against collected reports:
let a = Assert::default_checks().max_gap_ms(5000);
let result = a.assert_cgroup(&reports, Some(&cpuset));
Use Assert for both the merge chain (#[ktstr_test] attributes,
Scheduler.assert, execute_steps_with) and direct report checking.
Constants
Assert::NO_OVERRIDES– identity formerge; every field isNone, so it overrides nothing. This is not “no checks” – when used as a per-test or per-schedulerassert, the runtime chain still applies defaults because it mergesdefault_checks() -> scheduler -> test.Assert::default_checks()–not_starvedenabled, monitor thresholds populated fromMonitorThresholds::DEFAULT.
AssertResult
AssertResult carries pass/fail status, diagnostic messages, and
aggregated statistics from a scenario run.
Construction
AssertResult::pass()– creates a passing result with empty details and default stats.AssertResult::skip(reason)– creates a passing result with a skip reason indetailsandskipped = true. Used when a scenario cannot run under the current topology or flag combination but is not a failure.AssertResult::fail(detail)– failing result carrying a singleAssertDetail. Mirrorspass/skipfor the failure axis.AssertResult::fail_msg(msg)– shortcut for the common case where the failure is a plain diagnostic message taggedDetailKind::Other.
Mutation and inspection
result.note(msg)– append an informational annotation taggedDetailKind::Note. Does NOT flippassedorskipped— a note is context, not a verdict. Returns&mut Selfso calls chain.result.with_note(msg)– builder-style sibling ofnotethat consumes and returnsself. Use at the return site to chain a context annotation onto a fresh result without an intermediatelet mut.result.is_skipped()– convenience accessor returningskipped. Stats tooling uses this to subtract non-executions from pass counts.result.is_failed()– convenience accessor returning!passed. Mirrorsis_skippedso branches reading “did this claim fail?” don’t negate.passedinline.
Fields
passed: bool– whether all checks passed.skipped: bool– distinguishes a passing result that ran every check from one that skipped execution (topology / flag mismatch, prerequisite absent).AssertResult::skipsets this;pass/fail/fail_msgleave itfalse.details: Vec<AssertDetail>– structured diagnostic entries; each carries akind: DetailKind(Other,Note,Skip,Temporal, …) plus a human-readablemessage: String. Consumers filter bykindfor routing (failure vs informational note) and readmessagefor display.stats: ScenarioStats– aggregated worker telemetry across all cgroups (spread, gaps, migrations, wake latency, iterations).measurements: BTreeMap<String, NoteValue>– structured per-test measurements keyed by name. Sidecar consumers and comparison tooling read this map directly without parsingdetailsstrings, so populate it (viaVerdict::note_valueduring claim evaluation) for any value a downstream comparison needs to lift programmatically.
Merging
result.merge(other) combines two results. If other.passed is
false, the merged result is also false. Details and stats are
accumulated:
let mut combined = AssertResult::pass();
combined.merge(cgroup_0_result);
combined.merge(cgroup_1_result);
// combined.passed is false if either cgroup failed
// combined.details contains messages from both
Stats merging takes worst values across cgroups for spread, gap, wake
latency, and migration ratio. Counters (total_workers, total_cpus,
total_migrations, total_iterations) are summed.
For examples of overriding thresholds at the scheduler and per-test level, see Customize Checking.
Ops and Steps
The ops system is a composable way to express dynamic cgroup topology
changes. It replaces hand-written Action::Custom functions for most
dynamic scenarios.
Op
An Op is an atomic operation on the cgroup topology. The enum is
#[non_exhaustive], so external pattern matches must end with .. to
stay compatible across ktstr version bumps that add new variants:
| Op | Description |
|---|---|
AddCgroup | Create a cgroup |
RemoveCgroup | Stop workers and remove a cgroup |
SetCpuset | Set a cgroup’s cpuset via CpusetSpec |
ClearCpuset | Remove cpuset constraints |
SwapCpusets | Swap cpusets between two cgroups |
Spawn | Fork workers into a cgroup |
StopCgroup | Stop a cgroup’s workers |
SetAffinity | Set worker affinity via AffinityIntent |
SpawnHost | Spawn workers in the parent cgroup |
MoveAllTasks | Move all tasks from one cgroup to another |
RunPayload | Spawn a binary-kind Payload in the background and track its PayloadHandle under the step’s payload set. Subsequent WaitPayload / KillPayload address it by (payload.name, cgroup). Scheduler-kind payloads are rejected at apply time. |
WaitPayload | Block until the named payload exits naturally, evaluate its checks, and record metrics to the per-test sidecar. Target lookup is by (name, cgroup) composite key; cgroup: None resolves to the unique live copy. No timeout — pair with a bounded HoldSpec or the payload’s own --runtime for time-boxed runs. |
KillPayload | SIGKILL the named payload, reap the child, evaluate checks, and record metrics. Same (name, cgroup) lookup rules as WaitPayload. Mirrors step-teardown drain for an explicitly-targeted payload. |
FreezeCgroup | Freeze every task in the named cgroup via cgroup.freeze (kernel-side asynchronous freeze; not a SIGSTOP). Idempotent for already-frozen cgroups. Pair with UnfreezeCgroup to release; teardown auto-unfreezes. See Snapshots for the observer-cgroup deadlock warning. |
UnfreezeCgroup | Unfreeze every task in the named cgroup via cgroup.freeze. Inverse of FreezeCgroup. Idempotent. |
Snapshot | Capture a host-side diagnostic snapshot under name via the freeze coordinator: pauses every vCPU, reads BPF map state, vCPU registers, and per-CPU counters into a FailureDumpReport, then resumes. The report is keyed by name on the active SnapshotBridge. No active bridge is a no-op with tracing::warn!. See Snapshots. |
WatchSnapshot | Capture a snapshot whenever the guest writes to the named kernel symbol; one fire = one capture tagged with the symbol path. Symbol resolution at op execution time looks the name up by verbatim vmlinux ELF symbol-table match — the requested name must appear in the guest kernel’s static symbol table exactly as written (no path expansion, no BTF descent). Maximum 3 watch ops per scenario (3 hardware watchpoint slots; 1 slot reserved for the error-class exit_kind trigger). See Watch Snapshots. |
Op constructors accept string literals directly (no .into() needed):
Op::add_cgroup("cg_0")
Op::set_cpuset("cg_0", CpusetSpec::disjoint(0, 2))
Op::stop_cgroup("cg_0")
Op::spawn("cg_0", WorkSpec::default().workers(4))
Op::set_affinity("cg_0", AffinityIntent::RandomSubset)
Op::spawn_host(WorkSpec::default().workers(4))
Op::freeze_cgroup("cg_0")
Op::unfreeze_cgroup("cg_0")
Op::snapshot("after_spawn")
Op::watch_snapshot("jiffies_64")
SpawnHost creates workers in the parent cgroup, not in a managed
cgroup. Use this to simulate host-level CPU contention alongside
managed cgroups.
OpKind
OpKind is a payload-free discriminant enum generated from Op via
#[strum_discriminants]. It carries the same variant set as Op
(AddCgroup, RemoveCgroup, …, RunPayload, WaitPayload,
KillPayload, FreezeCgroup, UnfreezeCgroup, Snapshot,
WatchSnapshot) with none of the inner fields, so it is cheap to
copy and use as a map key. Framework code uses OpKind when it
only cares WHICH operation ran (per-op statistics, stimulus-event
tagging, verifier/monitor bookkeeping) without the payload. Test
authors rarely spell OpKind directly — the strum::EnumIter
derive also lets tooling enumerate every OpKind variant for
coverage checks.
OpKind shares Op’s #[non_exhaustive] attribute: external
pattern matches over OpKind must end with ...
CpusetSpec
CpusetSpec computes a cpuset from the topology at runtime. The enum
is #[non_exhaustive], so external callers should construct via the
associated constructor functions (see the list below this snippet)
rather than naming variant literals — a future field addition (e.g.
a stride on Range) can land behind a defaulted parameter without
breaking call sites. Pattern matches over CpusetSpec must also end
with ..:
pub enum CpusetSpec {
Llc(usize), // All CPUs in an LLC
Numa(usize), // All CPUs in a NUMA node
Range { start_frac: f64, end_frac: f64 }, // Fraction of usable CPUs
Disjoint { index: usize, of: usize }, // Equal disjoint partitions
Overlap { index: usize, of: usize, frac: f64 }, // Overlapping partitions
Exact(BTreeSet<usize>), // Exact CPU set
}
Convenience constructors accept parameters directly:
CpusetSpec::disjoint(0, 2), CpusetSpec::range(0.0, 0.5),
CpusetSpec::exact([0, 1, 2]), CpusetSpec::llc(0),
CpusetSpec::numa(0), CpusetSpec::overlap(0, 2, 0.5).
All fractional specs operate on
usable_cpus().
CgroupDef
CgroupDef bundles three ops that always go together: create cgroup,
set cpuset, spawn workers. It is the primary way to define cgroups in
ops-based scenarios.
let def = CgroupDef::named("cg_0")
.with_cpuset(CpusetSpec::disjoint(0, 2))
.workers(4)
.work_type(WorkType::SpinWait);
Builder methods
.with_cpuset(CpusetSpec)– set the cpuset (CPU set the cgroup is pinned to)..with_cpuset_mems(BTreeSet<usize>)– explicitcpuset.memsoverride (default derives from the resolved cpuset’s NUMA nodes)..workers(n)– set worker count..work_type(WorkType)– set work type (default:SpinWait)..sched_policy(SchedPolicy)– set Linux scheduling policy (default:Normal). See WorkSpec Types..work(WorkSpec)– add a work group (multiple calls for concurrent groups)..workload(&'static Payload)– attach a binary workload payload to run alongside the worker group; the framework launches it as a child process inside the cgroup. Panics when called with a scheduler-kindPayload(PayloadKind::Scheduler(_)); the scheduler slot is#[ktstr_test(scheduler = ...)]at the test level, not the cgroup-levelworkloadslot. Step-levelOp::RunPayloadrejects scheduler-kind payloads with ananyhow::Errorinstead of panicking; the build-timeworkloadcall panics because there is no scenario-level recovery path..affinity(AffinityIntent)– set per-worker affinity (default:Inherit)..mem_policy(MemPolicy)– set NUMA memory placement policy (default:Default). See MemPolicy..mpol_flags(MpolFlags)– set mode flags forset_mempolicy(2)(default:NONE). See MemPolicy..nice(n)– cgroup-level default per-worker nice value, merged into every WorkSpec whose ownniceis unset. See Tutorial: Step 11..comm(name)– cgroup-level default per-workertask->commviaprctl(PR_SET_NAME). Merged into every WorkSpec whose owncommis unset..pcomm(name)– thread-group-leadertask->commfor the fork-then-thread spawn path (workers run as threads under one forked leader). Stamps every existing WorkSpec in-place; not order-independent with.work(...)..uid(uid)/.gid(gid)– cgroup-level default per-worker effective UID / GID viasetresuid/setresgid. Merged into every WorkSpec whose ownuid/gidis unset..numa_node(node)– cgroup-level default NUMA-node affinity for every WorkSpec. Merged at apply-setup time..swappable(bool)– opt into gauntlet work type override.
Cgroup controllers
The cgroup-v2 cpu / memory / io / pids controllers are exposed as typed setters (default: unconstrained):
.cpu_quota_pct(pct)/.cpu_quota(quota, period)/.cpu_unlimited()– writecpu.max(pctis shorthand:100= one full CPU).cpu_unlimitedresets to the kernel default..cpu_weight(weight)– writecpu.weight(1..=10000, default100)..memory_max(bytes)/.memory_high(bytes)/.memory_low(bytes)/.memory_unlimited()– writememory.max/memory.high/memory.low.memory_unlimitedresetsmemory.maxtomax..memory_swap_max(bytes)/.memory_swap_unlimited()– writememory.swap.max..io_weight(weight)– writeio.weight(1..=10000, default100)..pids_max(n)/.pids_unlimited()– writepids.max.
MemPolicy-cpuset validation
When a cgroup has a cpuset, ktstr validates that the MemPolicy’s
node set is covered by the NUMA nodes reachable from that cpuset. A
MemPolicy::Bind([1]) on a cgroup whose cpuset covers only NUMA
node 0 fails at setup time. Policies without a node set (Default,
Local) skip validation.
WorkSpec type overrides and swappable
CgroupDef has a swappable flag (default: false). When true
and a work type override is active (Ctx.work_type_override), the
override replaces this def’s work type.
In contrast, the Scenario-level override (in run_scenario()) only
replaces SpinWait work types. The two mechanisms serve different
scopes:
- Scenario-level: replaces
SpinWaitinWorkSpec.work_type - CgroupDef-level: replaces the work type when
swappable = true
Both skip overrides to grouped work types when num_workers is not
divisible by the work type’s group size.
WorkSpec type overrides apply only to CgroupDef setup, not to raw
Op::Spawn. Op::Spawn always uses the work type as given. Use
CgroupDef with .swappable(true) when the work type should
participate in gauntlet overrides.
Step
A Step is a sequence of ops with a hold period:
pub struct Step {
pub setup: Setup, // CgroupDefs to create after ops
pub ops: Vec<Op>, // Operations to apply
pub hold: HoldSpec, // How long to wait after
}
Setup is either Defs(Vec<CgroupDef>) or Factory(fn(&Ctx) -> Vec<CgroupDef>).
Vec<CgroupDef> implements Into<Setup>, so you can write
setup: vec![...].into() instead of setup: Setup::Defs(vec![...]).
Constructors
Step::new(ops, hold) – creates a step with ops only (no
CgroupDef setup). Use when the step only applies dynamic operations
to an existing topology.
Step::with_defs(defs, hold) – creates a step with CgroupDef
setup and a hold period. The primary constructor for steps that
create cgroups with workers.
Step::set_ops(self, ops) – REPLACES the ops on a step
(builder method). Chain after with_defs to add dynamic operations
to a step that also creates cgroups.
Naming asymmetry:
Step::set_opsREPLACES; the siblingBackdrop::with_opsAPPENDS. The two methods deliberately use different verbs to signal the different semantics. AStep::new(ops).set_ops(more)chain produces a step whose ops vec is exactlymore(the originalopsis dropped); aBackdrop::new().with_ops(ops_a).with_ops(ops_b)chain produces a backdrop whose ops vec isops_a + ops_b. If you need to extend a step’s ops vec, build the combinedVec<Op>at the call site and pass it toset_ops, or compose at theBackdroplayer instead.
HoldSpec
How long to hold after a step completes:
| Variant | Description |
|---|---|
Frac(f64) | Fraction of the total scenario duration |
Fixed(Duration) | Fixed time |
Loop { interval } | Repeat ops at interval until time runs out |
HoldSpec::FULL is a constant for Frac(1.0) (hold for the full
scenario duration).
execute_defs
execute_defs(ctx, defs) is a convenience wrapper for the common
pattern of creating cgroups and running them for the full duration:
execute_defs(ctx, vec![
CgroupDef::named("cg_0").workers(4),
CgroupDef::named("cg_1").workers(4),
])
Equivalent to execute_steps(ctx, vec![Step::with_defs(defs, HoldSpec::FULL)]).
execute_steps
execute_steps(ctx, steps) runs a step sequence:
- For each step: apply ops, then apply setup (create cgroups from
CgroupDefs), hold for the specified duration. Ops run first so parent cgroups can be created before children are spawned.Loopsteps reverse this: setup runs once before the loop, then ops repeat at the specified interval. - Check scheduler liveness between steps.
- After all steps: collect worker reports and run checks.
- Writes stimulus events to the SHM ring buffer for timeline analysis.
execute_steps_with
execute_steps_with(ctx, steps, assertions) is the same as
execute_steps but accepts an explicit
Assert for worker checks.
execute_steps is a convenience wrapper that passes None.
use ktstr::prelude::*;
fn my_scenario(ctx: &Ctx) -> Result<AssertResult> {
let assertions = Assert::NO_OVERRIDES
.check_not_starved()
.max_gap_ms(3000);
let steps = vec![/* ... */];
execute_steps_with(ctx, steps, Some(&assertions))
}
When assertions is Some, the provided Assert overrides ctx.assert
for worker checks. When None, uses ctx.assert (the merged
three-layer config: default_checks -> scheduler -> per-test).
TestTopology
TestTopology provides CPU topology information for test
configuration. It discovers CPUs, last-level caches (LLCs), and NUMA
nodes, and generates cpuset partitions for scenarios.
CPU topology hierarchy
ktstr models four levels of CPU topology, from largest to smallest:
- NUMA node – a memory-proximity domain. Each node is a group of CPUs with fast access to a local memory bank. Cross-node memory access is slower.
- LLC (last-level cache) – the largest cache shared by a group of cores. LLCs are the key scheduling boundary: tasks sharing an LLC benefit from shared cache lines.
- Core – a physical execution unit with its own pipeline and L1/L2 caches.
- Thread – an SMT (simultaneous multithreading) sibling. Multiple threads share a single core’s execution resources.
Containment: threads belong to a core, cores belong to an LLC, LLCs
belong to a NUMA node. For example, 2n4l4c2t describes 2 NUMA
nodes, each with 2 LLCs, each LLC with 4 cores, each core with 2
threads = 32 CPUs total.
Most tests use a single NUMA node (the default). NUMA matters when a
scheduler makes placement decisions based on memory locality.
Single-NUMA topologies (numa_nodes = 1) test scheduling without
memory-locality effects. Multi-NUMA topologies test whether a
scheduler keeps tasks close to their memory. See the
gauntlet NUMA presets
for multi-NUMA configurations.
use ktstr::prelude::*;
pub struct TestTopology {
// private fields — use the accessors below
}
Construction
from_system() -> Result<Self> – reads sysfs
(/sys/devices/system/cpu/) to discover the live topology. Reads LLC
IDs, NUMA node IDs, core IDs, and cache sizes for each online CPU.
Also scans /sys/devices/system/node/ to discover memory-only nodes
(CXL), reads per-node meminfo and inter-node distances.
from_vm_topology(topo: &Topology) -> Self – builds a topology
from a VM spec. Topology fields are big-to-little: NUMA nodes,
last-level caches, cores per LLC, threads per core. Multiple LLCs
can share a NUMA node when numa_nodes < llcs; llcs must be an
exact multiple of numa_nodes so LLCs partition evenly across nodes
(the declare_scheduler! macro rejects violations at compile time
and runtime callers inside ktstr hold the same invariant). CPUs
numbered sequentially. Used as a fallback when sysfs is incomplete
inside a guest VM. For the memory-aware variant, see
from_vm_topology_with_memory.
synthetic(num_cpus, num_llcs) -> Self (test-only) – creates a
topology with evenly distributed CPUs across LLCs. Used in unit tests.
Topology queries
total_cpus() – total number of CPUs.
num_llcs() – number of last-level caches.
num_numa_nodes() – number of NUMA nodes.
all_cpus() -> &[usize] – all CPU IDs, sorted.
all_cpuset() -> BTreeSet<usize> – all CPU IDs as a set.
usable_cpus() -> &[usize] – CPUs available for workload
placement. Reserves the last CPU for the root cgroup (cgroup 0) when
the topology has more than 2 CPUs. On 8 CPUs: usable = 0-6, CPU 7
reserved. Most built-in scenarios and CgroupDef cpuset specs
operate on usable_cpus() automatically; test authors rarely need
to query it directly.
usable_cpuset() -> BTreeSet<usize> – usable CPUs as a set.
llcs() -> &[LlcInfo] – all LLC domains with their CPUs, NUMA
node, cache size, and core map.
cpus_in_llc(idx) -> &[usize] – CPUs belonging to LLC at index.
llc_aligned_cpuset(idx) -> BTreeSet<usize> – CPUs in LLC as a set.
numa_aligned_cpuset(node) -> BTreeSet<usize> – CPUs in all LLCs
belonging to NUMA node node. Filters LLCs by numa_node() == node
and collects their CPUs.
numa_node_ids() -> &BTreeSet<usize> – NUMA node IDs as a
BTreeSet.
numa_nodes_for_cpuset(cpus) -> BTreeSet<usize> – NUMA nodes
covered by the given CPU set. Returns the set of NUMA nodes that
contain at least one LLC with a CPU in the given set.
node_meminfo(node_id) -> Option<&NodeMemInfo> – per-node
memory info (total and free KiB). Returns None when the node ID
is not present or meminfo is unavailable. NodeMemInfo has
total_kb, free_kb, and used_kb() (saturating subtraction).
numa_distance(from, to) -> u8 – inter-node NUMA distance.
Returns 255 when either node ID is not present (matches the kernel’s
unreachable distance). For from_vm_topology() topologies without
explicit distances, returns 10 for local and 20 for remote.
is_memory_only(node_id) -> bool – whether the node is
memory-only (has RAM but no CPUs). Typical for CXL-attached memory
tiers.
Construction from VM topology
from_vm_topology(topo) -> Self – build a TestTopology from a
Topology (the VMM’s topology spec). Populates LLCs, NUMA nodes,
distances, per-node memory info, and memory-only node flags.
from_vm_topology_with_memory(topo, total_memory_mb) -> Self –
same as from_vm_topology but accepts an optional total memory size
for uniform topologies. When Some, divides memory evenly across
nodes to populate NodeMemInfo. When None, memory info is omitted.
Cpuset generation
split_by_llc() -> Vec<BTreeSet<usize>> – one set of CPUs per LLC.
overlapping_cpusets(n, overlap_frac) -> Vec<BTreeSet<usize>> –
generates n cpusets with overlap_frac overlap between adjacent
sets. Available to scenarios that want to hand-build overlapping
cpusets (e.g. via CpusetSpec::Exact); CpusetSpec::Overlap
computes its slice inline rather than calling this helper.
cpuset_string(cpus) -> String – formats a CPU set as a compact
range string (e.g. "0-3,5,7-9"). Used when writing cpuset.cpus.
LlcInfo
Each LLC domain is represented by an LlcInfo:
pub struct LlcInfo {
cpus: Vec<usize>,
numa_node: usize,
cache_size_kb: Option<u64>,
cores: BTreeMap<usize, Vec<usize>>, // core_id -> SMT siblings
}
Accessors: cpus(), numa_node(), cache_size_kb(), cores(),
num_cores().
num_cores() returns the number of physical cores (from the core map),
or falls back to cpus.len() if no core map is populated (synthetic
topologies).
How scenarios use topology
TestTopology is available to scenarios via Ctx.topo. The
CpusetSpec variants use topology methods to resolve a cgroup’s
cpuset:
| CpusetSpec | Topology method |
|---|---|
Llc(idx) | llc_aligned_cpuset(idx) |
Numa(node) | numa_aligned_cpuset(node) |
Range { start_frac, end_frac } | usable_cpus() sliced by fraction |
Disjoint { index, of } | usable_cpus() partitioned into of equal sets |
Overlap { index, of, frac } | usable_cpus() partitioned with neighbor overlap |
Exact(set) | no topology resolution (caller-supplied set) |
Llc confines a cgroup to a single LLC’s CPUs; Numa spans all
LLCs in a NUMA node. The fraction- and partition-style variants
operate on the usable-CPUs pool the host reservation has granted.
See Ops and Steps for the full CpusetSpec enum.
CPU list parsing
Two standalone functions parse CPU list strings:
parse_cpu_list(s) -> Result<Vec<usize>> – strict parsing of
"0-3,5,7-9" format. Returns an error on invalid entries.
parse_cpu_list_lenient(s) -> Vec<usize> – lenient parsing that
silently skips invalid entries.
See also: CgroupManager for
set_cpuset() which consumes cpuset strings,
CgroupGroup for RAII cgroup
management, WorkloadHandle for
worker lifecycle, Scenarios for how CpusetSpec
drives cpuset partitioning.
Host-side reservation
TestTopology::numa_distance is also consumed at host-side
--cpu-cap plan time: acquire_llc_plan uses it to order the
spill from the seed NUMA node to nearest-by-distance neighbors
when the CPU budget cannot fit within a single node. The
resulting plan is a LOCK_SH reservation on every selected LLC
(flock granularity stays per-LLC even when the last LLC is
partial-taken for the CPU budget) with a cpuset.mems union
written to a cgroup v2 sandbox. See Resource Budget
for the full pipeline.
MemPolicy
MemPolicy controls NUMA memory placement for worker processes. It
wraps set_mempolicy(2) and is applied after fork, before the work
loop starts.
pub enum MemPolicy {
Default,
Bind(BTreeSet<usize>),
Preferred(usize),
Interleave(BTreeSet<usize>),
Local,
PreferredMany(BTreeSet<usize>),
WeightedInterleave(BTreeSet<usize>),
}
Variants
Default – inherit the parent process’s memory policy. No
set_mempolicy syscall is made.
Bind(nodes) – allocate only from the specified NUMA nodes
(MPOL_BIND). Allocation fails with ENOMEM if all specified nodes
are exhausted.
Preferred(node) – prefer allocations from the specified node,
falling back to others when the preferred node is full
(MPOL_PREFERRED).
Interleave(nodes) – interleave allocations round-robin across
the specified nodes (MPOL_INTERLEAVE).
Local – prefer the nearest node to the CPU where the allocation
occurs (MPOL_LOCAL). No nodemask.
PreferredMany(nodes) – prefer allocations from any of the
specified nodes, falling back to others when all preferred nodes are
full (MPOL_PREFERRED_MANY, kernel 5.15+).
WeightedInterleave(nodes) – weighted interleave across the
specified nodes. Page distribution is proportional to per-node weights
set via /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
(MPOL_WEIGHTED_INTERLEAVE, kernel 6.9+).
Convenience constructors
MemPolicy::bind([0, 1])
MemPolicy::preferred(0)
MemPolicy::interleave([0, 1])
MemPolicy::preferred_many([0, 1])
MemPolicy::weighted_interleave([0, 1])
Node-set constructors (bind, interleave, preferred_many,
weighted_interleave) accept any IntoIterator<Item = usize> –
arrays, ranges, Vec, BTreeSet. preferred takes a single
usize node ID.
MpolFlags
MpolFlags provides optional mode flags OR’d into the
set_mempolicy(2) mode argument:
| Flag | Value | Description |
|---|---|---|
NONE | 0 | No flags |
STATIC_NODES | 1 << 15 | Nodemask is absolute, not remapped when the task’s cpuset changes |
RELATIVE_NODES | 1 << 14 | Nodemask is relative to the task’s current cpuset |
NUMA_BALANCING | 1 << 13 | Enable NUMA balancing optimization for this policy |
Flags combine with | or MpolFlags::union():
let flags = MpolFlags::STATIC_NODES | MpolFlags::NUMA_BALANCING;
Usage in WorkSpec and CgroupDef
WorkSpec and CgroupDef both expose .mem_policy() and
.mpol_flags() builder methods:
use ktstr::prelude::*;
let w = WorkSpec::default()
.workers(4)
.mem_policy(MemPolicy::bind([0]))
.mpol_flags(MpolFlags::STATIC_NODES);
let def = CgroupDef::named("cg_0")
.with_cpuset(CpusetSpec::numa(0))
.workers(4)
.mem_policy(MemPolicy::bind([0]));
Cpuset validation
When a cgroup has a cpuset, ktstr validates that the MemPolicy’s
node set is covered by the NUMA nodes reachable from that cpuset.
A MemPolicy::Bind([1]) on a cgroup whose cpuset covers only NUMA
node 0 will fail with an error at setup time.
Policies without a node set (Default, Local) skip validation.
node_set()
MemPolicy::node_set() returns the NUMA node IDs referenced by the
policy. Returns the node set for Bind, Interleave,
PreferredMany, and WeightedInterleave; a single-element set for
Preferred; and an empty set for Default/Local.
NUMA checking
Page locality and migration results from workers using MemPolicy are
checked by the NUMA checking
assertions. The expected node set for
locality checks is derived from the worker’s MemPolicy at evaluation
time.
Example: NUMA-aware test
A complete test that checks page locality across two NUMA nodes:
use ktstr::prelude::*;
#[ktstr_test(
numa_nodes = 2, llcs = 4, cores = 4, threads = 1,
min_numa_nodes = 2,
min_page_locality = 0.8,
)]
fn numa_locality(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("node0")
.with_cpuset(CpusetSpec::numa(0))
.workers(4)
.mem_policy(MemPolicy::bind([0])),
CgroupDef::named("node1")
.with_cpuset(CpusetSpec::numa(1))
.workers(4)
.mem_policy(MemPolicy::bind([1])),
])
}
Each cgroup’s workers are pinned to a single NUMA node’s CPUs via
CpusetSpec::numa() and their memory allocations are bound to the
same node via MemPolicy::bind(). The min_page_locality threshold
fails the test if less than 80% of pages land on the expected node.
Performance Mode
Performance mode reduces noise during VM execution by applying host-side isolation (vCPU pinning, hugepages, NUMA mbind, RT scheduling). On x86_64, additionally: a guest-visible CPUID hint (KVM_HINTS_REALTIME) and KVM exit suppression (PAUSE and HLT VM exits disabled). On aarch64, the four host-side optimizations apply (vCPU pinning, hugepages, NUMA mbind, RT scheduling); KVM exit suppression and CPUID hints are not available.
What it does
On x86_64, seven optimizations are applied when performance_mode
is enabled (six host-side, one guest-visible via CPUID). On aarch64,
four of these apply (vCPU pinning, hugepages, NUMA mbind, RT
scheduling); the x86-specific items (PAUSE/HLT exit disabling,
KVM_HINTS_REALTIME CPUID, halt poll) are not available.
Host-side KVM_CAP_HALT_POLL is explicitly skipped on x86_64 —
the guest haltpoll cpuidle driver disables it via
MSR_KVM_POLL_CONTROL (see below):
vCPU pinning – each virtual LLC is mapped to a physical LLC
group on the host. vCPU threads are pinned to cores within their
assigned LLC via sched_setaffinity. This prevents the host scheduler
from migrating vCPU threads across LLCs, which would add cache
thrashing noise to measurements.
Hugepages – guest memory is allocated with 2MB hugepages
(MAP_HUGETLB) when sufficient free hugepages exist. This eliminates
TLB pressure from host-side page walks during guest execution.
NUMA mbind – guest memory is bound to the NUMA node(s) of the
pinned vCPUs via mbind(MPOL_BIND). This ensures memory allocations
are local to the CPUs executing vCPU threads, avoiding cross-node
memory access latency.
RT scheduling – vCPU threads are set to SCHED_FIFO priority 1.
The watchdog and monitor threads run at priority 2 on a dedicated
host CPU not assigned to any vCPU, so they can preempt for
timeout/sampling without competing for vCPU cores. The serial console
mutex uses PTHREAD_PRIO_INHERIT to avoid priority inversion between
RT vCPU threads and service threads.
Disable PAUSE VM exits (x86_64 only) – KVM_CAP_X86_DISABLE_EXITS with
KVM_X86_DISABLE_EXITS_PAUSE suppresses VM exits on PAUSE
instructions. Guest spinlocks execute PAUSE in tight loops; each
PAUSE normally causes a vmexit so the hypervisor can schedule other
vCPUs. With dedicated cores (vCPU pinning), this reschedule is
unnecessary overhead. The capability is optional –
if unsupported, a warning is logged and the VM proceeds without it.
Disable HLT VM exits (x86_64 only) – KVM_X86_DISABLE_EXITS_HLT suppresses
VM exits on HLT instructions, the most frequent exit type during
boot and idle. BSP shutdown detection uses I8042 reset (port 0x64,
value 0xFE via reboot=k) and VcpuExit::Shutdown instead of
VcpuExit::Hlt. KVM blocks HLT disable when mitigate_smt_rsb is
active (host has X86_BUG_SMT_RSB and cpu_smt_possible()); in that case,
only PAUSE exits are disabled.
KVM_HINTS_REALTIME CPUID (x86_64 only) – sets bit 0 of CPUID leaf 0x40000001 EDX, telling the guest kernel that vCPUs are pinned to dedicated host cores. The guest disables PV spinlocks, PV TLB flush, and PV sched_yield (all add hypercall overhead unnecessary on dedicated cores), and enables haltpoll cpuidle (polls briefly before halting, reducing wakeup latency). PV spinlocks require CONFIG_PARAVIRT_SPINLOCKS, which is not in ktstr.kconfig, so that disable is a no-op for ktstr guests.
Skip host-side halt poll (x86_64 only) – when a guest vCPU halts (executes
HLT with nothing to do), KVM can busy-wait briefly on the host
before putting the vCPU thread to sleep, reducing wakeup latency at
the cost of host CPU time. KVM_CAP_HALT_POLL controls this
per-VM ceiling. In performance mode it is not set because the guest
haltpoll cpuidle driver (enabled by KVM_HINTS_REALTIME above)
handles polling inside the guest and writes MSR_KVM_POLL_CONTROL=0
to disable host-side polling via kvm_arch_no_poll().
Non-performance-mode VMs set KVM_CAP_HALT_POLL to 200µs (matching
the x86 kernel default), or 0 when vCPUs exceed host CPUs.
Prerequisites
Sufficient host CPUs – the host must have at least
(llcs * cores_per_llc * threads_per_core) + 1 online CPUs. The extra
CPU is reserved for service threads (monitor, watchdog) so they do not
share a core with any RT vCPU. The host must also have at least as many
LLC groups as virtual LLCs.
2MB hugepages (optional) – the host must have free 2MB hugepages
(check /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages).
Without them, guest memory uses regular pages. A warning is printed.
CAP_SYS_NICE or rtprio limit (optional) – SCHED_FIFO requires
either CAP_SYS_NICE (root) or an RLIMIT_RTPRIO >= the requested
priority. Set an rtprio limit for non-root use:
# /etc/security/limits.conf
username - rtprio 99
Log out and back in for the limit to take effect. Without either capability, RT scheduling is skipped with a warning and vCPU threads run at normal priority (results may be noisy).
Validation
validate_performance_mode() runs during VM build and applies two
levels of checks:
Errors (fatal):
- Total vCPUs + 1 service CPU exceed available host CPUs.
- Virtual LLCs exceed available LLC groups.
- Pinning plan cannot be satisfied (an LLC group has fewer available CPUs than the virtual LLC requires).
- No free host CPU for service threads after vCPU assignment.
Warnings (non-fatal):
- Insufficient free hugepages – regular page allocation is used.
- Host load is high –
procs_runningfrom/proc/statexceeds half the vCPU count, results may be noisy. - TSC not stable (x86_64 only, checked at VM creation time) –
KVM_CLOCK_TSC_STABLEnot set afterKVM_GET_CLOCK, kvmclock falls back to per-vCPU timekeeping. Timing measurements may have higher variance. Common in nested virtualization.
Usage
In #[ktstr_test]:
#[ktstr_test(
llcs = 2,
cores = 4,
threads = 2,
performance_mode = true,
)]
fn my_perf_test(ctx: &Ctx) -> Result<AssertResult> {
// vCPUs are pinned, hugepage-backed
Ok(AssertResult::pass())
}
Via the builder API:
let vm = vmm::KtstrVm::builder()
.kernel(&kernel_path)
.init_binary(&ktstr_binary)
.topology(1, 2, 4, 2)
.memory_mb(4096)
.performance_mode(true)
.build()?
.run()?;
When to use
Performance mode is for tests where host-side scheduling noise affects results – fairness spread measurements, scheduling gap detection, imbalance ratio checks. It is not needed for correctness tests (cpuset isolation, starvation detection) where pass/fail is binary.
The gauntlet runs many VMs in parallel. Performance mode on
parallel VMs can oversubscribe the host if scheduled naively.
Avoid performance_mode unless the host has enough CPUs for the
topology matrix.
Two dimensions
Performance mode serves two purposes:
Noise reduction – pinning, hugepages, NUMA mbind, and RT scheduling reduce measurement variance on both architectures. On x86_64, PAUSE and HLT VM exit disabling, the KVM_HINTS_REALTIME CPUID hint, and skipping host-side halt poll further reduce noise. Scheduling gaps, spread, and throughput checks become meaningful because host jitter is controlled. Without performance mode, a 50ms gap could be host noise; with it, the same gap indicates a scheduler problem.
Performance assertions – with stable measurements, tests can set
tight thresholds (max_gap_ms, min_iteration_rate,
max_p99_wake_latency_ns) to detect scheduling regressions. A test
using execute_steps_with can pass custom Assert checks that are
evaluated inside the guest against worker telemetry. These thresholds
are only meaningful under performance mode’s controlled environment.
Nextest parallelism
Performance-mode tests each consume one LLC group on the host.
The vm-perf test group in .config/nextest.toml sets a static
max-threads limit. The flock-based LLC slot reservation
(acquire_resource_locks) handles runtime contention: if all LLC
slots are busy, the test returns ResourceContention.
On contention, the test returns exit code 0 (skip) – it never ran.
The SKIP: prefix in stderr distinguishes skips from real passes.
LLC exclusivity validation
When performance_mode is enabled, the build step validates LLC
exclusivity: each virtual LLC must reserve the entire physical
LLC group it maps to. The validation sums the actual CPU count of
each LLC group and checks the total (plus service CPU) fits within
the host’s online CPUs. If validation fails, the build returns an
error (tests skip with ResourceContention).
Three-way mode tier
ktstr’s host-side resource coordination has three effective tiers,
selected by the combination of performance_mode,
--no-perf-mode/KTSTR_NO_PERF_MODE, and --cpu-cap/KTSTR_CPU_CAP:
Tier 1: performance mode (full isolation)
Enabled when performance_mode=true is set on the VM builder (or
via #[ktstr_test(performance_mode = true)]). Acquires LOCK_EX
on each selected LLC’s /tmp/ktstr-llc-{N}.lock — the LLC-level
exclusive lock already covers every CPU in the group, so per-CPU
/tmp/ktstr-cpu-{C}.lock files are NOT touched
(try_acquire_all in vmm/host_topology.rs short-circuits the
per-CPU loop when LlcLockMode == Exclusive). Applies every
isolation feature listed under “What it does”: vCPU pinning via
sched_setaffinity, 2 MB hugepages, NUMA mbind, RT SCHED_FIFO
scheduling, and (x86_64) PAUSE/HLT exit suppression +
KVM_HINTS_REALTIME CPUID.
Tier 2: no-perf-mode with CPU-cap reservation
Enabled by --no-perf-mode / KTSTR_NO_PERF_MODE=1. Every
no-perf-mode VM goes through acquire_llc_plan: the reservation
is LOCK_SH across a NUMA-aware, consolidation-aware set of
LLCs, sized to meet the CPU budget — either --cpu-cap N (or
KTSTR_CPU_CAP=N) if set, or 30% of the calling process’s
sched_getaffinity cpuset (minimum 1) if not. The flock granularity
stays per-LLC; plan.cpus holds EXACTLY the budget (partial-take
on the last LLC when the budget falls mid-LLC). Multiple
no-perf-mode VMs coexist on the same LLCs because shared locks
are reentrant; a concurrent perf-mode VM attempting LOCK_EX
blocks until every no-perf-mode peer has released.
Enforcement under --cpu-cap:
- cgroup v2 cpuset sandbox — the reserved CPUs and derived NUMA
nodes are written to a child cgroup’s
cpuset.cpusandcpuset.mems, and the build pid is migrated into that cgroup, somake -jNgcc children inherit the binding. Under--cpu-cap, narrowing by a parent cgroup is a fatal error; without the flag (but withacquire_llc_planstill running on the 30% default) the sandbox warns and proceeds. - Soft-mask affinity — vCPU threads receive a
sched_setaffinitymask covering only the reserved CPUs, so the guest’s CPU placement respects the budget even though no pinning is applied. - No RT scheduling, no hugepages, no mbind, no KVM exit
suppression — these remain off;
--cpu-capis not a partial performance mode. make -jNhint — kernel-build pipelines passplan.cpus.len()tomakeso gcc’s fan-out matches the reserved capacity rather thannproc.
This tier is mutually exclusive with performance_mode=true (on
the CLI, clap requires = "no_perf_mode" rejects --cpu-cap
without --no-perf-mode at parse time) and with
KTSTR_BYPASS_LLC_LOCKS=1 (rejected at every entry point because
the contract and the bypass escape hatch are contradictory).
Library consumers that set performance_mode=true on
KtstrVmBuilder directly bypass the CLI parse — KTSTR_CPU_CAP
is silently ignored in that path because the builder’s perf-mode
branch never consults CpuCap::resolve.
See Resource Budget for the CpuCap,
LlcPlan, and ktstr locks surfaces in detail.
Tier 3: default (per-CPU window + LLC LOCK_SH)
Selected when neither performance_mode=true nor
--no-perf-mode/KTSTR_NO_PERF_MODE is set — the default path
for #[ktstr_test] entries that don’t declare performance_mode
(entry.rs KtstrTestEntry::DEFAULT sets performance_mode: false). acquire_cpu_locks (in vmm/host_topology.rs) walks a
contiguous CPU window, takes LOCK_EX on each window CPU’s
/tmp/ktstr-cpu-{C}.lock, then additionally takes LOCK_SH on
the LLC lockfiles covering those CPUs so a perf-mode (tier 1)
VM cannot grab LOCK_EX on an LLC that this path is using. No
pinning, no isolation, no cgroup sandbox — the per-CPU reservation
is purely for host-scheduling-noise avoidance between concurrent
VMs.
This is the ONLY tier that actually flocks per-CPU lockfiles. Tier 1 skips them (LLC EX already covers all CPUs in the group); tier 2 skips them (capped LLC SH is enforced via the cgroup cpuset and the flock is sufficient per-LLC coordination).
Disabling performance mode
--no-perf-mode (or KTSTR_NO_PERF_MODE=1) forces
performance_mode=false. The result is tier 2 above — a
CPU-capped LOCK_SH reservation (either explicit --cpu-cap N
or the 30%-of-allowed default). The feature differences
relative to tier 1 are:
- LLC flock mode — tier 1 holds
LOCK_EXon each reserved LLC; tier 2 holdsLOCK_SH. Multiple shared holders coexist; an exclusive holder blocks every shared acquirer and vice-versa. - Per-CPU flocks — tier 1 relies on LLC-level
LOCK_EXfor exclusivity; per-CPU/tmp/ktstr-cpu-{C}.lockfiles are skipped (try_acquire_allinvmm/host_topology.rsshort-circuits the per-CPU loop whenLlcLockMode == Exclusivebecause the LLC lock already covers every CPU in the group). Tier 2 also skips them — the cgroup cpuset is the enforcement layer. - vCPU pinning — tier 1 pins via
sched_setaffinityto the reserved LLC’s CPUs. Tier 2 applies soft-mask affinity (budget-scoped but no 1:1 vCPU-to-CPU binding). - RT scheduling — tier 1 only; tier 2 runs vCPU threads at normal priority.
- Hugepages — tier 1 only; tier 2 uses regular pages.
- NUMA mbind — tier 1 only; tier 2 instead writes
cpuset.memson its child cgroup to achieve NUMA locality at the cgroup layer. - KVM exit suppression (x86_64) — tier 1 only; tier 2 leaves PAUSE and HLT exits enabled.
- KVM_HINTS_REALTIME CPUID (x86_64) — tier 1 only; tier 2 leaves the guest on PV spinlocks and standard cpuidle.
Use tier 2 on multi-tenant hosts where you want bounded concurrency
(at most N concurrent builds or no-perf-mode VMs per host) but
cannot afford the full perf-mode contract. Use tier 1 for
regression measurement where host jitter must be controlled.
Available via:
ktstr shell --no-perf-modecargo ktstr test --no-perf-modecargo ktstr coverage --no-perf-modecargo ktstr llvm-cov --no-perf-modecargo ktstr shell --no-perf-modeKTSTR_NO_PERF_MODE=1(any value; presence is sufficient)
--cpu-cap N layers on top of any of the above when present;
when absent, the 30%-of-allowed default applies automatically.
The env var is read by every VM builder call site (test harness,
auto-repro, verifier, shell). The CLI flags set the env var
before test execution so library consumers inherit it.
Resource Budget
--cpu-cap N adds a third tier between full performance-mode
isolation and unreserved no-perf-mode execution. Instead of
“lock each reserved LLC exclusively” (perf-mode), it reserves a
NUMA-aware, consolidation-aware set of host CPUs under LOCK_SH,
enforces the reservation via a cgroup v2 cpuset sandbox, and scales
make -jN fan-out to the reserved capacity. The flock granularity
stays per-LLC: every selected LLC is flocked whole, but plan.cpus
holds EXACTLY N CPUs (the last LLC is partial-taken when the
budget falls mid-LLC). See
Performance Mode for
the comparison against the other two tiers.
Every no-perf-mode VM and kernel build runs through this pipeline
— there is no “no cap” path. When --cpu-cap is absent, the
planner applies a 30% default of the calling process’s
sched_getaffinity cpuset (minimum 1 CPU). This keeps
sched_setaffinity safe under cgroup-restricted CI runners (CI
hosts, systemd slices, sudo-under-a-limited-cpuset) where the
process cannot run on every online CPU even if sysfs lists them.
When to use it
- Multi-tenant CI hosts where unbounded parallelism starves
concurrent builds but the full performance-mode contract
(
SCHED_FIFO, hugepages, NUMA mbind, KVM exit suppression) is too heavy. - Kernel builds run alongside perf-mode VM tests — the
shared
LOCK_SHcoordinates with the perf-modeLOCK_EXsomakenever stomps a measurement in progress. - Concurrent no-perf-mode VMs on a shared host — a cap of
NCPUs bounds how much capacity each run reserves; peers that would exceed the host’s flock availability wait rather than racing for CPU.
CpuCap — parsed and resolved
CpuCap::new(N: usize) -> Result<CpuCap> constructs a cap from a
CLI integer. N is a CPU count. N == 0 is rejected with
--cpu-cap must be ≥ 1 CPU (got 0) — zero is a scripting sentinel,
not a silent “no cap” fallback.
CpuCap::resolve(cli_flag: Option<usize>) -> Result<Option<CpuCap>>
is the three-tier precedence:
- CLI flag (
--cpu-cap N) wins over env var. KTSTR_CPU_CAP=Nenv var applies when the CLI flag is absent. Empty string is treated as unset;0or non-numeric values produce the same rejection as the CLI path.- Neither set →
Ok(None). The planner expands this into the 30%-of-allowed default at acquire time.
CpuCap::effective_count(allowed_cpus: usize) -> Result<usize>
clamps at acquire time, not construction time.
N > allowed_cpus returns a ResourceContention error naming
both numbers — operators reading the error see immediately that
the cap exceeds the process’s sched_getaffinity cpuset, not the
host’s total online CPU count. Fixing the cap requires either
lowering N or releasing the cgroup restriction on the calling
process.
host_allowed_cpus — the reference set
host_allowed_cpus() reads the calling process’s allowed CPUs
via sched_getaffinity(0) with a /proc/self/status
Cpus_allowed_list: fallback. Every consumer of the --cpu-cap
pipeline plans against this set instead of
HostTopology::online_cpus, so sched_setaffinity on the plan’s
CPU list never produces an empty effective mask under a
cgroup-restricted runner.
An empty allowed set is a bail condition, not a fallback to
“every CPU” — guessing on a misconfigured host is worse than
failing visibly. A host topology that has no LLC overlapping the
allowed set (sysfs and sched_getaffinity disagree — e.g. stale
sysfs after hot-plug, cgroup cpuset pinned to CPUs the kernel no
longer reports in LLC groups) also bails with an actionable
diagnostic.
LlcPlan — the ACQUIRE result
acquire_llc_plan(topo, test_topo, cpu_cap) runs three phases:
- DISCOVER — for every LLC, stat the canonical
/tmp/ktstr-llc-{N}.lock, read/proc/locksonce, and build a snapshot of holders per LLC. No flocks are taken. - PLAN — rank LLCs (eligible = at least one allowed CPU):
consolidation (prefer LLCs with existing holders) first, then
fresh LLCs, all tiebroken by ascending index. Seed on the
highest-scored LLC’s NUMA node; greedily fill that node before
spilling to nearest-by-distance nodes via
TestTopology::numa_distance. Accumulate allowed-CPU contribution per LLC until the accumulated count meetstarget_cpus. Final acquire order is ascending LLC index for livelock safety. - ACQUIRE — non-blocking
LOCK_SHon every selected LLC. A singleEWOULDBLOCKdrops every held fd and retries once (one TOCTOU retry — the second DISCOVER’s/proc/locksread IS the backoff; more retries would amplify livelock risk without adding coordination signal).
Partial-take on the last LLC
Post-ACQUIRE, the materialization layer walks each selected LLC’s
CPUs in ascending order, intersects with the allowed set, and
STOPS at exactly target_cpus total. The last selected LLC
typically contributes only a prefix of its allowed CPUs — the
flock is still held at LLC granularity (coordination with
concurrent ktstr peers is always per-LLC), but plan.cpus
reflects the exact CPU budget. sched_setaffinity masks and
cgroup cpuset.cpus writes narrow to that exact set.
The returned LlcPlan carries:
locked_llcs: Vec<usize>— selected host LLC indices, ASC.cpus: Vec<usize>— flat list of reserved CPUs, sized exactlytarget_cpus(a subset of every selected LLC’s allowed CPUs, with the last LLC possibly contributing only a prefix).mems: BTreeSet<usize>— NUMA nodes actually hostingplan.cpus(an LLC that contributes a partial slice only registers the nodes of its used CPUs).snapshot: Vec<LlcSnapshot>— per-LLC discovery trail.locks: Vec<OwnedFd>— RAII flock handles; Drop releases.
When mems spans more than one node
(warn_if_cross_node_spill fires), stderr gets a ktstr: reserving LLCs […] across N NUMA nodes warning so the operator
knows to expect cross-node memory latency. Single-node plans are
silent.
Cgroup v2 cpuset sandbox
BuildSandbox::try_create(plan_cpus, plan_mems, hard_error_on_degrade)
writes the plan into a child cgroup under the caller’s own cgroup,
in the kernel-required order: cpuset.cpus → cpuset.mems →
cgroup.procs. A task in a cgroup with empty cpuset.mems may
be killed by the cpuset allocator, so migration into
cgroup.procs MUST happen after both cpuset fields are populated.
After each cpuset write, .effective is read back. Narrowing by
a parent cgroup (e.g. systemd slice restriction) is a fatal error
under --cpu-cap (hard_error_on_degrade = true) and a warn-
only degrade without the flag.
Drop migrates the build pid back to root, tolerates transient
EBUSY on cgroup.rmdir (5 × 10 ms retries), and orphans the
directory with a tag=resource_budget.cgroup_orphan_left warn-
log if the rmdir still refuses. Orphans older than 24 h are
swept on the next sandbox creation.
make -jN hint
make_jobs_for_plan(plan) returns plan.cpus.len().max(1). The
kernel-build pipeline threads this as make -jN. Without the
hint, make -j$(nproc) fans gcc children across every online
CPU, defeating the cpuset reservation in scheduling terms — the
kernel still enforces cpuset membership at the fs layer, but
gcc’s parallel width silently violates the budget. The .max(1)
floor guards against make -j0 (unbounded on GNU make).
ktstr locks — observational surface
ktstr locks (or cargo ktstr locks) prints every ktstr flock
currently held on the host, cross-referenced against
/proc/locks to name each holder by PID + truncated cmdline.
Read-only — takes no flocks. Four categories:
- LLC locks under
/tmp/ktstr-llc-*.lock - Per-CPU locks under
/tmp/ktstr-cpu-*.lock - Cache-entry locks under
{cache_root}/.locks/*.lock - Run-dir locks under
{runs_root}/.locks/{kernel}-{project_commit}.lock— held for the duration of the (pre-clear + write) cycle byserialize_and_write_sidecarso two concurrent ktstr processes targeting the same run-dir key serialize on the sidecar write rather than tearing each other’s mid-write files.
Flags:
--json— emit a structured snapshot. One-shot usesto_string_prettyfor readability; under--watcheach frame is compact on its own line (ndjson-style) for streaming consumers. Top-level keys:llcs,cpus,cache,run_dirs. Each row names itslockfilepath and aholdersarray; every holder haspid+cmdline.--watch <interval>— redraw on the given interval until SIGINT. Interval useshumantimesyntax (100ms,1s,5m,1h).
Use ktstr locks when --cpu-cap acquires fail with
ResourceContention: the error already names busy LLCs, but the
live snapshot shows every contending peer at once.
KTSTR_BYPASS_LLC_LOCKS — escape hatch
Setting KTSTR_BYPASS_LLC_LOCKS=1 (any non-empty value) skips
acquire_llc_plan entirely. The VM boots or the kernel
builds immediately without coordinating against any concurrent
perf-mode run. Use only when the operator explicitly accepts
measurement noise:
- A shell session doing unrelated work alongside tests.
- An isolated developer workstation.
- A CI queue that already serializes jobs at a higher layer.
Mutually exclusive with --cpu-cap / KTSTR_CPU_CAP at every
entry point (CLI parse for shell + kernel build on both
ktstr and cargo ktstr, the kernel_build_pipeline reservation
phase, and the library-layer KtstrVmBuilder::build no-perf-mode
branch). The error wording always contains "resource contract"
so operators can grep for it; the contract and the bypass cannot
coexist at any of those six sites.
Note: the performance_mode=true vs --cpu-cap exclusion is
weaker. It is enforced at CLI parse (shell --cpu-cap requires
--no-perf-mode via clap requires), but library consumers that
set performance_mode=true on KtstrVmBuilder directly see
KTSTR_CPU_CAP silently ignored — the builder’s perf-mode branch
never calls CpuCap::resolve, it goes through
validate_performance_mode + acquire_resource_locks
(LOCK_EX) instead.
Filesystem requirement
Every ktstr lockfile (/tmp/ktstr-llc-*.lock,
/tmp/ktstr-cpu-*.lock, {cache_root}/.locks/*.lock,
{runs_root}/.locks/*.lock) must live on a local filesystem —
tmpfs, ext4, xfs, btrfs, f2fs, or bcachefs are the
explicitly-accepted set. flock(2) behavior
on NFS, CIFS, SMB2, CephFS, AFS, and FUSE is unreliable: NFSv3
is advisory-only without an NLM peer and NFSv4 byte-range
locking does not cover flock(2); SMB does not emit
/proc/locks entries so ktstr cannot enumerate peer holders;
Ceph MDS does not participate in flock serialization across
nodes; AFS does not support flock(2) at all; FUSE flock
semantics depend on whether the userspace server implements the
op. try_flock statfs-checks every lockfile path at open time
via reject_remote_fs in src/flock.rs — hitting any
deny-listed filesystem produces an actionable runtime error
naming the filesystem plus the remediation “Move the lockfile
path to a local filesystem (tmpfs, ext4, xfs, btrfs, f2fs,
bcachefs).” Unknown local filesystems (zfs, erofs, etc.) are
not on the deny-list and pass through, on the basis that
rejecting unknown-but-local is more disruptive than accepting a
potentially-unreliable flock.
Related
- Performance Mode — the full-isolation tier; the tier comparison lives there.
- Environment Variables
—
KTSTR_CPU_CAP,KTSTR_BYPASS_LLC_LOCKS, and every other ktstr-controlled env var.
Writing Tests
Tests are Rust functions annotated with #[ktstr_test]. Each test
boots a KVM VM, runs the scenario inside it, and evaluates results
on the host.
use ktstr::prelude::*;
#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("cg_0").workers(2),
CgroupDef::named("cg_1").workers(2),
])
}
Run with cargo ktstr test --kernel ../linux. See
Getting Started for setup and
The #[ktstr_test] Macro for all
available attributes. Each test also generates gauntlet variants across
topology presets and flag profiles. See
Gauntlet Tests.
For scenarios that need logic beyond what the ops system can express,
see Custom Scenarios.
The #[ktstr_test] Macro
#[ktstr_test] registers a function as an integration test that runs
inside a VM.
Basic usage
use ktstr::prelude::*;
#[ktstr_test(llcs = 2, cores = 4, threads = 2)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
// ctx provides cgroup manager, topology, duration, etc.
Ok(AssertResult::pass())
}
When a scheduler with a default topology is specified, the topology can be omitted:
use ktstr::declare_scheduler;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
// numa, llcs, cores/llc, threads/core
topology = (1, 2, 4, 1),
});
#[ktstr_test(scheduler = MY_SCHED)]
fn inherited_topo(ctx: &Ctx) -> Result<AssertResult> {
// Inherits 1n2l4c1t from MY_SCHED
Ok(AssertResult::pass())
}
declare_scheduler! emits a pub static MY_SCHED: Scheduler and
registers a private linkme static in the KTSTR_SCHEDULERS
distributed slice. The scheduler = slot expects
&'static Scheduler — pass the bare MY_SCHED ident; the macro
takes a reference internally.
The function must have signature
fn(&ktstr::scenario::Ctx) -> anyhow::Result<ktstr::assert::AssertResult>.
What the macro generates
- Renames the function to
__ktstr_inner_{name}. - Registers it in the
KTSTR_TESTSdistributed slice via linkme. - Emits a
#[test]wrapper that callsrun_ktstr_test().
The #[test] wrapper boots a VM with the specified topology and runs
the function inside it.
Attributes
All attributes are optional with defaults.
Topology
| Attribute | Default | Description |
|---|---|---|
llcs | inherited | Number of LLCs |
numa_nodes | inherited | Number of NUMA nodes |
cores | inherited | Cores per LLC |
threads | inherited | Threads per core |
memory_mb | 2048 | VM memory in MB |
Each dimension independently inherits from Scheduler.topology when
a scheduler is specified and that dimension is not explicitly set.
Without a scheduler, unset dimensions use macro defaults (numa_nodes=1,
llcs=1, cores=2, threads=1). The default is a single-NUMA topology,
so most tests do not need to set numa_nodes. See
Default topology.
Scheduler
| Attribute | Default | Description |
|---|---|---|
scheduler = CONST | &Scheduler::EEVDF | Rust const path to a &'static Scheduler. The bare const emitted by declare_scheduler! (e.g. MY_SCHED) is the expected form. The default Scheduler::EEVDF runs tests under the kernel’s default scheduler (EEVDF on Linux 6.6+) so tests without an explicit scheduler = run under the kernel default. |
extra_sched_args = [...] | [] | Extra CLI args for the scheduler, appended after Scheduler::sched_args. |
watchdog_timeout_s | 5 | scx watchdog override (seconds). Applied via scx_sched.watchdog_timeout on 7.1+ kernels (BTF-detected) and via the static scx_watchdog_timeout symbol on pre-7.1 kernels. When neither path is available the override silently no-ops. |
Payloads
| Attribute | Default | Description |
|---|---|---|
payload = CONST | None | Rust const path to a binary-kind Payload (PayloadKind::Binary). Populates KtstrTestEntry::payload; the test body can run it via ctx.payload(&CONST). Scheduler-kind payloads are rejected at compile time — use the scheduler = … slot for those. |
workloads = [CONST, …] | [] | Array of binary-kind Payload const paths composed alongside the primary payload. Each entry is runnable from the test body via ctx.payload(&CONST); the include-file pipeline packages every referenced binary into the guest automatically. |
extra_include_files = ["path", …] | [] | Array of string-literal paths to extra host-side files (datasets, fixture configs, helper scripts) that the framework packages into the guest initramfs alongside the binaries declared by scheduler / payload / workloads. Maps onto KtstrTestEntry::extra_include_files (&'static [&'static str]); union with per-payload Payload::include_files is computed at run time via KtstrTestEntry::all_include_files. Use this slot for test-level dependencies that don’t belong on a specific Payload. |
See Payload Definitions for
authoring new Payload fixtures; tests/common/fixtures.rs carries
reusable examples (SCHBENCH, SCHBENCH_HINTED, SCHBENCH_JSON).
Checking
| Attribute | Default | Description |
|---|---|---|
not_starved | inherited | Enable starvation (zero work units), fairness spread, and scheduling gap checks |
isolation | inherited | Enable cpuset isolation check (workers must stay on assigned CPUs) |
max_gap_ms | inherited | Max scheduling gap threshold |
max_spread_pct | inherited | Max fairness spread threshold |
max_throughput_cv | inherited | Max coefficient of variation for worker throughput |
min_work_rate | inherited | Minimum work_units per CPU-second per worker |
max_imbalance_ratio | inherited | Monitor imbalance ratio |
max_local_dsq_depth | inherited | Monitor DSQ depth |
fail_on_stall | inherited | Fail on stall detection |
sustained_samples | inherited | Sample window for sustained violations |
max_fallback_rate | inherited | Max fallback event rate |
max_keep_last_rate | inherited | Max keep-last event rate |
max_p99_wake_latency_ns | inherited | Max p99 wake latency in nanoseconds |
max_wake_latency_cv | inherited | Max wake latency coefficient of variation |
min_iteration_rate | inherited | Minimum iterations per wall-clock second per worker |
max_migration_ratio | inherited | Max migration ratio (migrations/iterations) per cgroup |
min_page_locality | inherited | Min fraction of pages on expected NUMA nodes (0.0-1.0) |
max_cross_node_migration_ratio | inherited | Max ratio of NUMA-migrated pages to total pages (0.0-1.0) |
max_slow_tier_ratio | inherited | Max fraction of pages on memory-only (CXL) nodes (0.0-1.0) |
not_starved = true enables three distinct checks: starvation (any
worker with zero work units), fairness spread (max-min off-CPU% below
max_spread_pct), and scheduling gaps (longest gap below max_gap_ms).
Each threshold can be overridden independently. See
Customize Checking for
override examples and Checking for
the merge chain.
Topology constraints
| Attribute | Default | Description |
|---|---|---|
min_llcs | 1 | Minimum LLCs for gauntlet topology filtering |
max_llcs | 12 | Maximum LLCs for gauntlet topology filtering |
min_cpus | 1 | Minimum total CPU count for gauntlet topology filtering |
max_cpus | 192 | Maximum total CPU count for gauntlet topology filtering |
min_numa_nodes | 1 | Minimum NUMA nodes for gauntlet topology filtering |
max_numa_nodes | 1 | Maximum NUMA nodes for gauntlet topology filtering |
requires_smt | false | Require SMT (threads > 1) topologies. On aarch64 the gauntlet ships only non-SMT presets, so any test with requires_smt = true is skipped entirely on that arch. |
The gauntlet skips presets that do not satisfy these constraints.
Multi-NUMA presets are excluded by default (max_numa_nodes = 1).
See Gauntlet
for filtering rules and
Gauntlet Tests for a worked
example.
Execution
| Attribute | Default | Description |
|---|---|---|
auto_repro | true | On scheduler crash, boot a second VM with probes attached. Set to false for fast iteration. |
performance_mode | false | Pin vCPUs to host cores, hugepages, NUMA mbind, RT scheduling, LLC exclusivity validation |
no_perf_mode | false | Decouple the virtual topology from host hardware: build the VM with the declared numa_nodes / llcs / cores / threads even on smaller hosts; skip vCPU pinning, hugepages, NUMA mbind, RT scheduling, and KVM exit suppression; relax gauntlet preset filtering to the single “host has enough total CPUs” check. Mutually exclusive with performance_mode = true (validated at runtime by KtstrTestEntry::validate). Equivalent to setting KTSTR_NO_PERF_MODE=1 per-test — either source forces the no-perf path. See Performance Mode. |
duration_s | 12 | Per-scenario duration in seconds |
expect_err | false | Test expects run_ktstr_test to return Err; disables auto-repro |
bpf_map_write = CONST | empty | Rust const path to a BpfMapWrite; host writes this value to a BPF map after the scheduler loads. The entry field is a slice; the macro wraps the single path in a one-element slice. |
host_only | false | Run the test function directly on the host instead of inside a VM. Use for tests that need host tools (e.g. cargo, nested VMs) unavailable in the guest initramfs. |
num_snapshots = N | 0 | Fire N periodic freeze_and_capture(false) boundaries inside the workload’s 10 %–90 % window; each capture is stored on the host SnapshotBridge under periodic_NNN. 0 disables periodic capture entirely. Validated against MAX_STORED_SNAPSHOTS (= 64), host_only = true, and a 100 ms minimum-spacing rule. See Periodic Capture and Temporal Assertions. |
cleanup_budget_ms = N | None | Sub-watchdog cap on host-side VM teardown wall time. When the budget is exceeded the test’s AssertResult is folded with a failing AssertDetail. None disables the check. |
post_vm = PATH | None | Host-side callback invoked after vm.run() returns. Signature: fn(&VmResult) -> anyhow::Result<()>. Use for assertions that need host-side state — e.g. draining VmResult.snapshot_bridge for periodic-capture analysis (see Periodic Capture). |
config = EXPR | None | Inline scheduler config content (string literal or path to a const &'static str). Written to the guest path declared by the scheduler’s config_file_def; the framework substitutes {file} in the scheduler’s arg template with the guest path. Required when the scheduler declares config_file_def; rejected when it doesn’t. The pairing is enforced at compile time via a const assertion against Payload::config_file_def, and again at runtime by KtstrTestEntry::validate. See Inline scheduler config. |
See Performance Mode for details on
what performance_mode enables, prerequisites, and validation behavior.
Inline scheduler config
Some schedulers (e.g. scx_layered, scx_lavd) accept a JSON config
file via a CLI argument like --config /path/to/config.json. Two
pieces wire this into a test:
-
Scheduler declaration — the
Schedulerbuilder declares the arg template and the guest path via.config_file_def:const LAYERED_SCHED: Scheduler = Scheduler::new("layered") .binary(SchedulerSpec::Discover("scx_layered")) .config_file_def("--config {file}", "/include-files/layered.json");{file}in the arg template is replaced with the guest path. The frameworkmkdir -ps the parent and writes the config content to/include-files/layered.jsoninside the guest before the scheduler binary starts. -
Test attribute — the test supplies the inline JSON via
config = …:const LAYERED_CONFIG: &str = r#"{ "layers": [...] }"#; #[ktstr_test(scheduler = LAYERED_SCHED, config = LAYERED_CONFIG)] fn layered_test(ctx: &Ctx) -> Result<AssertResult> { Ok(AssertResult::pass()) }config = "..."(string literal) andconfig = SOME_CONST(path to aconst &'static str) are both accepted.
The pairing gate is bidirectional:
- A scheduler with
config_file_defset requiresconfig = …on every test (otherwise the scheduler binary would launch without--config). - A scheduler without
config_file_defrejectsconfig = …on the test (the content would be silently dropped at dispatch).
Both halves are validated at compile time via a const assertion
emitted by the macro AND at runtime by KtstrTestEntry::validate,
so direct programmatic-entry construction sees the same gate.
For schedulers that take a config file from a host-side path
instead of inline content, use Scheduler::config_file(host_path)
instead of config_file_def. The framework packs the host file into
the initramfs at /include-files/{filename} and prepends --config /include-files/{filename} to scheduler args; no config = … on
the test is needed in that flavor.
Example with custom scheduler
Define the scheduler with declare_scheduler! (see
Scheduler Definitions), then
reference it in #[ktstr_test]:
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
topology = (1, 2, 4, 1),
sched_args = ["--enable-llc", "--enable-stealing"],
});
#[ktstr_test(
scheduler = MY_SCHED,
not_starved = true,
max_gap_ms = 5000,
)]
fn my_sched_basic(ctx: &Ctx) -> Result<AssertResult> {
// Inherits 1n2l4c1t from MY_SCHED
Ok(AssertResult::pass())
}
declare_scheduler! emits a pub static MY_SCHED: Scheduler
and registers it in the KTSTR_SCHEDULERS distributed slice via
a private linkme static so cargo ktstr verifier discovers it.
The bare MY_SCHED ident is what #[ktstr_test(scheduler = ...)]
expects. See
Scheduler Definitions
for the full macro grammar.
For the manual builder pattern (no distributed-slice registration), see Scheduler Definitions: Manual definition.
Custom Scenarios
For dynamic scenarios (cgroup creation/removal, cpuset changes), prefer
the ops/steps system over raw Action::Custom.
Use Action::Custom only when you need logic that the ops system
cannot express.
Writing a custom scenario
use ktstr::prelude::*;
use ktstr::scenario::*;
fn my_custom_scenario(ctx: &Ctx) -> Result<AssertResult> {
let wl = dfl_wl(ctx);
let (handles, _guard) = setup_cgroups(ctx, 2, &wl)?;
// Custom logic: resize cpusets, move workers, etc.
std::thread::sleep(ctx.duration);
Ok(collect_all(handles, &ctx.assert))
}
Helper functions
setup_cgroups(ctx, n, wl) – creates N cgroups, spawns workers,
returns Result<(Vec<WorkloadHandle>, CgroupGroup)>.
Bind the CgroupGroup to a named variable (e.g. _guard) so it
lives until end of scope.
See CgroupGroup for drop semantics.
Imports:
setup_cgroupsanddfl_wllive inktstr::scenario, not in the prelude. Theuse ktstr::scenario::*;line in the example above is required —use ktstr::prelude::*;alone does not bring them into scope.
collect_all(handles, checks) – stops all workers, collects reports,
runs worker-level checks when configured, otherwise falls back to
assert_not_starved(). Merges results: if any worker group fails, the
overall result fails.
dfl_wl(ctx) – creates a WorkloadConfig with
ctx.workers_per_cgroup workers and default settings.
spawn_diverse(ctx, cgroup_names) – spawns different
work types across cgroups, rotating
through (SpinWait, Bursty{50ms burst / 100ms sleep}, IoSyncWrite,
Mixed, YieldHeavy). Each cgroup uses ctx.workers_per_cgroup
workers except IoSyncWrite cgroups, which always use 2 workers so
blocking IO does not drown the scenario.
The Ctx struct
Custom scenarios receive a Ctx reference:
pub struct Ctx<'a> {
pub cgroups: &'a dyn CgroupOps,
pub topo: &'a TestTopology,
pub duration: Duration,
pub workers_per_cgroup: usize,
pub sched_pid: Option<libc::pid_t>,
pub settle: Duration,
pub work_type_override: Option<WorkType>,
pub assert: Assert,
pub wait_for_map_write: bool,
}
cgroups – create/remove cgroups, set cpusets, move tasks. The
slot is a &dyn CgroupOps trait object, not a concrete
CgroupManager, so tests can
substitute a no-op double for host-only scenarios while production
paths receive the real manager. Method signatures are defined on
CgroupOps; see CgroupManager for the production implementation.
topo – query CPU topology (LLCs, NUMA nodes, memory info,
distances). Provides CPU enumeration, LLC/NUMA partitioning, cpuset
generation, and inter-node distance queries. See
TestTopology for the full API reference.
sched_pid – scheduler process ID (Option<libc::pid_t>) for
liveness checks. None when the test runs without an scx scheduler
(the EEVDF default path has no userspace scheduler binary). Unwrap
or is_some_and(...) before passing to process_alive or
kill(Pid::from_raw(pid), None).
settle – time to wait after cgroup creation for the scheduler
to stabilize.
Checking in custom scenarios
Use Assert for both direct report checking and ops-based scenarios.
Call assert.assert_cgroup(reports, cpuset) for manual report
collection, or use execute_steps_with() for ops-based scenarios. See
Checking.
Scheduler Definitions
A Scheduler tells the test framework how to launch and configure
the scheduler under test. Tests reference one via
#[ktstr_test(scheduler = MY_SCHED)]; the verifier sweep reads
every declared scheduler from the KTSTR_SCHEDULERS distributed
slice automatically.
The Scheduler type
pub struct Scheduler {
pub name: &'static str,
pub binary: SchedulerSpec,
pub sysctls: &'static [Sysctl],
pub kargs: &'static [&'static str],
pub assert: Assert,
pub cgroup_parent: Option<CgroupPath>,
pub sched_args: &'static [&'static str],
pub topology: Topology,
pub constraints: TopologyConstraints,
pub config_file: Option<&'static str>,
pub config_file_def: Option<(&'static str, &'static str)>,
pub kernels: &'static [&'static str],
}
config_file packs a host-side file into the initramfs at
/include-files/{filename} and prepends --config /include-files/{filename}
to scheduler args automatically.
config_file_def declares an arg-template + guest-path pair for
schedulers that take inline JSON content via the test attribute
#[ktstr_test(config = …)]: the framework writes the test’s
config_content to the declared guest path and substitutes
{file} in the arg template before launching the scheduler. The
two fields are alternatives — config_file is the host-file path,
config_file_def is the inline-content path. See
The #[ktstr_test] Macro
for the inline pairing gate.
sysctls takes Sysctl values. Construct them with
Sysctl::new("key", "value") in const context. Use the
dot-separated form for the key (e.g. "kernel.foo", not
"kernel/foo"); duplicate keys are applied in order and the
last write wins.
kargs is the extra GUEST KERNEL command-line (not the scheduler
binary’s CLI — use sched_args for that). Do not override the
kargs ktstr injects itself (console=, loglevel=, init=):
those break guest-side init and leave the VM unable to run tests.
kernels is the per-scheduler filter on the
BPF Verifier sweep matrix. The
matrix dimension itself is the operator’s cargo ktstr verifier --kernel <SPEC> set (which the dispatcher always populates into
KTSTR_KERNEL_LIST, including a single auto-discovered entry
when --kernel is omitted). For each scheduler, the lister
emits one cell per (kernel-list entry that passes this filter ×
accepted topology preset).
Each entry is a string consumed by KernelId::parse — the same
parser as the cargo ktstr verifier --kernel <SPEC> CLI flag.
Match semantics per variant:
- Exact
Version("6.14.2") — matches an entry whose raw or sanitized label equals the version. Range("6.14..7.0"or"6.14..=7.0"— both inclusive on both endpoints) — matches entries whose raw version falls inside[start, end]viadecompose_version_for_compare.Path/CacheKey/Git("git+URL#REF","path/to/dir","6.14.2-tarball-x86_64-kc...") — matches by sanitized-label equality.
An empty kernels = [] means no filter — the scheduler
verifies against every kernel-list entry the operator supplied.
SchedulerSpec
How to find the scheduler binary:
pub enum SchedulerSpec {
Eevdf, // No sched_ext binary -- use kernel EEVDF
Discover(&'static str), // Auto-discover by name
Path(&'static str), // Explicit path
KernelBuiltin { // Kernel-built scheduler (no binary)
enable: &'static [&'static str],
disable: &'static [&'static str],
},
}
KernelBuiltin is for schedulers compiled into the kernel (e.g.
BPF-less sched_ext or debugfs-tuned variants). The enable
commands run in the guest to activate the scheduler; disable
commands run to deactivate it. No binary is injected into the
VM.
SchedulerSpec::has_active_scheduling() returns true for all
variants except Eevdf. When true, the framework runs monitor
threshold evaluation after the scenario and enables auto-repro
on crash.
Eevdf and KernelBuiltin are excluded from the verifier sweep
at cell-emission time — neither has a userspace binary to load
BPF programs from, so the verifier has nothing to verify.
Built-in: EEVDF
Scheduler::EEVDF runs tests without a sched_ext scheduler,
using the kernel’s default EEVDF scheduler. Its binary is
SchedulerSpec::Eevdf. It is the default scheduler for
#[ktstr_test] entries that do not pass scheduler = ....
Defining a scheduler
declare_scheduler! is the preferred entry point: it constructs a
pub static Scheduler and registers it in the KTSTR_SCHEDULERS
distributed slice in one step, so the verifier sweep picks it up
automatically.
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
sched_args = ["--exit-dump-len", "1048576"],
topology = (1, 2, 4, 1),
kernels = ["6.14", "6.15..=7.0"],
});
#[ktstr_test(scheduler = MY_SCHED)]
fn basic(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("cg_0").workers(2),
CgroupDef::named("cg_1").workers(2),
])
}
The macro emits:
pub static MY_SCHED: Schedulerwith the supplied fields.- A private
static __KTSTR_SCHED_REG_MY_SCHED: &'static Schedulerregistered in theKTSTR_SCHEDULERSdistributed slice vialinkmesocargo ktstr verifierdiscovers it.
#[ktstr_test(scheduler = ...)] expects an &'static Scheduler —
pass the bare ident (e.g. MY_SCHED). The macro takes a reference
internally, so passing the bare const yields the correct
&Scheduler.
Accepted fields
Every key=value pair after name and binary is optional. The
key names match Scheduler struct fields:
name = "..."— short human name (required).binary = "scx_name"— defaults toSchedulerSpec::Discover(name). AcceptsSchedulerSpec::Path("/abs/path"),SchedulerSpec::Eevdf, orSchedulerSpec::KernelBuiltin { enable: &[...], disable: &[...] }.sched_args = ["--a", "--b"]— CLI args prepended to every test that uses this scheduler.kernels = ["6.14", "6.15..=7.0", "git+URL#REF", "/path", "cache-key"]— verifier sweep set; see the field doc above.cgroup_parent = "/path"— must begin with/, must not be"/"alone.kargs = ["nosmt"]— guest-kernel cmdline additions.sysctls = [Sysctl::new("kernel.foo", "1")]— applied before the scheduler starts.topology = (numa_nodes, llcs, cores, threads)— default VM topology for#[ktstr_test]entries.constraints = TopologyConstraints { ... }— gauntlet topology constraints inherited by#[ktstr_test]entries.config_file = "configs/my_sched.toml"— opaque host-side config to pack into the guest initramfs.config_file_def = ("--config={file}", "/include-files/my.json")— alternative inline-config seam (see The #[ktstr_test] Macro).assert = Assert::NO_OVERRIDES.method_chain()— scheduler-level assertion overrides merged on top ofAssert::default_checks().
Visibility
The identifier can be pub or pub(crate):
declare_scheduler!(pub MY_SCHED, { name = "my_sched", binary = "scx_my_sched" });
declare_scheduler!(pub(crate) INTERNAL, { name = "internal", binary = "scx_internal" });
The macro emits #[allow(missing_docs)] on the generated static
so crates with #![deny(missing_docs)] compile cleanly.
Manual definition
The const builder pattern still works when the macro doesn’t fit — e.g. when the scheduler is composed programmatically or when test-only fixtures need to avoid the distributed-slice registration:
use ktstr::prelude::*;
const MITOSIS: Scheduler = Scheduler::new("scx_mitosis")
.binary(SchedulerSpec::Discover("scx_mitosis"))
.topology(1, 2, 4, 1)
.sched_args(&["--exit-dump-len", "1048576"])
.cgroup_parent("/ktstr")
.assert(Assert::NO_OVERRIDES.max_imbalance_ratio(2.0));
A manually-defined Scheduler is not registered in
KTSTR_SCHEDULERS automatically; the verifier sweep does not
see it. Use declare_scheduler! for any scheduler that should
participate in cargo ktstr verifier.
Cgroup parent
Scheduler.cgroup_parent specifies a cgroup subtree under
/sys/fs/cgroup for the scheduler to manage. When set, the VM
init creates the directory before starting the scheduler, and
--cell-parent-cgroup <path> is injected into the scheduler
args. The field is Option<CgroupPath>. CgroupPath::new() is
a const constructor that panics at compile time if the path
does not begin with / or is "/" alone. The
Scheduler::cgroup_parent() builder and the
declare_scheduler! cgroup_parent = "..." field both accept
&'static str and construct a CgroupPath internally.
declare_scheduler!(MITOSIS, {
name = "scx_mitosis",
binary = "scx_mitosis",
topology = (1, 2, 4, 1),
cgroup_parent = "/ktstr",
});
This creates /sys/fs/cgroup/ktstr in the guest and passes
--cell-parent-cgroup /ktstr to the scheduler binary.
Config file
Scheduler.config_file specifies a host-side path to an opaque
config file that the scheduler binary reads at startup. The
framework packs the file into the guest initramfs at
/include-files/{filename} and prepends --config /include-files/{filename}
to the scheduler args. ktstr does not parse or validate the
file — it is passed through as-is.
The --config flag name is not configurable. Schedulers that
use config_file must accept --config <path>. For schedulers
that use a different flag, use config_file to place the file
in the guest and add the desired flag via sched_args — the
scheduler will also receive --config and must not reject
unknown flags.
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
topology = (1, 2, 4, 1),
config_file = "configs/my_sched.toml",
});
This copies configs/my_sched.toml from the host into the
guest at /include-files/my_sched.toml and passes
--config /include-files/my_sched.toml to the scheduler binary.
Scheduler args
Scheduler.sched_args provides default CLI args that apply to
every test using this scheduler. They are prepended before
per-test extra_sched_args.
declare_scheduler!(MITOSIS, {
name = "scx_mitosis",
binary = "scx_mitosis",
topology = (1, 2, 4, 1),
cgroup_parent = "/ktstr",
sched_args = ["--exit-dump-len", "1048576"],
});
Merge order: config_file injection, then cgroup_parent
injection, then sched_args, then per-test extra_sched_args.
Default topology
Scheduler.topology sets the default VM topology for all tests
using this scheduler. When #[ktstr_test] omits llcs,
cores, and threads, the scheduler’s topology is used.
Explicit attributes on #[ktstr_test] override the scheduler
default.
// (numa_nodes, llcs, cores_per_llc, threads_per_core)
declare_scheduler!(MITOSIS, {
name = "scx_mitosis",
binary = "scx_mitosis",
topology = (1, 2, 4, 1),
});
Arguments are (numa_nodes, llcs, cores_per_llc, threads_per_core).
Most schedulers use numa_nodes = 1 (single NUMA node).
Scheduler::new() defaults to (1, 1, 2, 1) — a minimal
2-CPU single-NUMA VM, sufficient for tests that don’t exercise
topology-dependent scheduling.
Tests that need a different topology (e.g. SMT) override individual dimensions. Unset dimensions still inherit from the scheduler:
// Inherits llcs=2, cores=4 from MITOSIS; overrides threads to 2
#[ktstr_test(scheduler = MITOSIS, threads = 2)]
fn smt_test(ctx: &Ctx) -> Result<AssertResult> { /* ... */ }
Checking overrides
Scheduler.assert provides scheduler-level checking defaults.
These sit between Assert::default_checks() and per-test
overrides in the merge chain.
A scheduler that tolerates higher imbalance:
declare_scheduler!(RELAXED, {
name = "relaxed",
binary = "scx_relaxed",
assert = Assert::NO_OVERRIDES.max_imbalance_ratio(5.0),
});
Kernel-built scheduler example
For schedulers compiled into the kernel (no userspace binary),
use SchedulerSpec::KernelBuiltin with shell commands to
activate/deactivate the scheduler:
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(MINLAT, {
name = "minlat",
binary = SchedulerSpec::KernelBuiltin {
enable: &["echo minlat > /sys/kernel/debug/sched/ext/root/ops"],
disable: &["echo none > /sys/kernel/debug/sched/ext/root/ops"],
},
});
The enable commands run in the guest before scenarios start.
The disable commands run after scenarios complete.
KernelBuiltin schedulers do not participate in the verifier
sweep (no userspace binary to load BPF programs from); the
declaration is still useful for #[ktstr_test(scheduler = ...)]
attribution and sidecar identification.
Payload Definitions
A Payload describes a binary workload that a test can run
alongside its cgroup workers. The struct encodes
PayloadKind::Binary (an external executable — schbench,
fio, stress-ng) for the workload role. Tests reference a
Payload via #[ktstr_test(payload = FIXTURE)] (primary slot)
or #[ktstr_test(workloads = [FIXTURE_A, FIXTURE_B])]
(additional slots); the test body then runs it via
ctx.payload(&FIXTURE).
The scheduler slot is separate from the payload / workloads
slots — #[ktstr_test(scheduler = MY_SCHED)] takes a bare
Scheduler reference (the declare_scheduler! const), not a
Payload.
#[non_exhaustive] and construction rules
Payload is #[non_exhaustive] (see crate::non_exhaustive).
Downstream crates cannot use struct-literal construction —
a future ktstr bump can add fields without breaking callers
only if everyone constructs through the provided associated
functions:
Payload::binary(name, binary)— minimal binary-kindPayloadwith exit-code-only defaults (no declared args, checks, metrics, or include files). Fillsname, setskind = PayloadKind::Binary(binary).Payload::new(...)— full positional constructor; the#[derive(Payload)]macro emits a call to this internally.
For richer binary payloads (custom default args, declared
MetricChecks, MetricHints, include_files), use
#[derive(Payload)] on a marker struct — the derive generates
the matching const via the same non-exhaustive-preserving
construction path. tests/common/fixtures.rs has worked
examples — SCHBENCH, SCHBENCH_HINTED, SCHBENCH_JSON —
suitable as reference shapes to copy.
Quick reference: Payload fields
The fields are listed here for readers tracing the fixture
files, not as a license to hand-roll literals. Each is
populated by Payload::binary + the derive’s builder methods:
name: &'static str— display name that appears in sidecar JSON, stats tables, and test filtering. Distinct from the binary name (kind) so e.g.SCHBENCH_HINTEDcan run the sameschbenchbinary with a different label.kind: PayloadKind— eitherBinary(executable_name)(for test payloads likeschbench) orScheduler(&'static Scheduler)(the in-memory shape ofPayload::KERNEL_DEFAULTand similar scheduler-wrapping payloads). Test authors normally do not constructPayloadKind::Schedulerdirectly — the#[ktstr_test(scheduler = MY_SCHED)]slot takes the bareSchedulerref without a Payload wrapper.output: OutputFormat— how to interpret the payload’s stdout/stderr.ExitCode(status code only),Json(parse numeric leaves), orLlmExtract(Option<&'static str>)(route through a local LLM with an optional hint).default_args: &'static [&'static str]— CLI args prepended to every invocation. Per-testctx.payload(...).args(...)appends after these.default_checks: &'static [MetricCheck]— static assertions applied to the payload’s output/exit (min/max/range/exists/exit_code_eqconstructors onMetricCheck). Merged with per-test.checks(...).metrics: &'static [MetricHint]— declared metrics the payload emits (name, unit, polarity). Driveslist-metricsand comparison thresholds.metric_bounds: Option<&'static MetricBounds>— optional per-metric host-side bounds applied AFTER the payload exits. Consumed byLlmExtractpayloads (where extraction runs host-side post-VM-exit);JsonandExitCodepayloads ignore this field and route assertions throughdefault_checksinstead.include_files: &'static [&'static str]— extra files packaged into the guest alongside the binary (config files, datasets).uses_parent_pgrp: bool— when true, the payload child inherits the test’s process group so the teardown SIGKILL sweep reaches it. Most binaries leave thisfalseand are reaped explicitly.known_flags: Option<&'static [&'static str]>— optional allow-list of CLI flags the payload accepts; used by the gauntlet-style flag expansion.
For an end-to-end workflow from building a scheduler to running the gauntlet, see Test a New Scheduler.
Gauntlet Tests
The gauntlet expands each #[ktstr_test] into a matrix of
test × topology_preset variants. The test definition controls
which cells of the matrix it populates.
Controlling topology coverage
Topology constraints in #[ktstr_test] filter which gauntlet
presets a test runs on. See
Topology Constraints
for the full attribute table and
Topology Presets
for the preset list.
Worked example
A test with min_llcs = 2, requires_smt = true, and default
max_numa_nodes = 1 against the
preset table:
tiny-1llc(1 LLC): excluded — belowmin_llcs- All non-SMT presets (
tiny-2llc,odd-*,*-nosmt): excluded —requires_smt near-max-llc(15 LLCs): excluded — above defaultmax_llcs = 12max-cpu(252 CPUs, 14 LLCs): excluded — above defaultmax_cpus = 192(also above defaultmax_llcs = 12)- All
numa*presets: excluded — above defaultmax_numa_nodes = 1
Result: 6 of 24 presets survive (smt-2llc, smt-3llc,
medium-4llc, medium-8llc, large-4llc, large-8llc). On
aarch64, none survive — all aarch64 presets lack SMT.
Total variant count
The total number of gauntlet variants for a test is:
valid_presets × resolved_kernels
A test with 8 valid presets produces 8 gauntlet variants under
a single-kernel run; passing two kernels (--kernel A --kernel B)
doubles that to 16. The kernel dimension is contributed by
cargo ktstr test / coverage / llvm-cov at the CLI surface
(zero or one resolved kernels keeps the historical 3-segment
name shape gauntlet/{name}/{preset}; two or more expands the
gauntlet across kernels with an extra {kernel_label} segment).
See
Multi-kernel: kernel as a gauntlet dimension.
Tests that skip gauntlet
- Entries with
host_only = truenever produce gauntlet variants (no VM to vary topology on). They also skip the kernel-dim multiplication under multi-kernel runs: ahost_onlytest lists and runs once regardless ofKTSTR_KERNEL_LISTcardinality, since a host-side test never observes the kernel directory and N copies of identical work would carry no signal. Seehost_onlyfor how that flag is set, and Multi-kernel: kernel as a gauntlet dimension for the kernel-suffix dispatch contract. - Tests whose names start with
demo_are ignored by default. Their gauntlet variants are also ignored (all gauntlet variants are ignored).
Cross-references
- Gauntlet (Running Tests) — how to run gauntlet variants, preset table, budget interaction
- The #[ktstr_test] Macro — full attribute reference
Snapshots
A snapshot is a frozen record of guest BPF map state and scheduler
globals captured at a specific point in a scenario. The freeze
coordinator pauses every vCPU long enough to walk the kernel’s BPF
maps, BTF-render every captured value, and bundle the result into a
FailureDumpReport keyed by a name you choose. Test code then reads
it back via the Snapshot accessor for typed traversal.
Op::snapshot("name") is the on-demand capture trigger. Use it to
ask “what does the scheduler look like right now?” at a precise
point in the scenario. For automatic capture on a kernel write to a
specific symbol, see Watch Snapshots. For
cadenced capture across the workload window without invoking
Op::snapshot from the scenario body, see
Periodic Capture — it produces a time-ordered
SampleSeries that flows into
the temporal-assertion patterns
(nondecreasing, rate_within, steady_within, converges_to,
always_true, ratio_within).
Issuing a snapshot
Op::snapshot(name) is a single op in a Step’s op list. The
executor invokes the active SnapshotBridge’s capture callback,
which performs the freeze rendezvous and returns the report; the
bridge stores the report under name.
use ktstr::prelude::*;
let steps = vec![Step {
setup: vec![CgroupDef::named("workers").workers(2)].into(),
ops: vec![
Op::snapshot("after_spawn"),
// ... other ops ...
Op::snapshot("after_workload"),
],
hold: HoldSpec::FULL,
}];
execute_steps(ctx, steps)?;
A scenario may issue any number of Op::snapshot ops with distinct
names. Reusing a name overwrites the prior capture (and emits a
tracing::warn!).
Wiring the bridge
The bridge is what turns an Op::snapshot into stored data. The host
typically wires it before execute_steps runs, but a scenario can
install one inline:
use ktstr::prelude::*;
let cb: CaptureCallback = std::sync::Arc::new(|_name: &str| {
// Production: freeze the VM and build a real FailureDumpReport.
// Tests: return a hand-crafted report so the executor + bridge
// pipeline runs without booting a guest.
Some(FailureDumpReport::default())
});
let bridge = SnapshotBridge::new(cb);
let bridge_handle = bridge.clone();
let _guard = bridge.set_thread_local();
execute_steps(ctx, steps)?;
let captured = bridge_handle.drain();
let report = captured.get("after_spawn").expect("snapshot recorded");
set_thread_local returns a BridgeGuard that restores the prior
bridge on drop, so a nested scenario inside an outer one cannot leak
its bridge into the outer scope. Bind the guard to an
underscore-prefixed identifier such as _guard so the binding lives
for the scope of the scenario — a bare let _ = bridge.set_thread_local()
drops the guard immediately and clears the bridge before any op runs.
must_use will warn if the return value is discarded entirely.
If no bridge is installed, Op::snapshot is a no-op with a
tracing::warn! and the scenario continues. If the capture callback
returns None (capture pipeline unavailable), the bridge stays empty
and the scenario continues. Existing scenarios that never declare
snapshot ops keep working unchanged.
Reading the captured report
Snapshot::new(report) builds a borrowed view over a
FailureDumpReport. The view does not copy the report; accessor
methods walk the report in place and return further borrowed views.
Map-name lookup
let snap = Snapshot::new(report);
let map = snap.map("scx_per_task")?; // SnapshotMap
Snapshot::map(name) returns Result<SnapshotMap, SnapshotError>. A
miss yields SnapshotError::MapNotFound { requested, available } —
the available list enumerates every captured map name so a typo
surfaces in test output.
Top-level globals (.bss / .data / .rodata)
let nr_cpus = snap.var("nr_cpus_onln").as_u64()?;
Snapshot::var(name) walks every *.bss, *.data, and *.rodata
global-section map for a top-level member named name and returns
the unique match as a SnapshotField.
Multiple matches yield
SnapshotError::AmbiguousVar { requested, found_in } —
disambiguate via Snapshot::map(name). A miss yields
SnapshotError::VarNotFound { requested, available } with the
union of every section’s top-level member names.
Entries inside a map
let map = snap.map("scx_per_task")?;
let first = map.at(0); // by ordinal index
let busy = map.find(|e| e.get("tid").as_i64().unwrap_or(-1) == 1234);
let busiest = map.max_by(|e| e.get("runtime_ns").as_u64().unwrap_or(0));
let all_active = map.filter(|e| e.get("runtime_ns").as_u64().unwrap_or(0) > 0);
SnapshotMap exposes:
at(n)— entry at ordinal indexn. Out of range returnsSnapshotEntry::Missing(SnapshotError::IndexOutOfRange).find(predicate)— first matching entry. No match returnsSnapshotEntry::Missing(SnapshotError::NoMatch { op: "find", ... }).filter(predicate)— every matching entry collected into aVec.max_by(key_fn)— entry whosekey_fnproduces the maximumu64. Empty map returnsMissingwithop: "max_by".
Per-CPU maps
BPF_MAP_TYPE_PERCPU_ARRAY / _PERCPU_HASH / _LRU_PERCPU_HASH maps
require narrowing to a CPU before reading individual values:
let map = snap.map("scx_pcpu")?;
let entry = map.cpu(1).at(0); // CPU 1's slot
let value = entry.get("").as_u64()?; // empty path = root
SnapshotMap::cpu(n) narrows subsequent at / find calls to a
specific CPU’s slot. An out-of-range CPU returns Missing with
SnapshotError::PerCpuSlot { unmapped: false, len, ... }; an
unmapped slot (None in the per-CPU vec) returns the same error
variant with unmapped: true.
Calling entry.get(path) on a per-CPU entry without narrowing
first surfaces SnapshotError::PerCpuNotNarrowed { map } — call
.cpu(N) first.
Field accessors and dotted paths
SnapshotEntry::get(path) and SnapshotField::get(path) walk the
entry’s value side along a dotted path. Each component matches a
struct member; pointer dereferences are followed transparently.
let weight = entry.get("ctx.weight").as_u64()?;
let policy = entry.get("ctx.policy").as_str()?; // enum variant name
let pid = entry.get("leader.pid").as_i64()?; // pointer chase
The dotted-path walker:
-
Pointer chase. When a path step lands on
RenderedValue::Ptr { deref: Some(...) }, the walker transparently follows the dereference (up to 16 hops) before matching the next component. The test author writes the path the BTF would suggest; pointer indirection is invisible. -
Empty path.
get("")returns the current value as aSnapshotField::Value— useful for terminal accessors on per-CPU slots that hold a scalar directly. -
Composability. Two-segment paths are equivalent to chained
getcalls:entry.get("ctx.weight")≡entry.get("ctx").get("weight").Note that
Snapshot::vardoes not split — it treats the full string as one global name. To walk into a struct, usesnap.var("ctx").get("weight").
Terminal accessors
SnapshotField exposes typed terminal reads, all returning
Result<T, SnapshotError>:
| Method | Returns | Accepts |
|---|---|---|
as_u64() | u64 | Uint, non-negative Int/Enum, Bool (0/1), Char (raw byte), Ptr (pointer value, including cast-recovered pointers — see Cast-recovered pointers), per-CPU array key |
as_i64() | i64 | Int, Uint ≤ i64::MAX, Bool, Char, Enum, per-CPU array key |
as_bool() | bool | Bool direct; Int/Uint/Char/Enum/Ptr non-zero is true; per-CPU array key |
as_f64() | f64 | Float, Int, Uint, Enum, per-CPU array key |
as_str() | &str | Enum with a resolved variant name |
rendered() | Option<&RenderedValue> | the underlying value when present |
Type mismatches surface as SnapshotError::TypeMismatch { expected, actual, requested } — for example, as_str() on a Uint reports
expected: "Enum", actual: "Uint".
Cast-recovered pointers
Schedulers stash kernel pointers (task_struct *, cgroup *, …)
and arena pointers in BPF map fields whose BTF declares them as
u64 because BTF cannot express a pointer to a per-allocation
type. The host-side
cast analyzer walks the
scheduler’s .bpf.o instruction stream during load, recovers the
target struct for each provable (source_struct, field_offset) → target_struct mapping, and feeds the result into the renderer.
When the renderer encounters a u64 slot the analyzer flagged, it
emits a RenderedValue::Ptr
with cast_annotation set and chases the dereference through the
address-space-appropriate reader. The full set of
cast_annotation values:
| Annotation | Meaning |
|---|---|
"cast→arena" | Cast analyzer flagged a u64 field; chase resolved to an arena allocation via the BTF-typed pointee. |
"cast→kernel" | Cast analyzer flagged a u64 field; chase resolved to a kernel slab / vmalloc / per-cpu allocation. |
"sdt_alloc" | BTF-typed Type::Ptr whose pointee was a BTF_KIND_FWD; the renderer recovered the real payload struct id via the sdt_alloc bridge. No cast-analyzer hit was involved. |
"cast→arena (sdt_alloc)" | Cast analyzer flagged a u64 field AND the chase target peeled to a Fwd; the bridge recovered the real arena payload struct id. |
"cast→kernel (sdt_alloc)" | Cast analyzer flagged a u64 field AND the chase target peeled to a Fwd; the bridge recovered the real kernel-side struct id. |
A parallel cross-BTF Fwd resolution path is consulted whenever a
chase target survives the local same-BTF Fwd resolve as a
BTF_KIND_FWD: when the body lives in a sibling embedded BPF
object’s BTF (the multi-.bpf.objs shape), the renderer switches
the recursion to that sibling BTF and renders the full body.
Cross-BTF resolution does NOT add a new annotation — the body is
recovered transparently and the rendered subtree carries whichever
annotation ("cast→arena", "cast→kernel", or None for a
BTF-typed Type::Ptr) it would have had if the same struct lived
in the entry BTF.
From the test author’s perspective:
as_u64()returns the raw pointer value (matching pre-analysis behavior, so existing tests do not need updating).entry.get("ctx.task")and similar dotted-path walks transparently follow the cast-recovered chase; nested struct fields appear under the same path the BTF would suggest for a natively-typed pointer.- The
cast_annotationis visible in failure-dump rendering and diagnostic output so an operator can distinguish cast-recovered pointers from BTF-typed ones; the test API does not require any extra calls to consume them.
Error handling
SnapshotError is the unified error type for every fallible
accessor. Each variant carries the path or available alternatives
needed to fix the call site without re-running the test:
MapNotFound { requested, available }—Snapshot::map(name)miss.VarNotFound { requested, available }—Snapshot::var(name)miss.AmbiguousVar { requested, found_in }— more than one*.bss/*.data/*.rodatamap exposes a top-level member with the requested name.found_inlists every map (in capture order) where the name was seen; disambiguate viaSnapshot::map(name)+.at(0).get(...)against a specific map.FieldNotFound { requested, walked, component, available }— a path component did not match any struct member at that depth.walkedis the prefix that resolved successfully;componentis the failing segment;requestedis the original user-supplied path.NotAStruct { requested, walked, component, kind }— a path component reached a non-struct value where a struct was expected (e.g. descending into aUintleaf).kindnames the actual variant.TypeMismatch { expected, actual, requested }— terminal accessor called on a rendered shape it cannot decode.expectednames the scalar type the accessor requires;actualnames the rendered variant;requestedis the user-supplied lookup string (empty when the accessor was invoked on a leaf without a path walk).IndexOutOfRange { map, index, len }—SnapshotMap::at(n)past the entry list end.PerCpuSlot { map, cpu, len, unmapped }— out-of-range or unmapped per-CPU slot;unmapped: truedistinguishes aNoneslot from an out-of-range CPU.NoMatch { map, op }— predicate-based lookup (find,max_by) found no match.opnames the operation.EmptyPathComponent { requested }— a path string contained an empty component (e.g."a..b").PerCpuNotNarrowed { map }—entry.getcalled on a per-CPU entry withoutcpu(N)first.NoRendered { map, side }— entry has no rendered key/value side (BTF type id missing at capture time, leaving hex bytes only).PlaceholderSample { tag, reason }— a periodic-capture sample’s underlyingFailureDumpReportis a placeholder produced by the freeze-rendezvous timeout fallback. Surfaces when projecting viaSampleSeries::bpf; temporal patterns route the variant through their skip path so a placeholder never falsely registers as zero progress against a monotonicity / rate / steady / ratio band.reasoncarries the rendezvous-timeout cause text.MissingStats { tag }— aSampleSeries::statsprojection ran on a sample whosestatsslot isNone(stats client not wired or per-sample stats request failed). Distinct from in-JSON path misses (FieldNotFound/TypeMismatch) so the assertion site can branch on the cause without re-walking the source.
SnapshotError implements std::error::Error and Display, so it
composes with ? and anyhow. The Display impl includes the path
and any available alternatives so a failure message points the test
author at the fix.
Worked example
Capture a snapshot, look up a map, walk into its first entry, and read a nested field:
use ktstr::prelude::*;
fn snapshot_then_inspect(ctx: &Ctx) -> Result<AssertResult> {
// Wire a bridge for the duration of the scenario.
let cb: CaptureCallback = std::sync::Arc::new(|_name| {
// Production: freeze + build a real FailureDumpReport. The
// host installs this callback in real runs.
Some(FailureDumpReport::default())
});
let bridge = SnapshotBridge::new(cb);
let handle = bridge.clone();
let _guard = bridge.set_thread_local();
// Run the scenario, capturing once after spawn.
let steps = vec![Step {
setup: vec![CgroupDef::named("workers").workers(2)].into(),
ops: vec![Op::snapshot("after_spawn")],
hold: HoldSpec::FULL,
}];
let mut result = execute_steps(ctx, steps)?;
// Drain the bridge and inspect the captured report.
let captured = handle.drain();
let report = captured
.get("after_spawn")
.ok_or_else(|| anyhow::anyhow!("snapshot 'after_spawn' missing"))?;
let snap = Snapshot::new(report);
// Top-level scalar.
if let Ok(nr_cpus) = snap.var("nr_cpus_onln").as_u64() {
result.details.push(AssertDetail::new(
DetailKind::Other,
format!("captured nr_cpus_onln = {nr_cpus}"),
));
}
Ok(result)
}
For the executor + bridge wiring outside a VM, see the host-side
smoke tests in tests/snapshot_e2e.rs — they exercise the same
pipeline against a hand-crafted FailureDumpReport so the assertion
shape is covered without booting a guest.
Composing reads with writes
Snapshots are the read half of the host↔guest interaction. The
write half — pre-seeding a BPF map value before the scenario
starts — is the #[ktstr_test] attribute bpf_map_write = CONST,
which targets a BpfMapWrite constant:
use ktstr::prelude::*;
const TRIGGER_FAULT: BpfMapWrite = BpfMapWrite {
map_name_suffix: ".bss", // matched against discovered maps
offset: 42, // byte offset within the map's value
value: 1, // u32 written by the host
};
#[ktstr_test(bpf_map_write = TRIGGER_FAULT, expect_err = true)]
fn fault_then_inspect(ctx: &Ctx) -> Result<AssertResult> {
// The host has already written `1` at `.bss + 42` before
// the scenario started. Capture and inspect the resulting
// scheduler state mid-run.
/* bridge wiring + Op::snapshot + Snapshot::new as above */
Ok(AssertResult::pass())
}
The write is event-driven: the host polls for BPF map
discoverability (scheduler loaded), polls the SHM ring for
scenario start, then writes the configured u32 at the configured
offset. Only BPF_MAP_TYPE_ARRAY maps are supported; the framework
finds the map by map_name_suffix (e.g. ".bss") via
BpfMapAccessor::find_map. See Monitor → BPF map writes
for the prerequisites (vmlinux) and the full host-side
contract.
Read+write workflows then compose naturally: the test pre-seeds
guest state with bpf_map_write, lets the scheduler run, and
asserts on the resulting state with Op::snapshot + the
Snapshot accessor:
- Write (pre-scenario) —
bpf_map_writeflips a.bssflag the scheduler reads. - Run — the scenario’s ops drive workload behavior; the scheduler reacts to the flag.
- Read (mid-scenario) —
Op::snapshot("after")captures the scheduler state at the chosen point. - Assert —
Snapshot::var(...).as_u64()/Snapshot::map(...).find(...).get(...).as_*()verifies the reaction. Errors carry the available alternatives so a typo or stale field name surfaces before the test author hand-edits the case.
The write side is a single one-shot poke at scheduler-load time;
there is no Op variant for runtime writes. Ergonomic mid-scenario
state mutation is reserved for cases where the scheduler itself
exports a writable interface (sysfs, debugfs, BPF map command
interface) and the test invokes that interface from a workload
process.
Watch Snapshots
A watch snapshot registers a hardware data-write watchpoint on a
named kernel symbol. The host arms a watchpoint slot via the guest’s
hardware debug facilities; the produced captures share the
Snapshot accessor surface documented in Snapshots.
Watch snapshots are supported on x86_64 and aarch64 KVM hosts. The slot terminology below is arch-neutral — each architecture’s KVM plumbing maps the slots onto its native hardware-watchpoint facility (debug registers on x86_64, hardware watchpoints on aarch64).
Op::watch_snapshot("symbol") is the write-driven capture
trigger. Use it when the question is “what does the scheduler look
like whenever the kernel touches X?” rather than “what does it
look like at this point in my scenario?”. For time-driven capture,
use Op::snapshot instead.
How it works
The full pipeline is implemented and tested end-to-end:
Op::watch_snapshot(symbol)registers the symbol via the virtio-console port 1MSG_TYPE_SNAPSHOT_REQUESTTLV frame.- The freeze coordinator resolves the KVA from the vmlinux ELF,
validates 4-byte alignment, and arms a free user watchpoint slot
via
KVM_SET_GUEST_DEBUG. - When the guest writes to the watched address, the corresponding debug exit fires and the host identifies which slot tripped.
- The coordinator captures via
freeze_and_captureand stores the report in theSnapshotBridgeunder the symbol tag. - The report is also mirrored to a sidecar JSON file for post-hoc inspection.
The per-scenario cap of MAX_WATCH_SNAPSHOTS (= 3) is enforced
(slot 0 is reserved for the error-class exit_kind trigger; the
remaining 3 slots are available for user watches). A 4th
Op::watch_snapshot fails the step with a “cap exceeded” message.
Symbol-resolution failures bail immediately so a typo surfaces
visibly.
Op::watch_snapshot covers the full pipeline: registration,
cap enforcement, symbol resolution, hardware arming, and
automatic capture on write.
Issuing a watch
use ktstr::prelude::*;
let steps = vec![Step {
setup: vec![CgroupDef::named("workers").workers(2)].into(),
ops: vec![
Op::watch_snapshot("jiffies_64"),
Op::watch_snapshot("scx_watchdog_timestamp"),
],
hold: HoldSpec::FULL,
}];
execute_steps(ctx, steps)?;
Each Op::watch_snapshot invokes the active SnapshotBridge’s
register_watch callback with the symbol string. On success, the
callback is responsible for arming a hardware watchpoint that will
fire whenever the guest writes to the symbol’s address. Each fire
produces one capture, tagged with the symbol path itself.
Wiring the bridge
In
#[ktstr_test]scenarios that boot a VM, the bridge is wired automatically. Usepost_vmto read captures fromVmResult::snapshot_bridge. Do not install a thread-local bridge inside the scenario function — the in-VMOp::watch_snapshotregisters via the virtio-console port 1MSG_TYPE_SNAPSHOT_REQUESTTLV frame, the host coordinator arms the watchpoint and stores captures on the bridge it owns, and the test reads them after the VM exits.The
set_thread_localpattern below is for host-side unit tests that exercise the executor in process without booting a guest.
A watch-capable bridge for host-side unit tests needs both a capture
callback and a register_watch callback:
use ktstr::prelude::*;
let cb: CaptureCallback = std::sync::Arc::new(|_name| {
Some(FailureDumpReport::default())
});
let reg: WatchRegisterCallback = std::sync::Arc::new(|symbol: &str| {
// Host-side unit tests: record the symbol and return Ok. In a
// booted VM, the host coordinator's pipeline runs instead —
// see arm_user_watchpoint in src/vmm/freeze_coord.rs.
println!("would arm watchpoint on {symbol}");
Ok(())
});
let bridge = SnapshotBridge::new(cb).with_watch_register(reg);
let _guard = bridge.set_thread_local();
A bridge built only with SnapshotBridge::new(cb) (no
with_watch_register) rejects every Op::watch_snapshot with an
error pointing the operator at the missing wiring.
Symbol resolution
Production resolution is a verbatim match against the vmlinux ELF
symbol table. The freeze coordinator walks Elf::syms and accepts
the symbol whose strtab entry equals the requested string byte-for-byte
— there is no prefix stripping, BTF lookup, kallsyms walk, or per-CPU
offset arithmetic. Use the exact name nm vmlinux would print:
"jiffies_64"— the kernel’s monotonic tick counter."scx_watchdog_timestamp"— sched_ext’s watchdog timestamp.
Warning: high-frequency symbols soft-lock the guest. Watching a symbol that the kernel writes every jiffie (e.g.
jiffies_64atHZ=1000) fires 1000+ captures per second. Each capture freezes all vCPUs for the full dump pipeline. The guest spends almost all of its wall time paused, which is indistinguishable from a soft lock-up — schedulers stall, watchdogs fire, and the test wedges before any meaningful work runs. Pick symbols the kernel writes at scenario-relevant cadence (a state field, a per-event counter), not on every tick.
The string passed to Op::watch_snapshot must match a vmlinux ELF
symtab entry exactly; otherwise the step fails with
symbol '...' not found in vmlinux symtab. The
register_watch callback on a host-side test bridge can accept any
shape it wants — the e2e tests in tests/snapshot_e2e.rs use
"kernel.a" / "kernel.b" / etc. for the cap-enforcement test and
"exit_kind" for the in-VM test — but the Op::watch_snapshot ops
that flow through the production pipeline (in-VM scenarios with no
host-side bridge override) must use a verbatim ELF symbol.
Maximum of 3 watches per scenario
pub const MAX_WATCH_SNAPSHOTS: usize = 3;
The bridge enforces a per-scenario cap of 3 successfully-registered
watches. The number is tied to the per-vCPU hardware-watchpoint
slots KVM exposes via KVM_SET_GUEST_DEBUG: slot 0 is reserved for
the existing *scx_root->exit_kind watchpoint that drives the
error-class freeze trigger; the remaining three user watchpoint
slots are available for on-demand watches.
A 4th Op::watch_snapshot in the same scenario fails the step with
“cap exceeded” when the cap is exceeded:
let steps = vec![Step {
setup: vec![CgroupDef::named("cg").workers(2)].into(),
ops: vec![
Op::watch_snapshot("kernel.a"),
Op::watch_snapshot("kernel.b"),
Op::watch_snapshot("kernel.c"),
Op::watch_snapshot("kernel.d"), // <-- cap exceeded
],
hold: HoldSpec::FULL,
}];
let result = execute_steps(ctx, steps)?;
assert!(!result.passed);
// One AssertDetail carries the cap-exceeded message:
// "Op::WatchSnapshot cap exceeded: scenario already registered 3
// watchpoints (3 user watchpoint slots occupied; slot 0 reserved for the
// error-class exit_kind trigger)..."
A failed register (cap exceeded, callback error, missing
register_watch) does not consume a slot. The bridge rolls the
count back so the scenario can keep trying with different symbols up
to the cap.
Failure modes
The register callback is the single integration point where
production resolution can fail. The reasons documented on
WatchRegisterCallback:
- The symbol does not match any vmlinux ELF symtab entry (typo, symbol stripped from the build, or a non-ELF kernel image).
- The resolved KVA is not 4-byte aligned (the 4-byte watch length
the framework arms requires
addr & 0x3 == 0on every supported architecture). - All three available user watchpoint slots are already allocated inside the host’s KVM plumbing.
KVM_SET_GUEST_DEBUGrejected the arm.
When the callback returns Err(reason), the executor bails the step
immediately with a message containing the symbol and the failure
reason. Silent degradation is deliberately avoided — a watch that
never fires would look identical to a healthy passing run, and the
test author would never notice the captures were missing.
Slot 0 (exit_kind) is separate
The existing error-class freeze trigger watches
*scx_root->exit_kind on slot 0 and is not an
Op::watch_snapshot slot. It is wired by the freeze coordinator
independently to detect SCX_EXIT_ERROR writes and drive the
failure-dump pipeline. That trigger is unrelated to the on-demand
watch surface — it always runs, regardless of whether a scenario
declares any Op::watch_snapshot ops. The cap of 3 reflects the
three remaining user slots after slot 0 is held back.
For tests that want the failure dump produced by SCX_EXIT_ERROR,
nothing needs to be declared; the trigger fires automatically and
the dump is written to {sidecar_dir()}/{test_name}.failure-dump.json.
The watch-snapshot in-VM test in tests/snapshot_e2e.rs reads that
file back and feeds it through the Snapshot accessor as a way
to demonstrate the full read path.
Reading captures
Once a watchpoint fires, the resulting report is stored on the bridge
under the tag and read back exactly as Op::snapshot captures are.
Every accessor — Snapshot::map, Snapshot::var,
SnapshotMap::at / find / filter / max_by, dotted-path walks,
typed terminal reads — is shared. See Snapshots for
the full surface.
Periodic Capture
Op::snapshot is on-demand — the test author picks the moment of
capture. Periodic capture is the cadenced complement: the freeze
coordinator fires freeze_and_capture(false) at evenly-spaced points
across the workload window without the scenario body asking. The
result is a time-ordered series of (report, stats, elapsed_ms)
samples that flows naturally into the
temporal-assertion patterns.
Enabling periodic capture
Set num_snapshots = N on the #[ktstr_test] attribute. N is the
number of interior boundaries to fire; 0 (the default) disables
periodic capture entirely.
use ktstr::prelude::*;
#[ktstr_test(num_snapshots = 3, duration_s = 10)]
fn paced_capture(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
])
}
When boundaries fire
The window is the 10 %–90 % slice of the workload duration,
anchored at the first MSG_TYPE_SCENARIO_START the freeze
coordinator observes. A 10 % pre-buffer at the start (workload
ramp-up) and a 10 % post-buffer at the end (ramp-down) keep periodic
samples off transient state.
The remaining 80 % is divided into N + 1 equal intervals, yielding
N interior boundary points:
num_snapshots = N | Boundary timestamps (relative to scenario start) |
|---|---|
1 | 0.5·d (midpoint) |
3 | 0.3·d, 0.5·d, 0.7·d |
N ≥ 2 | 0.1·d + (i+1)·0.8·d / (N+1) for i ∈ 0..N |
For a 10 s workload, N = 3 produces captures at scenario_start +
{3 s, 5 s, 7 s}.
Anchoring at MSG_TYPE_SCENARIO_START means VM boot, BPF verifier
time, and any other pre-scenario work do NOT eat the budget — every
boundary lands inside the workload’s actual run window.
MSG_TYPE_SCENARIO_PAUSE / MSG_TYPE_SCENARIO_RESUME from the guest
shift every un-fired boundary by the cumulative pause duration. The
boundary clock is workload time, not wall-clock: a guest that
pauses for P ns delays each remaining boundary by P ns.
Tag namespace
Each periodic capture is stored on the host’s SnapshotBridge under
"periodic_NNN" — zero-padded 3-digit ordinal index, e.g.
periodic_000, periodic_001, periodic_002. The width is fixed at
3 digits because the bridge cap (see below) maxes out at
MAX_STORED_SNAPSHOTS (= 64 today), so 3 digits always suffices.
Periodic tags coexist with on-demand Op::snapshot tags and
watchpoint-fire tags on the same bridge. Use
SampleSeries::periodic_only(temporal-assertions.md#sampleseries) (or
periodic_ref() for the borrowed equivalent) to filter to the
periodic timeline before assertions.
Capture cost
Each periodic boundary fires the same freeze_and_capture(false)
path that Op::Snapshot dispatches:
- Every vCPU is parked under
FREEZE_RENDEZVOUS_TIMEOUT(30 s hard ceiling). - BPF maps are walked.
- The dump is serialised to JSON.
- The report is stored on the bridge.
On a healthy guest with a typical scheduler-state map size, the freeze is tens of milliseconds (10–100 ms steady state; cold-cache or large guest-memory walks can push higher). The host-side watchdog deadline is extended by the freeze duration after each fire, so periodic captures do not eat into the workload’s wall-clock budget.
Minimum spacing
KtstrTestEntry::validate rejects entries where the per-boundary
interval is below 100 ms — boundaries scheduled closer than that
would fire back-to-back without any workload progress in between.
The exact rule: 0.8 · duration / (N + 1) >= 100 ms. Either
reduce num_snapshots or extend duration_s if validation refuses
the configuration.
Bridge cap
num_snapshots cannot exceed MAX_STORED_SNAPSHOTS (= 64).
Validation rejects higher values rather than silently FIFO-evicting
the earliest periodic samples. Split into multiple test entries if
a longer timeline is needed.
Best-effort delivery
Up to N captures fire, but the run-loop stops servicing periodic
boundaries the moment the kill flag fires. An early VM exit, BSP
done, rendezvous timeout, or watchdog deadline can cut the periodic
sequence short. Tests should assert
result.periodic_fired >= some_lower_bound rather than equality:
fn check_coverage(result: &VmResult) -> Result<()> {
anyhow::ensure!(
result.periodic_target == 3,
"expected num_snapshots = 3, got {}",
result.periodic_target,
);
anyhow::ensure!(
result.periodic_fired >= 2,
"too few periodic samples ({}/{})",
result.periodic_fired,
result.periodic_target,
);
Ok(())
}
result.periodic_target mirrors the configured num_snapshots;
result.periodic_fired is the count actually serviced (including
rendezvous-timeout placeholders). The pair lets a test compute
coverage without re-reading the entry table.
The run-loop additionally abandons the remaining sequence after 2
consecutive rendezvous timeouts and emits a tracing::warn naming
the consecutive-timeout count, so a sustained host overload does
not pile up dozens of placeholder samples.
Op::snapshot captures composed by the test author land on the
same bridge alongside the periodic_NNN tags; total bridge
occupancy is num_snapshots + user_captures and the bridge
FIFO-evicts past MAX_STORED_SNAPSHOTS.
Draining the bridge
The temporal-assertion pipeline runs on the host, so the drain
happens after vm.run() returns — typically inside a post_vm
callback. Use
SnapshotBridge::drain_ordered_with_stats(snapshots.md) to take
ownership of the captured (tag, report, stats, elapsed_ms) tuples
in insertion order:
use ktstr::prelude::*;
fn post_vm(result: &VmResult) -> Result<()> {
let series = SampleSeries::from_drained(
result.snapshot_bridge.drain_ordered_with_stats(),
)
.periodic_only();
anyhow::ensure!(
!series.is_empty(),
"no periodic samples — coordinator never fired",
);
// ... walk samples or feed into temporal patterns ...
Ok(())
}
#[ktstr_test(num_snapshots = 3, duration_s = 10, post_vm = post_vm)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
])
}
drain_ordered_with_stats returns a
Vec<(String, FailureDumpReport, Option<serde_json::Value>, Option<u64>)>
in the order store() saw inserts. Periodic boundaries land
periodic_000 first, periodic_NNN last. The FIFO eviction at
MAX_STORED_SNAPSHOTS drops the oldest tags from order and
reports together, so a hot run that overflowed the cap returns
the most recent MAX_STORED_SNAPSHOTS captures in insertion order.
drain_ordered (without _with_stats) drops the parallel stats /
elapsed metadata; use it only when the test does not need either.
drain (no ordering, no stats) returns a HashMap and loses the
periodic timeline ordering — avoid for periodic data.
Sample anatomy
Each drained tuple unpacks into a Sample<'_> view (via
SampleSeries::iter_samples):
for sample in series.iter_samples() {
let tag: &str = sample.tag; // e.g. "periodic_001"
let elapsed_ms: u64 = sample.elapsed_ms; // ms since run_start
let snap: Snapshot<'_> = sample.snapshot; // BPF state view
let stats: Option<&serde_json::Value> = sample.stats; // scx_stats JSON
// ...
}
elapsed_ms is pause-adjusted: the coordinator subtracts
cumulative MSG_TYPE_SCENARIO_PAUSE/RESUME time (and any in-flight
pause window) before stamping the value. The timestamp is captured
AFTER the scx_stats request returns (or fails) and BEFORE entering
the freeze rendezvous, so elapsed_ms reflects when the running
scheduler’s stats were observed; BPF state is observed up to
FREEZE_RENDEZVOUS_TIMEOUT later than that anchor.
stats is None when the stats client was not wired
(scheduler_binary is absent), or the per-sample stats request
failed (relay rejected, non-zero envelope errno, scheduler not yet
listening). A None slot surfaces through
SampleSeries::stats(temporal-assertions.md#projecting-from-scx_stats-json) as a
SnapshotError::MissingStats { tag } per-sample error — distinct
from in-JSON path misses so the assertion site can branch on the
cause.
A sample whose underlying FailureDumpReport is a placeholder
(rendezvous timeout fallback) surfaces through
SampleSeries::bpf(temporal-assertions.md#projecting-from-bpf-state) as a
SnapshotError::PlaceholderSample { tag, reason } per-sample error
rather than passing a hollow Snapshot to the projection closure.
What to assert
The standard shape is two-stage:
- Compose the series — drain, filter to periodic.
- Project + assert — pick a column, choose a temporal pattern.
For monotonic counters (BPF .bss advancement, scx_stats counter
fields), nondecreasing(temporal-assertions.md#nondecreasing–strictly_increasing)
is the canonical choice. For utilisation-style metrics that should
hold steady once warmup ends,
steady_within(temporal-assertions.md#steady_withinwarmup_ms-tolerance-f64-only)
captures the invariant. For “system stabilizes near target”,
converges_to(temporal-assertions.md#converges_totarget-tolerance-deadline_ms-f64-only)
witnesses the convergence.
For the full pattern surface, projection helpers, and failure rendering, see Temporal Assertions.
Temporal Assertions
Periodic snapshots produce a series of samples over time. Temporal assertions answer questions about the trajectory — does a counter only ever advance? Does a utilization metric stay near its mean once warmup ends? Does a load average converge before a deadline?
The shape is two-stage:
- Build a
SampleSeries(#sampleseries) from the bridge’s drained periodic captures. - Project a
SeriesField<T>(#seriesfield) — one column ofT-typed values across every sample — and feed it through a temporal pattern (nondecreasing,rate_within,steady_within,converges_to,always_true,ratio_within).
Each pattern records DetailKind::Temporal(#failure-rendering)
details on the Verdict when a sample violates the invariant, and
records Notes when projection errors leave a coverage gap.
For how to enable periodic capture and drain the bridge, see Periodic Capture. This page covers the projection + assertion surface only.
SampleSeries
SampleSeries is the ordered sequence of (tag, report, stats, elapsed_ms) tuples drained from the bridge after the VM exits. Build
it from
SnapshotBridge::drain_ordered_with_stats(snapshots.md#wiring-the-bridge):
use ktstr::prelude::*;
let drained = vm_result.snapshot_bridge.drain_ordered_with_stats();
let series = SampleSeries::from_drained(drained).periodic_only();
periodic_only() filters to entries whose tag begins with
"periodic_" — it strips on-demand Op::snapshot captures and
watchpoint fires that share the bridge’s tag namespace. Use
periodic_ref() for the borrowed-iterator equivalent when one test
needs both views from the same series.
SampleSeries exposes:
len(),is_empty()— sample count.iter_samples()— borrowedSample<'_>views (each carryingtag,elapsed_ms,Snapshot<'_>,Option<&Value>stats).bpf(label, |snap| …)/stats(label, |sv| …)— manual closure projection along the BPF or stats axis.bpf_map(map_name)/stats_path(path)— typed auto-projection helpers (see Auto-projection).
SeriesField
A SeriesField<T> is one per-sample column extracted from a
SampleSeries. Each slot is a SnapshotResult<T> so a missing
field, type mismatch, or placeholder report on any individual sample
does NOT abort the whole projection — it surfaces at the temporal-
assertion site as a per-sample error the pattern decides how to
handle.
The field carries the per-sample tags and elapsed-ms timestamps alongside the values, so failure messages name the offending sample without the caller re-threading the source series.
Projecting from BPF state
The SampleSeries::bpf closure receives each sample’s
Snapshot<'_>:
let nr_dispatched: SeriesField<u64> = series.bpf(
"nr_dispatched",
|snap| snap.var("nr_dispatched").as_u64(),
);
The closure body is a normal Snapshot accessor expression;
its SnapshotResult<T> return value lands directly in the field.
Projecting from scx_stats JSON
The SampleSeries::stats closure receives each sample’s
StatsValue<'_> — a thin wrapper around the per-sample stats JSON
exposing path("…").as_u64() / as_f64() etc.:
let busy: SeriesField<f64> = series.stats(
"busy",
|sv| sv.path("busy").as_f64(),
);
A sample whose stats slot is None (the stats request failed, the
relay rejected, or the scheduler binary isn’t wired) yields a
SnapshotError::MissingStats { tag } slot — distinct from an
in-JSON path miss (FieldNotFound / TypeMismatch) so the
assertion site can tell coverage gaps from data errors apart.
Auto-projection
The typed auto-projectors discover available field names and emit
ready-to-feed SeriesFields without an explicit closure:
// Top-level scalar member of a BPF map's first entry.
let dispatched = series
.bpf_map("scx_obj.bss")
.at(0)
.field_u64("nr_dispatched");
// Stats path drilling into nested layer/cgroup keys.
let layer_util = series
.stats_path("layers")
.key("batch")
.field_f64("util");
Bulk discovery is also available — member_names() /
u64_fields() / f64_fields() on the BPF projector,
key_names() / u64_fields() / f64_fields() on the stats
projector. The *_fields() helpers project every member that
yields at least one Ok value across the series, dropping
non-numeric / type-mismatched fields silently. Useful for blanket
“every counter must be nondecreasing” sweeps.
Top-level scalar fields only for the typed field_* helpers.
Nested struct members (e.g. "ctx.weight") and per-CPU maps need
the manual closure path through SampleSeries::bpf.
The six temporal patterns
Every pattern takes &mut Verdict and returns the same &mut Verdict so chains of assertions stack onto one accumulator. Each
pattern is a method on SeriesField:
nondecreasing / strictly_increasing
Pass when every consecutive pair satisfies values[i] <= values[i+1] (or <, for the strict variant). The common shape for
kernel counters whose only legal direction is up.
let mut v = Verdict::new();
nr_dispatched.nondecreasing(&mut v);
nr_dispatched.strictly_increasing(&mut v); // require advance every period
Per-sample projection errors are SKIPPED — the affected pair is
dropped, the skip count is logged as a verdict Note, and the
verdict is NOT flipped on missing-data conditions. Adjacent samples
on either side of a gap are still checked. A series with fewer than
2 samples records a Note (“vacuously holds”) and passes.
rate_within(lo, hi) (f64 only)
Pass when every consecutive (delta_value / delta_ms) lies in
[lo, hi]. Rate is computed from per-sample elapsed-ms timestamps,
so a counter that should advance at ~1 unit/ms reads as
rate_within(0.5, 2.0).
let ticks: SeriesField<f64> = series.bpf("ticks",
|snap| snap.var("ticks").as_f64());
ticks.rate_within(&mut v, 0.5, 2.0);
Failure modes:
- A zero-time delta between adjacent samples records a structured detail naming the offending pair.
- A non-finite rate (NaN / Inf endpoints, or a finite difference
that overflows f64) records a
non-finite ratedetail rather than silently slipping past the band check. - Caller error (
lo > hi) lands as a single detail.
Per-sample projection errors are GAPS — no rate is computed across
the gap, the skip count is logged as a Note with the underlying
error variant.
steady_within(warmup_ms, tolerance) (f64 only)
Pass when every post-warmup sample (elapsed_ms >= warmup_ms)
lies inside [mean·(1-tolerance), mean·(1+tolerance)]. The mean is
computed over the post-warmup samples only — the warmup region is
excluded so ramp-up does not bias the steady-state baseline.
tolerance is a fraction (0.10 = ±10%).
let util: SeriesField<f64> = series.stats("busy",
|sv| sv.path("busy").as_f64());
util.steady_within(&mut v, /*warmup_ms=*/ 1000, /*tolerance=*/ 0.10);
Per-sample projection errors are SKIPPED with a Note. When the
warmup window absorbs every sample, the pattern emits a “no
samples beyond warmup” Note and passes vacuously.
converges_to(target, tolerance, deadline_ms) (f64 only)
Pass when three consecutive samples land inside [target - tolerance, target + tolerance] AT OR BEFORE deadline_ms. The
intent is “the system stabilizes near target by the deadline” —
three consecutive in-band samples are the convergence-witness shape.
load.converges_to(&mut v, /*target=*/ 1.0, /*tol=*/ 0.5, /*deadline_ms=*/ 5_000);
Distinct outcomes:
- Witness found — pass.
- No witness before deadline —
DetailKind::Temporalfailure naming the sample count evaluated. If errored samples interrupted in-progress runs, the failure message lists them. - Insufficient samples — fewer than 3 successfully-projected
samples in the deadline window. Records a
Note(NOT a verdict failure); absence of data is a coverage gap, not a negative finding. The note distinguishes “did not collect enough samples” from “collected enough samples but never converged”.
always_true (bool only)
Pass when every sample’s value is true. Per-sample projection
errors FAIL the assertion (this is a strict pattern — a missing
boolean is a coverage gap that must surface).
let alive: SeriesField<bool> = series.bpf("scheduler_alive",
|snap| snap.var("scheduler_alive").as_bool());
alive.always_true(&mut v);
ratio_within(other, lo, hi) (f64 only)
Pass when every per-index (self_value / other_value) lies in
[lo, hi] — the two series are walked in lock-step at indices
0..N, comparing self[i] / other[i]. Cross-field correlation
across two same-length series.
util.ratio_within(&mut v, &runtime, 0.4, 0.6);
A length mismatch fires a single caller-error detail and aborts
the comparison. A sample where rhs == 0 records a “cannot
compute ratio” detail naming the sample; out-of-band ratios
record a structured detail with the lhs/rhs values. Per-sample
projection errors on either side are SKIPPED with a Note
listing each gap and which side errored.
Per-sample scalar checks: each
The temporal patterns are aggregate. For per-sample scalar bounds
(>=, <=, lo..=hi) bypass the patterns via SeriesField::each:
nr_dispatched.each(&mut v).at_least(1u64);
util.each(&mut v).between(0.0_f64, 100.0_f64);
ticks.each(&mut v).at_most(10_000.0_f64);
each runs the comparator on every successfully-projected sample
independently. The first failure records a detail; subsequent
failures pile on so the timeline shows every offending sample, not
just the first.
Per-sample projection errors record a detail and flip the verdict
— each is strict (matches always_true’s policy). NaN samples
report an incomparable failure naming the sample distinctly:
without this branch, IEEE-754 < against NaN is always false, so
a NaN sample would silently pass value < floor / value > ceiling
checks.
Failure rendering
Every temporal failure carries the field’s label, the pattern
name, and the offending sample’s tag + elapsed_ms. A
nondecreasing regression at sample periodic_004 (+850 ms) reads:
nr_dispatched (nondecreasing): regression at sample periodic_004 (+850ms): \
value 100 after prior value 200 at sample periodic_003 (+700ms)
Coverage Notes render WITH the per-sample error variant so the
operator can tell PlaceholderSample (rendezvous timeout),
MissingStats (stats request failed), FieldNotFound (typo /
wrong map), and TypeMismatch apart without re-running under a
debugger:
nr_dispatched (nondecreasing): skipped 1 sample(s) with projection errors: \
periodic_002(+500ms): snapshot has no global variable 'nrdispatch' \
in any *.bss/*.data/*.rodata map (available globals: ["nr_dispatched", \
"stall"])
Worked example
The temporal-assertion pipeline draining the bridge runs on the
host, not inside the guest. #[ktstr_test(post_vm = …)] registers
a host-side callback that receives the VmResult after vm.run()
returns; the callback drains the bridge and walks the resulting
series:
use ktstr::prelude::*;
fn assert_temporal_patterns(result: &VmResult) -> Result<()> {
let series = SampleSeries::from_drained(
result.snapshot_bridge.drain_ordered_with_stats(),
)
.periodic_only();
let mut v = Verdict::new();
// BPF axis: counter must advance every periodic boundary.
let nr_dispatched: SeriesField<u64> = series.bpf(
"nr_dispatched",
|snap| snap.var("nr_dispatched").as_u64(),
);
nr_dispatched.nondecreasing(&mut v);
// Stats axis: stay under a generous ceiling.
let stats_dispatched: SeriesField<u64> = series.stats(
"nr_dispatched",
|sv| sv.path("nr_dispatched").as_u64(),
);
stats_dispatched.each(&mut v).at_most(1_000_000_000u64);
let r = v.into_result();
anyhow::ensure!(r.passed, "temporal assertions failed: {:?}", r.details);
Ok(())
}
#[ktstr_test(num_snapshots = 3, duration_s = 10, post_vm = assert_temporal_patterns)]
fn dispatch_counter_advances(ctx: &Ctx) -> Result<AssertResult> {
execute_defs(ctx, vec![
CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
])
}
For the periodic-capture wiring, num_snapshots semantics, and the
bridge-drain contract, see Periodic Capture.
For the underlying Snapshot / SnapshotMap / SnapshotEntry
accessors the projection closures call into, see
Snapshots.
Recipes
Standalone examples for common tasks. Each recipe is self-contained.
- Test a new scheduler – end-to-end from binary to integration tests
- Investigate a crash – auto-repro, reading BPF probe output
- A/B compare branches – worktree setup, run and compare
- Capture and compare host state –
cargo ktstr show-hostsnapshot diff for kernel / sched_* tunable / NUMA layout drift - Diagnose a slow scheduler with ctprof –
per-thread profile diff via
ktstr ctprof capture/compare, with the taskstats off-CPU lens - Customize checking – scheduler thresholds, per-test overrides
- Benchmarking and negative tests – performance gates, intentional degradation, Assert checks
Test a New Scheduler
End-to-end workflow: define a scheduler, write tests, run them.
1. Define the scheduler
Use declare_scheduler! to register a scheduler in the
KTSTR_SCHEDULERS distributed slice. The verifier sweep picks
it up automatically.
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
topology = (1, 2, 4, 1),
kernels = ["6.14", "6.15..=7.0"],
sched_args = ["--exit-dump-len", "1048576"],
});
The macro generates pub static MY_SCHED: Scheduler plus a
private linkme registration so cargo ktstr verifier
discovers the scheduler automatically. Tests reference the
bare MY_SCHED ident via
#[ktstr_test(scheduler = MY_SCHED)].
See Scheduler Definitions for every supported field.
2. Write integration tests
Tests inherit the scheduler’s topology. Override with explicit
llcs, cores, or threads when needed.
use ktstr::prelude::*;
#[ktstr_test(scheduler = MY_SCHED)]
fn basic_steady(ctx: &Ctx) -> Result<AssertResult> {
// Inherits 1n2l4c1t from MY_SCHED
scenarios::steady(ctx)
}
#[ktstr_test(scheduler = MY_SCHED, threads = 2)]
fn smt_steady(ctx: &Ctx) -> Result<AssertResult> {
// Inherits llcs=2, cores=4; overrides threads to exercise SMT
scenarios::steady(ctx)
}
3. Build a kernel
Build a kernel with sched_ext support:
cargo ktstr kernel build
See Getting Started: Build a kernel for version selection and local source builds.
4. Run
cargo ktstr test --kernel ../linux
5. Check BPF complexity (optional)
Collect per-program verifier statistics across the declared kernels and accepted topology presets:
# Use the kernel auto-discovered via KTSTR_KERNEL / cache.
cargo ktstr verifier
# Pin to a specific kernel build.
cargo ktstr verifier --kernel ../linux
# Sweep across multiple kernels. Each scheduler's
# `kernels = [...]` declaration acts as a per-scheduler filter on
# the operator-supplied set; an empty (or omitted) `kernels` field
# means the scheduler runs against every kernel in the sweep.
cargo ktstr verifier --kernel 6.14 --kernel 7.0
See BPF Verifier for output format, cycle collapse, and the cell-name → kernel matching contract.
6. Manage the kernel cache
Cached kernel images accumulate under
$XDG_CACHE_HOME/ktstr/kernels/. Keep a handful of recent
builds and drop the rest when disk pressure grows:
cargo ktstr kernel list # inspect cache contents
cargo ktstr kernel clean --keep 3 # keep the 3 most recent images
cargo ktstr kernel clean --force # remove everything (non-interactive)
7. Debug failures
Boot an interactive shell with the scheduler binary:
cargo ktstr shell -i ./target/debug/scx_my_sched
Inside the guest, run /include-files/scx_my_sched manually to
inspect behavior. See
cargo-ktstr shell for
all flags.
See The #[ktstr_test] Macro
for all available attributes and
Scheduler Definitions
for the full Scheduler type and the declare_scheduler! macro.
Investigate a Crash
When a scheduler crashes during a test, the failure output and auto-repro pipeline help identify the cause.
First step: enable full diagnostics
Rerun the failing test with RUST_BACKTRACE=1 before digging into
individual sections:
RUST_BACKTRACE=1 cargo ktstr test --kernel ../linux -- -E 'test(my_test)'
Setting RUST_BACKTRACE=1 unconditionally appends the
--- diagnostics --- section (init stage, VM exit code, last lines
of kernel console) to every failure, not only when the scheduler
self-dies. It also enables verbose VM console output (equivalent to
KTSTR_VERBOSE=1).
Reading failure output
A test failure message contains up to eight sections, each present only when relevant:
| Section | Content |
|---|---|
| Error line | Test name, scheduler, failure reason. |
--- stats --- | Per-cgroup worker count, CPU count, spread, gap, migrations, iterations. |
--- diagnostics --- | Init stage classification, VM exit code, last 20 lines of kernel console. |
--- timeline --- | Kernel version, topology, scheduler, scenario duration, phase breakdown with monitor samples. |
--- scheduler log --- | Scheduler process stdout+stderr (cycle-collapsed). |
--- monitor --- | Host-side monitor: sample count, max imbalance ratio, max local-DSQ depth, sustained-violation flag, SCX event counters (select_cpu_fallback, keep_last, skip_exiting, skip_migration_disabled), per-sched_domain load-balance rates, per-BPF-program verified_insns, and the merged threshold verdict. |
--- sched_ext dump --- | sched_ext_dump trace lines from the guest kernel. |
--- auto-repro --- | BPF probe data from a second VM run, plus repro VM duration, scheduler log, sched_ext dump, and dmesg tails. |
--- diagnostics --- appears automatically when the scheduler died
or crashed, or when RUST_BACKTRACE is set to 1 or full.
Auto-repro
auto_repro defaults to true in #[ktstr_test]. When the scheduler
crashes, ktstr automatically:
- Captures the crash stack trace from the scenario output.
- Boots a second VM with BPF kprobes (kernel functions) and fentry
probes (BPF callbacks) on each function in the crash chain, plus
a
tp_btf/sched_ext_exittracepoint trigger. - Reruns the scenario to capture function arguments at each crash point.
Reading auto-repro output
The probe output shows each function in the crash chain with:
- Function signature and argument values during execution of the same workload
- Source file and line number
- Call chain context
After the probe data, the auto-repro section includes the repro VM duration and the last 40 lines of the repro VM’s scheduler log (cycle-collapsed), sched_ext dump, and kernel console (dmesg). These supplement probe data when the crash produces sparse or no probe events. When probe data is absent, a crash reproduction status line replaces it.
See Auto-Repro for details on how the two-VM repro cycle works.
A/B Compare Branches
Compare scheduler behavior between two branches by running the
same #[ktstr_test] suite against each, then using
cargo ktstr stats compare to diff per-metric results with
dual-gate (absolute and relative) significance and exit non-zero
on any regression.
Setup worktrees
The examples below use the scx scheduler crate under
~/opensource/scx; substitute your own scheduler crate’s path and
remote everywhere scx appears.
cd ~/opensource/scx
# Create a worktree for the baseline branch.
git worktree add ~/opensource/scx-main upstream/main
Collect both runs into a shared run root
Each cargo nextest run --workspace writes its sidecars into
target/ktstr/{kernel}-{project_commit}/. The {project_commit}
half is the project tree’s HEAD short hex captured at first
sidecar write (suffixed -dirty when the worktree differs from
HEAD), so two branches with distinct HEADs land in distinct
directories.
Two back-to-back runs of the SAME kernel at the SAME commit
reuse the same directory — the second run pre-clears any prior
sidecars at first write, so each directory is a last-writer-wins
snapshot of (kernel, project commit).
Warning: The two worktrees MUST be at distinct commits for A/B comparison to work. If both checkouts share the same HEAD (e.g. baseline branch and feature branch happen to be even), the second run overwrites the first via the last-writer-wins pre-clear and the comparison degenerates to “identical pool of sidecars.” Confirm distinct commits with
git -C ~/opensource/scx rev-parse HEADandgit -C ~/opensource/scx-main rev-parse HEADbefore invoking the secondcargo nextest run.
Every sidecar also carries its own project_commit field (read
from the project tree’s git HEAD at sidecar-write time), so
the runs from two branches land disjoint values on the commit
dimension regardless of how the directories are named. The
project commit is discovered by walking up from the test
process’s current working directory to find a .git marker
— so the cd ~/opensource/scx-main / cd ~/opensource/scx
steps below are load-bearing, not stylistic. Without them the
probe would walk up from wherever you happened to invoke
cargo, potentially ending at an entirely different repo and
recording the wrong commit on every sidecar. The simplest
collection workflow is to merge both branches’ run
subdirectories under one root and rely on
--a-project-commit / --b-project-commit to partition them:
mkdir -p ~/opensource/scx-runs/ktstr
# Baseline.
cd ~/opensource/scx-main
cargo ktstr test --kernel ../linux
mv target/ktstr/* ~/opensource/scx-runs/ktstr/
# Experimental.
cd ~/opensource/scx
cargo ktstr test --kernel ../linux
mv target/ktstr/* ~/opensource/scx-runs/ktstr/
The {kernel}-{project_commit} subdirectory names are unique per
(kernel, project commit) pair, so two branches with distinct
HEADs coexist under one root without collision. Within a single
branch, two clean back-to-back runs at the same commit reuse
one directory (last-writer-wins via per-process pre-clear);
mark one of them as -dirty (uncommitted change) or commit /
amend between runs to land separate directories.
Do not set KTSTR_SIDECAR_DIR: cargo ktstr stats list
and cargo ktstr stats compare walk
{CARGO_TARGET_DIR or "target"}/ktstr/ by default and would
not see runs written to a custom flat directory unless
--dir DIR is passed.
Discover available dimension values
The framework records the project tree’s git commit (discovered
by walking parents of the test process’s cwd to find the
enclosing .git) on every sidecar via
SidecarResult::project_commit, so two runs from different
commits land disjoint values on the commit dimension and
--a-project-commit / --b-project-commit slice between them
without any per-run directory bookkeeping.
Use
cargo ktstr stats list-values --dir DIR to enumerate the
distinct values of every filterable dimension (kernel,
commit, kernel_commit, source, scheduler, topology,
work_type) present in the pool, so per-side filters target
real values. The commit and source keys map to the
internal SidecarResult::project_commit / run_source fields;
the per-side filter flags spell as --a-project-commit /
--b-project-commit and --a-run-source / --b-run-source
on the compare
subcommand.
cd ~/opensource/scx
CARGO_TARGET_DIR=~/opensource/scx-runs cargo ktstr stats list
CARGO_TARGET_DIR=~/opensource/scx-runs cargo ktstr stats list-values
Compare per-side filter groups
cd ~/opensource/scx
CARGO_TARGET_DIR=~/opensource/scx-runs cargo ktstr stats compare \
--a-project-commit <baseline-short-hex> \
--b-project-commit <current-short-hex>
stats compare is pool-driven: every sidecar under the runs
root is loaded into a single pool, and per-side filter flags
(--a-X / --b-X) partition the pool into the A and B
contrasts. The dimensions on which the A and B filters DIFFER
are the slicing dimensions of the contrast; every other
dimension is part of the dynamic pairing key the comparison
joins on. Slicing on project-commit alone joins each
baseline scenario with its matching experimental counterpart
on every other dimension (kernel, kernel-commit, run-source,
scheduler, topology, work_type).
Other slicing axes work the same way:
# Slice on kernel.
cargo ktstr stats compare --a-kernel 6.14 --b-kernel 7.0
# Slice on scheduler, pin both sides to one kernel.
cargo ktstr stats compare \
--a-scheduler scx_rusty --b-scheduler scx_lavd \
--kernel 6.14
Shared --X flags pin BOTH sides to the same value; per-side
--a-X / --b-X REPLACE the corresponding shared --X for
that side only (“more-specific replaces”). Slicing on more
than one dimension at once prints a stderr warning but is
supported for cohort sweeps.
compare applies the dual-gate significance check from the
unified MetricDef registry to every metric and prints colored
output (red = regression, green = improvement). Rows where
either side has passed=false are dropped from the math and
counted in the summary line; the exit code is non-zero when
any regression is detected, so the command can gate CI
directly. Narrow further with -E SUBSTRING (matches the
joined scenario topology scheduler work_type string),
override the relative gate uniformly with --threshold PCT
or per-metric via --policy FILE. The absolute gate from each
MetricDef is unaffected by --threshold — a delta must
clear both gates to count as significant.
See stats compare
for the full per-side flag table and validation rules, and
stats list-values
for the discovery counterpart.
Cleanup
git worktree remove ~/opensource/scx-main
rm -rf ~/opensource/scx-runs
Capture and Compare Host State
Disambiguation: this recipe covers host context (kernel build, CPU model, sched_* tunables, NUMA layout) via
cargo ktstr show-host. For per-thread profiling (scheduling counters, memory / I/O accounting, taskstats delay accounting per thread), see the ctprof reference and the Diagnose a Slow Scheduler with ctprof recipe.
When a gauntlet run passes on one machine and fails on another —
or passes on Monday and fails on Wednesday — the first thing to
check is whether the host itself changed. cargo ktstr show-host
captures a snapshot of the kernel, CPU, memory, scheduler
tunables, and kernel cmdline; cargo ktstr stats compare
surfaces the changes between two sidecars in a host-delta
section of its output so you can see what moved.
Two show-host commands: live vs archived
Two distinct subcommands print host context, and they are NOT interchangeable — pick the one whose target matches your question:
cargo ktstr show-hostcaptures the live host context by reading/proc,/sys, anduname()at invocation time. Use this when you want to inspect the current machine, e.g. before running a benchmark, after a sysctl change, or to confirm whatcargo ktstr stats comparewould record on the next run produced here. No prior runs needed.cargo ktstr stats show-host --run RUN_IDprints the archived host context captured at sidecar-write time for the named run. Use this when investigating a regression in a past run — what looked like a code change might trace back to a host change at the time the sidecar was produced. Resolves--runagainsttarget/ktstr/(or--dir) and renders the first sidecar in the run that carries a populatedhostfield via the sameHostContext::format_humanformatter the liveshow-hostuses, so the two outputs are byte-for-byte comparable when the host is unchanged.
The sections below cover the live show-host. For the archived
variant’s flag table see
stats show-host.
Capture: show-host
cargo ktstr show-host
Prints a key: value report covering:
- CPU model + vendor (first
/proc/cpuinfoentry). - Total memory, hugepages total / free, hugepage size (from
/proc/meminfo). - Transparent hugepage policy (
thp_enabled,thp_defrag) with the bracketed selection preserved verbatim. - Every
/proc/sys/kernel/sched_*tunable, one entry per line. - NUMA node count (from CPU→node mapping; memory-only nodes without CPUs are not counted).
kernel_name/kernel_release/arch(from theuname()syscall)./proc/cmdlineverbatim.
Absent fields render as (unknown) — an empty sched_* map
renders as (empty) and a missing map renders as (unknown).
The distinction matters when you want to know whether a
dimension was inspected but absent, vs failed to populate.
Sidecars written before the uname_sysname / uname_release /
uname_machine → kernel_name / kernel_release / arch
rename render the renamed fields as (unknown) in show-host
and in stats compare’s host-delta section, and re-running the
test against the current binary regenerates the sidecar with
the new field names populated. Mechanically: the old sidecar
still deserializes cleanly (deserialization is forward-compatible
in the “does-not-error” sense), but the renamed fields land as
None on the new struct because the old-name data does not
migrate to the new field names.
This output is human-oriented. For programmatic access, read
the host field of any sidecar JSON (same schema, identical
values — show-host prints the live snapshot the sidecar
writer would attach to a fresh test run).
Compare: stats compare
cargo ktstr stats compare --a-project-commit <baseline> --b-project-commit <current>
Per-side filter flags (--a-X / --b-X) partition the
sidecar pool into the two sides of the contrast — slice on
project-commit, kernel, scheduler, run-source, etc.
depending on what you are diffing. compare picks the first sidecar with
Some(host) from each side, collects every host field that
differs, and prints a side-by-side delta unconditionally as
part of the compare output (there is no opt-in flag — the
host-delta section appears whenever the two sides disagree on
a host field):
host delta ('A' → 'B'):
kernel_release: 6.14.2 → 6.15.0
thp_enabled: always [madvise] never → always madvise [never]
sched_tunables.sched_migration_cost_ns: 500000 → 100000
Fields that match in both runs are suppressed by design — this
is a diff, not a snapshot. Missing-on-one-side rendering differs
by layer: top-level Option<T> host fields (e.g. kernel_release,
thp_enabled, the whole sched_tunables map) render with
(unknown) on the None side so a regression in the capture
pipeline surfaces instead of silently hiding. Per-key diffs
inside the sched_tunables map use (absent) instead, to
distinguish “the map was captured and this key is not in it”
from “the whole map was unknown at capture time”.
CI integration
Gauntlet runs emit the host block automatically in every
sidecar. To diff the host state across two CI runs, slice the
pool on whatever dimension separates them (typically
--a-project-commit / --b-project-commit or --a-kernel /
--b-kernel) — the host-delta section appears automatically
in the compare output when any host field differs between the
two sides. A CI job can:
- Run the gauntlet on the candidate commit and the baseline.
- Invoke
stats compareslicing on the dimension that separates the two runs (e.g.--a-project-commit <baseline> --b-project-commit <current>) and inspect the host-delta section of its output. - Fail (or annotate the PR) if any host dimension changed — an unchanged host set is the precondition for a clean A/B of scheduler behavior.
Typical hits
Each bullet names the show-host field that carries the signal so
you can cargo ktstr show-host | grep <field> directly, or pluck
the same key out of a sidecar via jq '.host.<field>'.
thp_enabled(and its companionthp_defrag) changed between runs → explains latency-sensitive regressions that vanish when you pin THP viatransparent_hugepage=on the kernel cmdline. The bracketed selection inside the value is the active setting; compare the bracket position, not just the full string.sched_tunables.sched_migration_cost_nsdiffers (look for it inside thesched_*block printed byshow-host) → fair scheduler migrated the run onto different CPUs, which changes the idle-steal pressure onscx_*schedulers that depend on it. Othersched_tunables.*keys (sched_wakeup_granularity_ns,sched_min_granularity_ns,sched_latency_ns,sched_rt_runtime_us, etc.) have the same shape — the full set is whatever/proc/sys/kernel/sched_*lists at capture time. Note: the examples above are CFS-era tunables; several of them (sched_wakeup_granularity_ns,sched_min_granularity_ns,sched_latency_ns) were dropped when CFS was replaced by EEVDF in Linux 6.6+, so a run on an EEVDF kernel will simply not have those keys in the map — their absence is a kernel-version fact, not a capture failure. EEVDF’s own latency-floor knob is exposed assched_tunables.sched_base_slice_nson 6.6+ kernels (the replacement for the dropped CFS latency / granularity triple); check for its presence to confirm an EEVDF-era capture. What you get in practice is whatever/proc/sys/kernel/sched_*exposes on the running kernel.kernel_cmdlinediverges →isolcpus=/nohz_full=/mitigations=/transparent_hugepage=/numa_balancing=are all boot-time and change the whole scheduling surface. Rebooting the host to match is the correct remediation when you need the comparison to hold. The field is namedkernel_cmdline(notcmdline) in bothshow-host’s printed output and the sidecar JSON to disambiguate fromSidecarResult.kargs, which carries the extra kargs the ktstr VMM appended when booting the guest rather than the running host’s boot line.kernel_releasediffers (also check the companionkernel_nameandarchfields) → the kernel itself changed; every other host dimension is suspect under cross-kernel comparison. Akernel_namechange (uname -sreporting a different OS family —LinuxvsFreeBSD, say) is a harder stop than a same-family version bump and usually means the two sidecars were produced on entirely different systems.hugepages_total/hugepages_free/hugepages_size_kbdeltas → benchmark throughput that depends on 2 MiB pages (performance_mode tests) flips outcome when the pool shrinks or the page size changes. All three are reported byshow-hostin the meminfo-derived block.numa_nodesdiffers → cpusets and cross-node migration signals only make sense within the CPU→node mapping captured at sidecar-write time; a host reconfigured to expose or hide nodes changes whatcpus_usedandnuma_pagesmean across the two runs. See the capture caveat —numa_nodescounts only nodes that host at least one CPU (memory-only nodes are not counted), so a delta here can reflect either a hardware / firmware change or a topology reconfiguration that left the memory-only nodes untouched.- CPU-level skew (
cpu_model/cpu_vendor) → microarchitectural differences affect cache-sensitive benchmarks. Always inspect alongsidecmdlinebecause a different CPU usually comes with a different bootloader.
Seeing the raw sidecar field
show-host reads the live host; the sidecar carries whatever
show-host would have captured at sidecar-write time. To see
the sidecar’s host block directly:
jq '.host' path/to/sidecar.ktstr.json
The field is emitted on every gauntlet run.
Diagnose a Slow Scheduler with ctprof
When a scheduler change makes the workload slower but the test
suite still passes, the regression is usually buried in
per-thread off-CPU time. ktstr ctprof capture snapshots every
live thread’s scheduling, memory, I/O, and taskstats delay
counters; ktstr ctprof compare diffs two snapshots and surfaces
the buckets where time went. This recipe walks through a typical
A/B comparison.
See the ctprof reference for the full metric registry, aggregation rules, derived-metric formulas, and taskstats kconfig gating.
Capture before and after
# Baseline: scheduler A loaded, workload running.
ktstr ctprof capture --output baseline.ctprof.zst
# Switch schedulers, restart workload, wait for steady state.
# ...
# Candidate: scheduler B, same workload.
ktstr ctprof capture --output candidate.ctprof.zst
capture walks /proc once and writes the snapshot. It is
read-only — no kprobes, no tracing — so the act of capturing
does not perturb the measurement. The default capture covers
every live tgid; on a busy host this is hundreds of threads.
The snapshot is zstd-compressed JSON, typically a few MB.
Compare with the taskstats lens
The taskstats-delay section bundles the eight kernel
delay-accounting buckets (CPU, blkio, swapin, freepages,
thrashing, compact, wpcopy, irq) plus their nine derived
metrics (avg_*_delay_ns per bucket, total_offcpu_delay_ns
rollup). Running with --sections taskstats-delay filters the
output down to just the off-CPU view:
ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst \
--sections taskstats-delay \
--sort-by total_offcpu_delay_ns:desc
The --sort-by total_offcpu_delay_ns:desc puts the processes
with the largest absolute off-CPU growth at the top. Each row
gives baseline | candidate | delta | %; large positive deltas
on a process that should not have moved are the suspects.
The total_offcpu_delay_ns derivation is:
cpu + blkio + freepages + compact + wpcopy + irq + max(swapin, thrashing)
max(swapin, thrashing) rather than swapin + thrashing
because every thrashing event is also a swapin event from the
syscall perspective; summing both would double-count.
Drill into the per-bucket averages
If total_offcpu_delay_ns jumped on a process, the per-bucket
avg_*_delay_ns derivations identify which off-CPU phase
grew. In the same compare output (the --sections taskstats-delay
filter keeps both the raw counters AND the 9 derivations
together), look at the suspect process’s row in:
| Bucket | Average derivation | Meaning |
|---|---|---|
| CPU runqueue wait | avg_cpu_delay_ns | Time waiting for the scheduler to pick the task. RACY (count + total update lockless). |
| Block I/O wait | avg_blkio_delay_ns | Synchronous block-device wait. Distinct from schedstat iowait_sum; the canonical delay-accounting reading. |
| Swap-in / Thrashing | avg_swapin_delay_ns / avg_thrashing_delay_ns | Memory pressure. The two overlap — a thrashing event is also a swapin event. |
| Direct memory reclaim | avg_freepages_delay_ns | Allocator hit __alloc_pages slowpath. |
| Memory compaction | avg_compact_delay_ns | Allocator demanded a high-order page; compaction stalled. |
| CoW page-fault | avg_wpcopy_delay_ns | Write-protect-copy fault, e.g. fork-then-write. |
| IRQ handling | avg_irq_delay_ns | Time charged to the task by the IRQ accounting subsystem. |
A growing avg_cpu_delay_ns with flat blkio/swap/freepages
suggests the new scheduler is making poor placement choices —
the task is queueing more often or for longer, but no other
subsystem is to blame. A growing avg_blkio_delay_ns with flat
avg_cpu_delay_ns points away from the scheduler entirely
(disk, network filesystem, or a userspace lock pattern).
Cross-reference the primary table
Once a bucket is identified, look at the underlying counters without the section filter:
ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst \
--metrics nr_wakeups,nr_migrations,wait_sum,wait_count,run_time_ns,timeslices
--metrics restricts the rendered rows to the named primary
metrics. Useful pairings when the suspect bucket is CPU runqueue
wait:
wait_sum / wait_count— schedstat’s average wait per scheduling event (theavg_wait_nsderivation, exposed outside thetaskstats-delaysection). If this confirmsavg_cpu_delay_ns, both delay-accounting paths agree.nr_migrations— the new scheduler may be moving the task more aggressively. Cross-CPU migrations cost wall-clock time even whenrun_time_nsis identical.nr_wakeups_affine / nr_wakeups_affine_attempts— theaffine_success_ratioderivation; CFS-only signal that reflects how oftenwake_affine()succeeded. A large drop with growingavg_cpu_delay_nsis a strong signal for cache- unfriendly placement.
Confirm taskstats data is actually populated
If every taskstats column reads zero, the snapshot likely hit a
gating problem rather than a real “no delay” reading. Inspect
CtprofSnapshot::taskstats_summary (the structured
per-snapshot tally written into the snapshot itself):
eperm_count > 0— the capturing process lackedCAP_NET_ADMIN. Re-run as root, or grantcap_net_admin+eipviasetcap.esrch_countneartids_walked— every tid raced exit before the per-tid query landed. Lengthen the workload’s steady-state window and re-capture.ok_count == 0ANDeperm_count == 0— the netlink open failed, almost always meaning the kernel was built withoutCONFIG_TASKSTATS. Rebuild with the kconfig.ok_count > 0but every delay column reads zero — kernel built withCONFIG_TASKSTATSandCONFIG_TASK_DELAY_ACCTbut launched without the runtimedelayacct=ontoggle. Adddelayacctto the kernel cmdline, or setsysctl kernel.task_delayacct=1and re-capture.
The structured fields above let an operator distinguish each case without scraping the capture-pipeline tracing log.
Related
- ctprof reference — the full metric registry and gating documentation.
- Capture and Compare Host State — the
cargo ktstr show-hostrecipe for host-context diffs (kernel, sched_* tunables, NUMA layout); use that when the hypothesis is “the host config moved” rather than “a workload’s per-thread behaviour moved.” - A/B Compare Branches — recipe for diffing scheduler-source-tree changes via ktstr’s gauntlet runs; ctprof complements that by surfacing per-thread-level effects the scenario assertions miss.
Customize Checking
Override default checking thresholds for schedulers that tolerate higher imbalance, different gap thresholds, or relaxed event rates.
Scheduler-level overrides
Declare a scheduler with custom assertion overrides:
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(RELAXED, {
name = "relaxed",
binary = "scx_relaxed",
assert = Assert::NO_OVERRIDES
.max_imbalance_ratio(5.0) // tolerate 5:1 imbalance
.max_fallback_rate(500.0) // higher fallback rate ok
.fail_on_stall(false), // don't fail on stall
});
These overrides sit between Assert::default_checks() and per-test
overrides in the merge chain.
Per-test overrides via #[ktstr_test]
#[ktstr_test(
scheduler = RELAXED,
not_starved = true,
max_gap_ms = 5000,
max_imbalance_ratio = 10.0,
sustained_samples = 10,
)]
fn high_imbalance_test(ctx: &Ctx) -> Result<AssertResult> {
// Inherits topology from RELAXED
Ok(AssertResult::pass())
}
Understanding not_starved
not_starved = true enables starvation, fairness spread, and
scheduling gap checks. Each threshold can be overridden independently.
See Checking: Worker checks
for details and default thresholds.
Merge order
Three-layer merge with last-Some-wins semantics. See
Checking: Merge layers.
Using Assert directly in ops scenarios
fn my_scenario(ctx: &Ctx) -> Result<AssertResult> {
let assertions = Assert::NO_OVERRIDES
.check_not_starved()
.max_gap_ms(3000);
let steps = vec![/* ... */];
execute_steps_with(ctx, steps, Some(&assertions))
}
execute_steps_with applies the given Assert for worker checks.
execute_steps (without _with) passes None, falling back to
ctx.assert (the merged three-layer config: default_checks ->
scheduler -> per-test).
See Ops and Steps for the full step execution model.
Benchmarking and Negative Tests
Recipes for writing tests that check scheduler performance gates (positive tests) and confirm that degraded schedulers fail those gates (negative tests).
Using Assert for checking
Assert carries all checking thresholds. Every field is Option;
None means “inherit from parent layer.”
-
In the merge chain:
Assert::default_checks()->Scheduler.assert-> per-test#[ktstr_test]attributes. Use withexecute_steps_with()for ops-based scenarios. See Checking. -
For direct report checking: call
Assert::assert_cgroup(reports, cpuset).
let a = Assert::default_checks().max_gap_ms(500);
let result = a.assert_cgroup(&reports, None);
Positive benchmarking test
Check that a scheduler passes performance gates under
performance_mode. Use #[ktstr_test] with Assert thresholds:
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
topology = (1, 1, 2, 1),
});
#[ktstr_test(
scheduler = MY_SCHED,
performance_mode = true,
duration_s = 3,
sustained_samples = 15,
)]
fn perf_positive(ctx: &Ctx) -> Result<AssertResult> {
let checks = Assert::default_checks()
.min_iteration_rate(5000.0)
.max_gap_ms(500);
let steps = vec![Step::with_defs(
vec![CgroupDef::named("cg_0").workers(2)],
HoldSpec::FULL,
)];
execute_steps_with(ctx, steps, Some(&checks))
}
Key points:
performance_mode = truepins vCPUs and uses hugepages for deterministic measurements.Assert::default_checks()starts from the standard baseline.- Chain
.min_iteration_rate(),.max_gap_ms(), or.max_p99_wake_latency_ns()to set gates. execute_steps_with()applies theAssertduring worker checks.
Negative test pattern
Check that intentionally degraded scheduling fails the same gates. This confirms that the gates actually catch regressions rather than passing vacuously.
Use expect_err = true on #[ktstr_test] to assert that the test
fails. The macro wraps the test with assert!(result.is_err()) and
disables auto-repro automatically.
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
topology = (1, 1, 2, 1),
});
#[ktstr_test(
scheduler = MY_SCHED,
performance_mode = true,
duration_s = 5,
extra_sched_args = ["--fail-verify"],
expect_err = true,
)]
fn perf_negative(ctx: &Ctx) -> Result<AssertResult> {
let checks = Assert::default_checks().max_gap_ms(50);
let steps = vec![Step::with_defs(
vec![CgroupDef::named("cg_0").workers(4)],
HoldSpec::FULL,
)];
execute_steps_with(ctx, steps, Some(&checks))
}
Key points:
expect_err = truetells the harness to assert failure and disable auto-repro.extra_sched_args = [...]passes CLI args to the scheduler binary."--fail-verify"is a real knob that the test fixture schedulerscx-ktstrexposes to force a verifier failure (seescx-ktstr/src/main.rsandscx-ktstr/src/bpf/main.bpf.c); substitute your own scheduler’s equivalent of the behaviour you want to exercise in a negative test.- The test function returns the scenario result normally; the harness checks that it produces an error.
Metric extraction from stderr
OutputFormat::Json and OutputFormat::LlmExtract read the
payload’s STDOUT as the primary stream, then fall back to STDERR if
stdout is empty or yields no metrics. Some benchmarks emit their
numbers only to stderr — schbench, for example, writes its
Wakeup Latencies percentiles / Request Latencies percentiles
blocks via fprintf(stderr, ...) and leaves stdout blank. The
fallback keeps those benchmarks usable without a redirect.
Consequence: a payload that writes mixed output to both streams
will have metrics extracted from stdout only, because the
fallback fires solely when the primary stream is empty or yields
nothing parseable. If you care about stderr-side numbers for a
stdout-emitting binary, redirect stderr into stdout at the payload
layer (extra_args = ["-c", "cmd 2>&1"] for shell-wrapped
invocations, or whatever equivalent the binary supports).
stress-ng is the mirror trap: progress / per-stressor summaries
go to stderr and stdout is blank, so the fallback sees stress-ng’s
prose. OutputFormat::Json returns zero metrics (stderr is prose,
not JSON); OutputFormat::LlmExtract may extract numbers from the
fallback but results depend on the local model’s tolerance for
that prose format. Keep OutputFormat::ExitCode for stress-ng
unless you are prepared for that tradeoff.
Declarative include_files on Payload
Payloads that need host binaries or fixtures in the guest initramfs
can declare them on the Payload itself instead of relying on the
CLI -i / --include-files flag at every invocation. The specs
are resolved at run_ktstr_test time through the same
include-file pipeline the CLI uses.
Spec shapes
Three shapes are accepted; which branch fires is decided by the shape of the path:
- Bare name (single-component, no
/, no./, no../) — looked up first in the harness’s current working directory (path.exists()is tried before thePATHwalk), then in the host’sPATHif the cwd lookup misses. The resolved absolute path is packed asinclude-files/<filename>. Example:"fio"→ host/usr/bin/fio→ archiveinclude-files/fio. - Relative or absolute path (starts with
/,./,../, or contains more than one component) — used verbatim and must exist. Relative paths are interpreted against the current working directory at the time the test harness runs (forcargo nextest runthat is the workspace root or the individual crate root, depending on how the binary is invoked). A single-file path is packed asinclude-files/<filename>. Example:"./test-fixtures/workload.json"→ archiveinclude-files/workload.json. - Directory (any path whose resolution is a directory) —
walked recursively (symlinks followed, non-regular files
skipped) and the directory’s basename becomes the root under
include-files/. Example:"./helpers"containinga.shandsub/b.sh→ archive entriesinclude-files/helpers/a.shandinclude-files/helpers/sub/b.sh.
Base directory for extra_include_files
Strings in extra_include_files follow the same three shapes as
the #[include_files(...)] attribute. They are NOT anchored to
CARGO_MANIFEST_DIR or to the crate source tree — they are
resolved against the harness’s current working directory at test
time, plus the host PATH for bare names. The attribute parser
accepts string literals only, so paths must be plain quoted
strings rather than compile-time expressions like
concat!(env!("CARGO_MANIFEST_DIR"), "/test-fixtures/foo.json").
For test fixtures
shipped alongside the test source, the reliable options are
either a bare name that a build script or test-setup stage has
placed on PATH, or a relative path rooted at the directory
from which the test is invoked.
Per-Payload declaration
Declare via the #[include_files(...)] attribute on
#[derive(Payload)]:
use ktstr::prelude::*;
#[derive(Payload)]
#[payload(binary = "fio")]
#[include_files("fio", "bench-helper")]
#[metric(name = "iops", polarity = HigherBetter, unit = "ops/s")]
struct FioPayload;
The generated FIO const carries include_files: &["fio", "bench-helper"]. The macro generates a const named by converting
the struct name to SCREAMING_SNAKE_CASE (stripping any Payload
suffix), so FioPayload → FIO and BenchDriver → BENCH_DRIVER.
When any #[ktstr_test] uses FIO as a payload or workload, those
files get resolved and packed into the initramfs automatically —
no -i flag needed on the CLI.
Fully worked declarative test
Complete end-to-end example of a #[ktstr_test] that relies on
declarative include_files only (no CLI -i flag at runtime). The
fixture binary ships on PATH under a project-controlled bin
directory; the payload declares its own dependency:
use ktstr::declare_scheduler;
use ktstr::prelude::*;
declare_scheduler!(MY_SCHED, {
name = "my_sched",
binary = "scx_my_sched",
topology = (1, 1, 2, 1),
});
#[derive(Payload)]
#[payload(binary = "bench-driver")]
#[include_files("bench-driver", "bench-helper")]
#[metric(name = "ops_per_sec", polarity = HigherBetter, unit = "ops/s")]
struct BenchDriver;
// The macro generates the `BENCH_DRIVER` const used below — `BenchDriver`
// (UpperCamelCase struct) → `BENCH_DRIVER` (SCREAMING_SNAKE_CASE, `Payload`
// suffix stripped). This is the only way to reference the payload from
// `#[ktstr_test]` attributes and from `ctx.payload(&...)` inside the body.
#[ktstr_test(
scheduler = MY_SCHED,
payload = BENCH_DRIVER,
duration_s = 5,
)]
fn bench_driver_runs_with_declared_helpers(ctx: &Ctx) -> Result<AssertResult> {
// Harness resolves the payload's `include_files` before boot:
// bench-driver → `include-files/bench-driver` (from $PATH)
// bench-helper → `include-files/bench-helper` (from $PATH)
// Both land in the guest initramfs at `/include-files/` and are
// on the worker's `PATH` during execution. The test body itself
// does not touch the include set — it runs through `ctx.payload`.
// `.run()` returns `(AssertResult, PayloadMetrics)`; the test
// body only wants the AssertResult here, so discard the metrics
// half of the tuple.
ctx.payload(&BENCH_DRIVER)
.run()
.map(|(assert_result, _metrics)| assert_result)
}
No -i / --include-files flag is needed on any host-side
invocation; the packaging happens automatically as part of
run_ktstr_test.
Test-level extras
Test-level extras that don’t belong on any specific payload go on
the #[ktstr_test] attribute directly:
#[ktstr_test(
scheduler = MY_SCHED,
payload = FIO,
extra_include_files = ["test-fixtures/workload.json"],
)]
fn fio_with_fixture(ctx: &Ctx) -> Result<AssertResult> {
// test body
Ok(AssertResult::pass())
}
The declarative set (scheduler’s include_files + payload’s +
workloads’ + extra_include_files) is aggregated at test time
and resolved through the same include-file pipeline the CLI’s
-i / --include-files flag uses (exposed on ktstr shell and
cargo ktstr shell; #[ktstr_test] resolution and the shell
CLIs share the same resolve_include_files resolver, just fed
from different sources). The union is deduped on identical
(archive_path, host_path) pairs. Two declarations that resolve
to the same archive slot with different host paths surface as a
hard error with both host paths in the diagnostic, rather than one
silently overwriting the other.
Architecture Overview
ktstr has three execution domains:
-
Host process – the test binary running on the host. Manages VM lifecycle, monitors guest memory, evaluates results.
-
Guest process – the same test binary running inside the VM as PID 1. Mounts filesystems, starts the scheduler, creates cgroups, forks workers, runs scenarios, writes results to SHM (COM2 fallback).
-
Monitor thread – runs on the host while the guest executes. Reads guest VM memory directly to observe scheduler state without instrumenting it.
Execution flow
Host Guest
---- -----
test binary
|
+-- build initramfs
| (test binary as /init
| + optional scheduler)
|
+-- boot KVM VM
| test binary (PID 1 init)
| |
+-- start monitor thread +-- mount filesystems
| (reads guest memory) +-- start scheduler (if any)
| +-- create cgroups
| +-- fork workers
| +-- move workers to cgroups
| +-- signal workers to start
| +-- poll scheduler liveness
| +-- stop workers, collect reports
| +-- evaluate results
| +-- write result to SHM (COM2 fallback)
|
+-- read result from SHM (COM2 fallback)
+-- evaluate monitor data
+-- report pass/fail
Key design decisions
Same binary, two roles. The test binary serves as both host
controller and guest test runner. The initramfs embeds the binary
as /init. When running as PID 1, the Rust init code
(vmm::rust_init) handles the full guest lifecycle: mounts,
scheduler start, test dispatch, and reboot.
Forked workers, not threads. Workers are fork()ed processes
because cgroups operate on PIDs. Each worker must be a separate
process to be placed in its own cgroup.
Host-side monitoring. The monitor reads guest memory via KVM, avoiding BPF instrumentation of the scheduler under test. This eliminates observer effects on scheduling decisions.
Typed flag declarations. Flags use static references instead of string matching, enabling compile-time dependency resolution.
VMM
ktstr includes a purpose-built VMM (virtual machine monitor) that boots Linux kernels in KVM for testing.
KtstrVm builder
let result = vmm::KtstrVm::builder()
.kernel(&kernel_path)
.init_binary(&ktstr_binary)
.topology(numa_nodes, llcs, cores_per_llc, threads_per_core)
.memory_mb(4096)
.run_args(&["run".into(), "--ktstr-test-fn".into(), "my_test".into()])
.build()?
.run()?;
Topology
The VM topology is specified as (numa_nodes, llcs, cores_per_llc, threads_per_core). On x86_64, the VMM creates ACPI tables (MADT,
SRAT, SLIT, and HMAT when numa_nodes > 1) and MP tables. On
aarch64, topology is expressed via FDT cpu nodes with MPIDR-derived
reg properties.
pub struct Topology {
pub llcs: u32,
pub cores_per_llc: u32,
pub threads_per_core: u32,
pub numa_nodes: u32,
pub nodes: Option<&'static [NumaNode]>,
pub distances: Option<&'static NumaDistance>,
}
total_cpus() = llcs * cores_per_llc * threads_per_core.
num_llcs() = llcs.
When nodes is None (the default), memory and LLCs are distributed
uniformly across NUMA nodes with default 10/20 distances. When
Some, each NumaNode specifies its LLC count, memory size, and
optional HMAT attributes (latency_ns, bandwidth_mbs,
mem_side_cache). A NumaNode with llcs = 0 models a CXL
memory-only node.
NumaDistance is an NxN inter-node distance matrix. Diagonal entries
must be 10, off-diagonal > 10, and the matrix must be symmetric (ACPI
SLIT requirements).
Use Topology::new(numa_nodes, llcs, cores, threads) for uniform
topologies, or Topology::with_nodes(cores, threads, &nodes) for
explicit per-node configuration.
initramfs
The VMM builds a cpio initramfs containing:
- The test binary (as
/init) - Optional scheduler binary (as
/scheduler) - Shared library dependencies (resolved via ELF DT_NEEDED parsing)
The initramfs is cached based on a cache key derived from the binary contents. A compressed SHM segment enables COW overlay into guest memory, sharing physical pages across concurrent VMs.
Guest-host communication
Serial console – COM2 carries guest stdout/stderr, the
canonical crash diagnostic transport, and a fallback result
transport. The guest panic hook writes PANIC: <info>\n<bt>\n to
COM2; the host parses it via extract_panic_message and surfaces
the backtrace in test failure output. Delimited test results
(between ===KTSTR_TEST_RESULT_START=== /
===KTSTR_TEST_RESULT_END=== sentinels) and exit codes
(KTSTR_EXIT=N) are also written to COM2 as a fallback when the
TLV stream is unavailable.
Virtio-console port 1 TLV stream – the primary guest-to-host
data channel. Carries scenario markers (MSG_TYPE_SCENARIO_START,
MSG_TYPE_SCENARIO_END), test results (MSG_TYPE_TEST_RESULT),
exit codes (MSG_TYPE_EXIT), stimulus events (MSG_TYPE_STIMULUS),
scheduler exit notifications (MSG_TYPE_SCHED_EXIT), profraw
coverage data (MSG_TYPE_PROFRAW), per-payload-invocation metrics
(MSG_TYPE_PAYLOAD_METRICS), and raw LlmExtract output
(MSG_TYPE_RAW_PAYLOAD_OUTPUT). Each TLV frame has a CRC32 for
integrity checking.
Virtio devices
The VMM implements three virtio-MMIO devices in addition to the
serial console above. All three speak the virtio 1.x MMIO transport
(virtio-v1.2 §4.2.2) with VIRTIO_F_VERSION_1 and use irqfd
(eventfd → KVM GSI) for interrupt delivery.
- virtio-blk (
vmm::virtio_blk) – file-backed block device with a single request virtqueue and a token-bucket throttle. Used to give workloads real on-disk filesystems (per-test images cloned from a btrfs template). AdvertisesVIRTIO_BLK_F_BLK_SIZE,VIRTIO_BLK_F_SEG_MAX,VIRTIO_BLK_F_SIZE_MAX,VIRTIO_BLK_F_FLUSH, andVIRTIO_RING_F_EVENT_IDX, plusVIRTIO_BLK_F_ROwhen configured read-only. - virtio-net (
vmm::virtio_net) – two-virtqueue (RX, TX) NIC with an in-VMM L2 loopback backend. Used by network-shaped workloads (TCP/UDP throughput, latency) without depending on the host’s network stack. AdvertisesVIRTIO_NET_F_MACso the guest binds a deterministic MAC. - virtio-console (
vmm::virtio_console) – three-port multiport console with eight virtqueues (per virtio-v1.2 §5.3.5: two control queues plus an in/out pair per port, three ports → 2 + 2·3 = 8). Port 0 carries the interactive/dev/hvc0console alongside the COM1/COM2 16550 serial ports; port 1 carries the guest-to-host TLV stream that delivers exit code, test result, per-payload metrics, raw payload outputs, profraw, and scheduler exit notifications; port 2 is a transparent byte-pipe relay carrying scx_stats request bytes from the host to the in-guest relay thread and the scheduler’s responses back. AdvertisesVIRTIO_CONSOLE_F_MULTIPORTwithmax_nr_ports = 3.
Performance mode
When performance_mode is enabled, the VMM applies host-side
isolation (vCPU pinning, hugepages, NUMA mbind, RT scheduling),
guest-visible hints (KVM_HINTS_REALTIME CPUID), and KVM exit
suppression. Non-performance-mode VMs set KVM_CAP_HALT_POLL to
200us; overcommitted topologies set it to 0.
See Performance Mode for the full optimization list, prerequisites, and validation.
Dual-role architecture
The same test binary serves two roles:
Host side – manages the VM lifecycle: builds the initramfs, boots the kernel, runs the monitor, and evaluates results.
Guest side – runs inside the VM as /init (PID 1). The Rust init
code (vmm::rust_init) mounts filesystems, starts the scheduler,
dispatches the test function, then reboots.
The role is determined at runtime:
- PID 1 detection: when running as PID 1, the
#[ctor]functionktstr_test_early_dispatch()runs the guest init path, which handles the full guest lifecycle. #[ktstr_test]host dispatch: a#[ctor::ctor]function (ktstr_test_early_dispatch) runs beforemain()in any binary that links against ktstr. When both--ktstr-test-fnand--ktstr-topoare present, it boots a VM and runs the test inside it.#[ktstr_test]guest dispatch: when only--ktstr-test-fnis present (no--ktstr-topo), the ctor runs the test function directly – the binary is already inside a VM.
This design means one cargo build produces everything needed for
both host and guest execution. The initramfs embeds the same binary
that built it.
Boot process
- Load kernel (bzImage on x86_64, Image on aarch64) via
linux-loader. - Set up KVM vCPUs with the specified topology.
- Build and load initramfs.
- Set up serial devices (COM1 for console, COM2 for results).
- Boot the kernel.
- Kernel starts
/init(the test binary). - PID 1 detected: the guest init path mounts filesystems, starts the scheduler, dispatches the test function, and reboots.
Monitor
The monitor observes scheduler state from the host side by reading guest VM memory directly. It does not instrument the guest kernel or the scheduler under test.
What it reads
The monitor resolves kernel structure offsets via BTF (BPF Type Format) from the guest kernel. It reads per-CPU runqueue structures to extract:
nr_running– number of runnable tasks on each CPUscx_nr_running– tasks managed by the sched_ext schedulerrq_clock– runqueue clock valuelocal_dsq_depth– scx local dispatch queue depthscx_flags– sched_ext flags for each CPU- scx event counters (fallback, keep-last, offline dispatch, skip-exiting, skip-migration-disabled, reenq-immed, reenq-local-repeat, refill-slice-dfl, bypass-duration, bypass-dispatch, bypass-activate, insert-not-owned, sub-bypass-dispatch)
When CONFIG_SCHEDSTATS is enabled, the monitor also reads per-CPU
struct rq schedstat fields (run_delay, pcount, sched_count,
ttwu_count, etc.).
The monitor walks the struct sched_domain tree whenever BTF
contains rq->sd and struct sched_domain — no CONFIG_SCHEDSTATS
required. Domain tree walking starts at rq->sd (lowest level) and
follows sd->parent pointers up to the root. Each domain level
provides topology metadata (level, name, flags, span_weight) and
runtime fields (balance_interval, nr_balance_failed,
max_newidle_lb_cost) and optional fields (newidle_call,
newidle_success, newidle_ratio — added in 7.0, backported to
6.18.5+ and 6.12.65+; absent on 6.16-6.18.4). When
CONFIG_SCHEDSTATS is also enabled, each
domain additionally provides load balancing stats: lb_count,
lb_failed, lb_balanced, alb_pushed, ttwu_wake_remote, and
other counters indexed by idle type (CPU_NOT_IDLE, CPU_IDLE,
CPU_NEWLY_IDLE).
Sampling
The monitor takes periodic snapshots (MonitorSample) of all per-CPU
state. Each sample captures a point-in-time view of every CPU.
MonitorSummary aggregates samples into peak values (max imbalance
ratio, max DSQ depth, stall detection), per-sample averages
(imbalance ratio, nr_running per CPU, DSQ depth per CPU), and event
counter deltas. Averages are computed over valid samples only
(excluding uninitialized guest memory).
Threshold evaluation
MonitorThresholds defines pass/fail conditions:
pub struct MonitorThresholds {
pub max_imbalance_ratio: f64,
pub max_local_dsq_depth: u32,
pub fail_on_stall: bool,
pub sustained_samples: usize,
pub max_fallback_rate: f64,
pub max_keep_last_rate: f64,
}
A violation must persist for sustained_samples consecutive samples
before triggering a failure. This filters transient spikes from cpuset
transitions and cgroup creation/destruction.
Stall detection
A stall is detected when a CPU’s rq_clock does not advance between
consecutive samples. Three exemptions prevent false positives:
-
Idle CPUs: when
nr_running == 0in both the current and previous sample, the CPU has no runnable tasks. The kernel stops the tick (NOHZ) on idle CPUs, sorq_clocklegitimately does not advance. These CPUs are excluded from stall checks. -
Preempted vCPUs: when the vCPU thread’s CPU time did not advance past the preemption threshold between samples, the host preempted the vCPU. These samples are excluded from stall checks.
-
Sustained window: stall detection uses per-CPU consecutive counters and the
sustained_samplesthreshold, matching how imbalance and DSQ depth checks work. A single stuck sample does not trigger failure – the stall must persist forsustained_samplesconsecutive samples on the same CPU.
Uninitialized memory detection
Before the guest kernel initializes per-CPU structures, monitor reads return uninitialized data. Two layers handle this:
-
Summary computation (
MonitorSummary::from_samples): skips individual samples where any CPU’slocal_dsq_depthexceedsDSQ_PLAUSIBILITY_CEILING(10,000) viasample_looks_valid(). -
Threshold evaluation (
MonitorThresholds::evaluate): checks all samples globally for plausibility. If allrq_clockvalues are identical across every CPU and sample, or any sample exceeds the plausibility ceiling, the entire report is passed as “not yet initialized” — no per-threshold checks run.
BPF map introspection
The monitor module also provides host-side BPF map discovery and
read/write access via bpf_map::BpfMapAccessor. The host reads and
writes guest BPF maps directly through the physical memory mapping
— no guest cooperation or BPF syscalls are needed.
GuestMem
GuestMem wraps a host pointer to the start of guest DRAM and
provides bounds-checked volatile reads and writes for scalar types
(u8/u32/u64). Byte-slice reads (read_bytes) use
copy_nonoverlapping. It also implements x86-64 page table walks
(translate_kva) for both 4-level and 5-level paging, and
granule-agnostic aarch64 walks (4 KB / 16 KB / 64 KB; level count
derived from TCR_EL1’s TG1 + T1SZ fields).
Scalar accesses use volatile semantics because the guest kernel modifies memory concurrently.
GuestKernel
GuestKernel builds on GuestMem by adding kernel symbol
resolution and address translation. It parses the vmlinux ELF
symbol table at construction and resolves paging configuration
(PAGE_OFFSET, CR3, 4-level vs 5-level) from guest memory.
Subsequent reads use cached state.
Three address translation modes are supported:
- Text/data/bss:
kva - __START_KERNEL_map. For statically-linked kernel variables (read_symbol_*,write_symbol_*). - Direct mapping:
kva - PAGE_OFFSET. For SLAB allocations, per-CPU data, physically contiguous memory (read_direct_*). - Vmalloc/vmap: Page table walk via CR3. For BPF maps, vmalloc’d
memory, module text (
read_kva_*,write_kva_*).
BpfMapAccessor
BpfMapAccessor resolves BTF offsets for BPF map kernel structures
(struct bpf_map, struct bpf_array, struct xa_node, struct idr)
and provides map discovery and value read/write. It borrows a
GuestKernel for address translation.
BpfMapAccessorOwned is a convenience wrapper that owns the
GuestKernel internally. Use BpfMapAccessor::from_guest_kernel
when you already have a GuestKernel; use BpfMapAccessorOwned::new
when you want a self-contained accessor.
Map discovery walks the kernel’s map_idr xarray:
- Read
map_idr(BSS symbol, text mapping translation) - Walk xa_node tree (SLAB-allocated, direct mapping translation)
- Read
struct bpf_mapfields. The allocation may be kmalloc’d or vmalloc’d depending on size and flags, so the translation usestranslate_any_kvawhich handles both paths rather than assuming either.
find_map searches by name suffix (e.g. ".bss" matches
"mitosis.bss"). Only BPF_MAP_TYPE_ARRAY maps are returned.
Use maps() to enumerate all map types without filtering.
Value access for BPF_MAP_TYPE_ARRAY maps reads/writes the inline
bpf_array.value flex array at the BTF-resolved offset. The value
region is vmalloc’d, so each byte access goes through the page table
walker to handle page boundaries.
For BPF_MAP_TYPE_PERCPU_ARRAY maps, bpf_array.pptrs[key] holds
a __percpu pointer (at the same union offset as value). Adding
__per_cpu_offset[cpu] yields the per-CPU KVA in the direct mapping.
read_percpu_array returns one Option<Vec<u8>> per CPU: Some
when the per-CPU PA falls within guest memory, None when it does not.
Typed field access
When a map has BTF metadata (btf_kva != 0), resolve_value_layout
reads the guest’s struct btf and its data blob, parses it with
btf_rs, and resolves the value struct’s fields. This enables
read_field / write_field with type-checked BpfValue variants.
Usage example
Find a scheduler’s .bss map and write a crash variable:
let offsets = BpfMapOffsets::from_vmlinux(vmlinux)?;
let accessor = BpfMapAccessor::from_guest_kernel(&kernel, &offsets)?;
let bss = accessor.find_map(".bss").expect(".bss map not found");
accessor.write_value_u32(&bss, crash_offset, 1);
BpfMapWrite
BpfMapWrite specifies a host-side write to a BPF map during VM
execution. The test runner waits for the scheduler to load (map
becomes discoverable), writes the value, then signals the guest via
SHM to start the scenario.
pub struct BpfMapWrite {
pub map_name_suffix: &'static str, // e.g. ".bss"
pub offset: usize, // byte offset in the map value
pub value: u32, // value to write
}
Use with #[ktstr_test] via the bpf_map_write attribute:
const BPF_CRASH: BpfMapWrite = BpfMapWrite {
map_name_suffix: ".bss",
offset: 42,
value: 1,
};
#[ktstr_test(bpf_map_write = BPF_CRASH, expect_err = true)]
fn crash_test(ctx: &Ctx) -> Result<AssertResult> {
Ok(AssertResult::pass())
}
The map is discovered by name suffix via BpfMapAccessor::find_map.
Only BPF_MAP_TYPE_ARRAY maps are supported. The write targets a
u32 at the specified byte offset within the map’s value region.
Prerequisites
- vmlinux: Required for ELF symbols and BTF. Must match the guest
kernel. Symbols include
phys_baseso the runtime KASLR offset can be resolved via a page-table walk through the BSP’s CR3, breaking the chicken-and-egg between text-symbol PA translation and KASLR.
Cast analysis
BPF maps frequently store kernel pointers (task_struct *,
cgroup *, …) and arena pointers in u64 fields because BTF cannot
express a pointer to a per-allocation type. Without intervention the
renderer treats them as integers and the failure dump shows raw
0xffff…ffff values with no further chase.
The cast analyzer (monitor::cast_analysis::analyze_casts) closes
that gap. The freeze coordinator runs it once per scheduler load,
before any periodic capture or on-demand snapshot would consume
its output:
- The host loads the scheduler binary and locates each
.bpf.oELF in the build artifacts. - Each program section is decoded through
cast_analysis::BpfInsn::from_le_bytesinto a flat&[BpfInsn]slab; relocations against.bss/.data/.rodataannotate the correspondingBPF_LD_IMM64PCs with their datasec target. analyze_castswalks the slab forward, tracking register and stack-slot state for each instruction. Two detection paths feed the output: the arena pointer path (LDX through a previously loadedu64field) and the kernel kptr path (STX of a typed pointer register into au64field). Function-entry seeding frombpf_func_inforeseeds R1..R5 from the BTF FuncProto so typed parameters propagate correctly across subprogram joins.- The result is a
CastMap(BTreeMap<(source_struct_btf_id, field_byte_offset), CastHit>) cached on the per-VMKtstrVm.cast_map(aLazyCastMapthat runs the analyzer on first dump and caches the result process-wide by scheduler binary content hash). The freeze coordinator threads the cachedCastMapthroughDumpContext::cast_mapinto every per-map render so the renderer can consult it at every dump site. render_cast_pointerinmonitor::btf_renderconsumesCastHitviaMemReader::cast_lookup. When au64field at a recorded(struct, offset)is rendered, the renderer chases the pointer through the address-space-appropriate reader (arena vs slab/vmalloc) and tags the result with acast_annotationof"cast→arena"or"cast→kernel"(plus a(sdt_alloc)suffix when the bridge described below fired). Failure dumps show the annotation alongside the resolved struct fields, so cast-recovered pointers are visually distinct from BTF-typed ones.
The renderer also consults an sdt_alloc bridge whenever a chase
target peels to a BTF_KIND_FWD forward declaration (typical for
struct sdt_data __arena * fields whose body lives in the
sdt_alloc library’s BTF rather than the scheduler’s program BTF).
The dump-state pre-pass walks each live scx_allocator and
populates a slot_start → ArenaSlotInfo index — one entry
per live allocator slot, carrying elem_size, header_size, and
the resolved payload BTF type id — that
MemReader::resolve_arena_type (in
dump::render_map::AccessorMemReader) range-looks up during the
chase. The lookup finds the slot whose
[slot_start, slot_start + elem_size) range contains the chased
address and routes by offset_in_slot: a slot-start chase
(offset == 0, e.g. the data field of scx_task_map_val
storing the raw sdt_alloc() return) returns the payload type id
with header_skip = header_size; a payload-start chase
(offset == header_size, e.g. the return of scx_task_data(p)
cached in cached_taskc_raw) returns the same payload type id
with header_skip = 0. The renderer reads header_skip + btf_size
bytes from the chased address, slices off the leading
header_skip bytes, and renders the payload struct. The
resulting Ptr carries a sdt_alloc-flavoured annotation:
"sdt_alloc" on the BTF-typed Type::Ptr arm, and
"cast→arena (sdt_alloc)" / "cast→kernel (sdt_alloc)" on the
cast-analyzer-driven path. The sdt_alloc bridge fires only when
the BTF-only resolve has already exhausted same-name siblings;
false-positive risk on that arm is bounded by the arena-window
range check (MemReader::resolve_arena_type returns None for
addresses outside every known allocator slot).
A separate cross-BTF Fwd resolution path covers the case where a
BTF_KIND_FWD pointee’s body lives in a sibling embedded BPF
object’s BTF rather than an sdt_alloc slot — the typical
multi-.bpf.objs shape where one object declares
struct cgx_target; (forward) and a sibling object defines
struct cgx_target { ... } (full body). The cast-analysis
pre-pass (vmm::cast_analysis_load::build_fwd_index) walks every
parsed embedded program BTF and records a
name -> (btfs index, type_id) entry for every complete
(!is_fwd) Type::Struct / Type::Union. First-write-wins on
duplicate names: when the same name appears in multiple BTFs the
index keeps the first-seen entry. Anonymous types and Typedef
are not indexed (no name to key on, and typedefs add no body —
the chase peels through them via peel_modifiers_with_id before
consulting the index). The index is threaded through
DumpContext::cross_btf and exposed to the renderer via
MemReader::cross_btf_resolve_fwd. When chase_arena_pointer /
render_cast_pointer peel a chase target through
peel_modifiers_resolving_fwd and the local same-BTF sibling
search came up empty, try_cross_btf_fwd_resolve consults the
cross-BTF index by the Fwd’s name (and aggregate kind — struct
vs union); a hit returns a CrossBtfRef { btf, type_id } and
the chase recursion switches to the resolved sibling BTF for the
pointee render. Cross-BTF resolution does NOT introduce a new
annotation — the body is recovered transparently and the rendered
subtree carries the cast or BTF-typed annotation it would have
had if the same struct lived in the entry BTF. Unlike the
sdt_alloc bridge the cross-BTF index is consulted whenever a
Fwd terminal survives the local resolve — there is no
arena-window gate, since the lookup is purely a name-keyed BTF
table and a name miss simply leaves the chase on its existing
“forward declaration; body not in this BTF” skip path.
The analyzer is deliberately conservative: branch joins reset
register and stack state, conflicts drop the offending entry, and
self-stores are rejected. False negatives fall back to raw u64
(the prior behavior); false positives would chase garbage and are
avoided. The analysis is unconditional — no test-author
configuration, no opt-in flag — and the freeze coordinator wires
the resulting CastMap through every snapshot, periodic capture,
and failure dump.
Probe pipeline
The probe pipeline captures function arguments and struct fields during auto-repro. It operates inside the guest VM (not from the host), using two BPF skeletons that share maps.
Architecture
crash stack -> extract functions -> BTF resolve -> load skeletons -> poll
|
kprobe skeleton | fentry/fexit skeleton
(kernel entry) | (BPF entry + kernel exit)
| | |
v v v
func_meta_map <--shared--> probe_data
| (entry + exit fields)
trigger fires (ring buffer)
|
read probe_data entries
|
stitch by tptr
|
format with entry→exit diffs
Kprobe skeleton (probe.bpf.c)
Attaches to kernel functions via attach_kprobe. The BPF handler:
- Gets the function IP via
bpf_get_func_ip - Looks up
func_metafromfunc_meta_map(keyed by IP) - Captures 6 raw args from
pt_regs - Dereferences struct fields via BTF-resolved offsets
- Reads char * string params if configured
- Stores result in
probe_data(keyed by(func_ip, task_ptr))
The trigger fires via tp_btf/sched_ext_exit (inside
scx_claim_exit()) and sends an EVENT_TRIGGER via ring buffer
with the current task pointer and kernel stack.
Fentry/fexit skeleton (fentry_probe.bpf.c)
Handles both BPF struct_ops callbacks and kernel function exit
capture. Loaded in batches of 4 fentry + 4 fexit programs per
skeleton instance via set_attach_target. Shares probe_data and
func_meta_map with the kprobe skeleton via reuse_fd.
A per-slot is_kernel rodata flag controls argument access:
- BPF callbacks (
is_kernel=0):ctx[0]is a void pointer to the real callback arguments. The handler dereferences through it. Uses sentinel IPs (func_idx | (1<<63)) infunc_meta_map. - Kernel functions (
is_kernel=1): args are directly inctx[0..5]. Usesbpf_get_func_ip(ctx)for the real IP, matching the kprobe entry handler’s key.
Fexit handlers look up the existing probe_data entry (written by
fentry or kprobe at function entry) and re-read struct fields into
exit_fields. This captures post-mutation state for paired display.
BTF resolution
Two BTF sources:
-
vmlinux BTF (
btf-rs): resolves kernel struct offsets. Types inSTRUCT_FIELDS(task_struct, rq, scx_dispatch_q, etc.) use curated field lists with chained pointer dereferences (e.g.->cpus_ptr->bits[0]). Other struct pointer params get scalar, enum, and cpumask pointer fields auto-discovered from vmlinux BTF. -
Program BTF (
libbpf-rs): resolves BPF-local struct offsets for types not in vmlinux (e.g. scheduler-definedtask_ctx). Auto-discovers scalar, enum, and cpumask pointer fields.
Callback signatures are resolved by:
____nameinner function in program BTF (typed params)sched_ext_opsmember in vmlinux BTF (fallback)- Wrapper function (void *ctx, no useful params)
Field decoding
The output formatter decodes field values based on their key name:
dsq_id->SCX_DSQ_INVALID,SCX_DSQ_GLOBAL,SCX_DSQ_LOCAL,SCX_DSQ_BYPASS,SCX_DSQ_LOCAL_ON|{cpu},BUILTIN({v}),DSQ(0x{hex})cpumask_0..3-> coalesced into onecpus_ptrfield rendered as0x{hex}({cpu-list})— the masked hex of the cpumask words (low-order word first; multi-word masks join with_between 64-bit chunks) followed by the run-length-collapsed CPU range list (e.g.0xf(0-3),0x1_00000000000000ff(0-7,64))enq_flags->WAKEUP|HEAD|PREEMPTexit_kind->ERROR,ERROR_BPF,ERROR_STALL, etc.scx_flags->QUEUED|ENABLEDsticky_cpu->-1for 0xffffffff
Event stitching
After the trigger fires, all probe_data entries are read, matched
to functions by IP, then filtered to a single task’s scheduling
journey:
- Read the task_struct pointer from the trigger event’s
bpf_get_current_task()value (args[0]) - For functions with a task_struct parameter: keep events where
args[param_idx] == tptr - For functions without a task_struct parameter: keep events where
task_ptr == tptr(matched viabpf_get_current_task()at probe time)
Events are sorted by timestamp for chronological output.
Worker Processes
Workers are the processes that generate load for scenarios. They run inside the VM, each in its own cgroup.
Fork, not threads
Workers are fork()ed processes. Cgroups operate on PIDs, so each
worker must be a separate process to be independently placed in a
cgroup.
Two-phase start
Workers wait on a pipe for a “start” signal after fork:
- Parent forks the worker.
- Worker installs SIGUSR1 handler, then blocks on pipe read.
- Parent moves the worker to its target cgroup.
- Parent writes to the pipe, signaling the worker to start.
This ensures workers run inside their target cgroup from the first instruction of their workload.
Custom work types
WorkType::Custom workers follow the same two-phase start (fork,
cgroup placement, start signal), and the framework applies affinity
and scheduling policy before handing control to the user function.
After setup, the run function pointer takes over entirely –
the framework work loop is bypassed.
Stop protocol
Workers install a SIGUSR1 handler that sets an atomic STOP flag. The
main work loop checks this flag each iteration. On stop:
- Parent sends SIGUSR1 to all workers.
- Workers exit their work loop.
- Workers serialize their
WorkerReportto a pipe. - Parent reads reports and waits for child exit.
Telemetry
Each worker produces a WorkerReport:
pub struct WorkerReport {
pub tid: i32,
pub work_units: u64,
pub cpu_time_ns: u64,
pub wall_time_ns: u64,
pub off_cpu_ns: u64,
pub migration_count: u64,
pub cpus_used: BTreeSet<usize>,
pub migrations: Vec<Migration>,
pub max_gap_ms: u64,
pub max_gap_cpu: usize,
pub max_gap_at_ms: u64,
pub resume_latencies_ns: Vec<u64>,
pub wake_sample_total: u64,
pub iteration_costs_ns: Vec<u64>,
pub iteration_cost_sample_total: u64,
pub iterations: u64,
pub schedstat_run_delay_ns: u64,
pub schedstat_run_count: u64,
pub schedstat_cpu_time_ns: u64,
pub completed: bool,
pub numa_pages: BTreeMap<usize, u64>,
pub vmstat_numa_pages_migrated: u64,
pub exit_info: Option<WorkerExitInfo>,
pub is_messenger: bool,
pub group_idx: usize,
pub affinity_error: Option<String>,
}
pub enum WorkerExitInfo {
Exited(i32),
Signaled(i32),
TimedOut,
WaitFailed(String),
/// Thread-mode worker panicked. Exclusive to `CloneMode::Thread`;
/// fork workers surface panics via `Exited(1)` or
/// `Signaled(SIGABRT)` depending on the panic strategy.
Panicked(String),
}
iteration_costs_ns mirrors resume_latencies_ns for per-iteration
wall-clock cost: a reservoir-sampled vector capped at
MAX_WAKE_SAMPLES entries, paired with iteration_cost_sample_total
for the total observation count when the cap is exceeded.
group_idx is 0 for the primary group and 1..=N for composed
WorkSpec entries in declaration order (mirrors
WorkloadConfig::composed). affinity_error is Some(reason)
when the worker’s sched_setaffinity / mbind setup failed; the
worker still runs and produces a report but the field documents
the divergence from the requested affinity contract.
Three fields worth calling out explicitly:
-
wake_sample_total— the TOTAL number of wake-latency observations the worker saw, including samples the reservoir sampler dropped.resume_latencies_nsis clamped to at most 100_000 entries (MAX_WAKE_SAMPLES); on a long run that accumulates more wakes than the cap, the vector stays at the cap while this counter keeps climbing. Host-side consumers reporting “total wakeups observed” readwake_sample_total; percentile / CV computations readresume_latencies_ns. -
completed—truewhen the worker reached its natural end (outer loop observed STOP and exited cleanly, or a custom- closure payload returned from itsrun). Sentinel reports synthesised bystop_and_collect’s JSON-parse fallback carryfalse. Lets consumers distinguish “ran to completion, saw zero iterations” from “died / timed out before recording anything.” -
is_messenger—trueonly for the messenger worker in aFutexFanOut/FanOutComputegroup (the single writer that advances the shared generation and issuesfutex_wake). Enables per-worker latency-participation assertions — receivers produceresume_latencies_nsentries, messengers record wake-side work but no resume latency. -
off_cpu_ns = wall_time_ns - cpu_time_ns -
exit_infoisNoneon every live-worker-authored report.stop_and_collectsynthesises a sentinelWorkerReportwithSome(_)when the worker handed back no (or unparseable) JSON, using theWorkerExitInfoenum (Exited(code)/Signaled(signum)/TimedOut/WaitFailed(String)— the string carries the underlyingwaitpiderrno rendering) to preserve the reap shape for post-mortem. -
Migrations are tracked every 1024 work units: after each outer iteration the worker checks
work_units.is_multiple_of(1024)and runs the migration-detect body iff that is true. The check runs exactly once per outer iteration, so the effective period in outer iterations is1024 / gcd(units_per_iter, 1024). Default parameters assumed unless noted:- Every outer iteration (period = 1 iter): SpinWait (1024),
Mixed (1024), Bursty (each outer iter runs
spin_burst(1024)some number of times inside theburst_msloop — always a multiple of 1024), PipeIo (burst_iters=1024), FutexPingPong (spin_iters=1024), CachePressure (1024 strided RMW steps), CacheYield (1024 strided RMW steps), CachePipe (burst_iters=1024), FutexFanOut messenger AND receiver (both callspin_burst(spin_iters)before splitting roles; default 1024), AffinityChurn (spin_iters=1024), PolicyChurn (spin_iters=1024). - Every 2 iterations: NiceSweep (
spin_burst(512)per iter →gcd(512, 1024) = 512). - Every 4 iterations: MutexContention
(
work_iters=1024 +hold_iters=256 = 1280 per acquire+ release →gcd(1280, 1024) = 256, period = 4 iters). FanOutCompute messenger (spin_burst(256)per wake cycle → same 256-unit gcd). - Every 16 iterations: PageFaultChurn — one persistent
MAP_PRIVATE | MAP_ANONYMOUSregion per worker (default 4 MiB viaregion_kb=4096), re-faulted each outer iteration viamadvise(MADV_DONTNEED). Each iteration contributestouches_per_cycle=256 page writes (each first write afterMADV_DONTNEEDtriggers a minor fault; a birthday-collision xorshift64 index may revisit a page already faulted this cycle, so the fault count is a ceiling, not a floor) +spin_iters=64 = 320 work units (gcd(320, 1024) = 64). - Every 64 iterations: IoSyncWrite (16 4-KiB writes per
write-then-sleep pair →
gcd(16, 1024) = 16); IoRandRead and IoConvoy use the same 64-iteration cadence for their per-iteration pread/pwrite mixes. - Every 1024 iterations: YieldHeavy (1 unit per yield),
ForkExit (1 unit per fork+wait), FanOutCompute worker
(
operations=5 matrix multiplies per wake, onework_unitstick per multiply →gcd(5, 1024) = 1). - Phase-inherited: Sequence inherits whichever phase is
currently active — Spin / Yield / Io use the same per-unit
accounting as the SpinWait / YieldHeavy / IoSyncWrite groups
above; Sleep contributes no
work_unitsand so pauses migration checks while it runs. - Not tracked by the framework: Custom workers do not
contribute to
work_unitson the framework’s behalf — migration tracking fires only if the user’srunfunction incrementswork_unitsand emits migrations directly.
- Every outer iteration (period = 1 iter): SpinWait (1024),
Mixed (1024), Bursty (each outer iter runs
-
Scheduling gaps (
max_gap_ms,max_gap_cpu,max_gap_at_ms) record the longest wall-clock interval between consecutive 1024-work-unit migration-check points plus the CPU the gap was observed on and its time from start. High values indicate preemption or descheduling near a checkpoint boundary. The checkpoint cadence — and therefore the gap-measurement cadence — is governed by the samework_units.is_multiple_of(1024)test that the migration tracker uses, so the effective measurement period in outer iterations matches the per-WorkType tables above.
Benchmarking fields
Workers collect two categories of timing data:
Per-wakeup latency (resume_latencies_ns): timestamp-based samples
recorded around blocking operations. Populated for work types with a
blocking step: Bursty (sleep), PipeIo (pipe read), FutexPingPong
(futex wait), FutexFanOut (futex wait, receivers only), FanOutCompute
(futex wait, workers only — measured as CLOCK_MONOTONIC delta from
messenger’s shared timestamp), CacheYield (yield), CachePipe (pipe
read), IoSyncWrite / IoRandRead / IoConvoy (pread / pwrite / fdatasync
blocking), NiceSweep (yield), AffinityChurn (yield),
PolicyChurn (yield), MutexContention (futex wait on contended
acquire), ForkExit (parent’s waitpid wait), and Sequence when its
phases include Sleep, Yield, or Io.
Each sample is in nanoseconds; most work types use
Instant::elapsed() across the blocking call, while FanOutCompute
uses clock_gettime(CLOCK_MONOTONIC) to measure against the
messenger’s pre-wake timestamp.
schedstat deltas: read from /proc/self/schedstat at work-loop
start and end. Three fields:
schedstat_cpu_time_ns– delta of field 1 (on-CPU time)schedstat_run_delay_ns– delta of field 2 (time spent waiting for a CPU)schedstat_run_count– delta of field 3 (pcount — scheduler-in count: incremented each time the scheduler picks this task to execute, across CFS/EEVDF, FIFO/RR, and sched_ext alike). Not a context-switch count — a task that keeps running on the same CPU without leaving the runqueue does not see pcount advance while it runs. For true context-switch counts read/proc/<pid>/status’svoluntary_ctxt_switchesandnonvoluntary_ctxt_switches; the worker reads pcount instead because schedstat delivers it alongsiderun_delay/cpu_timein a single file read.
iterations counts outer-loop iterations.
NUMA fields
numa_pages: per-NUMA-node page counts parsed from
/proc/self/numa_maps after the workload completes. Keyed by node ID.
Empty when numa_maps is unavailable.
vmstat_numa_pages_migrated: delta of the numa_pages_migrated
counter from /proc/vmstat between pre- and post-workload snapshots.
Measures cross-node page migrations during the test.
These fields feed the NUMA checking thresholds.
Custom workers produce their own WorkerReport. The framework does
not populate any telemetry fields for Custom – migration tracking,
gap detection, schedstat deltas, NUMA page counts, and iteration
counters are only present if the user’s run function fills them.
Worker-progress watchdog
Workers send SIGUSR2 to the scheduler when stuck > 2 seconds. The default POSIX disposition terminates the scheduler process, which ktstr detects as a scheduler death and captures the sched_ext dump from dmesg.
In repro mode, the watchdog is disabled to keep the scheduler alive for BPF probe assertions. The watchdog does not fire for Custom workers because they bypass the framework work loop.
RAII cleanup
WorkloadHandle implements Drop: it sends SIGKILL to all child
processes and waits for them. This prevents orphaned worker processes
on error paths.
WorkloadHandle
WorkloadHandle is the RAII handle to spawned worker processes. It
manages the lifecycle of forked workers: spawning, start signaling,
stop/collection, and cleanup.
use ktstr::prelude::*;
#[must_use = "dropping a WorkloadHandle immediately kills all worker processes"]
pub struct WorkloadHandle { /* ... */ }
Spawning
let config = WorkloadConfig {
num_workers: 4,
work_type: WorkType::Mixed,
..Default::default()
};
let mut handle = WorkloadHandle::spawn(&config)?;
Set only the fields that matter for the test and let
..Default::default() fill in the rest. The spread-default form is the
canonical style in the ktstr codebase — it keeps examples pinned to
intent (num_workers, work_type) and has already absorbed additions
to WorkloadConfig (the NUMA memory-policy fields) without rotting.
Consult the WorkloadConfig rustdoc for the current field list.
spawn() forks num_workers child processes. Each child installs a
SIGUSR1 handler, then blocks on a pipe waiting for the start signal.
Workers do not begin their workload until start() is called.
For grouped work types (PipeIo, CachePipe, FutexPingPong,
FutexFanOut), spawn() validates that num_workers is divisible by
the group size and sets up inter-worker communication (pipes for
PipeIo/CachePipe, shared mmap pages for FutexPingPong/FutexFanOut).
Methods
worker_pids() -> Vec<libc::pid_t> – PIDs of all worker
processes. Used with CgroupManager::move_task() or move_tasks()
to place workers in cgroups before starting them.
start() – signals all workers to begin their workload by writing
to their start pipes. Idempotent: calling it twice has no effect.
Call this after moving workers into their target cgroups.
set_affinity(idx, cpus) -> Result<()> – sets CPU affinity for
the worker at index idx via sched_setaffinity. Use this for
per-worker pinning outside any cgroup, or when you need to change one
worker’s affinity without disturbing the rest. When all workers in a
cgroup should share the same CPU set, prefer
CgroupGroup::add_cgroup — it creates the cgroup,
writes cpuset.cpus once for the whole cgroup, and RAII-removes the
cgroup on drop (including error paths). Reach for
CgroupManager::set_cpuset directly only when
the cgroup’s lifetime must outlive the current scope; the RAII
wrapper is the default because it cleans up on every error path.
snapshot_iterations() -> Vec<u64> – reads all workers’ current
iteration counts from a shared memory region (MAP_SHARED). Each count
is monotonically increasing, read with relaxed ordering. Returns an
empty vec if no workers were spawned. Call periodically during the
workload’s run window to sample forward progress (e.g. to detect stalls
or compute instantaneous rates); the final per-worker totals come back
through stop_and_collect().
stop_and_collect(self) -> Vec<WorkerReport> – sends SIGUSR1 to
all workers, reads their serialized WorkerReport from report pipes,
and waits for exit. Auto-starts workers if start() was not called.
Workers that do not respond within a shared 5-second deadline are
killed with SIGKILL. Consumes the handle.
Typical usage
// 1. Spawn workers (blocked, waiting for start signal)
let mut handle = WorkloadHandle::spawn(&config)?;
// 2. Move workers into their target cgroup. `cgroup.procs` is
// tgid-scoped, so use `worker_pids_for_cgroup_procs()` — it
// bails for Thread-mode workers (whose pids share the harness's
// tgid) and points at `cgroup.threads` instead. Plain
// `worker_pids()` returns the raw pid set without the
// cgroup-procs safety check.
ctx.cgroups.move_tasks("cg_0", &handle.worker_pids_for_cgroup_procs()?)?;
// 3. Signal workers to start
handle.start();
// 4. Wait for workload duration
std::thread::sleep(ctx.duration);
// 5. Stop workers and collect telemetry
let reports: Vec<WorkerReport> = handle.stop_and_collect();
Drop behavior
Dropping a WorkloadHandle without calling stop_and_collect() sends
SIGKILL to all child processes and waits for them. This prevents
orphaned worker processes on error paths. Shared mmap regions (futex
pages and iteration counters) are unmapped on drop.
See also: CgroupManager for cgroup operations, CgroupGroup for RAII cleanup, TestTopology for cpuset generation, Worker Processes for the two-phase start protocol and telemetry details.
CgroupManager
CgroupManager manages cgroup v2 filesystem operations. It creates,
configures, and removes cgroups under a parent directory.
use ktstr::prelude::*;
pub struct CgroupManager {
parent: PathBuf,
}
Construction
use std::collections::BTreeSet;
let cgroups = CgroupManager::new("/sys/fs/cgroup/ktstr");
let mut controllers = BTreeSet::new();
controllers.insert(Controller::Cpuset);
controllers.insert(Controller::Cpu);
cgroups.setup(&controllers)?; // create parent dir, enable cpuset + cpu controllers
new() sets the parent path. setup() takes a
&BTreeSet<Controller> (variants: Cpuset, Cpu, Memory,
Pids, Io), creates the parent directory if it does not exist,
and enables the requested controllers on every ancestor from
/sys/fs/cgroup down to the parent by writing to each level’s
cgroup.subtree_control. An empty set creates the directory and
returns without touching subtree_control. The deterministic
BTreeSet iteration order keeps the rendered subtree_control
write stable between runs.
Methods
parent_path() -> &Path – returns the parent cgroup directory path.
create_cgroup(name) – creates a child cgroup directory. Idempotent:
no error if the directory already exists. Supports nested paths
(e.g. "nested/deep"). For nested paths, enables +cpuset on
intermediate cgroups’ subtree_control.
remove_cgroup(name) – drains tasks from the child cgroup to the
cgroup filesystem root, then removes the directory. No error if the
cgroup does not exist.
set_cpuset(name, cpus) – writes cpuset.cpus for a child cgroup.
The BTreeSet<usize> is formatted as a compact range string via
TestTopology::cpuset_string() (e.g. "0-3,5,7-9").
clear_cpuset(name) – writes an empty string to cpuset.cpus,
which inherits the parent’s cpuset.
move_task(name, pid) – writes a single PID to the child cgroup’s
cgroup.procs.
move_tasks(name, pids) – moves all PIDs from a slice into the
child cgroup. Tolerates ESRCH (task exited between listing and
migration) with a warning. Retries EBUSY up to 3 times with 100ms
backoff for transient rejections from sched_ext BPF
cgroup_prep_move callbacks. Propagates EBUSY after retries
exhausted. Propagates all other errors immediately.
drain_tasks(name) – moves all tasks from a child cgroup to the
cgroup filesystem root (/sys/fs/cgroup) by reading cgroup.procs
and writing each PID to the root’s cgroup.procs. Drains to root
because the parent has subtree_control set and the kernel’s
no-internal-process constraint rejects writes to a cgroup with
active controllers.
cleanup_all() – recursively removes all child cgroups under the
parent (depth-first), draining tasks at each level. Keeps the parent
directory itself.
Timeout protection
All cgroup filesystem writes use a 2-second timeout. The write runs in a spawned thread; if it does not complete within the timeout, the caller gets an error. This prevents test hangs when cgroup operations block in the kernel (e.g. during scheduler reconfigurations).
Usage in scenarios
Scenarios access CgroupManager through Ctx.cgroups. The typical
pattern is:
fn custom_scenario(ctx: &Ctx) -> Result<AssertResult> {
let mut guard = CgroupGroup::new(ctx.cgroups);
guard.add_cgroup("cg_0", &cpuset)?;
let mut h = WorkloadHandle::spawn(&config)?;
ctx.cgroups.move_tasks("cg_0", &h.worker_pids_for_cgroup_procs()?)?;
h.start(); // workers block until start() is called
// ... run workload ...
// `guard` drops at end of scope and removes cg_0 even on error.
Ok(result)
}
Bypass CgroupGroup only when you need to hand the
cgroup’s lifetime to a different owner; the RAII wrapper is the default
because it removes the cgroup on every error path, not just the happy
path.
See also: CgroupGroup for RAII cleanup, WorkloadHandle for worker lifecycle, TestTopology for cpuset generation.
CgroupGroup
CgroupGroup is an RAII guard that removes cgroups on drop. It
prevents cgroup leaks when workload spawning or other operations fail
between cgroup creation and cleanup.
use ktstr::prelude::*;
#[must_use = "dropping a CgroupGroup immediately destroys the cgroups it manages"]
pub struct CgroupGroup<'a> {
cgroups: &'a dyn CgroupOps,
names: Vec<String>,
}
Methods
new(cgroups: &dyn CgroupOps) -> Self – creates an empty group
bound to any implementor of CgroupOps (e.g.
CgroupManager in production, an in-memory fake
in tests).
add_cgroup(name, cpuset) -> Result<()> – creates a cgroup and
sets its cpuset. The cgroup is tracked for removal on drop.
add_cgroup_no_cpuset(name) -> Result<()> – creates a cgroup
without setting a cpuset. The cgroup is tracked for removal on drop.
names() -> &[String] – returns the names of all tracked cgroups.
Drop behavior
When the CgroupGroup is dropped, it calls remove_cgroup() on each
tracked cgroup in reverse insertion order so nested children are
removed before their parents (a parent still holding child
directories would fail with ENOTEMPTY).
ENOENT is the one errno the drop swallows silently — it indicates
the directory is already gone (the post-condition cleanup owes), which
can legitimately happen via a TOCTOU race between the inner
exists() check and remove_dir. Every other error (EBUSY from a
surviving task, EACCES, a broken cgroupfs mount, etc.) is emitted
as a tracing::warn! record carrying the cgroup name, the full error
chain, and — for EBUSY or EACCES — a short remediation hint. The
drop never panics and never returns an error (it cannot), but
teardown failures are visible in logs rather than silently swallowed.
Usage
CgroupGroup is the standard pattern for cgroup lifecycle management
in custom scenarios and in run_scenario() for data-driven scenarios.
fn custom_scenario(ctx: &Ctx) -> Result<AssertResult> {
let mut guard = CgroupGroup::new(ctx.cgroups);
guard.add_cgroup("cg_0", &cpuset_a)?;
guard.add_cgroup("cg_1", &cpuset_b)?;
// If WorkloadHandle::spawn() fails here, guard drops
// and both cgroups are removed automatically.
let mut h = WorkloadHandle::spawn(&config)?;
ctx.cgroups.move_tasks("cg_0", &h.worker_pids_for_cgroup_procs()?)?;
h.start(); // workers block until start() is called
// ... run workload ...
// guard drops at end of scope, removing cg_0 and cg_1.
Ok(result)
}
The helper function setup_cgroups()
returns a CgroupGroup alongside the worker handles:
let (handles, _guard) = setup_cgroups(ctx, 2, &wl)?;
// _guard lives until end of scope; cgroups are cleaned up on drop.
See also: CgroupManager for filesystem operations, WorkloadHandle for worker lifecycle, TestTopology for cpuset generation.
CI
Recipes for running ktstr tests in continuous integration.
Runner requirements
ktstr boots KVM virtual machines. CI runners must provide:
/dev/kvmaccess (hardware virtualization enabled)- Self-hosted runners or a provider that exposes KVM to the guest
GitHub-hosted ubuntu-latest runners do not expose /dev/kvm.
Use self-hosted runners with KVM labels:
runs-on: [self-hosted, X64] # x86_64 (minimum labels)
runs-on: [self-hosted, Linux, kvm, kernel-build, ARM64] # aarch64 (adjust labels to your runner pool)
See Troubleshooting: /dev/kvm not accessible for diagnosing KVM issues on runners, including cloud VM nested virtualization setup (GCP, AWS, Azure).
Runners also need the build dependencies listed in Getting Started: Prerequisites (clang, pkg-config, make, gcc, autotools) and at least 5 GB of free disk for kernel source extraction, build artifacts, and cached images.
Workflow setup
A minimal workflow that builds a kernel, caches it, and runs tests:
name: CI
on:
push:
branches: [main]
pull_request:
jobs:
test:
runs-on: [self-hosted, X64]
env:
KTSTR_GHA_CACHE: "1"
steps:
- uses: actions/checkout@v5
- uses: dtolnay/rust-toolchain@stable
with:
components: rustfmt
- uses: taiki-e/install-action@v2
with:
tool: cargo-nextest
- name: Install ktstr
run: cargo install --path . --locked --bin ktstr --bin cargo-ktstr
- name: Cache kernel images
uses: actions/cache@v5
with:
path: ~/.cache/ktstr/kernels
key: ktstr-kernels-x64-${{ hashFiles('ktstr.kconfig') }}
restore-keys: ktstr-kernels-x64-
- name: Build test kernel
run: cargo ktstr kernel build
- run: cargo ktstr test -- --profile ci --features integration
The test harness auto-discovers the built kernel. --profile ci
configures nextest timeouts and retry behavior; see
Nextest CI profile. KTSTR_GHA_CACHE enables
a remote kernel cache; see Caching. To pin a specific
kernel version, see Kernel pinning below.
Kernel pinning
Pin a specific kernel version via the matrix strategy:
strategy:
fail-fast: false
matrix:
kernel-version: ['6.14', '7.0']
steps:
# ...
- name: Install ktstr
run: cargo install --path . --locked --bin ktstr --bin cargo-ktstr
- name: Build test kernel
run: cargo ktstr kernel build ${{ matrix.kernel-version }}
- run: cargo ktstr test --kernel ${{ matrix.kernel-version }} -- --profile ci --features integration
--kernel tells cargo ktstr test which cached kernel to use at
runtime. A major.minor prefix (e.g. 6.14) resolves to the highest
patch release in that series. See
Kernel discovery for the
full resolution chain.
When testing multiple kernel versions, add the version to the cache key (unlike the minimal workflow above, which omits it because it builds a single kernel):
key: ktstr-kernels-x64-${{ matrix.kernel-version }}-${{ hashFiles('ktstr.kconfig') }}
restore-keys: ktstr-kernels-x64-${{ matrix.kernel-version }}-
Caching
actions/cache persists ~/.cache/ktstr/kernels across runs, keyed
on hashFiles('ktstr.kconfig') so kconfig changes trigger a rebuild.
Set KTSTR_GHA_CACHE=1 to enable a remote cache layer that shares
kernels across jobs and workflow runs. Remote failures are non-fatal;
local cache is authoritative.
Budget-based test selection
Set KTSTR_BUDGET_SECS to limit test runtime:
- run: cargo ktstr test -- --profile ci --features integration
env:
KTSTR_BUDGET_SECS: "300"
The selector greedily picks tests that maximize feature coverage within the time budget. Useful for smoke-test jobs or constrained runners. See Running Tests: Budget-based test selection.
Coverage
Run tests under cargo ktstr coverage for coverage reports:
coverage:
runs-on: [self-hosted, X64]
steps:
- uses: actions/checkout@v5
- uses: dtolnay/rust-toolchain@stable
with:
components: rustfmt,llvm-tools-preview
- uses: taiki-e/install-action@v2
with:
tool: cargo-llvm-cov,cargo-nextest
- name: Install ktstr
run: cargo install --path . --locked --bin ktstr --bin cargo-ktstr
- name: Cache kernel images
uses: actions/cache@v5
with:
path: ~/.cache/ktstr/kernels
key: ktstr-kernels-x64-${{ hashFiles('ktstr.kconfig') }}
restore-keys: ktstr-kernels-x64-
- name: Build test kernel
run: cargo ktstr kernel build
- run: cargo ktstr coverage -- --profile ci --lcov --output-path lcov.info --features integration --exclude-from-report scx-ktstr
Requires llvm-tools-preview rustup component and cargo-llvm-cov.
Pass --exclude-from-report <crate> to exclude scheduler crates from
coverage reports (the example excludes scx-ktstr, the project’s
own test fixture scheduler).
Test statistics
Collect test statistics after the test run:
- name: Test statistics
if: ${{ !cancelled() }}
run: cargo ktstr stats
stats reads sidecar JSON files from target/ktstr/
and prints gauntlet analysis, BPF verifier stats, callback profiles,
and KVM stats. The if: !cancelled() condition ensures stats are
collected even on test failure. See
cargo-ktstr stats for
subcommands and options.
aarch64
aarch64 runners use the same workflow as x64. Copy the x64 workflow above and apply these differences:
- Runner labels:
[self-hosted, Linux, kvm, kernel-build, ARM64](adjust to match your runner pool). - Cache key prefix:
arm64instead ofx64. sccachemust be installed on every runner the workflow targets (x64 and arm64). The workflow’s globalRUSTC_WRAPPER=sccacheapplies to every job; a runner withoutsccacheon$PATHfails the first cargo invocation.
Performance mode
CI runners may lack CAP_SYS_NICE, rtprio limits, or enough host
CPUs for exclusive LLC reservation. Disable performance mode to skip
these features:
- run: cargo ktstr test -- --profile ci --features integration
env:
KTSTR_NO_PERF_MODE: "1"
Tests with performance_mode=true are skipped entirely under
--no-perf-mode. See
Performance Mode: Disabling.
Environment variables
See the full reference for all
environment variables. The CI-relevant ones are KTSTR_GHA_CACHE,
KTSTR_BUDGET_SECS, KTSTR_NO_PERF_MODE, KTSTR_KERNEL, and
KTSTR_CACHE_DIR.
Nextest CI profile
The workspace ships a ci nextest profile in .config/nextest.toml.
Compared to the default profile, it raises the slow-timeout
termination threshold from 2 to 3 cycles (terminate-after = 3),
defers per-test output until the run completes
(failure-output = "final"), and continues past failures
(fail-fast = false). Use it with --profile ci.
See Tests pass locally but fail in CI for common CI failure causes.
Troubleshooting
Build errors
clang not found
error: failed to run custom build command for `ktstr`
...
clang: No such file or directory
The BPF skeleton build (libbpf-cargo) invokes clang to compile
.bpf.c sources. Install clang:
- Debian/Ubuntu:
sudo apt install clang - Fedora:
sudo dnf install clang
pkg-config not found
error: failed to run custom build command for `libbpf-sys`
...
pkg-config: command not found
libbpf-sys uses pkg-config during its vendored build. Install it:
- Debian/Ubuntu:
sudo apt install pkg-config - Fedora:
sudo dnf install pkgconf
autotools errors (autoconf, autopoint, aclocal)
autoreconf: command not found
aclocal: command not found
autopoint: command not found
The vendored libbpf-sys build compiles bundled libelf and zlib from source using autotools. These libraries are not system dependencies – they ship with libbpf-sys – but the autotools toolchain is needed to build them. Install:
- Debian/Ubuntu:
sudo apt install autoconf autopoint flex bison gawk - Fedora:
sudo dnf install autoconf gettext-devel flex bison gawk
make or gcc not found
busybox build requires 'make' — install build-essential (Debian/Ubuntu) or base-devel (Fedora/Arch)
busybox build requires 'gcc' — install build-essential (Debian/Ubuntu) or base-devel (Fedora/Arch)
The build script compiles busybox from source for guest shell mode. This requires make and gcc.
- Debian/Ubuntu:
sudo apt install make gcc - Fedora:
sudo dnf install make gcc
BTF errors
no BTF source found. Set KTSTR_KERNEL to a kernel build directory,
or ensure /sys/kernel/btf/vmlinux exists.
build.rs generates vmlinux.h from kernel BTF data. It searches
the kernel discovery chain (KTSTR_KERNEL, ./linux, ../linux,
installed kernel) for a vmlinux file, falling back to
/sys/kernel/btf/vmlinux. Most distros ship
/sys/kernel/btf/vmlinux with CONFIG_DEBUG_INFO_BTF enabled.
Fixes:
- Verify BTF is available:
ls /sys/kernel/btf/vmlinux - If missing, set
KTSTR_KERNELto a kernel build directory that contains avmlinuxwith BTF:export KTSTR_KERNEL=/path/to/linux - Build a kernel with
CONFIG_DEBUG_INFO_BTF=y. - Some minimal/cloud kernels strip BTF. Use a distro kernel or build your own.
busybox download failure
failed to obtain busybox source.
tarball (https://github.com/mirror/busybox/archive/refs/tags/1_36_1.tar.gz): download: ...
git clone (https://github.com/mirror/busybox.git): ...
Check network connectivity. First build requires internet access.
build.rs downloads busybox source on first build (tarball first,
git clone fallback). Subsequent builds use the cached binary in
$OUT_DIR.
Fixes:
- Verify network connectivity to github.com.
- If behind a proxy, set
HTTP_PROXY/HTTPS_PROXY. - After a successful first build, no network access is needed
unless
cargo cleanremoves the cached binary.
/dev/kvm not accessible
The host-side pre-flight emits one of the following, depending on whether the device node is missing or merely unreadable:
/dev/kvm not found. KVM requires:
- Linux kernel with KVM support (CONFIG_KVM)
- Access to /dev/kvm (check permissions or add user to 'kvm' group)
- Hardware virtualization enabled in BIOS (VT-x/AMD-V)
/dev/kvm: permission denied. Add your user to the 'kvm' group:
sudo usermod -aG kvm $USER
then log out and back in.
ktstr boots Linux kernels in KVM virtual machines. The host must have
KVM enabled and the user must have read+write access to /dev/kvm.
Diagnose:
- Check the device exists and inspect its permissions and owning group:
ls -l /dev/kvm. Typical output:crw-rw---- 1 root kvm 10, 232 .... - Confirm the
kvmgroup exists and see its members:getent group kvm.
Fixes:
- Load the KVM module:
modprobe kvm_intelormodprobe kvm_amd. - Follow the group-membership hint in the error text above (log out and back in afterward for the group change to take effect).
- On cloud VMs (GCP, AWS, Azure) or nested hypervisors, nested
virtualization is typically off by default. Enable it per the
provider’s instructions (e.g. GCP
--enable-nested-virtualization, AWS metal/.metalinstance types, Azure Dv3/Ev3+ with nested virt). - In CI, ensure the runner has KVM access (e.g.
runs-on: [self-hosted, kvm]).
No kernel found
no kernel found
hint: set KTSTR_KERNEL to a kernel source directory, a version (e.g. `6.14.2`), or a cache key (see `cargo ktstr kernel list`), or run `cargo ktstr kernel build` to populate the cache
hint: or set KTSTR_TEST_KERNEL=/path/to/bzImage to point at a pre-built bootable image directly (bypasses KTSTR_KERNEL resolution)
On aarch64 the second hint says Image instead of bzImage.
ktstr shell and cargo ktstr shell auto-download the latest
stable kernel when no --kernel is specified and no kernel is found
via the discovery chain. See
Kernel auto-download failures for
download-specific errors.
ktstr needs a bootable Linux kernel image (bzImage on x86_64,
Image on aarch64). See
Kernel discovery for the
search order.
Fixes:
- Download and cache a kernel:
cargo ktstr kernel build - Build from a local tree:
cargo ktstr kernel build --source ../linux - Set
KTSTR_TEST_KERNELto an explicit image path. - The host’s installed kernel works for basic testing.
Scheduler not found
scheduler 'scx_mitosis' not found. Set KTSTR_SCHEDULER or
place it next to the test binary or in target/{debug,release}/
When using SchedulerSpec::Discover, ktstr searches for the scheduler
binary in:
KTSTR_SCHEDULERenvironment variable.- Sibling of the current executable (and, when the test binary
lives under
target/{debug,release}/deps/, the parent ofdeps/one level up — this covers the nextest / integration- test layout where the scheduler binary sits next to the test binary’s parent). target/debug/.target/release/.- On-demand build via
cargo buildagainst the scheduler’s package name — ktstr invokes the build itself when the preceding four locations have no match, so a fresh checkout with an unbuilt scheduler still produces a usable binary without the caller pre-runningcargo build.
Fixes:
- Build the scheduler first:
cargo build -p scx_mitosis(skipped automatically if step 5 above can build it on demand, but pre-building makes the first test run faster). - Set
KTSTR_SCHEDULER=/path/to/binary. - Use
SchedulerSpec::Pathfor an explicit path in#[ktstr_test].
Scheduler died
scheduler process died unexpectedly after completing step 2 of 5 (12.3s into test)
The scheduler process died while the scenario was running. This is usually a crash. The exact message varies by when the crash was detected (between steps, during workload, after completion).
The failure output contains diagnostic sections (each present only when relevant):
--- scheduler log ---: the scheduler’s stdout and stderr, cycle-collapsed for readability.--- diagnostics ---: init stage classification, VM exit code, and the last 20 lines of kernel console output.--- sched_ext dump ---:sched_ext_dumptrace lines from the guest kernel (present when a SysRq-D dump fired).
Set RUST_BACKTRACE=1 to force --- diagnostics --- on all
failures, not just scheduler deaths.
Next steps:
- Check the
--- scheduler log ---for the crash reason. - Check
--- diagnostics ---for BPF errors or kernel oops in the kernel console. - Enable
auto_reproin the test to capture the crash path with BPF probes. See Auto-Repro. - Run with a longer duration and specific flags to narrow the reproducer.
See Investigate a Crash for the complete failure output format and auto-repro walkthrough.
Insufficient hugepages
performance_mode: WARNING: no 2MB hugepages available, guest memory will use regular pages
performance_mode: WARNING: need N 2MB hugepages, only K free — falling back to regular pages
Performance mode requests 2MB
hugepages for guest memory. The first form fires when no 2MB hugepages
are reserved on the host (free == 0); the second fires when some are
reserved but fewer than the run needs. In both cases the VM falls back
to regular pages and continues to boot.
Fix:
Allocate hugepages before the run:
echo 2048 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
Worker assertion failures
stuck 4500ms on cpu2 at +3200ms (threshold 3000ms)
unfair cgroup: spread=42% (8-50%) 4 workers on 4 cpus (threshold 35%)
The Assert checks (max_gap_ms, max_spread_pct, etc.) detected a
worker metric outside the configured thresholds.
Fixes:
- Check whether the topology has enough CPUs for the scenario. Small topologies produce higher contention, larger gaps, and more spread.
- Use
execute_steps_with()with a customAssertto override thresholds for scenarios that need relaxed limits. - Check the scheduler’s behavior under the specific flag profile that triggered the failure.
Cgroup name typos
No such file or directory: /sys/fs/cgroup/.../nonexistent/cgroup.procs
A cgroup name passed to Op::SetCpuset, Op::Spawn, or
CgroupManager::move_tasks does not match a previously created
cgroup. Cgroup names are case-sensitive strings.
Fixes:
- Verify the cgroup name matches the
nameinOp::AddCgrouporCgroupDef::named(). - When using dynamic cgroup names (e.g.
format!("cg_{i}")), ensure the same formatting is used in all ops referencing that cgroup.
CpusetSpec errors
cgroup 'cg_0': CpusetSpec validation failed: not enough usable CPUs (4) for 8 partitions
cgroup 'cg_1': CpusetSpec validation failed: index 3 >= partition count 3
cgroup 'cg_2': CpusetSpec validation failed: Range fracs must lie in [0.0, 1.0]: start_frac=-1, end_frac=0.5
A CpusetSpec cannot produce a valid cpuset for the test topology.
execute_steps treats this as a hard error and aborts the step so the
downstream slicing/arithmetic in CpusetSpec::resolve is never reached
with inputs that would panic.
Fixes:
- Guard with a topology check before creating the step:
if ctx.topo.usable_cpus().len() < needed { return Ok(AssertResult::skip(...)); } - Call
CpusetSpec::validate(&ctx)in your scenario builder so failures surface beforeexecute_stepsruns. - Reduce the partition count or use
CpusetSpec::Llcinstead ofDisjointon topologies with fewer CPUs than partitions. - For
Range/Overlap, keep fractions finite and inside[0.0, 1.0];Rangeadditionally requiresstart_frac < end_frac.
Worker count mismatches
PipeIo requires num_workers divisible by 2, got 3
Grouped work types (PipeIo, FutexPingPong, CachePipe,
FutexFanOut, FanOutCompute) require num_workers divisible by their
group size. WorkType::worker_group_size() returns the divisor.
Fixes:
- Set
CgroupDef::workers(n)to a value divisible by the work type’s group size (2 for pipe/futex pairs,fan_out + 1for FutexFanOut and FanOutCompute). - Use an ungrouped work type (
SpinWait,Mixed,Bursty,IoSyncWrite,IoRandRead,IoConvoy,YieldHeavy) if worker count flexibility is needed.
Cache corruption
6.14.2-tarball-x86_64-kc... (corrupt: metadata.json malformed: ...)
warning: entries marked (corrupt) cannot be used — cached metadata is missing, malformed, or references a missing image. Inspect the entry directory under ~/.cache/ktstr/kernels to remove it manually, or run `kernel clean --corrupt-only --force` which removes ONLY corrupt entries and leaves valid ones intact. ...
A cached kernel entry has missing, unparseable, or
schema-drifted metadata.json, or metadata that references an
image file that is no longer present. This can happen after a
partial write (e.g. disk full, killed process), or after a ktstr
release that evolved the metadata schema in a
non-backward-compatible way. cargo ktstr kernel list surfaces
these as (corrupt: ...) rows; the trailing footer on stderr
summarizes the remediation options. CacheDir::lookup returns
None for corrupt entries so test runs at a specific cache key
fall through to the normal re-build path.
The JSON form (cargo ktstr kernel list --json) emits an
error_kind field on every corrupt entry — one of "missing",
"unreadable", "schema_drift", "malformed", "truncated",
"parse_error", "image_missing", or "unknown" — so CI
scripts can dispatch on a stable token without parsing the
free-form error string.
Fixes:
- Remove ONLY corrupt entries (keeps valid ones intact):
cargo ktstr kernel clean --corrupt-only --force - Remove the corrupt entry along with everything else:
cargo ktstr kernel clean --force - Rebuild a specific version after cleanup:
cargo ktstr kernel build --force 6.14.2 - Override the cache directory via
KTSTR_CACHE_DIRif the default location is on a problematic filesystem. - See
cargo ktstr kernel cleanfor all cleanup options, including--keep N --forceto preserve the N newest entries.
Stale vmlinux.btf or default.profraw in kernel source tree
After upgrading from an older ktstr version, you may notice extra files in your kernel source directory:
<source>/vmlinux.btf— a sidecar of the kernel’s.BTFsection bytes. Older ktstr versions wrote it next to whichevervmlinuxthey parsed, including source-tree builds. Current ktstr only writes the sidecar when the vmlinux path is inside the cache root (~/.cache/ktstr/kernels/or whateverKTSTR_CACHE_DIRpoints at) so source trees stay pristine.<source>/default.profraw— an LLVM coverage runtime artifact. Older ktstr versions could leave it in cwd when a coverage-instrumentedcargo ktstr testwas launched from inside the kernel tree. Current ktstr injectsLLVM_PROFILE_FILE=<cargo-ktstr-binary-parent>/llvm-cov-target/default-{pid}-{binary_hash}.profrawfor the barenextestpath so the profraw lands next to the cargo-ktstr binary regardless of cwd. See profraw layout for the per-population directory map.
Both files are leftover state from prior runs and are safe to remove:
rm -f /path/to/linux/vmlinux.btf
rm -f /path/to/linux/default.profraw
If you also see them turn up under a different ktstr-driven
source tree, check that you are running a current ktstr build
(re-run cargo build or cargo install ktstr to pick up the
fix) before deleting again — the guards live in the resolver,
not on disk, so an old binary will keep regenerating these
files.
Cache directory not found
HOME is unset; cannot resolve cache directory. The container init or login shell did not assign HOME — set it to an absolute path, or set KTSTR_CACHE_DIR to an absolute path (e.g. /tmp/ktstr-cache) or XDG_CACHE_HOME to specify a cache location explicitly.
HOME is set to the empty string; cannot resolve cache directory. An empty HOME usually means a Dockerfile or shell rc has `export HOME=` or `ENV HOME=` with no value. Either set HOME to a real absolute path, or set KTSTR_CACHE_DIR to an absolute path (e.g. /tmp/ktstr-cache) or XDG_CACHE_HOME to specify a cache location explicitly.
The kernel image cache requires a writable directory. ktstr resolves
it as: KTSTR_CACHE_DIR > $XDG_CACHE_HOME/ktstr/kernels/ >
$HOME/.cache/ktstr/kernels/. The first form fires when HOME is
absent from the environment (typical of bare container inits or
systemd units with no Environment=HOME=...); the second fires when
HOME is present but assigned to the empty string.
Fix: Set KTSTR_CACHE_DIR to an explicit path, or ensure HOME
is set to a real absolute path.
Stale kconfig
warning: entries marked (stale kconfig) were built against a different ktstr.kconfig.
Rebuild with: kernel build --force <entry version>
cargo ktstr kernel list marks entries whose stored ktstr_kconfig_hash
differs from the current embedded ktstr.kconfig fragment. This
happens after updating ktstr (which may change the kconfig fragment).
Fix:
Rebuilds happen automatically on the next cargo ktstr kernel build
for stale entries. Use --force to override the cache for other
reasons. See cargo ktstr kernel list
for the full listing output.
Kernel auto-download failures
ktstr: no kernel found, downloading latest stable
fetch https://www.kernel.org/releases.json: <error>
ktstr auto-downloads a kernel when no --kernel is specified and no
kernel is found via the discovery chain (see
Kernel discovery). The same
download path runs when --kernel specifies a version (e.g.
--kernel 6.14.2) that is not in the cache. The CLI label varies:
ktstr: for the standalone binary, cargo ktstr: for the cargo
subcommand.
The <error> above is the underlying reqwest error (DNS resolution,
connection refused, timeout, TLS handshake failure).
fetch https://www.kernel.org/releases.json: HTTP 503
kernel.org returned a non-success status code.
no stable kernel with patch >= 8 found in releases.json
ktstr requires a stable or longterm release with patch version >= 8 to avoid brand-new major versions that may have build issues. This error means releases.json contained no qualifying version.
download https://cdn.kernel.org/.../linux-6.14.10.tar.xz: <error>
Network failure during tarball download (same causes as above).
extract tarball: <error>
Tarball extraction failed. Common causes: disk full, insufficient permissions on the temp directory, or a truncated download.
kernel built but cache store failed — cannot return image from temporary directory
The kernel built successfully but could not be stored in the cache. Check disk space and permissions on the cache directory.
For version-specific download errors (HTTP 404, HTML responses), see Kernel download failures.
Fixes:
- Verify network connectivity:
curl -sI https://www.kernel.org/releases.json - Check DNS resolution for kernel.org and cdn.kernel.org.
- Check disk space — the download, extraction, and build require significant disk space.
- If behind a proxy, set
HTTP_PROXY,HTTPS_PROXY, andNO_PROXY(reqwest respects these environment variables). - Override the cache directory via
KTSTR_CACHE_DIRif the default location has insufficient space or permissions. - Pre-download a kernel explicitly:
cargo ktstr kernel build 6.14.10to isolate whether the failure is in version resolution or download.
Kernel download failures
These errors occur when cargo ktstr kernel build or --kernel
specifies an explicit version. For network and extraction errors
during auto-download, see
Kernel auto-download failures.
version 6.14.22 not found. latest 6.14.x: 6.14.10
The requested version does not exist on kernel.org. When a version in the same major.minor series is available in releases.json, the error suggests it.
version 5.4.99 not found
When the series is EOL or not in releases.json, only the “not found” message appears (no suggestion).
RC tarball not found: https://git.kernel.org/torvalds/t/linux-6.15-rc3.tar.gz
RC releases are removed from git.kernel.org after the stable version ships.
RC tarballs are removed from git.kernel.org after the stable version
ships. Use --git with a git.kernel.org URL to clone the tag instead.
download ...: server returned HTML instead of tarball (URL may be invalid)
Some CDN error pages return HTTP 200 with text/html content type.
The download rejects these responses.
Fixes:
- Check the suggested version in the error message.
- Verify the version exists: check
https://www.kernel.org/releases.jsonfor available versions. - For RC releases, use
--gitwith a git.kernel.org URL instead of a tarball download. - Run
cargo ktstr kernel buildwithout a version to automatically fetch the latest stable.
Shell mode issues
stdin must be a terminal
stdin must be a terminal for interactive shell mode
cargo ktstr shell requires a terminal for bidirectional I/O
forwarding. Piped or redirected stdin is rejected.
Fix: Run from an interactive terminal session.
include file not found
-i strace: not found in filesystem or PATH
Bare names (without /, ., or ..) are searched in PATH. If the
binary is not in PATH, use an explicit path.
--include-files path not found: ./missing-file
Explicit paths (containing / or starting with .) must exist on
disk.
Fix: Verify the file exists and use the correct path.
include directory contains no files
warning: -i ./empty-dir: directory contains no regular files
The directory passed to --include-files was walked recursively but
contained no regular files. FIFOs, device nodes, and sockets are
skipped during the walk.
Fix: Verify the directory contains the files you expect.
Model load failed
GGUF model load failed at /home/.../models/Qwen3-4B-Q4_K_M.gguf. The
file may be corrupt or incompatible with the linked llama.cpp version
— delete the file and re-run `cargo ktstr model fetch` to download
a fresh copy. Check stderr for the upstream llama.cpp rejection reason.
The host-side LLM extraction backend (OutputFormat::LlmExtract)
could not load the cached GGUF weights. The cached file is either
corrupt (partial download, disk error) or incompatible with the
linked llama.cpp version.
Diagnose:
- Re-run with
RUST_LOG=llama-cpp-2=info(or=debugfor more detail) to surface llama.cpp’s own rejection reason on stderr. The first call to the inference engine routesllama_cpp_2::send_logs_to_tracingevents through the tracing subscriber under target"llama-cpp-2"(literal hyphens — see Environment Variables for the EnvFilter shape). cargo ktstr model statusreports the cache path and verdict (Matches,Mismatches,CheckFailed,NotCached).
Fix:
- Delete the cached file and re-fetch:
cargo ktstr model clean && cargo ktstr model fetch.cleanremoves both the GGUF artifact and its.mtime-sizewarm-cache sidecar;fetchre-downloads from the pinned URL and SHA-checks the result. - If
model statusreportsMismatches, the local file’s hash diverged from the pinned digest —cargo ktstr model fetchwill refuse to overwrite a corrupt cache and the explicitcleanis required first. - If you set
KTSTR_MODEL_OFFLINE=1, unset it for the re-fetch. Seecargo ktstr model.
Flock timeout / NFS rejection
flock LOCK_EX on run-dir target/ktstr/6.14-abc1234 timed out after
30s (lockfile target/ktstr/.locks/6.14-abc1234.lock, holders:
pid=12345 cmd=cargo-ktstr test --kernel 6.14). A peer cargo
ktstr test process is writing sidecars to the same
{kernel}-{project_commit} directory; wait for it to finish or kill
it, then retry.
A peer process is holding the per-run-key advisory flock(2)
that serializes sidecar writes; the helper polled for 30 s and
gave up. Run-dir locks live at
{runs_root}/.locks/{kernel}-{project_commit}.lock and serialize
the (pre-clear + write) cycle so two concurrent ktstr runs
sharing the same key can’t tear partially-written sidecars.
target/ktstr/.locks/6.14-abc1234.lock: filesystem NFS is not
supported for ktstr lockfiles (NFSv3 is advisory-only without
an NLM peer; NFSv4 byte-range locking does not cover flock(2)).
Move the lockfile path to a local filesystem (tmpfs, ext4, xfs,
btrfs, f2fs, bcachefs).
try_flock rejects NFS, CIFS, SMB2, CephFS, AFS, and FUSE mounts
because flock(2) semantics on those filesystems are unreliable
(see Resource Budget — Filesystem requirement
for the per-filesystem rationale).
Diagnose:
cargo ktstr locks(orktstr locks --watch 1s) prints every ktstr flock currently held on the host with PID + cmdline, including per-run-key sidecar locks under the “Run-dir locks” section (seecargo ktstr locks).cat /proc/locks | grep '<lockfile-path-from-error>'falls back to the kernel’s own flock enumeration when the holder is outside ktstr.stat -f -c '%T' <runs-root>reports the filesystem type when the rejection error names NFS/CIFS/SMB/CephFS/AFS/FUSE.
Fix:
- For a peer-holder timeout: wait for the peer to finish, kill
it (
kill <pid>from the holder list), or retry with the peer done. - For an NFS / remote-fs rejection: relocate the runs root to a
local filesystem. Set
KTSTR_SIDECAR_DIRto a local path (/tmp/ktstr-sidecars, a tmpfs mount) — note that this override path also skips the cross-process flock, so concurrent runs targeting the sameKTSTR_SIDECAR_DIRhave no serialization between them. Use the override only for a single-process run or per-process distinct paths. - The kernel cache’s lockfiles
(
{cache_root}/.locks/*.lock) face the same constraint — overrideKTSTR_CACHE_DIRto a local filesystem if the default resolves to NFS. See Cache directory not found.
Tests pass locally but fail in CI
Common causes:
- No KVM: CI runners need hardware virtualization. Check for
/dev/kvmaccess. - Fewer CPUs: gauntlet topology presets up to 252 CPUs may exceed the runner’s capacity. Use smaller topologies.
- No kernel: set
KTSTR_TEST_KERNELin the CI environment. - No CAP_SYS_NICE or rtprio: performance-mode tests require
CAP_SYS_NICEor an rtprio limit for RT scheduling, and enough host CPUs for exclusive LLC reservation. Pass--no-perf-mode(or setKTSTR_NO_PERF_MODE=1) to disable all performance mode features. Tests withperformance_mode=trueare skipped entirely under--no-perf-mode. - Debug thresholds: CI often runs debug builds. Debug builds use relaxed thresholds (3000ms gap, 35% spread) but may still hit limits on slow runners. See default thresholds.
Environment Variables
Environment variables that control ktstr behavior.
User-facing
| Variable | Description | Default |
|---|---|---|
KTSTR_KERNEL | Kernel identifier for cargo-build-time BTF resolution (read by build.rs) and runtime image discovery. Accepts a path (../linux), version string (6.14.2), or cache key (use cargo ktstr kernel list for actual keys). During cargo build, only paths are used (build.rs extracts BTF from vmlinux). At runtime, version strings and cache keys resolve via the XDG cache; paths search only the specified directory (error if no image found). Set automatically by cargo ktstr test --kernel. Overridden by KTSTR_KERNEL_LIST when present: under multi-kernel runs the test binary’s --list / --exact handlers consult KTSTR_KERNEL_LIST first and only fall back to KTSTR_KERNEL when the list env is unset; the producer-side cargo ktstr always sets KTSTR_KERNEL to the FIRST resolved entry alongside the full KTSTR_KERNEL_LIST so downstream code that inspects KTSTR_KERNEL directly still sees a valid path. | Auto-discovered |
KTSTR_KERNEL_LIST | Multi-kernel wire format label1=path1;label2=path2;… consumed by the test binary’s gauntlet expansion. Set by cargo ktstr test / coverage / llvm-cov when the resolved kernel set has 2 or more entries; the test binary’s --list handler emits one variant per kernel (suffix gauntlet/{name}/{preset}/{profile}/{kernel_label} or ktstr/{name}/{kernel_label}) and the --exact handler strips the suffix and re-exports KTSTR_KERNEL to the matching directory before booting the VM. Semicolon is the entry separator (paths can carry : on POSIX); = separates label from path. Empty value or unset means “single-kernel mode” — the test binary falls back to KTSTR_KERNEL. | None (single-kernel) |
KTSTR_CI | Set to any non-empty value to flip every sidecar’s run_source field from "local" (developer-machine default) to "ci". Read at sidecar-write time by detect_run_source; surfaces through cargo ktstr stats compare --run-source ci so CI-produced runs can be partitioned from developer runs without per-run directory bookkeeping. Empty string counts as unset. The third value "archive" is applied at LOAD time (not write time) when cargo ktstr stats compare --dir / list-values --dir pulls sidecars from a non-default pool root — KTSTR_CI does not control that. | None (run_source = "local") |
KTSTR_TEST_KERNEL | Path to a bootable kernel image (bzImage on x86_64, Image on aarch64). See Getting Started and Troubleshooting for search order. | Auto-discovered |
KTSTR_CACHE_DIR | Override the kernel image cache directory. When set, all cache operations use this path instead of the XDG default. | $XDG_CACHE_HOME/ktstr/kernels/ or $HOME/.cache/ktstr/kernels/ |
KTSTR_GHA_CACHE | Set to "1" to enable remote kernel cache via GitHub Actions cache service. Requires ACTIONS_CACHE_URL (set by the GHA runner). Local cache is always authoritative; remote failures are non-fatal. | None (disabled) |
KTSTR_SCHEDULER | Path to a scheduler binary for SchedulerSpec::Discover. See Troubleshooting for search order. | Auto-discovered |
KTSTR_BUDGET_SECS | Time budget in seconds for greedy test selection during --list. Must be positive. See Running Tests. | None (all tests listed) |
KTSTR_SIDECAR_DIR | Directory for per-test result sidecar JSON files. Used as-is when set, no key suffix. Consumed by the test harness (sidecar write path) and by bare cargo ktstr stats (sidecar read path). When this override is set, pre-clear is skipped AND the per-run-key cross-process flock is skipped — the operator chose the directory and owns its contents, so any pre-existing sidecars there are preserved, and ktstr does not coordinate concurrent writers against the override path. Two concurrent runs pointing the same KTSTR_SIDECAR_DIR at the same path therefore have no serialization between them; choose distinct override paths per process (or rely on the default-path branch, which acquires the flock automatically). cargo ktstr stats list, cargo ktstr stats compare, cargo ktstr stats list-values, and cargo ktstr stats show-host walk {CARGO_TARGET_DIR or "target"}/ktstr/ by default and ignore KTSTR_SIDECAR_DIR — pass --dir DIR on compare / list-values / show-host to point them at an alternate run root. See Runs. | {CARGO_TARGET_DIR or "target"}/ktstr/{kernel}-{project_commit}/ (where {project_commit} is the project HEAD short hex, suffixed -dirty when the worktree differs, or the literal unknown when not in a git repo — see Runs for the unknown-commit collision semantics) |
KTSTR_NO_PERF_MODE | Force performance_mode=false and skip flock topology reservation. Disables all performance mode features (pinning, RT scheduling, hugepages, NUMA mbind, KVM exit suppression). Presence is sufficient (any value). See Performance Mode. Also settable via --no-perf-mode CLI flag. | None (disabled) |
KTSTR_CARGO_TEST_MODE | Marks a test invocation that runs without the cargo-ktstr wrapper (typically KTSTR_KERNEL=… KTSTR_CARGO_TEST_MODE=1 cargo test -- some_test). When active, the harness (1) skips the cross-process initramfs SHM cache and builds inline per VM run (process-local HashMap memoization still applies); (2) skips host-topology LLC / per-CPU flock acquisition — tests run on whatever CPUs the OS schedules them onto; (3) skips gauntlet variant expansion in nextest discovery — each #[ktstr_test] runs once with its declared topology, no KTSTR_KERNEL_LIST multi-kernel fan-out; (4) resolves SchedulerSpec::Discover(name) via $PATH first (before the sibling-dir / target-dir / cargo-build chain) so a user can install a scheduler on PATH and run a single test without driving the cargo-ktstr build pipeline. Empty string is treated as unset (rejection mirrors KTSTR_NO_PERF_MODE). Acceptable for development iteration; perf-mode tests still use their measurement contract internally but no peer-coordination flocks are taken. | None (full cargo-ktstr coordination) |
KTSTR_NO_SKIP_MODE | Convert resource-contention and host-topology-insufficient skips into hard test failures (exit code 1). Default behavior is to skip the test (exit code 0) so a contended runner does not fail tests that simply could not start; setting this env var (or passing --no-skip-mode on cargo ktstr test / coverage / llvm-cov) opts into “if the test cannot run, the test fails”. Use when a test environment is supposed to provision sufficient resources and a missing topology is a real configuration error. The CLI flag exports KTSTR_NO_SKIP_MODE=1 for the test binary. Presence is sufficient (any value). | None (skip on contention / topology insufficiency) |
KTSTR_CPU_CAP | Cap the number of host CPUs reserved by a no-perf-mode VM or kernel build to N (integer ≥ 1, a CPU count). The planner walks whole LLCs in consolidation- / NUMA-aware order, filtered to the calling process’s sched_getaffinity cpuset, partial-taking the last LLC so plan.cpus.len() is EXACTLY N. CLI flag --cpu-cap N takes precedence; empty string is treated as unset; 0 or non-numeric values are rejected with a parse error. On shell, --cpu-cap is rejected at clap parse time unless --no-perf-mode is also passed (requires = "no_perf_mode"); on kernel build, no perf-mode concept applies. Library consumers that set performance_mode=true on KtstrVmBuilder directly see the env var silently ignored — the builder’s perf-mode branch never consults CpuCap::resolve. Mutually exclusive with KTSTR_BYPASS_LLC_LOCKS=1 at every entry point (rejection wording contains “resource contract”). See Resource Budget. | None (30% of allowed CPUs, minimum 1) |
KTSTR_BYPASS_LLC_LOCKS | Skip host-side LLC flock acquisition entirely. No coordination against concurrent perf-mode runs. Presence is sufficient (any non-empty value). Mutually exclusive with KTSTR_CPU_CAP / --cpu-cap — the conflict is rejected at every entry point with an error containing “resource contract”. See Resource Budget. | None (coordinate) |
KTSTR_KERNEL_PARALLELISM | Override the rayon pool width cargo ktstr uses for --kernel per-spec fan-out in resolve_kernel_set. Parsed as usize after .trim(); whitespace around the value is tolerated. Values that fail to parse, are negative, or are 0 silently fall through to the default — a typoed export (=abc, =0) does NOT disable parallelism, it degrades to the host-CPU default. Useful when the default is wrong for the host: a fast NIC + slow CPU benefits from a higher value (more concurrent downloads); a contended CI runner benefits from a lower cap to leave bandwidth and CPU for sibling jobs. Scope is narrow: only the bounded ThreadPool resolve_kernel_set builds via ThreadPoolBuilder::install is affected — the global rayon pool that other code paths (nextest harness, polars groupby, etc.) consume is untouched. The build phase inside each per-spec resolve is already serialized at the LLC-flock layer, so raising this knob accelerates download fan-out only, not build time. | std::thread::available_parallelism() (host logical CPU count, falling back to 1 on a sandboxed host where available_parallelism errors) |
KTSTR_VERBOSE | Set to "1" for verbose VM console output (earlyprintk, loglevel=7). | None |
RUST_BACKTRACE | Gates verbose diagnostic output on failure. Also enables verbose VM console output (same as KTSTR_VERBOSE=1) when set to "1" or "full". Propagated to the guest. | None |
RUST_LOG | Controls every ktstr tracing filter — guest-side and host-side. Guest-side: propagated to the VM kernel command line and parsed by the guest tracing subscriber, so guest events are filtered by the same RUST_LOG value the host process saw at launch. Host-side: applied via the EnvFilter the inference engine installs on first call to global_backend() (tracing_subscriber::fmt::try_init() — a no-op when an outer subscriber was already installed). Two host-side targets are useful in practice: "llama-cpp-2" (literal hyphens — the Metadata::target() set by llama_cpp_2::send_logs_to_tracing(LogOptions::default()), carrying llama.cpp / GGML log lines: model-load progress, GGUF parse chatter, KV-cache reservation notes, error reasons) and "ktstr::flock" (the module_path!() default for src/flock.rs, where the shared flock-timeout primitive emits a tracing::debug!("waiting on flock at …") event on each Ok(None) poll iteration). Examples: RUST_LOG=llama-cpp-2=info widens model-load logging to INFO; RUST_LOG=ktstr::flock=debug surfaces flock-contention heartbeats; RUST_LOG=llama-cpp-2=off suppresses llama.cpp output entirely. EnvFilter does prefix-matching on meta.target() without underscore normalization (the hyphenated llama-cpp-2 target is a string literal, not a Rust path). The default EnvFilter derived from an unset RUST_LOG keeps only ERROR-level events, which is exactly the C-side rejection-reason text behind otherwise-opaque InferenceError::ModelLoad / LlamaModelLoadError::NullResult failures. Operators wanting a different sink (file, alternate format) can install their own subscriber FIRST — try_init() becomes a no-op and the operator’s subscriber receives the events. | None (host-side: ERROR-level events on stderr) |
jemalloc probe wiring
These variables are only consulted by integration tests that boot a
jemalloc-linked allocator worker inside the VM and attach the
ktstr-jemalloc-probe to it (see tests/jemalloc_probe_tests.rs).
Both are set from a #[ctor] in the test binary so they land before
the test harness dispatches.
What #[ctor] is and why these variables need it
#[ctor] is a Rust attribute (provided by the
ctor crate) that marks a
function to run automatically at binary initialization — after the
dynamic linker sets up the process but before main() is called.
Linux implements this via the .init_array ELF section; the
attribute’s generated code registers the function there. A function
under #[ctor] therefore runs exactly once per process, on the
main thread, before any code inside main() executes.
The two environment variables above are consulted by ktstr’s
nextest pre-dispatch path (ktstr_test_early_dispatch), which
itself runs under a ktstr-owned #[ctor] that intercepts the
nextest protocol args (--list, --exact) before the standard
Rust test harness sees them. The probe-wiring variables must
already be populated when that early dispatch fires, so setting
them from plain test-body code is too late — the sidecar
enumeration and initramfs packing decisions have already run.
Tests needing probe integration install their own #[ctor] that
writes the two variables via std::env::set_var, ensuring both
ktstr’s early dispatch and the VM launch path downstream see the
populated values.
The ctor hook runs under the ctor crate re-exported at
ktstr::__private::ctor, so a new test crate does not need to
add ctor to its own dependencies — it can use the re-export
via ktstr::__private::ctor::ctor and stay in sync with the
version ktstr itself depends on, avoiding the “two ctor
crates, two .init_array entries, ordering undefined” pitfall.
Leaving either variable unset is the normal case — the VM launcher skips probe wiring entirely, and no initramfs entry is added.
| Variable | Description | Default |
|---|---|---|
KTSTR_JEMALLOC_PROBE_BINARY | Absolute host path to the ktstr-jemalloc-probe binary. When set, the probe is packed into every VM’s base initramfs at /bin/ktstr-jemalloc-probe. Typically set by a #[ctor] in the integration test crate to env!("CARGO_BIN_EXE_ktstr-jemalloc-probe"). Empty string is treated the same as unset. | None (no probe packed) |
KTSTR_JEMALLOC_ALLOC_WORKER_BINARY | Absolute host path to the paired ktstr-jemalloc-alloc-worker binary. Packed alongside the probe for the closed-loop tests that run the probe against a live allocator target. Same #[ctor] shape as above using env!("CARGO_BIN_EXE_ktstr-jemalloc-alloc-worker"). Empty string is treated the same as unset. | None (no worker packed) |
LLVM coverage
| Variable | Description | Default |
|---|---|---|
LLVM_COV_TARGET_DIR | Directory for extracted profraw files. | Parent of LLVM_PROFILE_FILE, or <exe-dir>/llvm-cov-target/ |
LLVM_PROFILE_FILE | Standard LLVM profiling output path. ktstr reads its parent as a fallback profraw directory. | None |
Nextest protocol
| Variable | Description | Default |
|---|---|---|
NEXTEST | Set by nextest when it invokes the test binary. ktstr’s #[ctor] dispatch inspects this to decide whether to intercept the nextest protocol args (--list, --exact) for gauntlet expansion and budget-based selection before main() runs. Under plain cargo test, this is unset and the standard harness runs the #[test] wrappers directly. | None |
VM-internal
Set by the host on the guest kernel command line and read by the
guest init (via /proc/cmdline). Not intended for user
configuration; listed here for debugging.
| Variable | Description |
|---|---|
SCHED_PID | PID of the scheduler process inside the guest, published after scheduler spawn. |
KTSTR_MODE | Guest execution mode (run for test dispatch, shell for interactive shell). |
KTSTR_TOPO | Topology string (numa_nodes,llcs,cores,threads) for guest-side scenario resolution. |
KTSTR_SHM_BASE | Host-physical base address of the SHM ring region (hex). |
KTSTR_SHM_SIZE | Size in bytes of the SHM ring region (hex). |
KTSTR_TERM | Terminal type forwarded from the host (sets guest TERM). |
KTSTR_COLORTERM | Color capability forwarded from the host (sets guest COLORTERM). |
KTSTR_COLS | Host terminal column count, used to size the guest pty when available. |
KTSTR_ROWS | Host terminal row count, used to size the guest pty when available. |
Sentinel tokens (===KTSTR_TEST_RESULT_START===,
===KTSTR_TEST_RESULT_END===, KTSTR_EXIT=N,
KTSTR_INIT_STARTED, KTSTR_PAYLOAD_STARTING, KTSTR_EXEC_EXIT)
are protocol markers written to COM2; they are not environment
variables.
ctprof
The ctprof profiler captures a host-wide per-thread snapshot of scheduling counters, memory / I/O accounting, CPU affinity, cgroup state, and thread identity, then compares two snapshots to surface what changed. It is a manually-invoked CLI companion to the automated scheduler tests — useful when a run passes on one machine and fails on another, or for A/B comparing host behaviour across kernel / sysctl / workload changes.
This is a different tool from cargo ktstr show-host,
which captures the host context (kernel, CPU model, sched_*
tunables, NUMA layout, kernel cmdline) — aggregate state that
does not change between scenarios. The profiler captures
per-thread cumulative counters that do change, and its
comparison surface is designed for the thread-level diff.
When to use it
- Workload investigation — you observe a regression and want to know which process / thread pool moved in run time, context-switch rate, or migration count.
- Kernel / sysctl A/B — capture before and after flipping a sched_* tunable on an otherwise-identical workload; the compare output surfaces every counter that responded.
- Host baselining — capture on a known-good host, capture on a failing host, compare to isolate what differs at the thread-behaviour level.
The profiler is not invoked automatically by scenarios or the
gauntlet. It is opt-in and operator-driven via the
ktstr ctprof subcommand.
Capture
ktstr ctprof capture --output baseline.ctprof.zst
# ... run workload, change a tunable, reboot a kernel, etc. ...
ktstr ctprof capture --output after.ctprof.zst
capture walks /proc for every live thread group, enumerates
each thread, and reads a handful of procfs sources for each one.
The output is a zstd-compressed JSON snapshot (conventional
extension: .ctprof.zst).
What is captured per thread
- Identity — tid, tgid,
pcomm(process name from/proc/<tgid>/comm),comm(thread name from/proc/<tid>/comm), cgroup v2 path,start_time_clock_ticks(from/proc/<tid>/statfield 22, in USER_HZ clock ticks), scheduling policy name, nice, CPU affinity mask. - Scheduling counters (cumulative, from
/proc/<tid>/sched; schedstat fields gated byCONFIG_SCHEDSTATS,run_time_ns/wait_time_ns/timeslicesgated byCONFIG_SCHED_INFO) —run_time_ns,wait_time_ns,timeslices,voluntary_csw,nonvoluntary_csw,nr_wakeups(plus_local/_remote/_sync/_migratesplits),nr_migrations,wait_sum/wait_count,voluntary_sleep_ns(capture-side normalized assum_sleep_runtime - sum_block_runtimeso the kernel’s sleep/block double-count is stripped before the value reaches the snapshot),block_sum,iowait_sum/iowait_count,core_forceidle_sum,wait_max/sleep_max/block_max/exec_max/slice_max(lifetime peaks). - Memory —
minflt/majfltfrom/proc/<tid>/stat.allocated_bytes/deallocated_bytesfrom the jemalloc per-thread TSD counters (tsd_s.thread_allocated/thread_deallocated) read via ptrace +process_vm_readv— populated only for processes linked against jemalloc; glibc arena counters are opaque and read as zero rather than failing capture.smaps_rollup_kb(per-process map of the kernel’s/proc/<tid>/smaps_rollupkeys, populated leader-only). - I/O —
rchar,wchar,syscr,syscw,read_bytes,write_bytes,cancelled_write_bytesfrom/proc/<tid>/io(requiresCONFIG_TASK_IO_ACCOUNTING). Note thatcancelled_write_bytesrecords on the truncating task — not the original writer — so it pairs withwrite_bytesas a group-level signal but per-thread arithmetic between the two is not meaningful. - Taskstats delay accounting + watermarks — eight delay
categories × four fields each (count, total_ns, max_ns,
min_ns) plus
hiwater_rss_bytesandhiwater_vm_bytespeaks, pulled via the kernel’s TASKSTATS genetlink family. RequiresCAP_NET_ADMINon the capturing process; delay-family fields additionally requireCONFIG_TASK_DELAY_ACCTand the runtimedelayacct=ontoggle, watermark fields requireCONFIG_TASK_XACCT. See the Taskstats delay accounting section below for the full field list, gating, and per-bucket semantic caveats. - PSI host-level —
cpu.stat/memory.currentaggregates per cgroup (see Per-cgroup enrichment) pluspsi(Pressure Stall Information) under each cgroup and at the host level. RequiresCONFIG_PSI. - sched_ext sysfs —
state,switch_all,nr_rejected,hotplug_seq,enable_seqfrom/sys/kernel/sched_ext/. Present only whenCONFIG_SCHED_CLASS_EXTis built.
Field families and probe-timing invariance:
- Cumulative counters and totals (the majority): wakeups,
migrations, csw, run/wait/sleep/block/iowait time, schedstat
counts, page-fault counters, syscall counters, byte counters,
the taskstats per-bucket
*_countand*_delay_total_ns, the jemalloc per-thread TSD counters. Sampled twice at different instants the value increases monotonically; probe attachment time does not alter the reading. - Lifetime extrema: schedstat
*_maxfamily (wait_max,sleep_max,block_max,exec_max,slice_max), every taskstats*_delay_max_ns/*_delay_min_ns, and the memory watermarks (hiwater_rss_bytes,hiwater_vm_bytes). Per-event extrema rather than sums. The*_maxandhiwater_*fields are non-DECREASING over time (kernel keeps the largest); the*_delay_min_nsfields are non-INCREASING (kernel keeps the smallest non-zero observation, so sentinel 0 means “no events observed” — compare against the matching*_count). - Instantaneous gauges (sensitive to probe timing):
nr_threads(signal_struct->nr_threads snapshot),fair_slice_ns(currentp->se.slice), andstate(task_state_array letter). Sampled at capture time and can legitimately differ between two probes of the same thread. - Categorical / ordinal scalars:
policy,nice,priority,processor,rt_priority, plus identity strings (pcomm,comm,cgroup) and thecpu_affinitycpuset. Sampled at capture time and can change at runtime (e.g.sched_setaffinitymid-run flipsprocessorandcpu_affinity), so they share the gauge family’s probe-timing sensitivity.
Metrics that reset on attachment (perf_event_open counters, BPF tracing samples, etc.) are intentionally absent — they require long-lived instrumentation the capture layer cannot install without disturbing the system it is measuring.
Capture is best-effort
Each internal reader returns Option; a kernel without
CONFIG_SCHED_DEBUG yields None from the /proc/<tid>/sched
reader (and a kernel without CONFIG_SCHEDSTATS yields None
from /proc/<tid>/schedstat and the schedstat-gated
/proc/<tid>/sched keys) without failing the rest of the
thread. Counters collapse to 0, identity strings collapse to
empty, affinity collapses to an empty vec. A missing reading
is indistinguishable from a genuine zero in the output — the
contract is “never fail the snapshot.” Tests that need stronger
guarantees inspect the underlying readers directly (they remain
Option-shaped and are unit-tested in the module).
Per-cgroup enrichment
Every cgroup at least one sampled thread resides in gets a
CgroupStats entry. Fields nest under per-controller
sub-structs:
cpu: CgroupCpuStats—usage_usec,nr_throttled,throttled_usec(fromcpu.stat);max_quota_us,max_period_us(fromcpu.max);weight,weight_nice(fromcpu.weight/cpu.weight.nice).memory: CgroupMemoryStats—current(frommemory.current);max,high,low,min(from the matchingmemory.*files;lowandminare protection floors,maxandhighare limits);statandeventsas flat key-value maps mirroringmemory.statandmemory.events.pids: CgroupPidsStats—currentandmaxfrom the optionalpidscontroller.psi: Psi— per-cgroup Pressure Stall Information from<cgroup>/cpu.pressure/memory.pressure/io.pressure/irq.pressure(gated onCONFIG_PSI).
All fields are read directly from cgroup v2 files, NOT derived from per-thread data, because those are aggregate-over-the-cgroup values.
Snapshot identity
The top-level CtprofSnapshot also embeds a HostContext
(the same structure show-host prints — kernel, CPU, memory,
sched_* tunables, cmdline). Older tools or synthetic fixtures
that omit the context render (host context unavailable) rather
than failing the compare.
Cgroup namespace caveat
The per-thread cgroup path is read verbatim from
/proc/<tid>/cgroup — it is therefore relative to the cgroup
namespace root the capturing process sees, NOT the
system-global v2 mount root. A process inside a nested cgroup
namespace sees a truncated path; a process outside sees a longer
one. Cross-namespace comparison requires external
canonicalization (the capture layer deliberately does not attempt
it because the right resolution depends on capture-site privilege
and namespace visibility).
Taskstats delay accounting
The kernel’s TASKSTATS genetlink family delivers per-task
delay-accounting and memory-watermark fields that are NOT
exposed via /proc/<tid>/sched or /proc/<tid>/stat. ctprof
captures them through crate::taskstats — a netlink socket
opens, the family-id resolves via CTRL_CMD_GETFAMILY, and one
TASKSTATS_CMD_GET query per tid is issued. The 34 captured
fields (8 delay categories × 4 bucket fields + 2 watermarks) all
tag Section::TaskstatsDelay so they can be filtered as a
unit.
Capability and kconfig gating
Calling the netlink family requires CAP_NET_ADMIN on the
capturing process (kernel/taskstats.c::taskstats_ops registers
TASKSTATS_CMD_GET with GENL_ADMIN_PERM). ktstr always runs
as root in production so the cap is implicit, but a non-root
operator running ktstr ctprof capture will hit EPERM on the
first query_tid call and every taskstats field will collapse
to zero per the best-effort capture contract.
Per-family kconfig gates and runtime toggles:
- Delay-accounting fields (
*_delay_count,*_delay_total_ns,*_delay_max_ns,*_delay_min_nsacross the eight categories): requireCONFIG_TASKSTATS=yANDCONFIG_TASK_DELAY_ACCT=yAND the runtimedelayacct=ontoggle (sysctlkernel.task_delayacct=1or boot paramdelayacct). The runtime toggle is a separate condition beyond the build-time gates — a kernel built with both CONFIGs but launched withoutdelayacct=onproduces all-zero delay readings. ktstr’s standard kernel build includes both kconfigs; the test harness addsdelayacctto the guest cmdline. - Memory-watermark fields (
hiwater_rss_bytes,hiwater_vm_bytes): requireCONFIG_TASKSTATS=yANDCONFIG_TASK_XACCT=y. They do NOT respond to thedelayacct=onruntime toggle —xacct_add_tsk(kernel/tsacct.c) is unconditional onceCONFIG_TASK_XACCTis built.xacct_add_tskreads watermarks from the SHAREDmm_struct, so sibling threads of the same tgid all report identical values; kernel threads (mm == NULL) read zero by design.
Any failed gate or missing cap collapses the affected fields
to zero. ktstr’s capture pipeline emits an info-level tracing
line per snapshot summarizing taskstats outcomes AND attaches
the structured tally to CtprofSnapshot::taskstats_summary
(ok_count / eperm_count / esrch_count /
other_err_count), so an operator can distinguish “kernel
doesn’t expose this” from “every tid raced exit” from
“CAP_NET_ADMIN missing” without scraping log lines.
Eight delay categories
| Category | Source | Notes |
|---|---|---|
cpu_delay_* | tsk->sched_info.{pcount,run_delay} via delayacct_add_tsk (kernel/delayacct.c) | Time waiting on the runqueue. RACY: count + total are not updated atomically (lockless sched_info path); a concurrent reader may observe one ahead of the other. Captures the same wait-for-CPU bucket as schedstat wait_* via a different code path. |
blkio_delay_* | delayacct_blkio_start / _end (kernel/delayacct.c) | Synchronous block I/O wait. Updates serialize through task->delays->lock so count + total are atomic (unlike cpu_*). The canonical delay-accounting block-I/O reading; distinct from schedstat iowait_sum. |
swapin_delay_* | delayacct_swapin_start / _end (include/linux/delayacct.h) | Swap-in wait. OVERLAPS with thrashing_* — every thrashing event is also a swapin event from the syscall layer; do not sum the two. |
freepages_delay_* | delayacct_freepages_start / _end (mm/page_alloc.c) | Direct memory reclaim wait. |
thrashing_delay_* | delayacct_thrashing_start / _end (mm/workingset.c) | Thrashing wait. Refines swapin tracking — see swapin_*. |
compact_delay_* | delayacct_compact_start / _end (mm/compaction.c) | Memory-compaction wait. |
wpcopy_delay_* | delayacct_wpcopy_start / _end (mm/memory.c) | Write-protect-copy (CoW) fault wait. Introduced in taskstats v13. |
irq_delay_* | delayacct_irq (kernel/delayacct.c) | IRQ-handler windows charged to the task by IRQ accounting. Introduced in taskstats v14. |
Each category has four fields:
*_count— number of windows observed (MonotonicCount,SumCount).*_delay_total_ns— cumulative ns of delay (MonotonicNs,SumNs).*_delay_max_ns— longest single window observed (PeakNs,MaxPeak).*_delay_min_ns— shortest non-zero window observed (PeakNs,MaxPeak). Sentinel 0 means “no events observed”, NOT “saw a zero-ns event”; compare against the matching*_countto disambiguate.
The two memory watermarks (hiwater_rss_bytes,
hiwater_vm_bytes) are PeakBytes / MaxPeakBytes — see the
MaxPeakBytes row in the
Aggregation rules section below for the
shared-mm semantics.
Compare
ktstr ctprof compare before.ctprof.zst after.ctprof.zst
compare joins the two snapshots on pcomm (process name) by
default — see Grouping for the other axes —
and emits one row per (group, metric) pair. Groups present
on only one side surface as unmatched — a row is missing
because the process did not exist, not because it did zero work.
Grouping
--group-by pcomm(default) — aggregate every thread of the same process together.--group-by cgroup— aggregate by cgroup path. Useful for container-per-workload deployments where the process name is ambiguous across cgroups.--group-by comm— aggregate by thread name across every process under token-based pattern normalization (tokio-worker-{0..N}→ one bucket;kworker/0:1H-events_highpri,kworker/1:0H-events_highpri, … → one bucket). Useful when a thread-pool name spans many binaries and you want one row per pool, not per binary. Disable normalization with--no-thread-normalize.--group-by comm-exact— synonym for--group-by comm --no-thread-normalize. Aggregate by literal thread name, no pattern collapse. Use when distinct token values carry meaning (e.g. tracking eachkworker/u8:Nindependently).
Cgroup-path flattening
ktstr ctprof compare before.ctprof.zst after.ctprof.zst \
--group-by cgroup \
--cgroup-flatten '/kubepods/*/pod-*/container' \
--cgroup-flatten '/system.slice/*.scope'
--cgroup-flatten accepts glob patterns that collapse dynamic
segments (pod UUIDs, session scopes, transient unit IDs) to a
canonical form before grouping, so the same logical workload
across two runs lands on the same row even if the kernel
assigned different UUIDs.
Filtering output: --sections vs --metrics
Two complementary filters narrow the rendered output:
--sectionspicks which sub-tables render. The default-empty value renders every section that has data; passing a comma-separated list restricts output to the named sub-tables — every section not listed is suppressed before its data-availability gate runs. Valid section names:primary,taskstats-delay,derived,cgroup-stats,cgroup-limits,memory-stat,memory-events,pressure,host-pressure,smaps-rollup,sched-ext. Five (cgroup-stats,cgroup-limits,memory-stat,memory-events,pressure) require--group-by cgroup; naming any of them under a non-cgroup grouping emits a stderr warning and renders zero rows.--metricspicks which rows render inside the primary and derived sub-tables. The default-empty value renders every metric; passing a comma-separated list restricts the rendered rows to the named metrics. Names must come from thectprof metric-listvocabulary (CTPROF_METRICS∪CTPROF_DERIVED_METRICS). Has no effect on the secondary sub-tables (cgroup-stats, smaps-rollup, etc.) — those have fixed column shapes and ignore the row filter.
The two compose multiplicatively: --sections primary --metrics run_time_ns shows a single row in the primary
sub-table and nothing else. --sections primary alone keeps
every primary row; --metrics run_time_ns alone keeps the
single row across every section that displays it.
Each metric carries exactly one Section tag in its
registry entry — the 34 taskstats-sourced primary rows and
the 9 taskstats-derived rows tag Section::TaskstatsDelay
rather than Section::Primary / Section::Derived. They
render inside the same primary / derived outer tables but
match a distinct section name, so --sections taskstats-delay
selects exactly the 34 + 9 taskstats rows alone, while
--sections primary excludes them and --sections derived
excludes the 9 taskstats derivations. The three-way split
lets an operator scope to non-taskstats only, taskstats
only, or any combination, without losing the visual grouping
under the same outer headers.
Aggregation rules
Each metric declares its own aggregation rule
(CTPROF_METRICS in src/ctprof_compare.rs). The
AggRule enum is typed: each variant binds an accessor of a
specific metric_types newtype (MonotonicCount,
MonotonicNs, PeakNs, Bytes, etc.) so a registry entry that
pairs a peak field with a sum reduction (e.g. t.wait_max
(PeakNs) bound to a Sum* rule) fails to compile rather
than producing a meaningless 1×1s ⊕ 1000×1ms aggregate. The
14 variants split into five families: Sum reductions, Max
reductions, Range reductions, Mode reductions, and the
Affinity reduction.
Sum reductions (cumulative counters)
| Variant | Newtype | Output unit | Examples |
|---|---|---|---|
SumCount | MonotonicCount | unitless | nr_wakeups (+ _local / _remote / _sync / _migrate / _affine / _affine_attempts), nr_migrations, nr_forced_migrations, nr_failed_migrations_*, voluntary_csw, nonvoluntary_csw, minflt, majflt, wait_count, iowait_count, timeslices, syscr, syscw, every taskstats *_delay_count (8 entries) |
SumNs | MonotonicNs | ns | run_time_ns, wait_time_ns, wait_sum, voluntary_sleep_ns, block_sum, iowait_sum, core_forceidle_sum, every taskstats *_delay_total_ns (8 entries) |
SumTicks | ClockTicks | USER_HZ ticks | utime_clock_ticks, stime_clock_ticks |
SumBytes | Bytes | bytes (IEC) | allocated_bytes, deallocated_bytes, rchar, wchar, read_bytes, write_bytes, cancelled_write_bytes |
Group reduction: saturating_add per the no-wraparound contract.
Delta is the signed difference; percent delta is relative to the
before-side. Auto-scale ladder is decimal SI for ns / count,
USER_HZ for ticks, IEC binary for bytes.
Max reductions (peaks and gauges)
| Variant | Newtype | Output unit | Examples |
|---|---|---|---|
MaxPeak | PeakNs | ns | wait_max, sleep_max, block_max, exec_max, slice_max, every taskstats *_delay_max_ns (8 entries), every taskstats *_delay_min_ns (8 entries) |
MaxPeakBytes | PeakBytes | bytes (IEC) | hiwater_rss_bytes, hiwater_vm_bytes (taskstats lifetime memory watermarks) |
MaxGaugeNs | GaugeNs | ns | fair_slice_ns (current scheduler slice) |
MaxGaugeCount | GaugeCount | unitless | nr_threads (process-wide thread count) |
MaxPeak / MaxPeakBytes rows surface the worst single window
or largest watermark any thread in the group has ever observed
— summing per-thread maxes would conflate “one thread with a 1s
spike” with “1000 threads with 1ms spikes each”.
MaxPeakBytes is the byte-typed twin of MaxPeak and routes
through the IEC binary auto-scale ladder so a 7.5 GiB watermark
renders as 7.500GiB rather than dominating the table with raw
byte counts. xacct_add_tsk (kernel/tsacct.c) reads the
watermarks from the SHARED mm_struct, so sibling threads of
the same tgid all report the same value; cross-thread Max
within a single process is a no-op, while cross-process Max
under a multi-tgid bucket picks the largest watermark any tgid
in the bucket reported.
MaxGaugeNs / MaxGaugeCount apply to instantaneous gauges
(read at capture time) where summing has no physical meaning.
nr_threads specifically is leader-only (populated on
tid == tgid, zero elsewhere); Max reads through the leader
so a comm-bucketed group still surfaces the largest process
represented in the bucket. The taskstats *_delay_min_ns rows
also use MaxPeak: min here is the kernel’s per-task lifetime
shortest non-zero observation, so cross-thread Max picks “the
largest minimum any contributor reported”; sentinel 0 means
“no events observed” — compare against the matching count.
Range reductions (bounded ordinals)
| Variant | Newtype | Output | Examples |
|---|---|---|---|
RangeI32 | OrdinalI32 | [min, max] (i64-widened) | nice, priority, processor |
RangeU32 | OrdinalU32 | [min, max] (i64-widened) | rt_priority |
The renderer shows [min, max] and the delta uses the midpoint
so a shift on either end is visible.
Mode reductions (categorical)
| Variant | Newtype | Output | Examples |
|---|---|---|---|
Mode | CategoricalString | most-frequent value + count/total | policy |
ModeChar | char (coerced) | most-frequent char + count/total | state |
ModeBool | bool (coerced) | most-frequent bool + count/total | ext_enabled |
Mode is textual: delta is "same" if both modes agree,
"differs" otherwise — there is no arithmetic on a categorical
value. ModeChar and ModeBool coerce to String via
to_string() before reducing because the underlying types are
not themselves Modeable. A 50/50 bool tie resolves
lex-smallest-wins (so "false" wins over "true"); operators
reading a false mode in a heterogeneous bucket should check
the count/total fraction.
Affinity reduction (CPU sets)
| Variant | Newtype | Output | Example |
|---|---|---|---|
Affinity | CpuSet | AffinitySummary { min_cpus, max_cpus, uniform } | cpu_affinity |
Heterogeneous groups render as "N-M cpus (mixed)". Unlike the
other rules, Affinity does not route through a
metric_types trait — its reduction produces a structured
summary, not a homogeneous newtype.
Derived metrics
Derived metrics consume one or more already-aggregated input
metrics from CTPROF_METRICS and produce a single scalar
with its own auto-scale ladder. They render in a separate
## Derived metrics table below the per-thread table on both
compare and show, with rows colored blue to distinguish
them from the primary table on TTY stdout. Registered in
CTPROF_DERIVED_METRICS in src/ctprof_compare.rs.
The full registry is 17 entries: 8 schedstat / I/O / heap
derivations plus 9 taskstats-derived (the 8 per-bucket
avg_*_delay_ns averages plus the total_offcpu_delay_ns
rollup). Every formula is implemented as a closure over the
group’s metrics map (BTreeMap<String, Aggregated>); a missing
input or a zero denominator yields None, which the renderer
surfaces as - so the operator can distinguish “not
computable” from “computed as zero”.
| Metric | Formula | Inputs | Unit | Notes |
|---|---|---|---|---|
affine_success_ratio | nr_wakeups_affine / nr_wakeups_affine_attempts | nr_wakeups_affine, nr_wakeups_affine_attempts | ratio (0..1) | wake_affine() success ratio. CFS-only signal — sched_ext does not increment the wakeup counters. Bare three-decimal scalar; the renderer suppresses the % column for ratio rows because absolute delta on a [0, 1] ratio is already in percentage points. |
avg_wait_ns | wait_sum / wait_count | wait_sum, wait_count | ns | Average runqueue-wait duration per scheduling event. Rendered with the ns auto-scale ladder (ns → µs → ms → s). Schedstat-gated (see wait_sum and wait_count); zero across sched_ext threads. |
cpu_efficiency | run_time_ns / (run_time_ns + wait_time_ns) | run_time_ns, wait_time_ns | ratio (0..1) | Fraction of total scheduler-tracked time spent on-CPU. Higher = less time stuck on the runqueue. Both inputs gated by CONFIG_SCHED_INFO. |
avg_slice_ns | run_time_ns / timeslices | run_time_ns, timeslices | ns | Average on-CPU slice length. Useful for spotting timeslice-tuning regressions (e.g. an sched_min_granularity_ns change that shrinks slices). Both inputs gated by CONFIG_SCHED_INFO. |
involuntary_csw_ratio | nonvoluntary_csw / (voluntary_csw + nonvoluntary_csw) | nonvoluntary_csw, voluntary_csw | ratio (0..1) | Fraction of context switches that were preemptions (kernel pulled the task off-CPU) vs. voluntary blocks. High values indicate preemption pressure; low values indicate cooperative blocking. |
disk_io_fraction | read_bytes / rchar | read_bytes, rchar | ratio (≥ 0) | Fraction of read syscall bytes that traveled past the pagecache layer (cache miss rate; covers local block devices and network filesystems alike). Typically ≤ 1.0, but can exceed 1 when readahead pulls more bytes past the pagecache layer than the syscall requested. Both inputs gated by CONFIG_TASK_IO_ACCOUNTING. |
live_heap_estimate | allocated_bytes - deallocated_bytes (signed) | allocated_bytes, deallocated_bytes | bytes (IEC, signed) | jemalloc-only live-heap estimate. Glibc and other allocators feed both inputs zero so the derived metric reads zero too — - would imply non-computable but here zero is the genuine reading. Renders on the IEC binary ladder (B → KiB → MiB → GiB → TiB). Per-thread reading carries cross-thread noise: a thread that purely frees objects allocated by other threads reads large negative values; group-level Sum across all threads of the process eliminates the asymmetry. |
avg_iowait_ns | iowait_sum / iowait_count | iowait_sum, iowait_count | ns | Average iowait interval per blocking event. Schedstat-gated; zero across sched_ext threads. |
avg_cpu_delay_ns | cpu_delay_total_ns / cpu_delay_count | cpu_delay_total_ns, cpu_delay_count | ns | Average runqueue-wait per scheduling event from the taskstats delayacct path. RACY: the kernel updates count + total via the lockless sched_info path, so a concurrent reader may observe one ahead of the other; the quotient is approximate at the sub-event scale and stable at the integrated scale. Distinct from avg_wait_ns (schedstat) which captures the same wait-for-CPU bucket via a different code path. |
avg_blkio_delay_ns | blkio_delay_total_ns / blkio_delay_count | blkio_delay_total_ns, blkio_delay_count | ns | Average synchronous block-I/O wait per event from the taskstats delayacct path. Distinct from avg_iowait_ns (schedstat) — this is the canonical delay-accounting block-I/O reading. |
avg_swapin_delay_ns | swapin_delay_total_ns / swapin_delay_count | swapin_delay_total_ns, swapin_delay_count | ns | Average swap-in wait per event. OVERLAPS with thrashing — every thrashing event is also a swapin event from the syscall layer; do not sum the two averages or the underlying totals directly. |
avg_freepages_delay_ns | freepages_delay_total_ns / freepages_delay_count | freepages_delay_total_ns, freepages_delay_count | ns | Average direct-reclaim wait per event. |
avg_thrashing_delay_ns | thrashing_delay_total_ns / thrashing_delay_count | thrashing_delay_total_ns, thrashing_delay_count | ns | Average thrashing wait per event. OVERLAPS with swapin (see avg_swapin_delay_ns). |
avg_compact_delay_ns | compact_delay_total_ns / compact_delay_count | compact_delay_total_ns, compact_delay_count | ns | Average memory-compaction wait per event. |
avg_wpcopy_delay_ns | wpcopy_delay_total_ns / wpcopy_delay_count | wpcopy_delay_total_ns, wpcopy_delay_count | ns | Average write-protect-copy (CoW) fault wait per event. |
avg_irq_delay_ns | irq_delay_total_ns / irq_delay_count | irq_delay_total_ns, irq_delay_count | ns | Average IRQ-handler window per event. |
total_offcpu_delay_ns | cpu + blkio + freepages + compact + wpcopy + irq + max(swapin, thrashing) | every *_delay_total_ns | ns | Sum of every meaningful off-CPU delay-accounting bucket. The swapin/thrashing pair is OR’d with .max() rather than summed because the two share syscall-layer events (every thrashing event is also a swapin from the syscall perspective); summing both would double-count thrashing-induced swapins. When CONFIG_TASK_DELAY_ACCT is off, the runtime toggle is off, or the kernel predates a bucket’s introduction (e.g. wpcopy_* lands in v13, irq_* in v14), the missing buckets read zero from the truncated taskstats payload — the rollup degrades to the sum of the populated buckets rather than returning -. The structured taskstats outcome lives on CtprofSnapshot::taskstats_summary for the operator to disambiguate “no data” from “zero data.” |
The is_ratio column on the registry is load-bearing for the
renderer: ratio rows skip the % column entirely (the absolute
delta already carries percentage-point semantics for a [0, 1]
quantity), and the auto-scale ladder is None (bare three-
decimal scalar). Non-ratio derived metrics reuse the same
ladders as their unit family — Ns for nanosecond derivations,
Bytes for byte derivations.
The 9 taskstats-derived entries (the 8 avg_*_delay_ns
averages plus total_offcpu_delay_ns) tag
Section::TaskstatsDelay rather than Section::Derived so
--sections taskstats-delay renders the full taskstats view —
the 34 raw rows AND the 9 derivations that depend on them —
without dragging in unrelated derivations.
Derived metrics are surfaced by ctprof metric-list
alongside the primary registry, and are valid --sort-by keys
on both compare and show.
Output and interpretation
The comparison prints raw numbers and percent delta. There are no judgment labels (regression vs. improvement) — the meaning of “run_time went up 15%” depends on whether you were measuring a CPU-bound workload (more work done) or a spin-wait pathology (more time wasted). The interpretation is scheduler- specific and left to the operator.
Sort order: by default, rows are sorted by absolute delta
(largest movers first) so the most-changed metrics surface at
the top. Rows with no numeric scalar (policy, heterogeneous
affinity) fall to the bottom.
File format
.ctprof.zst is zstd-compressed JSON of CtprofSnapshot. The
schema is #[non_exhaustive] so field additions do not break
existing snapshots:
CtprofSnapshot
├── captured_at_unix_ns: u64
├── host: Option<HostContext>
├── threads: Vec<ThreadState>
├── cgroup_stats: BTreeMap<String, CgroupStats>
├── probe_summary: Option<CtprofProbeSummary>
├── parse_summary: Option<CtprofParseSummary>
├── taskstats_summary: Option<TaskstatsSummary>
├── psi: Psi
└── sched_ext: Option<SchedExtSysfs>
TaskstatsSummary carries per-snapshot taskstats genetlink
query outcomes — ok_count, eperm_count, esrch_count,
other_err_count — so an operator can distinguish “no
taskstats data because every tid raced exit” (high
esrch_count) from “no taskstats data because the kernel was
built without CONFIG_TASKSTATS” (the netlink open failed
up-front, every counter zero) from “no taskstats data because
CAP_NET_ADMIN is missing” (high eperm_count).
ThreadState::start_time_clock_ticks is in USER_HZ (100 on
x86_64 and aarch64), NOT the kernel-internal CONFIG_HZ — so
cross-host comparison between differently-configured kernels on
those architectures is meaningful. Other in-tree architectures
(alpha, for instance, with USER_HZ=1024) would require normalization
at capture time; the capture layer currently targets x86_64 and
aarch64 only.
Compression level 3 (matching the ktstr remote-cache
convention): adequate ratio at fast speed, and ctprof
captures are small enough that further compression produces
diminishing returns on I/O.
Adding a metric
Adding a per-thread metric to the registry is a three-step mechanical process. The type system enforces the wiring so a mismatch between the kernel-source semantic and the aggregation rule fails to compile rather than producing a silently-wrong group reduction.
1. Add a ThreadState field with the right newtype
Pick the metric_types newtype that matches the kernel-source
semantic of the field — the per-newtype docs name the kernel
call sites that update each category. The shape determines what
aggregation rules are legal in step 3:
| Newtype | When to use |
|---|---|
MonotonicCount | Pure counter — only goes up across the thread’s lifetime. Examples: nr_wakeups, syscall counts, every taskstats *_delay_count. |
DeadCounter | Same shape as MonotonicCount but tagged for kernel counters with no live writer (always reads zero). Captured for parser parity but does NOT implement any reduction trait — register with is_dead: true and the renderer flags it [dead]. |
MonotonicNs | Cumulative-time counter in ns. Examples: run_time_ns, wait_sum, every taskstats *_delay_total_ns. |
PeakNs | Lifetime high-water mark in ns. Kernel updates via if (delta > stat->max) stat->max = delta. Summing peaks is a category error. Examples: wait_max, slice_max, every taskstats *_delay_max_ns and *_delay_min_ns. |
PeakBytes | Byte-typed twin of PeakNs — lifetime high-water mark in bytes. Routes through the IEC binary auto-scale ladder. Used for taskstats memory watermarks (hiwater_rss_bytes, hiwater_vm_bytes) read from the shared mm_struct. Pairs with AggRule::MaxPeakBytes. |
GaugeNs | Instantaneous gauge sampled at capture time (ns). Cannot sum — N near-identical samples collapse to N×gauge with no meaning. Example: fair_slice_ns. |
GaugeCount | Instantaneous unitless count that goes up AND down. Example: nr_threads. |
ClockTicks | USER_HZ-scaled time. Examples: utime_clock_ticks, stime_clock_ticks. |
Bytes | Byte counts. IEC binary auto-scale ladder. Examples: read_bytes, wchar. |
OrdinalI32 / OrdinalU32 / OrdinalU64 | Bounded scalar — range-aggregated, not summable. Examples: nice (i32), rt_priority (u32). The Rangeable::range_across reduction returns Option<Range<Self>> — see Range<T> below. OrdinalU64 implements Rangeable but is currently unused in the registry; a metric that picks OrdinalU64 requires adding AggRule::RangeU64 alongside the existing RangeI32 and RangeU32 variants. |
CategoricalString | Categorical value — mode-aggregated. Examples: policy. |
CpuSet | CPU affinity mask — affinity-aggregated. Example: cpu_affinity. |
Range<T> | Output type of the Rangeable::range_across reduction. Carries min and max of the same T with the min <= max invariant enforced at construction (debug_assert! in Range::new). Not stored on ThreadState — the Aggregated::OrdinalRange boundary unwraps it via into_tuple() to a (i64, i64) pair widened from the underlying OrdinalI32 / OrdinalU32 / OrdinalU64. |
Add the field to ThreadState in src/ctprof.rs:
// In ThreadState struct definition.
/// Description: what the field counts, what kernel call site
/// writes it, and what scheduler classes increment it. Cite
/// `kernel/sched/...` line numbers for the writer.
pub my_new_metric: crate::metric_types::MonotonicCount,
2. Wire the capture path
capture_thread_at_with_tally in src/ctprof.rs is the
single per-thread procfs walk. Add the per-source reader (or
extend an existing one) and stamp the field in the
ThreadState { ... } construction:
// Inside capture_thread_at_with_tally, after the existing
// per-source reads. Wrap in the newtype constructor; never use
// `.into()` (the typed-newtype style is explicit).
my_new_metric: MonotonicCount(sched.my_new_metric.unwrap_or(0)),
The Option::unwrap_or(0) collapse is load-bearing: the
profiler’s contract is “never fail the snapshot,” so a missing
reading lands at the newtype’s Default::default() (zero). The
absent reading is indistinguishable from a genuine zero in the
output — see the Capture is best-effort section.
3. Register the metric
Append a CtprofMetricDef entry to CTPROF_METRICS in
src/ctprof_compare.rs. The AggRule variant must match the
newtype chosen in step 1 — the type system enforces this.
CtprofMetricDef {
name: "my_new_metric",
rule: AggRule::SumCount(|t| t.my_new_metric),
sched_class: None, // or Some("cfs-only") / Some("non-ext") / Some("fair-policy")
config_gates: &[], // or &["CONFIG_SCHEDSTATS"], etc.
is_dead: false, // true for kernel-side dead pointers
description: "One-line operator-facing description; surfaces in `ctprof metric-list`.",
section: Section::Primary, // or Section::TaskstatsDelay for taskstats-sourced rows
},
The name field is the canonical metric identifier — used by
--sort-by, --metrics, and the metric-list output. (The
--columns flag accepts layout names — group, threads,
metric, baseline, candidate, delta, %, arrow,
value — not metric names.) Names are ASCII short-form
(matching the capture-side field name where possible).
sched_class and config_gates render as bracketed suffixes
in metric-list output ([cfs-only], [SCHEDSTATS]) so
operators reading a row know which kernels populate the
counter. The section tag drives the --sections per-row
filter — most rows take Section::Primary; taskstats-sourced
rows take Section::TaskstatsDelay.
Compile-time guards
The type system catches the four most common mistakes:
- Wrong reduction family: pairing a
PeakNsaccessor withAggRule::SumNsfails with a type error —PeakNsdoes not implementSummable(onlyMaxable), and the closure’s return type does not match the variant’s expected newtype. - Wrong unit family: pairing a
Bytesaccessor withAggRule::SumNsfails the same way. - Dead counter with live reduction:
DeadCounterdoes not implementSummable/Maxable/Rangeable/Modeable, so anyAggRule::Sum*/Max*/Range*/Mode*variant bound to a dead-counter accessor fails to compile. Register the metric only via theis_dead: trueflag with whichever variant matches its shape — the rendering layer surfaces it as[dead]and skips numeric reduction. - Categorical with numeric reduction: pairing a
CategoricalStringaccessor withAggRule::SumCountfails becauseCategoricalStringdoes not implementSummable.
The closure body cannot be type-checked beyond the variant
boundary, so a body that actively miswraps a field — e.g.
SumNs(|t| MonotonicNs(t.wait_max.0)) laundering a peak through
the sum wrapper — type-checks. Don’t do that. The wrapper
category is load-bearing; the type system catches the variant
mismatch but not the lying inside.
Optional: derived metric
If the new metric has a useful ratio or sum-of-ratios pairing
with existing inputs, register a DerivedMetricDef in
CTPROF_DERIVED_METRICS (same file). The compute closure
reads inputs via input_scalar(metrics, name)? and returns
Option<DerivedValue>; the ratio_compute and
ratio_of_sum_compute helpers cover the two most common
shapes. Set is_ratio: true when the output is in [0, 1] so
the renderer suppresses the % column. Set section to
Section::Derived for general derivations or
Section::TaskstatsDelay if every input is a taskstats field
(so --sections taskstats-delay keeps the derivation alongside
its raw inputs).
Related
cargo ktstr show-host— captures the host context (kernel, CPU, tunables) that the profiler embeds as thehostfield. Useshow-hostwhen you want to inspect host configuration only, without the per- thread walk.- Capture and Compare Host State —
recipe covering the
show-host/stats compareflow for comparing host context across sidecars (not the per-thread profiler). - Environment Variables — every ktstr-controlled env var.