Features
ktstr is a test framework for Linux process schedulers. See Overview for a quick introduction with code examples.
Supported kernels
ktstr’s runtime dispatches to per-kernel-version fallback paths for
the watchdog timeout and event counters. CI explicitly exercises
6.14 and 7.0 on both x86_64 and aarch64. On 7.1+ kernels the watchdog
override uses scx_sched.watchdog_timeout via BTF detection;
older kernels use the static scx_watchdog_timeout symbol.
Event counters follow a different layout split: 6.18+ kernels
(backported to 6.17.7+ stable) read via
scx_sched.pcpu -> scx_sched_pcpu.event_stats; 6.16-6.17 kernels
read scx_sched.event_stats_cpu directly. When neither path is
available, event-counter sampling is silently disabled.
Testing
Real kernel, clean slate, x86/arm parity — every VM test boots its own Linux kernel in KVM, fresh state each run
Tests boot a real Linux kernel in a KVM virtual machine with configurable topology: NUMA nodes, LLCs, cores per LLC, threads per core. Multi-NUMA topologies produce NUMA domains via ACPI SRAT/SLIT/HMAT tables on x86_64. On aarch64, CPU topology is described via FDT cpu nodes with MPIDR affinity. Both architectures are supported (24 topology presets on x86_64, 14 on aarch64). See Gauntlet for the full preset list.
Fast boot — compressed initramfs SHM cache with COW overlay, VMs boot in ms, not s
The initramfs base (test binary + busybox + shared libraries) is LZ4-compressed and cached in a shared memory segment. Concurrent VMs COW-map the cached base into guest memory, avoiding per-VM compression and copy. Per-test arguments are packed as a small suffix appended to the cached base. Result: VM boot time is dominated by kernel init, not initramfs preparation.
Automatic shared library resolution — recursive DT_NEEDED discovery, no need to link statically
Shared library dependencies for the test binary and any injected
host files are resolved automatically by walking DT_NEEDED entries
in ELF headers. The framework builds a complete closure of transitive
dependencies — no manual .so lists or LD_LIBRARY_PATH hacks.
Bare-metal mode — run the same scenarios on real hardware without VMs
cargo ktstr export packages a registered test as a self-extracting
.run script that reproduces the scenario on bare metal without
a VM. The runfile validates host topology and sched_ext support,
then dispatches the test directly under whatever scheduler is
already active. Used for testing under production schedulers and
real topology.
Declarative scheduler registration — one macro declares the binary, default topology, kernels, and assertions
Tests load sched_ext schedulers
via BPF struct_ops inside the VM. The
declare_scheduler!
macro registers a scheduler in the KTSTR_SCHEDULERS distributed
slice — binary path, default topology, kernel filter for the
verifier sweep, assertion overrides, and always-on CLI args all
land in one declaration that tests reference via the bare const
ident the macro emits.
use ktstr::declare_scheduler;
declare_scheduler!(MITOSIS, {
name = "mitosis",
binary = "scx_mitosis",
topology = (1, 2, 4, 1),
sched_args = ["--exit-dump-len", "1048576"],
});
Without a scheduler = … attribute on #[ktstr_test], tests run
under the kernel’s default scheduler (EEVDF).
Data-driven scenarios — declare what you want, the framework handles cgroups, cpusets, and workers
Scenarios are composable sequences of
steps and ops. You declare intent as data —
the framework creates cgroups, assigns cpusets, spawns workers,
sets scheduling policies, and manages affinity. 50+
canned scenarios across 8 scenario
submodules cover basic, cpuset, dynamic, stress, interaction,
affinity, nested, and performance patterns. (The ops module
is the underlying DSL that every scenario is expressed in, not
a scenario category; the scenarios module is the top-level
catalog aggregator.)
API types:
CgroupDef— declarative cgroup: name + cpuset + workload(s)Step— sequence of ops followed by a hold periodOp— atomic operation (add/remove cgroup, set/swap/clear cpuset, spawn, stop, set affinity, move tasks)CpusetSpec— topology-relative cpuset (LLC-aligned, disjoint, overlapping, range, exact)HoldSpec— hold duration (fractional, fixed, or looped)AffinityIntent— per-worker affinity (inherit, random subset, LLC-aligned, cross-cgroup, single CPU, exact)SchedPolicy— Linux scheduling policy (Normal, Batch, Idle, FIFO, RoundRobin)WorkSpec— workload definition for a group of workers
Gauntlet — one test declaration, dozens of topology variants with budget-aware CI selection
A single #[ktstr_test]
auto-expands across topology presets. Multi-kernel runs
(cargo ktstr test --kernel A --kernel B) add the kernel as
an additional dimension. Budget-based selection
(KTSTR_BUDGET_SECS) picks the subset that maximizes coverage
within a CI time limit.
Constraint attributes:
min_llcs,max_llcs,min_cpus,max_cpus,min_numa_nodes,max_numa_nodes,requires_smt— topology gatesextra_sched_args— per-test scheduler CLI arguments
#[ktstr_test] proc macro — zero-boilerplate test declaration with nextest integration
Declares tests with topology, scheduler, and constraint attributes.
Generates both nextest-compatible entries and standard #[test]
wrappers. No custom harness or main() needed.
See The #[ktstr_test] Macro.
Library-first — add as a dev-dependency, write tests in your own crate
Add ktstr as a dev-dependency. ktstr::prelude re-exports
all test-authoring types.
See Getting Started for setup.
Automatic lifecycle — boot, load scheduler, run scenario, collect results, shutdown — all handled
The framework manages the full VM lifecycle: boot, scheduler start, scenario execution, result collection, and shutdown. Bidirectional SHM signal slots coordinate graceful shutdown, BPF map writes, and readiness gates between host and guest.
38 work types — configurable workload profiles for different scheduling pressures
Workers are fork()ed processes placed in cgroups:
SpinWait— tight CPU spin loopYieldHeavy— repeated sched_yield with minimal CPU workMixed— CPU spin burst followed by sched_yieldAluHot— dependent integer multiply chain at high IPC (≥ 2.0)SmtSiblingSpin— paired PAUSE-spin pinned across two SMT siblingsIpcVariance— alternating high-IPC (multiplies) / low-IPC (cache touches) phasesIoSyncWrite— 16 × 4 KB pwrites + fdatasync per iteration (O_SYNC)IoRandRead— 4 KB random pread (O_DIRECT)IoConvoy— interleaved sequential pwrite + random pread with periodic fdatasync (O_DIRECT)Bursty— CPU burst then sleep (parameterized viaDuration)IdleChurn— CPU burst thennanosleep(hrtimer + idle-class path)PipeIo— CPU burst then pipe exchange (cross-CPU wake placement)FutexPingPong— paired futex wait/wake (non-WF_SYNC)FutexFanOut— 1:N fan-out wakeCachePressure— strided RMW sized to pressure L1CacheYield— cache pressure + sched_yieldCachePipe— cache pressure + pipe exchangeSequence— compound: loop through phasesForkExit— rapid fork+_exit cyclingNiceSweep— cycle nice level from -20 to 19AffinityChurn— rapid self-directed sched_setaffinityPolicyChurn— cycle SCHED_OTHER → BATCH → IDLE (→ FIFO/RR with CAP_SYS_NICE)NumaMigrationChurn— rotate sched_setaffinity across NUMA nodesCgroupChurn— cycle cgroup membership between sibling cgroupsFanOutCompute— messenger/worker fan-out with matrix-multiply computeAsymmetricWaker— paired workers in mismatched scheduling classes share one futex wordWakeChain— ring of waker-wakee hops (Pipe with WF_SYNC, or Futex)EpollStorm— eventfd producers + epoll_wait consumersThunderingHerd— N waiters on one global futex word; broadcast wakePageFaultChurn— rapid mmap/fault/MADV_DONTNEED cyclingNumaWorkingSetSweep— rotate working-set memory across NUMA nodes via mbindMutexContention— N-way futex mutex contentionPriorityInversion— three-tier lock contention (Pi or Plain futex)ProducerConsumerImbalance— unbalanced producer/consumer pipeline (queue grows)SignalStorm— paired workers fire tkill(partner, SIGUSR1) between CPU burstsPreemptStorm— one SCHED_FIFO worker preempts CFS spinners at ~kHzRtStarvation— SCHED_FIFO workers monopolise CPU; SCHED_NORMAL workers starveCustom— user-supplied work function
See WorkType.
Observability
Zero-overhead introspection — read/write kernel state and BPF maps from the host w/o the guest knowing
All observability is built on direct read/write of guest physical memory from the host via the KVM memory mapping, with page table walks for dynamically allocated addresses. No guest-side instrumentation, no BPF syscalls — the observer does not perturb the scheduler under test.
Kernel state: per-CPU runqueues, sched_domain trees, schedstat counters, and sched_ext event counters — read via BTF-resolved struct offsets from vmlinux. See Monitor.
BPF state: maps discovered by walking kernel data structures through page table translation. Array and percpu_array maps support typed field access via BPF program BTF; hash maps return raw key-value pairs. Read/write — write enables host-initiated crash reproduction.
Types: GuestMem, GuestKernel, BpfMapAccessor
Cast analysis — recover typed pointers from u64 map fields automatically; no annotations required
Schedulers stash kernel kptrs (task_struct *, cgroup *, …) and
arena pointers in BPF map fields the BTF declares as u64 because
BTF cannot express a pointer to a per-allocation type. The cast
analyzer walks the scheduler’s .bpf.o instruction stream, tracks
register state across LDX / STX / stack-spill / kfunc-return, and
records every (source_struct, field_offset) → target_struct
mapping it can prove from the program’s own access pattern.
The renderer feeds those mappings into render_cast_pointer so a
field that previously surfaced as a raw 0xffff… integer now
chases through to the target struct’s fields and prints with a
(cast→arena) or (cast→kernel) annotation distinguishing
cast-recovered pointers from BTF-typed ones. Failure dumps,
periodic captures, and on-demand snapshots all benefit
automatically.
A complementary sdt_alloc bridge recovers a chase target’s real
struct id when the scheduler’s program BTF declares the pointee as
a BTF_KIND_FWD forward declaration (the typical shape for
struct sdt_data __arena * fields whose body lives in a separate
library BTF). The freeze pre-pass populates a
slot_start → ArenaSlotInfo range index from each live
sdt_alloc allocator slot — one entry per slot, carrying
elem_size, header_size, and the resolved payload type id.
When a chase lands on a Fwd terminal, the renderer range-looks
up the slot the chased address falls in and renders the recovered
payload struct from the slot’s payload start (skipping the
union sdt_id header when the chased pointer lands at slot-start
rather than payload-start). The result carries an sdt_alloc
annotation suffix: (sdt_alloc) for the BTF-typed Type::Ptr
arm, or cast→arena (sdt_alloc) / cast→kernel (sdt_alloc)
when the chase originated from a cast-analyzer hit.
A parallel cross-BTF Fwd resolution path covers a different
multi-BTF shape: a BTF_KIND_FWD whose body lives in a sibling
embedded BPF object’s BTF rather than an sdt_alloc slot — the
typical scheduler shape where one .bpf.c declares
struct cgx_target; (forward) and another defines the body. The
cast-analysis pre-pass builds a name-keyed index over every
parsed embedded program BTF (one entry per complete !is_fwd
Struct / Union; first-write-wins on duplicate names; anonymous
types skipped). When a chase target survives the local same-BTF
Fwd resolve as a Fwd, the renderer consults the cross-BTF index
by name (matching aggregate kind — struct vs union); a hit
switches the recursion to the resolved sibling BTF and renders
the full body. No new annotation is introduced — the recovered
subtree carries whatever annotation it would have had if the
struct body lived in the entry BTF.
Runs unconditionally on every scheduler load; no test-author
configuration. False negatives (a missed cast — renderer falls
back to raw u64, the prior behavior) are acceptable; false
positives (a misidentified cast) are not, so the analyzer is
deliberately conservative on conflicting evidence and branch
joins. See Monitor → Cast analysis.
Unified timeline — correlate scenario phases with scheduler telemetry
Stimulus events (cgroup ops, cpuset changes, step transitions) correlated with monitor samples for per-phase scheduler behavior analysis. Each event carries timestamps, operation details, and cumulative worker iteration counts.
Periodic capture + temporal assertions — cadenced sampling across the workload window with monotonicity, rate, steady-state, convergence, and ratio patterns
#[ktstr_test(num_snapshots = N)] fires N host-side
freeze_and_capture boundaries inside the workload’s 10 %–90 %
window, anchored at the first MSG_TYPE_SCENARIO_START. Each
capture is stored on the SnapshotBridge under periodic_NNN
along with the parallel scx_stats JSON observed pre-freeze and a
pause-adjusted elapsed-ms timestamp.
A post_vm callback drains the bridge into a SampleSeries and
projects per-sample columns (SeriesField<T>) along the BPF or
stats axis. Seven temporal patterns evaluate the projections:
nondecreasing— counter monotonicity (v[i] <= v[i+1])strictly_increasing— strict counter monotonicity (v[i] < v[i+1])rate_within(lo, hi)— bounded delta-per-millisecondsteady_within(warmup_ms, tolerance)— post-warmup mean bandconverges_to(target, tolerance, deadline_ms)— three-consecutive-in-band witness before deadlinealways_true— boolean invariant at every sampleratio_within(other, lo, hi)— cross-field correlation between two series
Per-sample projection errors render with the underlying
SnapshotError variant (PlaceholderSample, MissingStats,
FieldNotFound, TypeMismatch, …) so coverage gaps surface with
their cause without re-running. See
Periodic Capture and
Temporal Assertions.
Worker comm / nice / pcomm — set task->comm, nice level, and thread-group leader name on every worker
CgroupDef::comm("name") calls prctl(PR_SET_NAME) on every
worker. CgroupDef::nice(n) calls setpriority(PRIO_PROCESS, 0, n)
on every worker. CgroupDef::pcomm("name") triggers ktstr’s
fork-then-thread spawn path: workers sharing a pcomm value coalesce
into ONE forked thread-group leader whose task->group_leader->comm
is the pcomm string, with worker threads inside it. Each worker
thread additionally sets its own task->comm via the per-WorkSpec
.comm(). Models real applications like chrome (pcomm) hosting
ThreadPoolForeg (per-thread comm). PipeIo/CachePipe and
SignalStorm work correctly under Fork and Thread clone modes,
including inside pcomm-coalesced thread groups. See Tutorial: Step 11.
Inline scheduler configs — pass JSON config strings directly into the test, framework writes to guest path
Schedulers that take a --config JSON file (scx_layered,
scx_lavd, …) declare the arg template + guest path via
Scheduler::config_file_def(arg_template, guest_path). Tests
supply the inline JSON via #[ktstr_test(config = LAYERED_CONFIG)].
The framework writes the content to a temp file, packs it into the
initramfs at the guest path, and substitutes {file} in the arg
template before launching the scheduler. A bidirectional pairing
gate (compile time + runtime) catches mismatched declarations: a
scheduler with config_file_def REQUIRES config = … on every
test, and a scheduler without it REJECTS config = …. See
Tutorial: Step 12 and
The #[ktstr_test] Macro.
no_perf_mode — decouple virtual topology from host hardware for tests with NUMA / LLC counts the host can't satisfy
#[ktstr_test(no_perf_mode = true)] (or KTSTR_NO_PERF_MODE=1)
builds the VM with the declared numa_nodes / llcs / cores /
threads even on smaller hosts. vCPU pinning, hugepages, NUMA
mbind, RT scheduling, and KVM exit suppression are skipped, and
gauntlet preset filtering relaxes host-topology checks to the
single “host has enough total CPUs” inequality. Mutually exclusive
with performance_mode = true (validated at runtime by KtstrTestEntry::validate). See
Tutorial: Step 13
and Performance Mode.
Statistical regression detection — Polars-powered analysis across combinatoric test matrices
Polars-powered aggregation computes scheduling metrics across runs. Run-to-run compare with dual-gate significance thresholds (absolute and relative) catches regressions that single-run assertions miss.
Metrics:
worst_spread— CPU time fairness (0.0 = perfect)worst_gap_ms— longest scheduling gaptotal_migrations/worst_migration_ratio— cross-CPU migration volumemax_imbalance_ratio— runqueue length imbalancep99_wake_latency_us— tail wake-to-run latencymean_run_delay_us— mean schedstat run delaytotal_iterations— throughput
Debugging
Auto-repro — automatically captures function arguments and struct state at crash-path call sites
On scheduler crash, extracts the crash stack and discovers
struct_ops callbacks. Attaches BPF kprobes and fentry/fexit probes,
triggers on sched_ext_exit, and reruns the scenario to capture
function arguments and struct field state at each crash-path call
site. See Auto-Repro.
Interactive shell — busybox shell inside the VM with host file injection (debugging, not tests — too slow)
ktstr shell boots a VM with busybox and drops into an interactive
shell. --include-files injects host binaries and libraries with
automatic shared library resolution.
--exec mode — run commands inside the VM non-interactively
ktstr shell --exec "command" runs a command inside the VM and
exits.
Infrastructure
Kernel management — build, cache, and auto-discover kernels from any source
ktstr kernel build builds and caches kernel images from version
numbers, local source paths, or git URLs. Automatic kernel discovery
resolves cached images, host kernels, and CI-provided paths without
manual configuration.
Performance mode — host-side isolation-ish and topology mirroring for maybe workable results
vCPU pinning, 2MB hugepages with pre-fault allocation, NUMA mbind, RT scheduling (SCHED_FIFO), KVM exit suppression (PAUSE + HLT), and KVM_HINTS_REALTIME CPUID — isolates the VM from host noise for reproducible measurements. Topology mirroring maps the VM’s LLC structure to match the host’s physical layout, so cache-aware scheduling decisions in the guest reflect real hardware behavior rather than synthetic geometry. Kinda. See Performance Mode.
Resource-budget coordination — --cpu-cap N bounds concurrent kernel builds and no-perf-mode VMs per host
--cpu-cap N (or KTSTR_CPU_CAP=N) constrains a no-perf-mode VM or
kernel build to exactly N host CPUs, selected by walking whole LLCs
in NUMA-aware, consolidation-aware order (filtered to the calling
process’s sched_getaffinity cpuset), and partial-taking the last LLC
so plan.cpus.len() == N. The full LLC is still flocked for
per-LLC coordination with concurrent ktstr peers. When the flag is
absent, the planner defaults to 30% of the allowed-CPU set (minimum
1). The plan writes the reserved CPUs and NUMA nodes into a cgroup
v2 cpuset sandbox so make -jN gcc fan-out and vCPU soft-mask
affinity respect the budget. On shell, mutually exclusive with
performance_mode=true (clap parse rejection); library consumers
see the env var silently ignored under perf-mode. Mutually exclusive
with KTSTR_BYPASS_LLC_LOCKS=1 at every entry point (contract vs.
bypass conflict rejected at CLI parse plus the library and
kernel-build-pipeline sites). ktstr locks / ktstr locks --json
enumerates every held LLC + per-CPU + cache-entry flock on the host
with holder PID + cmdline for contention diagnosis. See
Resource Budget.
Guest coverage — profraw collection from inside the VM, merged with host coverage
Guest-side profraw collection via shared memory. The host merges
guest and host coverage for unified cargo llvm-cov reports.
cargo-ktstr — cargo subcommand for the full workflow, more introspection less shell scripts
Wraps cargo nextest run with automatic kernel resolution.
Subcommands (in --help order): test, coverage, llvm-cov,
stats, kernel, model, verifier, funify, completions,
show-host, show-thresholds, export, locks, shell.
See cargo-ktstr.
Remote kernel cache — GHA cache backend for cross-run kernel sharing
GHA cache backend for CI kernel sharing. When KTSTR_GHA_CACHE=1
and ACTIONS_CACHE_URL are set, all cache lookups check the remote
after a local miss before falling back to download (for versions) or
erroring (for cache keys). All successful builds are pushed to the
remote automatically. Non-fatal on failure; local cache is
authoritative.
Real-kernel verifier analysis — boots the kernel, loads the scheduler, reads actual verified instruction counts
Runs the BPF verifier against every declare_scheduler!-registered
scheduler’s struct_ops programs inside a real kernel. Reports
per-program verified instruction counts with cycle collapse —
deduplicating repeated verifier paths to show true unique cost.
The sweep emits one nextest cell per (declared scheduler ×
kernel-list entry × accepted topology preset) tuple, with
parallelism and retries handled natively by nextest.
Reads bpf_prog_aux.verified_insns from guest memory after loading
the scheduler via struct_ops — the same path production uses.
Captures real-world verification costs including map sharing, BTF
resolution, and program composition.
See Verifier.
Host-guest signaling — bidirectional host/VM coordination built into the test library
SHM signal slots enable the host and guest to coordinate without a network stack, serial protocol, or guest-side daemon. Graceful shutdown, readiness gates, and BPF map write triggers flow through shared memory mapped into both address spaces.
See Getting Started to set up your first test, or browse the Recipes for common workflows.