Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ktstr

ktstr is a test harness for Linux process schedulers, with a focus on sched_ext (BPF-extensible process scheduling). It boots Linux kernels in KVM virtual machines with controlled CPU topologies, runs workloads, and verifies scheduling correctness. Also tests under the kernel’s default EEVDF scheduler.

Quick taste

The simplest test calls a canned scenario:

use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    scenarios::steady(ctx)
}
cargo ktstr test --kernel ../linux

Without a scheduler attribute, tests run under EEVDF. See Getting Started for testing a sched_ext scheduler.

Library API

The ktstr::prelude module re-exports the types needed for writing tests. Declare cgroups and workloads as data with CgroupDef:

use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("cg_0").workers(2),
        CgroupDef::named("cg_1").workers(2),
    ])
}

The prelude also exports low-level types (CgroupGroup, WorkloadConfig, WorkloadHandle) for manual cgroup and worker management, Assert for composable assertion config, and WorkerReport for telemetry access.

For binary workloads (running schbench, fio, or any external executable as part of a test), see Payload Definitions. #[ktstr_test(payload = FIXTURE)] runs a Payload (binary workload) alongside the cgroup workers; the scheduler = slot takes a bare Scheduler reference — the const emitted by declare_scheduler!.

What it tests

  • Fair scheduling – workers get CPU time without starvation or excessive scheduling gaps.
  • Cpuset isolation – workers stay on assigned CPUs.
  • Dynamic operations – cgroups created, destroyed, and resized mid-run.
  • Affinity – scheduler respects thread affinity constraints.
  • Stress – many cgroups, many workers, rapid topology changes.
  • Stall detection – scheduler doesn’t drop tasks.

Design

Two principles drive ktstr’s architecture:

Fidelity without overhead – every test boots a real Linux kernel in a KVM VM with real cgroups and real BPF programs. No mocking, no containers, no shared state. The VMM is minimal and PCI-free: two 16550 serial ports (COM1 for kernel console, COM2 for application I/O), a shared-memory ring buffer, and three virtio-MMIO devices (virtio-console for guest console I/O, virtio-blk for file-backed block storage with optional btrfs templates, virtio-net for in-VMM L2 loopback used by network workload tests).

Direct access over tooling layers – the host-side monitor reads guest memory directly via BTF (BPF Type Format)-resolved struct offsets to observe scheduler state. The monitor runs entirely host-side — no BPF programs are injected into the guest to collect scheduler telemetry, so observations do not perturb scheduling decisions. (BPF programs loaded by the scheduler under test, the BPF verifier pipeline, and the auto-repro probe pipeline are separate concerns; those are the code under test, not the observation layer.) See Monitor for details on BTF resolution and guest memory introspection.

BPF verifier analysis

The verifier_pipeline tests boot a scheduler in a VM and capture per-program verifier output from the real kernel verifier. The default output applies cycle collapse to reduce repetitive loop unrolling. See BPF Verifier for details.

Auto-repro probe pipeline

When a scheduler crashes, ktstr can automatically rerun the failing scenario with BPF probes attached to the crash-path functions. See Auto-Repro for details.

Workspace structure

ComponentPurpose
ktstr (lib)Core library
ktstr-macros#[ktstr_test], declare_scheduler!, and #[derive(Payload)] proc macros
ktstr (bin)Host-side CLI
cargo-ktstr (bin)Cargo-integrated workflow: test, coverage, llvm-cov, kernel mgmt, verifier analysis, stats, interactive shell
scx-ktstrMinimal BPF scheduler for testing

ktstr and cargo-ktstr are the two user-facing [[bin]] targets in the crate; install them with cargo install --locked ktstr --bin ktstr --bin cargo-ktstr. The crate also defines two test-fixture [[bin]] targets — ktstr-jemalloc-probe and ktstr-jemalloc-alloc-worker — used by the tests/jemalloc_probe_tests.rs integration tests. The explicit --bin flags scope the install to just the two operator-facing entry points; without them, cargo install would also place the test-fixture binaries on $PATH.

Kernel config

ktstr.kconfig in the repo root contains the kernel config fragment needed for scheduler testing (sched_ext, BPF, kprobes, cgroups). Copy it to your kernel source tree and run make olddefconfig.

Features

ktstr is a test framework for Linux process schedulers. See Overview for a quick introduction with code examples.

Supported kernels

ktstr’s runtime dispatches to per-kernel-version fallback paths for the watchdog timeout and event counters. CI explicitly exercises 6.14 and 7.0 on both x86_64 and aarch64. On 7.1+ kernels the watchdog override uses scx_sched.watchdog_timeout via BTF detection; older kernels use the static scx_watchdog_timeout symbol.

Event counters follow a different layout split: 6.18+ kernels (backported to 6.17.7+ stable) read via scx_sched.pcpu -> scx_sched_pcpu.event_stats; 6.16-6.17 kernels read scx_sched.event_stats_cpu directly. When neither path is available, event-counter sampling is silently disabled.

Testing

Real kernel, clean slate, x86/arm parity — every VM test boots its own Linux kernel in KVM, fresh state each run

Tests boot a real Linux kernel in a KVM virtual machine with configurable topology: NUMA nodes, LLCs, cores per LLC, threads per core. Multi-NUMA topologies produce NUMA domains via ACPI SRAT/SLIT/HMAT tables on x86_64. On aarch64, CPU topology is described via FDT cpu nodes with MPIDR affinity. Both architectures are supported (24 topology presets on x86_64, 14 on aarch64). See Gauntlet for the full preset list.

Fast boot — compressed initramfs SHM cache with COW overlay, VMs boot in ms, not s

The initramfs base (test binary + busybox + shared libraries) is LZ4-compressed and cached in a shared memory segment. Concurrent VMs COW-map the cached base into guest memory, avoiding per-VM compression and copy. Per-test arguments are packed as a small suffix appended to the cached base. Result: VM boot time is dominated by kernel init, not initramfs preparation.

Automatic shared library resolution — recursive DT_NEEDED discovery, no need to link statically

Shared library dependencies for the test binary and any injected host files are resolved automatically by walking DT_NEEDED entries in ELF headers. The framework builds a complete closure of transitive dependencies — no manual .so lists or LD_LIBRARY_PATH hacks.

Bare-metal mode — run the same scenarios on real hardware without VMs

cargo ktstr export packages a registered test as a self-extracting .run script that reproduces the scenario on bare metal without a VM. The runfile validates host topology and sched_ext support, then dispatches the test directly under whatever scheduler is already active. Used for testing under production schedulers and real topology.

Declarative scheduler registration — one macro declares the binary, default topology, kernels, and assertions

Tests load sched_ext schedulers via BPF struct_ops inside the VM. The declare_scheduler! macro registers a scheduler in the KTSTR_SCHEDULERS distributed slice — binary path, default topology, kernel filter for the verifier sweep, assertion overrides, and always-on CLI args all land in one declaration that tests reference via the bare const ident the macro emits.

use ktstr::declare_scheduler;

declare_scheduler!(MITOSIS, {
    name = "mitosis",
    binary = "scx_mitosis",
    topology = (1, 2, 4, 1),
    sched_args = ["--exit-dump-len", "1048576"],
});

Without a scheduler = … attribute on #[ktstr_test], tests run under the kernel’s default scheduler (EEVDF).

Data-driven scenarios — declare what you want, the framework handles cgroups, cpusets, and workers

Scenarios are composable sequences of steps and ops. You declare intent as data — the framework creates cgroups, assigns cpusets, spawns workers, sets scheduling policies, and manages affinity. 50+ canned scenarios across 8 scenario submodules cover basic, cpuset, dynamic, stress, interaction, affinity, nested, and performance patterns. (The ops module is the underlying DSL that every scenario is expressed in, not a scenario category; the scenarios module is the top-level catalog aggregator.)

API types:

  • CgroupDef — declarative cgroup: name + cpuset + workload(s)
  • Step — sequence of ops followed by a hold period
  • Op — atomic operation (add/remove cgroup, set/swap/clear cpuset, spawn, stop, set affinity, move tasks)
  • CpusetSpec — topology-relative cpuset (LLC-aligned, disjoint, overlapping, range, exact)
  • HoldSpec — hold duration (fractional, fixed, or looped)
  • AffinityIntent — per-worker affinity (inherit, random subset, LLC-aligned, cross-cgroup, single CPU, exact)
  • SchedPolicy — Linux scheduling policy (Normal, Batch, Idle, FIFO, RoundRobin)
  • WorkSpec — workload definition for a group of workers
Gauntlet — one test declaration, dozens of topology variants with budget-aware CI selection

A single #[ktstr_test] auto-expands across topology presets. Multi-kernel runs (cargo ktstr test --kernel A --kernel B) add the kernel as an additional dimension. Budget-based selection (KTSTR_BUDGET_SECS) picks the subset that maximizes coverage within a CI time limit.

Constraint attributes:

  • min_llcs, max_llcs, min_cpus, max_cpus, min_numa_nodes, max_numa_nodes, requires_smt — topology gates
  • extra_sched_args — per-test scheduler CLI arguments
#[ktstr_test] proc macro — zero-boilerplate test declaration with nextest integration

Declares tests with topology, scheduler, and constraint attributes. Generates both nextest-compatible entries and standard #[test] wrappers. No custom harness or main() needed. See The #[ktstr_test] Macro.

Library-first — add as a dev-dependency, write tests in your own crate

Add ktstr as a dev-dependency. ktstr::prelude re-exports all test-authoring types. See Getting Started for setup.

Automatic lifecycle — boot, load scheduler, run scenario, collect results, shutdown — all handled

The framework manages the full VM lifecycle: boot, scheduler start, scenario execution, result collection, and shutdown. Bidirectional SHM signal slots coordinate graceful shutdown, BPF map writes, and readiness gates between host and guest.

38 work types — configurable workload profiles for different scheduling pressures

Workers are fork()ed processes placed in cgroups:

  • SpinWait — tight CPU spin loop
  • YieldHeavy — repeated sched_yield with minimal CPU work
  • Mixed — CPU spin burst followed by sched_yield
  • AluHot — dependent integer multiply chain at high IPC (≥ 2.0)
  • SmtSiblingSpin — paired PAUSE-spin pinned across two SMT siblings
  • IpcVariance — alternating high-IPC (multiplies) / low-IPC (cache touches) phases
  • IoSyncWrite — 16 × 4 KB pwrites + fdatasync per iteration (O_SYNC)
  • IoRandRead — 4 KB random pread (O_DIRECT)
  • IoConvoy — interleaved sequential pwrite + random pread with periodic fdatasync (O_DIRECT)
  • Bursty — CPU burst then sleep (parameterized via Duration)
  • IdleChurn — CPU burst then nanosleep (hrtimer + idle-class path)
  • PipeIo — CPU burst then pipe exchange (cross-CPU wake placement)
  • FutexPingPong — paired futex wait/wake (non-WF_SYNC)
  • FutexFanOut — 1:N fan-out wake
  • CachePressure — strided RMW sized to pressure L1
  • CacheYield — cache pressure + sched_yield
  • CachePipe — cache pressure + pipe exchange
  • Sequence — compound: loop through phases
  • ForkExit — rapid fork+_exit cycling
  • NiceSweep — cycle nice level from -20 to 19
  • AffinityChurn — rapid self-directed sched_setaffinity
  • PolicyChurn — cycle SCHED_OTHER → BATCH → IDLE (→ FIFO/RR with CAP_SYS_NICE)
  • NumaMigrationChurn — rotate sched_setaffinity across NUMA nodes
  • CgroupChurn — cycle cgroup membership between sibling cgroups
  • FanOutCompute — messenger/worker fan-out with matrix-multiply compute
  • AsymmetricWaker — paired workers in mismatched scheduling classes share one futex word
  • WakeChain — ring of waker-wakee hops (Pipe with WF_SYNC, or Futex)
  • EpollStorm — eventfd producers + epoll_wait consumers
  • ThunderingHerd — N waiters on one global futex word; broadcast wake
  • PageFaultChurn — rapid mmap/fault/MADV_DONTNEED cycling
  • NumaWorkingSetSweep — rotate working-set memory across NUMA nodes via mbind
  • MutexContention — N-way futex mutex contention
  • PriorityInversion — three-tier lock contention (Pi or Plain futex)
  • ProducerConsumerImbalance — unbalanced producer/consumer pipeline (queue grows)
  • SignalStorm — paired workers fire tkill(partner, SIGUSR1) between CPU bursts
  • PreemptStorm — one SCHED_FIFO worker preempts CFS spinners at ~kHz
  • RtStarvation — SCHED_FIFO workers monopolise CPU; SCHED_NORMAL workers starve
  • Custom — user-supplied work function

See WorkType.

Observability

Zero-overhead introspection — read/write kernel state and BPF maps from the host w/o the guest knowing

All observability is built on direct read/write of guest physical memory from the host via the KVM memory mapping, with page table walks for dynamically allocated addresses. No guest-side instrumentation, no BPF syscalls — the observer does not perturb the scheduler under test.

Kernel state: per-CPU runqueues, sched_domain trees, schedstat counters, and sched_ext event counters — read via BTF-resolved struct offsets from vmlinux. See Monitor.

BPF state: maps discovered by walking kernel data structures through page table translation. Array and percpu_array maps support typed field access via BPF program BTF; hash maps return raw key-value pairs. Read/write — write enables host-initiated crash reproduction.

Types: GuestMem, GuestKernel, BpfMapAccessor

Cast analysis — recover typed pointers from u64 map fields automatically; no annotations required

Schedulers stash kernel kptrs (task_struct *, cgroup *, …) and arena pointers in BPF map fields the BTF declares as u64 because BTF cannot express a pointer to a per-allocation type. The cast analyzer walks the scheduler’s .bpf.o instruction stream, tracks register state across LDX / STX / stack-spill / kfunc-return, and records every (source_struct, field_offset) → target_struct mapping it can prove from the program’s own access pattern.

The renderer feeds those mappings into render_cast_pointer so a field that previously surfaced as a raw 0xffff… integer now chases through to the target struct’s fields and prints with a (cast→arena) or (cast→kernel) annotation distinguishing cast-recovered pointers from BTF-typed ones. Failure dumps, periodic captures, and on-demand snapshots all benefit automatically.

A complementary sdt_alloc bridge recovers a chase target’s real struct id when the scheduler’s program BTF declares the pointee as a BTF_KIND_FWD forward declaration (the typical shape for struct sdt_data __arena * fields whose body lives in a separate library BTF). The freeze pre-pass populates a slot_start → ArenaSlotInfo range index from each live sdt_alloc allocator slot — one entry per slot, carrying elem_size, header_size, and the resolved payload type id. When a chase lands on a Fwd terminal, the renderer range-looks up the slot the chased address falls in and renders the recovered payload struct from the slot’s payload start (skipping the union sdt_id header when the chased pointer lands at slot-start rather than payload-start). The result carries an sdt_alloc annotation suffix: (sdt_alloc) for the BTF-typed Type::Ptr arm, or cast→arena (sdt_alloc) / cast→kernel (sdt_alloc) when the chase originated from a cast-analyzer hit.

A parallel cross-BTF Fwd resolution path covers a different multi-BTF shape: a BTF_KIND_FWD whose body lives in a sibling embedded BPF object’s BTF rather than an sdt_alloc slot — the typical scheduler shape where one .bpf.c declares struct cgx_target; (forward) and another defines the body. The cast-analysis pre-pass builds a name-keyed index over every parsed embedded program BTF (one entry per complete !is_fwd Struct / Union; first-write-wins on duplicate names; anonymous types skipped). When a chase target survives the local same-BTF Fwd resolve as a Fwd, the renderer consults the cross-BTF index by name (matching aggregate kind — struct vs union); a hit switches the recursion to the resolved sibling BTF and renders the full body. No new annotation is introduced — the recovered subtree carries whatever annotation it would have had if the struct body lived in the entry BTF.

Runs unconditionally on every scheduler load; no test-author configuration. False negatives (a missed cast — renderer falls back to raw u64, the prior behavior) are acceptable; false positives (a misidentified cast) are not, so the analyzer is deliberately conservative on conflicting evidence and branch joins. See Monitor → Cast analysis.

Unified timeline — correlate scenario phases with scheduler telemetry

Stimulus events (cgroup ops, cpuset changes, step transitions) correlated with monitor samples for per-phase scheduler behavior analysis. Each event carries timestamps, operation details, and cumulative worker iteration counts.

Periodic capture + temporal assertions — cadenced sampling across the workload window with monotonicity, rate, steady-state, convergence, and ratio patterns

#[ktstr_test(num_snapshots = N)] fires N host-side freeze_and_capture boundaries inside the workload’s 10 %–90 % window, anchored at the first MSG_TYPE_SCENARIO_START. Each capture is stored on the SnapshotBridge under periodic_NNN along with the parallel scx_stats JSON observed pre-freeze and a pause-adjusted elapsed-ms timestamp.

A post_vm callback drains the bridge into a SampleSeries and projects per-sample columns (SeriesField<T>) along the BPF or stats axis. Seven temporal patterns evaluate the projections:

  • nondecreasing — counter monotonicity (v[i] <= v[i+1])
  • strictly_increasing — strict counter monotonicity (v[i] < v[i+1])
  • rate_within(lo, hi) — bounded delta-per-millisecond
  • steady_within(warmup_ms, tolerance) — post-warmup mean band
  • converges_to(target, tolerance, deadline_ms) — three-consecutive-in-band witness before deadline
  • always_true — boolean invariant at every sample
  • ratio_within(other, lo, hi) — cross-field correlation between two series

Per-sample projection errors render with the underlying SnapshotError variant (PlaceholderSample, MissingStats, FieldNotFound, TypeMismatch, …) so coverage gaps surface with their cause without re-running. See Periodic Capture and Temporal Assertions.

Worker comm / nice / pcomm — set task->comm, nice level, and thread-group leader name on every worker

CgroupDef::comm("name") calls prctl(PR_SET_NAME) on every worker. CgroupDef::nice(n) calls setpriority(PRIO_PROCESS, 0, n) on every worker. CgroupDef::pcomm("name") triggers ktstr’s fork-then-thread spawn path: workers sharing a pcomm value coalesce into ONE forked thread-group leader whose task->group_leader->comm is the pcomm string, with worker threads inside it. Each worker thread additionally sets its own task->comm via the per-WorkSpec .comm(). Models real applications like chrome (pcomm) hosting ThreadPoolForeg (per-thread comm). PipeIo/CachePipe and SignalStorm work correctly under Fork and Thread clone modes, including inside pcomm-coalesced thread groups. See Tutorial: Step 11.

Inline scheduler configs — pass JSON config strings directly into the test, framework writes to guest path

Schedulers that take a --config JSON file (scx_layered, scx_lavd, …) declare the arg template + guest path via Scheduler::config_file_def(arg_template, guest_path). Tests supply the inline JSON via #[ktstr_test(config = LAYERED_CONFIG)]. The framework writes the content to a temp file, packs it into the initramfs at the guest path, and substitutes {file} in the arg template before launching the scheduler. A bidirectional pairing gate (compile time + runtime) catches mismatched declarations: a scheduler with config_file_def REQUIRES config = … on every test, and a scheduler without it REJECTS config = …. See Tutorial: Step 12 and The #[ktstr_test] Macro.

no_perf_mode — decouple virtual topology from host hardware for tests with NUMA / LLC counts the host can't satisfy

#[ktstr_test(no_perf_mode = true)] (or KTSTR_NO_PERF_MODE=1) builds the VM with the declared numa_nodes / llcs / cores / threads even on smaller hosts. vCPU pinning, hugepages, NUMA mbind, RT scheduling, and KVM exit suppression are skipped, and gauntlet preset filtering relaxes host-topology checks to the single “host has enough total CPUs” inequality. Mutually exclusive with performance_mode = true (validated at runtime by KtstrTestEntry::validate). See Tutorial: Step 13 and Performance Mode.

Statistical regression detection — Polars-powered analysis across combinatoric test matrices

Polars-powered aggregation computes scheduling metrics across runs. Run-to-run compare with dual-gate significance thresholds (absolute and relative) catches regressions that single-run assertions miss.

Metrics:

  • worst_spread — CPU time fairness (0.0 = perfect)
  • worst_gap_ms — longest scheduling gap
  • total_migrations / worst_migration_ratio — cross-CPU migration volume
  • max_imbalance_ratio — runqueue length imbalance
  • p99_wake_latency_us — tail wake-to-run latency
  • mean_run_delay_us — mean schedstat run delay
  • total_iterations — throughput

Debugging

Auto-repro — automatically captures function arguments and struct state at crash-path call sites

On scheduler crash, extracts the crash stack and discovers struct_ops callbacks. Attaches BPF kprobes and fentry/fexit probes, triggers on sched_ext_exit, and reruns the scenario to capture function arguments and struct field state at each crash-path call site. See Auto-Repro.

Interactive shell — busybox shell inside the VM with host file injection (debugging, not tests — too slow)

ktstr shell boots a VM with busybox and drops into an interactive shell. --include-files injects host binaries and libraries with automatic shared library resolution.

--exec mode — run commands inside the VM non-interactively

ktstr shell --exec "command" runs a command inside the VM and exits.

Infrastructure

Kernel management — build, cache, and auto-discover kernels from any source

ktstr kernel build builds and caches kernel images from version numbers, local source paths, or git URLs. Automatic kernel discovery resolves cached images, host kernels, and CI-provided paths without manual configuration.

Performance mode — host-side isolation-ish and topology mirroring for maybe workable results

vCPU pinning, 2MB hugepages with pre-fault allocation, NUMA mbind, RT scheduling (SCHED_FIFO), KVM exit suppression (PAUSE + HLT), and KVM_HINTS_REALTIME CPUID — isolates the VM from host noise for reproducible measurements. Topology mirroring maps the VM’s LLC structure to match the host’s physical layout, so cache-aware scheduling decisions in the guest reflect real hardware behavior rather than synthetic geometry. Kinda. See Performance Mode.

Resource-budget coordination--cpu-cap N bounds concurrent kernel builds and no-perf-mode VMs per host

--cpu-cap N (or KTSTR_CPU_CAP=N) constrains a no-perf-mode VM or kernel build to exactly N host CPUs, selected by walking whole LLCs in NUMA-aware, consolidation-aware order (filtered to the calling process’s sched_getaffinity cpuset), and partial-taking the last LLC so plan.cpus.len() == N. The full LLC is still flocked for per-LLC coordination with concurrent ktstr peers. When the flag is absent, the planner defaults to 30% of the allowed-CPU set (minimum 1). The plan writes the reserved CPUs and NUMA nodes into a cgroup v2 cpuset sandbox so make -jN gcc fan-out and vCPU soft-mask affinity respect the budget. On shell, mutually exclusive with performance_mode=true (clap parse rejection); library consumers see the env var silently ignored under perf-mode. Mutually exclusive with KTSTR_BYPASS_LLC_LOCKS=1 at every entry point (contract vs. bypass conflict rejected at CLI parse plus the library and kernel-build-pipeline sites). ktstr locks / ktstr locks --json enumerates every held LLC + per-CPU + cache-entry flock on the host with holder PID + cmdline for contention diagnosis. See Resource Budget.

Guest coverage — profraw collection from inside the VM, merged with host coverage

Guest-side profraw collection via shared memory. The host merges guest and host coverage for unified cargo llvm-cov reports.

cargo-ktstr — cargo subcommand for the full workflow, more introspection less shell scripts

Wraps cargo nextest run with automatic kernel resolution. Subcommands (in --help order): test, coverage, llvm-cov, stats, kernel, model, verifier, funify, completions, show-host, show-thresholds, export, locks, shell. See cargo-ktstr.

Remote kernel cache — GHA cache backend for cross-run kernel sharing

GHA cache backend for CI kernel sharing. When KTSTR_GHA_CACHE=1 and ACTIONS_CACHE_URL are set, all cache lookups check the remote after a local miss before falling back to download (for versions) or erroring (for cache keys). All successful builds are pushed to the remote automatically. Non-fatal on failure; local cache is authoritative.

Real-kernel verifier analysis — boots the kernel, loads the scheduler, reads actual verified instruction counts

Runs the BPF verifier against every declare_scheduler!-registered scheduler’s struct_ops programs inside a real kernel. Reports per-program verified instruction counts with cycle collapse — deduplicating repeated verifier paths to show true unique cost. The sweep emits one nextest cell per (declared scheduler × kernel-list entry × accepted topology preset) tuple, with parallelism and retries handled natively by nextest.

Reads bpf_prog_aux.verified_insns from guest memory after loading the scheduler via struct_ops — the same path production uses. Captures real-world verification costs including map sharing, BTF resolution, and program composition. See Verifier.

Host-guest signaling — bidirectional host/VM coordination built into the test library

SHM signal slots enable the host and guest to coordinate without a network stack, serial protocol, or guest-side daemon. Graceful shutdown, readiness gates, and BPF map write triggers flow through shared memory mapped into both address spaces.

High test coverage — broad self-coverage of the framework itself codecov

See Getting Started to set up your first test, or browse the Recipes for common workflows.

Getting Started

Prerequisites

Linux only (x86_64, aarch64). ktstr boots KVM virtual machines; it does not build or run on other platforms.

  • Linux host with KVM access (/dev/kvm)
  • Rust toolchain (stable, >= 1.94.1; pinned via rust-toolchain.toml)
  • clang (BPF skeleton compilation)
  • pkg-config, make, gcc
  • autotools (autoconf, autopoint, flex, bison, gawk) – vendored libbpf/libelf/zlib build
  • BTF (/sys/kernel/btf/vmlinux – present by default on most distros; set KTSTR_KERNEL if missing)
  • Internet access on first build (downloads busybox source)
  • Linux kernel 6.12+ for sched_ext tests (check with uname -r). The host kernel has no version requirement beyond KVM; the test kernel is whichever you build or cache via cargo ktstr kernel build. See Supported kernels for per-feature version boundaries.

Ubuntu/Debian:

sudo apt install clang pkg-config make gcc autoconf autopoint flex bison gawk

Fedora:

sudo dnf install clang pkgconf make gcc autoconf gettext-devel flex bison gawk

Install tools

cargo install cargo-nextest           # required
cargo install --locked ktstr --bin ktstr --bin cargo-ktstr   # both user-facing binaries (optional)

cargo-nextest is required. cargo ktstr test delegates to nextest internally; without it, cargo ktstr test will fail.

cargo install --locked ktstr --bin ktstr --bin cargo-ktstr installs the two user-facing binaries (ktstr host-side CLI and cargo-ktstr dev workflow plugin); the --bin flags scope the install away from the two test-fixture binaries (ktstr-jemalloc-probe, ktstr-jemalloc-alloc-worker) that the crate’s integration tests spawn.

Add the dependency

Add ktstr as a dev-dependency:

[dev-dependencies]
ktstr = { version = "0.4" }

Kernel discovery

Tests require a bootable Linux kernel. The test harness checks these locations in order:

  1. KTSTR_TEST_KERNEL environment variable (direct image path).
  2. KTSTR_KERNEL environment variable, parsed as one of three forms:
    • Path: search that directory for arch/<arch>/boot/<image>
    • Version (e.g. 6.14.2): look up the version in XDG cache
    • Cache key (from cargo ktstr kernel list): exact cache lookup
  3. XDG cache: most recent cached image (newest first); entries built with a different kconfig fragment are skipped. When KTSTR_KERNEL named an explicit version or cache key that was not present in the cache, the cache scan is skipped entirely – discovery moves on to step 4 rather than substituting an unrelated cached kernel.
  4. ./linux/arch/<arch>/boot/<image> (workspace-local build tree)
  5. ../linux/arch/<arch>/boot/<image> (sibling directory)
  6. /lib/modules/$(uname -r)/build/arch/<arch>/boot/<image> (installed kernel build tree)
  7. /lib/modules/$(uname -r)/vmlinuz (installed kernel)
  8. /boot/vmlinuz-$(uname -r)
  9. /boot/vmlinuz (unversioned symlink)

On x86_64, the build-tree image is arch/x86/boot/bzImage; on aarch64, arch/arm64/boot/Image.

The host’s installed kernel works for basic testing. For sched_ext tests, build a kernel with the ktstr config fragment (below). See Troubleshooting for details.

Implicit discovery reads existing cache entries but does not run the build pipeline or produce a new cache entry. The chain reads existing cache entries on the read path (most-recent-valid first; entries built with a different kconfig fragment are skipped) and falls back to local build trees and host paths when nothing matches. It does NOT compute a local-{hash7}-{arch}-kc{suffix} cache key, run the build pipeline, or store a new cache entry from whatever source-tree image it ends up using. To opt into the build + cache-store pipeline so a source tree’s build is recorded and reused under a stable cache key, pass the path explicitly via cargo ktstr test --kernel ../linux; see cargo-ktstr — What it does for the full path-mode flow including the cache-hit short-circuit.

Build a kernel

cargo ktstr kernel build downloads a kernel tarball from kernel.org, configures it with the embedded ktstr.kconfig fragment, builds it, and caches the result:

cargo ktstr kernel build               # latest stable series with >= 8 maintenance releases
cargo ktstr kernel build 6.14.2        # specific version
cargo ktstr kernel build 6.12          # highest 6.12.x patch release
cargo ktstr kernel build 6             # highest 6.x.y release

The bare cargo ktstr kernel build skips series that have fewer than 8 maintenance releases to keep CI off brand-new majors whose early point releases tend to hit build issues on older toolchains; pass the specific version explicitly if you need a series that hasn’t reached .8 yet.

Subsequent runs of cargo ktstr test or cargo nextest run will find the cached kernel automatically (step 3 in the discovery chain above).

To build from a local source tree:

cargo ktstr kernel build --source ../linux

To list and manage cached kernels:

cargo ktstr kernel list
cargo ktstr kernel clean --keep 3

See cargo-ktstr for all options.

Manual

cd /path/to/linux
make defconfig
cat /path/to/ktstr/ktstr.kconfig >> .config
make olddefconfig
make -j$(nproc)

ktstr.kconfig in the repo root contains a kernel config fragment tuned for scheduler testing (sched_ext, BPF, kprobes, minimal boot).

Write a test

Create a file in your crate’s tests/ directory (e.g. tests/sched_test.rs) and write a #[ktstr_test] function. The prelude module re-exports the types you need.

The simplest test uses a canned scenario. AssertResult carries the pass/fail verdict, diagnostic messages, and per-cgroup statistics from the run.

use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]  // llcs = last-level caches
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    // `scenarios::steady` is a canned scenario: two cgroups of equal
    // CPU-spin workers, no cpuset restrictions, run for the default
    // duration.
    scenarios::steady(ctx)
}

For custom cgroup topology, declare cgroups with CgroupDef and run them with execute_defs. A CgroupDef bundles the cgroup name, optional cpuset, and workload specification into a single declaration. This is the most common custom test pattern:

use std::time::Duration;
use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("cg_0").workers(4),
        CgroupDef::named("cg_1")
            .workers(2)
            // CPU burst for 50 ms, sleep for 100 ms, repeat.
            .work_type(WorkType::bursty(
                Duration::from_millis(50),
                Duration::from_millis(100),
            )),
    ])
}

execute_defs is a convenience wrapper that creates a single step holding for the full duration – use it when all cgroups run concurrently for one phase. Use execute_steps when you need multiple phases (e.g., adding cgroups mid-test or changing cpusets between phases).

Step::with_defs pairs a list of CgroupDefs with a HoldSpec that controls how long the step runs. This example starts two cgroups, then adds a third mid-test:

use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 4, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    let steps = vec![
        // Phase 1: two cgroups for the first half.
        Step::with_defs(
            vec![
                CgroupDef::named("cg_0").workers(2),
                CgroupDef::named("cg_1").workers(2),
            ],
            HoldSpec::Frac(0.5),
        ),
        // Phase 2: add a third cgroup for the remaining half.
        Step::with_defs(
            vec![CgroupDef::named("cg_2").workers(2)],
            HoldSpec::Frac(0.5),
        ),
    ];
    execute_steps(ctx, steps)
}

How it runs

The framework boots a KVM VM with the requested topology and runs your test binary as the guest’s init process. Your test function executes inside the VMexecute_defs and execute_steps immediately create cgroups, spawn workers, run the workload, and return assertion results. Ctx provides the guest topology (ctx.topo) and cgroup management (ctx.cgroups).

What gets checked

Every test automatically checks for worker starvation, scheduling fairness, scheduling gaps, and host-side runqueue health (including imbalance, stalls, dispatch queue depth). These defaults come from Assert::default_checks() and can be overridden per-scheduler or per-test. See Checking for the full list of checks and thresholds.

Run

The recommended way to run #[ktstr_test] tests is cargo ktstr test, which handles kernel resolution and wraps cargo nextest:

cargo ktstr test --kernel ../linux

The ktstr ctor automatically intercepts nextest protocol args (--list, --exact) for gauntlet expansion and budget-driven test selection.

Fallbacks:

  • cargo nextest run: ctor intercepts, runs gauntlet-expanded tests (you must supply your own kernel via KTSTR_KERNEL / KTSTR_TEST_KERNEL or the discovery chain).
  • cargo test: standard harness runs the #[test] wrappers (base topology only, no gauntlet expansion).

Requires /dev/kvm. See Troubleshooting if KVM is unavailable.

Passing tests:

    PASS [  11.34s] my_crate::my_sched_tests ktstr/my_test

A failing test prints assertion details:

    FAIL [  12.05s] my_crate::my_sched_tests ktstr/my_test

--- STDERR ---
ktstr_test 'my_test' [topo=1n1l2c1t] failed:
  stuck 3500ms on cpu1 at +1200ms

--- stats ---
4 workers, 2 cpus, 8 migrations, worst_spread=12.3%, worst_gap=3500ms
  cg0: workers=2 cpus=2 spread=5.1% gap=3500ms migrations=4 iter=15230
  cg1: workers=2 cpus=2 spread=12.3% gap=890ms migrations=4 iter=14870

Each test invocation writes results into {CARGO_TARGET_DIR or "target"}/ktstr/{kernel}-{project_commit}/ as one *.ktstr.json sidecar per #[ktstr_test] variant. Run cargo ktstr stats list to see runs (RUN, TESTS, DATE, ARCH columns). See Runs for the full layout and analysis workflow.

Using cargo-ktstr

cargo ktstr test handles kernel resolution and test execution in one command:

cargo ktstr test                                              # auto-discover kernel
cargo ktstr test --kernel ../linux                            # local source tree (builds + caches; subsequent runs hit cache)
cargo ktstr test --kernel 6.14.2                              # version (auto-downloads on miss)
cargo ktstr test -- -E 'test(my_test)'                        # pass nextest args

See cargo-ktstr for details.

Interactive shell

cargo ktstr shell boots a VM with busybox for manual exploration:

cargo ktstr shell                              # default 1,1,1,1 topology
cargo ktstr shell --topology 1,2,4,1           # 1 NUMA node, 2 LLCs, 4 cores/LLC, 1 thread/core
cargo ktstr shell -i ./my-scheduler            # include a file in the guest
cargo ktstr shell -i ./test-data/              # include a directory recursively

Included ELF binaries get automatic shared library resolution. Directories are walked recursively; their contents appear under /include-files/<dirname>/ preserving the original structure. Individual files are available at /include-files/<name> inside the guest. See cargo-ktstr shell for details.

Next steps

To understand scenarios, flags, and checking: Core Concepts.

To write new tests: Writing Tests.

To test your own scheduler: Test a New Scheduler.

Zero to ktstr

This tutorial walks through writing a complete #[ktstr_test] from scratch. By the end you’ll have a working scheduler test that runs two cgroups with different lifecycle patterns across a multi-LLC topology, tunes test duration and the watchdog, and asserts fairness, throughput parity, and cpuset isolation.

What you’ll build

A test named mixed_workloads that:

  • Runs two cgroups on separate LLCs:
    • background_spinner – a persistent CPU-bound load that runs for the entire test duration.
    • phased_worker – a worker that loops through explicit Spin -> Yield -> Spin -> Yield ... phases via WorkType::Sequence.
  • Targets a 2-LLC, 4-core topology so the scheduler has a real cache boundary to respect.
  • Sets explicit test duration and scx watchdog timeout via #[ktstr_test] attributes.
  • Asserts fairness (per-cgroup spread), throughput parity (CV across workers + minimum rate), and cpuset isolation (workers stay on their assigned CPUs). Scheduling gaps and host-side runqueue health are checked automatically.

The complete test is at the end of this page.

Prerequisites

Set up the host and a kernel before continuing:

  • Getting Started covers KVM access, the toolchain, and the dev-dependency.
  • A bootable Linux kernel image is required. Build one with cargo ktstr kernel build or point at a source tree with cargo ktstr test --kernel ../linux. See Getting Started: Build a kernel for the full kernel-management workflow.

Once the dependency is in place, create a file under your crate’s tests/ directory (e.g. tests/mixed_workloads.rs) and follow along.

Step 1: The skeleton

Every #[ktstr_test] is a Rust function that takes &Ctx and returns Result<AssertResult>. Start with an empty body that passes unconditionally:

use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    let _ = ctx;
    Ok(AssertResult::pass())
}

use ktstr::prelude::*; brings in every type the test body needs – Ctx, AssertResult, CgroupDef, WorkType, CpusetSpec, execute_defs, and the Result alias from anyhow. The #[ktstr_test] attribute registers the function so cargo ktstr test discovers it and boots a VM with the requested topology.

A test without a scheduler = ... attribute runs under the kernel’s default EEVDF scheduler — useful as a baseline. Step 2 swaps in a sched_ext scheduler so the rest of the tutorial exercises that scheduler instead.

For the full attribute reference, see The #[ktstr_test] Macro.

Step 2: Define your scheduler

To target a sched_ext scheduler, declare it with declare_scheduler! and reference the generated const from #[ktstr_test(scheduler = …)]. The example uses scx-ktstr, the test-fixture scheduler shipped in this workspace; substitute your own binary name to target a different scheduler.

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(KTSTR_SCHED, {
    name = "ktstr_sched",
    binary = "scx-ktstr",
});

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    let _ = ctx;
    Ok(AssertResult::pass())
}

declare_scheduler! emits a pub static KTSTR_SCHED: Scheduler holding the declared fields and registers a private static in the KTSTR_SCHEDULERS distributed slice via linkme so cargo ktstr verifier discovers it automatically. The scheduler = slot on #[ktstr_test] expects an &'static Scheduler — pass the bare KTSTR_SCHED ident.

The macro fields:

  • name — scheduler name for display and sidecar keys.
  • binary — binary name for auto-discovery in target/{debug,release}/, the directory containing the test binary, or a KTSTR_SCHEDULER override path. When the scheduler is a [[bin]] target in the same workspace, cargo build already places it where discovery looks. The resolved binary is packed into the VM’s initramfs.
  • topology = (numa, llcs, cores, threads) — optional default VM topology. Tests can override individual dimensions via #[ktstr_test(llcs = ...)]. Omitted here; the per-test attributes in Step 4 set every dimension explicitly.
  • sched_args = ["--flag", "--another"] — optional CLI args prepended to every test that uses this scheduler. Useful when a scheduler needs the same --enable-llc-style switches in every run; for one-off variations, use #[ktstr_test(extra_sched_args = [...])] on the test instead.
  • kernels = ["6.14", "6.15..=7.0"] — optional set of kernel specs the verifier sweep should exercise this scheduler against. See BPF Verifier for the cell emission contract.

For the full attribute surface (sysctls, kargs, config_file, gauntlet constraints, scheduler-level assertion overrides), see Scheduler Definitions.

When the macro doesn’t fit — the most common case being inline JSON config supplied per-test or programmatic composition — define the Scheduler const through the manual builder instead. Step 12 below walks through that path with scx_layered.

Step 3: Add workloads

A CgroupDef declares a cgroup along with the workers that will run inside it. The builder methods configure worker count, the work each worker performs, scheduling policy, and cpuset assignment.

Add two cgroups – both running tight CPU spinners for now. Step 5 will swap one of them for a phased workload:

use ktstr::prelude::*;

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 1, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::SpinWait),
    ])
}

Without .with_cpuset(...), a cgroup’s workers run on every CPU in the test’s topology — they share the VM’s full CPU set with all other cgroups. .with_cpuset(CpusetSpec::Llc(idx)) (introduced in Step 4) restricts a cgroup to one LLC’s CPUs, and the other CpusetSpec variants narrow further.

WorkType::SpinWait runs a tight CPU spin loop; it is one of many work primitives – see WorkType for the full enum (Bursty, FutexPingPong, CachePressure, IoSyncWrite, PageFaultChurn, MutexContention, Sequence, etc.) and the work-type-to-scheduler-behavior mapping table.

execute_defs is a convenience wrapper that runs each cgroup concurrently for the test’s full duration. Both cgroups are persistent – they hold for the entire scenario. Use execute_steps when you need to add cgroups mid-run or swap cpusets between phases; see Ops and Steps for the multi-step API.

Step 4: Set topology

The #[ktstr_test] attribute carries the VM’s CPU topology. Topology dimensions are big-to-little: numa_nodes (default 1), llcs (total across all NUMA nodes), cores per LLC, and threads per core. Total CPU count is llcs * cores * threads.

LLC count matters because the last-level cache is the primary scheduling boundary – tasks sharing an LLC benefit from shared cache lines, while cross-LLC migration carries a cold-cache penalty. A scheduler that ignores LLC topology will look fine on llcs = 1 and start failing as soon as there is a real cache boundary to respect.

Bump the topology to two LLCs with two cores each (4 CPUs total) so each cgroup can own its own LLC:

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .with_cpuset(CpusetSpec::Llc(0)),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .with_cpuset(CpusetSpec::Llc(1)),
    ])
}

CpusetSpec::Llc(idx) confines a cgroup to the CPUs that belong to LLC idx. Other variants (Numa, Range, Disjoint, Overlap, Exact) cover NUMA-node binding, fractional partitioning, and hand-built CPU sets.

For the full topology surface (NUMA accessors, per-LLC info, cpuset generation helpers), see TestTopology.

Step 5: Compose phased work inside a cgroup

So far both cgroups run identical CPU spinners. The point of this test is to exercise a scheduler against different lifecycle patterns at once, so swap phased_worker for a worker that loops through explicit phases.

WorkType::Sequence { first: Phase, rest: Vec<Phase> } runs each phase for its specified duration and then advances to the next; when the last phase ends the loop restarts from first. Phases: Phase::Spin(Duration), Phase::Sleep(Duration), Phase::Yield(Duration), Phase::Io(Duration). Use the WorkType::sequence(first, rest) constructor.

Phase, WorkType, and CpusetSpec are all in ktstr::prelude::*; only std::time::Duration needs an extra use line — added on the first line of the example below:

use std::time::Duration;
use ktstr::prelude::*;

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        // Persistent CPU pressure on LLC 0 for the whole run.
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .with_cpuset(CpusetSpec::Llc(0)),
        // Phased worker on LLC 1: spin 100 ms, yield for 20 ms,
        // then loop. Stresses the scheduler's wake-after-yield
        // placement repeatedly while the LLC-0 spinner keeps
        // host runqueue pressure constant.
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::sequence(
                Phase::Spin(Duration::from_millis(100)),
                [Phase::Yield(Duration::from_millis(20))],
            ))
            .with_cpuset(CpusetSpec::Llc(1)),
    ])
}

The two cgroups now exercise distinct paths concurrently:

  • background_spinner keeps two CPUs continuously busy on LLC 0.
  • phased_worker alternates between burning CPU and yielding on LLC 1, exercising the scheduler’s voluntary-preemption + wakeup placement code paths.

Both cgroups still run for the entire scenario duration: the phasing happens within each phased_worker worker’s loop, while execute_defs holds both cgroups across the whole run via HoldSpec::FULL. To express phasing across cgroups (e.g. add phased_worker only for the second half of the run), use execute_steps with multiple Step entries – see Ops and Steps. Step 9 below adds an Op::snapshot capture into a step’s op list.

Step 6: Tune execution

Several #[ktstr_test] attributes control how the VM runs the scenario. The defaults are tuned for fast iteration; tune up for longer / heavier runs:

AttributeDefaultWhat it does
duration_s12Per-scenario wall-clock seconds. The framework keeps both cgroups running for duration_s seconds, then signals workers to stop and collects reports.
watchdog_timeout_s5sched_ext watchdog fire threshold. Applied via scx_sched.watchdog_timeout on 7.1+ kernels and the static scx_watchdog_timeout symbol on pre-7.1 kernels. When neither path is available the override silently no-ops.
memory_mb2048VM memory in MiB.

watchdog_timeout_s is sched_ext’s per-task stall threshold — if a runnable task is not picked for watchdog_timeout_s seconds, the scheduler exits with SCX_EXIT_ERROR_STALL. The scenario duration and watchdog are independent; a 12 s scenario with a 5 s watchdog is normal. Tune the watchdog only when the scheduler under test is expected to legitimately leave a runnable task parked longer than the default 5 s.

For the run we’re building, set the duration to 20 s (so each phase iteration repeats many times):

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    // body unchanged from Step 5 -- two cgroups via execute_defs
}

For the full attribute reference (auto-repro, performance mode, topology constraints, etc.), see The #[ktstr_test] Macro.

Step 7: Add assertions

Default checks already run with no configuration – not_starved is Some(true) in Assert::default_checks(), which enables:

  • Starvation – any worker with zero work units fails the test.
  • Fairness spread – per-cgroup max(off-CPU%) - min(off-CPU%) must stay under the spread threshold (release default 15%; debug default 35% — debug builds in small VMs show higher spread, so the threshold loosens automatically when cfg!(debug_assertions) is true).
  • Scheduling gaps – the longest wall-clock gap observed at work-unit checkpoints must stay under the gap threshold (release default 2000 ms; debug default 3000 ms — same cfg!(debug_assertions) gate as spread).

Host-side monitor checks (imbalance ratio, DSQ depth, stall detection, fallback / keep-last event rates) are also enabled by default with thresholds from MonitorThresholds::DEFAULT.

Cpuset isolation is opt-in – enable it with isolation = true. Override the spread threshold and add throughput-parity gates:

use std::time::Duration;
use ktstr::prelude::*;

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
    isolation = true,
    max_spread_pct = 20.0,
    max_throughput_cv = 0.5,
    min_work_rate = 1.0,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .with_cpuset(CpusetSpec::Llc(0)),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::sequence(
                Phase::Spin(Duration::from_millis(100)),
                [Phase::Yield(Duration::from_millis(20))],
            ))
            .with_cpuset(CpusetSpec::Llc(1)),
    ])
}

What each new attribute gates:

  • isolation = true – workers must only run on CPUs in their assigned cpuset; any execution on an unexpected CPU fails the test.
  • max_spread_pct = 20.0 – per-cgroup fairness override (the release default is 15.0; this loosens it slightly to absorb noise from the phased worker’s yield-driven re-placement). Bare max_spread_pct = 15.0 would silently match the default and have no observable effect.
  • max_throughput_cv = 0.5 – coefficient of variation of work_units / cpu_time across workers. Catches a scheduler that gives some workers disproportionately less effective CPU.
  • min_work_rate = 1.0 – minimum work units per CPU-second per worker. Catches the case where every worker is equally slow (CV passes but absolute throughput is too low).

#[ktstr_test] exposes the full Assert surface (scheduling gaps, monitor thresholds, NUMA locality, wake-latency benchmarks). See Checking for the merge chain (default_checks() -> Scheduler.assert -> per-test) and the complete threshold list.

Step 8: Run it

Run the test with cargo ktstr test, scoped to this one test name:

cargo ktstr test --kernel ../linux -- -E 'test(mixed_workloads)'

If cargo ktstr test reports “kernel not found”, the --kernel path either points at a directory without a built vmlinux or at a kernel the cache cannot locate. Run cargo ktstr kernel build to populate the cache, or pass an explicit path to a built kernel source tree — see Getting Started: Build a kernel for the resolution order.

If a probe-related error surfaces (“probe skeleton load failed”, “trigger attach failed”), re-run with RUST_LOG=ktstr=debug to see the underlying libbpf reason. Common causes: missing tp_btf target on older kernels (handled automatically by the two-phase fallback), CONFIG_DEBUG_INFO_BTF=n in the guest kernel (rebuild with BTF enabled), or a verifier rejection on a non-optional program (the retry surfaces both the original and retry errors so the verifier output is preserved).

cargo ktstr test resolves the kernel image, boots a VM with the declared topology, runs the test as the guest’s init, and reports the result. A passing run looks like:

    PASS [  11.34s] my_crate::mixed_workloads ktstr/mixed_workloads

A failure prints the violated threshold along with per-cgroup stats:

    FAIL [  12.05s] my_crate::mixed_workloads ktstr/mixed_workloads

--- STDERR ---
ktstr_test 'mixed_workloads' [sched=scx-ktstr] [topo=1n2l2c1t] failed:
  unfair cgroup: spread=22% (10-32%) 2 workers on 2 cpus (threshold 20%)

--- stats ---
4 workers, 4 cpus, 12 migrations, worst_spread=22.4%, worst_gap=180ms
  cg0: workers=2 cpus=2 spread=22.4% gap=180ms migrations=8 iter=15230
  cg1: workers=2 cpus=2 spread=4.1% gap=120ms migrations=4 iter=14870

The detail line unfair cgroup: spread=N% (min-max%) N workers on N cpus (threshold N%) is the exact format produced by assert::assert_not_starved. Other detail-line shapes the same producer emits:

  • tid {N} starved (0 work units) — when a worker made no progress. Example:

    ktstr_test 'mixed_workloads' [topo=1n2l2c1t] failed:
      tid 2 starved (0 work units)
    
  • tid {N} stuck {N}ms on cpu{N} at +{N}ms (threshold {N}ms) — when a worker’s longest off-CPU gap crossed Assert::max_gap_ms. Example:

    ktstr_test 'mixed_workloads' [topo=1n2l2c1t] failed:
      tid 7 stuck 2500ms on cpu3 at +4200ms (threshold 2000ms)
    
  • unfair cgroup: spread={pct}% ({lo}-{hi}%) — when per-cgroup fairness exceeded max_spread_pct. Example:

    ktstr_test 'mixed_workloads' [topo=1n2l2c1t] failed:
      unfair cgroup: spread=22% (10-32%) 2 workers on 2 cpus (threshold 20%)
    

The reporting layer does NOT include the cgroup name — cg{i} is the positional index in the stats roll-up (cg0, cg1, …) matching the cg{i}: workers=... cpus=... spread=... per-cgroup stats line emitted by test_support::eval::evaluate_vm_result.

For the full run lifecycle, sidecar layout, and analysis workflow, see Running Tests.

Step 9: Capture a snapshot

Threshold-based assertions tell you something is off; snapshots tell you what the scheduler’s state actually was. Op::snapshot(name) asks the host to freeze every vCPU long enough to read the BPF (in-kernel program) map state, vCPU registers, and per-CPU counters into a FailureDumpReport keyed by name, then resumes the guest.

execute_defs (used so far) takes a flat list of cgroups and runs them concurrently. To inject a snapshot mid-run, switch to execute_steps, which takes a list of Steps — each step has setup cgroups, an ops list (where Op::snapshot(...) lives), and a hold duration:

use ktstr::prelude::*;

#[ktstr_test(scheduler = KTSTR_SCHED, llcs = 2, cores = 2, threads = 1, duration_s = 20)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_steps(ctx, vec![
        Step {
            setup: Setup::Defs(vec![
                CgroupDef::named("background_spinner")
                    .workers(2)
                    .work_type(WorkType::SpinWait)
                    .with_cpuset(CpusetSpec::Llc(0)),
                CgroupDef::named("phased_worker")
                    .workers(2)
                    .work_type(WorkType::SpinWait)
                    .with_cpuset(CpusetSpec::Llc(1)),
            ]),
            ops: vec![Op::snapshot("after_workload")],
            hold: HoldSpec::FULL,
        },
    ])
}

After the scenario completes, the captured report is keyed by name on the active SnapshotBridge — the host-side store that owns the captured FailureDumpReport map for the run. Downstream test code drains it and walks scalar variables with the dotted-path accessor — e.g. snap.var("nr_cpus_onln").as_u64()? reads a scheduler global (any .bss/.data/.rodata symbol; Snapshot::var walks all three) as a u64.

For the bridge wiring, the full traversal API (Snapshot::map, SnapshotEntry::get, per-CPU narrowing, error variants), and the symbol-driven Op::watch_snapshot variant that fires whenever the guest writes a kernel symbol, see Snapshots.

Step 10: Gauntlet expansion

The #[ktstr_test] macro doesn’t just emit a single test – it also generates a gauntlet of variants that run the same body across every accepted topology preset (single-LLC, multi-LLC, multi-NUMA, with/without SMT).

Gauntlet variants are nextest-discovered and run with cargo ktstr test -- --run-ignored ignored-only -E 'test(gauntlet/)'. Constrain coverage with min_llcs / max_llcs, min_cpus / max_cpus, and requires_smt on the attribute. See Gauntlet Tests for the full filtering and worked examples.

Step 11: Name and prioritize workers

Per-cgroup defaults travel through CgroupDef’s builder methods so schedulers that key on task->comm or task_struct->static_prio can be exercised with realistic, distinguishable workers:

CgroupDef::named("background_spinner")
    .workers(2)
    .comm("bg_spinner")           // prctl(PR_SET_NAME, "bg_spinner")
    .nice(10)                     // setpriority(PRIO_PROCESS, 0, 10)
    .work_type(WorkType::SpinWait)
  • .comm("name") — every worker calls prctl(PR_SET_NAME, name) at startup. The kernel truncates task->comm to 15 bytes inside __set_task_comm. Distinguishes workers in top / ps output and in scheduler tracepoints.
  • .nice(n) — every worker calls setpriority(PRIO_PROCESS, 0, n). Values below the calling task’s current nice require CAP_SYS_NICE; ktstr always runs as root in-VM so the full -20..=19 range is available. Skips the syscall entirely when .nice(...) is not chained (workers inherit the parent’s nice).
  • .pcomm("name") — set the thread-group leader’s task->comm. Triggers ktstr’s fork-then-thread spawn path: workers sharing a pcomm value coalesce into ONE forked leader process whose task->group_leader->comm is name, with worker threads inside it. Models real applications like chrome (pcomm) hosting ThreadPoolForeg (per-thread comm) and java (pcomm) hosting GC Thread / C2 CompilerThre.

pcomm is a WorkSpec field, NOT a CloneMode variant. The two real CloneMode variants are Fork (default; each worker is its own thread group) and Thread (workers share the harness’s tgid as std::thread::spawn threads). pcomm triggers an in-process fork-then-thread shape that combines per-process leader visibility schedulers expect with the in-process thread-spawn dispatch the worker bodies use. PipeIo and CachePipe workers placed in a .pcomm(...) cgroup run as threads inside the pcomm container; their pipe-pair partner indices are computed within the container’s thread group, not across forked siblings. SignalStorm uses tkill (per-task signal delivery, PIDTYPE_PID) rather than kill (PIDTYPE_TGID), so the partner-vs-self addressing is correct uniformly across Fork and Thread modes — including inside pcomm-coalesced thread groups.

Per-WorkSpec overrides win over cgroup-level defaults — write .work(WorkSpec::default().nice(0).comm("hot_spinner")) to opt a specific worker out of the cgroup-level defaults.

Step 12: Inline scheduler config

Schedulers like scx_layered and scx_lavd accept a JSON config via --config /path/to/file.json. Declare the arg template + guest path on a Scheduler const built via the manual builder, then supply the inline content from the test attribute:

const LAYERED_SCHED: Scheduler = Scheduler::new("layered")
    .binary(SchedulerSpec::Discover("scx_layered"))
    .topology(1, 2, 4, 1)
    .config_file_def("--config {file}", "/include-files/layered.json");

const LAYERED_CONFIG: &str = r#"{ "layers": [{ "name": "default" }] }"#;

#[ktstr_test(scheduler = LAYERED_SCHED, config = LAYERED_CONFIG)]
fn layered_default(ctx: &Ctx) -> Result<AssertResult> {
    Ok(AssertResult::pass())
}

The framework writes LAYERED_CONFIG to the guest at the path declared on the scheduler (/include-files/layered.json) and substitutes {file} in the arg template with that path before launching the scheduler binary. A scheduler that declares config_file_def REQUIRES every test to supply config = … (compile-time + runtime gate); a scheduler that doesn’t declare it REJECTS config = … (the content would be silently dropped). See The #[ktstr_test] Macro for the full pairing rules.

For schedulers whose config lives on disk on the host (no inline content), use Scheduler::config_file(host_path) instead — the host file is packed into the initramfs and --config is injected into scheduler args automatically; no config = … on the test is needed in that flavor.

Step 13: Decouple virtual topology from host hardware

By default, ktstr pins vCPUs to host cores in a layout that mirrors the declared virtual topology. A test declaring numa_nodes = 2, llcs = 8 cannot run on a 1-NUMA-node host — the gauntlet preset filter rejects it. Set no_perf_mode = true to drop the host mirroring and run the declared virtual topology unchanged:

#[ktstr_test(
    numa_nodes = 2,
    llcs = 8,             // 8 % 2 == 0; the macro requires divisibility
    cores = 4,
    no_perf_mode = true,  // VM built as declared, even on 1-NUMA hosts
)]
fn two_node_test(ctx: &Ctx) -> Result<AssertResult> { /* ... */ }

In no_perf_mode:

  • The VM’s virtual topology is built as declared via KVM vCPU layout, ACPI SRAT/SLIT (x86_64), or FDT cpu nodes (aarch64) — the guest sees the full requested NUMA / LLC structure.
  • vCPU-to-host-core pinning, 2 MB hugepages, NUMA mbind, RT scheduling, and KVM exit suppression are skipped.
  • Host topology constraints (min_numa_nodes, min_llcs, requires_smt, per-LLC CPU widths) are NOT compared against host hardware. The only host check that survives is “total host CPUs >= total vCPUs”.

no_perf_mode = true is mutually exclusive with performance_mode = true (KtstrTestEntry::validate rejects the combination at runtime). Equivalent to setting KTSTR_NO_PERF_MODE=1 per-test — either source forces the no-perf path. See Performance Mode for the full lifecycle.

Step 14: Periodic capture and temporal assertions

On-demand Op::snapshot (Step 9) captures the scheduler’s BPF state at a point you choose. Periodic capture fires automatically at evenly-spaced points across the workload window, producing a time-ordered SampleSeries (the host-side container of drained snapshots, in capture order; .periodic_only() filters to periodic-tagged samples) for temporal assertions. Periodic capture is only useful when paired with a post_vm callback that drains the bridge and asserts something about the sequence — the two attributes belong together.

Enable periodic capture with num_snapshots = N and register the host-side callback with post_vm = function_name. The callback drains the bridge and runs assertions over the time-ordered series:

use ktstr::prelude::*;

fn check_dispatch_advances(result: &VmResult) -> Result<()> {
    let series = SampleSeries::from_drained(
        result.snapshot_bridge.drain_ordered_with_stats(),
    )
    .periodic_only();

    let mut v = Verdict::new();

    let nr_dispatched: SeriesField<u64> = series.bpf(
        "nr_dispatched",
        |snap| snap.var("nr_dispatched").as_u64(),
    );
    nr_dispatched.nondecreasing(&mut v);

    let r = v.into_result();
    anyhow::ensure!(r.passed, "temporal assertions failed: {:?}", r.details);
    Ok(())
}

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
    num_snapshots = 5,
    post_vm = check_dispatch_advances,
)]
fn dispatch_advances(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
    ])
}

num_snapshots = 5 fires 5 freeze-and-capture boundaries inside the 10%-90% window of the 20 s workload — at roughly +5 s, +7 s, +10 s, +13 s, +15 s. Each capture freezes every vCPU, reads BPF map state, and resumes. The host watchdog deadline is extended by each freeze duration so captures do not eat into the workload budget. The captures are stored under periodic_000periodic_004 on the SnapshotBridge.

Verdict is the assertion accumulator: every pattern call records its outcome on the same Verdict, and v.into_result() consumes it into a pass/fail AssertResult.

The seven temporal patterns on SeriesField:

PatternTypeWhat it checks
nondecreasingu64/f64Every consecutive pair: v[i] <= v[i+1]
strictly_increasingu64/f64Every consecutive pair: v[i] < v[i+1]
rate_within(lo, hi)f64Per-pair delta_value / delta_ms in [lo, hi]
steady_within(warmup_ms, tol)f64Post-warmup values within mean ± tol%
converges_to(target, tol, deadline_ms)f643 consecutive samples in [target ± tol] before deadline
ratio_within(other, lo, hi)f64Per-sample self / other in [lo, hi] (cross-field)
always_trueboolEvery sample is true

Every pattern method takes &mut Verdict as its first argument and returns it, so calls chain into the same accumulator.

SeriesField::each provides per-sample scalar bounds: field.each(&mut v).at_least(1u64), field.each(&mut v).between(0.0, 100.0).

When a temporal pattern fails, the AssertDetail entries identify the offending sample by tag and elapsed-ms timestamp. Example for nondecreasing flagging a regression on nr_dispatched:

nr_dispatched (nondecreasing): regression at sample periodic_002 (+10000ms): \
value 41 after prior value 42 at sample periodic_001 (+7000ms)

The rate, steady, converges, ratio, and always-true variants emit parallel shapes — every detail names the pattern, the specific sample(s) involved, and the violating value, so a failing test points at the data without re-running.

For boundary timing, spacing rules, and the bridge cap, see Periodic Capture. For the full projection API (bpf, stats, auto-projectors) and failure rendering, see Temporal Assertions.

Step 15: After the run — test statistics

cargo ktstr stats aggregates the sidecar JSON files that each test variant writes — useful for tracking gauntlet coverage, BPF verifier complexity, and scheduling behavior across commits. This is a post-run CLI workflow, not part of the test definition:

cargo ktstr stats                                 # summary: gauntlet coverage, verifier, KVM stats
cargo ktstr stats list                            # list runs with date, test count, arch
cargo ktstr stats compare --a-kernel 6.14 \       # diff sidecar partitions defined by
    --b-kernel 6.15                               #   per-side --a-X / --b-X filter flags

Statistics are collected even on test failure (if: !cancelled() in CI). For the full subcommand surface, see cargo-ktstr stats.

The complete test

The shape exercised by every step above, in one file. sched_args = ["--slow"] always-applies scx-ktstr’s --slow mode (Step 2); watchdog_timeout_s = 10 overrides the sched_ext stall threshold (Step 6); num_snapshots + post_vm enable periodic capture and a temporal assertion (Step 14):

use std::time::Duration;
use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(KTSTR_SCHED, {
    name = "ktstr_sched",
    binary = "scx-ktstr",
    sched_args = ["--slow"],
});

fn check_dispatch_advances(result: &VmResult) -> Result<()> {
    let series = SampleSeries::from_drained(
        result.snapshot_bridge.drain_ordered_with_stats(),
    )
    .periodic_only();

    let mut v = Verdict::new();

    let nr_dispatched: SeriesField<u64> = series.bpf(
        "nr_dispatched",
        |snap| snap.var("nr_dispatched").as_u64(),
    );
    nr_dispatched.nondecreasing(&mut v);

    let r = v.into_result();
    anyhow::ensure!(r.passed, "temporal assertions failed: {:?}", r.details);
    Ok(())
}

#[ktstr_test(
    scheduler = KTSTR_SCHED,
    llcs = 2,
    cores = 2,
    threads = 1,
    duration_s = 20,
    watchdog_timeout_s = 10,
    isolation = true,
    max_spread_pct = 20.0,
    max_throughput_cv = 0.5,
    min_work_rate = 1.0,
    num_snapshots = 5,
    post_vm = check_dispatch_advances,
)]
fn mixed_workloads(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("background_spinner")
            .workers(2)
            .work_type(WorkType::SpinWait)
            .with_cpuset(CpusetSpec::Llc(0)),
        CgroupDef::named("phased_worker")
            .workers(2)
            .work_type(WorkType::sequence(
                Phase::Spin(Duration::from_millis(100)),
                [Phase::Yield(Duration::from_millis(20))],
            ))
            .with_cpuset(CpusetSpec::Llc(1)),
    ])
}

Run it:

cargo ktstr test --kernel ../linux -- -E 'test(mixed_workloads)'

What you’ll see when things break

The output examples below are the shapes ktstr emits in real runs. They’re worth skimming before you ship a test so a future failure is recognisable.

Auto-repro probe chain

When the scheduler crashes, ktstr re-runs the scenario with BPF probes attached and dumps the path leading to the exit. Decoded struct fields appear inline, with between fentry-captured entry values and fexit-captured exit values:

ktstr_test 'demo_host_crash_auto_repro' [sched=scx-ktstr] [topo=1n1l2c1t] failed:
  scheduler died

--- auto-repro ---
=== AUTO-PROBE: scx_exit fired ===

  ktstr_enqueue                                                   main.bpf.c:21
    task_struct *p
      pid         97
      cpus_ptr    0xf(0-3)
      dsq_id      SCX_DSQ_INVALID
      enq_flags   NONE
      slice       0
      vtime       0
      weight      100
      sticky_cpu  -1
      scx_flags   QUEUED|ENABLED
  do_enqueue_task                                               kernel/sched/ext.c:1344
    rq *rq
      cpu         1
    task_struct *p
      pid         97
      cpus_ptr    0xf(0-3)
      dsq_id      SCX_DSQ_INVALID          →  SCX_DSQ_LOCAL
      enq_flags   NONE
      slice       20000000
      vtime       0
      weight      100
      sticky_cpu  -1
      scx_flags   QUEUED|DEQD_FOR_SLEEP    →  QUEUED

For the probe pipeline architecture, the BTF resolution path, event-stitching rules, and the demo_host_crash_auto_repro fixture, see Auto-Repro.

Failure dumps with cast-recovered pointers

The freeze coordinator builds a FailureDumpReport on every snapshot, periodic capture, and post-failure dump. Each captured map prints as a map <name> (type=..., value_size=..., max_entries=...) header followed by the rendered value (single-entry global sections like .bss/.data) or entry: key=... blocks (multi-entry maps). u64 fields the cast analyzer flagged as typed pointers chase to the recovered struct and print with a (cast→arena) or (cast→kernel) annotation distinguishing them from BTF-typed pointers; an (sdt_alloc) suffix is added when the sdt_alloc bridge recovered the real payload struct from a forward-declared pointee. A separate cross-BTF Fwd resolution path also recovers a forward-declared pointee whose body lives in a sibling embedded BPF object’s BTF — that path adds no annotation, the body is rendered transparently:

map scx_lavd.bss (type=array, value_size=4096, max_entries=1)
.bss:
  nr_cpus_onln=4
  task_ctx_root 0xffff888103a01000 (cast→arena) → task_ctx{cpu_id=2, last_runtime_ns=12345678, nice=0}
  current_task 0xffff90124f80c000 (cast→kernel) → task_struct:
    pid=4321   weight=100
    cpus_ptr 0xffff888103b40000 → cpus={0-3}
  taskc_data 0x7f0000080000 (cast→arena (sdt_alloc)) → task_data{slice_ns=20000000, vtime=12345678}

A field that the analyzer cannot prove is a pointer falls back to its raw u64 shape, which is the prior behavior — no test-author configuration is required either way.

Verifier output

cargo ktstr verifier runs the BPF verifier against every declare_scheduler!-registered scheduler’s struct_ops programs inside a real kernel and prints per-program verified-instruction counts. The dispatcher hands off to cargo nextest run -E 'test(/^verifier/)'; nextest fans out across (scheduler × declared kernel × accepted topology preset) cells, each cell booting its own VM. Per-cell output starts with a banner identifying the axis values:

=== ktstr_sched | kernel kernel_6_14_2 | topology tiny-1llc ===

verifier
  enqueue                                  verified_insns=42

verifier --- verifier stats ---
  processed=42  states=8/10

verifier --- scheduler log ---
func#0 @0
0: R1=ctx() R10=fp0
processed 42 insns (limit 1000000) max_states_per_insn 1 total_states 10 peak_states 8 mark_read 5

When the scheduler did not capture a log, the output is just the per-program table:

=== ktstr_sched | kernel kernel_6_14_2 | topology tiny-1llc ===

verifier
  enqueue                                  verified_insns=500
  dispatch                                 verified_insns=1200
  init                                     verified_insns=300

--raw disables cycle collapsing in the scheduler-log section. --kernel A --kernel B runs the sweep against multiple kernels; the cell handler walks KTSTR_KERNEL_LIST to match each cell’s sanitized kernel label against the resolved set. For the full verifier-sweep model, cycle-collapse rules, and the cell-name → kernel matching contract, see Verifier.

What’s next

  • Custom Scenarios – when the declarative ops API is not enough and the scenario needs arbitrary Rust logic between phases.
  • Ops and Steps – multi-phase scenarios: add/remove cgroups, swap cpusets, freeze, resume.
  • Watch SnapshotsOp::watch_snapshot("symbol") registers a hardware data-write watchpoint (up to 3 per scenario; slot 0 is reserved for the error-class exit_kind trigger).
  • MemPolicy – NUMA-aware tests that bind memory allocations to specific nodes and check page locality.
  • Performance Mode – pinned vCPUs, hugepages, and LLC-exclusivity validation for benchmark-grade runs.
  • Auto-Repro – on a scheduler crash, ktstr can boot a second VM with probes attached and dump the failing state automatically.
  • Recipes – task-specific guides (test a new scheduler, A/B compare branches, customize checking, benchmarking, host-state diff, ctprof).

Running Tests

Tests run via cargo ktstr test --kernel ../linux, which resolves the kernel and wraps cargo nextest run to boot KVM virtual machines for each #[ktstr_test] entry. Raw cargo nextest run remains available as a fallback once a kernel is in place via the discovery chain.

Quick reference

# Run all tests
cargo ktstr test --kernel ../linux

# Run a specific test
cargo ktstr test --kernel ../linux -- -E 'test(sched_basic_proportional)'

# Run ignored gauntlet tests
cargo ktstr test --kernel ../linux -- --run-ignored ignored-only -E 'test(gauntlet/)'

Run analysis

Each test invocation writes a *.ktstr.json sidecar per variant into {CARGO_TARGET_DIR or "target"}/ktstr/{kernel}-{project_commit}/. cargo ktstr stats list enumerates runs; cargo ktstr stats compare, list-values, list-metrics, and show-host operate on those sidecars. See Runs for the directory layout, last-writer-wins semantics, and the comparison workflow.

Budget-based test selection

Set KTSTR_BUDGET_SECS to select the subset of tests that maximizes feature coverage within a time budget. Useful for CI pipelines or quick smoke tests.

# Run the best 5 minutes of tests
KTSTR_BUDGET_SECS=300 cargo ktstr test --kernel ../linux

# Budget applies to gauntlet variants too
KTSTR_BUDGET_SECS=600 cargo ktstr test --kernel ../linux -- --run-ignored all

The selector encodes each test as a bitset of properties (scheduler, topology class, SMT, workload characteristics) and greedily picks tests with the highest marginal coverage per estimated second. Duration estimates account for VM boot overhead based on vCPU count.

A summary is printed to stderr during --list:

ktstr budget: 42/1200 tests, 295/300s used, 38/38 configurations covered

When KTSTR_BUDGET_SECS is not set, all tests are listed as usual.

Custom scheduler

Declare a scheduler with declare_scheduler! and reference the bare const from #[ktstr_test(scheduler = ...)]:

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
});

#[ktstr_test(scheduler = MY_SCHED)]
fn my_sched_test(ctx: &Ctx) -> Result<AssertResult> {
    Ok(AssertResult::pass())
}

The binary is injected into the VM’s initramfs and started before scenarios run. See Test a New Scheduler for the full end-to-end workflow, and Payload Definitions for the #[derive(Payload)] macro that handles binary-kind workloads (schbench, fio, etc.) — distinct from the scheduler-under-test surface.

Single Scenario

Running a specific test

cargo ktstr test --kernel ../linux -- -E 'test(sched_basic_proportional)'

Running with verbose output

RUST_BACKTRACE=1 cargo ktstr test --kernel ../linux -- -E 'test(sched_basic_proportional)'

Investigating failures

Run one test with verbose output to see scheduler logs and kernel console:

RUST_BACKTRACE=1 cargo ktstr test --kernel ../linux -- -E 'test(cover_cgroup_cpuset_cross_llc_race)'

VM topology

Each #[ktstr_test] declares its topology via macro attributes:

#[ktstr_test(llcs = 2, cores = 4, threads = 2)]

The test framework boots a VM with the specified topology automatically.

See Investigate a Crash for interpreting failure output and Troubleshooting for common error messages.

Gauntlet

The gauntlet runs every test across 24 topology presets (14 on aarch64). Gauntlet variants are prefixed with gauntlet/ and ignored by default.

# Run only base tests (default)
cargo ktstr test --kernel ../linux

# Run only gauntlet variants
cargo ktstr test --kernel ../linux -- --run-ignored ignored-only -E 'test(gauntlet/)'

# Run everything
cargo ktstr test --kernel ../linux -- --run-ignored all

Entries with host_only = true never produce gauntlet variants (topology variation is meaningless without a VM). See host_only for how that flag is set.

Variant naming

Single-kernel runs name each gauntlet variant gauntlet/{test_name}/{preset}:

  • {test_name} – the #[ktstr_test] function name
  • {preset} – one of the topology preset names below

When --kernel resolves to two or more kernels (multiple --kernel flags or a START..END range that expands to several releases), cargo ktstr test / coverage / llvm-cov add the kernel as a third dimension and append a {kernel_label} suffix: gauntlet/{test_name}/{preset}/{kernel_label}. See Multi-kernel: kernel as a gauntlet dimension for how the kernel labels are derived (sanitized from the resolved version, range expansion, cache key, git source, or path basename).

To run a single variant:

cargo ktstr test --kernel ../linux -- --run-ignored ignored-only \
  -E 'test(=gauntlet/my_test/smt-2llc)'

Topology presets

PresetTopologyCPUsLLCsNUMADescription
tiny-1llc1n1l4c1t411Single LLC
tiny-2llc1n2l2c1t421Minimal multi-LLC
odd-3llc1n3l3c1t931Odd CPU count
odd-5llc1n5l3c1t1551Prime LLC count
odd-7llc1n7l2c1t1471Prime LLC count
smt-2llc1n2l2c2t821SMT enabled
smt-3llc1n3l2c2t1231SMT, 3 LLCs
medium-4llc1n4l4c2t3241Medium topology
medium-8llc1n8l4c2t6481Medium, many LLCs
large-4llc1n4l16c2t12841Large, few LLCs
large-8llc1n8l8c2t12881Large, many LLCs
near-max-llc1n15l8c2t240151Near maximum
max-cpu1n14l9c2t252141Near KVM vCPU limit
medium-4llc-nosmt1n4l8c1t3241Medium, no SMT
medium-8llc-nosmt1n8l8c1t6481Medium, many LLCs, no SMT
large-4llc-nosmt1n4l32c1t12841Large, no SMT
large-8llc-nosmt1n8l16c1t12881Large, many LLCs, no SMT
near-max-llc-nosmt1n15l16c1t240151Near maximum, no SMT
max-cpu-nosmt1n14l18c1t252141Near KVM vCPU limit, no SMT
numa2-4llc2n4l4c1t1642Multi-NUMA, 2 nodes
numa2-8llc2n8l8c2t12882Multi-NUMA, 2 nodes, SMT
numa2-8llc-nosmt2n8l16c1t12882Multi-NUMA, 2 nodes, no SMT
numa4-8llc4n8l4c1t3284Multi-NUMA, 4 nodes
numa4-12llc4n12l8c2t192124Multi-NUMA, 4 nodes, SMT

Topology format: {numa_nodes}n{llcs}l{cores_per_llc}c{threads_per_core}t (e.g. 1n2l4c2t = 1 NUMA node, 2 LLCs, 4 cores per LLC, 2 threads per core = 16 CPUs). Presets are defined in gauntlet_presets(). Multi-NUMA presets are excluded by default (max_numa_nodes: Some(1) in TopologyConstraints::DEFAULT), so tests opt in to NUMA testing by raising max_numa_nodes.

aarch64: ARM64 CPUs do not have SMT. Presets with threads_per_core > 1 are excluded on aarch64, leaving 14 presets (the 5 small presets, 6 -nosmt variants, and 3 non-SMT NUMA presets).

Constraint filtering

#[ktstr_test] topology constraints filter which presets a test runs on. A preset is skipped when any constraint is not met:

  • num_numa_nodes() < min_numa_nodes
  • max_numa_nodes is set and num_numa_nodes() > max_numa_nodes
  • num_llcs() < min_llcs
  • max_llcs is set and num_llcs() > max_llcs
  • requires_smt and threads_per_core < 2
  • total_cpus() < min_cpus
  • max_cpus is set and total_cpus() > max_cpus

See Topology Constraints for the full attribute table and Gauntlet Tests for a worked example showing which presets survive a given constraint set.

Budget interaction

When KTSTR_BUDGET_SECS is set, greedy coverage maximization selects the most diverse set of test configurations within the time budget. Each candidate test is represented as a feature bitset (CPU count bucket, LLC count, SMT vs non-SMT, etc.). The selector greedily picks tests that cover the most uncovered feature bits per estimated second. The result is a mix of base tests and gauntlet variants that maximizes configuration diversity within the budget.

See Budget-based test selection.

Memory allocation

Each gauntlet VM gets max(topology_mb, initramfs_floor) MB of RAM, where topology_mb = max(cpus * 64, 256, entry.memory_mb) is the topology-requested minimum and initramfs_floor is computed from the actual initramfs size after build. For max-cpu (252 CPUs) the topology minimum is at least 16128 MB.

Runs

Each cargo ktstr test --kernel ../linux invocation writes per-test result sidecars into a run directory under {CARGO_TARGET_DIR or "target"}/ktstr/. The directory is the record of the latest test run for that (kernel, project commit) pair – there is no separate “baselines” cache.

Warning: Re-running the suite at the same kernel and project commit reuses the same directory and deletes prior sidecars at the first sidecar write of the new run. To preserve a previous run’s outputs, archive the directory elsewhere first (e.g. mv target/ktstr/6.14-abc1234 target/ktstr/6.14-abc1234.archived-{date}) or commit your changes (or amend to drop a -dirty suffix) so the next run lands in a separate snapshot directory.

Layout

target/
└── ktstr/
    ├── 6.14-abc1234/        # one run: kernel 6.14, project commit abc1234 (clean)
    │   ├── test_a.ktstr.json
    │   └── test_b.ktstr.json
    └── 7.0-def5678-dirty/   # another run: kernel 7.0, project commit def5678 with uncommitted changes
        ├── test_a.ktstr.json
        └── test_b.ktstr.json

Each subdirectory is keyed {kernel}-{project_commit}, where {kernel} is the kernel version resolved from the directory KTSTR_KERNEL points at — first the version field in its metadata.json, else the content of include/config/kernel.release, else unknown (when KTSTR_KERNEL is unset or neither file yields a version) — and {project_commit} is the project tree’s HEAD short hex (7 chars), suffixed -dirty when the worktree differs from HEAD, or the literal unknown when the test process is not running inside a git repository.

The commit is discovered by walking parents of the test process’s working directory until a .git marker is found — for a scheduler crate using ktstr as a dev-dependency, this is the scheduler crate’s commit, not ktstr’s. The function that performs the probe (detect_project_commit) is called from the test process’s cwd, so running tests from inside the scheduler crate’s clone yields that crate’s HEAD. Run from inside ktstr’s clone if you want to record ktstr’s HEAD instead.

Two runs sharing the same kernel and project commit (the typical “re-run the suite without committing changes” loop) reuse the same directory: the second run pre-clears any prior *.ktstr.json files in the directory at first sidecar write so the directory is a last-writer-wins snapshot of (kernel, project commit), not an append-only archive of every invocation. Re-run the suite to regenerate the sidecars; commit your changes (or amend to drop the -dirty suffix) to land a separate snapshot directory.

Pre-clear is shallow — only *.ktstr.json files in the immediate run directory are removed. Subdirectories created by external orchestrators (per-job gauntlet layouts, cluster shards) are left untouched, but cargo ktstr stats walks one level of subdirectories when collecting sidecars, so stale sidecar files left in subdirectories from a prior run will still appear in stats output. Operators driving subdirectory layouts must clean those subdirectories themselves; pre-clear’s contract covers the top-level only.

Filesystem requirement

The runs root must reside on a local filesystem (ext4, xfs, btrfs, tmpfs). NFS and other remote filesystems are rejected by the advisory lock used for cross-process sidecar-write serialization.

Unknown-commit collisions

When the test process is not inside a git repository (so detect_project_commit returns None), the on-disk dirname uses the literal sentinel unknown in the commit slot — every such run lands in {kernel}-unknown. Concurrent or successive non-git runs collide on this single directory, with the latest run pre-clearing the previous one’s sidecars. To disambiguate non-git runs, set KTSTR_SIDECAR_DIR to a per-run path or place the project tree under git so each run carries its own commit hash.

ktstr emits a one-shot stderr warning on first sidecar write under this configuration; setting KTSTR_SIDECAR_DIR both disambiguates the run and suppresses the warning (the override branch returns from sidecar_dir before the warning site is reached).

The unknown sentinel applies to the dirname only. The in-memory SidecarResult.project_commit field stays None (serialized as JSON null) for these runs — the dirname uses a filesystem-safe sentinel, while the JSON field preserves the original probe outcome. As a consequence, cargo ktstr stats compare --project-commit unknown will not match a sidecar whose project_commit is None; omit the --project-commit filter entirely to include None-commit rows in the comparison.

KTSTR_SIDECAR_DIR overrides the sidecar directory itself (used as-is, no key suffix), not the parent. The override only affects where new sidecars are written and what bare cargo ktstr stats reads. When the override is set, pre-clear is skipped — the operator chose that directory and owns its contents, so any pre-existing sidecars there are preserved. cargo ktstr stats list, cargo ktstr stats compare, cargo ktstr stats list-values, and cargo ktstr stats show-host all walk {CARGO_TARGET_DIR or "target"}/ktstr/ by default — pass --dir DIR on compare / list-values / show-host to point them at an alternate run root (e.g. an archived sidecar tree copied off a CI host). They do NOT consult KTSTR_SIDECAR_DIR.

Workflow

  1. Run tests for kernel A:

    cargo ktstr test --kernel 6.14
    
  2. Run again for kernel B:

    cargo ktstr test --kernel 7.0
    
  3. List runs:

    cargo ktstr stats list
    

    Each row carries RUN, TESTS, DATE, and ARCH columns. DATE is the earliest sidecar timestamp present in the directory — under the last-writer-wins semantics, this equals the most recent run’s first sidecar timestamp (the prior run’s sidecars were pre-cleared at the new run’s first write, so only the new run’s timestamps remain). ARCH is the host.arch value (x86_64, aarch64, …) from the run’s first sidecar, or - when no sidecar carries a populated host context. Rows are ordered by directory mtime, most recent first.

  4. Compare across dimensions:

    cargo ktstr stats compare --a-kernel 6.14 --b-kernel 7.0
    cargo ktstr stats compare --a-kernel 6.14 --b-kernel 7.0 -E cgroup_steady
    cargo ktstr stats compare --a-scheduler scx_rusty --b-scheduler scx_lavd --kernel 6.14
    cargo ktstr stats compare --a-project-commit abcdef1 --b-project-commit fedcba2
    cargo ktstr stats compare --a-project-commit abc1234 --b-project-commit abc1234-dirty
    cargo ktstr stats compare --a-kernel-commit abcdef1 --b-kernel-commit fedcba2
    cargo ktstr stats compare --a-run-source ci --b-run-source local
    

    The abc1234 vs abc1234-dirty row is the canonical WIP-vs-baseline pattern: run the suite once at a clean commit to capture the baseline directory {kernel}-abc1234, edit the tree without committing, run the suite again to capture {kernel}-abc1234-dirty, then diff the two. Both sidecar pools coexist under target/ktstr/ because the -dirty suffix makes them distinct directories.

    Per-side filters (--a-* / --b-*) partition the sidecar pool into two sides; shared filters (--kernel, --scheduler, --project-commit, --kernel-commit, --run-source, etc.) pin both sides. The seven slicing dimensions are kernel, scheduler, topology, work-type, project-commit, kernel-commit, and run-source; differing on any subset of them defines the A/B contrast. Per-metric deltas are computed using the unified MetricDef registry (polarity, absolute and relative thresholds). Output is colored: red for regressions, green for improvements. The command exits non-zero when regressions are detected. Use cargo ktstr stats list-values to discover available dimension values before constructing a comparison.

  5. Print analysis for the most recent run (no subcommand):

    cargo ktstr stats
    

    Picks the newest subdirectory under target/ktstr/ by mtime and prints gauntlet analysis, BPF verifier stats, callback profile, and KVM stats.

  6. Inspect the archived host context for a specific run:

    cargo ktstr stats show-host --run 6.14-abc1234
    cargo ktstr stats show-host --run archive-2024-01-15 --dir /tmp/archived-runs
    

    Resolves --run against target/ktstr/ (or --dir when set), scans the run’s sidecars in order, and renders the first populated host-context field via HostContext::format_human: CPU model, memory config, transparent-hugepage policy, NUMA node count, uname triple, kernel cmdline, and every /proc/sys/kernel/sched_* tunable. Same fingerprint stats compare uses for its host-delta section, but available on a single run. Fails with an actionable error when no sidecar carries a host field (pre-enrichment run).

Metric registry discovery

Before configuring per-metric ComparisonPolicy overrides, enumerate the available metric names:

cargo ktstr stats list-metrics
cargo ktstr stats list-metrics --json

Prints the ktstr::stats::METRICS registry: metric name, polarity (higher / lower better), default_abs and default_rel gate thresholds, and display unit. Use the metric names from this list as keys in ComparisonPolicy.per_metric_percent; unknown names are rejected at --policy load time so typos surface loudly. --json emits the same data as a serde array — the row accessor function is omitted (#[serde(skip)]) so the wire surface carries only wire-stable fields.

Sidecar format

Each test writes a SidecarResult JSON file containing the test name, topology, scheduler, work type, pass/fail, per-cgroup stats, monitor summary, stimulus events, verifier stats, KVM stats, effective sysctls, kernel command-line args, kernel version, timestamp, and run ID. Files are named with a .ktstr. infix for discovery. cargo ktstr stats reads all sidecar files from a run directory (recursing one level for gauntlet per-job subdirectories).

See also: KTSTR_SIDECAR_DIR.

ktstr

ktstr is the standalone debugging companion to the #[ktstr_test] test harness. It owns kernel cache management, interactive VM shells, host-wide per-thread profiling, and lock introspection — the operations a scheduler author reaches for when investigating a test failure.

To reproduce a test scenario as a self-contained shell script without a VM, use cargo ktstr export. To run the test suite, use cargo ktstr test.

See also cargo ktstr for the cargo-integrated companion that also covers test execution, coverage, BPF verifier stats, and gauntlet statistics.

Build from the workspace:

cargo build --bin ktstr

Subcommands

topo

Show the host CPU topology (CPUs, LLCs, NUMA nodes):

ktstr topo

kernel

The kernel subcommand manages cached kernel images. Subcommands: list, build, clean. See cargo-ktstr kernel for full documentation – the kernel subcommands are identical in both binaries.

shell

Boot an interactive shell in a KVM virtual machine. Launches a VM with busybox and drops into a shell.

ktstr shell
ktstr shell --kernel ../linux
ktstr shell --kernel 6.14.2
ktstr shell --topology 1,2,4,1
ktstr shell -i /path/to/binary
ktstr shell -i my_tool -i another_tool

Files and directories passed via -i are available at /include-files/<name> inside the guest. Directories are walked recursively, preserving structure (e.g. -i ./release includes all files under release/ at /include-files/release/...). Bare names (without path separators) are resolved via PATH lookup. Dynamically-linked ELF binaries get automatic shared library resolution via ELF DT_NEEDED parsing. Non-ELF files are copied as-is.

Stdin is a terminal requirement. The host terminal enters raw mode for bidirectional stdin/stdout forwarding. Terminal state is restored on all exit paths.

FlagDefaultDescription
--kernel IDautoKernel identifier: a source directory path (e.g. ../linux), a version (6.14.2 or major.minor prefix 6.14), or a cache key (see ktstr kernel list). Raw image files are rejected. Source directories auto-build; versions auto-download from kernel.org on cache miss. When absent, resolves via cache then filesystem and falls back to downloading the latest stable kernel.
--topology N,L,C,T1,1,1,1Virtual CPU topology as numa_nodes,llcs,cores,threads. All values must be >= 1.
-i, --include-files PATHFiles or directories to include in the guest. Repeatable. Directories are walked recursively.
--memory-mb MBautoGuest memory in MB (minimum 128). When absent, estimated from payload and include file sizes.
--dmesgoffForward kernel console (COM1/dmesg) to stderr in real-time. Sets loglevel=7 for verbose kernel output.
--exec CMDRun a command in the VM instead of an interactive shell. The VM exits after the command completes.
--no-perf-modeoffDisable all performance mode features (flock, pinning, RT scheduling, hugepages, NUMA mbind, KVM exit suppression). Also settable via KTSTR_NO_PERF_MODE env var.
--cpu-cap NunsetReserve only N host CPUs for the shell VM (integer ≥ 1). Requires --no-perf-mode — perf-mode already holds every LLC exclusively, so capping under perf-mode would double-reserve. The planner walks whole LLCs in consolidation- and NUMA-aware order, partial-taking the last LLC so plan.cpus.len() == N exactly. Mutually exclusive with KTSTR_BYPASS_LLC_LOCKS=1. Also settable via KTSTR_CPU_CAP env var (CLI flag wins when both are present).

cargo ktstr shell runs the same VM boot flow and differs in one respect: it accepts raw image file paths for --kernel (e.g. bzImage, Image). Source-tree directories auto-build and no-kernel invocations auto-download — same as ktstr shell.

ctprof

Capture or compare a host-wide per-thread state snapshot. Useful for diagnosing “the scheduler looks fine but something on the host is still behaving oddly” by producing a baseline/candidate diff of every live thread’s scheduling, memory, and I/O counters — a superset of what any single test’s sidecar captures.

ktstr ctprof capture --output baseline.ctprof.zst
# ... run workload of interest ...
ktstr ctprof capture --output candidate.ctprof.zst
ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst

capture walks /proc at capture time and writes every visible thread’s metric values (cumulative counters from schedstat / sched / status CSW / page faults / I/O bytes / taskstats; lifetime peaks from schedstat *_max and hiwater_*; instantaneous gauges sampled at capture time including nr_threads, fair_slice_ns, state; categorical / ordinal scalars including policy, nice, cpu_affinity, identity strings) as zstd-compressed JSON (conventional extension .ctprof.zst). Cumulative counters and lifetime peaks are probe-timing-invariant — sampled twice, the value either monotonically increased or stayed at its high-water mark — so a diff between two snapshots measures exactly the activity over the window. Instantaneous gauges and categorical scalars are point-in-time readings that can legitimately differ between two probes of the same thread. Per-cgroup aggregates (cpu.stat, memory.current) are captured once per distinct path. Capture is read-only; nothing is attached, no kprobes, no tracing.

FlagDefaultDescription
-o, --output PATHrequiredDestination path (convention: .ctprof.zst). Existing files are overwritten.

compare joins two snapshots on the selected grouping axis (pcomm by default) and renders a per-metric baseline/candidate/delta table. The join key survives across captures taken on different hosts or after process restarts, so deltas reflect the behavior of the named workload rather than a specific pid. Metrics with cumulative semantics (CPU time, page faults, wait time) show the candidate-minus-baseline delta; instantaneous metrics (affinity, cgroup path) show the value at candidate capture time. See the ctprof reference for the full metric registry, aggregation rules, derived-metric formulas, and taskstats kconfig gating.

Arg / FlagDefaultDescription
BASELINErequiredPath to the baseline .ctprof.zst snapshot.
CANDIDATErequiredPath to the candidate .ctprof.zst snapshot.
--group-by AXISpcommGrouping axis: pcomm (process name), cgroup (cgroup v2 path), comm (thread-name pattern, token-normalized), or comm-exact (synonym for comm --no-thread-normalize).
--cgroup-flatten GLOBGlob pattern that collapses dynamic cgroup path segments before grouping (e.g. '/kubepods/*/workload'). Repeatable; explicit globs apply before auto-normalize.
--no-thread-normalizeoffDisable token-based pattern normalization for --group-by comm. Threads group by literal comm.
--no-cg-normalizeoffDisable token-based normalization for --group-by cgroup. Cgroup paths group by literal post-flatten path.
--sort-by SPECby largest |delta_pct|Multi-key sort spec: metric1[:dir1],metric2[:dir2],.... Each metric is a name from ctprof metric-list; each dir is asc or desc (default desc).
--display-format FORMATfullPer-row column layout. full (default 7 columns), delta-only (drop baseline+candidate), no-pct, arrow (collapse baseline/candidate/delta into one cell), or pct-only.
--columns NAMESComma-separated column list overriding --display-format. Valid names: group, threads, metric, baseline, candidate, delta, %, arrow. Order is the rendered order.
--sections NAMESeveryComma-separated sub-table list. Valid names: primary, taskstats-delay, derived, cgroup-stats, cgroup-limits, memory-stat, memory-events, pressure, host-pressure, smaps-rollup, sched-ext. Empty renders every section that has data.
--metrics NAMESeveryComma-separated metric-name allowlist for primary + derived rows. Names must come from ctprof metric-list. Composes multiplicatively with --sections.
--wrapoffWrap table cells to fit terminal width. Only fires when stdout is a TTY; piped output stays unwrapped so awk/grep pipelines see the same byte sequence.

show renders a single snapshot’s per-(group, metric) values without diff math. Same flag vocabulary as compare, minus the baseline/candidate/delta/pct columns:

ktstr ctprof show snapshot.ctprof.zst --group-by cgroup
ktstr ctprof show snapshot.ctprof.zst --sections taskstats-delay
Arg / FlagDefaultDescription
SNAPSHOTrequiredPath to the .ctprof.zst snapshot.
--group-by AXISpcommSame as compare.
--cgroup-flatten GLOBSame as compare. Repeatable.
--no-thread-normalizeoffSame as compare.
--no-cg-normalizeoffSame as compare.
--sort-by SPECalphabeticalSort spec; ranks groups by absolute aggregated value (no delta — single snapshot).
--columns NAMESComma-separated column list. Show-only valid names: group, threads, metric, value. The compare-only column names are rejected at parse time.
--sections NAMESeverySame as compare.
--metrics NAMESeverySame as compare.
--wrapoffSame as compare.

metric-list prints every registered metric (primary + derived) with its description, unit, kconfig gate, and sched_class scope. Use this to discover the vocabulary --sort-by and --metrics accept.

ktstr ctprof metric-list

locks

Enumerate every ktstr flock held on this host. Read-only — does NOT attempt any flock acquire. Useful as a troubleshooting companion for --cpu-cap contention: when a build or test is stalled behind a peer’s reservation, ktstr locks names the peer (PID + cmdline) without disturbing any of its flocks.

Scans four lock-file roots:

  • /tmp/ktstr-llc-*.lock — per-LLC reservations held by perf-mode test runs and --cpu-cap-bounded builds.
  • /tmp/ktstr-cpu-*.lock — per-CPU reservations from the same flow.
  • {cache_root}/.locks/*.lock — cache-entry locks held during kernel build writes, and source-{path_hash}.lock files held for the duration of kernel build --source and cargo ktstr test --kernel <path> against the same source tree.
  • {runs_root}/.locks/{kernel}-{project_commit}.lock — per-run-key sidecar-write locks held for the duration of the (pre-clear + write) cycle to serialize concurrent ktstr processes targeting the same run directory.

Each lock is cross-referenced against /proc/locks to name the holder PID and cmdline.

ktstr locks                       # one-shot snapshot
ktstr locks --json                # JSON snapshot
ktstr locks --watch 1s            # redraw every second until SIGINT
ktstr locks --watch 1s --json     # ndjson stream, one object per interval
FlagDefaultDescription
--jsonoffEmit the snapshot as JSON. Pretty-printed in one-shot mode; compact (one object per line, ndjson-style) under --watch. Stable field names — schema documented on ktstr::cli::list_locks.
--watch DURATIONunsetRedraw the snapshot at the given interval until SIGINT. Value is parsed by humantime: 100ms, 1s, 5m, 1h. Human output clears and redraws in place; --json emits one line-terminated object per interval.

The same subcommand is available as cargo ktstr locks with identical flag semantics.

completions

Generate shell completions for ktstr.

ktstr completions bash
ktstr completions zsh
ktstr completions fish
Arg / FlagDefaultDescription
SHELLrequiredShell to generate completions for (bash, zsh, fish, elvish, powershell).
--binary NAMEktstrBinary name to register the completion under. Override when invoking ktstr through a symlink with a different name (the shell looks up completions by argv[0]).

The same subcommand is available as cargo ktstr completions with identical flag semantics (--binary accepted on both; defaults to the respective binary name).

cargo-ktstr

cargo ktstr is a cargo plugin for kernel build, cache, and test workflow. Subcommands in --help order: test (alias: nextest), coverage, llvm-cov, stats, kernel, model, verifier, funify (alias: costume), completions, show-host, show-thresholds, export, locks, shell.

test

Build the kernel (if needed) and run tests via cargo nextest run. Also available as cargo ktstr nextest — a visible clap alias that expands to the same subcommand, so the two forms are interchangeable.

cargo ktstr test                                               # auto-discover kernel
cargo ktstr test --kernel ../linux                             # local source tree
cargo ktstr test --kernel 6.14.2                               # version (auto-downloads on miss)
cargo ktstr test --kernel 6.14.2-tarball-x86_64-kc...          # cache key (from kernel list)
cargo ktstr test --kernel 6.12..6.14                           # range: every stable+longterm release in [6.12, 6.14]
cargo ktstr test --kernel git+https://example.com/r.git#v6.14  # git URL + ref (tag/branch)
cargo ktstr test --kernel git+https://example.com/r.git#deadbeef1234  # specific commit
cargo ktstr test --kernel 6.14.2 --kernel 6.15.0               # multi-kernel: repeatable
cargo ktstr test --release                                     # release profile (stricter assertions)

--kernel is repeatable and accepts a path, version string, cache key, version range (START..END), or git source (git+URL#REF). When absent, the test framework discovers a kernel from KTSTR_TEST_KERNEL, then KTSTR_KERNEL, then falls back to cache and filesystem lookup. When --kernel is a path, cargo-ktstr configures and builds the kernel before running tests. Version strings auto-download and build on cache miss (both explicit patch versions like 6.14.2 and major.minor prefixes like 6.14). Cache keys resolve from the cache only — they error if not cached (run cargo ktstr kernel list to see available keys).

Ranges (START..END) expand against kernel.org’s releases.json to every stable and longterm release whose version sits inside [START, END] inclusive (mainline / linux-next rows are dropped). The endpoints themselves do NOT need to appear in releases.json6.10..6.16 brackets the surviving releases even if 6.10 and 6.16 have aged out.

Git sources (git+URL#REF) clone the repo shallow at the given ref, build, and cache the result. A repeat invocation against an unchanged branch tip lands a cache hit; a moved tip rebuilds.

Multi-kernel: kernel as a gauntlet dimension

When --kernel resolves to two or more kernels (multiple --kernel flags, or a single --kernel START..END range that expands to several releases), cargo-ktstr resolves all kernels upfront and exports the resolved set to cargo nextest via the KTSTR_KERNEL_LIST env var. The test binary’s gauntlet expansion adds the kernel as an additional dimension to the gauntlet cross-product, so each (test × scenario × topology × kernel) tuple becomes a distinct nextest test case. Two name shapes carry the kernel suffix:

  • Base tests: ktstr/{name}/{kernel_label} — one variant per registered #[ktstr_test] per kernel.
  • Gauntlet variants: gauntlet/{name}/{preset}/{kernel_label} — one variant per (test × topology preset × kernel).

Single-kernel runs (zero or one resolved kernel) keep the historical name shapes ktstr/{name} and gauntlet/{name}/{preset} with no kernel suffix, so existing CI baselines and per-test config overrides keep matching.

Kernel labels are semantic, operator-readable identifiers sanitized to kernel_[a-z0-9_]+:

  • Version / range expansion → kernel_6_14_2, kernel_6_15_rc3
  • Cache key → version prefix only (kernel_6_14_2 from 6.14.2-tarball-x86_64-kc<hash>)
  • Git source → kernel_git_{owner}_{repo}_{ref} (e.g. kernel_git_tj_sched_ext_for_next from git+https://github.com/tj/sched_ext#for-next)
  • Path → kernel_path_{basename}_{hash6} (e.g. kernel_path_linux_a3f2b1); the 6-char crc32 of the canonical path disambiguates two linux directories under different parents. Dirty-tree builds (uncommitted source changes, mid-build worktree mutations, or non-git trees) append _dirty to the label — e.g. kernel_path_linux_a3f2b1_dirty — so the test report distinguishes the non-reproducible run from a subsequent clean rebuild of the same path.
  • Local cache entry → kernel_local_{hash6} (first 6 chars of the source tree’s git short_hash, captured at cache-store time) or kernel_local_unknown for non-git trees. The hash6 keeps two distinct local trees from collapsing to the same label; the unknown literal is the shared bucket for every non-git tree (no discriminator exists at the cache layer to spread them apart).

Filter with nextest’s -E 'test(kernel_6_14)' to pick a single kernel from a multi-kernel matrix; nextest’s parallelism, retries, and --ignored flag all apply natively. Sidecars partition per kernel: each kernel runs in its own target/ktstr/{kernel}-{project_commit}/ directory keyed on the resolved kernel’s identity and the project tree’s HEAD short hex (with -dirty suffix when the worktree differs). Coverage profraw does NOT partition per kernel — __llvm_profile_write_buffer writes flat into target/llvm-cov-target/ with PID-keyed filenames (ktstr-test-{pid}-{counter}.profraw), and cargo-llvm-cov merges every variant’s profraw automatically into the single output report.

Build / download / clone failures abort BEFORE any test runs — a missing kernel can’t be tested, and continuing would mask which kernel was requested-but-unavailable in the operator-visible error stream. Test failures within a kernel are nextest-handled normally.

host_only tests under multi-kernel: tests marked host_only (those that run on the host without booting a VM) skip the kernel suffix and list / run once regardless of KTSTR_KERNEL_LIST cardinality. The dispatch sites (list_tests, list_tests_budget, and --exact’s run_host_only_test in src/test_support/dispatch.rs) all gate on entry.host_only before consulting the resolved kernel set, so a host-side test never observes the kernel directory and multiplying it across kernels would just run N copies of identical work for no signal.

FlagDefaultDescription
--kernel ID (repeatable)autoKernel identifier: path, version, cache key, range (START..END), or git source (git+URL#REF). Repeatable; a multi-kernel set fans the gauntlet across kernels.
--no-perf-modeoffDisable all performance mode features (flock, pinning, RT scheduling, hugepages, NUMA mbind, KVM exit suppression). Also settable via KTSTR_NO_PERF_MODE env var.
--no-skip-modeoffConvert resource-contention and host-topology-insufficient skips into hard test failures (exit 1 instead of 0). Default behavior skips so a contended runner does not fail tests that simply could not start; setting this flag opts into “if the test cannot run, the test fails”. Exports KTSTR_NO_SKIP_MODE=1 for the test binary.
--releaseoffBuild and run tests with the release profile (--cargo-profile release to nextest). Release mode applies stricter assertion thresholds (gap_threshold_ms 2000 vs debug’s 3000, spread_threshold_pct 15% vs debug’s 35%) — tests that barely pass in debug may fail under --release. catch_unwind-based tests and tests gated on #[cfg(debug_assertions)] are skipped.

What it does (path mode only)

These steps run only when --kernel is a source directory path. Cached version and cache-key identifiers skip straight to test execution (step 6); uncached version identifiers run through download + configure + build + cache-store first. Ranges fan out to per-version resolution (every release downloads + builds + caches independently if not already present); git sources clone shallow at the ref, build, and cache. Multi-kernel resolution finishes for every requested kernel BEFORE step 6 — the cargo-nextest invocation in step 6 sees the complete kernel set as a single KTSTR_KERNEL_LIST export, so nextest fans the gauntlet across kernels in a single run.

For path mode, the source tree is gix-discovered and classified as either clean (HEAD reachable, index matches HEAD, worktree matches index) or dirty or non-git (any tracked-file diff, or the directory is not a git repo at all). The cache is keyed in one of three shapes:

  • local-{hash7}-{arch}-kc{suffix} — clean git tree, no user .config file in the source tree yet (build will run make defconfig). {hash7} is the source tree’s HEAD short hash; {suffix} distinguishes ktstr framework kconfig fragments.
  • local-{hash7}-{arch}-cfg{user_config}-kc{suffix} — clean git tree with a user .config whose CRC32 hash discriminates distinct configurations against the same commit, so iterative .config edits at a fixed commit populate distinct cache entries instead of colliding.
  • local-unknown-{path_hash}-{arch}-kc{suffix} — dirty / non-git tree (HEAD does not describe the source). {path_hash} is the full 8-char (32-bit) CRC32 of the canonical source path so two parallel cargo ktstr test --kernel ./linux-a and --kernel ./linux-b runs do not collide on the same local-unknown-... slot.

Dirty / non-git trees never cache — the build pipeline runs in the source directory, the kernel label gets a _dirty suffix, and a subsequent run of the same path that goes clean produces a distinct cache entry under the clean shape.

  1. Source-tree validation — verifies <kernel>/Makefile and <kernel>/Kconfig both exist. If either is missing, bails with not a kernel source tree.
  2. Cache lookup (clean trees only) — looks up the local-{hash7}-{arch}[-cfg{user_config}]-kc{suffix} key (the cfg segment present iff a user .config exists in the source tree). Cache hit short-circuits to step 6: cargo-ktstr exports the cache entry directory via KTSTR_KERNEL and emits a cargo ktstr: cache hit for {input_path} ({cache_key}, built {age} ago) line on stderr (the , built {age} ago suffix is omitted when the timestamp is unparseable or future-dated). Cache miss continues to step 3.
  3. Auto-configure — if <kernel>/.config lacks the CONFIG_SCHED_CLASS_EXT=y sentinel, runs make defconfig (when no .config exists), appends ktstr.kconfig to .config, then runs make olddefconfig.
  4. Kernel build — runs make -j$(nproc) KCFLAGS=-Wno-error, then runs validate_kernel_config to verify critical config options (CONFIG_SCHED_CLASS_EXT, CONFIG_DEBUG_INFO_BTF, CONFIG_BPF_SYSCALL, CONFIG_FTRACE, CONFIG_KPROBE_EVENTS, CONFIG_BPF_EVENTS) survived the build — the kernel build system silently disables options whose dependencies are not met, and the validator surfaces those failures with a per- option remediation hint. make handles the no-op case when the kernel is already built. For dirty / non-git trees this is the unconditional path; for clean trees, only reached on cache miss.
  5. compile_commands.json + cache store — runs make compile_commands.json (skipped only for transient temp directories like extracted tarballs) so LSP / clangd work against the local tree. Then for clean trees, the kernel image + stripped vmlinux are persisted under the resolved local-{hash7}-{arch}[-cfg{user_config}]-kc{suffix} key with metadata.json recording the source tree path. A post-build re-check of the dirty state catches mid-build mutations (worktree edits or commits that happened during make) and skips the cache store on either signal so a racing-write build can not land under a stale identity. Dirty / non-git trees skip the cache store unconditionally (no stable HEAD identity for the cache key) but still get compile_commands.json.
  6. Test execution — execs cargo nextest run once with KTSTR_KERNEL set in the environment (single-kernel) or with both KTSTR_KERNEL and KTSTR_KERNEL_LIST (multi-kernel; the latter encodes the resolved kernel set as label1=path1;label2=path2;…). For clean Path-spec resolution KTSTR_KERNEL points at the cache entry directory; for dirty or non-git trees it points at the source tree directly. The test binary’s gauntlet expansion adds the kernel as a fifth dimension when the list carries 2+ entries; nextest’s parallelism, retries, and -E filtering apply natively to every (test × kernel) variant.

Implicit vs explicit kernel discovery diverge: cargo ktstr test --kernel ../linux (explicit Path spec) routes through the cache pipeline above — the source tree is gix-classified, the local-{hash7}-{arch}[-cfg{user_config}]-kc{suffix} cache key is computed, the kernel is built (or short-circuited on cache hit), and the cache entry directory is exported via KTSTR_KERNEL. cargo ktstr test (no --kernel flag) does NOT run the build pipeline or produce a new cache entry. The test binary’s find_kernel chain reads existing cache entries (most-recent-valid first; entries built with a different kconfig fragment are skipped) and falls back to local build trees (./linux, ../linux) and host paths. Whatever pre-built image it finds is returned as-is — no cache key is computed for source trees discovered on the filesystem, no make is invoked, and the result does not land in the kernel cache for a future cache_key-keyed lookup. The KTSTR_KERNEL env var with a path value follows this same direct-image flow — the cache write path is reached only via the cargo ktstr --kernel argument (or via cargo ktstr kernel build --source ../linux as an explicit cache-populate step). Pass --kernel ../linux to opt into the cache pipeline so a clean tree’s build is stored once and reused on subsequent runs.

Passing nextest arguments

Arguments after test are passed through to cargo nextest run:

cargo ktstr test -- -E 'test(my_test)'        # nextest filter
cargo ktstr test -- --workspace               # all workspace tests
cargo ktstr test -- --retries 2               # nextest retries

coverage

Build the kernel (if needed) and run tests with coverage via cargo llvm-cov nextest. Same kernel resolution and multi-kernel semantics as test: --kernel is repeatable; multi-kernel runs add the kernel suffix to every test name and partition the sidecar tree per kernel via target/ktstr/{kernel}-{project_commit}/, where {project_commit} is the project HEAD short hex (with -dirty when the worktree differs). Coverage profraw lands flat in target/llvm-cov-target/ with PID-keyed filenames — it does NOT partition per kernel — and cargo-llvm-cov merges every variant’s profraw automatically into the single output report.

cargo ktstr coverage                                               # auto-discover kernel
cargo ktstr coverage --kernel ../linux                             # local source tree
cargo ktstr coverage --kernel 6.14.2                               # version (auto-downloads on miss)
cargo ktstr coverage --kernel 6.14.2 --kernel 6.15.0               # multi-kernel coverage matrix
cargo ktstr coverage --release                                     # release profile (stricter assertions)
cargo ktstr coverage -- --workspace --lcov --output-path lcov.info # lcov output
FlagDefaultDescription
--kernel ID (repeatable)autoSame shapes and multi-kernel semantics as cargo ktstr test --kernel: each (test × kernel) variant runs as its own nextest subprocess so cargo-llvm-cov merges every variant’s profraw automatically.
--no-perf-modeoffDisable all performance mode features (flock, pinning, RT scheduling, hugepages, NUMA mbind, KVM exit suppression). Also settable via KTSTR_NO_PERF_MODE env var.
--no-skip-modeoffConvert resource-contention and host-topology-insufficient skips into hard test failures. Same semantics as on test; exports KTSTR_NO_SKIP_MODE=1 for the test binary.
--releaseoffCollect coverage with the release profile (--cargo-profile release to llvm-cov nextest). Same stricter-threshold caveats as test --release — release mode applies gap_threshold_ms=2000 / spread_threshold_pct=15%, and skips catch_unwind-based tests along with #[cfg(debug_assertions)]-gated tests.

Requires cargo-llvm-cov and the llvm-tools-preview rustup component:

cargo install cargo-llvm-cov
rustup component add llvm-tools-preview

Passing arguments

Arguments after coverage are passed through to cargo llvm-cov nextest:

cargo ktstr coverage -- --workspace --profile ci --lcov --output-path lcov.info
cargo ktstr coverage -- --features integration

profraw layout

Three populations of *.profraw files arise from cargo ktstr runs. They land in different directories and are not all collected by the same workflow:

Filename shapeDirectoryProducerCollected by
default-{pid}-{binary_hash}.profrawparent of cargo-ktstr binary, joined with llvm-cov-target/ (e.g. target/{profile}/llvm-cov-target/ for cargo run --bin cargo-ktstr, or ~/.cargo/bin/llvm-cov-target/ for an installed binary)host-side cargo ktstr test (via LLVM_PROFILE_FILE injection)not auto-collected; needs an explicit cargo llvm-cov report invocation
cargo-llvm-cov-managed (shape set by the outer harness)target/llvm-cov-target/ (workspace target dir, NOT under {profile})host-side cargo ktstr coverage (cargo-llvm-cov sets its own LLVM_PROFILE_FILE)merged into the cargo ktstr coverage report automatically
ktstr-test-{pid}-{counter}.profrawparent of the test binary’s LLVM_PROFILE_FILE env var, falling back to <test-binary parent>/llvm-cov-target/ (typically target/{profile}/deps/llvm-cov-target/ when no env override is in play); under cargo ktstr test, inherits the host-side injected dir, so co-locates with default-{pid}-{binary_hash}.profrawguest-side __llvm_profile_write_buffer flushed via the SHM ring at VM exitmerged into the cargo ktstr coverage report automatically

cargo ktstr test injects LLVM_PROFILE_FILE (added to prevent default.profraw leaking into a kernel source tree when the shell cwd was the kernel dir; see Stale vmlinux.btf or default.profraw). The resulting host-side default-{pid}-{binary_hash}.profraw files do NOT land in the target/llvm-cov-target/ directory that cargo ktstr coverage (cargo-llvm-cov) reads; they are NOT picked up by a later cargo ktstr coverage run unless you explicitly include them in a cargo llvm-cov report invocation pointed at the cargo-ktstr binary’s llvm-cov-target/ directory.

To clean accumulated profraw between runs:

# Remove ONLY *.profraw under target/llvm-cov-target/ (top-level glob, non-recursive):
cargo ktstr llvm-cov clean --profraw-only

# Drop host-side test-path profraw next to the cargo-ktstr binary.
# Run only the line(s) matching how cargo-ktstr was launched —
# the brace-list form is bash-only, so each path is its own command
# for portable POSIX shells (sh / dash):
rm -f target/debug/llvm-cov-target/default-*.profraw
rm -f target/release/llvm-cov-target/default-*.profraw

# If ktstr was installed via `cargo install`:
rm -f ~/.cargo/bin/llvm-cov-target/default-*.profraw

--profraw-only is the safe default: it removes only *.profraw files at the top level of target/llvm-cov-target/ (the cargo- llvm-cov-managed dir) and leaves coverage reports, profdata, and build artifacts intact. It does NOT touch the default-*.profraw files next to the cargo-ktstr binary (under target/{profile}/llvm-cov-target/ for cargo run / cargo build, or ~/.cargo/bin/llvm-cov-target/ for cargo install-deployed binaries) produced by the host-side injection — remove those with the explicit rm -f lines above for whichever launch mode you use. Avoid cargo ktstr llvm-cov clean without arguments (recursively wipes all of target/llvm-cov-target/, including reports) and --workspace (additionally runs cargo clean on workspace packages, removing build artifacts); both are destructive beyond profraw.

To opt out of the host-side LLVM_PROFILE_FILE injection entirely, export LLVM_PROFILE_FILE yourself before running cargo ktstr test — the injector only fires when the env is absent, so an explicit operator setting takes precedence.

llvm-cov

Raw passthrough to cargo llvm-cov with arbitrary arguments. Use this for llvm-cov subcommands that don’t fit the coverage flow — report, clean, show-env, etc. When you want cargo llvm-cov nextest, prefer cargo ktstr coverage; this subcommand carries the same kernel-resolution and --no-perf-mode plumbing but hands every remaining argument to cargo llvm-cov unchanged.

cargo ktstr llvm-cov report --lcov --output-path lcov.info    # generate report from prior run
cargo ktstr llvm-cov clean --workspace                         # wipe accumulated coverage data
cargo ktstr llvm-cov show-env                                  # print env cargo-llvm-cov would set
cargo ktstr llvm-cov --kernel ../linux report                  # pin kernel + passthrough
FlagDefaultDescription
--kernel ID (repeatable)autoKernel identifier: path, version, cache key, range (START..END), or git source (git+URL#REF). Same multi-kernel semantics as cargo ktstr test --kernel.
--no-perf-modeoffDisable all performance mode features (flock, pinning, RT scheduling, hugepages, NUMA mbind, KVM exit suppression). Also settable via KTSTR_NO_PERF_MODE env var.
--no-skip-modeoffConvert resource-contention and host-topology-insufficient skips into hard test failures. Same semantics as on test; exports KTSTR_NO_SKIP_MODE=1 for the test binary.

Note: a bare cargo ktstr llvm-cov (no trailing subcommand) dispatches to cargo llvm-cov, which runs cargo test — ktstr tests rely on the nextest harness for gauntlet expansion (topology-preset variants), verifier cell emission, and VM dispatch. Under bare cargo test, only the #[test] stubs run and gauntlet variants + verifier cells are silently skipped. Always pass a subcommand after llvm-cov (most often nextest, for which cargo ktstr coverage is the shorter route).

kernel

Manage cached kernel images. Three subcommands: list, build, clean. The standalone ktstr kernel subcommands are identical.

kernel list

List cached kernel images, sorted newest first. With --range, switches to PREVIEW MODE: prints the versions a START..END range expands to without performing any download or build.

cargo ktstr kernel list
cargo ktstr kernel list --json                    # JSON output for CI scripting
cargo ktstr kernel list --range 6.12..6.14        # preview range expansion
cargo ktstr kernel list --range 6.12..6.14 --json # preview as JSON

Default mode walks the local cache. Human-readable output shows key, version, source type, arch, and build timestamp. Entries built with a different ktstr.kconfig are marked (stale kconfig). Entries whose major.minor version is no longer in kernel.org’s active releases list are marked (EOL); prefix lookups for EOL series fall back to probing cdn.kernel.org for the latest patch release.

--range mode performs no cache reads: it fetches kernel.org’s releases.json once, expands the inclusive range against the stable and longterm releases (mainline / linux-next dropped), and prints one version per line on stdout. Use this to answer “what does --kernel 6.12..6.16 actually cover?” before paying the build cost — no kernel is downloaded or compiled. With --json, emits a JSON object carrying the literal range, the parsed start / end, and the expanded versions array.

FlagDescription
--jsonOutput in JSON format. Each entry includes a boolean eol field (computed at list time by fetching kernel.org’s releases.json) alongside the cached metadata. With --range, emits a single object {range, start, end, versions} instead.
--range START..ENDSwitch to range-preview mode. Format: MAJOR.MINOR[.PATCH][-rcN]..MAJOR.MINOR[.PATCH][-rcN]. Performs the single releases.json fetch a real range resolve does, expands inclusively, and prints the version list — no downloads, no builds, no cache lookups.

kernel build

Download, build, and cache a kernel image. Three source modes: version (tarball download), --source (local tree), --git (clone).

cargo ktstr kernel build                               # latest stable from kernel.org
cargo ktstr kernel build 6.14.2                        # specific version
cargo ktstr kernel build 6.15-rc3                      # RC release
cargo ktstr kernel build 6.12                          # latest 6.12.x patch release
cargo ktstr kernel build --source ../linux             # local source tree
cargo ktstr kernel build --git URL --ref v6.14         # git clone (shallow, depth 1)
cargo ktstr kernel build --force 6.14.2                # rebuild even if cached

When no version or source is given, fetches the latest stable series that has had at least 8 maintenance releases — keeping CI off brand-new majors whose early builds are more likely to break — from kernel.org’s releases.json. A major.minor prefix (e.g. 6.12) resolves to the highest patch release in that series. For EOL series no longer in releases.json, probes cdn.kernel.org to find the latest available tarball. Skips building when a cached entry already exists (use --force to override). Stale entries (built with a different ktstr.kconfig) are rebuilt automatically. For --source, generates compile_commands.json for LSP support. Dirty local trees (uncommitted changes to tracked files) are built but not cached.

FlagDescription
VERSIONKernel version or prefix to download (e.g. 6.14.2, 6.12, 6.15-rc3). A major.minor prefix resolves to the highest patch release, probing cdn.kernel.org for EOL series. Conflicts with --source and --git.
--source PATHPath to existing kernel source directory. Conflicts with VERSION and --git.
--git URLGit URL to clone. Requires --ref. Conflicts with VERSION and --source.
--ref REFGit ref to checkout (branch, tag, commit). Required with --git.
--forceRebuild even if a cached image exists.
--cleanRun make mrproper before configuring. Only meaningful with --source.
--cpu-cap NReserve exactly N host CPUs for the build (integer ≥ 1; must be ≤ the calling process’s sched_getaffinity cpuset size). When absent, 30% of the allowed CPUs are reserved (minimum 1). The planner walks whole LLCs in consolidation- and NUMA-aware order, partial-taking the last LLC so plan.cpus.len() == N exactly. Under --cpu-cap, make -jN parallelism matches the reserved CPU count and the build runs inside a cgroup v2 sandbox that pins gcc/ld to the reserved CPUs + NUMA nodes. Mutually exclusive with KTSTR_BYPASS_LLC_LOCKS=1. Also settable via KTSTR_CPU_CAP env var (CLI flag wins when both are present).

kernel clean

Remove cached kernel images.

cargo ktstr kernel clean                          # remove all (with confirmation prompt)
cargo ktstr kernel clean --keep 3                 # keep 3 most recent
cargo ktstr kernel clean --force                  # skip confirmation prompt
cargo ktstr kernel clean --corrupt-only --force   # remove only corrupt entries
FlagDescription
--keep NKeep the N most recent VALID cached kernels. Corrupt entries (metadata missing or unparseable, image file absent) are always candidates for removal regardless of this value — a corrupt entry never consumes a keep slot. Mutually exclusive with --corrupt-only.
--forceSkip confirmation prompt. Required in non-interactive contexts.
--corrupt-onlyRemove only corrupt cache entries (metadata missing or unparseable, image file absent). Valid entries are left untouched regardless of --force. Useful for clearing broken entries after an interrupted build without risking the curated set of good kernels. Mutually exclusive with --keep.

model

Manage the LLM model cache used by OutputFormat::LlmExtract payloads. fetch downloads the default pinned model into the ktstr model cache; status reports whether a SHA-checked copy is already cached; clean deletes the cached artifact plus its warm-cache sidecar.

cargo ktstr model fetch                          # download + SHA-check (no-op if cached)
cargo ktstr model status                         # report cache path + verdict
cargo ktstr model clean                          # delete cached artifact + sidecar

fetch is a no-op when the cache already holds a SHA-checked copy. Respects KTSTR_MODEL_OFFLINE=1 — set to refuse network fetches. Cache root resolution: KTSTR_CACHE_DIR (if set), then $XDG_CACHE_HOME/ktstr/models/, then $HOME/.cache/ktstr/models/.

status prints four fields and adds a one-line annotation when the verdict is anything other than Matches (a clean hit gets no annotation):

FieldDescription
model:Model file name (the pinned default; e.g. Qwen3-4B-Q4_K_M.gguf).
path:Absolute cache path ({cache_root}/models/{file}) the producer reads at LlmExtract time.
cached:true if an entry exists at path:, false otherwise.
checked:true if the cached entry’s SHA-256 matches the pinned digest.

The annotation distinguishes four verdicts: NotCached (no entry — emit a cargo ktstr model fetch hint plus the expected download size), CheckFailed (cached entry could not be SHA-checked due to an I/O error — re-fetch), Mismatches (cached entry hash does not match the pinned digest — re-fetch), Matches (silent — the all-clear path). Re-fetch is the shared remediation tail for every cached-but- not-Matches branch.

clean removes both the GGUF artifact at {cache_root}/models/{file_name} and its .mtime-size warm-cache sidecar (a small companion file the SHA fast-path uses to skip re-hashing on subsequent status calls). Per- file output names what was deleted with an IEC-prefixed size in parentheses (removed /path/to/Qwen3-4B-Q4_K_M.gguf (2.34 GiB)); a final freed N total line sums the artifact and sidecar bytes. A no-op clean (nothing cached) prints a single no cached model found at {path} line so an idempotent re-run produces a clear “nothing to do” outcome instead of two “(absent)” lines. Subsequent cargo ktstr model fetch re-downloads the pin from scratch.

verifier

Collect BPF verifier statistics for every scheduler declared via declare_scheduler! in the workspace’s test binaries. Spawns cargo nextest run -E 'test(/^verifier/)' and lets nextest fan out per (scheduler × kernel-list entry × accepted topology preset) cell — each cell boots its own VM, loads the scheduler’s BPF programs, and reports per-program verified instruction counts from host-side memory introspection.

cargo ktstr verifier                              # auto-discover kernel
cargo ktstr verifier --kernel ../linux            # pin to one kernel
cargo ktstr verifier --kernel 6.14 --kernel 7.0   # multi-kernel sweep
cargo ktstr verifier --raw                        # raw verifier log

There are no --scheduler / --scheduler-bin flags: the sweep discovers schedulers from the KTSTR_SCHEDULERS distributed slice populated by declare_scheduler!. To exclude a scheduler from the sweep, omit it from the test binary (or declare it with SchedulerSpec::Eevdf / SchedulerSpec::KernelBuiltin — both are skipped at cell-emission time because neither has a userspace binary to verify).

--kernel is repeatable; cargo-ktstr always exports KTSTR_KERNEL_LIST to the nextest invocation (synthesizing a single entry from auto-discovery when no --kernel is passed). Each scheduler’s kernels = [...] declaration acts as a per-scheduler filter on the operator-supplied set; an empty (or omitted) kernels field accepts every entry. See BPF Verifier: Matrix dimension + per-scheduler filter for the full filter contract.

--raw exports KTSTR_VERIFIER_RAW=1; the cell handler reads it via env::var_os and switches format_verifier_output from the cycle-collapsed default to the raw scheduler-log dump. See BPF Verifier: Cycle collapse algorithm for the rendering details.

FlagDescription
--kernel ID (repeatable)Kernel identifier: path, version, cache key, range (START..END), or git source (git+URL#REF). Raw image files (bzImage/Image) are NOT accepted — the verifier needs the cached vmlinux and kconfig fragment alongside the image. Source directories auto-build; version strings auto-download on cache miss. When absent, resolves via cache then filesystem, falling back to auto-download. Raw images are accepted only on cargo ktstr shell.
--rawPrint raw verifier output without cycle collapse.

See BPF Verifier for the cell-based dispatch design and output format, and Scheduler Definitions for the declare_scheduler! macro that registers a scheduler in KTSTR_SCHEDULERS.

shell

Shares the VM boot flow with ktstr shell and accepts the same flags. See ktstr shell for the full flag reference. The one behavior difference from ktstr shell is that cargo ktstr shell accepts raw image file paths for --kernel.

cargo ktstr shell
cargo ktstr shell --kernel 6.14.2
cargo ktstr shell --topology 1,2,4,1
cargo ktstr shell -i ./my-binary -i strace

completions

Generate shell completions for cargo-ktstr. See ktstr completions for the base subcommand.

cargo ktstr completions bash >> ~/.local/share/bash-completion/completions/cargo
cargo ktstr completions zsh > ~/.zfunc/_cargo-ktstr
cargo ktstr completions fish > ~/.config/fish/completions/cargo-ktstr.fish
ArgDescription
SHELLShell to generate completions for (bash, zsh, fish, elvish, powershell).
--binary NAMEBinary name for completions. Default: cargo.

stats

Sidecar analysis, per-record diagnostics, and run-to-run comparison. See Runs for the directory layout.

cargo ktstr stats                                             # print analysis of newest run
cargo ktstr stats list                                        # list runs
cargo ktstr stats list-metrics                                # list registered regression metrics
cargo ktstr stats compare --a-kernel 6.14 --b-kernel 6.15     # slice on kernel
cargo ktstr stats compare --a-scheduler scx_rusty --b-scheduler scx_alpha  # slice on scheduler
cargo ktstr stats compare --a-kernel 6.14 --b-kernel 6.15 --scheduler scx_rusty  # slice on kernel, pin scheduler on both sides
cargo ktstr stats compare --a-kernel 6.14 --b-kernel 6.15 -E cgroup_steady       # add substring filter
cargo ktstr stats compare --a-project-commit abcdef1 --b-project-commit fedcba2 --no-average  # opt out of trial averaging
cargo ktstr stats compare --a-kernel-commit abcdef1 --b-kernel-commit fedcba2    # slice on kernel source HEAD
cargo ktstr stats compare --a-run-source ci --b-run-source local                 # slice on run environment
cargo ktstr stats explain-sidecar --run RUN_ID                                   # diagnose Option-field absences

When invoked without a subcommand, prints gauntlet analysis from either the most recent run directory under {CARGO_TARGET_DIR or "target"}/ktstr/ (newest by mtime) or the explicit directory in KTSTR_SIDECAR_DIR when that variable is set. With KTSTR_SIDECAR_DIR set, that directory is the sidecar source directly – there is no newest-subdirectory walk under it:

  • Gauntlet analysis – outlier detection, per-scenario/topology dimension summaries, stimulus cross-tab.
  • BPF verifier stats – per-program verified instruction counts, warnings for programs near the 1M complexity limit.
  • BPF callback profile – per-program invocation counts, total CPU time, and average nanoseconds per call.
  • KVM stats – cross-VM averages for exits, halt polling, host preemptions.

list

Print a table of run directories under {CARGO_TARGET_DIR or "target"}/ktstr/ with four columns:

  • RUN: the run-directory leaf name, formatted as {kernel}-{project_commit} per Runs. list does NOT consult KTSTR_SIDECAR_DIR — that override only affects where the test harness writes sidecars; list always enumerates the default runs-root.
  • TESTS: number of sidecars in the directory (and one level of subdirectories — collect_sidecars walks per-job gauntlet layouts).
  • DATE: the earliest sidecar timestamp present in the directory — under last-writer-wins this equals the most recent run’s first sidecar timestamp (the prior run’s sidecars were pre-cleared at the new run’s first write, so only the new run’s timestamps remain). See Runs for the full semantics.
  • ARCH: the host.arch value from the run’s first sidecar (e.g. x86_64, aarch64). Renders as - when no sidecar in the directory carries a populated host context — pre-host- context archives and host-only test stubs that never populate the field land in this bucket.

Rows are sorted by directory mtime, most recent first, so the latest run lands at the top — the operator’s usual interest. Entries whose mtime cannot be read fall back to filename order as a deterministic tiebreaker and sort to the end of the listing.

list-metrics

List the registered regression metrics and their default thresholds. Enumerates the ktstr::stats::METRICS registry: metric name, polarity (higher/lower better), default absolute-delta gate, default relative-delta gate, and display unit. Use this to see which metric names ComparisonPolicy.per_metric_percent keys can reference, and what each default absolute and relative gate starts at before an override. Default output is a human-readable table; --json emits a JSON array with the same fields.

cargo ktstr stats list-metrics              # table
cargo ktstr stats list-metrics --json       # JSON array
FlagDefaultDescription
--jsonoffEmit JSON instead of a table.

list-values

List the distinct values present per filterable dimension in the sidecar pool. Walks every run directory under target/ktstr/ (or --dir), pools the sidecars, and reports per-dimension sets for all seven dimensions: kernel, commit, kernel_commit, source, scheduler, topology, and work_type. The commit and source keys map to the internal SidecarResult::project_commit / run_source fields; the JSON wire keys keep the shorter spellings.

Use this before crafting a cargo ktstr stats compare invocation to discover what --a-X / --b-X values the pool actually carries: --a-kernel 6.20 against an empty pool fails downstream with “no rows match filter A”, and list-values is the upstream answer to “what kernels do I have?”.

cargo ktstr stats list-values                       # text per-dim blocks
cargo ktstr stats list-values --json                # JSON object
cargo ktstr stats list-values --dir /tmp/archived   # archived sidecar tree

The text shape renders one block per dimension with values one per line. The JSON shape emits a single object keyed by dimension name with arrays of values:

{
  "kernel": [null, "6.14.2", "6.15.0"],
  "commit": [null, "abcdef1", "abcdef1-dirty"],
  "kernel_commit": [null, "kabcde7", "kabcde7-dirty"],
  "source": [null, "ci", "local"],
  "scheduler": ["eevdf", "scx_rusty"],
  "topology": ["1n2l4c1t", "1n4l2c1t"],
  "work_type": ["SpinWait", "PageFaultChurn"]
}

The JSON keys commit and source are the wire contract; internally the corresponding fields are SidecarResult::project_commit and SidecarResult::run_source, and the per-side filter flags spell as --project-commit / --run-source (see compare).

kernel, commit, kernel_commit, and source are optional on the source sidecar (SidecarResult::kernel_version / project_commit / kernel_commit / run_source are Option<String>); the textual sentinel unknown and JSON null both denote a sidecar that did not record a value for that dimension.

FlagDefaultDescription
--jsonoffEmit JSON instead of per-dimension text blocks.
--dir DIRtarget/ktstr/Alternate run root. Same semantics as compare --dir.

show-host

Print the archived HostContext for a specific run: CPU identity, memory/hugepage config, transparent-hugepage policy, NUMA node count, kernel uname triple, kernel cmdline, and every /proc/sys/kernel/sched_* tunable captured at archive time. Useful for inspecting the same fingerprint compare’s host-delta section uses, available on a single run.

The command scans sidecars in the run directory in iteration order and prints the FIRST sidecar that carries a populated host field — older pre-enrichment sidecars may have host: None, and the forward scan tolerates those. If no sidecar has a populated host field the command fails with an actionable error rather than returning empty output.

FlagDefaultDescription
--run IDrequiredRun key (e.g. 6.14-abc1234 or 6.14-abc1234-dirty; from cargo ktstr stats list).
--dir DIRtarget/ktstr/Alternate run root. Same semantics as compare --dir: useful for archived sidecar trees copied off a CI host.

explain-sidecar

Diagnose Option-field absences across a run’s sidecars. Loads every *.ktstr.json under --run ID (or its subdirectories one level deep, mirroring compare’s gauntlet-job layout) and reports, per sidecar, which Option<T> fields landed as None plus the documented causes for each absence and a classification:

  • expectedNone is the steady-state shape; no operator action recovers it (e.g. payload for a scheduler-only test, scheduler_commit which no SchedulerSpec variant exposes today).
  • actionableNone indicates a recoverable gap; re-running in a different environment (in-repo cwd, non-tarball kernel, non-host-only test) would populate the field.

Different gauntlet variants on the same run legitimately differ on which fields populate (host-only vs VM-backed, scheduler-only vs payload-bearing), so the report is per-sidecar rather than aggregate.

Sidecars are loaded verbatim — this command does NOT rewrite run_source to "archive" even when --dir is set. Diverges intentionally from compare / list-values; matches show-host. The override would erase the only signal that surfaces the pre-rename source-key drop case.

The output header reports walked N sidecar file(s), parsed M valid: N counts every .ktstr.json file the walker visited, M counts how many parsed against the current schema. walked > parsed signals a corrupt or pre-1.0-schema sidecar — re-run the test to regenerate under the current schema.

Per-None blocks in the text output also include a fix: line for fields whose None is recoverable by an operator action (e.g. kernel_commit recovers when KTSTR_KERNEL points at a local kernel git tree). Fields whose None is the steady-state shape (or a multi-cause set with no single remediation) emit no fix: line.

When the walk encounters parse failures, the text output appends a trailing corrupt sidecars (N): block listing each corrupt path on its own line followed by the serde error message indented as error: ..., optionally followed by an enriched: ... line with operator-facing remediation prose when the parse failure matches a known schema-drift case (currently the host missing-field case). When the walk encounters IO failures (file matched the predicate but read_to_string failed before parsing could begin — permission denied, mid-rotate truncation, broken symlink, EISDIR), the text output appends a parallel io errors (N): block, structured the same way (path on its own line, error: ... line below) but carrying std::io::Error::Display rather than serde-error text. IO errors do NOT carry enriched: lines — there is no schema-drift catalog for filesystem incidents; the raw std::io::Error Display is the remediation surface. Each block is suppressed independently when its source vec is empty.

All-corrupt and all-IO-failure runs (every predicate- matching file failed to parse, or every one failed to read) are NOT a hard error — text output renders the header (walked N sidecar file(s), parsed 0 valid) followed directly by the corrupt sidecars (N): and/or io errors (N): block(s), skipping the per-sidecar breakdown that has nothing to render. JSON output mirrors this with valid: 0, _walk.errors and/or _walk.io_errors populated, and per-field counts at zero. This preserves structured per-file visibility for dashboard consumers facing total-failure runs of either class.

All-corrupt and all-IO-failure runs exit 0 (not a hard error); CI scripts must inspect the JSON channel for failure detection rather than relying on exit code. Two common gating policies, each appropriate for different operational stances:

  • Lenient (treat partial failures as warnings): _walk.valid > 0. Accepts any run with at least one successfully-parsed sidecar; per-file parse or IO failures surface in the JSON arrays for triage but do not fail the gate.
  • Strict (fail on any sidecar failure): _walk.errors.len() == 0 && _walk.io_errors.len() == 0. Requires every predicate-matching file to parse cleanly. Both checks are required because the two arrays cover disjoint failure classes (parse vs read) — a run with zero parse errors but one IO error still has a missing sidecar.

The two policies are NOT equivalent: a run with one valid and one corrupt sidecar passes lenient (valid == 1 > 0) but fails strict (errors.len() == 1 > 0). Pick the policy that matches the operational tolerance for partial data.

--json emits a single object with three top-level keys: _schema_version (a string version stamp — currently "1" — that consumers can gate on for incompatible shape changes), _walk (an envelope carrying walked / valid counts — same numbers the text header reports under “walked N sidecar file(s), parsed M valid” — plus an errors array of {path, error, enriched_message} entries covering every parse failure (enriched_message is a human-facing remediation string when a known schema-drift case matches, JSON null otherwise) AND an io_errors array of {path, error} entries covering every IO failure (file matched the predicate but read_to_string failed; error carries the raw std::io::Error Display). Both arrays emit on every render — empty array when no failures of that class occurred — so dashboard consumers see a uniform shape without contains_key branching. With both arrays, walked == valid + errors.len() + io_errors.len() by construction in the steady state — every predicate-matching file lands in exactly one bucket. (Filesystem races between the count and load passes can perturb this; see the rustdoc on WalkStats for the full caveat.) Then fields. Each entry under fields carries none_count and some_count (counts across all valid sidecars in the run, summing to _walk.valid), classification, causes, and fix (string when a remediation applies, JSON null otherwise).

Output produced before the schema-version stamp landed has no _schema_version key; consumers should treat the key’s absence as pre-stamp output (compatible with shape "1" in practice but unstamped).

The version bumps on incompatible shape changes (key rename, key removal, semantic shift in an existing key) but NOT on additive changes (new optional top-level keys, new entries in fields, new optional sub-keys under existing entries). The stamp is emitted as a JSON string (e.g. "1", "2"); parse it by stripping the quotes and converting the inner digits to an integer, then gate on parsed >= 1 (integer comparison) — never use raw string comparison, since lexicographic order would put "10" ahead of "2". Pin loosely (e.g. accept any version >= 1) so dashboard code keeps working when the catalog grows; tighten only on the specific bumps a consumer cannot tolerate.

cargo ktstr stats explain-sidecar --run RUN_ID                       # text per-sidecar diagnostic
cargo ktstr stats explain-sidecar --run RUN_ID --json                 # aggregate JSON for dashboards
cargo ktstr stats explain-sidecar --run RUN_ID --dir /path/archive    # diagnose archived sidecars
FlagDefaultDescription
--run IDrequiredRun key (e.g. 6.14-abc1234 or 6.14-abc1234-dirty; from cargo ktstr stats list).
--dir DIRtarget/ktstr/Alternate run root. Same semantics as compare --dir.
--jsonoffEmit aggregate JSON instead of per-sidecar text.

compare

Pool every sidecar under target/ktstr/ (or --dir), partition the rows into A and B sides via per-side filter flags, average each side’s matching sidecars per pairing key (or pass through distinct sidecars under --no-average), and report regressions on the A→B delta. Exits non-zero on regression.

The dimensions on which the A and B filters DIFFER are the SLICING dimensions — the axes of the A/B contrast. Every other dimension is part of the dynamic PAIRING key the comparison joins on. Slicing dims are derived automatically from the filters:

# Slice on kernel: A is 6.14, B is 6.15. Pair on every other dim.
cargo ktstr stats compare --a-kernel 6.14 --b-kernel 6.15

# Slice on kernel AND scheduler simultaneously.
cargo ktstr stats compare \
    --a-kernel 6.14 --a-scheduler scx_rusty \
    --b-kernel 6.15 --b-scheduler scx_alpha

# Slice on project commit, narrow both sides to one scheduler+kernel.
cargo ktstr stats compare \
    --a-project-commit abcdef1 --b-project-commit fedcba2 \
    --kernel 6.14 --scheduler scx_rusty

# Slice on run environment: CI runs vs local developer runs.
cargo ktstr stats compare \
    --a-run-source ci --b-run-source local

Symmetric sugar. Shared --X flags (--kernel, --scheduler, --topology, --work-type, --project-commit, --kernel-commit, --run-source) pin BOTH sides to the same value(s). Per-side --a-X / --b-X flags REPLACE the corresponding shared --X value for that side only — “more-specific replaces” semantics. So --kernel 6.14 --a-kernel 6.13 puts A on 6.13 and B on 6.14. Together the seven slicing dimensions (kernel, scheduler, topology, work-type, project-commit, kernel-commit, run-source) cover every typed axis the comparison can contrast on.

Validation. The dispatch site rejects two cases up front:

  • Empty slicing: no --a-X / --b-X at all, OR the per-side flags resolve to identical effective filters. Bails with “specify at least one per-side filter (e.g. --a-kernel 6.14 --b-kernel 6.15) to define what dimension separates the two sides.”
  • Multi-dim slicing: slicing on more than one dimension prints a warning to stderr (“warning: slicing on N dimensions; results compress multiple axes into a single A/B contrast”) but continues — multi-dim contrasts are a deliberate feature for cohort sweeps.

Averaging. By default the comparison aggregates every matching sidecar within each side into a single arithmetic-mean row per pairing key, smoothing run-to-run jitter. Failing / skipped contributors are excluded from the metric mean; the aggregated row’s passed is the AND across every contributor. A header line above the comparison table reads averaged across N runs (A) and M runs (B) and a per-group passes_observed/total_observed block prints below the summary.

+mixed commit marker. When contributors to an averaged group disagree on the -dirty suffix for the same canonical hex (some clean, some -dirty), the rendered commit and kernel_commit columns show {hex}+mixed for that group. +mixed is a COHORT-level marker (distinct from -dirty, which is a per-record property of one sidecar): it indicates mixed working-tree state across the group’s contributors. Mixed-dirty tracking spans EVERY contributor (passing, failing, skipped) so WIP-vs-committed disagreement surfaces in the averaged row even when one of the two states only appears on a failing run. The marker is rendered against the canonical un-suffixed hex, so a abc1234 clean entry plus an abc1234-dirty entry render as abc1234+mixed regardless of which contributor was scanned first. Homogeneous cohorts (every contributor clean, every contributor dirty, or every contributor None) preserve the first-seen value verbatim and never get the +mixed marker.

--no-average keeps each sidecar distinct. If multiple sidecars on the same side share the same pairing key under --no-average, the comparison bails with “duplicate pairing keys” — pairing across A/B sides is ambiguous when one A-row could match many B-rows. Either drop --no-average to average them, or add another per-side filter to disambiguate.

Kernel match shape. A --kernel 6.12 filter (two-segment major.minor) PREFIX-matches every patch release in that series: 6.12, 6.12.0, 6.12.5 all match. A three-or-more-segment filter (--kernel 6.14.2, --kernel 6.15-rc3) is strict equality — 6.14.2 does NOT match 6.14.20. The same shape applies to --a-kernel / --b-kernel.

Discovering filter values. Run cargo ktstr stats list-values before crafting a compare invocation to see what kernel, commit, kernel_commit, source, scheduler, topology, and work_type values the sidecar pool actually carries; passing a --a-kernel 6.20 against an empty pool fails downstream with “no rows match filter A” and list-values is the upstream answer to “what have I got?”. list-values reports all seven filterable dimensions; the JSON keys commit and source map to the per-side filter flags --project-commit and --run-source.

When a side comes back as unknown for one of the optional dimensions (kernel, commit, kernel_commit, source), cargo ktstr stats explain-sidecar on the underlying run reports per-sidecar which optional fields are missing and what each absence means.

FlagDefaultDescription
-E FILTERSubstring filter applied to the joined scenario topology scheduler work_type string. Scope is limited: -E does NOT match against kernel, project_commit, kernel_commit, or run_source — those are typed dimensions reachable only via the dedicated --kernel / --project-commit / --kernel-commit / --run-source flags. To narrow on those, use the typed flags. Composes with the typed dimension filters: typed narrows happen first, substring runs over the surviving set.
--kernel VER (repeatable)Pin BOTH sides to the listed kernel version(s). Sugar for --a-kernel V1 --a-kernel V2 --b-kernel V1 --b-kernel V2. Per-side --a-kernel / --b-kernel REPLACES this shared value for that side only. Major.minor (6.12) prefix-matches; three-segment (6.14.2) is strict.
--scheduler NAME (repeatable)Pin BOTH sides to the listed scheduler(s). Sugar for --a-scheduler N1 --a-scheduler N2 --b-scheduler N1 --b-scheduler N2. Per-side --a-scheduler / --b-scheduler REPLACES this shared value for that side only. OR-combined: a row matches iff its scheduler field equals ANY listed entry. Strict equality per entry.
--topology LABEL (repeatable)Pin BOTH sides to the listed rendered topology label(s) (e.g. 1n2l4c2t). Sugar for --a-topology L1 --a-topology L2 --b-topology L1 --b-topology L2. Per-side --a-topology / --b-topology REPLACES this shared value for that side only. OR-combined: a row matches iff its rendered topology label equals ANY listed entry. Strict equality per entry.
--work-type TYPE (repeatable)Pin BOTH sides to the listed work_type(s) (PascalCase variants of WorkType, e.g. SpinWait). Sugar for --a-work-type T1 --a-work-type T2 --b-work-type T1 --b-work-type T2. Per-side --a-work-type / --b-work-type REPLACES this shared value for that side only. OR-combined: a row matches iff its work_type field equals ANY listed entry. Strict equality per entry. See WorkSpec types.
--project-commit HASH (repeatable)Pin BOTH sides to listed project_commit value(s) (7-char hex, optional -dirty suffix). Also accepts git revspecs (HEAD, HEAD~N, tags, branches, A..B ranges) resolved against the project repo into the same 7-char short hashes; see --help for details. Filters the ktstr framework commit; the scheduler binary’s commit (SidecarResult::scheduler_commit) is not currently exposed as a filter.
--kernel-commit HASH (repeatable)Pin BOTH sides to listed kernel_commit value(s) (7-char hex, optional -dirty suffix). Also accepts git revspecs (HEAD, HEAD~N, tags, branches, A..B ranges) resolved against the kernel repo (gix::open against KTSTR_KERNEL’s path); see --help for details. Filters the kernel SOURCE TREE commit (SidecarResult::kernel_commit), distinct from the kernel release version (--kernel): two runs of the same kernel_version with different kernel_commit values represent the same release rebuilt from different trees. Rows whose kernel_commit is None (KTSTR_KERNEL pointed at a non-git path, the underlying source was Tarball / Git rather than a Local tree, or the gix probe failed) NEVER match a populated filter.
--run-source NAME (repeatable)Pin BOTH sides to listed run-environment source(s). Filters SidecarResult::run_source set by detect_run_source at sidecar-write time: "local" for developer runs, "ci" when KTSTR_CI was set, or rewritten to "archive" at load time when --dir points at a non-default pool root. Rows whose run_source is None (sidecar pre-dates the field) NEVER match a populated filter — same opt-in policy as --kernel / --project-commit / --kernel-commit. Combine per-side --a-run-source ci --b-run-source local to contrast CI runs against developer runs of the same scenarios.
--a-kernel VER (repeatable)A-side kernel filter. Replaces the shared --kernel for the A side only.
--a-scheduler NAME (repeatable)A-side scheduler filter, OR-combined. Replaces the shared --scheduler value for the A side only.
--a-topology LABEL (repeatable)A-side topology filter, OR-combined. Replaces the shared --topology value for the A side only.
--a-work-type TYPE (repeatable)A-side work_type filter, OR-combined. Replaces the shared --work-type value for the A side only.
--a-project-commit HASH (repeatable)A-side project-commit filter. Replaces the shared --project-commit for the A side only.
--a-kernel-commit HASH (repeatable)A-side kernel-commit filter. Replaces the shared --kernel-commit for the A side only.
--a-run-source NAME (repeatable)A-side run-source filter. Replaces the shared --run-source for the A side only.
--b-kernel VER (repeatable)B-side kernel filter. Replaces the shared --kernel for the B side only.
--b-scheduler NAME (repeatable)B-side scheduler filter, OR-combined. Replaces the shared --scheduler value for the B side only.
--b-topology LABEL (repeatable)B-side topology filter, OR-combined. Replaces the shared --topology value for the B side only.
--b-work-type TYPE (repeatable)B-side work_type filter, OR-combined. Replaces the shared --work-type value for the B side only.
--b-project-commit HASH (repeatable)B-side project-commit filter. Replaces the shared --project-commit for the B side only.
--b-kernel-commit HASH (repeatable)B-side kernel-commit filter. Replaces the shared --kernel-commit for the B side only.
--b-run-source NAME (repeatable)B-side run-source filter. Replaces the shared --run-source for the B side only.
--no-averageoffDisable averaging. Each sidecar stays distinct; bails with an actionable error when multiple sidecars on the same side share the same pairing key (since pairing across sides becomes ambiguous).
--threshold PCTper-metric default_relUniform relative significance threshold in percent. Overrides the per-metric default_rel for every metric; the absolute gate is always per-metric and cannot be tuned from the CLI. Mutually exclusive with --policy.
--policy FILEPath to a JSON ComparisonPolicy file with per-metric thresholds. Schema: { "default_percent": N, "per_metric_percent": { "worst_spread": 5.0, ... } }. Priority is per-metric override → default_percent → each metric’s registry default_rel. Per-metric keys are rejected at load time if they do not match a metric in the METRICS registry. Mutually exclusive with --threshold.
--dir DIRtarget/ktstr/Alternate runs root for pool collection. Defaults to test_support::runs_root() (typically target/ktstr/). Useful when comparing archived sidecar trees copied off a CI host.

Prerequisites

Run tests first to generate sidecar JSON files:

cargo ktstr test                     # generates target/ktstr/{kernel}-{project_commit}/*.json
cargo ktstr stats                    # reads the newest run

Set KTSTR_SIDECAR_DIR to override the sidecar directory; otherwise the default is {CARGO_TARGET_DIR or "target"}/ktstr/{kernel}-{project_commit}/, where {project_commit} is the project HEAD short hex (with -dirty when the worktree differs).

show-host

Print the live host context used by the sidecar collector: CPU identity, memory/hugepage config, transparent-hugepage policy, NUMA node count, kernel uname triple (sysname / release / machine), kernel cmdline, and every /proc/sys/kernel/sched_* tunable. Useful for diagnosing cross-run regressions that trace back to host-context drift (sysctl change, THP policy flip, hugepage reservation) or for confirming what cargo ktstr stats compare would record on the next run produced here.

cargo ktstr show-host

This is a live snapshot (reads /proc, /sys, and uname() at invocation time). For the archived host context captured at sidecar-write time for a past run, use cargo ktstr stats show-host --run RUN_ID instead — same HostContext::format_human formatter so the two outputs are byte-for-byte comparable when the host is unchanged.

For historical drift between archived runs (host-side diff across two run partitions), use cargo ktstr stats compare — its host-delta section reports which host-context fields changed between side A and side B using the same HostContext::diff logic.

show-thresholds

Print the resolved assertion thresholds for the named test — the same merged Assert value run_ktstr_test_inner evaluates against worker reports, produced by the runtime merge chain Assert::default_checks().merge(entry.scheduler.assert()).merge(&entry.assert). Surfaces every threshold field (or none when inherited or unset) so an operator can see what the test will actually check against without reading source or guessing which layer contributed each bound.

cargo ktstr show-thresholds preempt_regression_fault_under_load
ArgDescription
TESTFunction-name-only test identifier as registered in #[ktstr_test] (e.g. preempt_regression_fault_under_load). Use cargo nextest list to enumerate test names — then strip the <binary>:: prefix that nextest prepends to each line before passing the name here. The #[ktstr_test] registry keys on the bare function name, so a name like ktstr::my_test (as printed by nextest) must be trimmed to my_test before it resolves.

Fails with an actionable message when no registered test matches the given name; the diagnostic includes a Did you mean ...? Levenshtein suggestion when a near match exists.

locks

Enumerate every ktstr flock held on this host — read-only, does NOT attempt any flock acquire. Troubleshooting companion for --cpu-cap contention: when a build or test is stalled behind a peer’s reservation, cargo ktstr locks names the peer (PID + cmdline) without disturbing any of its flocks.

Scans four lock-file roots:

  • /tmp/ktstr-llc-*.lock — per-LLC reservations held by perf-mode test runs and --cpu-cap-bounded builds.
  • /tmp/ktstr-cpu-*.lock — per-CPU reservations from the same flow.
  • {cache_root}/.locks/*.lock — cache-entry locks held during kernel build writes, and source-{path_hash}.lock files held for the duration of kernel build --source and cargo ktstr test --kernel <path> against the same source tree.
  • {runs_root}/.locks/{kernel}-{project_commit}.lock — per-run-key sidecar-write locks held for the duration of the (pre-clear + write) cycle to serialize concurrent ktstr processes targeting the same run directory.

Each lock is cross-referenced against /proc/locks to name the holder PID and cmdline.

cargo ktstr locks                       # one-shot snapshot
cargo ktstr locks --json                # JSON snapshot
cargo ktstr locks --watch 1s            # redraw every second until SIGINT
cargo ktstr locks --watch 1s --json     # ndjson stream, one object per interval
FlagDefaultDescription
--jsonoffEmit the snapshot as JSON. Pretty-printed in one-shot mode; compact (one object per line, ndjson-style) under --watch. Stable field names — schema documented on ktstr::cli::list_locks.
--watch DURATIONunsetRedraw the snapshot at the given interval until SIGINT. Value is parsed by humantime: 100ms, 1s, 5m, 1h. Human output clears and redraws in place; --json emits one line-terminated object per interval.

The same subcommand is available as ktstr locks with identical flag semantics.

Install

cargo install --locked ktstr --bin ktstr --bin cargo-ktstr   # the two user-facing binaries

The explicit --bin flags scope the install to just ktstr and cargo-ktstr; without them, cargo install would also place the test-fixture binaries (ktstr-jemalloc-probe, ktstr-jemalloc-alloc-worker) on $PATH.

Or build from the workspace:

cargo build --bin cargo-ktstr

Auto-Repro

When a test fails because the scheduler crashes or exits, auto-repro boots a second VM with BPF probes attached to capture function arguments and struct fields along the scheduling path. Stack functions extracted from the crash output seed the probe list; when no crash stack is available (e.g. a BPF text error or verifier failure with no backtrace), auto-repro falls back to dynamic BPF program discovery in the repro VM.

How it works

  1. First VM – the test runs normally. If the scheduler crashes or exits (BPF error, verifier failure, stall), ktstr captures any stack trace from the scheduler log (COM2) or kernel console (COM1).

  2. Stack extraction – function names are parsed from the crash trace when available. BPF program symbols (bpf_prog_*) are recognized and their short names extracted. Generic functions (scheduler entry points, spinlocks, syscall handlers, sched_ext exit machinery, BPF trampolines, stack dump helpers) are filtered out. When no stack functions are found, the pipeline continues with an empty probe list.

  3. BPF discovery – in the repro VM, ktstr discovers loaded struct_ops programs via libbpf-rs and adds them to the probe list alongside any stack-extracted functions. Their kernel-side callers are added (e.g. enqueue -> do_enqueue_task) for bridge kprobes. This step ensures probes can capture variable states across the scheduler exit call chain even when the crash produced no extractable stack.

  4. BTF resolution – function signatures are resolved from vmlinux BTF (kernel functions) and program BTF (BPF callbacks). Known struct types (task_struct, rq, scx_dispatch_q, etc.) have curated fields resolved to byte offsets. Other struct pointer params have scalar, enum, and cpumask pointer fields auto-discovered from vmlinux or BPF program BTF.

  5. Second VM – ktstr boots a new VM and reruns the scenario with BPF probes:

    • Kprobe skeleton for kernel function entry (uses bpf_get_func_ip)
    • Fentry/fexit skeleton for BPF callbacks and kernel function exit (batched in groups of 4, shares maps via reuse_fd). Fexit re-reads struct fields after the function executes, capturing post-mutation state alongside the entry snapshot.
    • Tracepoint trigger (tp_btf/sched_ext_exit) fires inside scx_claim_exit() when the scheduler exits, in the context of the current task at exit time
  6. Stitching – the task_struct pointer is read from the trigger event’s bpf_get_current_task() value. Events with a task_struct parameter are filtered to that pointer; events without a task_struct parameter are retained if their task_ptr (from bpf_get_current_task() at probe time) matches the triggering task. Events are sorted by timestamp and formatted with decoded field values (cpumask ranges, DSQ names, enqueue flags, etc.) and source locations (DWARF for kernel, line_info for BPF).

  7. Diagnostic tails – the last 40 lines of the repro VM’s scheduler log (COM2, cycle-collapsed), sched_ext dump (COM1), and kernel console (COM1) are appended after the probe output when non-empty. A duration line reports total repro VM wall time. When probe data is absent, a crash reproduction status line indicates whether the crash reproduced.

Requirements

Auto-repro requires a kernel with the sched_ext_exit tracepoint (used as the probe trigger). Kernels built with CONFIG_SCHED_CLASS_EXT and tracepoint support include this. If the tracepoint is unavailable, auto-repro is skipped and the pipeline diagnostics report the cause.

Enabling auto-repro

In #[ktstr_test]:

#[ktstr_test(auto_repro = true)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> { ... }

auto_repro defaults to true in #[ktstr_test].

Repro mode

During the second VM run, ktstr sets “repro mode” which disables the work-conservation watchdog. Workers normally send SIGUSR2 to the scheduler when stuck > 2 seconds. In repro mode, the scheduler stays alive so BPF assertion probes can fire.

Example output

The demo_host_crash_auto_repro test triggers a host-initiated crash via BPF map write and captures the scheduling path. Probe output shows each function with decoded struct fields and source locations. When fexit captures post-mutation state, changed fields show an arrow () between entry and exit values:

ktstr_test 'demo_host_crash_auto_repro' [sched=scx-ktstr] failed:
  scheduler died

--- auto-repro ---
=== AUTO-PROBE: scx_exit fired ===

  ktstr_enqueue                                                   main.bpf.c:21
    task_struct *p
      pid         97
      cpus_ptr    0xf(0-3)
      dsq_id      SCX_DSQ_INVALID
      enq_flags   NONE
      slice       0
      vtime       0
      weight      100
      sticky_cpu  -1
      scx_flags   QUEUED|ENABLED
  do_enqueue_task                                               kernel/sched/ext.c:1344
    rq *rq
      cpu         1
    task_struct *p
      pid         97
      cpus_ptr    0xf(0-3)
      dsq_id      SCX_DSQ_INVALID          →  SCX_DSQ_LOCAL
      enq_flags   NONE
      slice       20000000
      vtime       0
      weight      100
      sticky_cpu  -1
      scx_flags   QUEUED|DEQD_FOR_SLEEP    →  QUEUED

After the probe data, the auto-repro section includes the repro VM duration and the last 40 lines of the repro VM’s scheduler log, sched_ext dump, and dmesg (each only when non-empty).

Demo test

A demo test in this shape (reduced from demo_host_crash_auto_repro in tests/scenario_coverage.rs):

use ktstr::prelude::*;
use ktstr::test_support::{BpfMapWrite, KtstrTestEntry, run_ktstr_test};

fn scenario_yield_heavy(ctx: &Ctx) -> Result<AssertResult> {
    let steps = vec![Step::with_defs(
        vec![
            CgroupDef::named("demo_workers")
                .work_type(WorkType::YieldHeavy)
                .workers(4),
        ],
        HoldSpec::Fixed(Duration::from_secs(8)),
    )];
    execute_steps(ctx, steps)
}

Run manually to see full output:

cargo ktstr test --kernel ../linux -- --run-ignored ignored-only -E 'test(demo_host_crash_auto_repro)'

BPF Verifier

The verifier sweep boots every declared scheduler in a KVM VM and captures per-program verifier statistics from the real kernel verifier.

Design

The verifier sweep follows ktstr’s two core principles.

Fidelity without overhead. Each scheduler binary runs inside a VM on the same kernel the scheduler targets in production. The verifier that runs is the real verifier in the real kernel — no host-side BPF loading, no version skew between the host kernel’s verifier and the target kernel’s verifier.

Direct access over tooling layers. No subprocess to bpftool or veristat. The host reads per-program verified_insns directly from guest memory via bpf_prog_aux introspection and applies cycle collapse to verifier logs instead of truncating.

Quick start

# Run every declared scheduler against the kernel discovered via
# KTSTR_KERNEL (or the cache).
cargo ktstr verifier

# Run against a specific kernel build.
cargo ktstr verifier --kernel ../linux

# Sweep across multiple kernels (each cell runs against its own).
cargo ktstr verifier --kernel 6.14.2 --kernel 6.15.0

# Print the raw verifier log without cycle collapse.
cargo ktstr verifier --raw

See cargo-ktstr verifier for the full flag list.

How it works

cargo ktstr verifier is a thin dispatcher around cargo nextest run -E 'test(/^verifier/)'. The matrix that nextest runs is generated by the test binaries themselves.

  1. Cell emission — every test binary that links the ktstr test support and contains at least one declare_scheduler! declaration emits a verifier/<sched>/<kernel>/<preset>: test listing for each (declared scheduler × kernel-list entry × accepted gauntlet topology preset) triple. Cells whose topology preset would exceed the host’s CPU / LLC / per-LLC capacity are filtered at emission time (mirroring the gauntlet variant filter).

  2. Per-cell dispatch — nextest invokes the test binary once per cell with --exact verifier/<sched>/<kernel>/<preset>. The binary’s #[ctor] intercepts the prefix, parses the cell name into its three components, resolves the scheduler binary, kernel directory, and topology, and boots a single VM dedicated to that cell.

  3. Verifier collection — inside the VM, the scheduler loads its BPF programs via scx_ops_load!; the real kernel verifier runs against them. The host reads per-program verified_insns from bpf_prog_aux via guest physical memory introspection. On load failure, libbpf prints the verifier log to stderr; the VM forwards it to the host via the bulk SHM port between ===SCHED_OUTPUT_START=== / ===SCHED_OUTPUT_END=== markers.

  4. Rendering — per-program summary lines, then the verifier log with cycle collapse applied (or raw, with --raw).

Eevdf + KernelBuiltin scheduler variants have no userspace binary to load BPF programs from, so they are skipped at cell-emission time. Direct invocation outside nextest (--exact verifier/<eevdf-sched>/...) prints a SKIP banner and exits 0.

Matrix dimension + per-scheduler filter

The verifier sweep matrix is driven by the operator’s --kernel set, not by per-scheduler declare_scheduler! declarations. The dispatcher always exports KTSTR_KERNEL_LIST (label1=path1;label2=path2;...) to the nextest invocation — even with no --kernel flag it synthesizes a single entry from the auto-discovered kernel. The test binary’s lister walks that list as the matrix dimension and emits one cell per (declared scheduler × kernel-list entry × accepted preset).

Each scheduler’s declare_scheduler! kernels = [...] declaration acts as a per-scheduler filter on the matrix:

  • Empty (kernels = []) accepts every kernel-list entry — the scheduler verifies against everything the operator passes.
  • Version specs ("6.14.2") match entries whose raw label equals the version (or whose sanitized label equals the sanitized form of the version).
  • Range specs ("6.14..6.16", "6.14..=6.16") match entries whose raw version falls in the inclusive range, parsed via the same decompose_version_for_compare helper the operator-side range expansion uses.
  • Path / CacheKey / Git specs match by sanitized-label equality.
# Scheduler declares kernels = ["6.14..6.16"]
# Operator runs --kernel 6.14.2 --kernel 6.15.0 --kernel 6.17.0
# Dispatcher's KTSTR_KERNEL_LIST: kernel_6_14_2, kernel_6_15_0,
#                                 kernel_6_17_0
# Scheduler filter: 6.14.2 ∈ [6.14, 6.16] ✓
#                   6.15.0 ∈ [6.14, 6.16] ✓
#                   6.17.0 ∈ [6.14, 6.16] ✗ — rejected
# Cells emitted: verifier/<sched>/kernel_6_14_2/<preset>
#                verifier/<sched>/kernel_6_15_0/<preset>
cargo ktstr verifier --kernel 6.14.2 --kernel 6.15.0 --kernel 6.17.0

# No --kernel: dispatcher auto-discovers one kernel via the
# cache + filesystem chain and synthesizes a single-entry
# KTSTR_KERNEL_LIST. The auto-discovered entry's label is
# derived from the resolved path (`kernel_path_<basename>_<hash6>`).
# Schedulers with non-empty `kernels = [...]` may filter the
# entry out — operators wanting deterministic coverage should
# always pass --kernel.
cargo ktstr verifier

The verifier cell handler resolves the per-cell kernel directory by looking up the cell’s sanitized label in KTSTR_KERNEL_LIST — there is no single-kernel fallback that would silently run a cell against an unrelated kernel. A label that doesn’t appear in the list errors out with an actionable diagnostic naming the present labels and pointing at both fix paths (add --kernel <SPEC> or drop the matching entry from declare_scheduler!).

Output

Brief (default)

Per-program summary line:

  ktstr_enqueue                              verified_insns=500

verified_insns is the number of instructions the kernel verifier processed, read from bpf_prog_aux via host-side memory introspection.

On load failure, the scheduler-log section shows libbpf’s verifier output with cycle collapse applied — repeating loop iterations are reduced to the first iteration, an omission marker, and the last iteration:

--- 8x of the following 10 lines ---
100: (bf) r0 = r1 ; frame1: R0_w=scalar(id=0,umin=0)
101: (bf) r1 = r2 ; frame1: R1_w=scalar(id=1,umin=1)
...
--- 6 identical iterations omitted ---
100: (bf) r0 = r1 ; frame1: R0_w=scalar(id=70,umin=700)
101: (bf) r1 = r2 ; frame1: R1_w=scalar(id=71,umin=701)
...
--- end repeat ---

Raw (--raw)

Full raw verifier log without cycle collapse. Use for debugging verification failures where the exact register state at each iteration matters. The flag exports KTSTR_VERIFIER_RAW=1 for the nextest invocation; the in-binary cell handler reads it via env::var_os and switches the format_verifier_output rendering branch.

Cycle collapse algorithm

The kernel verifier unrolls loops by re-verifying each instruction with updated register states. A bounded loop of 8 instructions verified 100 times produces 800 near-identical lines — differing only in register-state annotations. Naive truncation loses context. Cycle collapse preserves structure: the first iteration shows what the loop does, the last shows the final state, and a count tells you how many iterations were elided.

The algorithm normalizes lines by stripping variable annotations, then detects repeating blocks:

  1. Normalize — strip ; frame1: R0_w=... annotations, standalone register dumps (3041: R0=scalar()), and inline branch-target state after goto pc+N. Source comments (; for (int j = 0; ...)) are preserved as cycle anchors.

  2. Detect — find the most frequent normalized line (the “anchor”), compute gaps between anchor occurrences to determine the cycle period, then verify consecutive blocks match after normalization. Minimum period: 5 lines. Minimum repetitions: 3.

  3. Collapse — replace the cycle with the first iteration, an omission count, and the last iteration. Run iteratively (up to 5 passes) to handle nested loops.

scx-ktstr test flags

scx-ktstr supports these flags to exercise the verifier pipeline:

--fail-verify — sets a .rodata variable before scx_ops_load!, enabling a code path the BPF verifier rejects. On failure, libbpf prints the verifier log to stderr.

--verify-loop — sets a .rodata variable that enables an unrolled loop followed by while(1) in ktstr_dispatch. The verifier rejects the infinite loop and libbpf prints the full instruction trace to stderr, exercising cycle collapse.

Core Concepts

ktstr tests compose from four layers:

  1. Scenarios – what to test: cgroup layout, CPU partitioning, workloads, custom logic.

  2. Flags – which scheduler features to enable for each run.

  3. WorkSpec types – what each worker process does: CPU spin, yield, I/O, bursty patterns, pipe-based IPC.

  4. Checking – how to evaluate results: starvation, fairness, isolation, scheduling gaps, monitor thresholds.

These compose orthogonally. A scenario runs with every valid flag combination, and checks apply uniformly across all runs.

Four supporting concepts complete the picture: Ops and Steps is the primary API for defining scenarios – most tests use CgroupDef and execute_defs from this module. TestTopology provides CPU and LLC layout for cpuset partitioning. Performance Mode applies host-side isolation for noise-sensitive measurements. Resource Budget describes the --cpu-cap tier that coordinates concurrent no-perf-mode VMs and kernel builds via LLC flocks and cgroup v2 cpuset sandboxes.

Scenarios

Scenarios define the scheduling conditions a test creates. Each scenario sets up cgroups, workers, and cpusets to produce a specific condition, then verifies the scheduler handles it correctly.

Canned scenarios (scenarios::*)

ktstr::scenario::scenarios provides curated scenario functions that can be called directly from #[ktstr_test]:

use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    scenarios::steady(ctx)
}
FunctionCondition testedSetup
steadyBaseline fairness2 cgroups, no cpusets, equal CPU-spin load
steady_llcLLC-boundary scheduling2 cgroups with LLC-aligned cpusets
oversubscribedDispatch under oversubscription2 cgroups, 32 mixed workers each
cpuset_applyCpuset assignment on running tasksDisjoint cpusets applied mid-run
cpuset_clearCpuset removal on confined tasksCpusets cleared mid-run
cpuset_resizeCpuset resizing adaptationCpusets shrink then grow
cgroup_addNew cgroup appearanceCgroups added mid-run
cgroup_removeCgroup removal while others runCgroups removed mid-run
affinity_changeAffinity mask changesWorker affinities randomized mid-run
affinity_pinnedNarrow-affinity contentionWorkers pinned to 2-CPU subset
host_contentionFairness between cgroup and host tasksHost workers vs cgroup workers
mixed_workloadsMixed workload fairnessHeavy + bursty + IO cgroups
nested_steadyNested cgroup hierarchyWorkers in nested sub-cgroups
nested_task_moveCross-level task migrationTasks moved between nested cgroups

Additional custom_* functions are available in ktstr::scenario::{affinity, basic, cpuset, dynamic, interaction, nested, performance, stress}. See the API docs for the full list.

Most tests use these canned functions or build custom scenarios with CgroupDef and execute_defs / execute_steps (see Ops and Steps). Custom scenarios receive a Ctx reference and use the same building blocks; see Custom Scenarios for the Ctx struct and helper functions.

WorkType

WorkType controls what each worker process does during a scenario.

The WorkType enum in ktstr::workload is the source of truth. The variants below are grouped by intent; each one-line summary is the leading sentence of the variant’s rustdoc. Run cargo doc --open for full per-variant semantics, parameter ranges, and kernel-path citations — this page reproduces only the high-level shape.

pub enum WorkType {
    // CPU primitives
    SpinWait,                            // Tight CPU spin loop (1024 iterations per cycle).
    YieldHeavy,                          // Repeated sched_yield with minimal CPU work.
    Mixed,                               // CPU spin burst followed by sched_yield.
    AluHot { width: AluWidth },          // Dependent integer multiply chain at high IPC (>= 2.0); optional SIMD width.
    SmtSiblingSpin,                      // Tight PAUSE-spin from a paired worker pinned to two SMT siblings.
    IpcVariance {                        // Alternating high-IPC (multiplies) / low-IPC (random cache touches) phases.
        hot_iters: u64,
        cold_iters: u64,
        period_iters: u64,
    },

    // Block-device I/O (operates on /dev/vda; falls back to per-worker tempfile when absent)
    IoSyncWrite,                         // 16 x 4 KB pwrites + fdatasync per iteration (O_SYNC).
    IoRandRead,                          // Single 4 KB pread at a random sector-aligned offset (O_DIRECT).
    IoConvoy,                            // Interleaved sequential pwrite + random pread with periodic fdatasync (O_DIRECT).

    // Burst-and-sleep
    Bursty {                             // CPU burst for `burst_duration`, sleep for `sleep_duration`, repeat.
        burst_duration: Duration,
        sleep_duration: Duration,
    },
    IdleChurn {                          // CPU burst then `nanosleep` (exercises hrtimer + idle-class path).
        burst_duration: Duration,
        sleep_duration: Duration,
        precise_timing: bool,
    },

    // Cache pressure
    CachePressure { size_kb: usize, stride: usize },    // Strided RMW sized to pressure L1.
    CacheYield { size_kb: usize, stride: usize },       // Cache pressure burst then sched_yield().

    // Wake-placement / cross-CPU paths
    PipeIo { burst_iters: u64 },                        // CPU burst then 1-byte pipe exchange with a partner worker.
    FutexPingPong { spin_iters: u64 },                  // Paired futex wait/wake between partner workers (non-WF_SYNC).
    CachePipe { size_kb: usize, burst_iters: u64 },     // Cache-hot working set + pipe wake.
    FutexFanOut { fan_out: usize, spin_iters: u64 },    // 1:N fan-out wake (one messenger, N receivers).
    FanOutCompute {                                     // Messenger/worker fan-out with matrix-multiply compute per receiver.
        fan_out: usize,
        cache_footprint_kb: usize,
        operations: usize,
        sleep_usec: u64,
    },
    AsymmetricWaker {                                   // Paired workers in mismatched scheduling classes share one futex word.
        waker_class: SchedClass,
        wakee_class: SchedClass,
        burst_iters: u64,
    },
    WakeChain {                                         // Ring of waker-wakee hops via Pipe (WF_SYNC) or Futex wake.
        depth: usize,
        wake: WakeMechanism,
        work_per_hop: Duration,
    },
    EpollStorm {                                        // eventfd producers + epoll_wait consumers (exclusive autoremove wake).
        producers: usize,
        consumers: usize,
        events_per_burst: u64,
    },
    ThunderingHerd {                                    // N waiters on ONE global futex word; broadcast FUTEX_WAKE rouses the herd.
        waiters: usize,
        batches: u64,
        inter_batch_ms: u64,
    },

    // Compound / sequence
    Sequence { first: Phase, rest: Vec<Phase> },        // Loop through ordered phases (Spin / Sleep / Yield / Io).

    // Lifecycle / scheduling-class churn
    ForkExit,                                           // Rapid fork+_exit cycling; parent waitpid's then repeats.
    NiceSweep,                                          // Cycle nice level from -20 to 19 across iterations.
    AffinityChurn { spin_iters: u64 },                  // Rapid self-directed sched_setaffinity to random CPUs.
    PolicyChurn { spin_iters: u64 },                    // Cycle SCHED_OTHER -> BATCH -> IDLE (-> FIFO/RR if CAP_SYS_NICE).
    NumaMigrationChurn { period_ms: u64 },              // Rotate sched_setaffinity across NUMA nodes.
    CgroupChurn { groups: usize, cycle_ms: u64 },       // Cycle cgroup membership between sibling cgroups.

    // Memory pressure / NUMA
    PageFaultChurn {                                    // mmap NOHUGEPAGE -> touch random pages -> MADV_DONTNEED, repeat.
        region_kb: usize,
        touches_per_cycle: usize,
        spin_iters: u64,
    },
    NumaWorkingSetSweep {                               // Rotate the working-set memory across NUMA nodes via mbind.
        region_kb: usize,
        sweep_period_ms: u64,
        target_nodes: Vec<usize>,
    },

    // Lock contention
    MutexContention {                                   // N-way futex mutex contention (CAS acquire / FUTEX_WAIT on failure).
        contenders: usize,
        hold_iters: u64,
        work_iters: u64,
    },
    PriorityInversion {                                 // Three priority tiers contending for one shared lock (Pi or Plain futex).
        high_count: usize,
        medium_count: usize,
        low_count: usize,
        hold_iters: u64,
        work_iters: u64,
        pi_mode: FutexLockMode,
    },

    // Producer/consumer + signal/preempt pressure
    ProducerConsumerImbalance {                         // Producer / consumer pipeline with deliberately-unbalanced rates.
        producers: usize,
        consumers: usize,
        produce_rate_hz: u64,
        consume_iters: u64,
        queue_depth_target: u64,
    },
    SignalStorm {                                       // Paired workers fire tkill(partner, SIGUSR1) between CPU bursts.
        signals_per_iter: u64,
        work_iters: u64,
    },
    PreemptStorm {                                      // One SCHED_FIFO worker preempts CFS spinners on the same CPU at ~kHz rate.
        cfs_workers: usize,
        rt_burst_iters: u64,
        rt_sleep_us: u64,
    },
    RtStarvation {                                      // SCHED_FIFO workers monopolise the CPU at 100%; CFS workers starve.
        rt_workers: usize,
        cfs_workers: usize,
        rt_priority: i32,
        burst_iters: u64,
    },

    // User-supplied
    Custom {                                            // User-supplied work function (name + fn pointer).
        name: String,
        run: fn(&AtomicBool) -> WorkerReport,
    },
}

Imports: WorkType, Phase, SchedPolicy, WorkSpec, and WorkloadConfig are in ktstr::prelude::*. The auxiliary enums FutexLockMode (used by PriorityInversion::pi_mode), WakeMechanism (used by WakeChain::wake), and SchedClass (used by AsymmetricWaker) live under ktstr::workload. Bring them into scope with use ktstr::workload::*; (or import each by name) before writing variant literals that reference them.

Parameterized variants have snake-case convenience constructors — e.g. WorkType::bursty(burst_duration, sleep_duration), WorkType::pipe_io(burst_iters), WorkType::cache_pressure(size_kb, stride), WorkType::page_fault_churn(region_kb, touches_per_cycle, spin_iters), WorkType::mutex_contention(contenders, hold_iters, work_iters), WorkType::priority_inversion(high_count, medium_count, low_count, hold_iters, work_iters, pi_mode), WorkType::wake_chain(depth, wake, work_per_hop), WorkType::custom(name, run). Every parameterised variant has one; see cargo doc --open on WorkType for the full constructor list and parameter validation rules.

Bursty, IdleChurn, and WakeChain take Duration parameters (humantime-serialised in captured configs) — pass Duration::from_millis(N) or Duration::from_micros(N) from std::time rather than raw integers. IpcVariance, ProducerConsumerImbalance, RtStarvation, PriorityInversion, EpollStorm, PreemptStorm, and ThunderingHerd reject zero-valued counters at spawn time (WorkTypeValidationError::*).

Choosing a work type

Scheduler behavior to testRecommended work type
Basic load balancing / fairnessSpinWait (default)
Wake placement / sleep-wake cyclesYieldHeavy, FutexPingPong
CPU borrowing / idle balanceBursty
Cross-CPU wake latencyPipeIo, CachePipe
Cache-aware schedulingCachePressure, CacheYield
Cache-aware fan-out wake latencyFanOutCompute
Fan-out wake stormsFutexFanOut
Mixed real-world patternsSequence
Task creation/destruction pressureForkExit
Priority reweighting / nice dynamicsNiceSweep
Rapid CPU migration / affinity churnAffinityChurn
Scheduling class transitionsPolicyChurn
Page fault / TLB pressurePageFaultChurn
Lock contention / convoy effectMutexContention
Arbitrary user-defined workloadCustom

Variants

SpinWait – tight spin loop with spin_loop() hints. 1024 iterations per check. Pure CPU-bound workload.

YieldHeavythread::yield_now() on every iteration. Exercises scheduler wake/sleep paths.

Mixed – 1024 spin iterations then yield. Combines CPU and voluntary preemption.

IoSyncWrite – 16 × 4 KB pwrites totaling 64 KB at the worker’s stripe offset (per-worker striping prevents fdatasync from coalescing across writers), then fdatasync(). Drives fsync-heavy D-state cycles. Opens /dev/vda with O_SYNC; falls back to a per-worker tempfile when /dev/vda is absent (host-side unit tests).

IoRandRead – single 4 KB pread at a sector-aligned random offset within the device capacity. Opens /dev/vda with O_DIRECT (tempfile fallback); drives high-IOPS short-D-state cycles. Per-worker xorshift PRNG seeded from tid.

IoConvoy – alternates 4 KB pwrite at the worker’s monotonic sequential cursor with 4 KB pread at a random offset; fdatasync() runs every 16 iterations. /dev/vda opened O_DIRECT (tempfile fallback). Currently uses direct IO so the pathology surface is the synchronous flush + IO-mix latency rather than page-cache convoy build-up.

Bursty – CPU burst for burst_duration, sleep for sleep_duration, repeat. Both fields are Duration (humantime- serialised); pass Duration::from_millis(N) from std::time. Frees CPUs during sleep, exercising CPU borrowing.

PipeIo – CPU burst then 1-byte pipe exchange with a partner worker. Workers are paired: (0,1), (2,3), etc. Sleep duration depends on partner scheduling, exercising cross-CPU wake placement. Requires even num_workers.

FutexPingPong – paired futex wait/wake between partner workers. Each iteration does spin_iters of CPU work then wakes the partner and waits on a shared futex word. Exercises the non-WF_SYNC wake path. Requires even num_workers.

CachePressure – strided read-modify-write over a buffer sized to pressure the L1 cache. Each worker allocates its own buffer post-fork. size_kb controls buffer size, stride controls the byte step between accesses.

CacheYield – cache pressure followed by sched_yield(). Tests scheduler re-placement after voluntary yield with a cache-hot working set.

CachePipe – cache pressure burst then 1-byte pipe exchange with a partner worker. Combines cache-hot working set with cross-CPU wake placement. Requires even num_workers.

FutexFanOut – 1:N fan-out wake pattern without cache pressure. One messenger per group does spin_iters of CPU spin work then wakes fan_out receivers via FUTEX_WAKE. Receivers measure wake-to-run latency. For cache-aware fan-out with matrix multiply work, see FanOutCompute. Requires num_workers divisible by fan_out + 1.

FanOutCompute – messenger/worker fan-out with compute work. One messenger per group stamps a CLOCK_MONOTONIC timestamp then wakes fan_out workers via FUTEX_WAKE. Workers measure wake-to-run latency (time from messenger’s timestamp to worker getting the CPU), sleep for sleep_usec microseconds (simulating think time), then do operations iterations of naive matrix multiply over a cache_footprint_kb-sized working set (three square matrices of u64, O(n^3)). Requires num_workers divisible by fan_out + 1.

Sequence – compound work pattern: loop through phases in order, repeat. Each phase runs for its specified duration before the next starts. Phases are defined via the Phase enum:

  • Phase::Spin(Duration) – CPU spin for the given duration.
  • Phase::Sleep(Duration)thread::sleep for the given duration.
  • Phase::Yield(Duration) – repeated sched_yield for the given duration.
  • Phase::Io(Duration) – simulated I/O (write 64 KB + 100 us sleep) for the given duration.

Sequence cannot be constructed via WorkType::from_name() because it requires explicit phase definitions. Build it directly:

WorkType::Sequence {
    first: Phase::Spin(Duration::from_millis(100)),
    rest: vec![
        Phase::Sleep(Duration::from_millis(50)),
        Phase::Yield(Duration::from_millis(20)),
    ],
}

ForkExit – rapid fork+_exit cycling. Each iteration forks a child that immediately calls _exit(0). The parent waitpids then repeats. Exercises wake_up_new_task, do_exit, and wait_task_zombie.

NiceSweep – cycles the worker’s nice level from -20 to 19 across iterations. Each iteration: 512-iteration spin burst, setpriority(PRIO_PROCESS, 0, nice_val), then sched_yield. Exercises reweight_task and dynamic priority reweighting. Skips negative nice values when CAP_SYS_NICE is absent. Resets nice to 0 before exit. Records per-yield wake latency.

AffinityChurn – rapid self-directed CPU affinity changes. Each iteration: spin_iters spin burst, sched_setaffinity to a random CPU from the effective cpuset, then sched_yield. Exercises affine_move_task and migration_cpu_stop. Records per-yield wake latency.

PolicyChurn – cycles through scheduling policies each iteration. Each iteration: spin_iters spin burst, sched_setscheduler to the next policy in the sequence, then sched_yield. Cycles through SCHED_OTHER, SCHED_BATCH, SCHED_IDLE (and SCHED_FIFO/SCHED_RR with priority 1 when CAP_SYS_NICE is available). Exercises __sched_setscheduler and scheduling class transitions. Resets to SCHED_OTHER before exit. Records per-yield wake latency.

PageFaultChurn – rapid page fault cycling. Workers mmap a region_kb KB region with MADV_NOHUGEPAGE (forcing 4 KB pages), touch touches_per_cycle random pages via write faults through do_anonymous_page, then MADV_DONTNEED to zap PTEs and repeat. spin_iters iterations of CPU work separate cycles. Exercises the page allocator, TLB pressure on migration, and rapid user/kernel transitions. Uses xorshift64 PRNG for random page selection (seeded from the process ID).

MutexContention – N-way futex mutex contention. contenders workers per group contend on a shared AtomicU32 via CAS acquire (FUTEX_WAIT on failure). Loop: spin_burst(work_iters) then CAS acquire, spin_burst(hold_iters) in the critical section, then store 0 + FUTEX_WAKE(1) to release. Exercises convoy effect, lock-holder preemption cascading stalls, and futex wait/wake contention paths. Requires num_workers divisible by contenders.

Custom – user-supplied work function. The run function pointer receives a reference to the stop flag (&AtomicBool, set by SIGUSR1) and returns a WorkerReport when the flag becomes true. The framework handles fork, cgroup placement, affinity, scheduling policy, and signal setup; the user function owns the work loop and all WorkerReport field population. Framework telemetry (migration tracking, gap detection, schedstat deltas, iteration counter updates) is not provided – the user function is responsible for any telemetry it needs.

Warning — pgid SIGKILL sweep on teardown. Every worker process calls setpgid(0, 0) immediately after fork, so the worker and any children a Custom closure spawns share a single process group. At teardown, stop_and_collect issues killpg(worker_pid, SIGKILL) on BOTH the graceful-exit and StillAlive-escalation paths, and WorkloadHandle::drop issues another killpg during handle destruction. Every descendant that inherits the worker’s pgid (a helper binary via execv, a subshell via sh -c, a test fixture the closure forks to drive the scheduler) will be SIGKILLed at teardown. Closures that need a child to outlive the worker must either detach it from the worker’s pgid (call setpgid(child_pid, 0) after fork) or wait on it explicitly before returning the WorkerReport.

Function pointers (fn(&AtomicBool) -> WorkerReport) are fork-safe because they carry no captured state across the fork boundary. Closures are not supported. Cannot be constructed via WorkType::from_name().

use std::sync::atomic::{AtomicBool, Ordering};
use ktstr::workload::{WorkType, WorkerReport};

fn my_workload(stop: &AtomicBool) -> WorkerReport {
    // `tid` in `WorkerReport` is an `i32` (libc::pid_t). Using
    // `std::process::id() as i32` avoids a direct `libc` dependency in
    // the consumer crate; inside ktstr the two produce the same value
    // because one worker = one process (no threads).
    let tid: i32 = std::process::id() as i32;
    let start = std::time::Instant::now();
    let mut work_units = 0u64;
    while !stop.load(Ordering::Relaxed) {
        // ... custom work ...
        work_units += 1;
    }
    let wall_time_ns = start.elapsed().as_nanos() as u64;
    // Start from `WorkerReport::default()` so the fields you don't
    // populate take their zero / empty values automatically and new
    // fields added to `WorkerReport` in the future do not require an
    // edit here. Only populate the telemetry your custom workload
    // actually produces.
    WorkerReport {
        tid,
        work_units,
        wall_time_ns,
        iterations: work_units,
        ..WorkerReport::default()
    }
}

let wt = WorkType::custom("my_workload", my_workload);

Grouped work types

PipeIo, FutexPingPong, and CachePipe require num_workers divisible by 2 (paired). FutexFanOut and FanOutCompute require num_workers divisible by fan_out + 1 (1 messenger + N receivers per group). MutexContention requires num_workers divisible by contenders. WorkType::worker_group_size() returns the group size for these variants, or None for ungrouped types. PipeIo and CachePipe use pipes; FutexPingPong, FutexFanOut, FanOutCompute, and MutexContention use shared mmap pages with futex wait/wake.

Clone-mode and pcomm interactions

CloneMode is a per-WorkloadConfig enum with two variants — Fork (the default; each worker is its own thread group, reaped via waitpid) and Thread (workers share the parent’s tgid, run as std::thread::spawn threads, reaped via JoinHandle).

pcomm is not a CloneMode variant — it is a WorkSpec field set via WorkSpec::pcomm(name) / CgroupDef::pcomm(name) in the tutorial. When a WorkSpec carries pcomm = Some(name), apply_setup routes it through the fork-then-thread spawn path: ONE forked thread-group leader whose task->comm is name hosts every matching worker as a pthread-style thread under that leader. Workers sharing a pcomm value coalesce into one container; this combines the per-process-leader visibility schedulers expect (a chrome parent, a java parent) with the in-process std::thread::spawn dispatch shape CloneMode::Thread already uses for the worker bodies themselves.

PipeIo and CachePipe work correctly inside a pcomm container. When workers run as threads inside one forked leader, the per-pair pipe-fd indices computed in the global pipe_pairs table are addressed by each worker’s position WITHIN the container’s thread group, so worker A reads its partner’s write end whether the pair lives in two forked processes (Fork mode) or in two threads of one pcomm container.

SignalStorm uses tkill(partner_tid, SIGUSR1) (per-task signal delivery, PIDTYPE_PID), NOT kill (per-tgid, PIDTYPE_TGID) and NOT tgkill(self_tgid, partner_tid, …) (would return ESRCH under Fork mode because each forked worker is its own tgid leader). tkill looks up the target via find_task_by_vpid(pid) and skips the tgid check, so the signal hits the partner thread’s per-task pending queue under Fork and Thread modes uniformly — including inside pcomm-coalesced thread groups. Sibling threads in a pcomm container do NOT dequeue each other’s SignalStorm signals because the PIDTYPE_PID queue is per-task, not per-tgid.

Default values

WorkType::from_name() uses these defaults:

  • Bursty: burst_duration=50ms, sleep_duration=100ms
  • PipeIo: burst_iters=1024
  • FutexPingPong: spin_iters=1024
  • CachePressure: size_kb=32, stride=64
  • CacheYield: size_kb=32, stride=64
  • CachePipe: size_kb=32, burst_iters=1024
  • FutexFanOut: fan_out=4, spin_iters=1024
  • FanOutCompute: fan_out=4, cache_footprint_kb=256, operations=5, sleep_usec=100
  • AffinityChurn: spin_iters=1024
  • PolicyChurn: spin_iters=1024
  • PageFaultChurn: region_kb=4096, touches_per_cycle=256, spin_iters=64
  • MutexContention: contenders=4, hold_iters=256, work_iters=1024

String lookup

WorkType::from_name() accepts PascalCase names matching the enum variants (e.g. "SpinWait", "FutexPingPong"). Sequence and Custom return None because they require explicit construction parameters. WorkType::ALL_NAMES lists every variant name. WorkType::name() returns the PascalCase name for a given value; for Custom, it returns the user-provided name field.

WorkloadConfig

WorkloadConfig is the low-level struct passed to WorkloadHandle::spawn(). CgroupDef builds one internally; use WorkloadConfig directly when calling setup_cgroups() or WorkloadHandle::spawn() in custom scenarios.

pub struct WorkloadConfig {
    pub num_workers: usize,           // Number of worker processes to fork
    pub affinity: AffinityIntent,     // Per-worker affinity intent (resolved at spawn time)
    pub work_type: WorkType,          // What each worker does
    pub sched_policy: SchedPolicy,    // Linux scheduling policy
    pub mem_policy: MemPolicy,        // NUMA memory placement policy
    pub mpol_flags: MpolFlags,        // Optional mode flags for set_mempolicy(2)
    pub nice: Option<i32>,            // Per-worker nice via setpriority(2); None inherits
    pub clone_mode: CloneMode,        // Fork (default) or Thread dispatch
    pub comm: Option<Cow<'static, str>>, // task->comm via prctl(PR_SET_NAME); kernel truncates to 15 bytes
    pub uid: Option<u32>,             // Effective UID via setresuid; None inherits
    pub gid: Option<u32>,             // Effective GID via setresgid; None inherits
    pub numa_node: Option<u32>,       // Restrict affinity to one NUMA node's CPU set
    pub composed: Vec<WorkSpec>,      // Secondary worker groups spawned alongside the primary
}

Default: 1 worker, AffinityIntent::Inherit, SpinWait, Normal policy, Default mem_policy, no mpol_flags, nice/comm/uid/gid/numa_node = None, clone_mode = Fork, composed = empty.

AffinityIntent is the type-unified affinity expression used at the top level and inside WorkSpec entries — Inherit, Exact(...), and RandomSubset(...) are accepted at WorkloadHandle::spawn; topology-aware variants (SingleCpu, LlcAligned, CrossCgroup, SmtSiblingPair) require scenario context and are rejected at the spawn gate with an actionable diagnostic. composed carries secondary WorkSpec groups that spawn alongside the primary; each composed entry can override work_type, num_workers, sched_policy, affinity, etc., and reports back via WorkerReport::group_idx (0 for the primary, 1..=N for composed entries in declaration order).

See MemPolicy for the NUMA memory placement API.

Scheduling policies

Workers can run under different Linux scheduling policies:

pub enum SchedPolicy {
    Normal,
    Batch,
    Idle,
    Fifo(u32),       // priority 1-99
    RoundRobin(u32), // priority 1-99
    Deadline {
        runtime: Duration,   // budget per period
        deadline: Duration,  // relative deadline from period start
        period: Duration,    // period; Duration::ZERO uses `deadline`
    },
}

Fifo, RoundRobin, and Deadline require CAP_SYS_NICE. The sched-deadline gate (runtime <= deadline <= period, all non-zero unless period == Duration::ZERO, which the kernel substitutes with deadline) is validated user-side in SchedPolicy::deadline() before sched_setattr so a malformed Deadline fails fast rather than tunneling EINVAL through the syscall.

Overriding work types

The work type override (configured via gauntlet or Ctx.work_type_override) replaces the default SpinWait work type for all scenarios that use it. Scenarios with non-SpinWait work types are not overridden.

Overrides to grouped work types (PipeIo, FutexPingPong, CachePipe, FutexFanOut, FanOutCompute, MutexContention) are skipped when num_workers is not divisible by the work type’s group size.

Ops-based scenarios have a separate override mechanism via CgroupDef.swappable. See Ops and Steps.

Checking

ktstr checks scheduler behavior through two channels: worker-side telemetry and host-side monitoring.

Worker checks

After each scenario, ktstr collects WorkerReport from every worker process. Several checks run against these reports:

Starvation – any worker with work_units == 0 fails the test.

Fairness – workers in the same cgroup should get similar CPU time. The “spread” (max off-CPU% - min off-CPU%) must be below a threshold (15% in release builds, 35% in debug). Violations report the spread and per-cgroup statistics.

Scheduling gaps – the longest wall-clock gap observed at work-unit checkpoints. Gaps above a threshold (2000ms release, 3000ms debug) indicate the scheduler dropped a task. Reports include the gap duration, CPU, and timing.

Cpuset isolation – workers must only run on CPUs in their assigned cpuset. Any execution on an unexpected CPU fails the test. Opt-in via isolation = true on the #[ktstr_test] attribute or via Assert::check_isolation(); Assert::default_checks() leaves this None, so the runtime merge resolves to false and the check is skipped unless explicitly enabled.

Throughput parityassert_throughput_parity() checks that workers produce similar throughput (work_units per CPU-second). Two thresholds:

  • max_throughput_cv: coefficient of variation across workers. High CV means the scheduler gives some workers disproportionately less effective CPU. Requires at least 2 workers with nonzero CPU time.
  • min_work_rate: minimum work_units per CPU-second per worker. Catches cases where all workers are equally slow (CV passes but absolute throughput is too low).

Neither threshold is set by default; enable via Assert setters or #[ktstr_test] attributes.

Benchmarkingassert_benchmarks() checks per-wakeup latency and iteration throughput. Three thresholds:

  • max_p99_wake_latency_ns: p99 of all resume_latencies_ns samples across workers in a cgroup. Populated only for work types that record wake-to-run latency: IoSyncWrite, IoRandRead, IoConvoy, Bursty, PipeIo, FutexPingPong, CacheYield, CachePipe, FutexFanOut (receivers), Sequence (Sleep / Yield / Io phases), ForkExit, NiceSweep, AffinityChurn, PolicyChurn, FanOutCompute, MutexContention. Pure-CPU work types (SpinWait, Mixed, CachePressure, PageFaultChurn) do not record samples.
  • max_wake_latency_cv: coefficient of variation of wake latency samples. High CV means inconsistent scheduling latency.
  • min_iteration_rate: minimum outer-loop iterations per wall-clock second per worker.

None are set by default. Set via Assert setters or #[ktstr_test] attributes.

Monitor checks

The host-side monitor reads guest VM memory (per-CPU runqueue structs via BTF offsets) and evaluates:

  • Imbalance ratio: max(nr_running) / max(1, min(nr_running)) across CPUs. The denominator is clamped to 1 so an all-idle sample does not divide by zero.
  • Local DSQ depth: per-CPU dispatch queue depth.
  • Stall detection: rq_clock not advancing on a CPU with runnable tasks. Idle CPUs and preempted vCPUs are exempt. See Monitor: Stall detection for exemption details.
  • Event rates: scx fallback and keep-last event counters.

Monitor thresholds use a sustained sample window (default: 5 samples). A violation must persist for N consecutive samples before failing.

NUMA checks

When workers use a MemPolicy, ktstr collects NUMA page placement data and checks it against thresholds:

Page localityassert_page_locality() checks the fraction of pages residing on the expected NUMA node(s). Expected nodes are derived from the worker’s MemPolicy::node_set() at evaluation time. Page counts come from WorkerReport::numa_pages (parsed from /proc/self/numa_maps). Returns 1.0 (vacuously local) when no pages are observed. Fails if the observed fraction falls below min_page_locality.

Cross-node migrationassert_cross_node_migration() checks the ratio of migrated pages to total allocated pages. WorkerReport::vmstat_numa_pages_migrated provides the delta of the numa_pages_migrated counter from /proc/vmstat over the work loop. Fails if the ratio exceeds max_cross_node_migration_ratio.

Slow-tier ratiomax_slow_tier_ratio checks the fraction of pages on memory-only NUMA nodes (CXL tiers). Fails if more than the specified fraction of pages land on memory-only nodes.

None of these thresholds are set by default. Set via Assert setters or #[ktstr_test] attributes.

Assert struct

Assert is a composable configuration that carries both worker checks and monitor thresholds:

pub struct Assert {
    // Worker checks
    pub not_starved: Option<bool>,
    pub isolation: Option<bool>,
    pub max_gap_ms: Option<u64>,
    pub max_spread_pct: Option<f64>,

    // Throughput checks
    pub max_throughput_cv: Option<f64>,
    pub min_work_rate: Option<f64>,

    // Benchmarking checks
    pub max_p99_wake_latency_ns: Option<u64>,
    pub max_wake_latency_cv: Option<f64>,
    pub min_iteration_rate: Option<f64>,
    pub max_migration_ratio: Option<f64>,

    // Monitor checks
    pub max_imbalance_ratio: Option<f64>,
    pub max_local_dsq_depth: Option<u32>,
    pub fail_on_stall: Option<bool>,
    pub sustained_samples: Option<usize>,
    pub max_fallback_rate: Option<f64>,
    pub max_keep_last_rate: Option<f64>,

    // NUMA checks
    pub min_page_locality: Option<f64>,
    pub max_cross_node_migration_ratio: Option<f64>,
    pub max_slow_tier_ratio: Option<f64>,
}

Every field is Option. None means “inherit from parent layer.”

Merge layers

Checking uses a three-layer merge:

  1. Assert::default_checks() – baseline: not_starved enabled, monitor thresholds from MonitorThresholds::DEFAULT.
  2. Scheduler.assert – scheduler-level overrides.
  3. Per-test assert – test-specific overrides via #[ktstr_test] attributes.

All fields use last-Some-wins semantics. A Some(false) in a higher layer can disable a check that a lower layer enabled.

let final_assert = Assert::default_checks()
    .merge(&scheduler.assert)
    .merge(&test_assert);

Default thresholds

Worker checks

CheckDefault (release)Default (debug)
Scheduling gap2000 ms3000 ms
Fairness spread15%35%

Debug builds run in small VMs with higher scheduling overhead, so thresholds are relaxed. Coverage-instrumented builds collect profraw data for code coverage analysis; all assertion and monitor threshold checks run normally.

Monitor checks

ThresholdDefaultRationale
max_imbalance_ratio4.0max(nr_running) / max(1, min(nr_running)) across CPUs (denominator clamped to 1 so an all-idle sample does not divide by zero). Lower values (2-3) false-positive during cpuset transitions.
max_local_dsq_depth50Per-CPU dispatch queue overflow. Sustained depth above this means the scheduler is not consuming dispatched tasks.
fail_on_stalltrueFail when rq_clock does not advance on a CPU with runnable tasks. Idle CPUs (NOHZ) and preempted vCPUs are exempt.
sustained_samples5At ~100ms sample interval, requires ~500ms of sustained violation. Filters transient spikes from cpuset reconfiguration.
max_fallback_rate200.0/sselect_cpu_fallback events per second across all CPUs. Sustained rate indicates systematic select_cpu failure.
max_keep_last_rate100.0/sdispatch_keep_last events per second across all CPUs. Sustained rate indicates dispatch starvation.

All monitor thresholds use the sustained_samples window – a violation must persist for N consecutive samples before failing.

Worker checks via Assert

Assert provides assert_cgroup() for running worker-side checks directly against collected reports:

let a = Assert::default_checks().max_gap_ms(5000);
let result = a.assert_cgroup(&reports, Some(&cpuset));

Use Assert for both the merge chain (#[ktstr_test] attributes, Scheduler.assert, execute_steps_with) and direct report checking.

Constants

  • Assert::NO_OVERRIDES – identity for merge; every field is None, so it overrides nothing. This is not “no checks” – when used as a per-test or per-scheduler assert, the runtime chain still applies defaults because it merges default_checks() -> scheduler -> test.
  • Assert::default_checks()not_starved enabled, monitor thresholds populated from MonitorThresholds::DEFAULT.

AssertResult

AssertResult carries pass/fail status, diagnostic messages, and aggregated statistics from a scenario run.

Construction

  • AssertResult::pass() – creates a passing result with empty details and default stats.
  • AssertResult::skip(reason) – creates a passing result with a skip reason in details and skipped = true. Used when a scenario cannot run under the current topology or flag combination but is not a failure.
  • AssertResult::fail(detail) – failing result carrying a single AssertDetail. Mirrors pass / skip for the failure axis.
  • AssertResult::fail_msg(msg) – shortcut for the common case where the failure is a plain diagnostic message tagged DetailKind::Other.

Mutation and inspection

  • result.note(msg) – append an informational annotation tagged DetailKind::Note. Does NOT flip passed or skipped — a note is context, not a verdict. Returns &mut Self so calls chain.
  • result.with_note(msg) – builder-style sibling of note that consumes and returns self. Use at the return site to chain a context annotation onto a fresh result without an intermediate let mut.
  • result.is_skipped() – convenience accessor returning skipped. Stats tooling uses this to subtract non-executions from pass counts.
  • result.is_failed() – convenience accessor returning !passed. Mirrors is_skipped so branches reading “did this claim fail?” don’t negate .passed inline.

Fields

  • passed: bool – whether all checks passed.
  • skipped: bool – distinguishes a passing result that ran every check from one that skipped execution (topology / flag mismatch, prerequisite absent). AssertResult::skip sets this; pass / fail / fail_msg leave it false.
  • details: Vec<AssertDetail> – structured diagnostic entries; each carries a kind: DetailKind (Other, Note, Skip, Temporal, …) plus a human-readable message: String. Consumers filter by kind for routing (failure vs informational note) and read message for display.
  • stats: ScenarioStats – aggregated worker telemetry across all cgroups (spread, gaps, migrations, wake latency, iterations).
  • measurements: BTreeMap<String, NoteValue> – structured per-test measurements keyed by name. Sidecar consumers and comparison tooling read this map directly without parsing details strings, so populate it (via Verdict::note_value during claim evaluation) for any value a downstream comparison needs to lift programmatically.

Merging

result.merge(other) combines two results. If other.passed is false, the merged result is also false. Details and stats are accumulated:

let mut combined = AssertResult::pass();
combined.merge(cgroup_0_result);
combined.merge(cgroup_1_result);
// combined.passed is false if either cgroup failed
// combined.details contains messages from both

Stats merging takes worst values across cgroups for spread, gap, wake latency, and migration ratio. Counters (total_workers, total_cpus, total_migrations, total_iterations) are summed.

For examples of overriding thresholds at the scheduler and per-test level, see Customize Checking.

Ops and Steps

The ops system is a composable way to express dynamic cgroup topology changes. It replaces hand-written Action::Custom functions for most dynamic scenarios.

Op

An Op is an atomic operation on the cgroup topology. The enum is #[non_exhaustive], so external pattern matches must end with .. to stay compatible across ktstr version bumps that add new variants:

OpDescription
AddCgroupCreate a cgroup
RemoveCgroupStop workers and remove a cgroup
SetCpusetSet a cgroup’s cpuset via CpusetSpec
ClearCpusetRemove cpuset constraints
SwapCpusetsSwap cpusets between two cgroups
SpawnFork workers into a cgroup
StopCgroupStop a cgroup’s workers
SetAffinitySet worker affinity via AffinityIntent
SpawnHostSpawn workers in the parent cgroup
MoveAllTasksMove all tasks from one cgroup to another
RunPayloadSpawn a binary-kind Payload in the background and track its PayloadHandle under the step’s payload set. Subsequent WaitPayload / KillPayload address it by (payload.name, cgroup). Scheduler-kind payloads are rejected at apply time.
WaitPayloadBlock until the named payload exits naturally, evaluate its checks, and record metrics to the per-test sidecar. Target lookup is by (name, cgroup) composite key; cgroup: None resolves to the unique live copy. No timeout — pair with a bounded HoldSpec or the payload’s own --runtime for time-boxed runs.
KillPayloadSIGKILL the named payload, reap the child, evaluate checks, and record metrics. Same (name, cgroup) lookup rules as WaitPayload. Mirrors step-teardown drain for an explicitly-targeted payload.
FreezeCgroupFreeze every task in the named cgroup via cgroup.freeze (kernel-side asynchronous freeze; not a SIGSTOP). Idempotent for already-frozen cgroups. Pair with UnfreezeCgroup to release; teardown auto-unfreezes. See Snapshots for the observer-cgroup deadlock warning.
UnfreezeCgroupUnfreeze every task in the named cgroup via cgroup.freeze. Inverse of FreezeCgroup. Idempotent.
SnapshotCapture a host-side diagnostic snapshot under name via the freeze coordinator: pauses every vCPU, reads BPF map state, vCPU registers, and per-CPU counters into a FailureDumpReport, then resumes. The report is keyed by name on the active SnapshotBridge. No active bridge is a no-op with tracing::warn!. See Snapshots.
WatchSnapshotCapture a snapshot whenever the guest writes to the named kernel symbol; one fire = one capture tagged with the symbol path. Symbol resolution at op execution time looks the name up by verbatim vmlinux ELF symbol-table match — the requested name must appear in the guest kernel’s static symbol table exactly as written (no path expansion, no BTF descent). Maximum 3 watch ops per scenario (3 hardware watchpoint slots; 1 slot reserved for the error-class exit_kind trigger). See Watch Snapshots.

Op constructors accept string literals directly (no .into() needed):

Op::add_cgroup("cg_0")
Op::set_cpuset("cg_0", CpusetSpec::disjoint(0, 2))
Op::stop_cgroup("cg_0")
Op::spawn("cg_0", WorkSpec::default().workers(4))
Op::set_affinity("cg_0", AffinityIntent::RandomSubset)
Op::spawn_host(WorkSpec::default().workers(4))
Op::freeze_cgroup("cg_0")
Op::unfreeze_cgroup("cg_0")
Op::snapshot("after_spawn")
Op::watch_snapshot("jiffies_64")

SpawnHost creates workers in the parent cgroup, not in a managed cgroup. Use this to simulate host-level CPU contention alongside managed cgroups.

OpKind

OpKind is a payload-free discriminant enum generated from Op via #[strum_discriminants]. It carries the same variant set as Op (AddCgroup, RemoveCgroup, …, RunPayload, WaitPayload, KillPayload, FreezeCgroup, UnfreezeCgroup, Snapshot, WatchSnapshot) with none of the inner fields, so it is cheap to copy and use as a map key. Framework code uses OpKind when it only cares WHICH operation ran (per-op statistics, stimulus-event tagging, verifier/monitor bookkeeping) without the payload. Test authors rarely spell OpKind directly — the strum::EnumIter derive also lets tooling enumerate every OpKind variant for coverage checks.

OpKind shares Op’s #[non_exhaustive] attribute: external pattern matches over OpKind must end with ...

CpusetSpec

CpusetSpec computes a cpuset from the topology at runtime. The enum is #[non_exhaustive], so external callers should construct via the associated constructor functions (see the list below this snippet) rather than naming variant literals — a future field addition (e.g. a stride on Range) can land behind a defaulted parameter without breaking call sites. Pattern matches over CpusetSpec must also end with ..:

pub enum CpusetSpec {
    Llc(usize),                          // All CPUs in an LLC
    Numa(usize),                         // All CPUs in a NUMA node
    Range { start_frac: f64, end_frac: f64 }, // Fraction of usable CPUs
    Disjoint { index: usize, of: usize },     // Equal disjoint partitions
    Overlap { index: usize, of: usize, frac: f64 }, // Overlapping partitions
    Exact(BTreeSet<usize>),              // Exact CPU set
}

Convenience constructors accept parameters directly: CpusetSpec::disjoint(0, 2), CpusetSpec::range(0.0, 0.5), CpusetSpec::exact([0, 1, 2]), CpusetSpec::llc(0), CpusetSpec::numa(0), CpusetSpec::overlap(0, 2, 0.5).

All fractional specs operate on usable_cpus().

CgroupDef

CgroupDef bundles three ops that always go together: create cgroup, set cpuset, spawn workers. It is the primary way to define cgroups in ops-based scenarios.

let def = CgroupDef::named("cg_0")
    .with_cpuset(CpusetSpec::disjoint(0, 2))
    .workers(4)
    .work_type(WorkType::SpinWait);

Builder methods

  • .with_cpuset(CpusetSpec) – set the cpuset (CPU set the cgroup is pinned to).
  • .with_cpuset_mems(BTreeSet<usize>) – explicit cpuset.mems override (default derives from the resolved cpuset’s NUMA nodes).
  • .workers(n) – set worker count.
  • .work_type(WorkType) – set work type (default: SpinWait).
  • .sched_policy(SchedPolicy) – set Linux scheduling policy (default: Normal). See WorkSpec Types.
  • .work(WorkSpec) – add a work group (multiple calls for concurrent groups).
  • .workload(&'static Payload) – attach a binary workload payload to run alongside the worker group; the framework launches it as a child process inside the cgroup. Panics when called with a scheduler-kind Payload (PayloadKind::Scheduler(_)); the scheduler slot is #[ktstr_test(scheduler = ...)] at the test level, not the cgroup-level workload slot. Step-level Op::RunPayload rejects scheduler-kind payloads with an anyhow::Error instead of panicking; the build-time workload call panics because there is no scenario-level recovery path.
  • .affinity(AffinityIntent) – set per-worker affinity (default: Inherit).
  • .mem_policy(MemPolicy) – set NUMA memory placement policy (default: Default). See MemPolicy.
  • .mpol_flags(MpolFlags) – set mode flags for set_mempolicy(2) (default: NONE). See MemPolicy.
  • .nice(n) – cgroup-level default per-worker nice value, merged into every WorkSpec whose own nice is unset. See Tutorial: Step 11.
  • .comm(name) – cgroup-level default per-worker task->comm via prctl(PR_SET_NAME). Merged into every WorkSpec whose own comm is unset.
  • .pcomm(name) – thread-group-leader task->comm for the fork-then-thread spawn path (workers run as threads under one forked leader). Stamps every existing WorkSpec in-place; not order-independent with .work(...).
  • .uid(uid) / .gid(gid) – cgroup-level default per-worker effective UID / GID via setresuid / setresgid. Merged into every WorkSpec whose own uid / gid is unset.
  • .numa_node(node) – cgroup-level default NUMA-node affinity for every WorkSpec. Merged at apply-setup time.
  • .swappable(bool) – opt into gauntlet work type override.

Cgroup controllers

The cgroup-v2 cpu / memory / io / pids controllers are exposed as typed setters (default: unconstrained):

  • .cpu_quota_pct(pct) / .cpu_quota(quota, period) / .cpu_unlimited() – write cpu.max (pct is shorthand: 100 = one full CPU). cpu_unlimited resets to the kernel default.
  • .cpu_weight(weight) – write cpu.weight (1..=10000, default 100).
  • .memory_max(bytes) / .memory_high(bytes) / .memory_low(bytes) / .memory_unlimited() – write memory.max / memory.high / memory.low. memory_unlimited resets memory.max to max.
  • .memory_swap_max(bytes) / .memory_swap_unlimited() – write memory.swap.max.
  • .io_weight(weight) – write io.weight (1..=10000, default 100).
  • .pids_max(n) / .pids_unlimited() – write pids.max.

MemPolicy-cpuset validation

When a cgroup has a cpuset, ktstr validates that the MemPolicy’s node set is covered by the NUMA nodes reachable from that cpuset. A MemPolicy::Bind([1]) on a cgroup whose cpuset covers only NUMA node 0 fails at setup time. Policies without a node set (Default, Local) skip validation.

WorkSpec type overrides and swappable

CgroupDef has a swappable flag (default: false). When true and a work type override is active (Ctx.work_type_override), the override replaces this def’s work type.

In contrast, the Scenario-level override (in run_scenario()) only replaces SpinWait work types. The two mechanisms serve different scopes:

  • Scenario-level: replaces SpinWait in WorkSpec.work_type
  • CgroupDef-level: replaces the work type when swappable = true

Both skip overrides to grouped work types when num_workers is not divisible by the work type’s group size.

WorkSpec type overrides apply only to CgroupDef setup, not to raw Op::Spawn. Op::Spawn always uses the work type as given. Use CgroupDef with .swappable(true) when the work type should participate in gauntlet overrides.

Step

A Step is a sequence of ops with a hold period:

pub struct Step {
    pub setup: Setup,   // CgroupDefs to create after ops
    pub ops: Vec<Op>,   // Operations to apply
    pub hold: HoldSpec, // How long to wait after
}

Setup is either Defs(Vec<CgroupDef>) or Factory(fn(&Ctx) -> Vec<CgroupDef>). Vec<CgroupDef> implements Into<Setup>, so you can write setup: vec![...].into() instead of setup: Setup::Defs(vec![...]).

Constructors

Step::new(ops, hold) – creates a step with ops only (no CgroupDef setup). Use when the step only applies dynamic operations to an existing topology.

Step::with_defs(defs, hold) – creates a step with CgroupDef setup and a hold period. The primary constructor for steps that create cgroups with workers.

Step::set_ops(self, ops) – REPLACES the ops on a step (builder method). Chain after with_defs to add dynamic operations to a step that also creates cgroups.

Naming asymmetry: Step::set_ops REPLACES; the sibling Backdrop::with_ops APPENDS. The two methods deliberately use different verbs to signal the different semantics. A Step::new(ops).set_ops(more) chain produces a step whose ops vec is exactly more (the original ops is dropped); a Backdrop::new().with_ops(ops_a).with_ops(ops_b) chain produces a backdrop whose ops vec is ops_a + ops_b. If you need to extend a step’s ops vec, build the combined Vec<Op> at the call site and pass it to set_ops, or compose at the Backdrop layer instead.

HoldSpec

How long to hold after a step completes:

VariantDescription
Frac(f64)Fraction of the total scenario duration
Fixed(Duration)Fixed time
Loop { interval }Repeat ops at interval until time runs out

HoldSpec::FULL is a constant for Frac(1.0) (hold for the full scenario duration).

execute_defs

execute_defs(ctx, defs) is a convenience wrapper for the common pattern of creating cgroups and running them for the full duration:

execute_defs(ctx, vec![
    CgroupDef::named("cg_0").workers(4),
    CgroupDef::named("cg_1").workers(4),
])

Equivalent to execute_steps(ctx, vec![Step::with_defs(defs, HoldSpec::FULL)]).

execute_steps

execute_steps(ctx, steps) runs a step sequence:

  1. For each step: apply ops, then apply setup (create cgroups from CgroupDefs), hold for the specified duration. Ops run first so parent cgroups can be created before children are spawned. Loop steps reverse this: setup runs once before the loop, then ops repeat at the specified interval.
  2. Check scheduler liveness between steps.
  3. After all steps: collect worker reports and run checks.
  4. Writes stimulus events to the SHM ring buffer for timeline analysis.

execute_steps_with

execute_steps_with(ctx, steps, assertions) is the same as execute_steps but accepts an explicit Assert for worker checks. execute_steps is a convenience wrapper that passes None.

use ktstr::prelude::*;

fn my_scenario(ctx: &Ctx) -> Result<AssertResult> {
    let assertions = Assert::NO_OVERRIDES
        .check_not_starved()
        .max_gap_ms(3000);

    let steps = vec![/* ... */];
    execute_steps_with(ctx, steps, Some(&assertions))
}

When assertions is Some, the provided Assert overrides ctx.assert for worker checks. When None, uses ctx.assert (the merged three-layer config: default_checks -> scheduler -> per-test).

TestTopology

TestTopology provides CPU topology information for test configuration. It discovers CPUs, last-level caches (LLCs), and NUMA nodes, and generates cpuset partitions for scenarios.

CPU topology hierarchy

ktstr models four levels of CPU topology, from largest to smallest:

  • NUMA node – a memory-proximity domain. Each node is a group of CPUs with fast access to a local memory bank. Cross-node memory access is slower.
  • LLC (last-level cache) – the largest cache shared by a group of cores. LLCs are the key scheduling boundary: tasks sharing an LLC benefit from shared cache lines.
  • Core – a physical execution unit with its own pipeline and L1/L2 caches.
  • Thread – an SMT (simultaneous multithreading) sibling. Multiple threads share a single core’s execution resources.

Containment: threads belong to a core, cores belong to an LLC, LLCs belong to a NUMA node. For example, 2n4l4c2t describes 2 NUMA nodes, each with 2 LLCs, each LLC with 4 cores, each core with 2 threads = 32 CPUs total.

Most tests use a single NUMA node (the default). NUMA matters when a scheduler makes placement decisions based on memory locality. Single-NUMA topologies (numa_nodes = 1) test scheduling without memory-locality effects. Multi-NUMA topologies test whether a scheduler keeps tasks close to their memory. See the gauntlet NUMA presets for multi-NUMA configurations.

use ktstr::prelude::*;

pub struct TestTopology {
    // private fields — use the accessors below
}

Construction

from_system() -> Result<Self> – reads sysfs (/sys/devices/system/cpu/) to discover the live topology. Reads LLC IDs, NUMA node IDs, core IDs, and cache sizes for each online CPU. Also scans /sys/devices/system/node/ to discover memory-only nodes (CXL), reads per-node meminfo and inter-node distances.

from_vm_topology(topo: &Topology) -> Self – builds a topology from a VM spec. Topology fields are big-to-little: NUMA nodes, last-level caches, cores per LLC, threads per core. Multiple LLCs can share a NUMA node when numa_nodes < llcs; llcs must be an exact multiple of numa_nodes so LLCs partition evenly across nodes (the declare_scheduler! macro rejects violations at compile time and runtime callers inside ktstr hold the same invariant). CPUs numbered sequentially. Used as a fallback when sysfs is incomplete inside a guest VM. For the memory-aware variant, see from_vm_topology_with_memory.

synthetic(num_cpus, num_llcs) -> Self (test-only) – creates a topology with evenly distributed CPUs across LLCs. Used in unit tests.

Topology queries

total_cpus() – total number of CPUs.

num_llcs() – number of last-level caches.

num_numa_nodes() – number of NUMA nodes.

all_cpus() -> &[usize] – all CPU IDs, sorted.

all_cpuset() -> BTreeSet<usize> – all CPU IDs as a set.

usable_cpus() -> &[usize] – CPUs available for workload placement. Reserves the last CPU for the root cgroup (cgroup 0) when the topology has more than 2 CPUs. On 8 CPUs: usable = 0-6, CPU 7 reserved. Most built-in scenarios and CgroupDef cpuset specs operate on usable_cpus() automatically; test authors rarely need to query it directly.

usable_cpuset() -> BTreeSet<usize> – usable CPUs as a set.

llcs() -> &[LlcInfo] – all LLC domains with their CPUs, NUMA node, cache size, and core map.

cpus_in_llc(idx) -> &[usize] – CPUs belonging to LLC at index.

llc_aligned_cpuset(idx) -> BTreeSet<usize> – CPUs in LLC as a set.

numa_aligned_cpuset(node) -> BTreeSet<usize> – CPUs in all LLCs belonging to NUMA node node. Filters LLCs by numa_node() == node and collects their CPUs.

numa_node_ids() -> &BTreeSet<usize> – NUMA node IDs as a BTreeSet.

numa_nodes_for_cpuset(cpus) -> BTreeSet<usize> – NUMA nodes covered by the given CPU set. Returns the set of NUMA nodes that contain at least one LLC with a CPU in the given set.

node_meminfo(node_id) -> Option<&NodeMemInfo> – per-node memory info (total and free KiB). Returns None when the node ID is not present or meminfo is unavailable. NodeMemInfo has total_kb, free_kb, and used_kb() (saturating subtraction).

numa_distance(from, to) -> u8 – inter-node NUMA distance. Returns 255 when either node ID is not present (matches the kernel’s unreachable distance). For from_vm_topology() topologies without explicit distances, returns 10 for local and 20 for remote.

is_memory_only(node_id) -> bool – whether the node is memory-only (has RAM but no CPUs). Typical for CXL-attached memory tiers.

Construction from VM topology

from_vm_topology(topo) -> Self – build a TestTopology from a Topology (the VMM’s topology spec). Populates LLCs, NUMA nodes, distances, per-node memory info, and memory-only node flags.

from_vm_topology_with_memory(topo, total_memory_mb) -> Self – same as from_vm_topology but accepts an optional total memory size for uniform topologies. When Some, divides memory evenly across nodes to populate NodeMemInfo. When None, memory info is omitted.

Cpuset generation

split_by_llc() -> Vec<BTreeSet<usize>> – one set of CPUs per LLC.

overlapping_cpusets(n, overlap_frac) -> Vec<BTreeSet<usize>> – generates n cpusets with overlap_frac overlap between adjacent sets. Available to scenarios that want to hand-build overlapping cpusets (e.g. via CpusetSpec::Exact); CpusetSpec::Overlap computes its slice inline rather than calling this helper.

cpuset_string(cpus) -> String – formats a CPU set as a compact range string (e.g. "0-3,5,7-9"). Used when writing cpuset.cpus.

LlcInfo

Each LLC domain is represented by an LlcInfo:

pub struct LlcInfo {
    cpus: Vec<usize>,
    numa_node: usize,
    cache_size_kb: Option<u64>,
    cores: BTreeMap<usize, Vec<usize>>, // core_id -> SMT siblings
}

Accessors: cpus(), numa_node(), cache_size_kb(), cores(), num_cores().

num_cores() returns the number of physical cores (from the core map), or falls back to cpus.len() if no core map is populated (synthetic topologies).

How scenarios use topology

TestTopology is available to scenarios via Ctx.topo. The CpusetSpec variants use topology methods to resolve a cgroup’s cpuset:

CpusetSpecTopology method
Llc(idx)llc_aligned_cpuset(idx)
Numa(node)numa_aligned_cpuset(node)
Range { start_frac, end_frac }usable_cpus() sliced by fraction
Disjoint { index, of }usable_cpus() partitioned into of equal sets
Overlap { index, of, frac }usable_cpus() partitioned with neighbor overlap
Exact(set)no topology resolution (caller-supplied set)

Llc confines a cgroup to a single LLC’s CPUs; Numa spans all LLCs in a NUMA node. The fraction- and partition-style variants operate on the usable-CPUs pool the host reservation has granted.

See Ops and Steps for the full CpusetSpec enum.

CPU list parsing

Two standalone functions parse CPU list strings:

parse_cpu_list(s) -> Result<Vec<usize>> – strict parsing of "0-3,5,7-9" format. Returns an error on invalid entries.

parse_cpu_list_lenient(s) -> Vec<usize> – lenient parsing that silently skips invalid entries.

See also: CgroupManager for set_cpuset() which consumes cpuset strings, CgroupGroup for RAII cgroup management, WorkloadHandle for worker lifecycle, Scenarios for how CpusetSpec drives cpuset partitioning.

Host-side reservation

TestTopology::numa_distance is also consumed at host-side --cpu-cap plan time: acquire_llc_plan uses it to order the spill from the seed NUMA node to nearest-by-distance neighbors when the CPU budget cannot fit within a single node. The resulting plan is a LOCK_SH reservation on every selected LLC (flock granularity stays per-LLC even when the last LLC is partial-taken for the CPU budget) with a cpuset.mems union written to a cgroup v2 sandbox. See Resource Budget for the full pipeline.

MemPolicy

MemPolicy controls NUMA memory placement for worker processes. It wraps set_mempolicy(2) and is applied after fork, before the work loop starts.

pub enum MemPolicy {
    Default,
    Bind(BTreeSet<usize>),
    Preferred(usize),
    Interleave(BTreeSet<usize>),
    Local,
    PreferredMany(BTreeSet<usize>),
    WeightedInterleave(BTreeSet<usize>),
}

Variants

Default – inherit the parent process’s memory policy. No set_mempolicy syscall is made.

Bind(nodes) – allocate only from the specified NUMA nodes (MPOL_BIND). Allocation fails with ENOMEM if all specified nodes are exhausted.

Preferred(node) – prefer allocations from the specified node, falling back to others when the preferred node is full (MPOL_PREFERRED).

Interleave(nodes) – interleave allocations round-robin across the specified nodes (MPOL_INTERLEAVE).

Local – prefer the nearest node to the CPU where the allocation occurs (MPOL_LOCAL). No nodemask.

PreferredMany(nodes) – prefer allocations from any of the specified nodes, falling back to others when all preferred nodes are full (MPOL_PREFERRED_MANY, kernel 5.15+).

WeightedInterleave(nodes) – weighted interleave across the specified nodes. Page distribution is proportional to per-node weights set via /sys/kernel/mm/mempolicy/weighted_interleave/nodeN (MPOL_WEIGHTED_INTERLEAVE, kernel 6.9+).

Convenience constructors

MemPolicy::bind([0, 1])
MemPolicy::preferred(0)
MemPolicy::interleave([0, 1])
MemPolicy::preferred_many([0, 1])
MemPolicy::weighted_interleave([0, 1])

Node-set constructors (bind, interleave, preferred_many, weighted_interleave) accept any IntoIterator<Item = usize> – arrays, ranges, Vec, BTreeSet. preferred takes a single usize node ID.

MpolFlags

MpolFlags provides optional mode flags OR’d into the set_mempolicy(2) mode argument:

FlagValueDescription
NONE0No flags
STATIC_NODES1 << 15Nodemask is absolute, not remapped when the task’s cpuset changes
RELATIVE_NODES1 << 14Nodemask is relative to the task’s current cpuset
NUMA_BALANCING1 << 13Enable NUMA balancing optimization for this policy

Flags combine with | or MpolFlags::union():

let flags = MpolFlags::STATIC_NODES | MpolFlags::NUMA_BALANCING;

Usage in WorkSpec and CgroupDef

WorkSpec and CgroupDef both expose .mem_policy() and .mpol_flags() builder methods:

use ktstr::prelude::*;

let w = WorkSpec::default()
    .workers(4)
    .mem_policy(MemPolicy::bind([0]))
    .mpol_flags(MpolFlags::STATIC_NODES);

let def = CgroupDef::named("cg_0")
    .with_cpuset(CpusetSpec::numa(0))
    .workers(4)
    .mem_policy(MemPolicy::bind([0]));

Cpuset validation

When a cgroup has a cpuset, ktstr validates that the MemPolicy’s node set is covered by the NUMA nodes reachable from that cpuset. A MemPolicy::Bind([1]) on a cgroup whose cpuset covers only NUMA node 0 will fail with an error at setup time.

Policies without a node set (Default, Local) skip validation.

node_set()

MemPolicy::node_set() returns the NUMA node IDs referenced by the policy. Returns the node set for Bind, Interleave, PreferredMany, and WeightedInterleave; a single-element set for Preferred; and an empty set for Default/Local.

NUMA checking

Page locality and migration results from workers using MemPolicy are checked by the NUMA checking assertions. The expected node set for locality checks is derived from the worker’s MemPolicy at evaluation time.

Example: NUMA-aware test

A complete test that checks page locality across two NUMA nodes:

use ktstr::prelude::*;

#[ktstr_test(
    numa_nodes = 2, llcs = 4, cores = 4, threads = 1,
    min_numa_nodes = 2,
    min_page_locality = 0.8,
)]
fn numa_locality(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("node0")
            .with_cpuset(CpusetSpec::numa(0))
            .workers(4)
            .mem_policy(MemPolicy::bind([0])),
        CgroupDef::named("node1")
            .with_cpuset(CpusetSpec::numa(1))
            .workers(4)
            .mem_policy(MemPolicy::bind([1])),
    ])
}

Each cgroup’s workers are pinned to a single NUMA node’s CPUs via CpusetSpec::numa() and their memory allocations are bound to the same node via MemPolicy::bind(). The min_page_locality threshold fails the test if less than 80% of pages land on the expected node.

Performance Mode

Performance mode reduces noise during VM execution by applying host-side isolation (vCPU pinning, hugepages, NUMA mbind, RT scheduling). On x86_64, additionally: a guest-visible CPUID hint (KVM_HINTS_REALTIME) and KVM exit suppression (PAUSE and HLT VM exits disabled). On aarch64, the four host-side optimizations apply (vCPU pinning, hugepages, NUMA mbind, RT scheduling); KVM exit suppression and CPUID hints are not available.

What it does

On x86_64, seven optimizations are applied when performance_mode is enabled (six host-side, one guest-visible via CPUID). On aarch64, four of these apply (vCPU pinning, hugepages, NUMA mbind, RT scheduling); the x86-specific items (PAUSE/HLT exit disabling, KVM_HINTS_REALTIME CPUID, halt poll) are not available.

Host-side KVM_CAP_HALT_POLL is explicitly skipped on x86_64 — the guest haltpoll cpuidle driver disables it via MSR_KVM_POLL_CONTROL (see below):

vCPU pinning – each virtual LLC is mapped to a physical LLC group on the host. vCPU threads are pinned to cores within their assigned LLC via sched_setaffinity. This prevents the host scheduler from migrating vCPU threads across LLCs, which would add cache thrashing noise to measurements.

Hugepages – guest memory is allocated with 2MB hugepages (MAP_HUGETLB) when sufficient free hugepages exist. This eliminates TLB pressure from host-side page walks during guest execution.

NUMA mbind – guest memory is bound to the NUMA node(s) of the pinned vCPUs via mbind(MPOL_BIND). This ensures memory allocations are local to the CPUs executing vCPU threads, avoiding cross-node memory access latency.

RT scheduling – vCPU threads are set to SCHED_FIFO priority 1. The watchdog and monitor threads run at priority 2 on a dedicated host CPU not assigned to any vCPU, so they can preempt for timeout/sampling without competing for vCPU cores. The serial console mutex uses PTHREAD_PRIO_INHERIT to avoid priority inversion between RT vCPU threads and service threads.

Disable PAUSE VM exits (x86_64 only) – KVM_CAP_X86_DISABLE_EXITS with KVM_X86_DISABLE_EXITS_PAUSE suppresses VM exits on PAUSE instructions. Guest spinlocks execute PAUSE in tight loops; each PAUSE normally causes a vmexit so the hypervisor can schedule other vCPUs. With dedicated cores (vCPU pinning), this reschedule is unnecessary overhead. The capability is optional – if unsupported, a warning is logged and the VM proceeds without it.

Disable HLT VM exits (x86_64 only) – KVM_X86_DISABLE_EXITS_HLT suppresses VM exits on HLT instructions, the most frequent exit type during boot and idle. BSP shutdown detection uses I8042 reset (port 0x64, value 0xFE via reboot=k) and VcpuExit::Shutdown instead of VcpuExit::Hlt. KVM blocks HLT disable when mitigate_smt_rsb is active (host has X86_BUG_SMT_RSB and cpu_smt_possible()); in that case, only PAUSE exits are disabled.

KVM_HINTS_REALTIME CPUID (x86_64 only) – sets bit 0 of CPUID leaf 0x40000001 EDX, telling the guest kernel that vCPUs are pinned to dedicated host cores. The guest disables PV spinlocks, PV TLB flush, and PV sched_yield (all add hypercall overhead unnecessary on dedicated cores), and enables haltpoll cpuidle (polls briefly before halting, reducing wakeup latency). PV spinlocks require CONFIG_PARAVIRT_SPINLOCKS, which is not in ktstr.kconfig, so that disable is a no-op for ktstr guests.

Skip host-side halt poll (x86_64 only) – when a guest vCPU halts (executes HLT with nothing to do), KVM can busy-wait briefly on the host before putting the vCPU thread to sleep, reducing wakeup latency at the cost of host CPU time. KVM_CAP_HALT_POLL controls this per-VM ceiling. In performance mode it is not set because the guest haltpoll cpuidle driver (enabled by KVM_HINTS_REALTIME above) handles polling inside the guest and writes MSR_KVM_POLL_CONTROL=0 to disable host-side polling via kvm_arch_no_poll(). Non-performance-mode VMs set KVM_CAP_HALT_POLL to 200µs (matching the x86 kernel default), or 0 when vCPUs exceed host CPUs.

Prerequisites

Sufficient host CPUs – the host must have at least (llcs * cores_per_llc * threads_per_core) + 1 online CPUs. The extra CPU is reserved for service threads (monitor, watchdog) so they do not share a core with any RT vCPU. The host must also have at least as many LLC groups as virtual LLCs.

2MB hugepages (optional) – the host must have free 2MB hugepages (check /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages). Without them, guest memory uses regular pages. A warning is printed.

CAP_SYS_NICE or rtprio limit (optional) – SCHED_FIFO requires either CAP_SYS_NICE (root) or an RLIMIT_RTPRIO >= the requested priority. Set an rtprio limit for non-root use:

# /etc/security/limits.conf
username  -  rtprio  99

Log out and back in for the limit to take effect. Without either capability, RT scheduling is skipped with a warning and vCPU threads run at normal priority (results may be noisy).

Validation

validate_performance_mode() runs during VM build and applies two levels of checks:

Errors (fatal):

  • Total vCPUs + 1 service CPU exceed available host CPUs.
  • Virtual LLCs exceed available LLC groups.
  • Pinning plan cannot be satisfied (an LLC group has fewer available CPUs than the virtual LLC requires).
  • No free host CPU for service threads after vCPU assignment.

Warnings (non-fatal):

  • Insufficient free hugepages – regular page allocation is used.
  • Host load is high – procs_running from /proc/stat exceeds half the vCPU count, results may be noisy.
  • TSC not stable (x86_64 only, checked at VM creation time) – KVM_CLOCK_TSC_STABLE not set after KVM_GET_CLOCK, kvmclock falls back to per-vCPU timekeeping. Timing measurements may have higher variance. Common in nested virtualization.

Usage

In #[ktstr_test]:

#[ktstr_test(
    llcs = 2,
    cores = 4,
    threads = 2,
    performance_mode = true,
)]
fn my_perf_test(ctx: &Ctx) -> Result<AssertResult> {
    // vCPUs are pinned, hugepage-backed
    Ok(AssertResult::pass())
}

Via the builder API:

let vm = vmm::KtstrVm::builder()
    .kernel(&kernel_path)
    .init_binary(&ktstr_binary)
    .topology(1, 2, 4, 2)
    .memory_mb(4096)
    .performance_mode(true)
    .build()?
    .run()?;

When to use

Performance mode is for tests where host-side scheduling noise affects results – fairness spread measurements, scheduling gap detection, imbalance ratio checks. It is not needed for correctness tests (cpuset isolation, starvation detection) where pass/fail is binary.

The gauntlet runs many VMs in parallel. Performance mode on parallel VMs can oversubscribe the host if scheduled naively. Avoid performance_mode unless the host has enough CPUs for the topology matrix.

Two dimensions

Performance mode serves two purposes:

Noise reduction – pinning, hugepages, NUMA mbind, and RT scheduling reduce measurement variance on both architectures. On x86_64, PAUSE and HLT VM exit disabling, the KVM_HINTS_REALTIME CPUID hint, and skipping host-side halt poll further reduce noise. Scheduling gaps, spread, and throughput checks become meaningful because host jitter is controlled. Without performance mode, a 50ms gap could be host noise; with it, the same gap indicates a scheduler problem.

Performance assertions – with stable measurements, tests can set tight thresholds (max_gap_ms, min_iteration_rate, max_p99_wake_latency_ns) to detect scheduling regressions. A test using execute_steps_with can pass custom Assert checks that are evaluated inside the guest against worker telemetry. These thresholds are only meaningful under performance mode’s controlled environment.

Nextest parallelism

Performance-mode tests each consume one LLC group on the host. The vm-perf test group in .config/nextest.toml sets a static max-threads limit. The flock-based LLC slot reservation (acquire_resource_locks) handles runtime contention: if all LLC slots are busy, the test returns ResourceContention.

On contention, the test returns exit code 0 (skip) – it never ran. The SKIP: prefix in stderr distinguishes skips from real passes.

LLC exclusivity validation

When performance_mode is enabled, the build step validates LLC exclusivity: each virtual LLC must reserve the entire physical LLC group it maps to. The validation sums the actual CPU count of each LLC group and checks the total (plus service CPU) fits within the host’s online CPUs. If validation fails, the build returns an error (tests skip with ResourceContention).

Three-way mode tier

ktstr’s host-side resource coordination has three effective tiers, selected by the combination of performance_mode, --no-perf-mode/KTSTR_NO_PERF_MODE, and --cpu-cap/KTSTR_CPU_CAP:

Tier 1: performance mode (full isolation)

Enabled when performance_mode=true is set on the VM builder (or via #[ktstr_test(performance_mode = true)]). Acquires LOCK_EX on each selected LLC’s /tmp/ktstr-llc-{N}.lock — the LLC-level exclusive lock already covers every CPU in the group, so per-CPU /tmp/ktstr-cpu-{C}.lock files are NOT touched (try_acquire_all in vmm/host_topology.rs short-circuits the per-CPU loop when LlcLockMode == Exclusive). Applies every isolation feature listed under “What it does”: vCPU pinning via sched_setaffinity, 2 MB hugepages, NUMA mbind, RT SCHED_FIFO scheduling, and (x86_64) PAUSE/HLT exit suppression + KVM_HINTS_REALTIME CPUID.

Tier 2: no-perf-mode with CPU-cap reservation

Enabled by --no-perf-mode / KTSTR_NO_PERF_MODE=1. Every no-perf-mode VM goes through acquire_llc_plan: the reservation is LOCK_SH across a NUMA-aware, consolidation-aware set of LLCs, sized to meet the CPU budget — either --cpu-cap N (or KTSTR_CPU_CAP=N) if set, or 30% of the calling process’s sched_getaffinity cpuset (minimum 1) if not. The flock granularity stays per-LLC; plan.cpus holds EXACTLY the budget (partial-take on the last LLC when the budget falls mid-LLC). Multiple no-perf-mode VMs coexist on the same LLCs because shared locks are reentrant; a concurrent perf-mode VM attempting LOCK_EX blocks until every no-perf-mode peer has released.

Enforcement under --cpu-cap:

  • cgroup v2 cpuset sandbox — the reserved CPUs and derived NUMA nodes are written to a child cgroup’s cpuset.cpus and cpuset.mems, and the build pid is migrated into that cgroup, so make -jN gcc children inherit the binding. Under --cpu-cap, narrowing by a parent cgroup is a fatal error; without the flag (but with acquire_llc_plan still running on the 30% default) the sandbox warns and proceeds.
  • Soft-mask affinity — vCPU threads receive a sched_setaffinity mask covering only the reserved CPUs, so the guest’s CPU placement respects the budget even though no pinning is applied.
  • No RT scheduling, no hugepages, no mbind, no KVM exit suppression — these remain off; --cpu-cap is not a partial performance mode.
  • make -jN hint — kernel-build pipelines pass plan.cpus.len() to make so gcc’s fan-out matches the reserved capacity rather than nproc.

This tier is mutually exclusive with performance_mode=true (on the CLI, clap requires = "no_perf_mode" rejects --cpu-cap without --no-perf-mode at parse time) and with KTSTR_BYPASS_LLC_LOCKS=1 (rejected at every entry point because the contract and the bypass escape hatch are contradictory). Library consumers that set performance_mode=true on KtstrVmBuilder directly bypass the CLI parse — KTSTR_CPU_CAP is silently ignored in that path because the builder’s perf-mode branch never consults CpuCap::resolve.

See Resource Budget for the CpuCap, LlcPlan, and ktstr locks surfaces in detail.

Tier 3: default (per-CPU window + LLC LOCK_SH)

Selected when neither performance_mode=true nor --no-perf-mode/KTSTR_NO_PERF_MODE is set — the default path for #[ktstr_test] entries that don’t declare performance_mode (entry.rs KtstrTestEntry::DEFAULT sets performance_mode: false). acquire_cpu_locks (in vmm/host_topology.rs) walks a contiguous CPU window, takes LOCK_EX on each window CPU’s /tmp/ktstr-cpu-{C}.lock, then additionally takes LOCK_SH on the LLC lockfiles covering those CPUs so a perf-mode (tier 1) VM cannot grab LOCK_EX on an LLC that this path is using. No pinning, no isolation, no cgroup sandbox — the per-CPU reservation is purely for host-scheduling-noise avoidance between concurrent VMs.

This is the ONLY tier that actually flocks per-CPU lockfiles. Tier 1 skips them (LLC EX already covers all CPUs in the group); tier 2 skips them (capped LLC SH is enforced via the cgroup cpuset and the flock is sufficient per-LLC coordination).

Disabling performance mode

--no-perf-mode (or KTSTR_NO_PERF_MODE=1) forces performance_mode=false. The result is tier 2 above — a CPU-capped LOCK_SH reservation (either explicit --cpu-cap N or the 30%-of-allowed default). The feature differences relative to tier 1 are:

  • LLC flock mode — tier 1 holds LOCK_EX on each reserved LLC; tier 2 holds LOCK_SH. Multiple shared holders coexist; an exclusive holder blocks every shared acquirer and vice-versa.
  • Per-CPU flocks — tier 1 relies on LLC-level LOCK_EX for exclusivity; per-CPU /tmp/ktstr-cpu-{C}.lock files are skipped (try_acquire_all in vmm/host_topology.rs short-circuits the per-CPU loop when LlcLockMode == Exclusive because the LLC lock already covers every CPU in the group). Tier 2 also skips them — the cgroup cpuset is the enforcement layer.
  • vCPU pinning — tier 1 pins via sched_setaffinity to the reserved LLC’s CPUs. Tier 2 applies soft-mask affinity (budget-scoped but no 1:1 vCPU-to-CPU binding).
  • RT scheduling — tier 1 only; tier 2 runs vCPU threads at normal priority.
  • Hugepages — tier 1 only; tier 2 uses regular pages.
  • NUMA mbind — tier 1 only; tier 2 instead writes cpuset.mems on its child cgroup to achieve NUMA locality at the cgroup layer.
  • KVM exit suppression (x86_64) — tier 1 only; tier 2 leaves PAUSE and HLT exits enabled.
  • KVM_HINTS_REALTIME CPUID (x86_64) — tier 1 only; tier 2 leaves the guest on PV spinlocks and standard cpuidle.

Use tier 2 on multi-tenant hosts where you want bounded concurrency (at most N concurrent builds or no-perf-mode VMs per host) but cannot afford the full perf-mode contract. Use tier 1 for regression measurement where host jitter must be controlled.

Available via:

  • ktstr shell --no-perf-mode
  • cargo ktstr test --no-perf-mode
  • cargo ktstr coverage --no-perf-mode
  • cargo ktstr llvm-cov --no-perf-mode
  • cargo ktstr shell --no-perf-mode
  • KTSTR_NO_PERF_MODE=1 (any value; presence is sufficient)

--cpu-cap N layers on top of any of the above when present; when absent, the 30%-of-allowed default applies automatically. The env var is read by every VM builder call site (test harness, auto-repro, verifier, shell). The CLI flags set the env var before test execution so library consumers inherit it.

Resource Budget

--cpu-cap N adds a third tier between full performance-mode isolation and unreserved no-perf-mode execution. Instead of “lock each reserved LLC exclusively” (perf-mode), it reserves a NUMA-aware, consolidation-aware set of host CPUs under LOCK_SH, enforces the reservation via a cgroup v2 cpuset sandbox, and scales make -jN fan-out to the reserved capacity. The flock granularity stays per-LLC: every selected LLC is flocked whole, but plan.cpus holds EXACTLY N CPUs (the last LLC is partial-taken when the budget falls mid-LLC). See Performance Mode for the comparison against the other two tiers.

Every no-perf-mode VM and kernel build runs through this pipeline — there is no “no cap” path. When --cpu-cap is absent, the planner applies a 30% default of the calling process’s sched_getaffinity cpuset (minimum 1 CPU). This keeps sched_setaffinity safe under cgroup-restricted CI runners (CI hosts, systemd slices, sudo-under-a-limited-cpuset) where the process cannot run on every online CPU even if sysfs lists them.

When to use it

  • Multi-tenant CI hosts where unbounded parallelism starves concurrent builds but the full performance-mode contract (SCHED_FIFO, hugepages, NUMA mbind, KVM exit suppression) is too heavy.
  • Kernel builds run alongside perf-mode VM tests — the shared LOCK_SH coordinates with the perf-mode LOCK_EX so make never stomps a measurement in progress.
  • Concurrent no-perf-mode VMs on a shared host — a cap of N CPUs bounds how much capacity each run reserves; peers that would exceed the host’s flock availability wait rather than racing for CPU.

CpuCap — parsed and resolved

CpuCap::new(N: usize) -> Result<CpuCap> constructs a cap from a CLI integer. N is a CPU count. N == 0 is rejected with --cpu-cap must be ≥ 1 CPU (got 0) — zero is a scripting sentinel, not a silent “no cap” fallback.

CpuCap::resolve(cli_flag: Option<usize>) -> Result<Option<CpuCap>> is the three-tier precedence:

  1. CLI flag (--cpu-cap N) wins over env var.
  2. KTSTR_CPU_CAP=N env var applies when the CLI flag is absent. Empty string is treated as unset; 0 or non-numeric values produce the same rejection as the CLI path.
  3. Neither set → Ok(None). The planner expands this into the 30%-of-allowed default at acquire time.

CpuCap::effective_count(allowed_cpus: usize) -> Result<usize> clamps at acquire time, not construction time. N > allowed_cpus returns a ResourceContention error naming both numbers — operators reading the error see immediately that the cap exceeds the process’s sched_getaffinity cpuset, not the host’s total online CPU count. Fixing the cap requires either lowering N or releasing the cgroup restriction on the calling process.

host_allowed_cpus — the reference set

host_allowed_cpus() reads the calling process’s allowed CPUs via sched_getaffinity(0) with a /proc/self/status Cpus_allowed_list: fallback. Every consumer of the --cpu-cap pipeline plans against this set instead of HostTopology::online_cpus, so sched_setaffinity on the plan’s CPU list never produces an empty effective mask under a cgroup-restricted runner.

An empty allowed set is a bail condition, not a fallback to “every CPU” — guessing on a misconfigured host is worse than failing visibly. A host topology that has no LLC overlapping the allowed set (sysfs and sched_getaffinity disagree — e.g. stale sysfs after hot-plug, cgroup cpuset pinned to CPUs the kernel no longer reports in LLC groups) also bails with an actionable diagnostic.

LlcPlan — the ACQUIRE result

acquire_llc_plan(topo, test_topo, cpu_cap) runs three phases:

  1. DISCOVER — for every LLC, stat the canonical /tmp/ktstr-llc-{N}.lock, read /proc/locks once, and build a snapshot of holders per LLC. No flocks are taken.
  2. PLAN — rank LLCs (eligible = at least one allowed CPU): consolidation (prefer LLCs with existing holders) first, then fresh LLCs, all tiebroken by ascending index. Seed on the highest-scored LLC’s NUMA node; greedily fill that node before spilling to nearest-by-distance nodes via TestTopology::numa_distance. Accumulate allowed-CPU contribution per LLC until the accumulated count meets target_cpus. Final acquire order is ascending LLC index for livelock safety.
  3. ACQUIRE — non-blocking LOCK_SH on every selected LLC. A single EWOULDBLOCK drops every held fd and retries once (one TOCTOU retry — the second DISCOVER’s /proc/locks read IS the backoff; more retries would amplify livelock risk without adding coordination signal).

Partial-take on the last LLC

Post-ACQUIRE, the materialization layer walks each selected LLC’s CPUs in ascending order, intersects with the allowed set, and STOPS at exactly target_cpus total. The last selected LLC typically contributes only a prefix of its allowed CPUs — the flock is still held at LLC granularity (coordination with concurrent ktstr peers is always per-LLC), but plan.cpus reflects the exact CPU budget. sched_setaffinity masks and cgroup cpuset.cpus writes narrow to that exact set.

The returned LlcPlan carries:

  • locked_llcs: Vec<usize> — selected host LLC indices, ASC.
  • cpus: Vec<usize> — flat list of reserved CPUs, sized exactly target_cpus (a subset of every selected LLC’s allowed CPUs, with the last LLC possibly contributing only a prefix).
  • mems: BTreeSet<usize> — NUMA nodes actually hosting plan.cpus (an LLC that contributes a partial slice only registers the nodes of its used CPUs).
  • snapshot: Vec<LlcSnapshot> — per-LLC discovery trail.
  • locks: Vec<OwnedFd> — RAII flock handles; Drop releases.

When mems spans more than one node (warn_if_cross_node_spill fires), stderr gets a ktstr: reserving LLCs […] across N NUMA nodes warning so the operator knows to expect cross-node memory latency. Single-node plans are silent.

Cgroup v2 cpuset sandbox

BuildSandbox::try_create(plan_cpus, plan_mems, hard_error_on_degrade) writes the plan into a child cgroup under the caller’s own cgroup, in the kernel-required order: cpuset.cpuscpuset.memscgroup.procs. A task in a cgroup with empty cpuset.mems may be killed by the cpuset allocator, so migration into cgroup.procs MUST happen after both cpuset fields are populated.

After each cpuset write, .effective is read back. Narrowing by a parent cgroup (e.g. systemd slice restriction) is a fatal error under --cpu-cap (hard_error_on_degrade = true) and a warn- only degrade without the flag.

Drop migrates the build pid back to root, tolerates transient EBUSY on cgroup.rmdir (5 × 10 ms retries), and orphans the directory with a tag=resource_budget.cgroup_orphan_left warn- log if the rmdir still refuses. Orphans older than 24 h are swept on the next sandbox creation.

make -jN hint

make_jobs_for_plan(plan) returns plan.cpus.len().max(1). The kernel-build pipeline threads this as make -jN. Without the hint, make -j$(nproc) fans gcc children across every online CPU, defeating the cpuset reservation in scheduling terms — the kernel still enforces cpuset membership at the fs layer, but gcc’s parallel width silently violates the budget. The .max(1) floor guards against make -j0 (unbounded on GNU make).

ktstr locks — observational surface

ktstr locks (or cargo ktstr locks) prints every ktstr flock currently held on the host, cross-referenced against /proc/locks to name each holder by PID + truncated cmdline. Read-only — takes no flocks. Four categories:

  1. LLC locks under /tmp/ktstr-llc-*.lock
  2. Per-CPU locks under /tmp/ktstr-cpu-*.lock
  3. Cache-entry locks under {cache_root}/.locks/*.lock
  4. Run-dir locks under {runs_root}/.locks/{kernel}-{project_commit}.lock — held for the duration of the (pre-clear + write) cycle by serialize_and_write_sidecar so two concurrent ktstr processes targeting the same run-dir key serialize on the sidecar write rather than tearing each other’s mid-write files.

Flags:

  • --json — emit a structured snapshot. One-shot uses to_string_pretty for readability; under --watch each frame is compact on its own line (ndjson-style) for streaming consumers. Top-level keys: llcs, cpus, cache, run_dirs. Each row names its lockfile path and a holders array; every holder has pid + cmdline.
  • --watch <interval> — redraw on the given interval until SIGINT. Interval uses humantime syntax (100ms, 1s, 5m, 1h).

Use ktstr locks when --cpu-cap acquires fail with ResourceContention: the error already names busy LLCs, but the live snapshot shows every contending peer at once.

KTSTR_BYPASS_LLC_LOCKS — escape hatch

Setting KTSTR_BYPASS_LLC_LOCKS=1 (any non-empty value) skips acquire_llc_plan entirely. The VM boots or the kernel builds immediately without coordinating against any concurrent perf-mode run. Use only when the operator explicitly accepts measurement noise:

  • A shell session doing unrelated work alongside tests.
  • An isolated developer workstation.
  • A CI queue that already serializes jobs at a higher layer.

Mutually exclusive with --cpu-cap / KTSTR_CPU_CAP at every entry point (CLI parse for shell + kernel build on both ktstr and cargo ktstr, the kernel_build_pipeline reservation phase, and the library-layer KtstrVmBuilder::build no-perf-mode branch). The error wording always contains "resource contract" so operators can grep for it; the contract and the bypass cannot coexist at any of those six sites.

Note: the performance_mode=true vs --cpu-cap exclusion is weaker. It is enforced at CLI parse (shell --cpu-cap requires --no-perf-mode via clap requires), but library consumers that set performance_mode=true on KtstrVmBuilder directly see KTSTR_CPU_CAP silently ignored — the builder’s perf-mode branch never calls CpuCap::resolve, it goes through validate_performance_mode + acquire_resource_locks (LOCK_EX) instead.

Filesystem requirement

Every ktstr lockfile (/tmp/ktstr-llc-*.lock, /tmp/ktstr-cpu-*.lock, {cache_root}/.locks/*.lock, {runs_root}/.locks/*.lock) must live on a local filesystem — tmpfs, ext4, xfs, btrfs, f2fs, or bcachefs are the explicitly-accepted set. flock(2) behavior on NFS, CIFS, SMB2, CephFS, AFS, and FUSE is unreliable: NFSv3 is advisory-only without an NLM peer and NFSv4 byte-range locking does not cover flock(2); SMB does not emit /proc/locks entries so ktstr cannot enumerate peer holders; Ceph MDS does not participate in flock serialization across nodes; AFS does not support flock(2) at all; FUSE flock semantics depend on whether the userspace server implements the op. try_flock statfs-checks every lockfile path at open time via reject_remote_fs in src/flock.rs — hitting any deny-listed filesystem produces an actionable runtime error naming the filesystem plus the remediation “Move the lockfile path to a local filesystem (tmpfs, ext4, xfs, btrfs, f2fs, bcachefs).” Unknown local filesystems (zfs, erofs, etc.) are not on the deny-list and pass through, on the basis that rejecting unknown-but-local is more disruptive than accepting a potentially-unreliable flock.

  • Performance Mode — the full-isolation tier; the tier comparison lives there.
  • Environment VariablesKTSTR_CPU_CAP, KTSTR_BYPASS_LLC_LOCKS, and every other ktstr-controlled env var.

Writing Tests

Tests are Rust functions annotated with #[ktstr_test]. Each test boots a KVM VM, runs the scenario inside it, and evaluates results on the host.

use ktstr::prelude::*;

#[ktstr_test(llcs = 1, cores = 2, threads = 1)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("cg_0").workers(2),
        CgroupDef::named("cg_1").workers(2),
    ])
}

Run with cargo ktstr test --kernel ../linux. See Getting Started for setup and The #[ktstr_test] Macro for all available attributes. Each test also generates gauntlet variants across topology presets and flag profiles. See Gauntlet Tests. For scenarios that need logic beyond what the ops system can express, see Custom Scenarios.

The #[ktstr_test] Macro

#[ktstr_test] registers a function as an integration test that runs inside a VM.

Basic usage

use ktstr::prelude::*;

#[ktstr_test(llcs = 2, cores = 4, threads = 2)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    // ctx provides cgroup manager, topology, duration, etc.
    Ok(AssertResult::pass())
}

When a scheduler with a default topology is specified, the topology can be omitted:

use ktstr::declare_scheduler;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    //          numa, llcs, cores/llc, threads/core
    topology = (1,    2,    4,         1),
});

#[ktstr_test(scheduler = MY_SCHED)]
fn inherited_topo(ctx: &Ctx) -> Result<AssertResult> {
    // Inherits 1n2l4c1t from MY_SCHED
    Ok(AssertResult::pass())
}

declare_scheduler! emits a pub static MY_SCHED: Scheduler and registers a private linkme static in the KTSTR_SCHEDULERS distributed slice. The scheduler = slot expects &'static Scheduler — pass the bare MY_SCHED ident; the macro takes a reference internally.

The function must have signature fn(&ktstr::scenario::Ctx) -> anyhow::Result<ktstr::assert::AssertResult>.

What the macro generates

  1. Renames the function to __ktstr_inner_{name}.
  2. Registers it in the KTSTR_TESTS distributed slice via linkme.
  3. Emits a #[test] wrapper that calls run_ktstr_test().

The #[test] wrapper boots a VM with the specified topology and runs the function inside it.

Attributes

All attributes are optional with defaults.

Topology

AttributeDefaultDescription
llcsinheritedNumber of LLCs
numa_nodesinheritedNumber of NUMA nodes
coresinheritedCores per LLC
threadsinheritedThreads per core
memory_mb2048VM memory in MB

Each dimension independently inherits from Scheduler.topology when a scheduler is specified and that dimension is not explicitly set. Without a scheduler, unset dimensions use macro defaults (numa_nodes=1, llcs=1, cores=2, threads=1). The default is a single-NUMA topology, so most tests do not need to set numa_nodes. See Default topology.

Scheduler

AttributeDefaultDescription
scheduler = CONST&Scheduler::EEVDFRust const path to a &'static Scheduler. The bare const emitted by declare_scheduler! (e.g. MY_SCHED) is the expected form. The default Scheduler::EEVDF runs tests under the kernel’s default scheduler (EEVDF on Linux 6.6+) so tests without an explicit scheduler = run under the kernel default.
extra_sched_args = [...][]Extra CLI args for the scheduler, appended after Scheduler::sched_args.
watchdog_timeout_s5scx watchdog override (seconds). Applied via scx_sched.watchdog_timeout on 7.1+ kernels (BTF-detected) and via the static scx_watchdog_timeout symbol on pre-7.1 kernels. When neither path is available the override silently no-ops.

Payloads

AttributeDefaultDescription
payload = CONSTNoneRust const path to a binary-kind Payload (PayloadKind::Binary). Populates KtstrTestEntry::payload; the test body can run it via ctx.payload(&CONST). Scheduler-kind payloads are rejected at compile time — use the scheduler = … slot for those.
workloads = [CONST, …][]Array of binary-kind Payload const paths composed alongside the primary payload. Each entry is runnable from the test body via ctx.payload(&CONST); the include-file pipeline packages every referenced binary into the guest automatically.
extra_include_files = ["path", …][]Array of string-literal paths to extra host-side files (datasets, fixture configs, helper scripts) that the framework packages into the guest initramfs alongside the binaries declared by scheduler / payload / workloads. Maps onto KtstrTestEntry::extra_include_files (&'static [&'static str]); union with per-payload Payload::include_files is computed at run time via KtstrTestEntry::all_include_files. Use this slot for test-level dependencies that don’t belong on a specific Payload.

See Payload Definitions for authoring new Payload fixtures; tests/common/fixtures.rs carries reusable examples (SCHBENCH, SCHBENCH_HINTED, SCHBENCH_JSON).

Checking

AttributeDefaultDescription
not_starvedinheritedEnable starvation (zero work units), fairness spread, and scheduling gap checks
isolationinheritedEnable cpuset isolation check (workers must stay on assigned CPUs)
max_gap_msinheritedMax scheduling gap threshold
max_spread_pctinheritedMax fairness spread threshold
max_throughput_cvinheritedMax coefficient of variation for worker throughput
min_work_rateinheritedMinimum work_units per CPU-second per worker
max_imbalance_ratioinheritedMonitor imbalance ratio
max_local_dsq_depthinheritedMonitor DSQ depth
fail_on_stallinheritedFail on stall detection
sustained_samplesinheritedSample window for sustained violations
max_fallback_rateinheritedMax fallback event rate
max_keep_last_rateinheritedMax keep-last event rate
max_p99_wake_latency_nsinheritedMax p99 wake latency in nanoseconds
max_wake_latency_cvinheritedMax wake latency coefficient of variation
min_iteration_rateinheritedMinimum iterations per wall-clock second per worker
max_migration_ratioinheritedMax migration ratio (migrations/iterations) per cgroup
min_page_localityinheritedMin fraction of pages on expected NUMA nodes (0.0-1.0)
max_cross_node_migration_ratioinheritedMax ratio of NUMA-migrated pages to total pages (0.0-1.0)
max_slow_tier_ratioinheritedMax fraction of pages on memory-only (CXL) nodes (0.0-1.0)

not_starved = true enables three distinct checks: starvation (any worker with zero work units), fairness spread (max-min off-CPU% below max_spread_pct), and scheduling gaps (longest gap below max_gap_ms). Each threshold can be overridden independently. See Customize Checking for override examples and Checking for the merge chain.

Topology constraints

AttributeDefaultDescription
min_llcs1Minimum LLCs for gauntlet topology filtering
max_llcs12Maximum LLCs for gauntlet topology filtering
min_cpus1Minimum total CPU count for gauntlet topology filtering
max_cpus192Maximum total CPU count for gauntlet topology filtering
min_numa_nodes1Minimum NUMA nodes for gauntlet topology filtering
max_numa_nodes1Maximum NUMA nodes for gauntlet topology filtering
requires_smtfalseRequire SMT (threads > 1) topologies. On aarch64 the gauntlet ships only non-SMT presets, so any test with requires_smt = true is skipped entirely on that arch.

The gauntlet skips presets that do not satisfy these constraints. Multi-NUMA presets are excluded by default (max_numa_nodes = 1). See Gauntlet for filtering rules and Gauntlet Tests for a worked example.

Execution

AttributeDefaultDescription
auto_reprotrueOn scheduler crash, boot a second VM with probes attached. Set to false for fast iteration.
performance_modefalsePin vCPUs to host cores, hugepages, NUMA mbind, RT scheduling, LLC exclusivity validation
no_perf_modefalseDecouple the virtual topology from host hardware: build the VM with the declared numa_nodes / llcs / cores / threads even on smaller hosts; skip vCPU pinning, hugepages, NUMA mbind, RT scheduling, and KVM exit suppression; relax gauntlet preset filtering to the single “host has enough total CPUs” check. Mutually exclusive with performance_mode = true (validated at runtime by KtstrTestEntry::validate). Equivalent to setting KTSTR_NO_PERF_MODE=1 per-test — either source forces the no-perf path. See Performance Mode.
duration_s12Per-scenario duration in seconds
expect_errfalseTest expects run_ktstr_test to return Err; disables auto-repro
bpf_map_write = CONSTemptyRust const path to a BpfMapWrite; host writes this value to a BPF map after the scheduler loads. The entry field is a slice; the macro wraps the single path in a one-element slice.
host_onlyfalseRun the test function directly on the host instead of inside a VM. Use for tests that need host tools (e.g. cargo, nested VMs) unavailable in the guest initramfs.
num_snapshots = N0Fire N periodic freeze_and_capture(false) boundaries inside the workload’s 10 %–90 % window; each capture is stored on the host SnapshotBridge under periodic_NNN. 0 disables periodic capture entirely. Validated against MAX_STORED_SNAPSHOTS (= 64), host_only = true, and a 100 ms minimum-spacing rule. See Periodic Capture and Temporal Assertions.
cleanup_budget_ms = NNoneSub-watchdog cap on host-side VM teardown wall time. When the budget is exceeded the test’s AssertResult is folded with a failing AssertDetail. None disables the check.
post_vm = PATHNoneHost-side callback invoked after vm.run() returns. Signature: fn(&VmResult) -> anyhow::Result<()>. Use for assertions that need host-side state — e.g. draining VmResult.snapshot_bridge for periodic-capture analysis (see Periodic Capture).
config = EXPRNoneInline scheduler config content (string literal or path to a const &'static str). Written to the guest path declared by the scheduler’s config_file_def; the framework substitutes {file} in the scheduler’s arg template with the guest path. Required when the scheduler declares config_file_def; rejected when it doesn’t. The pairing is enforced at compile time via a const assertion against Payload::config_file_def, and again at runtime by KtstrTestEntry::validate. See Inline scheduler config.

See Performance Mode for details on what performance_mode enables, prerequisites, and validation behavior.

Inline scheduler config

Some schedulers (e.g. scx_layered, scx_lavd) accept a JSON config file via a CLI argument like --config /path/to/config.json. Two pieces wire this into a test:

  1. Scheduler declaration — the Scheduler builder declares the arg template and the guest path via .config_file_def:

    const LAYERED_SCHED: Scheduler = Scheduler::new("layered")
        .binary(SchedulerSpec::Discover("scx_layered"))
        .config_file_def("--config {file}", "/include-files/layered.json");

    {file} in the arg template is replaced with the guest path. The framework mkdir -ps the parent and writes the config content to /include-files/layered.json inside the guest before the scheduler binary starts.

  2. Test attribute — the test supplies the inline JSON via config = …:

    const LAYERED_CONFIG: &str = r#"{ "layers": [...] }"#;
    
    #[ktstr_test(scheduler = LAYERED_SCHED, config = LAYERED_CONFIG)]
    fn layered_test(ctx: &Ctx) -> Result<AssertResult> {
        Ok(AssertResult::pass())
    }

    config = "..." (string literal) and config = SOME_CONST (path to a const &'static str) are both accepted.

The pairing gate is bidirectional:

  • A scheduler with config_file_def set requires config = … on every test (otherwise the scheduler binary would launch without --config).
  • A scheduler without config_file_def rejects config = … on the test (the content would be silently dropped at dispatch).

Both halves are validated at compile time via a const assertion emitted by the macro AND at runtime by KtstrTestEntry::validate, so direct programmatic-entry construction sees the same gate.

For schedulers that take a config file from a host-side path instead of inline content, use Scheduler::config_file(host_path) instead of config_file_def. The framework packs the host file into the initramfs at /include-files/{filename} and prepends --config /include-files/{filename} to scheduler args; no config = … on the test is needed in that flavor.

Example with custom scheduler

Define the scheduler with declare_scheduler! (see Scheduler Definitions), then reference it in #[ktstr_test]:

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    topology = (1, 2, 4, 1),
    sched_args = ["--enable-llc", "--enable-stealing"],
});

#[ktstr_test(
    scheduler = MY_SCHED,
    not_starved = true,
    max_gap_ms = 5000,
)]
fn my_sched_basic(ctx: &Ctx) -> Result<AssertResult> {
    // Inherits 1n2l4c1t from MY_SCHED
    Ok(AssertResult::pass())
}

declare_scheduler! emits a pub static MY_SCHED: Scheduler and registers it in the KTSTR_SCHEDULERS distributed slice via a private linkme static so cargo ktstr verifier discovers it. The bare MY_SCHED ident is what #[ktstr_test(scheduler = ...)] expects. See Scheduler Definitions for the full macro grammar.

For the manual builder pattern (no distributed-slice registration), see Scheduler Definitions: Manual definition.

Custom Scenarios

For dynamic scenarios (cgroup creation/removal, cpuset changes), prefer the ops/steps system over raw Action::Custom.

Use Action::Custom only when you need logic that the ops system cannot express.

Writing a custom scenario

use ktstr::prelude::*;
use ktstr::scenario::*;

fn my_custom_scenario(ctx: &Ctx) -> Result<AssertResult> {
    let wl = dfl_wl(ctx);
    let (handles, _guard) = setup_cgroups(ctx, 2, &wl)?;

    // Custom logic: resize cpusets, move workers, etc.
    std::thread::sleep(ctx.duration);

    Ok(collect_all(handles, &ctx.assert))
}

Helper functions

setup_cgroups(ctx, n, wl) – creates N cgroups, spawns workers, returns Result<(Vec<WorkloadHandle>, CgroupGroup)>. Bind the CgroupGroup to a named variable (e.g. _guard) so it lives until end of scope. See CgroupGroup for drop semantics.

Imports: setup_cgroups and dfl_wl live in ktstr::scenario, not in the prelude. The use ktstr::scenario::*; line in the example above is required — use ktstr::prelude::*; alone does not bring them into scope.

collect_all(handles, checks) – stops all workers, collects reports, runs worker-level checks when configured, otherwise falls back to assert_not_starved(). Merges results: if any worker group fails, the overall result fails.

dfl_wl(ctx) – creates a WorkloadConfig with ctx.workers_per_cgroup workers and default settings.

spawn_diverse(ctx, cgroup_names) – spawns different work types across cgroups, rotating through (SpinWait, Bursty{50ms burst / 100ms sleep}, IoSyncWrite, Mixed, YieldHeavy). Each cgroup uses ctx.workers_per_cgroup workers except IoSyncWrite cgroups, which always use 2 workers so blocking IO does not drown the scenario.

The Ctx struct

Custom scenarios receive a Ctx reference:

pub struct Ctx<'a> {
    pub cgroups: &'a dyn CgroupOps,
    pub topo: &'a TestTopology,
    pub duration: Duration,
    pub workers_per_cgroup: usize,
    pub sched_pid: Option<libc::pid_t>,
    pub settle: Duration,
    pub work_type_override: Option<WorkType>,
    pub assert: Assert,
    pub wait_for_map_write: bool,
}

cgroups – create/remove cgroups, set cpusets, move tasks. The slot is a &dyn CgroupOps trait object, not a concrete CgroupManager, so tests can substitute a no-op double for host-only scenarios while production paths receive the real manager. Method signatures are defined on CgroupOps; see CgroupManager for the production implementation.

topo – query CPU topology (LLCs, NUMA nodes, memory info, distances). Provides CPU enumeration, LLC/NUMA partitioning, cpuset generation, and inter-node distance queries. See TestTopology for the full API reference.

sched_pid – scheduler process ID (Option<libc::pid_t>) for liveness checks. None when the test runs without an scx scheduler (the EEVDF default path has no userspace scheduler binary). Unwrap or is_some_and(...) before passing to process_alive or kill(Pid::from_raw(pid), None).

settle – time to wait after cgroup creation for the scheduler to stabilize.

Checking in custom scenarios

Use Assert for both direct report checking and ops-based scenarios. Call assert.assert_cgroup(reports, cpuset) for manual report collection, or use execute_steps_with() for ops-based scenarios. See Checking.

Scheduler Definitions

A Scheduler tells the test framework how to launch and configure the scheduler under test. Tests reference one via #[ktstr_test(scheduler = MY_SCHED)]; the verifier sweep reads every declared scheduler from the KTSTR_SCHEDULERS distributed slice automatically.

The Scheduler type

pub struct Scheduler {
    pub name: &'static str,
    pub binary: SchedulerSpec,
    pub sysctls: &'static [Sysctl],
    pub kargs: &'static [&'static str],
    pub assert: Assert,
    pub cgroup_parent: Option<CgroupPath>,
    pub sched_args: &'static [&'static str],
    pub topology: Topology,
    pub constraints: TopologyConstraints,
    pub config_file: Option<&'static str>,
    pub config_file_def: Option<(&'static str, &'static str)>,
    pub kernels: &'static [&'static str],
}

config_file packs a host-side file into the initramfs at /include-files/{filename} and prepends --config /include-files/{filename} to scheduler args automatically.

config_file_def declares an arg-template + guest-path pair for schedulers that take inline JSON content via the test attribute #[ktstr_test(config = …)]: the framework writes the test’s config_content to the declared guest path and substitutes {file} in the arg template before launching the scheduler. The two fields are alternatives — config_file is the host-file path, config_file_def is the inline-content path. See The #[ktstr_test] Macro for the inline pairing gate.

sysctls takes Sysctl values. Construct them with Sysctl::new("key", "value") in const context. Use the dot-separated form for the key (e.g. "kernel.foo", not "kernel/foo"); duplicate keys are applied in order and the last write wins.

kargs is the extra GUEST KERNEL command-line (not the scheduler binary’s CLI — use sched_args for that). Do not override the kargs ktstr injects itself (console=, loglevel=, init=): those break guest-side init and leave the VM unable to run tests.

kernels is the per-scheduler filter on the BPF Verifier sweep matrix. The matrix dimension itself is the operator’s cargo ktstr verifier --kernel <SPEC> set (which the dispatcher always populates into KTSTR_KERNEL_LIST, including a single auto-discovered entry when --kernel is omitted). For each scheduler, the lister emits one cell per (kernel-list entry that passes this filter × accepted topology preset).

Each entry is a string consumed by KernelId::parse — the same parser as the cargo ktstr verifier --kernel <SPEC> CLI flag. Match semantics per variant:

  • Exact Version ("6.14.2") — matches an entry whose raw or sanitized label equals the version.
  • Range ("6.14..7.0" or "6.14..=7.0" — both inclusive on both endpoints) — matches entries whose raw version falls inside [start, end] via decompose_version_for_compare.
  • Path / CacheKey / Git ("git+URL#REF", "path/to/dir", "6.14.2-tarball-x86_64-kc...") — matches by sanitized-label equality.

An empty kernels = [] means no filter — the scheduler verifies against every kernel-list entry the operator supplied.

SchedulerSpec

How to find the scheduler binary:

pub enum SchedulerSpec {
    Eevdf,                   // No sched_ext binary -- use kernel EEVDF
    Discover(&'static str),  // Auto-discover by name
    Path(&'static str),      // Explicit path
    KernelBuiltin {          // Kernel-built scheduler (no binary)
        enable: &'static [&'static str],
        disable: &'static [&'static str],
    },
}

KernelBuiltin is for schedulers compiled into the kernel (e.g. BPF-less sched_ext or debugfs-tuned variants). The enable commands run in the guest to activate the scheduler; disable commands run to deactivate it. No binary is injected into the VM.

SchedulerSpec::has_active_scheduling() returns true for all variants except Eevdf. When true, the framework runs monitor threshold evaluation after the scenario and enables auto-repro on crash.

Eevdf and KernelBuiltin are excluded from the verifier sweep at cell-emission time — neither has a userspace binary to load BPF programs from, so the verifier has nothing to verify.

Built-in: EEVDF

Scheduler::EEVDF runs tests without a sched_ext scheduler, using the kernel’s default EEVDF scheduler. Its binary is SchedulerSpec::Eevdf. It is the default scheduler for #[ktstr_test] entries that do not pass scheduler = ....

Defining a scheduler

declare_scheduler! is the preferred entry point: it constructs a pub static Scheduler and registers it in the KTSTR_SCHEDULERS distributed slice in one step, so the verifier sweep picks it up automatically.

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    sched_args = ["--exit-dump-len", "1048576"],
    topology = (1, 2, 4, 1),
    kernels = ["6.14", "6.15..=7.0"],
});

#[ktstr_test(scheduler = MY_SCHED)]
fn basic(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("cg_0").workers(2),
        CgroupDef::named("cg_1").workers(2),
    ])
}

The macro emits:

  • pub static MY_SCHED: Scheduler with the supplied fields.
  • A private static __KTSTR_SCHED_REG_MY_SCHED: &'static Scheduler registered in the KTSTR_SCHEDULERS distributed slice via linkme so cargo ktstr verifier discovers it.

#[ktstr_test(scheduler = ...)] expects an &'static Scheduler — pass the bare ident (e.g. MY_SCHED). The macro takes a reference internally, so passing the bare const yields the correct &Scheduler.

Accepted fields

Every key=value pair after name and binary is optional. The key names match Scheduler struct fields:

  • name = "..." — short human name (required).
  • binary = "scx_name" — defaults to SchedulerSpec::Discover(name). Accepts SchedulerSpec::Path("/abs/path"), SchedulerSpec::Eevdf, or SchedulerSpec::KernelBuiltin { enable: &[...], disable: &[...] }.
  • sched_args = ["--a", "--b"] — CLI args prepended to every test that uses this scheduler.
  • kernels = ["6.14", "6.15..=7.0", "git+URL#REF", "/path", "cache-key"] — verifier sweep set; see the field doc above.
  • cgroup_parent = "/path" — must begin with /, must not be "/" alone.
  • kargs = ["nosmt"] — guest-kernel cmdline additions.
  • sysctls = [Sysctl::new("kernel.foo", "1")] — applied before the scheduler starts.
  • topology = (numa_nodes, llcs, cores, threads) — default VM topology for #[ktstr_test] entries.
  • constraints = TopologyConstraints { ... } — gauntlet topology constraints inherited by #[ktstr_test] entries.
  • config_file = "configs/my_sched.toml" — opaque host-side config to pack into the guest initramfs.
  • config_file_def = ("--config={file}", "/include-files/my.json") — alternative inline-config seam (see The #[ktstr_test] Macro).
  • assert = Assert::NO_OVERRIDES.method_chain() — scheduler-level assertion overrides merged on top of Assert::default_checks().

Visibility

The identifier can be pub or pub(crate):

declare_scheduler!(pub MY_SCHED, { name = "my_sched", binary = "scx_my_sched" });
declare_scheduler!(pub(crate) INTERNAL, { name = "internal", binary = "scx_internal" });

The macro emits #[allow(missing_docs)] on the generated static so crates with #![deny(missing_docs)] compile cleanly.

Manual definition

The const builder pattern still works when the macro doesn’t fit — e.g. when the scheduler is composed programmatically or when test-only fixtures need to avoid the distributed-slice registration:

use ktstr::prelude::*;

const MITOSIS: Scheduler = Scheduler::new("scx_mitosis")
    .binary(SchedulerSpec::Discover("scx_mitosis"))
    .topology(1, 2, 4, 1)
    .sched_args(&["--exit-dump-len", "1048576"])
    .cgroup_parent("/ktstr")
    .assert(Assert::NO_OVERRIDES.max_imbalance_ratio(2.0));

A manually-defined Scheduler is not registered in KTSTR_SCHEDULERS automatically; the verifier sweep does not see it. Use declare_scheduler! for any scheduler that should participate in cargo ktstr verifier.

Cgroup parent

Scheduler.cgroup_parent specifies a cgroup subtree under /sys/fs/cgroup for the scheduler to manage. When set, the VM init creates the directory before starting the scheduler, and --cell-parent-cgroup <path> is injected into the scheduler args. The field is Option<CgroupPath>. CgroupPath::new() is a const constructor that panics at compile time if the path does not begin with / or is "/" alone. The Scheduler::cgroup_parent() builder and the declare_scheduler! cgroup_parent = "..." field both accept &'static str and construct a CgroupPath internally.

declare_scheduler!(MITOSIS, {
    name = "scx_mitosis",
    binary = "scx_mitosis",
    topology = (1, 2, 4, 1),
    cgroup_parent = "/ktstr",
});

This creates /sys/fs/cgroup/ktstr in the guest and passes --cell-parent-cgroup /ktstr to the scheduler binary.

Config file

Scheduler.config_file specifies a host-side path to an opaque config file that the scheduler binary reads at startup. The framework packs the file into the guest initramfs at /include-files/{filename} and prepends --config /include-files/{filename} to the scheduler args. ktstr does not parse or validate the file — it is passed through as-is.

The --config flag name is not configurable. Schedulers that use config_file must accept --config <path>. For schedulers that use a different flag, use config_file to place the file in the guest and add the desired flag via sched_args — the scheduler will also receive --config and must not reject unknown flags.

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    topology = (1, 2, 4, 1),
    config_file = "configs/my_sched.toml",
});

This copies configs/my_sched.toml from the host into the guest at /include-files/my_sched.toml and passes --config /include-files/my_sched.toml to the scheduler binary.

Scheduler args

Scheduler.sched_args provides default CLI args that apply to every test using this scheduler. They are prepended before per-test extra_sched_args.

declare_scheduler!(MITOSIS, {
    name = "scx_mitosis",
    binary = "scx_mitosis",
    topology = (1, 2, 4, 1),
    cgroup_parent = "/ktstr",
    sched_args = ["--exit-dump-len", "1048576"],
});

Merge order: config_file injection, then cgroup_parent injection, then sched_args, then per-test extra_sched_args.

Default topology

Scheduler.topology sets the default VM topology for all tests using this scheduler. When #[ktstr_test] omits llcs, cores, and threads, the scheduler’s topology is used. Explicit attributes on #[ktstr_test] override the scheduler default.

// (numa_nodes, llcs, cores_per_llc, threads_per_core)
declare_scheduler!(MITOSIS, {
    name = "scx_mitosis",
    binary = "scx_mitosis",
    topology = (1, 2, 4, 1),
});

Arguments are (numa_nodes, llcs, cores_per_llc, threads_per_core). Most schedulers use numa_nodes = 1 (single NUMA node). Scheduler::new() defaults to (1, 1, 2, 1) — a minimal 2-CPU single-NUMA VM, sufficient for tests that don’t exercise topology-dependent scheduling.

Tests that need a different topology (e.g. SMT) override individual dimensions. Unset dimensions still inherit from the scheduler:

// Inherits llcs=2, cores=4 from MITOSIS; overrides threads to 2
#[ktstr_test(scheduler = MITOSIS, threads = 2)]
fn smt_test(ctx: &Ctx) -> Result<AssertResult> { /* ... */ }

Checking overrides

Scheduler.assert provides scheduler-level checking defaults. These sit between Assert::default_checks() and per-test overrides in the merge chain.

A scheduler that tolerates higher imbalance:

declare_scheduler!(RELAXED, {
    name = "relaxed",
    binary = "scx_relaxed",
    assert = Assert::NO_OVERRIDES.max_imbalance_ratio(5.0),
});

Kernel-built scheduler example

For schedulers compiled into the kernel (no userspace binary), use SchedulerSpec::KernelBuiltin with shell commands to activate/deactivate the scheduler:

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(MINLAT, {
    name = "minlat",
    binary = SchedulerSpec::KernelBuiltin {
        enable: &["echo minlat > /sys/kernel/debug/sched/ext/root/ops"],
        disable: &["echo none > /sys/kernel/debug/sched/ext/root/ops"],
    },
});

The enable commands run in the guest before scenarios start. The disable commands run after scenarios complete.

KernelBuiltin schedulers do not participate in the verifier sweep (no userspace binary to load BPF programs from); the declaration is still useful for #[ktstr_test(scheduler = ...)] attribution and sidecar identification.

Payload Definitions

A Payload describes a binary workload that a test can run alongside its cgroup workers. The struct encodes PayloadKind::Binary (an external executable — schbench, fio, stress-ng) for the workload role. Tests reference a Payload via #[ktstr_test(payload = FIXTURE)] (primary slot) or #[ktstr_test(workloads = [FIXTURE_A, FIXTURE_B])] (additional slots); the test body then runs it via ctx.payload(&FIXTURE).

The scheduler slot is separate from the payload / workloads slots — #[ktstr_test(scheduler = MY_SCHED)] takes a bare Scheduler reference (the declare_scheduler! const), not a Payload.

#[non_exhaustive] and construction rules

Payload is #[non_exhaustive] (see crate::non_exhaustive). Downstream crates cannot use struct-literal construction — a future ktstr bump can add fields without breaking callers only if everyone constructs through the provided associated functions:

For richer binary payloads (custom default args, declared MetricChecks, MetricHints, include_files), use #[derive(Payload)] on a marker struct — the derive generates the matching const via the same non-exhaustive-preserving construction path. tests/common/fixtures.rs has worked examples — SCHBENCH, SCHBENCH_HINTED, SCHBENCH_JSON — suitable as reference shapes to copy.

Quick reference: Payload fields

The fields are listed here for readers tracing the fixture files, not as a license to hand-roll literals. Each is populated by Payload::binary + the derive’s builder methods:

  • name: &'static str — display name that appears in sidecar JSON, stats tables, and test filtering. Distinct from the binary name (kind) so e.g. SCHBENCH_HINTED can run the same schbench binary with a different label.
  • kind: PayloadKind — either Binary(executable_name) (for test payloads like schbench) or Scheduler(&'static Scheduler) (the in-memory shape of Payload::KERNEL_DEFAULT and similar scheduler-wrapping payloads). Test authors normally do not construct PayloadKind::Scheduler directly — the #[ktstr_test(scheduler = MY_SCHED)] slot takes the bare Scheduler ref without a Payload wrapper.
  • output: OutputFormat — how to interpret the payload’s stdout/stderr. ExitCode (status code only), Json (parse numeric leaves), or LlmExtract(Option<&'static str>) (route through a local LLM with an optional hint).
  • default_args: &'static [&'static str] — CLI args prepended to every invocation. Per-test ctx.payload(...).args(...) appends after these.
  • default_checks: &'static [MetricCheck] — static assertions applied to the payload’s output/exit (min / max / range / exists / exit_code_eq constructors on MetricCheck). Merged with per-test .checks(...).
  • metrics: &'static [MetricHint] — declared metrics the payload emits (name, unit, polarity). Drives list-metrics and comparison thresholds.
  • metric_bounds: Option<&'static MetricBounds> — optional per-metric host-side bounds applied AFTER the payload exits. Consumed by LlmExtract payloads (where extraction runs host-side post-VM-exit); Json and ExitCode payloads ignore this field and route assertions through default_checks instead.
  • include_files: &'static [&'static str] — extra files packaged into the guest alongside the binary (config files, datasets).
  • uses_parent_pgrp: bool — when true, the payload child inherits the test’s process group so the teardown SIGKILL sweep reaches it. Most binaries leave this false and are reaped explicitly.
  • known_flags: Option<&'static [&'static str]> — optional allow-list of CLI flags the payload accepts; used by the gauntlet-style flag expansion.

For an end-to-end workflow from building a scheduler to running the gauntlet, see Test a New Scheduler.

Gauntlet Tests

The gauntlet expands each #[ktstr_test] into a matrix of test × topology_preset variants. The test definition controls which cells of the matrix it populates.

Controlling topology coverage

Topology constraints in #[ktstr_test] filter which gauntlet presets a test runs on. See Topology Constraints for the full attribute table and Topology Presets for the preset list.

Worked example

A test with min_llcs = 2, requires_smt = true, and default max_numa_nodes = 1 against the preset table:

  • tiny-1llc (1 LLC): excluded — below min_llcs
  • All non-SMT presets (tiny-2llc, odd-*, *-nosmt): excluded — requires_smt
  • near-max-llc (15 LLCs): excluded — above default max_llcs = 12
  • max-cpu (252 CPUs, 14 LLCs): excluded — above default max_cpus = 192 (also above default max_llcs = 12)
  • All numa* presets: excluded — above default max_numa_nodes = 1

Result: 6 of 24 presets survive (smt-2llc, smt-3llc, medium-4llc, medium-8llc, large-4llc, large-8llc). On aarch64, none survive — all aarch64 presets lack SMT.

Total variant count

The total number of gauntlet variants for a test is:

valid_presets × resolved_kernels

A test with 8 valid presets produces 8 gauntlet variants under a single-kernel run; passing two kernels (--kernel A --kernel B) doubles that to 16. The kernel dimension is contributed by cargo ktstr test / coverage / llvm-cov at the CLI surface (zero or one resolved kernels keeps the historical 3-segment name shape gauntlet/{name}/{preset}; two or more expands the gauntlet across kernels with an extra {kernel_label} segment). See Multi-kernel: kernel as a gauntlet dimension.

Tests that skip gauntlet

  • Entries with host_only = true never produce gauntlet variants (no VM to vary topology on). They also skip the kernel-dim multiplication under multi-kernel runs: a host_only test lists and runs once regardless of KTSTR_KERNEL_LIST cardinality, since a host-side test never observes the kernel directory and N copies of identical work would carry no signal. See host_only for how that flag is set, and Multi-kernel: kernel as a gauntlet dimension for the kernel-suffix dispatch contract.
  • Tests whose names start with demo_ are ignored by default. Their gauntlet variants are also ignored (all gauntlet variants are ignored).

Cross-references

Snapshots

A snapshot is a frozen record of guest BPF map state and scheduler globals captured at a specific point in a scenario. The freeze coordinator pauses every vCPU long enough to walk the kernel’s BPF maps, BTF-render every captured value, and bundle the result into a FailureDumpReport keyed by a name you choose. Test code then reads it back via the Snapshot accessor for typed traversal.

Op::snapshot("name") is the on-demand capture trigger. Use it to ask “what does the scheduler look like right now?” at a precise point in the scenario. For automatic capture on a kernel write to a specific symbol, see Watch Snapshots. For cadenced capture across the workload window without invoking Op::snapshot from the scenario body, see Periodic Capture — it produces a time-ordered SampleSeries that flows into the temporal-assertion patterns (nondecreasing, rate_within, steady_within, converges_to, always_true, ratio_within).

Issuing a snapshot

Op::snapshot(name) is a single op in a Step’s op list. The executor invokes the active SnapshotBridge’s capture callback, which performs the freeze rendezvous and returns the report; the bridge stores the report under name.

use ktstr::prelude::*;

let steps = vec![Step {
    setup: vec![CgroupDef::named("workers").workers(2)].into(),
    ops: vec![
        Op::snapshot("after_spawn"),
        // ... other ops ...
        Op::snapshot("after_workload"),
    ],
    hold: HoldSpec::FULL,
}];
execute_steps(ctx, steps)?;

A scenario may issue any number of Op::snapshot ops with distinct names. Reusing a name overwrites the prior capture (and emits a tracing::warn!).

Wiring the bridge

The bridge is what turns an Op::snapshot into stored data. The host typically wires it before execute_steps runs, but a scenario can install one inline:

use ktstr::prelude::*;

let cb: CaptureCallback = std::sync::Arc::new(|_name: &str| {
    // Production: freeze the VM and build a real FailureDumpReport.
    // Tests: return a hand-crafted report so the executor + bridge
    // pipeline runs without booting a guest.
    Some(FailureDumpReport::default())
});
let bridge = SnapshotBridge::new(cb);
let bridge_handle = bridge.clone();
let _guard = bridge.set_thread_local();

execute_steps(ctx, steps)?;

let captured = bridge_handle.drain();
let report = captured.get("after_spawn").expect("snapshot recorded");

set_thread_local returns a BridgeGuard that restores the prior bridge on drop, so a nested scenario inside an outer one cannot leak its bridge into the outer scope. Bind the guard to an underscore-prefixed identifier such as _guard so the binding lives for the scope of the scenario — a bare let _ = bridge.set_thread_local() drops the guard immediately and clears the bridge before any op runs. must_use will warn if the return value is discarded entirely.

If no bridge is installed, Op::snapshot is a no-op with a tracing::warn! and the scenario continues. If the capture callback returns None (capture pipeline unavailable), the bridge stays empty and the scenario continues. Existing scenarios that never declare snapshot ops keep working unchanged.

Reading the captured report

Snapshot::new(report) builds a borrowed view over a FailureDumpReport. The view does not copy the report; accessor methods walk the report in place and return further borrowed views.

Map-name lookup

let snap = Snapshot::new(report);
let map = snap.map("scx_per_task")?;        // SnapshotMap

Snapshot::map(name) returns Result<SnapshotMap, SnapshotError>. A miss yields SnapshotError::MapNotFound { requested, available } — the available list enumerates every captured map name so a typo surfaces in test output.

Top-level globals (.bss / .data / .rodata)

let nr_cpus = snap.var("nr_cpus_onln").as_u64()?;

Snapshot::var(name) walks every *.bss, *.data, and *.rodata global-section map for a top-level member named name and returns the unique match as a SnapshotField. Multiple matches yield SnapshotError::AmbiguousVar { requested, found_in } — disambiguate via Snapshot::map(name). A miss yields SnapshotError::VarNotFound { requested, available } with the union of every section’s top-level member names.

Entries inside a map

let map = snap.map("scx_per_task")?;
let first = map.at(0);                          // by ordinal index
let busy = map.find(|e| e.get("tid").as_i64().unwrap_or(-1) == 1234);
let busiest = map.max_by(|e| e.get("runtime_ns").as_u64().unwrap_or(0));
let all_active = map.filter(|e| e.get("runtime_ns").as_u64().unwrap_or(0) > 0);

SnapshotMap exposes:

  • at(n) — entry at ordinal index n. Out of range returns SnapshotEntry::Missing(SnapshotError::IndexOutOfRange).
  • find(predicate) — first matching entry. No match returns SnapshotEntry::Missing(SnapshotError::NoMatch { op: "find", ... }).
  • filter(predicate) — every matching entry collected into a Vec.
  • max_by(key_fn) — entry whose key_fn produces the maximum u64. Empty map returns Missing with op: "max_by".

Per-CPU maps

BPF_MAP_TYPE_PERCPU_ARRAY / _PERCPU_HASH / _LRU_PERCPU_HASH maps require narrowing to a CPU before reading individual values:

let map = snap.map("scx_pcpu")?;
let entry = map.cpu(1).at(0);                    // CPU 1's slot
let value = entry.get("").as_u64()?;             // empty path = root

SnapshotMap::cpu(n) narrows subsequent at / find calls to a specific CPU’s slot. An out-of-range CPU returns Missing with SnapshotError::PerCpuSlot { unmapped: false, len, ... }; an unmapped slot (None in the per-CPU vec) returns the same error variant with unmapped: true.

Calling entry.get(path) on a per-CPU entry without narrowing first surfaces SnapshotError::PerCpuNotNarrowed { map } — call .cpu(N) first.

Field accessors and dotted paths

SnapshotEntry::get(path) and SnapshotField::get(path) walk the entry’s value side along a dotted path. Each component matches a struct member; pointer dereferences are followed transparently.

let weight = entry.get("ctx.weight").as_u64()?;
let policy = entry.get("ctx.policy").as_str()?;     // enum variant name
let pid    = entry.get("leader.pid").as_i64()?;     // pointer chase

The dotted-path walker:

  1. Pointer chase. When a path step lands on RenderedValue::Ptr { deref: Some(...) }, the walker transparently follows the dereference (up to 16 hops) before matching the next component. The test author writes the path the BTF would suggest; pointer indirection is invisible.

  2. Empty path. get("") returns the current value as a SnapshotField::Value — useful for terminal accessors on per-CPU slots that hold a scalar directly.

  3. Composability. Two-segment paths are equivalent to chained get calls: entry.get("ctx.weight")entry.get("ctx").get("weight").

    Note that Snapshot::var does not split — it treats the full string as one global name. To walk into a struct, use snap.var("ctx").get("weight").

Terminal accessors

SnapshotField exposes typed terminal reads, all returning Result<T, SnapshotError>:

MethodReturnsAccepts
as_u64()u64Uint, non-negative Int/Enum, Bool (0/1), Char (raw byte), Ptr (pointer value, including cast-recovered pointers — see Cast-recovered pointers), per-CPU array key
as_i64()i64Int, Uint ≤ i64::MAX, Bool, Char, Enum, per-CPU array key
as_bool()boolBool direct; Int/Uint/Char/Enum/Ptr non-zero is true; per-CPU array key
as_f64()f64Float, Int, Uint, Enum, per-CPU array key
as_str()&strEnum with a resolved variant name
rendered()Option<&RenderedValue>the underlying value when present

Type mismatches surface as SnapshotError::TypeMismatch { expected, actual, requested } — for example, as_str() on a Uint reports expected: "Enum", actual: "Uint".

Cast-recovered pointers

Schedulers stash kernel pointers (task_struct *, cgroup *, …) and arena pointers in BPF map fields whose BTF declares them as u64 because BTF cannot express a pointer to a per-allocation type. The host-side cast analyzer walks the scheduler’s .bpf.o instruction stream during load, recovers the target struct for each provable (source_struct, field_offset) → target_struct mapping, and feeds the result into the renderer.

When the renderer encounters a u64 slot the analyzer flagged, it emits a RenderedValue::Ptr with cast_annotation set and chases the dereference through the address-space-appropriate reader. The full set of cast_annotation values:

AnnotationMeaning
"cast→arena"Cast analyzer flagged a u64 field; chase resolved to an arena allocation via the BTF-typed pointee.
"cast→kernel"Cast analyzer flagged a u64 field; chase resolved to a kernel slab / vmalloc / per-cpu allocation.
"sdt_alloc"BTF-typed Type::Ptr whose pointee was a BTF_KIND_FWD; the renderer recovered the real payload struct id via the sdt_alloc bridge. No cast-analyzer hit was involved.
"cast→arena (sdt_alloc)"Cast analyzer flagged a u64 field AND the chase target peeled to a Fwd; the bridge recovered the real arena payload struct id.
"cast→kernel (sdt_alloc)"Cast analyzer flagged a u64 field AND the chase target peeled to a Fwd; the bridge recovered the real kernel-side struct id.

A parallel cross-BTF Fwd resolution path is consulted whenever a chase target survives the local same-BTF Fwd resolve as a BTF_KIND_FWD: when the body lives in a sibling embedded BPF object’s BTF (the multi-.bpf.objs shape), the renderer switches the recursion to that sibling BTF and renders the full body. Cross-BTF resolution does NOT add a new annotation — the body is recovered transparently and the rendered subtree carries whichever annotation ("cast→arena", "cast→kernel", or None for a BTF-typed Type::Ptr) it would have had if the same struct lived in the entry BTF.

From the test author’s perspective:

  • as_u64() returns the raw pointer value (matching pre-analysis behavior, so existing tests do not need updating).
  • entry.get("ctx.task") and similar dotted-path walks transparently follow the cast-recovered chase; nested struct fields appear under the same path the BTF would suggest for a natively-typed pointer.
  • The cast_annotation is visible in failure-dump rendering and diagnostic output so an operator can distinguish cast-recovered pointers from BTF-typed ones; the test API does not require any extra calls to consume them.

Error handling

SnapshotError is the unified error type for every fallible accessor. Each variant carries the path or available alternatives needed to fix the call site without re-running the test:

  • MapNotFound { requested, available }Snapshot::map(name) miss.
  • VarNotFound { requested, available }Snapshot::var(name) miss.
  • AmbiguousVar { requested, found_in } — more than one *.bss/*.data/*.rodata map exposes a top-level member with the requested name. found_in lists every map (in capture order) where the name was seen; disambiguate via Snapshot::map(name) + .at(0).get(...) against a specific map.
  • FieldNotFound { requested, walked, component, available } — a path component did not match any struct member at that depth. walked is the prefix that resolved successfully; component is the failing segment; requested is the original user-supplied path.
  • NotAStruct { requested, walked, component, kind } — a path component reached a non-struct value where a struct was expected (e.g. descending into a Uint leaf). kind names the actual variant.
  • TypeMismatch { expected, actual, requested } — terminal accessor called on a rendered shape it cannot decode. expected names the scalar type the accessor requires; actual names the rendered variant; requested is the user-supplied lookup string (empty when the accessor was invoked on a leaf without a path walk).
  • IndexOutOfRange { map, index, len }SnapshotMap::at(n) past the entry list end.
  • PerCpuSlot { map, cpu, len, unmapped } — out-of-range or unmapped per-CPU slot; unmapped: true distinguishes a None slot from an out-of-range CPU.
  • NoMatch { map, op } — predicate-based lookup (find, max_by) found no match. op names the operation.
  • EmptyPathComponent { requested } — a path string contained an empty component (e.g. "a..b").
  • PerCpuNotNarrowed { map }entry.get called on a per-CPU entry without cpu(N) first.
  • NoRendered { map, side } — entry has no rendered key/value side (BTF type id missing at capture time, leaving hex bytes only).
  • PlaceholderSample { tag, reason } — a periodic-capture sample’s underlying FailureDumpReport is a placeholder produced by the freeze-rendezvous timeout fallback. Surfaces when projecting via SampleSeries::bpf; temporal patterns route the variant through their skip path so a placeholder never falsely registers as zero progress against a monotonicity / rate / steady / ratio band. reason carries the rendezvous-timeout cause text.
  • MissingStats { tag } — a SampleSeries::stats projection ran on a sample whose stats slot is None (stats client not wired or per-sample stats request failed). Distinct from in-JSON path misses (FieldNotFound / TypeMismatch) so the assertion site can branch on the cause without re-walking the source.

SnapshotError implements std::error::Error and Display, so it composes with ? and anyhow. The Display impl includes the path and any available alternatives so a failure message points the test author at the fix.

Worked example

Capture a snapshot, look up a map, walk into its first entry, and read a nested field:

use ktstr::prelude::*;

fn snapshot_then_inspect(ctx: &Ctx) -> Result<AssertResult> {
    // Wire a bridge for the duration of the scenario.
    let cb: CaptureCallback = std::sync::Arc::new(|_name| {
        // Production: freeze + build a real FailureDumpReport. The
        // host installs this callback in real runs.
        Some(FailureDumpReport::default())
    });
    let bridge = SnapshotBridge::new(cb);
    let handle = bridge.clone();
    let _guard = bridge.set_thread_local();

    // Run the scenario, capturing once after spawn.
    let steps = vec![Step {
        setup: vec![CgroupDef::named("workers").workers(2)].into(),
        ops: vec![Op::snapshot("after_spawn")],
        hold: HoldSpec::FULL,
    }];
    let mut result = execute_steps(ctx, steps)?;

    // Drain the bridge and inspect the captured report.
    let captured = handle.drain();
    let report = captured
        .get("after_spawn")
        .ok_or_else(|| anyhow::anyhow!("snapshot 'after_spawn' missing"))?;
    let snap = Snapshot::new(report);

    // Top-level scalar.
    if let Ok(nr_cpus) = snap.var("nr_cpus_onln").as_u64() {
        result.details.push(AssertDetail::new(
            DetailKind::Other,
            format!("captured nr_cpus_onln = {nr_cpus}"),
        ));
    }

    Ok(result)
}

For the executor + bridge wiring outside a VM, see the host-side smoke tests in tests/snapshot_e2e.rs — they exercise the same pipeline against a hand-crafted FailureDumpReport so the assertion shape is covered without booting a guest.

Composing reads with writes

Snapshots are the read half of the host↔guest interaction. The write half — pre-seeding a BPF map value before the scenario starts — is the #[ktstr_test] attribute bpf_map_write = CONST, which targets a BpfMapWrite constant:

use ktstr::prelude::*;

const TRIGGER_FAULT: BpfMapWrite = BpfMapWrite {
    map_name_suffix: ".bss",   // matched against discovered maps
    offset: 42,                // byte offset within the map's value
    value: 1,                  // u32 written by the host
};

#[ktstr_test(bpf_map_write = TRIGGER_FAULT, expect_err = true)]
fn fault_then_inspect(ctx: &Ctx) -> Result<AssertResult> {
    // The host has already written `1` at `.bss + 42` before
    // the scenario started. Capture and inspect the resulting
    // scheduler state mid-run.
    /* bridge wiring + Op::snapshot + Snapshot::new as above */
    Ok(AssertResult::pass())
}

The write is event-driven: the host polls for BPF map discoverability (scheduler loaded), polls the SHM ring for scenario start, then writes the configured u32 at the configured offset. Only BPF_MAP_TYPE_ARRAY maps are supported; the framework finds the map by map_name_suffix (e.g. ".bss") via BpfMapAccessor::find_map. See Monitor → BPF map writes for the prerequisites (vmlinux) and the full host-side contract.

Read+write workflows then compose naturally: the test pre-seeds guest state with bpf_map_write, lets the scheduler run, and asserts on the resulting state with Op::snapshot + the Snapshot accessor:

  1. Write (pre-scenario)bpf_map_write flips a .bss flag the scheduler reads.
  2. Run — the scenario’s ops drive workload behavior; the scheduler reacts to the flag.
  3. Read (mid-scenario)Op::snapshot("after") captures the scheduler state at the chosen point.
  4. AssertSnapshot::var(...).as_u64() / Snapshot::map(...).find(...).get(...).as_*() verifies the reaction. Errors carry the available alternatives so a typo or stale field name surfaces before the test author hand-edits the case.

The write side is a single one-shot poke at scheduler-load time; there is no Op variant for runtime writes. Ergonomic mid-scenario state mutation is reserved for cases where the scheduler itself exports a writable interface (sysfs, debugfs, BPF map command interface) and the test invokes that interface from a workload process.

Watch Snapshots

A watch snapshot registers a hardware data-write watchpoint on a named kernel symbol. The host arms a watchpoint slot via the guest’s hardware debug facilities; the produced captures share the Snapshot accessor surface documented in Snapshots.

Watch snapshots are supported on x86_64 and aarch64 KVM hosts. The slot terminology below is arch-neutral — each architecture’s KVM plumbing maps the slots onto its native hardware-watchpoint facility (debug registers on x86_64, hardware watchpoints on aarch64).

Op::watch_snapshot("symbol") is the write-driven capture trigger. Use it when the question is “what does the scheduler look like whenever the kernel touches X?” rather than “what does it look like at this point in my scenario?”. For time-driven capture, use Op::snapshot instead.

How it works

The full pipeline is implemented and tested end-to-end:

  1. Op::watch_snapshot(symbol) registers the symbol via the virtio-console port 1 MSG_TYPE_SNAPSHOT_REQUEST TLV frame.
  2. The freeze coordinator resolves the KVA from the vmlinux ELF, validates 4-byte alignment, and arms a free user watchpoint slot via KVM_SET_GUEST_DEBUG.
  3. When the guest writes to the watched address, the corresponding debug exit fires and the host identifies which slot tripped.
  4. The coordinator captures via freeze_and_capture and stores the report in the SnapshotBridge under the symbol tag.
  5. The report is also mirrored to a sidecar JSON file for post-hoc inspection.

The per-scenario cap of MAX_WATCH_SNAPSHOTS (= 3) is enforced (slot 0 is reserved for the error-class exit_kind trigger; the remaining 3 slots are available for user watches). A 4th Op::watch_snapshot fails the step with a “cap exceeded” message. Symbol-resolution failures bail immediately so a typo surfaces visibly.

Op::watch_snapshot covers the full pipeline: registration, cap enforcement, symbol resolution, hardware arming, and automatic capture on write.

Issuing a watch

use ktstr::prelude::*;

let steps = vec![Step {
    setup: vec![CgroupDef::named("workers").workers(2)].into(),
    ops: vec![
        Op::watch_snapshot("jiffies_64"),
        Op::watch_snapshot("scx_watchdog_timestamp"),
    ],
    hold: HoldSpec::FULL,
}];
execute_steps(ctx, steps)?;

Each Op::watch_snapshot invokes the active SnapshotBridge’s register_watch callback with the symbol string. On success, the callback is responsible for arming a hardware watchpoint that will fire whenever the guest writes to the symbol’s address. Each fire produces one capture, tagged with the symbol path itself.

Wiring the bridge

In #[ktstr_test] scenarios that boot a VM, the bridge is wired automatically. Use post_vm to read captures from VmResult::snapshot_bridge. Do not install a thread-local bridge inside the scenario function — the in-VM Op::watch_snapshot registers via the virtio-console port 1 MSG_TYPE_SNAPSHOT_REQUEST TLV frame, the host coordinator arms the watchpoint and stores captures on the bridge it owns, and the test reads them after the VM exits.

The set_thread_local pattern below is for host-side unit tests that exercise the executor in process without booting a guest.

A watch-capable bridge for host-side unit tests needs both a capture callback and a register_watch callback:

use ktstr::prelude::*;

let cb: CaptureCallback = std::sync::Arc::new(|_name| {
    Some(FailureDumpReport::default())
});
let reg: WatchRegisterCallback = std::sync::Arc::new(|symbol: &str| {
    // Host-side unit tests: record the symbol and return Ok. In a
    // booted VM, the host coordinator's pipeline runs instead —
    // see arm_user_watchpoint in src/vmm/freeze_coord.rs.
    println!("would arm watchpoint on {symbol}");
    Ok(())
});

let bridge = SnapshotBridge::new(cb).with_watch_register(reg);
let _guard = bridge.set_thread_local();

A bridge built only with SnapshotBridge::new(cb) (no with_watch_register) rejects every Op::watch_snapshot with an error pointing the operator at the missing wiring.

Symbol resolution

Production resolution is a verbatim match against the vmlinux ELF symbol table. The freeze coordinator walks Elf::syms and accepts the symbol whose strtab entry equals the requested string byte-for-byte — there is no prefix stripping, BTF lookup, kallsyms walk, or per-CPU offset arithmetic. Use the exact name nm vmlinux would print:

  • "jiffies_64" — the kernel’s monotonic tick counter.
  • "scx_watchdog_timestamp" — sched_ext’s watchdog timestamp.

Warning: high-frequency symbols soft-lock the guest. Watching a symbol that the kernel writes every jiffie (e.g. jiffies_64 at HZ=1000) fires 1000+ captures per second. Each capture freezes all vCPUs for the full dump pipeline. The guest spends almost all of its wall time paused, which is indistinguishable from a soft lock-up — schedulers stall, watchdogs fire, and the test wedges before any meaningful work runs. Pick symbols the kernel writes at scenario-relevant cadence (a state field, a per-event counter), not on every tick.

The string passed to Op::watch_snapshot must match a vmlinux ELF symtab entry exactly; otherwise the step fails with symbol '...' not found in vmlinux symtab. The register_watch callback on a host-side test bridge can accept any shape it wants — the e2e tests in tests/snapshot_e2e.rs use "kernel.a" / "kernel.b" / etc. for the cap-enforcement test and "exit_kind" for the in-VM test — but the Op::watch_snapshot ops that flow through the production pipeline (in-VM scenarios with no host-side bridge override) must use a verbatim ELF symbol.

Maximum of 3 watches per scenario

pub const MAX_WATCH_SNAPSHOTS: usize = 3;

The bridge enforces a per-scenario cap of 3 successfully-registered watches. The number is tied to the per-vCPU hardware-watchpoint slots KVM exposes via KVM_SET_GUEST_DEBUG: slot 0 is reserved for the existing *scx_root->exit_kind watchpoint that drives the error-class freeze trigger; the remaining three user watchpoint slots are available for on-demand watches.

A 4th Op::watch_snapshot in the same scenario fails the step with “cap exceeded” when the cap is exceeded:

let steps = vec![Step {
    setup: vec![CgroupDef::named("cg").workers(2)].into(),
    ops: vec![
        Op::watch_snapshot("kernel.a"),
        Op::watch_snapshot("kernel.b"),
        Op::watch_snapshot("kernel.c"),
        Op::watch_snapshot("kernel.d"),  // <-- cap exceeded
    ],
    hold: HoldSpec::FULL,
}];
let result = execute_steps(ctx, steps)?;
assert!(!result.passed);
// One AssertDetail carries the cap-exceeded message:
//   "Op::WatchSnapshot cap exceeded: scenario already registered 3
//    watchpoints (3 user watchpoint slots occupied; slot 0 reserved for the
//    error-class exit_kind trigger)..."

A failed register (cap exceeded, callback error, missing register_watch) does not consume a slot. The bridge rolls the count back so the scenario can keep trying with different symbols up to the cap.

Failure modes

The register callback is the single integration point where production resolution can fail. The reasons documented on WatchRegisterCallback:

  • The symbol does not match any vmlinux ELF symtab entry (typo, symbol stripped from the build, or a non-ELF kernel image).
  • The resolved KVA is not 4-byte aligned (the 4-byte watch length the framework arms requires addr & 0x3 == 0 on every supported architecture).
  • All three available user watchpoint slots are already allocated inside the host’s KVM plumbing.
  • KVM_SET_GUEST_DEBUG rejected the arm.

When the callback returns Err(reason), the executor bails the step immediately with a message containing the symbol and the failure reason. Silent degradation is deliberately avoided — a watch that never fires would look identical to a healthy passing run, and the test author would never notice the captures were missing.

Slot 0 (exit_kind) is separate

The existing error-class freeze trigger watches *scx_root->exit_kind on slot 0 and is not an Op::watch_snapshot slot. It is wired by the freeze coordinator independently to detect SCX_EXIT_ERROR writes and drive the failure-dump pipeline. That trigger is unrelated to the on-demand watch surface — it always runs, regardless of whether a scenario declares any Op::watch_snapshot ops. The cap of 3 reflects the three remaining user slots after slot 0 is held back.

For tests that want the failure dump produced by SCX_EXIT_ERROR, nothing needs to be declared; the trigger fires automatically and the dump is written to {sidecar_dir()}/{test_name}.failure-dump.json. The watch-snapshot in-VM test in tests/snapshot_e2e.rs reads that file back and feeds it through the Snapshot accessor as a way to demonstrate the full read path.

Reading captures

Once a watchpoint fires, the resulting report is stored on the bridge under the tag and read back exactly as Op::snapshot captures are. Every accessor — Snapshot::map, Snapshot::var, SnapshotMap::at / find / filter / max_by, dotted-path walks, typed terminal reads — is shared. See Snapshots for the full surface.

Periodic Capture

Op::snapshot is on-demand — the test author picks the moment of capture. Periodic capture is the cadenced complement: the freeze coordinator fires freeze_and_capture(false) at evenly-spaced points across the workload window without the scenario body asking. The result is a time-ordered series of (report, stats, elapsed_ms) samples that flows naturally into the temporal-assertion patterns.

Enabling periodic capture

Set num_snapshots = N on the #[ktstr_test] attribute. N is the number of interior boundaries to fire; 0 (the default) disables periodic capture entirely.

use ktstr::prelude::*;

#[ktstr_test(num_snapshots = 3, duration_s = 10)]
fn paced_capture(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
    ])
}

When boundaries fire

The window is the 10 %–90 % slice of the workload duration, anchored at the first MSG_TYPE_SCENARIO_START the freeze coordinator observes. A 10 % pre-buffer at the start (workload ramp-up) and a 10 % post-buffer at the end (ramp-down) keep periodic samples off transient state.

The remaining 80 % is divided into N + 1 equal intervals, yielding N interior boundary points:

num_snapshots = NBoundary timestamps (relative to scenario start)
10.5·d (midpoint)
30.3·d, 0.5·d, 0.7·d
N ≥ 20.1·d + (i+1)·0.8·d / (N+1) for i ∈ 0..N

For a 10 s workload, N = 3 produces captures at scenario_start + {3 s, 5 s, 7 s}.

Anchoring at MSG_TYPE_SCENARIO_START means VM boot, BPF verifier time, and any other pre-scenario work do NOT eat the budget — every boundary lands inside the workload’s actual run window.

MSG_TYPE_SCENARIO_PAUSE / MSG_TYPE_SCENARIO_RESUME from the guest shift every un-fired boundary by the cumulative pause duration. The boundary clock is workload time, not wall-clock: a guest that pauses for P ns delays each remaining boundary by P ns.

Tag namespace

Each periodic capture is stored on the host’s SnapshotBridge under "periodic_NNN" — zero-padded 3-digit ordinal index, e.g. periodic_000, periodic_001, periodic_002. The width is fixed at 3 digits because the bridge cap (see below) maxes out at MAX_STORED_SNAPSHOTS (= 64 today), so 3 digits always suffices.

Periodic tags coexist with on-demand Op::snapshot tags and watchpoint-fire tags on the same bridge. Use SampleSeries::periodic_only(temporal-assertions.md#sampleseries) (or periodic_ref() for the borrowed equivalent) to filter to the periodic timeline before assertions.

Capture cost

Each periodic boundary fires the same freeze_and_capture(false) path that Op::Snapshot dispatches:

  1. Every vCPU is parked under FREEZE_RENDEZVOUS_TIMEOUT (30 s hard ceiling).
  2. BPF maps are walked.
  3. The dump is serialised to JSON.
  4. The report is stored on the bridge.

On a healthy guest with a typical scheduler-state map size, the freeze is tens of milliseconds (10–100 ms steady state; cold-cache or large guest-memory walks can push higher). The host-side watchdog deadline is extended by the freeze duration after each fire, so periodic captures do not eat into the workload’s wall-clock budget.

Minimum spacing

KtstrTestEntry::validate rejects entries where the per-boundary interval is below 100 ms — boundaries scheduled closer than that would fire back-to-back without any workload progress in between. The exact rule: 0.8 · duration / (N + 1) >= 100 ms. Either reduce num_snapshots or extend duration_s if validation refuses the configuration.

Bridge cap

num_snapshots cannot exceed MAX_STORED_SNAPSHOTS (= 64). Validation rejects higher values rather than silently FIFO-evicting the earliest periodic samples. Split into multiple test entries if a longer timeline is needed.

Best-effort delivery

Up to N captures fire, but the run-loop stops servicing periodic boundaries the moment the kill flag fires. An early VM exit, BSP done, rendezvous timeout, or watchdog deadline can cut the periodic sequence short. Tests should assert result.periodic_fired >= some_lower_bound rather than equality:

fn check_coverage(result: &VmResult) -> Result<()> {
    anyhow::ensure!(
        result.periodic_target == 3,
        "expected num_snapshots = 3, got {}",
        result.periodic_target,
    );
    anyhow::ensure!(
        result.periodic_fired >= 2,
        "too few periodic samples ({}/{})",
        result.periodic_fired,
        result.periodic_target,
    );
    Ok(())
}

result.periodic_target mirrors the configured num_snapshots; result.periodic_fired is the count actually serviced (including rendezvous-timeout placeholders). The pair lets a test compute coverage without re-reading the entry table.

The run-loop additionally abandons the remaining sequence after 2 consecutive rendezvous timeouts and emits a tracing::warn naming the consecutive-timeout count, so a sustained host overload does not pile up dozens of placeholder samples.

Op::snapshot captures composed by the test author land on the same bridge alongside the periodic_NNN tags; total bridge occupancy is num_snapshots + user_captures and the bridge FIFO-evicts past MAX_STORED_SNAPSHOTS.

Draining the bridge

The temporal-assertion pipeline runs on the host, so the drain happens after vm.run() returns — typically inside a post_vm callback. Use SnapshotBridge::drain_ordered_with_stats(snapshots.md) to take ownership of the captured (tag, report, stats, elapsed_ms) tuples in insertion order:

use ktstr::prelude::*;

fn post_vm(result: &VmResult) -> Result<()> {
    let series = SampleSeries::from_drained(
        result.snapshot_bridge.drain_ordered_with_stats(),
    )
    .periodic_only();

    anyhow::ensure!(
        !series.is_empty(),
        "no periodic samples — coordinator never fired",
    );

    // ... walk samples or feed into temporal patterns ...
    Ok(())
}

#[ktstr_test(num_snapshots = 3, duration_s = 10, post_vm = post_vm)]
fn my_test(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
    ])
}

drain_ordered_with_stats returns a Vec<(String, FailureDumpReport, Option<serde_json::Value>, Option<u64>)> in the order store() saw inserts. Periodic boundaries land periodic_000 first, periodic_NNN last. The FIFO eviction at MAX_STORED_SNAPSHOTS drops the oldest tags from order and reports together, so a hot run that overflowed the cap returns the most recent MAX_STORED_SNAPSHOTS captures in insertion order.

drain_ordered (without _with_stats) drops the parallel stats / elapsed metadata; use it only when the test does not need either. drain (no ordering, no stats) returns a HashMap and loses the periodic timeline ordering — avoid for periodic data.

Sample anatomy

Each drained tuple unpacks into a Sample<'_> view (via SampleSeries::iter_samples):

for sample in series.iter_samples() {
    let tag: &str          = sample.tag;          // e.g. "periodic_001"
    let elapsed_ms: u64    = sample.elapsed_ms;   // ms since run_start
    let snap: Snapshot<'_> = sample.snapshot;     // BPF state view
    let stats: Option<&serde_json::Value> = sample.stats; // scx_stats JSON
    // ...
}

elapsed_ms is pause-adjusted: the coordinator subtracts cumulative MSG_TYPE_SCENARIO_PAUSE/RESUME time (and any in-flight pause window) before stamping the value. The timestamp is captured AFTER the scx_stats request returns (or fails) and BEFORE entering the freeze rendezvous, so elapsed_ms reflects when the running scheduler’s stats were observed; BPF state is observed up to FREEZE_RENDEZVOUS_TIMEOUT later than that anchor.

stats is None when the stats client was not wired (scheduler_binary is absent), or the per-sample stats request failed (relay rejected, non-zero envelope errno, scheduler not yet listening). A None slot surfaces through SampleSeries::stats(temporal-assertions.md#projecting-from-scx_stats-json) as a SnapshotError::MissingStats { tag } per-sample error — distinct from in-JSON path misses so the assertion site can branch on the cause.

A sample whose underlying FailureDumpReport is a placeholder (rendezvous timeout fallback) surfaces through SampleSeries::bpf(temporal-assertions.md#projecting-from-bpf-state) as a SnapshotError::PlaceholderSample { tag, reason } per-sample error rather than passing a hollow Snapshot to the projection closure.

What to assert

The standard shape is two-stage:

  1. Compose the series — drain, filter to periodic.
  2. Project + assert — pick a column, choose a temporal pattern.

For monotonic counters (BPF .bss advancement, scx_stats counter fields), nondecreasing(temporal-assertions.md#nondecreasing–strictly_increasing) is the canonical choice. For utilisation-style metrics that should hold steady once warmup ends, steady_within(temporal-assertions.md#steady_withinwarmup_ms-tolerance-f64-only) captures the invariant. For “system stabilizes near target”, converges_to(temporal-assertions.md#converges_totarget-tolerance-deadline_ms-f64-only) witnesses the convergence.

For the full pattern surface, projection helpers, and failure rendering, see Temporal Assertions.

Temporal Assertions

Periodic snapshots produce a series of samples over time. Temporal assertions answer questions about the trajectory — does a counter only ever advance? Does a utilization metric stay near its mean once warmup ends? Does a load average converge before a deadline?

The shape is two-stage:

  1. Build a SampleSeries(#sampleseries) from the bridge’s drained periodic captures.
  2. Project a SeriesField<T>(#seriesfield) — one column of T-typed values across every sample — and feed it through a temporal pattern (nondecreasing, rate_within, steady_within, converges_to, always_true, ratio_within).

Each pattern records DetailKind::Temporal(#failure-rendering) details on the Verdict when a sample violates the invariant, and records Notes when projection errors leave a coverage gap.

For how to enable periodic capture and drain the bridge, see Periodic Capture. This page covers the projection + assertion surface only.

SampleSeries

SampleSeries is the ordered sequence of (tag, report, stats, elapsed_ms) tuples drained from the bridge after the VM exits. Build it from SnapshotBridge::drain_ordered_with_stats(snapshots.md#wiring-the-bridge):

use ktstr::prelude::*;

let drained = vm_result.snapshot_bridge.drain_ordered_with_stats();
let series = SampleSeries::from_drained(drained).periodic_only();

periodic_only() filters to entries whose tag begins with "periodic_" — it strips on-demand Op::snapshot captures and watchpoint fires that share the bridge’s tag namespace. Use periodic_ref() for the borrowed-iterator equivalent when one test needs both views from the same series.

SampleSeries exposes:

  • len(), is_empty() — sample count.
  • iter_samples() — borrowed Sample<'_> views (each carrying tag, elapsed_ms, Snapshot<'_>, Option<&Value> stats).
  • bpf(label, |snap| …) / stats(label, |sv| …) — manual closure projection along the BPF or stats axis.
  • bpf_map(map_name) / stats_path(path) — typed auto-projection helpers (see Auto-projection).

SeriesField

A SeriesField<T> is one per-sample column extracted from a SampleSeries. Each slot is a SnapshotResult<T> so a missing field, type mismatch, or placeholder report on any individual sample does NOT abort the whole projection — it surfaces at the temporal- assertion site as a per-sample error the pattern decides how to handle.

The field carries the per-sample tags and elapsed-ms timestamps alongside the values, so failure messages name the offending sample without the caller re-threading the source series.

Projecting from BPF state

The SampleSeries::bpf closure receives each sample’s Snapshot<'_>:

let nr_dispatched: SeriesField<u64> = series.bpf(
    "nr_dispatched",
    |snap| snap.var("nr_dispatched").as_u64(),
);

The closure body is a normal Snapshot accessor expression; its SnapshotResult<T> return value lands directly in the field.

Projecting from scx_stats JSON

The SampleSeries::stats closure receives each sample’s StatsValue<'_> — a thin wrapper around the per-sample stats JSON exposing path("…").as_u64() / as_f64() etc.:

let busy: SeriesField<f64> = series.stats(
    "busy",
    |sv| sv.path("busy").as_f64(),
);

A sample whose stats slot is None (the stats request failed, the relay rejected, or the scheduler binary isn’t wired) yields a SnapshotError::MissingStats { tag } slot — distinct from an in-JSON path miss (FieldNotFound / TypeMismatch) so the assertion site can tell coverage gaps from data errors apart.

Auto-projection

The typed auto-projectors discover available field names and emit ready-to-feed SeriesFields without an explicit closure:

// Top-level scalar member of a BPF map's first entry.
let dispatched = series
    .bpf_map("scx_obj.bss")
    .at(0)
    .field_u64("nr_dispatched");

// Stats path drilling into nested layer/cgroup keys.
let layer_util = series
    .stats_path("layers")
    .key("batch")
    .field_f64("util");

Bulk discovery is also available — member_names() / u64_fields() / f64_fields() on the BPF projector, key_names() / u64_fields() / f64_fields() on the stats projector. The *_fields() helpers project every member that yields at least one Ok value across the series, dropping non-numeric / type-mismatched fields silently. Useful for blanket “every counter must be nondecreasing” sweeps.

Top-level scalar fields only for the typed field_* helpers. Nested struct members (e.g. "ctx.weight") and per-CPU maps need the manual closure path through SampleSeries::bpf.

The six temporal patterns

Every pattern takes &mut Verdict and returns the same &mut Verdict so chains of assertions stack onto one accumulator. Each pattern is a method on SeriesField:

nondecreasing / strictly_increasing

Pass when every consecutive pair satisfies values[i] <= values[i+1] (or <, for the strict variant). The common shape for kernel counters whose only legal direction is up.

let mut v = Verdict::new();
nr_dispatched.nondecreasing(&mut v);
nr_dispatched.strictly_increasing(&mut v); // require advance every period

Per-sample projection errors are SKIPPED — the affected pair is dropped, the skip count is logged as a verdict Note, and the verdict is NOT flipped on missing-data conditions. Adjacent samples on either side of a gap are still checked. A series with fewer than 2 samples records a Note (“vacuously holds”) and passes.

rate_within(lo, hi) (f64 only)

Pass when every consecutive (delta_value / delta_ms) lies in [lo, hi]. Rate is computed from per-sample elapsed-ms timestamps, so a counter that should advance at ~1 unit/ms reads as rate_within(0.5, 2.0).

let ticks: SeriesField<f64> = series.bpf("ticks",
    |snap| snap.var("ticks").as_f64());
ticks.rate_within(&mut v, 0.5, 2.0);

Failure modes:

  • A zero-time delta between adjacent samples records a structured detail naming the offending pair.
  • A non-finite rate (NaN / Inf endpoints, or a finite difference that overflows f64) records a non-finite rate detail rather than silently slipping past the band check.
  • Caller error (lo > hi) lands as a single detail.

Per-sample projection errors are GAPS — no rate is computed across the gap, the skip count is logged as a Note with the underlying error variant.

steady_within(warmup_ms, tolerance) (f64 only)

Pass when every post-warmup sample (elapsed_ms >= warmup_ms) lies inside [mean·(1-tolerance), mean·(1+tolerance)]. The mean is computed over the post-warmup samples only — the warmup region is excluded so ramp-up does not bias the steady-state baseline. tolerance is a fraction (0.10 = ±10%).

let util: SeriesField<f64> = series.stats("busy",
    |sv| sv.path("busy").as_f64());
util.steady_within(&mut v, /*warmup_ms=*/ 1000, /*tolerance=*/ 0.10);

Per-sample projection errors are SKIPPED with a Note. When the warmup window absorbs every sample, the pattern emits a “no samples beyond warmup” Note and passes vacuously.

converges_to(target, tolerance, deadline_ms) (f64 only)

Pass when three consecutive samples land inside [target - tolerance, target + tolerance] AT OR BEFORE deadline_ms. The intent is “the system stabilizes near target by the deadline” — three consecutive in-band samples are the convergence-witness shape.

load.converges_to(&mut v, /*target=*/ 1.0, /*tol=*/ 0.5, /*deadline_ms=*/ 5_000);

Distinct outcomes:

  • Witness found — pass.
  • No witness before deadlineDetailKind::Temporal failure naming the sample count evaluated. If errored samples interrupted in-progress runs, the failure message lists them.
  • Insufficient samples — fewer than 3 successfully-projected samples in the deadline window. Records a Note (NOT a verdict failure); absence of data is a coverage gap, not a negative finding. The note distinguishes “did not collect enough samples” from “collected enough samples but never converged”.

always_true (bool only)

Pass when every sample’s value is true. Per-sample projection errors FAIL the assertion (this is a strict pattern — a missing boolean is a coverage gap that must surface).

let alive: SeriesField<bool> = series.bpf("scheduler_alive",
    |snap| snap.var("scheduler_alive").as_bool());
alive.always_true(&mut v);

ratio_within(other, lo, hi) (f64 only)

Pass when every per-index (self_value / other_value) lies in [lo, hi] — the two series are walked in lock-step at indices 0..N, comparing self[i] / other[i]. Cross-field correlation across two same-length series.

util.ratio_within(&mut v, &runtime, 0.4, 0.6);

A length mismatch fires a single caller-error detail and aborts the comparison. A sample where rhs == 0 records a “cannot compute ratio” detail naming the sample; out-of-band ratios record a structured detail with the lhs/rhs values. Per-sample projection errors on either side are SKIPPED with a Note listing each gap and which side errored.

Per-sample scalar checks: each

The temporal patterns are aggregate. For per-sample scalar bounds (>=, <=, lo..=hi) bypass the patterns via SeriesField::each:

nr_dispatched.each(&mut v).at_least(1u64);
util.each(&mut v).between(0.0_f64, 100.0_f64);
ticks.each(&mut v).at_most(10_000.0_f64);

each runs the comparator on every successfully-projected sample independently. The first failure records a detail; subsequent failures pile on so the timeline shows every offending sample, not just the first.

Per-sample projection errors record a detail and flip the verdict — each is strict (matches always_true’s policy). NaN samples report an incomparable failure naming the sample distinctly: without this branch, IEEE-754 < against NaN is always false, so a NaN sample would silently pass value < floor / value > ceiling checks.

Failure rendering

Every temporal failure carries the field’s label, the pattern name, and the offending sample’s tag + elapsed_ms. A nondecreasing regression at sample periodic_004 (+850 ms) reads:

nr_dispatched (nondecreasing): regression at sample periodic_004 (+850ms): \
    value 100 after prior value 200 at sample periodic_003 (+700ms)

Coverage Notes render WITH the per-sample error variant so the operator can tell PlaceholderSample (rendezvous timeout), MissingStats (stats request failed), FieldNotFound (typo / wrong map), and TypeMismatch apart without re-running under a debugger:

nr_dispatched (nondecreasing): skipped 1 sample(s) with projection errors: \
    periodic_002(+500ms): snapshot has no global variable 'nrdispatch' \
    in any *.bss/*.data/*.rodata map (available globals: ["nr_dispatched", \
    "stall"])

Worked example

The temporal-assertion pipeline draining the bridge runs on the host, not inside the guest. #[ktstr_test(post_vm = …)] registers a host-side callback that receives the VmResult after vm.run() returns; the callback drains the bridge and walks the resulting series:

use ktstr::prelude::*;

fn assert_temporal_patterns(result: &VmResult) -> Result<()> {
    let series = SampleSeries::from_drained(
        result.snapshot_bridge.drain_ordered_with_stats(),
    )
    .periodic_only();

    let mut v = Verdict::new();

    // BPF axis: counter must advance every periodic boundary.
    let nr_dispatched: SeriesField<u64> = series.bpf(
        "nr_dispatched",
        |snap| snap.var("nr_dispatched").as_u64(),
    );
    nr_dispatched.nondecreasing(&mut v);

    // Stats axis: stay under a generous ceiling.
    let stats_dispatched: SeriesField<u64> = series.stats(
        "nr_dispatched",
        |sv| sv.path("nr_dispatched").as_u64(),
    );
    stats_dispatched.each(&mut v).at_most(1_000_000_000u64);

    let r = v.into_result();
    anyhow::ensure!(r.passed, "temporal assertions failed: {:?}", r.details);
    Ok(())
}

#[ktstr_test(num_snapshots = 3, duration_s = 10, post_vm = assert_temporal_patterns)]
fn dispatch_counter_advances(ctx: &Ctx) -> Result<AssertResult> {
    execute_defs(ctx, vec![
        CgroupDef::named("workers").workers(2).work_type(WorkType::SpinWait),
    ])
}

For the periodic-capture wiring, num_snapshots semantics, and the bridge-drain contract, see Periodic Capture. For the underlying Snapshot / SnapshotMap / SnapshotEntry accessors the projection closures call into, see Snapshots.

Recipes

Standalone examples for common tasks. Each recipe is self-contained.

Test a New Scheduler

End-to-end workflow: define a scheduler, write tests, run them.

1. Define the scheduler

Use declare_scheduler! to register a scheduler in the KTSTR_SCHEDULERS distributed slice. The verifier sweep picks it up automatically.

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    topology = (1, 2, 4, 1),
    kernels = ["6.14", "6.15..=7.0"],
    sched_args = ["--exit-dump-len", "1048576"],
});

The macro generates pub static MY_SCHED: Scheduler plus a private linkme registration so cargo ktstr verifier discovers the scheduler automatically. Tests reference the bare MY_SCHED ident via #[ktstr_test(scheduler = MY_SCHED)].

See Scheduler Definitions for every supported field.

2. Write integration tests

Tests inherit the scheduler’s topology. Override with explicit llcs, cores, or threads when needed.

use ktstr::prelude::*;

#[ktstr_test(scheduler = MY_SCHED)]
fn basic_steady(ctx: &Ctx) -> Result<AssertResult> {
    // Inherits 1n2l4c1t from MY_SCHED
    scenarios::steady(ctx)
}

#[ktstr_test(scheduler = MY_SCHED, threads = 2)]
fn smt_steady(ctx: &Ctx) -> Result<AssertResult> {
    // Inherits llcs=2, cores=4; overrides threads to exercise SMT
    scenarios::steady(ctx)
}

3. Build a kernel

Build a kernel with sched_ext support:

cargo ktstr kernel build

See Getting Started: Build a kernel for version selection and local source builds.

4. Run

cargo ktstr test --kernel ../linux

5. Check BPF complexity (optional)

Collect per-program verifier statistics across the declared kernels and accepted topology presets:

# Use the kernel auto-discovered via KTSTR_KERNEL / cache.
cargo ktstr verifier

# Pin to a specific kernel build.
cargo ktstr verifier --kernel ../linux

# Sweep across multiple kernels. Each scheduler's
# `kernels = [...]` declaration acts as a per-scheduler filter on
# the operator-supplied set; an empty (or omitted) `kernels` field
# means the scheduler runs against every kernel in the sweep.
cargo ktstr verifier --kernel 6.14 --kernel 7.0

See BPF Verifier for output format, cycle collapse, and the cell-name → kernel matching contract.

6. Manage the kernel cache

Cached kernel images accumulate under $XDG_CACHE_HOME/ktstr/kernels/. Keep a handful of recent builds and drop the rest when disk pressure grows:

cargo ktstr kernel list                # inspect cache contents
cargo ktstr kernel clean --keep 3      # keep the 3 most recent images
cargo ktstr kernel clean --force       # remove everything (non-interactive)

7. Debug failures

Boot an interactive shell with the scheduler binary:

cargo ktstr shell -i ./target/debug/scx_my_sched

Inside the guest, run /include-files/scx_my_sched manually to inspect behavior. See cargo-ktstr shell for all flags.

See The #[ktstr_test] Macro for all available attributes and Scheduler Definitions for the full Scheduler type and the declare_scheduler! macro.

Investigate a Crash

When a scheduler crashes during a test, the failure output and auto-repro pipeline help identify the cause.

First step: enable full diagnostics

Rerun the failing test with RUST_BACKTRACE=1 before digging into individual sections:

RUST_BACKTRACE=1 cargo ktstr test --kernel ../linux -- -E 'test(my_test)'

Setting RUST_BACKTRACE=1 unconditionally appends the --- diagnostics --- section (init stage, VM exit code, last lines of kernel console) to every failure, not only when the scheduler self-dies. It also enables verbose VM console output (equivalent to KTSTR_VERBOSE=1).

Reading failure output

A test failure message contains up to eight sections, each present only when relevant:

SectionContent
Error lineTest name, scheduler, failure reason.
--- stats ---Per-cgroup worker count, CPU count, spread, gap, migrations, iterations.
--- diagnostics ---Init stage classification, VM exit code, last 20 lines of kernel console.
--- timeline ---Kernel version, topology, scheduler, scenario duration, phase breakdown with monitor samples.
--- scheduler log ---Scheduler process stdout+stderr (cycle-collapsed).
--- monitor ---Host-side monitor: sample count, max imbalance ratio, max local-DSQ depth, sustained-violation flag, SCX event counters (select_cpu_fallback, keep_last, skip_exiting, skip_migration_disabled), per-sched_domain load-balance rates, per-BPF-program verified_insns, and the merged threshold verdict.
--- sched_ext dump ---sched_ext_dump trace lines from the guest kernel.
--- auto-repro ---BPF probe data from a second VM run, plus repro VM duration, scheduler log, sched_ext dump, and dmesg tails.

--- diagnostics --- appears automatically when the scheduler died or crashed, or when RUST_BACKTRACE is set to 1 or full.

Auto-repro

auto_repro defaults to true in #[ktstr_test]. When the scheduler crashes, ktstr automatically:

  1. Captures the crash stack trace from the scenario output.
  2. Boots a second VM with BPF kprobes (kernel functions) and fentry probes (BPF callbacks) on each function in the crash chain, plus a tp_btf/sched_ext_exit tracepoint trigger.
  3. Reruns the scenario to capture function arguments at each crash point.

Reading auto-repro output

The probe output shows each function in the crash chain with:

  • Function signature and argument values during execution of the same workload
  • Source file and line number
  • Call chain context

After the probe data, the auto-repro section includes the repro VM duration and the last 40 lines of the repro VM’s scheduler log (cycle-collapsed), sched_ext dump, and kernel console (dmesg). These supplement probe data when the crash produces sparse or no probe events. When probe data is absent, a crash reproduction status line replaces it.

See Auto-Repro for details on how the two-VM repro cycle works.

A/B Compare Branches

Compare scheduler behavior between two branches by running the same #[ktstr_test] suite against each, then using cargo ktstr stats compare to diff per-metric results with dual-gate (absolute and relative) significance and exit non-zero on any regression.

Setup worktrees

The examples below use the scx scheduler crate under ~/opensource/scx; substitute your own scheduler crate’s path and remote everywhere scx appears.

cd ~/opensource/scx

# Create a worktree for the baseline branch.
git worktree add ~/opensource/scx-main upstream/main

Collect both runs into a shared run root

Each cargo nextest run --workspace writes its sidecars into target/ktstr/{kernel}-{project_commit}/. The {project_commit} half is the project tree’s HEAD short hex captured at first sidecar write (suffixed -dirty when the worktree differs from HEAD), so two branches with distinct HEADs land in distinct directories. Two back-to-back runs of the SAME kernel at the SAME commit reuse the same directory — the second run pre-clears any prior sidecars at first write, so each directory is a last-writer-wins snapshot of (kernel, project commit).

Warning: The two worktrees MUST be at distinct commits for A/B comparison to work. If both checkouts share the same HEAD (e.g. baseline branch and feature branch happen to be even), the second run overwrites the first via the last-writer-wins pre-clear and the comparison degenerates to “identical pool of sidecars.” Confirm distinct commits with git -C ~/opensource/scx rev-parse HEAD and git -C ~/opensource/scx-main rev-parse HEAD before invoking the second cargo nextest run.

Every sidecar also carries its own project_commit field (read from the project tree’s git HEAD at sidecar-write time), so the runs from two branches land disjoint values on the commit dimension regardless of how the directories are named. The project commit is discovered by walking up from the test process’s current working directory to find a .git marker — so the cd ~/opensource/scx-main / cd ~/opensource/scx steps below are load-bearing, not stylistic. Without them the probe would walk up from wherever you happened to invoke cargo, potentially ending at an entirely different repo and recording the wrong commit on every sidecar. The simplest collection workflow is to merge both branches’ run subdirectories under one root and rely on --a-project-commit / --b-project-commit to partition them:

mkdir -p ~/opensource/scx-runs/ktstr

# Baseline.
cd ~/opensource/scx-main
cargo ktstr test --kernel ../linux
mv target/ktstr/* ~/opensource/scx-runs/ktstr/

# Experimental.
cd ~/opensource/scx
cargo ktstr test --kernel ../linux
mv target/ktstr/* ~/opensource/scx-runs/ktstr/

The {kernel}-{project_commit} subdirectory names are unique per (kernel, project commit) pair, so two branches with distinct HEADs coexist under one root without collision. Within a single branch, two clean back-to-back runs at the same commit reuse one directory (last-writer-wins via per-process pre-clear); mark one of them as -dirty (uncommitted change) or commit / amend between runs to land separate directories.

Do not set KTSTR_SIDECAR_DIR: cargo ktstr stats list and cargo ktstr stats compare walk {CARGO_TARGET_DIR or "target"}/ktstr/ by default and would not see runs written to a custom flat directory unless --dir DIR is passed.

Discover available dimension values

The framework records the project tree’s git commit (discovered by walking parents of the test process’s cwd to find the enclosing .git) on every sidecar via SidecarResult::project_commit, so two runs from different commits land disjoint values on the commit dimension and --a-project-commit / --b-project-commit slice between them without any per-run directory bookkeeping. Use cargo ktstr stats list-values --dir DIR to enumerate the distinct values of every filterable dimension (kernel, commit, kernel_commit, source, scheduler, topology, work_type) present in the pool, so per-side filters target real values. The commit and source keys map to the internal SidecarResult::project_commit / run_source fields; the per-side filter flags spell as --a-project-commit / --b-project-commit and --a-run-source / --b-run-source on the compare subcommand.

cd ~/opensource/scx
CARGO_TARGET_DIR=~/opensource/scx-runs cargo ktstr stats list
CARGO_TARGET_DIR=~/opensource/scx-runs cargo ktstr stats list-values

Compare per-side filter groups

cd ~/opensource/scx
CARGO_TARGET_DIR=~/opensource/scx-runs cargo ktstr stats compare \
    --a-project-commit <baseline-short-hex> \
    --b-project-commit <current-short-hex>

stats compare is pool-driven: every sidecar under the runs root is loaded into a single pool, and per-side filter flags (--a-X / --b-X) partition the pool into the A and B contrasts. The dimensions on which the A and B filters DIFFER are the slicing dimensions of the contrast; every other dimension is part of the dynamic pairing key the comparison joins on. Slicing on project-commit alone joins each baseline scenario with its matching experimental counterpart on every other dimension (kernel, kernel-commit, run-source, scheduler, topology, work_type).

Other slicing axes work the same way:

# Slice on kernel.
cargo ktstr stats compare --a-kernel 6.14 --b-kernel 7.0

# Slice on scheduler, pin both sides to one kernel.
cargo ktstr stats compare \
    --a-scheduler scx_rusty --b-scheduler scx_lavd \
    --kernel 6.14

Shared --X flags pin BOTH sides to the same value; per-side --a-X / --b-X REPLACE the corresponding shared --X for that side only (“more-specific replaces”). Slicing on more than one dimension at once prints a stderr warning but is supported for cohort sweeps.

compare applies the dual-gate significance check from the unified MetricDef registry to every metric and prints colored output (red = regression, green = improvement). Rows where either side has passed=false are dropped from the math and counted in the summary line; the exit code is non-zero when any regression is detected, so the command can gate CI directly. Narrow further with -E SUBSTRING (matches the joined scenario topology scheduler work_type string), override the relative gate uniformly with --threshold PCT or per-metric via --policy FILE. The absolute gate from each MetricDef is unaffected by --threshold — a delta must clear both gates to count as significant.

See stats compare for the full per-side flag table and validation rules, and stats list-values for the discovery counterpart.

Cleanup

git worktree remove ~/opensource/scx-main
rm -rf ~/opensource/scx-runs

Capture and Compare Host State

Disambiguation: this recipe covers host context (kernel build, CPU model, sched_* tunables, NUMA layout) via cargo ktstr show-host. For per-thread profiling (scheduling counters, memory / I/O accounting, taskstats delay accounting per thread), see the ctprof reference and the Diagnose a Slow Scheduler with ctprof recipe.

When a gauntlet run passes on one machine and fails on another — or passes on Monday and fails on Wednesday — the first thing to check is whether the host itself changed. cargo ktstr show-host captures a snapshot of the kernel, CPU, memory, scheduler tunables, and kernel cmdline; cargo ktstr stats compare surfaces the changes between two sidecars in a host-delta section of its output so you can see what moved.

Two show-host commands: live vs archived

Two distinct subcommands print host context, and they are NOT interchangeable — pick the one whose target matches your question:

  • cargo ktstr show-host captures the live host context by reading /proc, /sys, and uname() at invocation time. Use this when you want to inspect the current machine, e.g. before running a benchmark, after a sysctl change, or to confirm what cargo ktstr stats compare would record on the next run produced here. No prior runs needed.
  • cargo ktstr stats show-host --run RUN_ID prints the archived host context captured at sidecar-write time for the named run. Use this when investigating a regression in a past run — what looked like a code change might trace back to a host change at the time the sidecar was produced. Resolves --run against target/ktstr/ (or --dir) and renders the first sidecar in the run that carries a populated host field via the same HostContext::format_human formatter the live show-host uses, so the two outputs are byte-for-byte comparable when the host is unchanged.

The sections below cover the live show-host. For the archived variant’s flag table see stats show-host.

Capture: show-host

cargo ktstr show-host

Prints a key: value report covering:

  • CPU model + vendor (first /proc/cpuinfo entry).
  • Total memory, hugepages total / free, hugepage size (from /proc/meminfo).
  • Transparent hugepage policy (thp_enabled, thp_defrag) with the bracketed selection preserved verbatim.
  • Every /proc/sys/kernel/sched_* tunable, one entry per line.
  • NUMA node count (from CPU→node mapping; memory-only nodes without CPUs are not counted).
  • kernel_name / kernel_release / arch (from the uname() syscall).
  • /proc/cmdline verbatim.

Absent fields render as (unknown) — an empty sched_* map renders as (empty) and a missing map renders as (unknown). The distinction matters when you want to know whether a dimension was inspected but absent, vs failed to populate.

Sidecars written before the uname_sysname / uname_release / uname_machinekernel_name / kernel_release / arch rename render the renamed fields as (unknown) in show-host and in stats compare’s host-delta section, and re-running the test against the current binary regenerates the sidecar with the new field names populated. Mechanically: the old sidecar still deserializes cleanly (deserialization is forward-compatible in the “does-not-error” sense), but the renamed fields land as None on the new struct because the old-name data does not migrate to the new field names.

This output is human-oriented. For programmatic access, read the host field of any sidecar JSON (same schema, identical values — show-host prints the live snapshot the sidecar writer would attach to a fresh test run).

Compare: stats compare

cargo ktstr stats compare --a-project-commit <baseline> --b-project-commit <current>

Per-side filter flags (--a-X / --b-X) partition the sidecar pool into the two sides of the contrast — slice on project-commit, kernel, scheduler, run-source, etc. depending on what you are diffing. compare picks the first sidecar with Some(host) from each side, collects every host field that differs, and prints a side-by-side delta unconditionally as part of the compare output (there is no opt-in flag — the host-delta section appears whenever the two sides disagree on a host field):

host delta ('A' → 'B'):
  kernel_release: 6.14.2 → 6.15.0
  thp_enabled: always [madvise] never → always madvise [never]
  sched_tunables.sched_migration_cost_ns: 500000 → 100000

Fields that match in both runs are suppressed by design — this is a diff, not a snapshot. Missing-on-one-side rendering differs by layer: top-level Option<T> host fields (e.g. kernel_release, thp_enabled, the whole sched_tunables map) render with (unknown) on the None side so a regression in the capture pipeline surfaces instead of silently hiding. Per-key diffs inside the sched_tunables map use (absent) instead, to distinguish “the map was captured and this key is not in it” from “the whole map was unknown at capture time”.

CI integration

Gauntlet runs emit the host block automatically in every sidecar. To diff the host state across two CI runs, slice the pool on whatever dimension separates them (typically --a-project-commit / --b-project-commit or --a-kernel / --b-kernel) — the host-delta section appears automatically in the compare output when any host field differs between the two sides. A CI job can:

  1. Run the gauntlet on the candidate commit and the baseline.
  2. Invoke stats compare slicing on the dimension that separates the two runs (e.g. --a-project-commit <baseline> --b-project-commit <current>) and inspect the host-delta section of its output.
  3. Fail (or annotate the PR) if any host dimension changed — an unchanged host set is the precondition for a clean A/B of scheduler behavior.

Typical hits

Each bullet names the show-host field that carries the signal so you can cargo ktstr show-host | grep <field> directly, or pluck the same key out of a sidecar via jq '.host.<field>'.

  • thp_enabled (and its companion thp_defrag) changed between runs → explains latency-sensitive regressions that vanish when you pin THP via transparent_hugepage= on the kernel cmdline. The bracketed selection inside the value is the active setting; compare the bracket position, not just the full string.
  • sched_tunables.sched_migration_cost_ns differs (look for it inside the sched_* block printed by show-host) → fair scheduler migrated the run onto different CPUs, which changes the idle-steal pressure on scx_* schedulers that depend on it. Other sched_tunables.* keys (sched_wakeup_granularity_ns, sched_min_granularity_ns, sched_latency_ns, sched_rt_runtime_us, etc.) have the same shape — the full set is whatever /proc/sys/kernel/sched_* lists at capture time. Note: the examples above are CFS-era tunables; several of them (sched_wakeup_granularity_ns, sched_min_granularity_ns, sched_latency_ns) were dropped when CFS was replaced by EEVDF in Linux 6.6+, so a run on an EEVDF kernel will simply not have those keys in the map — their absence is a kernel-version fact, not a capture failure. EEVDF’s own latency-floor knob is exposed as sched_tunables.sched_base_slice_ns on 6.6+ kernels (the replacement for the dropped CFS latency / granularity triple); check for its presence to confirm an EEVDF-era capture. What you get in practice is whatever /proc/sys/kernel/sched_* exposes on the running kernel.
  • kernel_cmdline diverges → isolcpus= / nohz_full= / mitigations= / transparent_hugepage= / numa_balancing= are all boot-time and change the whole scheduling surface. Rebooting the host to match is the correct remediation when you need the comparison to hold. The field is named kernel_cmdline (not cmdline) in both show-host’s printed output and the sidecar JSON to disambiguate from SidecarResult.kargs, which carries the extra kargs the ktstr VMM appended when booting the guest rather than the running host’s boot line.
  • kernel_release differs (also check the companion kernel_name and arch fields) → the kernel itself changed; every other host dimension is suspect under cross-kernel comparison. A kernel_name change (uname -s reporting a different OS family — Linux vs FreeBSD, say) is a harder stop than a same-family version bump and usually means the two sidecars were produced on entirely different systems.
  • hugepages_total / hugepages_free / hugepages_size_kb deltas → benchmark throughput that depends on 2 MiB pages (performance_mode tests) flips outcome when the pool shrinks or the page size changes. All three are reported by show-host in the meminfo-derived block.
  • numa_nodes differs → cpusets and cross-node migration signals only make sense within the CPU→node mapping captured at sidecar-write time; a host reconfigured to expose or hide nodes changes what cpus_used and numa_pages mean across the two runs. See the capture caveatnuma_nodes counts only nodes that host at least one CPU (memory-only nodes are not counted), so a delta here can reflect either a hardware / firmware change or a topology reconfiguration that left the memory-only nodes untouched.
  • CPU-level skew (cpu_model / cpu_vendor) → microarchitectural differences affect cache-sensitive benchmarks. Always inspect alongside cmdline because a different CPU usually comes with a different bootloader.

Seeing the raw sidecar field

show-host reads the live host; the sidecar carries whatever show-host would have captured at sidecar-write time. To see the sidecar’s host block directly:

jq '.host' path/to/sidecar.ktstr.json

The field is emitted on every gauntlet run.

Diagnose a Slow Scheduler with ctprof

When a scheduler change makes the workload slower but the test suite still passes, the regression is usually buried in per-thread off-CPU time. ktstr ctprof capture snapshots every live thread’s scheduling, memory, I/O, and taskstats delay counters; ktstr ctprof compare diffs two snapshots and surfaces the buckets where time went. This recipe walks through a typical A/B comparison.

See the ctprof reference for the full metric registry, aggregation rules, derived-metric formulas, and taskstats kconfig gating.

Capture before and after

# Baseline: scheduler A loaded, workload running.
ktstr ctprof capture --output baseline.ctprof.zst

# Switch schedulers, restart workload, wait for steady state.
# ...

# Candidate: scheduler B, same workload.
ktstr ctprof capture --output candidate.ctprof.zst

capture walks /proc once and writes the snapshot. It is read-only — no kprobes, no tracing — so the act of capturing does not perturb the measurement. The default capture covers every live tgid; on a busy host this is hundreds of threads. The snapshot is zstd-compressed JSON, typically a few MB.

Compare with the taskstats lens

The taskstats-delay section bundles the eight kernel delay-accounting buckets (CPU, blkio, swapin, freepages, thrashing, compact, wpcopy, irq) plus their nine derived metrics (avg_*_delay_ns per bucket, total_offcpu_delay_ns rollup). Running with --sections taskstats-delay filters the output down to just the off-CPU view:

ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst \
    --sections taskstats-delay \
    --sort-by total_offcpu_delay_ns:desc

The --sort-by total_offcpu_delay_ns:desc puts the processes with the largest absolute off-CPU growth at the top. Each row gives baseline | candidate | delta | %; large positive deltas on a process that should not have moved are the suspects.

The total_offcpu_delay_ns derivation is:

cpu + blkio + freepages + compact + wpcopy + irq + max(swapin, thrashing)

max(swapin, thrashing) rather than swapin + thrashing because every thrashing event is also a swapin event from the syscall perspective; summing both would double-count.

Drill into the per-bucket averages

If total_offcpu_delay_ns jumped on a process, the per-bucket avg_*_delay_ns derivations identify which off-CPU phase grew. In the same compare output (the --sections taskstats-delay filter keeps both the raw counters AND the 9 derivations together), look at the suspect process’s row in:

BucketAverage derivationMeaning
CPU runqueue waitavg_cpu_delay_nsTime waiting for the scheduler to pick the task. RACY (count + total update lockless).
Block I/O waitavg_blkio_delay_nsSynchronous block-device wait. Distinct from schedstat iowait_sum; the canonical delay-accounting reading.
Swap-in / Thrashingavg_swapin_delay_ns / avg_thrashing_delay_nsMemory pressure. The two overlap — a thrashing event is also a swapin event.
Direct memory reclaimavg_freepages_delay_nsAllocator hit __alloc_pages slowpath.
Memory compactionavg_compact_delay_nsAllocator demanded a high-order page; compaction stalled.
CoW page-faultavg_wpcopy_delay_nsWrite-protect-copy fault, e.g. fork-then-write.
IRQ handlingavg_irq_delay_nsTime charged to the task by the IRQ accounting subsystem.

A growing avg_cpu_delay_ns with flat blkio/swap/freepages suggests the new scheduler is making poor placement choices — the task is queueing more often or for longer, but no other subsystem is to blame. A growing avg_blkio_delay_ns with flat avg_cpu_delay_ns points away from the scheduler entirely (disk, network filesystem, or a userspace lock pattern).

Cross-reference the primary table

Once a bucket is identified, look at the underlying counters without the section filter:

ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst \
    --metrics nr_wakeups,nr_migrations,wait_sum,wait_count,run_time_ns,timeslices

--metrics restricts the rendered rows to the named primary metrics. Useful pairings when the suspect bucket is CPU runqueue wait:

  • wait_sum / wait_count — schedstat’s average wait per scheduling event (the avg_wait_ns derivation, exposed outside the taskstats-delay section). If this confirms avg_cpu_delay_ns, both delay-accounting paths agree.
  • nr_migrations — the new scheduler may be moving the task more aggressively. Cross-CPU migrations cost wall-clock time even when run_time_ns is identical.
  • nr_wakeups_affine / nr_wakeups_affine_attempts — the affine_success_ratio derivation; CFS-only signal that reflects how often wake_affine() succeeded. A large drop with growing avg_cpu_delay_ns is a strong signal for cache- unfriendly placement.

Confirm taskstats data is actually populated

If every taskstats column reads zero, the snapshot likely hit a gating problem rather than a real “no delay” reading. Inspect CtprofSnapshot::taskstats_summary (the structured per-snapshot tally written into the snapshot itself):

  • eperm_count > 0 — the capturing process lacked CAP_NET_ADMIN. Re-run as root, or grant cap_net_admin+eip via setcap.
  • esrch_count near tids_walked — every tid raced exit before the per-tid query landed. Lengthen the workload’s steady-state window and re-capture.
  • ok_count == 0 AND eperm_count == 0 — the netlink open failed, almost always meaning the kernel was built without CONFIG_TASKSTATS. Rebuild with the kconfig.
  • ok_count > 0 but every delay column reads zero — kernel built with CONFIG_TASKSTATS and CONFIG_TASK_DELAY_ACCT but launched without the runtime delayacct=on toggle. Add delayacct to the kernel cmdline, or set sysctl kernel.task_delayacct=1 and re-capture.

The structured fields above let an operator distinguish each case without scraping the capture-pipeline tracing log.

  • ctprof reference — the full metric registry and gating documentation.
  • Capture and Compare Host State — the cargo ktstr show-host recipe for host-context diffs (kernel, sched_* tunables, NUMA layout); use that when the hypothesis is “the host config moved” rather than “a workload’s per-thread behaviour moved.”
  • A/B Compare Branches — recipe for diffing scheduler-source-tree changes via ktstr’s gauntlet runs; ctprof complements that by surfacing per-thread-level effects the scenario assertions miss.

Customize Checking

Override default checking thresholds for schedulers that tolerate higher imbalance, different gap thresholds, or relaxed event rates.

Scheduler-level overrides

Declare a scheduler with custom assertion overrides:

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(RELAXED, {
    name = "relaxed",
    binary = "scx_relaxed",
    assert = Assert::NO_OVERRIDES
        .max_imbalance_ratio(5.0)    // tolerate 5:1 imbalance
        .max_fallback_rate(500.0)     // higher fallback rate ok
        .fail_on_stall(false),        // don't fail on stall
});

These overrides sit between Assert::default_checks() and per-test overrides in the merge chain.

Per-test overrides via #[ktstr_test]

#[ktstr_test(
    scheduler = RELAXED,
    not_starved = true,
    max_gap_ms = 5000,
    max_imbalance_ratio = 10.0,
    sustained_samples = 10,
)]
fn high_imbalance_test(ctx: &Ctx) -> Result<AssertResult> {
    // Inherits topology from RELAXED
    Ok(AssertResult::pass())
}

Understanding not_starved

not_starved = true enables starvation, fairness spread, and scheduling gap checks. Each threshold can be overridden independently. See Checking: Worker checks for details and default thresholds.

Merge order

Three-layer merge with last-Some-wins semantics. See Checking: Merge layers.

Using Assert directly in ops scenarios

fn my_scenario(ctx: &Ctx) -> Result<AssertResult> {
    let assertions = Assert::NO_OVERRIDES
        .check_not_starved()
        .max_gap_ms(3000);

    let steps = vec![/* ... */];
    execute_steps_with(ctx, steps, Some(&assertions))
}

execute_steps_with applies the given Assert for worker checks. execute_steps (without _with) passes None, falling back to ctx.assert (the merged three-layer config: default_checks -> scheduler -> per-test).

See Ops and Steps for the full step execution model.

Benchmarking and Negative Tests

Recipes for writing tests that check scheduler performance gates (positive tests) and confirm that degraded schedulers fail those gates (negative tests).

Using Assert for checking

Assert carries all checking thresholds. Every field is Option; None means “inherit from parent layer.”

  • In the merge chain: Assert::default_checks() -> Scheduler.assert -> per-test #[ktstr_test] attributes. Use with execute_steps_with() for ops-based scenarios. See Checking.

  • For direct report checking: call Assert::assert_cgroup(reports, cpuset).

let a = Assert::default_checks().max_gap_ms(500);
let result = a.assert_cgroup(&reports, None);

Positive benchmarking test

Check that a scheduler passes performance gates under performance_mode. Use #[ktstr_test] with Assert thresholds:

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    topology = (1, 1, 2, 1),
});

#[ktstr_test(
    scheduler = MY_SCHED,
    performance_mode = true,
    duration_s = 3,
    sustained_samples = 15,
)]
fn perf_positive(ctx: &Ctx) -> Result<AssertResult> {
    let checks = Assert::default_checks()
        .min_iteration_rate(5000.0)
        .max_gap_ms(500);
    let steps = vec![Step::with_defs(
        vec![CgroupDef::named("cg_0").workers(2)],
        HoldSpec::FULL,
    )];
    execute_steps_with(ctx, steps, Some(&checks))
}

Key points:

  • performance_mode = true pins vCPUs and uses hugepages for deterministic measurements.
  • Assert::default_checks() starts from the standard baseline.
  • Chain .min_iteration_rate(), .max_gap_ms(), or .max_p99_wake_latency_ns() to set gates.
  • execute_steps_with() applies the Assert during worker checks.

Negative test pattern

Check that intentionally degraded scheduling fails the same gates. This confirms that the gates actually catch regressions rather than passing vacuously.

Use expect_err = true on #[ktstr_test] to assert that the test fails. The macro wraps the test with assert!(result.is_err()) and disables auto-repro automatically.

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    topology = (1, 1, 2, 1),
});

#[ktstr_test(
    scheduler = MY_SCHED,
    performance_mode = true,
    duration_s = 5,
    extra_sched_args = ["--fail-verify"],
    expect_err = true,
)]
fn perf_negative(ctx: &Ctx) -> Result<AssertResult> {
    let checks = Assert::default_checks().max_gap_ms(50);
    let steps = vec![Step::with_defs(
        vec![CgroupDef::named("cg_0").workers(4)],
        HoldSpec::FULL,
    )];
    execute_steps_with(ctx, steps, Some(&checks))
}

Key points:

  • expect_err = true tells the harness to assert failure and disable auto-repro.
  • extra_sched_args = [...] passes CLI args to the scheduler binary. "--fail-verify" is a real knob that the test fixture scheduler scx-ktstr exposes to force a verifier failure (see scx-ktstr/src/main.rs and scx-ktstr/src/bpf/main.bpf.c); substitute your own scheduler’s equivalent of the behaviour you want to exercise in a negative test.
  • The test function returns the scenario result normally; the harness checks that it produces an error.

Metric extraction from stderr

OutputFormat::Json and OutputFormat::LlmExtract read the payload’s STDOUT as the primary stream, then fall back to STDERR if stdout is empty or yields no metrics. Some benchmarks emit their numbers only to stderr — schbench, for example, writes its Wakeup Latencies percentiles / Request Latencies percentiles blocks via fprintf(stderr, ...) and leaves stdout blank. The fallback keeps those benchmarks usable without a redirect.

Consequence: a payload that writes mixed output to both streams will have metrics extracted from stdout only, because the fallback fires solely when the primary stream is empty or yields nothing parseable. If you care about stderr-side numbers for a stdout-emitting binary, redirect stderr into stdout at the payload layer (extra_args = ["-c", "cmd 2>&1"] for shell-wrapped invocations, or whatever equivalent the binary supports).

stress-ng is the mirror trap: progress / per-stressor summaries go to stderr and stdout is blank, so the fallback sees stress-ng’s prose. OutputFormat::Json returns zero metrics (stderr is prose, not JSON); OutputFormat::LlmExtract may extract numbers from the fallback but results depend on the local model’s tolerance for that prose format. Keep OutputFormat::ExitCode for stress-ng unless you are prepared for that tradeoff.

Declarative include_files on Payload

Payloads that need host binaries or fixtures in the guest initramfs can declare them on the Payload itself instead of relying on the CLI -i / --include-files flag at every invocation. The specs are resolved at run_ktstr_test time through the same include-file pipeline the CLI uses.

Spec shapes

Three shapes are accepted; which branch fires is decided by the shape of the path:

  • Bare name (single-component, no /, no ./, no ../) — looked up first in the harness’s current working directory (path.exists() is tried before the PATH walk), then in the host’s PATH if the cwd lookup misses. The resolved absolute path is packed as include-files/<filename>. Example: "fio" → host /usr/bin/fio → archive include-files/fio.
  • Relative or absolute path (starts with /, ./, ../, or contains more than one component) — used verbatim and must exist. Relative paths are interpreted against the current working directory at the time the test harness runs (for cargo nextest run that is the workspace root or the individual crate root, depending on how the binary is invoked). A single-file path is packed as include-files/<filename>. Example: "./test-fixtures/workload.json" → archive include-files/workload.json.
  • Directory (any path whose resolution is a directory) — walked recursively (symlinks followed, non-regular files skipped) and the directory’s basename becomes the root under include-files/. Example: "./helpers" containing a.sh and sub/b.sh → archive entries include-files/helpers/a.sh and include-files/helpers/sub/b.sh.

Base directory for extra_include_files

Strings in extra_include_files follow the same three shapes as the #[include_files(...)] attribute. They are NOT anchored to CARGO_MANIFEST_DIR or to the crate source tree — they are resolved against the harness’s current working directory at test time, plus the host PATH for bare names. The attribute parser accepts string literals only, so paths must be plain quoted strings rather than compile-time expressions like concat!(env!("CARGO_MANIFEST_DIR"), "/test-fixtures/foo.json"). For test fixtures shipped alongside the test source, the reliable options are either a bare name that a build script or test-setup stage has placed on PATH, or a relative path rooted at the directory from which the test is invoked.

Per-Payload declaration

Declare via the #[include_files(...)] attribute on #[derive(Payload)]:

use ktstr::prelude::*;

#[derive(Payload)]
#[payload(binary = "fio")]
#[include_files("fio", "bench-helper")]
#[metric(name = "iops", polarity = HigherBetter, unit = "ops/s")]
struct FioPayload;

The generated FIO const carries include_files: &["fio", "bench-helper"]. The macro generates a const named by converting the struct name to SCREAMING_SNAKE_CASE (stripping any Payload suffix), so FioPayloadFIO and BenchDriverBENCH_DRIVER. When any #[ktstr_test] uses FIO as a payload or workload, those files get resolved and packed into the initramfs automatically — no -i flag needed on the CLI.

Fully worked declarative test

Complete end-to-end example of a #[ktstr_test] that relies on declarative include_files only (no CLI -i flag at runtime). The fixture binary ships on PATH under a project-controlled bin directory; the payload declares its own dependency:

use ktstr::declare_scheduler;
use ktstr::prelude::*;

declare_scheduler!(MY_SCHED, {
    name = "my_sched",
    binary = "scx_my_sched",
    topology = (1, 1, 2, 1),
});

#[derive(Payload)]
#[payload(binary = "bench-driver")]
#[include_files("bench-driver", "bench-helper")]
#[metric(name = "ops_per_sec", polarity = HigherBetter, unit = "ops/s")]
struct BenchDriver;
// The macro generates the `BENCH_DRIVER` const used below — `BenchDriver`
// (UpperCamelCase struct) → `BENCH_DRIVER` (SCREAMING_SNAKE_CASE, `Payload`
// suffix stripped). This is the only way to reference the payload from
// `#[ktstr_test]` attributes and from `ctx.payload(&...)` inside the body.

#[ktstr_test(
    scheduler = MY_SCHED,
    payload = BENCH_DRIVER,
    duration_s = 5,
)]
fn bench_driver_runs_with_declared_helpers(ctx: &Ctx) -> Result<AssertResult> {
    // Harness resolves the payload's `include_files` before boot:
    //   bench-driver  → `include-files/bench-driver`  (from $PATH)
    //   bench-helper  → `include-files/bench-helper`  (from $PATH)
    // Both land in the guest initramfs at `/include-files/` and are
    // on the worker's `PATH` during execution. The test body itself
    // does not touch the include set — it runs through `ctx.payload`.
    // `.run()` returns `(AssertResult, PayloadMetrics)`; the test
    // body only wants the AssertResult here, so discard the metrics
    // half of the tuple.
    ctx.payload(&BENCH_DRIVER)
        .run()
        .map(|(assert_result, _metrics)| assert_result)
}

No -i / --include-files flag is needed on any host-side invocation; the packaging happens automatically as part of run_ktstr_test.

Test-level extras

Test-level extras that don’t belong on any specific payload go on the #[ktstr_test] attribute directly:

#[ktstr_test(
    scheduler = MY_SCHED,
    payload = FIO,
    extra_include_files = ["test-fixtures/workload.json"],
)]
fn fio_with_fixture(ctx: &Ctx) -> Result<AssertResult> {
    // test body
    Ok(AssertResult::pass())
}

The declarative set (scheduler’s include_files + payload’s + workloads’ + extra_include_files) is aggregated at test time and resolved through the same include-file pipeline the CLI’s -i / --include-files flag uses (exposed on ktstr shell and cargo ktstr shell; #[ktstr_test] resolution and the shell CLIs share the same resolve_include_files resolver, just fed from different sources). The union is deduped on identical (archive_path, host_path) pairs. Two declarations that resolve to the same archive slot with different host paths surface as a hard error with both host paths in the diagnostic, rather than one silently overwriting the other.

Architecture Overview

ktstr has three execution domains:

  1. Host process – the test binary running on the host. Manages VM lifecycle, monitors guest memory, evaluates results.

  2. Guest process – the same test binary running inside the VM as PID 1. Mounts filesystems, starts the scheduler, creates cgroups, forks workers, runs scenarios, writes results to SHM (COM2 fallback).

  3. Monitor thread – runs on the host while the guest executes. Reads guest VM memory directly to observe scheduler state without instrumenting it.

Execution flow

Host                          Guest
----                          -----
test binary                   
  |                           
  +-- build initramfs         
  |   (test binary as /init   
  |    + optional scheduler)  
  |                           
  +-- boot KVM VM             
  |                           test binary (PID 1 init)
  |                             |
  +-- start monitor thread      +-- mount filesystems
  |   (reads guest memory)      +-- start scheduler (if any)
  |                             +-- create cgroups
  |                             +-- fork workers
  |                             +-- move workers to cgroups
  |                             +-- signal workers to start
  |                             +-- poll scheduler liveness
  |                             +-- stop workers, collect reports
  |                             +-- evaluate results
  |                             +-- write result to SHM (COM2 fallback)
  |                           
  +-- read result from SHM (COM2 fallback)
  +-- evaluate monitor data   
  +-- report pass/fail        

Key design decisions

Same binary, two roles. The test binary serves as both host controller and guest test runner. The initramfs embeds the binary as /init. When running as PID 1, the Rust init code (vmm::rust_init) handles the full guest lifecycle: mounts, scheduler start, test dispatch, and reboot.

Forked workers, not threads. Workers are fork()ed processes because cgroups operate on PIDs. Each worker must be a separate process to be placed in its own cgroup.

Host-side monitoring. The monitor reads guest memory via KVM, avoiding BPF instrumentation of the scheduler under test. This eliminates observer effects on scheduling decisions.

Typed flag declarations. Flags use static references instead of string matching, enabling compile-time dependency resolution.

VMM

ktstr includes a purpose-built VMM (virtual machine monitor) that boots Linux kernels in KVM for testing.

KtstrVm builder

let result = vmm::KtstrVm::builder()
    .kernel(&kernel_path)
    .init_binary(&ktstr_binary)
    .topology(numa_nodes, llcs, cores_per_llc, threads_per_core)
    .memory_mb(4096)
    .run_args(&["run".into(), "--ktstr-test-fn".into(), "my_test".into()])
    .build()?
    .run()?;

Topology

The VM topology is specified as (numa_nodes, llcs, cores_per_llc, threads_per_core). On x86_64, the VMM creates ACPI tables (MADT, SRAT, SLIT, and HMAT when numa_nodes > 1) and MP tables. On aarch64, topology is expressed via FDT cpu nodes with MPIDR-derived reg properties.

pub struct Topology {
    pub llcs: u32,
    pub cores_per_llc: u32,
    pub threads_per_core: u32,
    pub numa_nodes: u32,
    pub nodes: Option<&'static [NumaNode]>,
    pub distances: Option<&'static NumaDistance>,
}

total_cpus() = llcs * cores_per_llc * threads_per_core. num_llcs() = llcs.

When nodes is None (the default), memory and LLCs are distributed uniformly across NUMA nodes with default 10/20 distances. When Some, each NumaNode specifies its LLC count, memory size, and optional HMAT attributes (latency_ns, bandwidth_mbs, mem_side_cache). A NumaNode with llcs = 0 models a CXL memory-only node.

NumaDistance is an NxN inter-node distance matrix. Diagonal entries must be 10, off-diagonal > 10, and the matrix must be symmetric (ACPI SLIT requirements).

Use Topology::new(numa_nodes, llcs, cores, threads) for uniform topologies, or Topology::with_nodes(cores, threads, &nodes) for explicit per-node configuration.

initramfs

The VMM builds a cpio initramfs containing:

  • The test binary (as /init)
  • Optional scheduler binary (as /scheduler)
  • Shared library dependencies (resolved via ELF DT_NEEDED parsing)

The initramfs is cached based on a cache key derived from the binary contents. A compressed SHM segment enables COW overlay into guest memory, sharing physical pages across concurrent VMs.

Guest-host communication

Serial console – COM2 carries guest stdout/stderr, the canonical crash diagnostic transport, and a fallback result transport. The guest panic hook writes PANIC: <info>\n<bt>\n to COM2; the host parses it via extract_panic_message and surfaces the backtrace in test failure output. Delimited test results (between ===KTSTR_TEST_RESULT_START=== / ===KTSTR_TEST_RESULT_END=== sentinels) and exit codes (KTSTR_EXIT=N) are also written to COM2 as a fallback when the TLV stream is unavailable.

Virtio-console port 1 TLV stream – the primary guest-to-host data channel. Carries scenario markers (MSG_TYPE_SCENARIO_START, MSG_TYPE_SCENARIO_END), test results (MSG_TYPE_TEST_RESULT), exit codes (MSG_TYPE_EXIT), stimulus events (MSG_TYPE_STIMULUS), scheduler exit notifications (MSG_TYPE_SCHED_EXIT), profraw coverage data (MSG_TYPE_PROFRAW), per-payload-invocation metrics (MSG_TYPE_PAYLOAD_METRICS), and raw LlmExtract output (MSG_TYPE_RAW_PAYLOAD_OUTPUT). Each TLV frame has a CRC32 for integrity checking.

Virtio devices

The VMM implements three virtio-MMIO devices in addition to the serial console above. All three speak the virtio 1.x MMIO transport (virtio-v1.2 §4.2.2) with VIRTIO_F_VERSION_1 and use irqfd (eventfd → KVM GSI) for interrupt delivery.

  • virtio-blk (vmm::virtio_blk) – file-backed block device with a single request virtqueue and a token-bucket throttle. Used to give workloads real on-disk filesystems (per-test images cloned from a btrfs template). Advertises VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_FLUSH, and VIRTIO_RING_F_EVENT_IDX, plus VIRTIO_BLK_F_RO when configured read-only.
  • virtio-net (vmm::virtio_net) – two-virtqueue (RX, TX) NIC with an in-VMM L2 loopback backend. Used by network-shaped workloads (TCP/UDP throughput, latency) without depending on the host’s network stack. Advertises VIRTIO_NET_F_MAC so the guest binds a deterministic MAC.
  • virtio-console (vmm::virtio_console) – three-port multiport console with eight virtqueues (per virtio-v1.2 §5.3.5: two control queues plus an in/out pair per port, three ports → 2 + 2·3 = 8). Port 0 carries the interactive /dev/hvc0 console alongside the COM1/COM2 16550 serial ports; port 1 carries the guest-to-host TLV stream that delivers exit code, test result, per-payload metrics, raw payload outputs, profraw, and scheduler exit notifications; port 2 is a transparent byte-pipe relay carrying scx_stats request bytes from the host to the in-guest relay thread and the scheduler’s responses back. Advertises VIRTIO_CONSOLE_F_MULTIPORT with max_nr_ports = 3.

Performance mode

When performance_mode is enabled, the VMM applies host-side isolation (vCPU pinning, hugepages, NUMA mbind, RT scheduling), guest-visible hints (KVM_HINTS_REALTIME CPUID), and KVM exit suppression. Non-performance-mode VMs set KVM_CAP_HALT_POLL to 200us; overcommitted topologies set it to 0.

See Performance Mode for the full optimization list, prerequisites, and validation.

Dual-role architecture

The same test binary serves two roles:

Host side – manages the VM lifecycle: builds the initramfs, boots the kernel, runs the monitor, and evaluates results.

Guest side – runs inside the VM as /init (PID 1). The Rust init code (vmm::rust_init) mounts filesystems, starts the scheduler, dispatches the test function, then reboots.

The role is determined at runtime:

  • PID 1 detection: when running as PID 1, the #[ctor] function ktstr_test_early_dispatch() runs the guest init path, which handles the full guest lifecycle.
  • #[ktstr_test] host dispatch: a #[ctor::ctor] function (ktstr_test_early_dispatch) runs before main() in any binary that links against ktstr. When both --ktstr-test-fn and --ktstr-topo are present, it boots a VM and runs the test inside it.
  • #[ktstr_test] guest dispatch: when only --ktstr-test-fn is present (no --ktstr-topo), the ctor runs the test function directly – the binary is already inside a VM.

This design means one cargo build produces everything needed for both host and guest execution. The initramfs embeds the same binary that built it.

Boot process

  1. Load kernel (bzImage on x86_64, Image on aarch64) via linux-loader.
  2. Set up KVM vCPUs with the specified topology.
  3. Build and load initramfs.
  4. Set up serial devices (COM1 for console, COM2 for results).
  5. Boot the kernel.
  6. Kernel starts /init (the test binary).
  7. PID 1 detected: the guest init path mounts filesystems, starts the scheduler, dispatches the test function, and reboots.

Monitor

The monitor observes scheduler state from the host side by reading guest VM memory directly. It does not instrument the guest kernel or the scheduler under test.

What it reads

The monitor resolves kernel structure offsets via BTF (BPF Type Format) from the guest kernel. It reads per-CPU runqueue structures to extract:

  • nr_running – number of runnable tasks on each CPU
  • scx_nr_running – tasks managed by the sched_ext scheduler
  • rq_clock – runqueue clock value
  • local_dsq_depth – scx local dispatch queue depth
  • scx_flags – sched_ext flags for each CPU
  • scx event counters (fallback, keep-last, offline dispatch, skip-exiting, skip-migration-disabled, reenq-immed, reenq-local-repeat, refill-slice-dfl, bypass-duration, bypass-dispatch, bypass-activate, insert-not-owned, sub-bypass-dispatch)

When CONFIG_SCHEDSTATS is enabled, the monitor also reads per-CPU struct rq schedstat fields (run_delay, pcount, sched_count, ttwu_count, etc.).

The monitor walks the struct sched_domain tree whenever BTF contains rq->sd and struct sched_domain — no CONFIG_SCHEDSTATS required. Domain tree walking starts at rq->sd (lowest level) and follows sd->parent pointers up to the root. Each domain level provides topology metadata (level, name, flags, span_weight) and runtime fields (balance_interval, nr_balance_failed, max_newidle_lb_cost) and optional fields (newidle_call, newidle_success, newidle_ratio — added in 7.0, backported to 6.18.5+ and 6.12.65+; absent on 6.16-6.18.4). When CONFIG_SCHEDSTATS is also enabled, each domain additionally provides load balancing stats: lb_count, lb_failed, lb_balanced, alb_pushed, ttwu_wake_remote, and other counters indexed by idle type (CPU_NOT_IDLE, CPU_IDLE, CPU_NEWLY_IDLE).

Sampling

The monitor takes periodic snapshots (MonitorSample) of all per-CPU state. Each sample captures a point-in-time view of every CPU.

MonitorSummary aggregates samples into peak values (max imbalance ratio, max DSQ depth, stall detection), per-sample averages (imbalance ratio, nr_running per CPU, DSQ depth per CPU), and event counter deltas. Averages are computed over valid samples only (excluding uninitialized guest memory).

Threshold evaluation

MonitorThresholds defines pass/fail conditions:

pub struct MonitorThresholds {
    pub max_imbalance_ratio: f64,
    pub max_local_dsq_depth: u32,
    pub fail_on_stall: bool,
    pub sustained_samples: usize,
    pub max_fallback_rate: f64,
    pub max_keep_last_rate: f64,
}

A violation must persist for sustained_samples consecutive samples before triggering a failure. This filters transient spikes from cpuset transitions and cgroup creation/destruction.

Stall detection

A stall is detected when a CPU’s rq_clock does not advance between consecutive samples. Three exemptions prevent false positives:

  • Idle CPUs: when nr_running == 0 in both the current and previous sample, the CPU has no runnable tasks. The kernel stops the tick (NOHZ) on idle CPUs, so rq_clock legitimately does not advance. These CPUs are excluded from stall checks.

  • Preempted vCPUs: when the vCPU thread’s CPU time did not advance past the preemption threshold between samples, the host preempted the vCPU. These samples are excluded from stall checks.

  • Sustained window: stall detection uses per-CPU consecutive counters and the sustained_samples threshold, matching how imbalance and DSQ depth checks work. A single stuck sample does not trigger failure – the stall must persist for sustained_samples consecutive samples on the same CPU.

Uninitialized memory detection

Before the guest kernel initializes per-CPU structures, monitor reads return uninitialized data. Two layers handle this:

  • Summary computation (MonitorSummary::from_samples): skips individual samples where any CPU’s local_dsq_depth exceeds DSQ_PLAUSIBILITY_CEILING (10,000) via sample_looks_valid().

  • Threshold evaluation (MonitorThresholds::evaluate): checks all samples globally for plausibility. If all rq_clock values are identical across every CPU and sample, or any sample exceeds the plausibility ceiling, the entire report is passed as “not yet initialized” — no per-threshold checks run.

BPF map introspection

The monitor module also provides host-side BPF map discovery and read/write access via bpf_map::BpfMapAccessor. The host reads and writes guest BPF maps directly through the physical memory mapping — no guest cooperation or BPF syscalls are needed.

GuestMem

GuestMem wraps a host pointer to the start of guest DRAM and provides bounds-checked volatile reads and writes for scalar types (u8/u32/u64). Byte-slice reads (read_bytes) use copy_nonoverlapping. It also implements x86-64 page table walks (translate_kva) for both 4-level and 5-level paging, and granule-agnostic aarch64 walks (4 KB / 16 KB / 64 KB; level count derived from TCR_EL1’s TG1 + T1SZ fields).

Scalar accesses use volatile semantics because the guest kernel modifies memory concurrently.

GuestKernel

GuestKernel builds on GuestMem by adding kernel symbol resolution and address translation. It parses the vmlinux ELF symbol table at construction and resolves paging configuration (PAGE_OFFSET, CR3, 4-level vs 5-level) from guest memory. Subsequent reads use cached state.

Three address translation modes are supported:

  • Text/data/bss: kva - __START_KERNEL_map. For statically-linked kernel variables (read_symbol_*, write_symbol_*).
  • Direct mapping: kva - PAGE_OFFSET. For SLAB allocations, per-CPU data, physically contiguous memory (read_direct_*).
  • Vmalloc/vmap: Page table walk via CR3. For BPF maps, vmalloc’d memory, module text (read_kva_*, write_kva_*).

BpfMapAccessor

BpfMapAccessor resolves BTF offsets for BPF map kernel structures (struct bpf_map, struct bpf_array, struct xa_node, struct idr) and provides map discovery and value read/write. It borrows a GuestKernel for address translation.

BpfMapAccessorOwned is a convenience wrapper that owns the GuestKernel internally. Use BpfMapAccessor::from_guest_kernel when you already have a GuestKernel; use BpfMapAccessorOwned::new when you want a self-contained accessor.

Map discovery walks the kernel’s map_idr xarray:

  1. Read map_idr (BSS symbol, text mapping translation)
  2. Walk xa_node tree (SLAB-allocated, direct mapping translation)
  3. Read struct bpf_map fields. The allocation may be kmalloc’d or vmalloc’d depending on size and flags, so the translation uses translate_any_kva which handles both paths rather than assuming either.

find_map searches by name suffix (e.g. ".bss" matches "mitosis.bss"). Only BPF_MAP_TYPE_ARRAY maps are returned. Use maps() to enumerate all map types without filtering.

Value access for BPF_MAP_TYPE_ARRAY maps reads/writes the inline bpf_array.value flex array at the BTF-resolved offset. The value region is vmalloc’d, so each byte access goes through the page table walker to handle page boundaries.

For BPF_MAP_TYPE_PERCPU_ARRAY maps, bpf_array.pptrs[key] holds a __percpu pointer (at the same union offset as value). Adding __per_cpu_offset[cpu] yields the per-CPU KVA in the direct mapping. read_percpu_array returns one Option<Vec<u8>> per CPU: Some when the per-CPU PA falls within guest memory, None when it does not.

Typed field access

When a map has BTF metadata (btf_kva != 0), resolve_value_layout reads the guest’s struct btf and its data blob, parses it with btf_rs, and resolves the value struct’s fields. This enables read_field / write_field with type-checked BpfValue variants.

Usage example

Find a scheduler’s .bss map and write a crash variable:

let offsets = BpfMapOffsets::from_vmlinux(vmlinux)?;
let accessor = BpfMapAccessor::from_guest_kernel(&kernel, &offsets)?;
let bss = accessor.find_map(".bss").expect(".bss map not found");
accessor.write_value_u32(&bss, crash_offset, 1);

BpfMapWrite

BpfMapWrite specifies a host-side write to a BPF map during VM execution. The test runner waits for the scheduler to load (map becomes discoverable), writes the value, then signals the guest via SHM to start the scenario.

pub struct BpfMapWrite {
    pub map_name_suffix: &'static str,  // e.g. ".bss"
    pub offset: usize,                  // byte offset in the map value
    pub value: u32,                     // value to write
}

Use with #[ktstr_test] via the bpf_map_write attribute:

const BPF_CRASH: BpfMapWrite = BpfMapWrite {
    map_name_suffix: ".bss",
    offset: 42,
    value: 1,
};

#[ktstr_test(bpf_map_write = BPF_CRASH, expect_err = true)]
fn crash_test(ctx: &Ctx) -> Result<AssertResult> {
    Ok(AssertResult::pass())
}

The map is discovered by name suffix via BpfMapAccessor::find_map. Only BPF_MAP_TYPE_ARRAY maps are supported. The write targets a u32 at the specified byte offset within the map’s value region.

Prerequisites

  • vmlinux: Required for ELF symbols and BTF. Must match the guest kernel. Symbols include phys_base so the runtime KASLR offset can be resolved via a page-table walk through the BSP’s CR3, breaking the chicken-and-egg between text-symbol PA translation and KASLR.

Cast analysis

BPF maps frequently store kernel pointers (task_struct *, cgroup *, …) and arena pointers in u64 fields because BTF cannot express a pointer to a per-allocation type. Without intervention the renderer treats them as integers and the failure dump shows raw 0xffff…ffff values with no further chase.

The cast analyzer (monitor::cast_analysis::analyze_casts) closes that gap. The freeze coordinator runs it once per scheduler load, before any periodic capture or on-demand snapshot would consume its output:

  1. The host loads the scheduler binary and locates each .bpf.o ELF in the build artifacts.
  2. Each program section is decoded through cast_analysis::BpfInsn::from_le_bytes into a flat &[BpfInsn] slab; relocations against .bss / .data / .rodata annotate the corresponding BPF_LD_IMM64 PCs with their datasec target.
  3. analyze_casts walks the slab forward, tracking register and stack-slot state for each instruction. Two detection paths feed the output: the arena pointer path (LDX through a previously loaded u64 field) and the kernel kptr path (STX of a typed pointer register into a u64 field). Function-entry seeding from bpf_func_info reseeds R1..R5 from the BTF FuncProto so typed parameters propagate correctly across subprogram joins.
  4. The result is a CastMap (BTreeMap<(source_struct_btf_id, field_byte_offset), CastHit>) cached on the per-VM KtstrVm.cast_map (a LazyCastMap that runs the analyzer on first dump and caches the result process-wide by scheduler binary content hash). The freeze coordinator threads the cached CastMap through DumpContext::cast_map into every per-map render so the renderer can consult it at every dump site.
  5. render_cast_pointer in monitor::btf_render consumes CastHit via MemReader::cast_lookup. When a u64 field at a recorded (struct, offset) is rendered, the renderer chases the pointer through the address-space-appropriate reader (arena vs slab/vmalloc) and tags the result with a cast_annotation of "cast→arena" or "cast→kernel" (plus a (sdt_alloc) suffix when the bridge described below fired). Failure dumps show the annotation alongside the resolved struct fields, so cast-recovered pointers are visually distinct from BTF-typed ones.

The renderer also consults an sdt_alloc bridge whenever a chase target peels to a BTF_KIND_FWD forward declaration (typical for struct sdt_data __arena * fields whose body lives in the sdt_alloc library’s BTF rather than the scheduler’s program BTF). The dump-state pre-pass walks each live scx_allocator and populates a slot_start → ArenaSlotInfo index — one entry per live allocator slot, carrying elem_size, header_size, and the resolved payload BTF type id — that MemReader::resolve_arena_type (in dump::render_map::AccessorMemReader) range-looks up during the chase. The lookup finds the slot whose [slot_start, slot_start + elem_size) range contains the chased address and routes by offset_in_slot: a slot-start chase (offset == 0, e.g. the data field of scx_task_map_val storing the raw sdt_alloc() return) returns the payload type id with header_skip = header_size; a payload-start chase (offset == header_size, e.g. the return of scx_task_data(p) cached in cached_taskc_raw) returns the same payload type id with header_skip = 0. The renderer reads header_skip + btf_size bytes from the chased address, slices off the leading header_skip bytes, and renders the payload struct. The resulting Ptr carries a sdt_alloc-flavoured annotation: "sdt_alloc" on the BTF-typed Type::Ptr arm, and "cast→arena (sdt_alloc)" / "cast→kernel (sdt_alloc)" on the cast-analyzer-driven path. The sdt_alloc bridge fires only when the BTF-only resolve has already exhausted same-name siblings; false-positive risk on that arm is bounded by the arena-window range check (MemReader::resolve_arena_type returns None for addresses outside every known allocator slot).

A separate cross-BTF Fwd resolution path covers the case where a BTF_KIND_FWD pointee’s body lives in a sibling embedded BPF object’s BTF rather than an sdt_alloc slot — the typical multi-.bpf.objs shape where one object declares struct cgx_target; (forward) and a sibling object defines struct cgx_target { ... } (full body). The cast-analysis pre-pass (vmm::cast_analysis_load::build_fwd_index) walks every parsed embedded program BTF and records a name -> (btfs index, type_id) entry for every complete (!is_fwd) Type::Struct / Type::Union. First-write-wins on duplicate names: when the same name appears in multiple BTFs the index keeps the first-seen entry. Anonymous types and Typedef are not indexed (no name to key on, and typedefs add no body — the chase peels through them via peel_modifiers_with_id before consulting the index). The index is threaded through DumpContext::cross_btf and exposed to the renderer via MemReader::cross_btf_resolve_fwd. When chase_arena_pointer / render_cast_pointer peel a chase target through peel_modifiers_resolving_fwd and the local same-BTF sibling search came up empty, try_cross_btf_fwd_resolve consults the cross-BTF index by the Fwd’s name (and aggregate kind — struct vs union); a hit returns a CrossBtfRef { btf, type_id } and the chase recursion switches to the resolved sibling BTF for the pointee render. Cross-BTF resolution does NOT introduce a new annotation — the body is recovered transparently and the rendered subtree carries the cast or BTF-typed annotation it would have had if the same struct lived in the entry BTF. Unlike the sdt_alloc bridge the cross-BTF index is consulted whenever a Fwd terminal survives the local resolve — there is no arena-window gate, since the lookup is purely a name-keyed BTF table and a name miss simply leaves the chase on its existing “forward declaration; body not in this BTF” skip path.

The analyzer is deliberately conservative: branch joins reset register and stack state, conflicts drop the offending entry, and self-stores are rejected. False negatives fall back to raw u64 (the prior behavior); false positives would chase garbage and are avoided. The analysis is unconditional — no test-author configuration, no opt-in flag — and the freeze coordinator wires the resulting CastMap through every snapshot, periodic capture, and failure dump.

Probe pipeline

The probe pipeline captures function arguments and struct fields during auto-repro. It operates inside the guest VM (not from the host), using two BPF skeletons that share maps.

Architecture

crash stack -> extract functions -> BTF resolve -> load skeletons -> poll
                                                         |
                                    kprobe skeleton      |     fentry/fexit skeleton
                                    (kernel entry)       |     (BPF entry + kernel exit)
                                         |               |          |
                                         v               v          v
                                    func_meta_map  <--shared-->  probe_data
                                                         |        (entry + exit fields)
                                              trigger fires (ring buffer)
                                                         |
                                              read probe_data entries
                                                         |
                                              stitch by tptr
                                                         |
                                              format with entry→exit diffs

Kprobe skeleton (probe.bpf.c)

Attaches to kernel functions via attach_kprobe. The BPF handler:

  1. Gets the function IP via bpf_get_func_ip
  2. Looks up func_meta from func_meta_map (keyed by IP)
  3. Captures 6 raw args from pt_regs
  4. Dereferences struct fields via BTF-resolved offsets
  5. Reads char * string params if configured
  6. Stores result in probe_data (keyed by (func_ip, task_ptr))

The trigger fires via tp_btf/sched_ext_exit (inside scx_claim_exit()) and sends an EVENT_TRIGGER via ring buffer with the current task pointer and kernel stack.

Fentry/fexit skeleton (fentry_probe.bpf.c)

Handles both BPF struct_ops callbacks and kernel function exit capture. Loaded in batches of 4 fentry + 4 fexit programs per skeleton instance via set_attach_target. Shares probe_data and func_meta_map with the kprobe skeleton via reuse_fd.

A per-slot is_kernel rodata flag controls argument access:

  • BPF callbacks (is_kernel=0): ctx[0] is a void pointer to the real callback arguments. The handler dereferences through it. Uses sentinel IPs (func_idx | (1<<63)) in func_meta_map.
  • Kernel functions (is_kernel=1): args are directly in ctx[0..5]. Uses bpf_get_func_ip(ctx) for the real IP, matching the kprobe entry handler’s key.

Fexit handlers look up the existing probe_data entry (written by fentry or kprobe at function entry) and re-read struct fields into exit_fields. This captures post-mutation state for paired display.

BTF resolution

Two BTF sources:

  • vmlinux BTF (btf-rs): resolves kernel struct offsets. Types in STRUCT_FIELDS (task_struct, rq, scx_dispatch_q, etc.) use curated field lists with chained pointer dereferences (e.g. ->cpus_ptr->bits[0]). Other struct pointer params get scalar, enum, and cpumask pointer fields auto-discovered from vmlinux BTF.

  • Program BTF (libbpf-rs): resolves BPF-local struct offsets for types not in vmlinux (e.g. scheduler-defined task_ctx). Auto-discovers scalar, enum, and cpumask pointer fields.

Callback signatures are resolved by:

  1. ____name inner function in program BTF (typed params)
  2. sched_ext_ops member in vmlinux BTF (fallback)
  3. Wrapper function (void *ctx, no useful params)

Field decoding

The output formatter decodes field values based on their key name:

  • dsq_id -> SCX_DSQ_INVALID, SCX_DSQ_GLOBAL, SCX_DSQ_LOCAL, SCX_DSQ_BYPASS, SCX_DSQ_LOCAL_ON|{cpu}, BUILTIN({v}), DSQ(0x{hex})
  • cpumask_0..3 -> coalesced into one cpus_ptr field rendered as 0x{hex}({cpu-list}) — the masked hex of the cpumask words (low-order word first; multi-word masks join with _ between 64-bit chunks) followed by the run-length-collapsed CPU range list (e.g. 0xf(0-3), 0x1_00000000000000ff(0-7,64))
  • enq_flags -> WAKEUP|HEAD|PREEMPT
  • exit_kind -> ERROR, ERROR_BPF, ERROR_STALL, etc.
  • scx_flags -> QUEUED|ENABLED
  • sticky_cpu -> -1 for 0xffffffff

Event stitching

After the trigger fires, all probe_data entries are read, matched to functions by IP, then filtered to a single task’s scheduling journey:

  1. Read the task_struct pointer from the trigger event’s bpf_get_current_task() value (args[0])
  2. For functions with a task_struct parameter: keep events where args[param_idx] == tptr
  3. For functions without a task_struct parameter: keep events where task_ptr == tptr (matched via bpf_get_current_task() at probe time)

Events are sorted by timestamp for chronological output.

Worker Processes

Workers are the processes that generate load for scenarios. They run inside the VM, each in its own cgroup.

Fork, not threads

Workers are fork()ed processes. Cgroups operate on PIDs, so each worker must be a separate process to be independently placed in a cgroup.

Two-phase start

Workers wait on a pipe for a “start” signal after fork:

  1. Parent forks the worker.
  2. Worker installs SIGUSR1 handler, then blocks on pipe read.
  3. Parent moves the worker to its target cgroup.
  4. Parent writes to the pipe, signaling the worker to start.

This ensures workers run inside their target cgroup from the first instruction of their workload.

Custom work types

WorkType::Custom workers follow the same two-phase start (fork, cgroup placement, start signal), and the framework applies affinity and scheduling policy before handing control to the user function. After setup, the run function pointer takes over entirely – the framework work loop is bypassed.

Stop protocol

Workers install a SIGUSR1 handler that sets an atomic STOP flag. The main work loop checks this flag each iteration. On stop:

  1. Parent sends SIGUSR1 to all workers.
  2. Workers exit their work loop.
  3. Workers serialize their WorkerReport to a pipe.
  4. Parent reads reports and waits for child exit.

Telemetry

Each worker produces a WorkerReport:

pub struct WorkerReport {
    pub tid: i32,
    pub work_units: u64,
    pub cpu_time_ns: u64,
    pub wall_time_ns: u64,
    pub off_cpu_ns: u64,
    pub migration_count: u64,
    pub cpus_used: BTreeSet<usize>,
    pub migrations: Vec<Migration>,
    pub max_gap_ms: u64,
    pub max_gap_cpu: usize,
    pub max_gap_at_ms: u64,
    pub resume_latencies_ns: Vec<u64>,
    pub wake_sample_total: u64,
    pub iteration_costs_ns: Vec<u64>,
    pub iteration_cost_sample_total: u64,
    pub iterations: u64,
    pub schedstat_run_delay_ns: u64,
    pub schedstat_run_count: u64,
    pub schedstat_cpu_time_ns: u64,
    pub completed: bool,
    pub numa_pages: BTreeMap<usize, u64>,
    pub vmstat_numa_pages_migrated: u64,
    pub exit_info: Option<WorkerExitInfo>,
    pub is_messenger: bool,
    pub group_idx: usize,
    pub affinity_error: Option<String>,
}

pub enum WorkerExitInfo {
    Exited(i32),
    Signaled(i32),
    TimedOut,
    WaitFailed(String),
    /// Thread-mode worker panicked. Exclusive to `CloneMode::Thread`;
    /// fork workers surface panics via `Exited(1)` or
    /// `Signaled(SIGABRT)` depending on the panic strategy.
    Panicked(String),
}

iteration_costs_ns mirrors resume_latencies_ns for per-iteration wall-clock cost: a reservoir-sampled vector capped at MAX_WAKE_SAMPLES entries, paired with iteration_cost_sample_total for the total observation count when the cap is exceeded. group_idx is 0 for the primary group and 1..=N for composed WorkSpec entries in declaration order (mirrors WorkloadConfig::composed). affinity_error is Some(reason) when the worker’s sched_setaffinity / mbind setup failed; the worker still runs and produces a report but the field documents the divergence from the requested affinity contract.

Three fields worth calling out explicitly:

  • wake_sample_total — the TOTAL number of wake-latency observations the worker saw, including samples the reservoir sampler dropped. resume_latencies_ns is clamped to at most 100_000 entries (MAX_WAKE_SAMPLES); on a long run that accumulates more wakes than the cap, the vector stays at the cap while this counter keeps climbing. Host-side consumers reporting “total wakeups observed” read wake_sample_total; percentile / CV computations read resume_latencies_ns.

  • completedtrue when the worker reached its natural end (outer loop observed STOP and exited cleanly, or a custom- closure payload returned from its run). Sentinel reports synthesised by stop_and_collect’s JSON-parse fallback carry false. Lets consumers distinguish “ran to completion, saw zero iterations” from “died / timed out before recording anything.”

  • is_messengertrue only for the messenger worker in a FutexFanOut / FanOutCompute group (the single writer that advances the shared generation and issues futex_wake). Enables per-worker latency-participation assertions — receivers produce resume_latencies_ns entries, messengers record wake-side work but no resume latency.

  • off_cpu_ns = wall_time_ns - cpu_time_ns

  • exit_info is None on every live-worker-authored report. stop_and_collect synthesises a sentinel WorkerReport with Some(_) when the worker handed back no (or unparseable) JSON, using the WorkerExitInfo enum (Exited(code) / Signaled(signum) / TimedOut / WaitFailed(String) — the string carries the underlying waitpid errno rendering) to preserve the reap shape for post-mortem.

  • Migrations are tracked every 1024 work units: after each outer iteration the worker checks work_units.is_multiple_of(1024) and runs the migration-detect body iff that is true. The check runs exactly once per outer iteration, so the effective period in outer iterations is 1024 / gcd(units_per_iter, 1024). Default parameters assumed unless noted:

    • Every outer iteration (period = 1 iter): SpinWait (1024), Mixed (1024), Bursty (each outer iter runs spin_burst(1024) some number of times inside the burst_ms loop — always a multiple of 1024), PipeIo (burst_iters=1024), FutexPingPong (spin_iters=1024), CachePressure (1024 strided RMW steps), CacheYield (1024 strided RMW steps), CachePipe (burst_iters=1024), FutexFanOut messenger AND receiver (both call spin_burst(spin_iters) before splitting roles; default 1024), AffinityChurn (spin_iters=1024), PolicyChurn (spin_iters=1024).
    • Every 2 iterations: NiceSweep (spin_burst(512) per iter → gcd(512, 1024) = 512).
    • Every 4 iterations: MutexContention (work_iters=1024 + hold_iters=256 = 1280 per acquire+ release → gcd(1280, 1024) = 256, period = 4 iters). FanOutCompute messenger (spin_burst(256) per wake cycle → same 256-unit gcd).
    • Every 16 iterations: PageFaultChurn — one persistent MAP_PRIVATE | MAP_ANONYMOUS region per worker (default 4 MiB via region_kb=4096), re-faulted each outer iteration via madvise(MADV_DONTNEED). Each iteration contributes touches_per_cycle=256 page writes (each first write after MADV_DONTNEED triggers a minor fault; a birthday-collision xorshift64 index may revisit a page already faulted this cycle, so the fault count is a ceiling, not a floor) + spin_iters=64 = 320 work units (gcd(320, 1024) = 64).
    • Every 64 iterations: IoSyncWrite (16 4-KiB writes per write-then-sleep pair → gcd(16, 1024) = 16); IoRandRead and IoConvoy use the same 64-iteration cadence for their per-iteration pread/pwrite mixes.
    • Every 1024 iterations: YieldHeavy (1 unit per yield), ForkExit (1 unit per fork+wait), FanOutCompute worker (operations=5 matrix multiplies per wake, one work_units tick per multiply → gcd(5, 1024) = 1).
    • Phase-inherited: Sequence inherits whichever phase is currently active — Spin / Yield / Io use the same per-unit accounting as the SpinWait / YieldHeavy / IoSyncWrite groups above; Sleep contributes no work_units and so pauses migration checks while it runs.
    • Not tracked by the framework: Custom workers do not contribute to work_units on the framework’s behalf — migration tracking fires only if the user’s run function increments work_units and emits migrations directly.
  • Scheduling gaps (max_gap_ms, max_gap_cpu, max_gap_at_ms) record the longest wall-clock interval between consecutive 1024-work-unit migration-check points plus the CPU the gap was observed on and its time from start. High values indicate preemption or descheduling near a checkpoint boundary. The checkpoint cadence — and therefore the gap-measurement cadence — is governed by the same work_units.is_multiple_of(1024) test that the migration tracker uses, so the effective measurement period in outer iterations matches the per-WorkType tables above.

Benchmarking fields

Workers collect two categories of timing data:

Per-wakeup latency (resume_latencies_ns): timestamp-based samples recorded around blocking operations. Populated for work types with a blocking step: Bursty (sleep), PipeIo (pipe read), FutexPingPong (futex wait), FutexFanOut (futex wait, receivers only), FanOutCompute (futex wait, workers only — measured as CLOCK_MONOTONIC delta from messenger’s shared timestamp), CacheYield (yield), CachePipe (pipe read), IoSyncWrite / IoRandRead / IoConvoy (pread / pwrite / fdatasync blocking), NiceSweep (yield), AffinityChurn (yield), PolicyChurn (yield), MutexContention (futex wait on contended acquire), ForkExit (parent’s waitpid wait), and Sequence when its phases include Sleep, Yield, or Io. Each sample is in nanoseconds; most work types use Instant::elapsed() across the blocking call, while FanOutCompute uses clock_gettime(CLOCK_MONOTONIC) to measure against the messenger’s pre-wake timestamp.

schedstat deltas: read from /proc/self/schedstat at work-loop start and end. Three fields:

  • schedstat_cpu_time_ns – delta of field 1 (on-CPU time)
  • schedstat_run_delay_ns – delta of field 2 (time spent waiting for a CPU)
  • schedstat_run_count – delta of field 3 (pcount — scheduler-in count: incremented each time the scheduler picks this task to execute, across CFS/EEVDF, FIFO/RR, and sched_ext alike). Not a context-switch count — a task that keeps running on the same CPU without leaving the runqueue does not see pcount advance while it runs. For true context-switch counts read /proc/<pid>/status’s voluntary_ctxt_switches and nonvoluntary_ctxt_switches; the worker reads pcount instead because schedstat delivers it alongside run_delay / cpu_time in a single file read.

iterations counts outer-loop iterations.

NUMA fields

numa_pages: per-NUMA-node page counts parsed from /proc/self/numa_maps after the workload completes. Keyed by node ID. Empty when numa_maps is unavailable.

vmstat_numa_pages_migrated: delta of the numa_pages_migrated counter from /proc/vmstat between pre- and post-workload snapshots. Measures cross-node page migrations during the test.

These fields feed the NUMA checking thresholds.

Custom workers produce their own WorkerReport. The framework does not populate any telemetry fields for Custom – migration tracking, gap detection, schedstat deltas, NUMA page counts, and iteration counters are only present if the user’s run function fills them.

Worker-progress watchdog

Workers send SIGUSR2 to the scheduler when stuck > 2 seconds. The default POSIX disposition terminates the scheduler process, which ktstr detects as a scheduler death and captures the sched_ext dump from dmesg.

In repro mode, the watchdog is disabled to keep the scheduler alive for BPF probe assertions. The watchdog does not fire for Custom workers because they bypass the framework work loop.

RAII cleanup

WorkloadHandle implements Drop: it sends SIGKILL to all child processes and waits for them. This prevents orphaned worker processes on error paths.

WorkloadHandle

WorkloadHandle is the RAII handle to spawned worker processes. It manages the lifecycle of forked workers: spawning, start signaling, stop/collection, and cleanup.

use ktstr::prelude::*;

#[must_use = "dropping a WorkloadHandle immediately kills all worker processes"]
pub struct WorkloadHandle { /* ... */ }

Spawning

let config = WorkloadConfig {
    num_workers: 4,
    work_type: WorkType::Mixed,
    ..Default::default()
};
let mut handle = WorkloadHandle::spawn(&config)?;

Set only the fields that matter for the test and let ..Default::default() fill in the rest. The spread-default form is the canonical style in the ktstr codebase — it keeps examples pinned to intent (num_workers, work_type) and has already absorbed additions to WorkloadConfig (the NUMA memory-policy fields) without rotting. Consult the WorkloadConfig rustdoc for the current field list.

spawn() forks num_workers child processes. Each child installs a SIGUSR1 handler, then blocks on a pipe waiting for the start signal. Workers do not begin their workload until start() is called.

For grouped work types (PipeIo, CachePipe, FutexPingPong, FutexFanOut), spawn() validates that num_workers is divisible by the group size and sets up inter-worker communication (pipes for PipeIo/CachePipe, shared mmap pages for FutexPingPong/FutexFanOut).

Methods

worker_pids() -> Vec<libc::pid_t> – PIDs of all worker processes. Used with CgroupManager::move_task() or move_tasks() to place workers in cgroups before starting them.

start() – signals all workers to begin their workload by writing to their start pipes. Idempotent: calling it twice has no effect. Call this after moving workers into their target cgroups.

set_affinity(idx, cpus) -> Result<()> – sets CPU affinity for the worker at index idx via sched_setaffinity. Use this for per-worker pinning outside any cgroup, or when you need to change one worker’s affinity without disturbing the rest. When all workers in a cgroup should share the same CPU set, prefer CgroupGroup::add_cgroup — it creates the cgroup, writes cpuset.cpus once for the whole cgroup, and RAII-removes the cgroup on drop (including error paths). Reach for CgroupManager::set_cpuset directly only when the cgroup’s lifetime must outlive the current scope; the RAII wrapper is the default because it cleans up on every error path.

snapshot_iterations() -> Vec<u64> – reads all workers’ current iteration counts from a shared memory region (MAP_SHARED). Each count is monotonically increasing, read with relaxed ordering. Returns an empty vec if no workers were spawned. Call periodically during the workload’s run window to sample forward progress (e.g. to detect stalls or compute instantaneous rates); the final per-worker totals come back through stop_and_collect().

stop_and_collect(self) -> Vec<WorkerReport> – sends SIGUSR1 to all workers, reads their serialized WorkerReport from report pipes, and waits for exit. Auto-starts workers if start() was not called. Workers that do not respond within a shared 5-second deadline are killed with SIGKILL. Consumes the handle.

Typical usage

// 1. Spawn workers (blocked, waiting for start signal)
let mut handle = WorkloadHandle::spawn(&config)?;

// 2. Move workers into their target cgroup. `cgroup.procs` is
//    tgid-scoped, so use `worker_pids_for_cgroup_procs()` — it
//    bails for Thread-mode workers (whose pids share the harness's
//    tgid) and points at `cgroup.threads` instead. Plain
//    `worker_pids()` returns the raw pid set without the
//    cgroup-procs safety check.
ctx.cgroups.move_tasks("cg_0", &handle.worker_pids_for_cgroup_procs()?)?;

// 3. Signal workers to start
handle.start();

// 4. Wait for workload duration
std::thread::sleep(ctx.duration);

// 5. Stop workers and collect telemetry
let reports: Vec<WorkerReport> = handle.stop_and_collect();

Drop behavior

Dropping a WorkloadHandle without calling stop_and_collect() sends SIGKILL to all child processes and waits for them. This prevents orphaned worker processes on error paths. Shared mmap regions (futex pages and iteration counters) are unmapped on drop.

See also: CgroupManager for cgroup operations, CgroupGroup for RAII cleanup, TestTopology for cpuset generation, Worker Processes for the two-phase start protocol and telemetry details.

CgroupManager

CgroupManager manages cgroup v2 filesystem operations. It creates, configures, and removes cgroups under a parent directory.

use ktstr::prelude::*;

pub struct CgroupManager {
    parent: PathBuf,
}

Construction

use std::collections::BTreeSet;

let cgroups = CgroupManager::new("/sys/fs/cgroup/ktstr");
let mut controllers = BTreeSet::new();
controllers.insert(Controller::Cpuset);
controllers.insert(Controller::Cpu);
cgroups.setup(&controllers)?; // create parent dir, enable cpuset + cpu controllers

new() sets the parent path. setup() takes a &BTreeSet<Controller> (variants: Cpuset, Cpu, Memory, Pids, Io), creates the parent directory if it does not exist, and enables the requested controllers on every ancestor from /sys/fs/cgroup down to the parent by writing to each level’s cgroup.subtree_control. An empty set creates the directory and returns without touching subtree_control. The deterministic BTreeSet iteration order keeps the rendered subtree_control write stable between runs.

Methods

parent_path() -> &Path – returns the parent cgroup directory path.

create_cgroup(name) – creates a child cgroup directory. Idempotent: no error if the directory already exists. Supports nested paths (e.g. "nested/deep"). For nested paths, enables +cpuset on intermediate cgroups’ subtree_control.

remove_cgroup(name) – drains tasks from the child cgroup to the cgroup filesystem root, then removes the directory. No error if the cgroup does not exist.

set_cpuset(name, cpus) – writes cpuset.cpus for a child cgroup. The BTreeSet<usize> is formatted as a compact range string via TestTopology::cpuset_string() (e.g. "0-3,5,7-9").

clear_cpuset(name) – writes an empty string to cpuset.cpus, which inherits the parent’s cpuset.

move_task(name, pid) – writes a single PID to the child cgroup’s cgroup.procs.

move_tasks(name, pids) – moves all PIDs from a slice into the child cgroup. Tolerates ESRCH (task exited between listing and migration) with a warning. Retries EBUSY up to 3 times with 100ms backoff for transient rejections from sched_ext BPF cgroup_prep_move callbacks. Propagates EBUSY after retries exhausted. Propagates all other errors immediately.

drain_tasks(name) – moves all tasks from a child cgroup to the cgroup filesystem root (/sys/fs/cgroup) by reading cgroup.procs and writing each PID to the root’s cgroup.procs. Drains to root because the parent has subtree_control set and the kernel’s no-internal-process constraint rejects writes to a cgroup with active controllers.

cleanup_all() – recursively removes all child cgroups under the parent (depth-first), draining tasks at each level. Keeps the parent directory itself.

Timeout protection

All cgroup filesystem writes use a 2-second timeout. The write runs in a spawned thread; if it does not complete within the timeout, the caller gets an error. This prevents test hangs when cgroup operations block in the kernel (e.g. during scheduler reconfigurations).

Usage in scenarios

Scenarios access CgroupManager through Ctx.cgroups. The typical pattern is:

fn custom_scenario(ctx: &Ctx) -> Result<AssertResult> {
    let mut guard = CgroupGroup::new(ctx.cgroups);
    guard.add_cgroup("cg_0", &cpuset)?;

    let mut h = WorkloadHandle::spawn(&config)?;
    ctx.cgroups.move_tasks("cg_0", &h.worker_pids_for_cgroup_procs()?)?;
    h.start(); // workers block until start() is called

    // ... run workload ...

    // `guard` drops at end of scope and removes cg_0 even on error.
    Ok(result)
}

Bypass CgroupGroup only when you need to hand the cgroup’s lifetime to a different owner; the RAII wrapper is the default because it removes the cgroup on every error path, not just the happy path.

See also: CgroupGroup for RAII cleanup, WorkloadHandle for worker lifecycle, TestTopology for cpuset generation.

CgroupGroup

CgroupGroup is an RAII guard that removes cgroups on drop. It prevents cgroup leaks when workload spawning or other operations fail between cgroup creation and cleanup.

use ktstr::prelude::*;

#[must_use = "dropping a CgroupGroup immediately destroys the cgroups it manages"]
pub struct CgroupGroup<'a> {
    cgroups: &'a dyn CgroupOps,
    names: Vec<String>,
}

Methods

new(cgroups: &dyn CgroupOps) -> Self – creates an empty group bound to any implementor of CgroupOps (e.g. CgroupManager in production, an in-memory fake in tests).

add_cgroup(name, cpuset) -> Result<()> – creates a cgroup and sets its cpuset. The cgroup is tracked for removal on drop.

add_cgroup_no_cpuset(name) -> Result<()> – creates a cgroup without setting a cpuset. The cgroup is tracked for removal on drop.

names() -> &[String] – returns the names of all tracked cgroups.

Drop behavior

When the CgroupGroup is dropped, it calls remove_cgroup() on each tracked cgroup in reverse insertion order so nested children are removed before their parents (a parent still holding child directories would fail with ENOTEMPTY).

ENOENT is the one errno the drop swallows silently — it indicates the directory is already gone (the post-condition cleanup owes), which can legitimately happen via a TOCTOU race between the inner exists() check and remove_dir. Every other error (EBUSY from a surviving task, EACCES, a broken cgroupfs mount, etc.) is emitted as a tracing::warn! record carrying the cgroup name, the full error chain, and — for EBUSY or EACCES — a short remediation hint. The drop never panics and never returns an error (it cannot), but teardown failures are visible in logs rather than silently swallowed.

Usage

CgroupGroup is the standard pattern for cgroup lifecycle management in custom scenarios and in run_scenario() for data-driven scenarios.

fn custom_scenario(ctx: &Ctx) -> Result<AssertResult> {
    let mut guard = CgroupGroup::new(ctx.cgroups);
    guard.add_cgroup("cg_0", &cpuset_a)?;
    guard.add_cgroup("cg_1", &cpuset_b)?;

    // If WorkloadHandle::spawn() fails here, guard drops
    // and both cgroups are removed automatically.
    let mut h = WorkloadHandle::spawn(&config)?;
    ctx.cgroups.move_tasks("cg_0", &h.worker_pids_for_cgroup_procs()?)?;
    h.start(); // workers block until start() is called

    // ... run workload ...

    // guard drops at end of scope, removing cg_0 and cg_1.
    Ok(result)
}

The helper function setup_cgroups() returns a CgroupGroup alongside the worker handles:

let (handles, _guard) = setup_cgroups(ctx, 2, &wl)?;
// _guard lives until end of scope; cgroups are cleaned up on drop.

See also: CgroupManager for filesystem operations, WorkloadHandle for worker lifecycle, TestTopology for cpuset generation.

CI

Recipes for running ktstr tests in continuous integration.

Runner requirements

ktstr boots KVM virtual machines. CI runners must provide:

  • /dev/kvm access (hardware virtualization enabled)
  • Self-hosted runners or a provider that exposes KVM to the guest

GitHub-hosted ubuntu-latest runners do not expose /dev/kvm. Use self-hosted runners with KVM labels:

runs-on: [self-hosted, X64]                              # x86_64 (minimum labels)
runs-on: [self-hosted, Linux, kvm, kernel-build, ARM64]  # aarch64 (adjust labels to your runner pool)

See Troubleshooting: /dev/kvm not accessible for diagnosing KVM issues on runners, including cloud VM nested virtualization setup (GCP, AWS, Azure).

Runners also need the build dependencies listed in Getting Started: Prerequisites (clang, pkg-config, make, gcc, autotools) and at least 5 GB of free disk for kernel source extraction, build artifacts, and cached images.

Workflow setup

A minimal workflow that builds a kernel, caches it, and runs tests:

name: CI

on:
  push:
    branches: [main]
  pull_request:

jobs:
  test:
    runs-on: [self-hosted, X64]
    env:
      KTSTR_GHA_CACHE: "1"
    steps:
      - uses: actions/checkout@v5
      - uses: dtolnay/rust-toolchain@stable
        with:
          components: rustfmt
      - uses: taiki-e/install-action@v2
        with:
          tool: cargo-nextest
      - name: Install ktstr
        run: cargo install --path . --locked --bin ktstr --bin cargo-ktstr
      - name: Cache kernel images
        uses: actions/cache@v5
        with:
          path: ~/.cache/ktstr/kernels
          key: ktstr-kernels-x64-${{ hashFiles('ktstr.kconfig') }}
          restore-keys: ktstr-kernels-x64-
      - name: Build test kernel
        run: cargo ktstr kernel build
      - run: cargo ktstr test -- --profile ci --features integration

The test harness auto-discovers the built kernel. --profile ci configures nextest timeouts and retry behavior; see Nextest CI profile. KTSTR_GHA_CACHE enables a remote kernel cache; see Caching. To pin a specific kernel version, see Kernel pinning below.

Kernel pinning

Pin a specific kernel version via the matrix strategy:

strategy:
  fail-fast: false
  matrix:
    kernel-version: ['6.14', '7.0']
steps:
  # ...
  - name: Install ktstr
    run: cargo install --path . --locked --bin ktstr --bin cargo-ktstr
  - name: Build test kernel
    run: cargo ktstr kernel build ${{ matrix.kernel-version }}
  - run: cargo ktstr test --kernel ${{ matrix.kernel-version }} -- --profile ci --features integration

--kernel tells cargo ktstr test which cached kernel to use at runtime. A major.minor prefix (e.g. 6.14) resolves to the highest patch release in that series. See Kernel discovery for the full resolution chain.

When testing multiple kernel versions, add the version to the cache key (unlike the minimal workflow above, which omits it because it builds a single kernel):

key: ktstr-kernels-x64-${{ matrix.kernel-version }}-${{ hashFiles('ktstr.kconfig') }}
restore-keys: ktstr-kernels-x64-${{ matrix.kernel-version }}-

Caching

actions/cache persists ~/.cache/ktstr/kernels across runs, keyed on hashFiles('ktstr.kconfig') so kconfig changes trigger a rebuild.

Set KTSTR_GHA_CACHE=1 to enable a remote cache layer that shares kernels across jobs and workflow runs. Remote failures are non-fatal; local cache is authoritative.

Budget-based test selection

Set KTSTR_BUDGET_SECS to limit test runtime:

- run: cargo ktstr test -- --profile ci --features integration
  env:
    KTSTR_BUDGET_SECS: "300"

The selector greedily picks tests that maximize feature coverage within the time budget. Useful for smoke-test jobs or constrained runners. See Running Tests: Budget-based test selection.

Coverage

Run tests under cargo ktstr coverage for coverage reports:

coverage:
  runs-on: [self-hosted, X64]
  steps:
    - uses: actions/checkout@v5
    - uses: dtolnay/rust-toolchain@stable
      with:
        components: rustfmt,llvm-tools-preview
    - uses: taiki-e/install-action@v2
      with:
        tool: cargo-llvm-cov,cargo-nextest
    - name: Install ktstr
      run: cargo install --path . --locked --bin ktstr --bin cargo-ktstr
    - name: Cache kernel images
      uses: actions/cache@v5
      with:
        path: ~/.cache/ktstr/kernels
        key: ktstr-kernels-x64-${{ hashFiles('ktstr.kconfig') }}
        restore-keys: ktstr-kernels-x64-
    - name: Build test kernel
      run: cargo ktstr kernel build
    - run: cargo ktstr coverage -- --profile ci --lcov --output-path lcov.info --features integration --exclude-from-report scx-ktstr

Requires llvm-tools-preview rustup component and cargo-llvm-cov. Pass --exclude-from-report <crate> to exclude scheduler crates from coverage reports (the example excludes scx-ktstr, the project’s own test fixture scheduler).

Test statistics

Collect test statistics after the test run:

- name: Test statistics
  if: ${{ !cancelled() }}
  run: cargo ktstr stats

stats reads sidecar JSON files from target/ktstr/ and prints gauntlet analysis, BPF verifier stats, callback profiles, and KVM stats. The if: !cancelled() condition ensures stats are collected even on test failure. See cargo-ktstr stats for subcommands and options.

aarch64

aarch64 runners use the same workflow as x64. Copy the x64 workflow above and apply these differences:

  • Runner labels: [self-hosted, Linux, kvm, kernel-build, ARM64] (adjust to match your runner pool).
  • Cache key prefix: arm64 instead of x64.
  • sccache must be installed on every runner the workflow targets (x64 and arm64). The workflow’s global RUSTC_WRAPPER=sccache applies to every job; a runner without sccache on $PATH fails the first cargo invocation.

Performance mode

CI runners may lack CAP_SYS_NICE, rtprio limits, or enough host CPUs for exclusive LLC reservation. Disable performance mode to skip these features:

- run: cargo ktstr test -- --profile ci --features integration
  env:
    KTSTR_NO_PERF_MODE: "1"

Tests with performance_mode=true are skipped entirely under --no-perf-mode. See Performance Mode: Disabling.

Environment variables

See the full reference for all environment variables. The CI-relevant ones are KTSTR_GHA_CACHE, KTSTR_BUDGET_SECS, KTSTR_NO_PERF_MODE, KTSTR_KERNEL, and KTSTR_CACHE_DIR.

Nextest CI profile

The workspace ships a ci nextest profile in .config/nextest.toml. Compared to the default profile, it raises the slow-timeout termination threshold from 2 to 3 cycles (terminate-after = 3), defers per-test output until the run completes (failure-output = "final"), and continues past failures (fail-fast = false). Use it with --profile ci.

See Tests pass locally but fail in CI for common CI failure causes.

Troubleshooting

Build errors

clang not found

error: failed to run custom build command for `ktstr`
  ...
  clang: No such file or directory

The BPF skeleton build (libbpf-cargo) invokes clang to compile .bpf.c sources. Install clang:

  • Debian/Ubuntu: sudo apt install clang
  • Fedora: sudo dnf install clang

pkg-config not found

error: failed to run custom build command for `libbpf-sys`
  ...
  pkg-config: command not found

libbpf-sys uses pkg-config during its vendored build. Install it:

  • Debian/Ubuntu: sudo apt install pkg-config
  • Fedora: sudo dnf install pkgconf

autotools errors (autoconf, autopoint, aclocal)

autoreconf: command not found
aclocal: command not found
autopoint: command not found

The vendored libbpf-sys build compiles bundled libelf and zlib from source using autotools. These libraries are not system dependencies – they ship with libbpf-sys – but the autotools toolchain is needed to build them. Install:

  • Debian/Ubuntu: sudo apt install autoconf autopoint flex bison gawk
  • Fedora: sudo dnf install autoconf gettext-devel flex bison gawk

make or gcc not found

busybox build requires 'make' — install build-essential (Debian/Ubuntu) or base-devel (Fedora/Arch)
busybox build requires 'gcc' — install build-essential (Debian/Ubuntu) or base-devel (Fedora/Arch)

The build script compiles busybox from source for guest shell mode. This requires make and gcc.

  • Debian/Ubuntu: sudo apt install make gcc
  • Fedora: sudo dnf install make gcc

BTF errors

no BTF source found. Set KTSTR_KERNEL to a kernel build directory,
or ensure /sys/kernel/btf/vmlinux exists.

build.rs generates vmlinux.h from kernel BTF data. It searches the kernel discovery chain (KTSTR_KERNEL, ./linux, ../linux, installed kernel) for a vmlinux file, falling back to /sys/kernel/btf/vmlinux. Most distros ship /sys/kernel/btf/vmlinux with CONFIG_DEBUG_INFO_BTF enabled.

Fixes:

  • Verify BTF is available: ls /sys/kernel/btf/vmlinux
  • If missing, set KTSTR_KERNEL to a kernel build directory that contains a vmlinux with BTF: export KTSTR_KERNEL=/path/to/linux
  • Build a kernel with CONFIG_DEBUG_INFO_BTF=y.
  • Some minimal/cloud kernels strip BTF. Use a distro kernel or build your own.

busybox download failure

failed to obtain busybox source.
  tarball (https://github.com/mirror/busybox/archive/refs/tags/1_36_1.tar.gz): download: ...
  git clone (https://github.com/mirror/busybox.git): ...
  Check network connectivity. First build requires internet access.

build.rs downloads busybox source on first build (tarball first, git clone fallback). Subsequent builds use the cached binary in $OUT_DIR.

Fixes:

  • Verify network connectivity to github.com.
  • If behind a proxy, set HTTP_PROXY / HTTPS_PROXY.
  • After a successful first build, no network access is needed unless cargo clean removes the cached binary.

/dev/kvm not accessible

The host-side pre-flight emits one of the following, depending on whether the device node is missing or merely unreadable:

/dev/kvm not found. KVM requires:
  - Linux kernel with KVM support (CONFIG_KVM)
  - Access to /dev/kvm (check permissions or add user to 'kvm' group)
  - Hardware virtualization enabled in BIOS (VT-x/AMD-V)
/dev/kvm: permission denied. Add your user to the 'kvm' group:
  sudo usermod -aG kvm $USER
  then log out and back in.

ktstr boots Linux kernels in KVM virtual machines. The host must have KVM enabled and the user must have read+write access to /dev/kvm.

Diagnose:

  • Check the device exists and inspect its permissions and owning group: ls -l /dev/kvm. Typical output: crw-rw---- 1 root kvm 10, 232 ....
  • Confirm the kvm group exists and see its members: getent group kvm.

Fixes:

  • Load the KVM module: modprobe kvm_intel or modprobe kvm_amd.
  • Follow the group-membership hint in the error text above (log out and back in afterward for the group change to take effect).
  • On cloud VMs (GCP, AWS, Azure) or nested hypervisors, nested virtualization is typically off by default. Enable it per the provider’s instructions (e.g. GCP --enable-nested-virtualization, AWS metal/.metal instance types, Azure Dv3/Ev3+ with nested virt).
  • In CI, ensure the runner has KVM access (e.g. runs-on: [self-hosted, kvm]).

No kernel found

no kernel found
  hint: set KTSTR_KERNEL to a kernel source directory, a version (e.g. `6.14.2`), or a cache key (see `cargo ktstr kernel list`), or run `cargo ktstr kernel build` to populate the cache
  hint: or set KTSTR_TEST_KERNEL=/path/to/bzImage to point at a pre-built bootable image directly (bypasses KTSTR_KERNEL resolution)

On aarch64 the second hint says Image instead of bzImage.

ktstr shell and cargo ktstr shell auto-download the latest stable kernel when no --kernel is specified and no kernel is found via the discovery chain. See Kernel auto-download failures for download-specific errors.

ktstr needs a bootable Linux kernel image (bzImage on x86_64, Image on aarch64). See Kernel discovery for the search order.

Fixes:

  • Download and cache a kernel: cargo ktstr kernel build
  • Build from a local tree: cargo ktstr kernel build --source ../linux
  • Set KTSTR_TEST_KERNEL to an explicit image path.
  • The host’s installed kernel works for basic testing.

Scheduler not found

scheduler 'scx_mitosis' not found. Set KTSTR_SCHEDULER or
place it next to the test binary or in target/{debug,release}/

When using SchedulerSpec::Discover, ktstr searches for the scheduler binary in:

  1. KTSTR_SCHEDULER environment variable.
  2. Sibling of the current executable (and, when the test binary lives under target/{debug,release}/deps/, the parent of deps/ one level up — this covers the nextest / integration- test layout where the scheduler binary sits next to the test binary’s parent).
  3. target/debug/.
  4. target/release/.
  5. On-demand build via cargo build against the scheduler’s package name — ktstr invokes the build itself when the preceding four locations have no match, so a fresh checkout with an unbuilt scheduler still produces a usable binary without the caller pre-running cargo build.

Fixes:

  • Build the scheduler first: cargo build -p scx_mitosis (skipped automatically if step 5 above can build it on demand, but pre-building makes the first test run faster).
  • Set KTSTR_SCHEDULER=/path/to/binary.
  • Use SchedulerSpec::Path for an explicit path in #[ktstr_test].

Scheduler died

scheduler process died unexpectedly after completing step 2 of 5 (12.3s into test)

The scheduler process died while the scenario was running. This is usually a crash. The exact message varies by when the crash was detected (between steps, during workload, after completion).

The failure output contains diagnostic sections (each present only when relevant):

  • --- scheduler log ---: the scheduler’s stdout and stderr, cycle-collapsed for readability.
  • --- diagnostics ---: init stage classification, VM exit code, and the last 20 lines of kernel console output.
  • --- sched_ext dump ---: sched_ext_dump trace lines from the guest kernel (present when a SysRq-D dump fired).

Set RUST_BACKTRACE=1 to force --- diagnostics --- on all failures, not just scheduler deaths.

Next steps:

  • Check the --- scheduler log --- for the crash reason.
  • Check --- diagnostics --- for BPF errors or kernel oops in the kernel console.
  • Enable auto_repro in the test to capture the crash path with BPF probes. See Auto-Repro.
  • Run with a longer duration and specific flags to narrow the reproducer.

See Investigate a Crash for the complete failure output format and auto-repro walkthrough.

Insufficient hugepages

performance_mode: WARNING: no 2MB hugepages available, guest memory will use regular pages
performance_mode: WARNING: need N 2MB hugepages, only K free — falling back to regular pages

Performance mode requests 2MB hugepages for guest memory. The first form fires when no 2MB hugepages are reserved on the host (free == 0); the second fires when some are reserved but fewer than the run needs. In both cases the VM falls back to regular pages and continues to boot.

Fix:

Allocate hugepages before the run:

echo 2048 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Worker assertion failures

stuck 4500ms on cpu2 at +3200ms (threshold 3000ms)
unfair cgroup: spread=42% (8-50%) 4 workers on 4 cpus (threshold 35%)

The Assert checks (max_gap_ms, max_spread_pct, etc.) detected a worker metric outside the configured thresholds.

Fixes:

  • Check whether the topology has enough CPUs for the scenario. Small topologies produce higher contention, larger gaps, and more spread.
  • Use execute_steps_with() with a custom Assert to override thresholds for scenarios that need relaxed limits.
  • Check the scheduler’s behavior under the specific flag profile that triggered the failure.

Cgroup name typos

No such file or directory: /sys/fs/cgroup/.../nonexistent/cgroup.procs

A cgroup name passed to Op::SetCpuset, Op::Spawn, or CgroupManager::move_tasks does not match a previously created cgroup. Cgroup names are case-sensitive strings.

Fixes:

  • Verify the cgroup name matches the name in Op::AddCgroup or CgroupDef::named().
  • When using dynamic cgroup names (e.g. format!("cg_{i}")), ensure the same formatting is used in all ops referencing that cgroup.

CpusetSpec errors

cgroup 'cg_0': CpusetSpec validation failed: not enough usable CPUs (4) for 8 partitions
cgroup 'cg_1': CpusetSpec validation failed: index 3 >= partition count 3
cgroup 'cg_2': CpusetSpec validation failed: Range fracs must lie in [0.0, 1.0]: start_frac=-1, end_frac=0.5

A CpusetSpec cannot produce a valid cpuset for the test topology. execute_steps treats this as a hard error and aborts the step so the downstream slicing/arithmetic in CpusetSpec::resolve is never reached with inputs that would panic.

Fixes:

  • Guard with a topology check before creating the step: if ctx.topo.usable_cpus().len() < needed { return Ok(AssertResult::skip(...)); }
  • Call CpusetSpec::validate(&ctx) in your scenario builder so failures surface before execute_steps runs.
  • Reduce the partition count or use CpusetSpec::Llc instead of Disjoint on topologies with fewer CPUs than partitions.
  • For Range/Overlap, keep fractions finite and inside [0.0, 1.0]; Range additionally requires start_frac < end_frac.

Worker count mismatches

PipeIo requires num_workers divisible by 2, got 3

Grouped work types (PipeIo, FutexPingPong, CachePipe, FutexFanOut, FanOutCompute) require num_workers divisible by their group size. WorkType::worker_group_size() returns the divisor.

Fixes:

  • Set CgroupDef::workers(n) to a value divisible by the work type’s group size (2 for pipe/futex pairs, fan_out + 1 for FutexFanOut and FanOutCompute).
  • Use an ungrouped work type (SpinWait, Mixed, Bursty, IoSyncWrite, IoRandRead, IoConvoy, YieldHeavy) if worker count flexibility is needed.

Cache corruption

  6.14.2-tarball-x86_64-kc...                 (corrupt: metadata.json malformed: ...)
warning: entries marked (corrupt) cannot be used — cached metadata is missing, malformed, or references a missing image. Inspect the entry directory under ~/.cache/ktstr/kernels to remove it manually, or run `kernel clean --corrupt-only --force` which removes ONLY corrupt entries and leaves valid ones intact. ...

A cached kernel entry has missing, unparseable, or schema-drifted metadata.json, or metadata that references an image file that is no longer present. This can happen after a partial write (e.g. disk full, killed process), or after a ktstr release that evolved the metadata schema in a non-backward-compatible way. cargo ktstr kernel list surfaces these as (corrupt: ...) rows; the trailing footer on stderr summarizes the remediation options. CacheDir::lookup returns None for corrupt entries so test runs at a specific cache key fall through to the normal re-build path.

The JSON form (cargo ktstr kernel list --json) emits an error_kind field on every corrupt entry — one of "missing", "unreadable", "schema_drift", "malformed", "truncated", "parse_error", "image_missing", or "unknown" — so CI scripts can dispatch on a stable token without parsing the free-form error string.

Fixes:

  • Remove ONLY corrupt entries (keeps valid ones intact): cargo ktstr kernel clean --corrupt-only --force
  • Remove the corrupt entry along with everything else: cargo ktstr kernel clean --force
  • Rebuild a specific version after cleanup: cargo ktstr kernel build --force 6.14.2
  • Override the cache directory via KTSTR_CACHE_DIR if the default location is on a problematic filesystem.
  • See cargo ktstr kernel clean for all cleanup options, including --keep N --force to preserve the N newest entries.

Stale vmlinux.btf or default.profraw in kernel source tree

After upgrading from an older ktstr version, you may notice extra files in your kernel source directory:

  • <source>/vmlinux.btf — a sidecar of the kernel’s .BTF section bytes. Older ktstr versions wrote it next to whichever vmlinux they parsed, including source-tree builds. Current ktstr only writes the sidecar when the vmlinux path is inside the cache root (~/.cache/ktstr/kernels/ or whatever KTSTR_CACHE_DIR points at) so source trees stay pristine.
  • <source>/default.profraw — an LLVM coverage runtime artifact. Older ktstr versions could leave it in cwd when a coverage-instrumented cargo ktstr test was launched from inside the kernel tree. Current ktstr injects LLVM_PROFILE_FILE=<cargo-ktstr-binary-parent>/llvm-cov-target/default-{pid}-{binary_hash}.profraw for the bare nextest path so the profraw lands next to the cargo-ktstr binary regardless of cwd. See profraw layout for the per-population directory map.

Both files are leftover state from prior runs and are safe to remove:

rm -f /path/to/linux/vmlinux.btf
rm -f /path/to/linux/default.profraw

If you also see them turn up under a different ktstr-driven source tree, check that you are running a current ktstr build (re-run cargo build or cargo install ktstr to pick up the fix) before deleting again — the guards live in the resolver, not on disk, so an old binary will keep regenerating these files.

Cache directory not found

HOME is unset; cannot resolve cache directory. The container init or login shell did not assign HOME — set it to an absolute path, or set KTSTR_CACHE_DIR to an absolute path (e.g. /tmp/ktstr-cache) or XDG_CACHE_HOME to specify a cache location explicitly.
HOME is set to the empty string; cannot resolve cache directory. An empty HOME usually means a Dockerfile or shell rc has `export HOME=` or `ENV HOME=` with no value. Either set HOME to a real absolute path, or set KTSTR_CACHE_DIR to an absolute path (e.g. /tmp/ktstr-cache) or XDG_CACHE_HOME to specify a cache location explicitly.

The kernel image cache requires a writable directory. ktstr resolves it as: KTSTR_CACHE_DIR > $XDG_CACHE_HOME/ktstr/kernels/ > $HOME/.cache/ktstr/kernels/. The first form fires when HOME is absent from the environment (typical of bare container inits or systemd units with no Environment=HOME=...); the second fires when HOME is present but assigned to the empty string.

Fix: Set KTSTR_CACHE_DIR to an explicit path, or ensure HOME is set to a real absolute path.

Stale kconfig

warning: entries marked (stale kconfig) were built against a different ktstr.kconfig.
Rebuild with: kernel build --force <entry version>

cargo ktstr kernel list marks entries whose stored ktstr_kconfig_hash differs from the current embedded ktstr.kconfig fragment. This happens after updating ktstr (which may change the kconfig fragment).

Fix:

Rebuilds happen automatically on the next cargo ktstr kernel build for stale entries. Use --force to override the cache for other reasons. See cargo ktstr kernel list for the full listing output.

Kernel auto-download failures

ktstr: no kernel found, downloading latest stable
fetch https://www.kernel.org/releases.json: <error>

ktstr auto-downloads a kernel when no --kernel is specified and no kernel is found via the discovery chain (see Kernel discovery). The same download path runs when --kernel specifies a version (e.g. --kernel 6.14.2) that is not in the cache. The CLI label varies: ktstr: for the standalone binary, cargo ktstr: for the cargo subcommand.

The <error> above is the underlying reqwest error (DNS resolution, connection refused, timeout, TLS handshake failure).

fetch https://www.kernel.org/releases.json: HTTP 503

kernel.org returned a non-success status code.

no stable kernel with patch >= 8 found in releases.json

ktstr requires a stable or longterm release with patch version >= 8 to avoid brand-new major versions that may have build issues. This error means releases.json contained no qualifying version.

download https://cdn.kernel.org/.../linux-6.14.10.tar.xz: <error>

Network failure during tarball download (same causes as above).

extract tarball: <error>

Tarball extraction failed. Common causes: disk full, insufficient permissions on the temp directory, or a truncated download.

kernel built but cache store failed — cannot return image from temporary directory

The kernel built successfully but could not be stored in the cache. Check disk space and permissions on the cache directory.

For version-specific download errors (HTTP 404, HTML responses), see Kernel download failures.

Fixes:

  • Verify network connectivity: curl -sI https://www.kernel.org/releases.json
  • Check DNS resolution for kernel.org and cdn.kernel.org.
  • Check disk space — the download, extraction, and build require significant disk space.
  • If behind a proxy, set HTTP_PROXY, HTTPS_PROXY, and NO_PROXY (reqwest respects these environment variables).
  • Override the cache directory via KTSTR_CACHE_DIR if the default location has insufficient space or permissions.
  • Pre-download a kernel explicitly: cargo ktstr kernel build 6.14.10 to isolate whether the failure is in version resolution or download.

Kernel download failures

These errors occur when cargo ktstr kernel build or --kernel specifies an explicit version. For network and extraction errors during auto-download, see Kernel auto-download failures.

version 6.14.22 not found. latest 6.14.x: 6.14.10

The requested version does not exist on kernel.org. When a version in the same major.minor series is available in releases.json, the error suggests it.

version 5.4.99 not found

When the series is EOL or not in releases.json, only the “not found” message appears (no suggestion).

RC tarball not found: https://git.kernel.org/torvalds/t/linux-6.15-rc3.tar.gz
  RC releases are removed from git.kernel.org after the stable version ships.

RC tarballs are removed from git.kernel.org after the stable version ships. Use --git with a git.kernel.org URL to clone the tag instead.

download ...: server returned HTML instead of tarball (URL may be invalid)

Some CDN error pages return HTTP 200 with text/html content type. The download rejects these responses.

Fixes:

  • Check the suggested version in the error message.
  • Verify the version exists: check https://www.kernel.org/releases.json for available versions.
  • For RC releases, use --git with a git.kernel.org URL instead of a tarball download.
  • Run cargo ktstr kernel build without a version to automatically fetch the latest stable.

Shell mode issues

stdin must be a terminal

stdin must be a terminal for interactive shell mode

cargo ktstr shell requires a terminal for bidirectional I/O forwarding. Piped or redirected stdin is rejected.

Fix: Run from an interactive terminal session.

include file not found

-i strace: not found in filesystem or PATH

Bare names (without /, ., or ..) are searched in PATH. If the binary is not in PATH, use an explicit path.

--include-files path not found: ./missing-file

Explicit paths (containing / or starting with .) must exist on disk.

Fix: Verify the file exists and use the correct path.

include directory contains no files

warning: -i ./empty-dir: directory contains no regular files

The directory passed to --include-files was walked recursively but contained no regular files. FIFOs, device nodes, and sockets are skipped during the walk.

Fix: Verify the directory contains the files you expect.

Model load failed

GGUF model load failed at /home/.../models/Qwen3-4B-Q4_K_M.gguf. The
file may be corrupt or incompatible with the linked llama.cpp version
— delete the file and re-run `cargo ktstr model fetch` to download
a fresh copy. Check stderr for the upstream llama.cpp rejection reason.

The host-side LLM extraction backend (OutputFormat::LlmExtract) could not load the cached GGUF weights. The cached file is either corrupt (partial download, disk error) or incompatible with the linked llama.cpp version.

Diagnose:

  • Re-run with RUST_LOG=llama-cpp-2=info (or =debug for more detail) to surface llama.cpp’s own rejection reason on stderr. The first call to the inference engine routes llama_cpp_2::send_logs_to_tracing events through the tracing subscriber under target "llama-cpp-2" (literal hyphens — see Environment Variables for the EnvFilter shape).
  • cargo ktstr model status reports the cache path and verdict (Matches, Mismatches, CheckFailed, NotCached).

Fix:

  • Delete the cached file and re-fetch: cargo ktstr model clean && cargo ktstr model fetch. clean removes both the GGUF artifact and its .mtime-size warm-cache sidecar; fetch re-downloads from the pinned URL and SHA-checks the result.
  • If model status reports Mismatches, the local file’s hash diverged from the pinned digest — cargo ktstr model fetch will refuse to overwrite a corrupt cache and the explicit clean is required first.
  • If you set KTSTR_MODEL_OFFLINE=1, unset it for the re-fetch. See cargo ktstr model.

Flock timeout / NFS rejection

flock LOCK_EX on run-dir target/ktstr/6.14-abc1234 timed out after
30s (lockfile target/ktstr/.locks/6.14-abc1234.lock, holders:
  pid=12345 cmd=cargo-ktstr test --kernel 6.14). A peer cargo
ktstr test process is writing sidecars to the same
{kernel}-{project_commit} directory; wait for it to finish or kill
it, then retry.

A peer process is holding the per-run-key advisory flock(2) that serializes sidecar writes; the helper polled for 30 s and gave up. Run-dir locks live at {runs_root}/.locks/{kernel}-{project_commit}.lock and serialize the (pre-clear + write) cycle so two concurrent ktstr runs sharing the same key can’t tear partially-written sidecars.

target/ktstr/.locks/6.14-abc1234.lock: filesystem NFS is not
supported for ktstr lockfiles (NFSv3 is advisory-only without
an NLM peer; NFSv4 byte-range locking does not cover flock(2)).
Move the lockfile path to a local filesystem (tmpfs, ext4, xfs,
btrfs, f2fs, bcachefs).

try_flock rejects NFS, CIFS, SMB2, CephFS, AFS, and FUSE mounts because flock(2) semantics on those filesystems are unreliable (see Resource Budget — Filesystem requirement for the per-filesystem rationale).

Diagnose:

  • cargo ktstr locks (or ktstr locks --watch 1s) prints every ktstr flock currently held on the host with PID + cmdline, including per-run-key sidecar locks under the “Run-dir locks” section (see cargo ktstr locks).
  • cat /proc/locks | grep '<lockfile-path-from-error>' falls back to the kernel’s own flock enumeration when the holder is outside ktstr.
  • stat -f -c '%T' <runs-root> reports the filesystem type when the rejection error names NFS/CIFS/SMB/CephFS/AFS/FUSE.

Fix:

  • For a peer-holder timeout: wait for the peer to finish, kill it (kill <pid> from the holder list), or retry with the peer done.
  • For an NFS / remote-fs rejection: relocate the runs root to a local filesystem. Set KTSTR_SIDECAR_DIR to a local path (/tmp/ktstr-sidecars, a tmpfs mount) — note that this override path also skips the cross-process flock, so concurrent runs targeting the same KTSTR_SIDECAR_DIR have no serialization between them. Use the override only for a single-process run or per-process distinct paths.
  • The kernel cache’s lockfiles ({cache_root}/.locks/*.lock) face the same constraint — override KTSTR_CACHE_DIR to a local filesystem if the default resolves to NFS. See Cache directory not found.

Tests pass locally but fail in CI

Common causes:

  • No KVM: CI runners need hardware virtualization. Check for /dev/kvm access.
  • Fewer CPUs: gauntlet topology presets up to 252 CPUs may exceed the runner’s capacity. Use smaller topologies.
  • No kernel: set KTSTR_TEST_KERNEL in the CI environment.
  • No CAP_SYS_NICE or rtprio: performance-mode tests require CAP_SYS_NICE or an rtprio limit for RT scheduling, and enough host CPUs for exclusive LLC reservation. Pass --no-perf-mode (or set KTSTR_NO_PERF_MODE=1) to disable all performance mode features. Tests with performance_mode=true are skipped entirely under --no-perf-mode.
  • Debug thresholds: CI often runs debug builds. Debug builds use relaxed thresholds (3000ms gap, 35% spread) but may still hit limits on slow runners. See default thresholds.

Environment Variables

Environment variables that control ktstr behavior.

User-facing

VariableDescriptionDefault
KTSTR_KERNELKernel identifier for cargo-build-time BTF resolution (read by build.rs) and runtime image discovery. Accepts a path (../linux), version string (6.14.2), or cache key (use cargo ktstr kernel list for actual keys). During cargo build, only paths are used (build.rs extracts BTF from vmlinux). At runtime, version strings and cache keys resolve via the XDG cache; paths search only the specified directory (error if no image found). Set automatically by cargo ktstr test --kernel. Overridden by KTSTR_KERNEL_LIST when present: under multi-kernel runs the test binary’s --list / --exact handlers consult KTSTR_KERNEL_LIST first and only fall back to KTSTR_KERNEL when the list env is unset; the producer-side cargo ktstr always sets KTSTR_KERNEL to the FIRST resolved entry alongside the full KTSTR_KERNEL_LIST so downstream code that inspects KTSTR_KERNEL directly still sees a valid path.Auto-discovered
KTSTR_KERNEL_LISTMulti-kernel wire format label1=path1;label2=path2;… consumed by the test binary’s gauntlet expansion. Set by cargo ktstr test / coverage / llvm-cov when the resolved kernel set has 2 or more entries; the test binary’s --list handler emits one variant per kernel (suffix gauntlet/{name}/{preset}/{profile}/{kernel_label} or ktstr/{name}/{kernel_label}) and the --exact handler strips the suffix and re-exports KTSTR_KERNEL to the matching directory before booting the VM. Semicolon is the entry separator (paths can carry : on POSIX); = separates label from path. Empty value or unset means “single-kernel mode” — the test binary falls back to KTSTR_KERNEL.None (single-kernel)
KTSTR_CISet to any non-empty value to flip every sidecar’s run_source field from "local" (developer-machine default) to "ci". Read at sidecar-write time by detect_run_source; surfaces through cargo ktstr stats compare --run-source ci so CI-produced runs can be partitioned from developer runs without per-run directory bookkeeping. Empty string counts as unset. The third value "archive" is applied at LOAD time (not write time) when cargo ktstr stats compare --dir / list-values --dir pulls sidecars from a non-default pool root — KTSTR_CI does not control that.None (run_source = "local")
KTSTR_TEST_KERNELPath to a bootable kernel image (bzImage on x86_64, Image on aarch64). See Getting Started and Troubleshooting for search order.Auto-discovered
KTSTR_CACHE_DIROverride the kernel image cache directory. When set, all cache operations use this path instead of the XDG default.$XDG_CACHE_HOME/ktstr/kernels/ or $HOME/.cache/ktstr/kernels/
KTSTR_GHA_CACHESet to "1" to enable remote kernel cache via GitHub Actions cache service. Requires ACTIONS_CACHE_URL (set by the GHA runner). Local cache is always authoritative; remote failures are non-fatal.None (disabled)
KTSTR_SCHEDULERPath to a scheduler binary for SchedulerSpec::Discover. See Troubleshooting for search order.Auto-discovered
KTSTR_BUDGET_SECSTime budget in seconds for greedy test selection during --list. Must be positive. See Running Tests.None (all tests listed)
KTSTR_SIDECAR_DIRDirectory for per-test result sidecar JSON files. Used as-is when set, no key suffix. Consumed by the test harness (sidecar write path) and by bare cargo ktstr stats (sidecar read path). When this override is set, pre-clear is skipped AND the per-run-key cross-process flock is skipped — the operator chose the directory and owns its contents, so any pre-existing sidecars there are preserved, and ktstr does not coordinate concurrent writers against the override path. Two concurrent runs pointing the same KTSTR_SIDECAR_DIR at the same path therefore have no serialization between them; choose distinct override paths per process (or rely on the default-path branch, which acquires the flock automatically). cargo ktstr stats list, cargo ktstr stats compare, cargo ktstr stats list-values, and cargo ktstr stats show-host walk {CARGO_TARGET_DIR or "target"}/ktstr/ by default and ignore KTSTR_SIDECAR_DIR — pass --dir DIR on compare / list-values / show-host to point them at an alternate run root. See Runs.{CARGO_TARGET_DIR or "target"}/ktstr/{kernel}-{project_commit}/ (where {project_commit} is the project HEAD short hex, suffixed -dirty when the worktree differs, or the literal unknown when not in a git repo — see Runs for the unknown-commit collision semantics)
KTSTR_NO_PERF_MODEForce performance_mode=false and skip flock topology reservation. Disables all performance mode features (pinning, RT scheduling, hugepages, NUMA mbind, KVM exit suppression). Presence is sufficient (any value). See Performance Mode. Also settable via --no-perf-mode CLI flag.None (disabled)
KTSTR_CARGO_TEST_MODEMarks a test invocation that runs without the cargo-ktstr wrapper (typically KTSTR_KERNEL=… KTSTR_CARGO_TEST_MODE=1 cargo test -- some_test). When active, the harness (1) skips the cross-process initramfs SHM cache and builds inline per VM run (process-local HashMap memoization still applies); (2) skips host-topology LLC / per-CPU flock acquisition — tests run on whatever CPUs the OS schedules them onto; (3) skips gauntlet variant expansion in nextest discovery — each #[ktstr_test] runs once with its declared topology, no KTSTR_KERNEL_LIST multi-kernel fan-out; (4) resolves SchedulerSpec::Discover(name) via $PATH first (before the sibling-dir / target-dir / cargo-build chain) so a user can install a scheduler on PATH and run a single test without driving the cargo-ktstr build pipeline. Empty string is treated as unset (rejection mirrors KTSTR_NO_PERF_MODE). Acceptable for development iteration; perf-mode tests still use their measurement contract internally but no peer-coordination flocks are taken.None (full cargo-ktstr coordination)
KTSTR_NO_SKIP_MODEConvert resource-contention and host-topology-insufficient skips into hard test failures (exit code 1). Default behavior is to skip the test (exit code 0) so a contended runner does not fail tests that simply could not start; setting this env var (or passing --no-skip-mode on cargo ktstr test / coverage / llvm-cov) opts into “if the test cannot run, the test fails”. Use when a test environment is supposed to provision sufficient resources and a missing topology is a real configuration error. The CLI flag exports KTSTR_NO_SKIP_MODE=1 for the test binary. Presence is sufficient (any value).None (skip on contention / topology insufficiency)
KTSTR_CPU_CAPCap the number of host CPUs reserved by a no-perf-mode VM or kernel build to N (integer ≥ 1, a CPU count). The planner walks whole LLCs in consolidation- / NUMA-aware order, filtered to the calling process’s sched_getaffinity cpuset, partial-taking the last LLC so plan.cpus.len() is EXACTLY N. CLI flag --cpu-cap N takes precedence; empty string is treated as unset; 0 or non-numeric values are rejected with a parse error. On shell, --cpu-cap is rejected at clap parse time unless --no-perf-mode is also passed (requires = "no_perf_mode"); on kernel build, no perf-mode concept applies. Library consumers that set performance_mode=true on KtstrVmBuilder directly see the env var silently ignored — the builder’s perf-mode branch never consults CpuCap::resolve. Mutually exclusive with KTSTR_BYPASS_LLC_LOCKS=1 at every entry point (rejection wording contains “resource contract”). See Resource Budget.None (30% of allowed CPUs, minimum 1)
KTSTR_BYPASS_LLC_LOCKSSkip host-side LLC flock acquisition entirely. No coordination against concurrent perf-mode runs. Presence is sufficient (any non-empty value). Mutually exclusive with KTSTR_CPU_CAP / --cpu-cap — the conflict is rejected at every entry point with an error containing “resource contract”. See Resource Budget.None (coordinate)
KTSTR_KERNEL_PARALLELISMOverride the rayon pool width cargo ktstr uses for --kernel per-spec fan-out in resolve_kernel_set. Parsed as usize after .trim(); whitespace around the value is tolerated. Values that fail to parse, are negative, or are 0 silently fall through to the default — a typoed export (=abc, =0) does NOT disable parallelism, it degrades to the host-CPU default. Useful when the default is wrong for the host: a fast NIC + slow CPU benefits from a higher value (more concurrent downloads); a contended CI runner benefits from a lower cap to leave bandwidth and CPU for sibling jobs. Scope is narrow: only the bounded ThreadPool resolve_kernel_set builds via ThreadPoolBuilder::install is affected — the global rayon pool that other code paths (nextest harness, polars groupby, etc.) consume is untouched. The build phase inside each per-spec resolve is already serialized at the LLC-flock layer, so raising this knob accelerates download fan-out only, not build time.std::thread::available_parallelism() (host logical CPU count, falling back to 1 on a sandboxed host where available_parallelism errors)
KTSTR_VERBOSESet to "1" for verbose VM console output (earlyprintk, loglevel=7).None
RUST_BACKTRACEGates verbose diagnostic output on failure. Also enables verbose VM console output (same as KTSTR_VERBOSE=1) when set to "1" or "full". Propagated to the guest.None
RUST_LOGControls every ktstr tracing filter — guest-side and host-side. Guest-side: propagated to the VM kernel command line and parsed by the guest tracing subscriber, so guest events are filtered by the same RUST_LOG value the host process saw at launch. Host-side: applied via the EnvFilter the inference engine installs on first call to global_backend() (tracing_subscriber::fmt::try_init() — a no-op when an outer subscriber was already installed). Two host-side targets are useful in practice: "llama-cpp-2" (literal hyphens — the Metadata::target() set by llama_cpp_2::send_logs_to_tracing(LogOptions::default()), carrying llama.cpp / GGML log lines: model-load progress, GGUF parse chatter, KV-cache reservation notes, error reasons) and "ktstr::flock" (the module_path!() default for src/flock.rs, where the shared flock-timeout primitive emits a tracing::debug!("waiting on flock at …") event on each Ok(None) poll iteration). Examples: RUST_LOG=llama-cpp-2=info widens model-load logging to INFO; RUST_LOG=ktstr::flock=debug surfaces flock-contention heartbeats; RUST_LOG=llama-cpp-2=off suppresses llama.cpp output entirely. EnvFilter does prefix-matching on meta.target() without underscore normalization (the hyphenated llama-cpp-2 target is a string literal, not a Rust path). The default EnvFilter derived from an unset RUST_LOG keeps only ERROR-level events, which is exactly the C-side rejection-reason text behind otherwise-opaque InferenceError::ModelLoad / LlamaModelLoadError::NullResult failures. Operators wanting a different sink (file, alternate format) can install their own subscriber FIRST — try_init() becomes a no-op and the operator’s subscriber receives the events.None (host-side: ERROR-level events on stderr)

jemalloc probe wiring

These variables are only consulted by integration tests that boot a jemalloc-linked allocator worker inside the VM and attach the ktstr-jemalloc-probe to it (see tests/jemalloc_probe_tests.rs). Both are set from a #[ctor] in the test binary so they land before the test harness dispatches.

What #[ctor] is and why these variables need it

#[ctor] is a Rust attribute (provided by the ctor crate) that marks a function to run automatically at binary initialization — after the dynamic linker sets up the process but before main() is called. Linux implements this via the .init_array ELF section; the attribute’s generated code registers the function there. A function under #[ctor] therefore runs exactly once per process, on the main thread, before any code inside main() executes.

The two environment variables above are consulted by ktstr’s nextest pre-dispatch path (ktstr_test_early_dispatch), which itself runs under a ktstr-owned #[ctor] that intercepts the nextest protocol args (--list, --exact) before the standard Rust test harness sees them. The probe-wiring variables must already be populated when that early dispatch fires, so setting them from plain test-body code is too late — the sidecar enumeration and initramfs packing decisions have already run. Tests needing probe integration install their own #[ctor] that writes the two variables via std::env::set_var, ensuring both ktstr’s early dispatch and the VM launch path downstream see the populated values.

The ctor hook runs under the ctor crate re-exported at ktstr::__private::ctor, so a new test crate does not need to add ctor to its own dependencies — it can use the re-export via ktstr::__private::ctor::ctor and stay in sync with the version ktstr itself depends on, avoiding the “two ctor crates, two .init_array entries, ordering undefined” pitfall.

Leaving either variable unset is the normal case — the VM launcher skips probe wiring entirely, and no initramfs entry is added.

VariableDescriptionDefault
KTSTR_JEMALLOC_PROBE_BINARYAbsolute host path to the ktstr-jemalloc-probe binary. When set, the probe is packed into every VM’s base initramfs at /bin/ktstr-jemalloc-probe. Typically set by a #[ctor] in the integration test crate to env!("CARGO_BIN_EXE_ktstr-jemalloc-probe"). Empty string is treated the same as unset.None (no probe packed)
KTSTR_JEMALLOC_ALLOC_WORKER_BINARYAbsolute host path to the paired ktstr-jemalloc-alloc-worker binary. Packed alongside the probe for the closed-loop tests that run the probe against a live allocator target. Same #[ctor] shape as above using env!("CARGO_BIN_EXE_ktstr-jemalloc-alloc-worker"). Empty string is treated the same as unset.None (no worker packed)

LLVM coverage

VariableDescriptionDefault
LLVM_COV_TARGET_DIRDirectory for extracted profraw files.Parent of LLVM_PROFILE_FILE, or <exe-dir>/llvm-cov-target/
LLVM_PROFILE_FILEStandard LLVM profiling output path. ktstr reads its parent as a fallback profraw directory.None

Nextest protocol

VariableDescriptionDefault
NEXTESTSet by nextest when it invokes the test binary. ktstr’s #[ctor] dispatch inspects this to decide whether to intercept the nextest protocol args (--list, --exact) for gauntlet expansion and budget-based selection before main() runs. Under plain cargo test, this is unset and the standard harness runs the #[test] wrappers directly.None

VM-internal

Set by the host on the guest kernel command line and read by the guest init (via /proc/cmdline). Not intended for user configuration; listed here for debugging.

VariableDescription
SCHED_PIDPID of the scheduler process inside the guest, published after scheduler spawn.
KTSTR_MODEGuest execution mode (run for test dispatch, shell for interactive shell).
KTSTR_TOPOTopology string (numa_nodes,llcs,cores,threads) for guest-side scenario resolution.
KTSTR_SHM_BASEHost-physical base address of the SHM ring region (hex).
KTSTR_SHM_SIZESize in bytes of the SHM ring region (hex).
KTSTR_TERMTerminal type forwarded from the host (sets guest TERM).
KTSTR_COLORTERMColor capability forwarded from the host (sets guest COLORTERM).
KTSTR_COLSHost terminal column count, used to size the guest pty when available.
KTSTR_ROWSHost terminal row count, used to size the guest pty when available.

Sentinel tokens (===KTSTR_TEST_RESULT_START===, ===KTSTR_TEST_RESULT_END===, KTSTR_EXIT=N, KTSTR_INIT_STARTED, KTSTR_PAYLOAD_STARTING, KTSTR_EXEC_EXIT) are protocol markers written to COM2; they are not environment variables.

ctprof

The ctprof profiler captures a host-wide per-thread snapshot of scheduling counters, memory / I/O accounting, CPU affinity, cgroup state, and thread identity, then compares two snapshots to surface what changed. It is a manually-invoked CLI companion to the automated scheduler tests — useful when a run passes on one machine and fails on another, or for A/B comparing host behaviour across kernel / sysctl / workload changes.

This is a different tool from cargo ktstr show-host, which captures the host context (kernel, CPU model, sched_* tunables, NUMA layout, kernel cmdline) — aggregate state that does not change between scenarios. The profiler captures per-thread cumulative counters that do change, and its comparison surface is designed for the thread-level diff.

When to use it

  • Workload investigation — you observe a regression and want to know which process / thread pool moved in run time, context-switch rate, or migration count.
  • Kernel / sysctl A/B — capture before and after flipping a sched_* tunable on an otherwise-identical workload; the compare output surfaces every counter that responded.
  • Host baselining — capture on a known-good host, capture on a failing host, compare to isolate what differs at the thread-behaviour level.

The profiler is not invoked automatically by scenarios or the gauntlet. It is opt-in and operator-driven via the ktstr ctprof subcommand.

Capture

ktstr ctprof capture --output baseline.ctprof.zst
# ... run workload, change a tunable, reboot a kernel, etc. ...
ktstr ctprof capture --output after.ctprof.zst

capture walks /proc for every live thread group, enumerates each thread, and reads a handful of procfs sources for each one. The output is a zstd-compressed JSON snapshot (conventional extension: .ctprof.zst).

What is captured per thread

  • Identity — tid, tgid, pcomm (process name from /proc/<tgid>/comm), comm (thread name from /proc/<tid>/comm), cgroup v2 path, start_time_clock_ticks (from /proc/<tid>/stat field 22, in USER_HZ clock ticks), scheduling policy name, nice, CPU affinity mask.
  • Scheduling counters (cumulative, from /proc/<tid>/sched; schedstat fields gated by CONFIG_SCHEDSTATS, run_time_ns/wait_time_ns/timeslices gated by CONFIG_SCHED_INFO) — run_time_ns, wait_time_ns, timeslices, voluntary_csw, nonvoluntary_csw, nr_wakeups (plus _local / _remote / _sync / _migrate splits), nr_migrations, wait_sum / wait_count, voluntary_sleep_ns (capture-side normalized as sum_sleep_runtime - sum_block_runtime so the kernel’s sleep/block double-count is stripped before the value reaches the snapshot), block_sum, iowait_sum / iowait_count, core_forceidle_sum, wait_max / sleep_max / block_max / exec_max / slice_max (lifetime peaks).
  • Memoryminflt / majflt from /proc/<tid>/stat. allocated_bytes / deallocated_bytes from the jemalloc per-thread TSD counters (tsd_s.thread_allocated / thread_deallocated) read via ptrace + process_vm_readv — populated only for processes linked against jemalloc; glibc arena counters are opaque and read as zero rather than failing capture. smaps_rollup_kb (per-process map of the kernel’s /proc/<tid>/smaps_rollup keys, populated leader-only).
  • I/Orchar, wchar, syscr, syscw, read_bytes, write_bytes, cancelled_write_bytes from /proc/<tid>/io (requires CONFIG_TASK_IO_ACCOUNTING). Note that cancelled_write_bytes records on the truncating task — not the original writer — so it pairs with write_bytes as a group-level signal but per-thread arithmetic between the two is not meaningful.
  • Taskstats delay accounting + watermarks — eight delay categories × four fields each (count, total_ns, max_ns, min_ns) plus hiwater_rss_bytes and hiwater_vm_bytes peaks, pulled via the kernel’s TASKSTATS genetlink family. Requires CAP_NET_ADMIN on the capturing process; delay-family fields additionally require CONFIG_TASK_DELAY_ACCT and the runtime delayacct=on toggle, watermark fields require CONFIG_TASK_XACCT. See the Taskstats delay accounting section below for the full field list, gating, and per-bucket semantic caveats.
  • PSI host-levelcpu.stat / memory.current aggregates per cgroup (see Per-cgroup enrichment) plus psi (Pressure Stall Information) under each cgroup and at the host level. Requires CONFIG_PSI.
  • sched_ext sysfsstate, switch_all, nr_rejected, hotplug_seq, enable_seq from /sys/kernel/sched_ext/. Present only when CONFIG_SCHED_CLASS_EXT is built.

Field families and probe-timing invariance:

  • Cumulative counters and totals (the majority): wakeups, migrations, csw, run/wait/sleep/block/iowait time, schedstat counts, page-fault counters, syscall counters, byte counters, the taskstats per-bucket *_count and *_delay_total_ns, the jemalloc per-thread TSD counters. Sampled twice at different instants the value increases monotonically; probe attachment time does not alter the reading.
  • Lifetime extrema: schedstat *_max family (wait_max, sleep_max, block_max, exec_max, slice_max), every taskstats *_delay_max_ns / *_delay_min_ns, and the memory watermarks (hiwater_rss_bytes, hiwater_vm_bytes). Per-event extrema rather than sums. The *_max and hiwater_* fields are non-DECREASING over time (kernel keeps the largest); the *_delay_min_ns fields are non-INCREASING (kernel keeps the smallest non-zero observation, so sentinel 0 means “no events observed” — compare against the matching *_count).
  • Instantaneous gauges (sensitive to probe timing): nr_threads (signal_struct->nr_threads snapshot), fair_slice_ns (current p->se.slice), and state (task_state_array letter). Sampled at capture time and can legitimately differ between two probes of the same thread.
  • Categorical / ordinal scalars: policy, nice, priority, processor, rt_priority, plus identity strings (pcomm, comm, cgroup) and the cpu_affinity cpuset. Sampled at capture time and can change at runtime (e.g. sched_setaffinity mid-run flips processor and cpu_affinity), so they share the gauge family’s probe-timing sensitivity.

Metrics that reset on attachment (perf_event_open counters, BPF tracing samples, etc.) are intentionally absent — they require long-lived instrumentation the capture layer cannot install without disturbing the system it is measuring.

Capture is best-effort

Each internal reader returns Option; a kernel without CONFIG_SCHED_DEBUG yields None from the /proc/<tid>/sched reader (and a kernel without CONFIG_SCHEDSTATS yields None from /proc/<tid>/schedstat and the schedstat-gated /proc/<tid>/sched keys) without failing the rest of the thread. Counters collapse to 0, identity strings collapse to empty, affinity collapses to an empty vec. A missing reading is indistinguishable from a genuine zero in the output — the contract is “never fail the snapshot.” Tests that need stronger guarantees inspect the underlying readers directly (they remain Option-shaped and are unit-tested in the module).

Per-cgroup enrichment

Every cgroup at least one sampled thread resides in gets a CgroupStats entry. Fields nest under per-controller sub-structs:

  • cpu: CgroupCpuStatsusage_usec, nr_throttled, throttled_usec (from cpu.stat); max_quota_us, max_period_us (from cpu.max); weight, weight_nice (from cpu.weight / cpu.weight.nice).
  • memory: CgroupMemoryStatscurrent (from memory.current); max, high, low, min (from the matching memory.* files; low and min are protection floors, max and high are limits); stat and events as flat key-value maps mirroring memory.stat and memory.events.
  • pids: CgroupPidsStatscurrent and max from the optional pids controller.
  • psi: Psi — per-cgroup Pressure Stall Information from <cgroup>/cpu.pressure / memory.pressure / io.pressure / irq.pressure (gated on CONFIG_PSI).

All fields are read directly from cgroup v2 files, NOT derived from per-thread data, because those are aggregate-over-the-cgroup values.

Snapshot identity

The top-level CtprofSnapshot also embeds a HostContext (the same structure show-host prints — kernel, CPU, memory, sched_* tunables, cmdline). Older tools or synthetic fixtures that omit the context render (host context unavailable) rather than failing the compare.

Cgroup namespace caveat

The per-thread cgroup path is read verbatim from /proc/<tid>/cgroup — it is therefore relative to the cgroup namespace root the capturing process sees, NOT the system-global v2 mount root. A process inside a nested cgroup namespace sees a truncated path; a process outside sees a longer one. Cross-namespace comparison requires external canonicalization (the capture layer deliberately does not attempt it because the right resolution depends on capture-site privilege and namespace visibility).

Taskstats delay accounting

The kernel’s TASKSTATS genetlink family delivers per-task delay-accounting and memory-watermark fields that are NOT exposed via /proc/<tid>/sched or /proc/<tid>/stat. ctprof captures them through crate::taskstats — a netlink socket opens, the family-id resolves via CTRL_CMD_GETFAMILY, and one TASKSTATS_CMD_GET query per tid is issued. The 34 captured fields (8 delay categories × 4 bucket fields + 2 watermarks) all tag Section::TaskstatsDelay so they can be filtered as a unit.

Capability and kconfig gating

Calling the netlink family requires CAP_NET_ADMIN on the capturing process (kernel/taskstats.c::taskstats_ops registers TASKSTATS_CMD_GET with GENL_ADMIN_PERM). ktstr always runs as root in production so the cap is implicit, but a non-root operator running ktstr ctprof capture will hit EPERM on the first query_tid call and every taskstats field will collapse to zero per the best-effort capture contract.

Per-family kconfig gates and runtime toggles:

  • Delay-accounting fields (*_delay_count, *_delay_total_ns, *_delay_max_ns, *_delay_min_ns across the eight categories): require CONFIG_TASKSTATS=y AND CONFIG_TASK_DELAY_ACCT=y AND the runtime delayacct=on toggle (sysctl kernel.task_delayacct=1 or boot param delayacct). The runtime toggle is a separate condition beyond the build-time gates — a kernel built with both CONFIGs but launched without delayacct=on produces all-zero delay readings. ktstr’s standard kernel build includes both kconfigs; the test harness adds delayacct to the guest cmdline.
  • Memory-watermark fields (hiwater_rss_bytes, hiwater_vm_bytes): require CONFIG_TASKSTATS=y AND CONFIG_TASK_XACCT=y. They do NOT respond to the delayacct=on runtime toggle — xacct_add_tsk (kernel/tsacct.c) is unconditional once CONFIG_TASK_XACCT is built. xacct_add_tsk reads watermarks from the SHARED mm_struct, so sibling threads of the same tgid all report identical values; kernel threads (mm == NULL) read zero by design.

Any failed gate or missing cap collapses the affected fields to zero. ktstr’s capture pipeline emits an info-level tracing line per snapshot summarizing taskstats outcomes AND attaches the structured tally to CtprofSnapshot::taskstats_summary (ok_count / eperm_count / esrch_count / other_err_count), so an operator can distinguish “kernel doesn’t expose this” from “every tid raced exit” from “CAP_NET_ADMIN missing” without scraping log lines.

Eight delay categories

CategorySourceNotes
cpu_delay_*tsk->sched_info.{pcount,run_delay} via delayacct_add_tsk (kernel/delayacct.c)Time waiting on the runqueue. RACY: count + total are not updated atomically (lockless sched_info path); a concurrent reader may observe one ahead of the other. Captures the same wait-for-CPU bucket as schedstat wait_* via a different code path.
blkio_delay_*delayacct_blkio_start / _end (kernel/delayacct.c)Synchronous block I/O wait. Updates serialize through task->delays->lock so count + total are atomic (unlike cpu_*). The canonical delay-accounting block-I/O reading; distinct from schedstat iowait_sum.
swapin_delay_*delayacct_swapin_start / _end (include/linux/delayacct.h)Swap-in wait. OVERLAPS with thrashing_* — every thrashing event is also a swapin event from the syscall layer; do not sum the two.
freepages_delay_*delayacct_freepages_start / _end (mm/page_alloc.c)Direct memory reclaim wait.
thrashing_delay_*delayacct_thrashing_start / _end (mm/workingset.c)Thrashing wait. Refines swapin tracking — see swapin_*.
compact_delay_*delayacct_compact_start / _end (mm/compaction.c)Memory-compaction wait.
wpcopy_delay_*delayacct_wpcopy_start / _end (mm/memory.c)Write-protect-copy (CoW) fault wait. Introduced in taskstats v13.
irq_delay_*delayacct_irq (kernel/delayacct.c)IRQ-handler windows charged to the task by IRQ accounting. Introduced in taskstats v14.

Each category has four fields:

  • *_count — number of windows observed (MonotonicCount, SumCount).
  • *_delay_total_ns — cumulative ns of delay (MonotonicNs, SumNs).
  • *_delay_max_ns — longest single window observed (PeakNs, MaxPeak).
  • *_delay_min_ns — shortest non-zero window observed (PeakNs, MaxPeak). Sentinel 0 means “no events observed”, NOT “saw a zero-ns event”; compare against the matching *_count to disambiguate.

The two memory watermarks (hiwater_rss_bytes, hiwater_vm_bytes) are PeakBytes / MaxPeakBytes — see the MaxPeakBytes row in the Aggregation rules section below for the shared-mm semantics.

Compare

ktstr ctprof compare before.ctprof.zst after.ctprof.zst

compare joins the two snapshots on pcomm (process name) by default — see Grouping for the other axes — and emits one row per (group, metric) pair. Groups present on only one side surface as unmatched — a row is missing because the process did not exist, not because it did zero work.

Grouping

  • --group-by pcomm (default) — aggregate every thread of the same process together.
  • --group-by cgroup — aggregate by cgroup path. Useful for container-per-workload deployments where the process name is ambiguous across cgroups.
  • --group-by comm — aggregate by thread name across every process under token-based pattern normalization (tokio-worker-{0..N} → one bucket; kworker/0:1H-events_highpri, kworker/1:0H-events_highpri, … → one bucket). Useful when a thread-pool name spans many binaries and you want one row per pool, not per binary. Disable normalization with --no-thread-normalize.
  • --group-by comm-exact — synonym for --group-by comm --no-thread-normalize. Aggregate by literal thread name, no pattern collapse. Use when distinct token values carry meaning (e.g. tracking each kworker/u8:N independently).

Cgroup-path flattening

ktstr ctprof compare before.ctprof.zst after.ctprof.zst \
    --group-by cgroup \
    --cgroup-flatten '/kubepods/*/pod-*/container' \
    --cgroup-flatten '/system.slice/*.scope'

--cgroup-flatten accepts glob patterns that collapse dynamic segments (pod UUIDs, session scopes, transient unit IDs) to a canonical form before grouping, so the same logical workload across two runs lands on the same row even if the kernel assigned different UUIDs.

Filtering output: --sections vs --metrics

Two complementary filters narrow the rendered output:

  • --sections picks which sub-tables render. The default-empty value renders every section that has data; passing a comma-separated list restricts output to the named sub-tables — every section not listed is suppressed before its data-availability gate runs. Valid section names: primary, taskstats-delay, derived, cgroup-stats, cgroup-limits, memory-stat, memory-events, pressure, host-pressure, smaps-rollup, sched-ext. Five (cgroup-stats, cgroup-limits, memory-stat, memory-events, pressure) require --group-by cgroup; naming any of them under a non-cgroup grouping emits a stderr warning and renders zero rows.
  • --metrics picks which rows render inside the primary and derived sub-tables. The default-empty value renders every metric; passing a comma-separated list restricts the rendered rows to the named metrics. Names must come from the ctprof metric-list vocabulary (CTPROF_METRICSCTPROF_DERIVED_METRICS). Has no effect on the secondary sub-tables (cgroup-stats, smaps-rollup, etc.) — those have fixed column shapes and ignore the row filter.

The two compose multiplicatively: --sections primary --metrics run_time_ns shows a single row in the primary sub-table and nothing else. --sections primary alone keeps every primary row; --metrics run_time_ns alone keeps the single row across every section that displays it.

Each metric carries exactly one Section tag in its registry entry — the 34 taskstats-sourced primary rows and the 9 taskstats-derived rows tag Section::TaskstatsDelay rather than Section::Primary / Section::Derived. They render inside the same primary / derived outer tables but match a distinct section name, so --sections taskstats-delay selects exactly the 34 + 9 taskstats rows alone, while --sections primary excludes them and --sections derived excludes the 9 taskstats derivations. The three-way split lets an operator scope to non-taskstats only, taskstats only, or any combination, without losing the visual grouping under the same outer headers.

Aggregation rules

Each metric declares its own aggregation rule (CTPROF_METRICS in src/ctprof_compare.rs). The AggRule enum is typed: each variant binds an accessor of a specific metric_types newtype (MonotonicCount, MonotonicNs, PeakNs, Bytes, etc.) so a registry entry that pairs a peak field with a sum reduction (e.g. t.wait_max (PeakNs) bound to a Sum* rule) fails to compile rather than producing a meaningless 1×1s ⊕ 1000×1ms aggregate. The 14 variants split into five families: Sum reductions, Max reductions, Range reductions, Mode reductions, and the Affinity reduction.

Sum reductions (cumulative counters)

VariantNewtypeOutput unitExamples
SumCountMonotonicCountunitlessnr_wakeups (+ _local / _remote / _sync / _migrate / _affine / _affine_attempts), nr_migrations, nr_forced_migrations, nr_failed_migrations_*, voluntary_csw, nonvoluntary_csw, minflt, majflt, wait_count, iowait_count, timeslices, syscr, syscw, every taskstats *_delay_count (8 entries)
SumNsMonotonicNsnsrun_time_ns, wait_time_ns, wait_sum, voluntary_sleep_ns, block_sum, iowait_sum, core_forceidle_sum, every taskstats *_delay_total_ns (8 entries)
SumTicksClockTicksUSER_HZ ticksutime_clock_ticks, stime_clock_ticks
SumBytesBytesbytes (IEC)allocated_bytes, deallocated_bytes, rchar, wchar, read_bytes, write_bytes, cancelled_write_bytes

Group reduction: saturating_add per the no-wraparound contract. Delta is the signed difference; percent delta is relative to the before-side. Auto-scale ladder is decimal SI for ns / count, USER_HZ for ticks, IEC binary for bytes.

Max reductions (peaks and gauges)

VariantNewtypeOutput unitExamples
MaxPeakPeakNsnswait_max, sleep_max, block_max, exec_max, slice_max, every taskstats *_delay_max_ns (8 entries), every taskstats *_delay_min_ns (8 entries)
MaxPeakBytesPeakBytesbytes (IEC)hiwater_rss_bytes, hiwater_vm_bytes (taskstats lifetime memory watermarks)
MaxGaugeNsGaugeNsnsfair_slice_ns (current scheduler slice)
MaxGaugeCountGaugeCountunitlessnr_threads (process-wide thread count)

MaxPeak / MaxPeakBytes rows surface the worst single window or largest watermark any thread in the group has ever observed — summing per-thread maxes would conflate “one thread with a 1s spike” with “1000 threads with 1ms spikes each”. MaxPeakBytes is the byte-typed twin of MaxPeak and routes through the IEC binary auto-scale ladder so a 7.5 GiB watermark renders as 7.500GiB rather than dominating the table with raw byte counts. xacct_add_tsk (kernel/tsacct.c) reads the watermarks from the SHARED mm_struct, so sibling threads of the same tgid all report the same value; cross-thread Max within a single process is a no-op, while cross-process Max under a multi-tgid bucket picks the largest watermark any tgid in the bucket reported.

MaxGaugeNs / MaxGaugeCount apply to instantaneous gauges (read at capture time) where summing has no physical meaning. nr_threads specifically is leader-only (populated on tid == tgid, zero elsewhere); Max reads through the leader so a comm-bucketed group still surfaces the largest process represented in the bucket. The taskstats *_delay_min_ns rows also use MaxPeak: min here is the kernel’s per-task lifetime shortest non-zero observation, so cross-thread Max picks “the largest minimum any contributor reported”; sentinel 0 means “no events observed” — compare against the matching count.

Range reductions (bounded ordinals)

VariantNewtypeOutputExamples
RangeI32OrdinalI32[min, max] (i64-widened)nice, priority, processor
RangeU32OrdinalU32[min, max] (i64-widened)rt_priority

The renderer shows [min, max] and the delta uses the midpoint so a shift on either end is visible.

Mode reductions (categorical)

VariantNewtypeOutputExamples
ModeCategoricalStringmost-frequent value + count/totalpolicy
ModeCharchar (coerced)most-frequent char + count/totalstate
ModeBoolbool (coerced)most-frequent bool + count/totalext_enabled

Mode is textual: delta is "same" if both modes agree, "differs" otherwise — there is no arithmetic on a categorical value. ModeChar and ModeBool coerce to String via to_string() before reducing because the underlying types are not themselves Modeable. A 50/50 bool tie resolves lex-smallest-wins (so "false" wins over "true"); operators reading a false mode in a heterogeneous bucket should check the count/total fraction.

Affinity reduction (CPU sets)

VariantNewtypeOutputExample
AffinityCpuSetAffinitySummary { min_cpus, max_cpus, uniform }cpu_affinity

Heterogeneous groups render as "N-M cpus (mixed)". Unlike the other rules, Affinity does not route through a metric_types trait — its reduction produces a structured summary, not a homogeneous newtype.

Derived metrics

Derived metrics consume one or more already-aggregated input metrics from CTPROF_METRICS and produce a single scalar with its own auto-scale ladder. They render in a separate ## Derived metrics table below the per-thread table on both compare and show, with rows colored blue to distinguish them from the primary table on TTY stdout. Registered in CTPROF_DERIVED_METRICS in src/ctprof_compare.rs.

The full registry is 17 entries: 8 schedstat / I/O / heap derivations plus 9 taskstats-derived (the 8 per-bucket avg_*_delay_ns averages plus the total_offcpu_delay_ns rollup). Every formula is implemented as a closure over the group’s metrics map (BTreeMap<String, Aggregated>); a missing input or a zero denominator yields None, which the renderer surfaces as - so the operator can distinguish “not computable” from “computed as zero”.

MetricFormulaInputsUnitNotes
affine_success_rationr_wakeups_affine / nr_wakeups_affine_attemptsnr_wakeups_affine, nr_wakeups_affine_attemptsratio (0..1)wake_affine() success ratio. CFS-only signal — sched_ext does not increment the wakeup counters. Bare three-decimal scalar; the renderer suppresses the % column for ratio rows because absolute delta on a [0, 1] ratio is already in percentage points.
avg_wait_nswait_sum / wait_countwait_sum, wait_countnsAverage runqueue-wait duration per scheduling event. Rendered with the ns auto-scale ladder (ns → µs → ms → s). Schedstat-gated (see wait_sum and wait_count); zero across sched_ext threads.
cpu_efficiencyrun_time_ns / (run_time_ns + wait_time_ns)run_time_ns, wait_time_nsratio (0..1)Fraction of total scheduler-tracked time spent on-CPU. Higher = less time stuck on the runqueue. Both inputs gated by CONFIG_SCHED_INFO.
avg_slice_nsrun_time_ns / timeslicesrun_time_ns, timeslicesnsAverage on-CPU slice length. Useful for spotting timeslice-tuning regressions (e.g. an sched_min_granularity_ns change that shrinks slices). Both inputs gated by CONFIG_SCHED_INFO.
involuntary_csw_rationonvoluntary_csw / (voluntary_csw + nonvoluntary_csw)nonvoluntary_csw, voluntary_cswratio (0..1)Fraction of context switches that were preemptions (kernel pulled the task off-CPU) vs. voluntary blocks. High values indicate preemption pressure; low values indicate cooperative blocking.
disk_io_fractionread_bytes / rcharread_bytes, rcharratio (≥ 0)Fraction of read syscall bytes that traveled past the pagecache layer (cache miss rate; covers local block devices and network filesystems alike). Typically ≤ 1.0, but can exceed 1 when readahead pulls more bytes past the pagecache layer than the syscall requested. Both inputs gated by CONFIG_TASK_IO_ACCOUNTING.
live_heap_estimateallocated_bytes - deallocated_bytes (signed)allocated_bytes, deallocated_bytesbytes (IEC, signed)jemalloc-only live-heap estimate. Glibc and other allocators feed both inputs zero so the derived metric reads zero too — - would imply non-computable but here zero is the genuine reading. Renders on the IEC binary ladder (B → KiB → MiB → GiB → TiB). Per-thread reading carries cross-thread noise: a thread that purely frees objects allocated by other threads reads large negative values; group-level Sum across all threads of the process eliminates the asymmetry.
avg_iowait_nsiowait_sum / iowait_countiowait_sum, iowait_countnsAverage iowait interval per blocking event. Schedstat-gated; zero across sched_ext threads.
avg_cpu_delay_nscpu_delay_total_ns / cpu_delay_countcpu_delay_total_ns, cpu_delay_countnsAverage runqueue-wait per scheduling event from the taskstats delayacct path. RACY: the kernel updates count + total via the lockless sched_info path, so a concurrent reader may observe one ahead of the other; the quotient is approximate at the sub-event scale and stable at the integrated scale. Distinct from avg_wait_ns (schedstat) which captures the same wait-for-CPU bucket via a different code path.
avg_blkio_delay_nsblkio_delay_total_ns / blkio_delay_countblkio_delay_total_ns, blkio_delay_countnsAverage synchronous block-I/O wait per event from the taskstats delayacct path. Distinct from avg_iowait_ns (schedstat) — this is the canonical delay-accounting block-I/O reading.
avg_swapin_delay_nsswapin_delay_total_ns / swapin_delay_countswapin_delay_total_ns, swapin_delay_countnsAverage swap-in wait per event. OVERLAPS with thrashing — every thrashing event is also a swapin event from the syscall layer; do not sum the two averages or the underlying totals directly.
avg_freepages_delay_nsfreepages_delay_total_ns / freepages_delay_countfreepages_delay_total_ns, freepages_delay_countnsAverage direct-reclaim wait per event.
avg_thrashing_delay_nsthrashing_delay_total_ns / thrashing_delay_countthrashing_delay_total_ns, thrashing_delay_countnsAverage thrashing wait per event. OVERLAPS with swapin (see avg_swapin_delay_ns).
avg_compact_delay_nscompact_delay_total_ns / compact_delay_countcompact_delay_total_ns, compact_delay_countnsAverage memory-compaction wait per event.
avg_wpcopy_delay_nswpcopy_delay_total_ns / wpcopy_delay_countwpcopy_delay_total_ns, wpcopy_delay_countnsAverage write-protect-copy (CoW) fault wait per event.
avg_irq_delay_nsirq_delay_total_ns / irq_delay_countirq_delay_total_ns, irq_delay_countnsAverage IRQ-handler window per event.
total_offcpu_delay_nscpu + blkio + freepages + compact + wpcopy + irq + max(swapin, thrashing)every *_delay_total_nsnsSum of every meaningful off-CPU delay-accounting bucket. The swapin/thrashing pair is OR’d with .max() rather than summed because the two share syscall-layer events (every thrashing event is also a swapin from the syscall perspective); summing both would double-count thrashing-induced swapins. When CONFIG_TASK_DELAY_ACCT is off, the runtime toggle is off, or the kernel predates a bucket’s introduction (e.g. wpcopy_* lands in v13, irq_* in v14), the missing buckets read zero from the truncated taskstats payload — the rollup degrades to the sum of the populated buckets rather than returning -. The structured taskstats outcome lives on CtprofSnapshot::taskstats_summary for the operator to disambiguate “no data” from “zero data.”

The is_ratio column on the registry is load-bearing for the renderer: ratio rows skip the % column entirely (the absolute delta already carries percentage-point semantics for a [0, 1] quantity), and the auto-scale ladder is None (bare three- decimal scalar). Non-ratio derived metrics reuse the same ladders as their unit family — Ns for nanosecond derivations, Bytes for byte derivations.

The 9 taskstats-derived entries (the 8 avg_*_delay_ns averages plus total_offcpu_delay_ns) tag Section::TaskstatsDelay rather than Section::Derived so --sections taskstats-delay renders the full taskstats view — the 34 raw rows AND the 9 derivations that depend on them — without dragging in unrelated derivations.

Derived metrics are surfaced by ctprof metric-list alongside the primary registry, and are valid --sort-by keys on both compare and show.

Output and interpretation

The comparison prints raw numbers and percent delta. There are no judgment labels (regression vs. improvement) — the meaning of “run_time went up 15%” depends on whether you were measuring a CPU-bound workload (more work done) or a spin-wait pathology (more time wasted). The interpretation is scheduler- specific and left to the operator.

Sort order: by default, rows are sorted by absolute delta (largest movers first) so the most-changed metrics surface at the top. Rows with no numeric scalar (policy, heterogeneous affinity) fall to the bottom.

File format

.ctprof.zst is zstd-compressed JSON of CtprofSnapshot. The schema is #[non_exhaustive] so field additions do not break existing snapshots:

CtprofSnapshot
├── captured_at_unix_ns: u64
├── host: Option<HostContext>
├── threads: Vec<ThreadState>
├── cgroup_stats: BTreeMap<String, CgroupStats>
├── probe_summary: Option<CtprofProbeSummary>
├── parse_summary: Option<CtprofParseSummary>
├── taskstats_summary: Option<TaskstatsSummary>
├── psi: Psi
└── sched_ext: Option<SchedExtSysfs>

TaskstatsSummary carries per-snapshot taskstats genetlink query outcomes — ok_count, eperm_count, esrch_count, other_err_count — so an operator can distinguish “no taskstats data because every tid raced exit” (high esrch_count) from “no taskstats data because the kernel was built without CONFIG_TASKSTATS” (the netlink open failed up-front, every counter zero) from “no taskstats data because CAP_NET_ADMIN is missing” (high eperm_count).

ThreadState::start_time_clock_ticks is in USER_HZ (100 on x86_64 and aarch64), NOT the kernel-internal CONFIG_HZ — so cross-host comparison between differently-configured kernels on those architectures is meaningful. Other in-tree architectures (alpha, for instance, with USER_HZ=1024) would require normalization at capture time; the capture layer currently targets x86_64 and aarch64 only.

Compression level 3 (matching the ktstr remote-cache convention): adequate ratio at fast speed, and ctprof captures are small enough that further compression produces diminishing returns on I/O.

Adding a metric

Adding a per-thread metric to the registry is a three-step mechanical process. The type system enforces the wiring so a mismatch between the kernel-source semantic and the aggregation rule fails to compile rather than producing a silently-wrong group reduction.

1. Add a ThreadState field with the right newtype

Pick the metric_types newtype that matches the kernel-source semantic of the field — the per-newtype docs name the kernel call sites that update each category. The shape determines what aggregation rules are legal in step 3:

NewtypeWhen to use
MonotonicCountPure counter — only goes up across the thread’s lifetime. Examples: nr_wakeups, syscall counts, every taskstats *_delay_count.
DeadCounterSame shape as MonotonicCount but tagged for kernel counters with no live writer (always reads zero). Captured for parser parity but does NOT implement any reduction trait — register with is_dead: true and the renderer flags it [dead].
MonotonicNsCumulative-time counter in ns. Examples: run_time_ns, wait_sum, every taskstats *_delay_total_ns.
PeakNsLifetime high-water mark in ns. Kernel updates via if (delta > stat->max) stat->max = delta. Summing peaks is a category error. Examples: wait_max, slice_max, every taskstats *_delay_max_ns and *_delay_min_ns.
PeakBytesByte-typed twin of PeakNs — lifetime high-water mark in bytes. Routes through the IEC binary auto-scale ladder. Used for taskstats memory watermarks (hiwater_rss_bytes, hiwater_vm_bytes) read from the shared mm_struct. Pairs with AggRule::MaxPeakBytes.
GaugeNsInstantaneous gauge sampled at capture time (ns). Cannot sum — N near-identical samples collapse to N×gauge with no meaning. Example: fair_slice_ns.
GaugeCountInstantaneous unitless count that goes up AND down. Example: nr_threads.
ClockTicksUSER_HZ-scaled time. Examples: utime_clock_ticks, stime_clock_ticks.
BytesByte counts. IEC binary auto-scale ladder. Examples: read_bytes, wchar.
OrdinalI32 / OrdinalU32 / OrdinalU64Bounded scalar — range-aggregated, not summable. Examples: nice (i32), rt_priority (u32). The Rangeable::range_across reduction returns Option<Range<Self>> — see Range<T> below. OrdinalU64 implements Rangeable but is currently unused in the registry; a metric that picks OrdinalU64 requires adding AggRule::RangeU64 alongside the existing RangeI32 and RangeU32 variants.
CategoricalStringCategorical value — mode-aggregated. Examples: policy.
CpuSetCPU affinity mask — affinity-aggregated. Example: cpu_affinity.
Range<T>Output type of the Rangeable::range_across reduction. Carries min and max of the same T with the min <= max invariant enforced at construction (debug_assert! in Range::new). Not stored on ThreadState — the Aggregated::OrdinalRange boundary unwraps it via into_tuple() to a (i64, i64) pair widened from the underlying OrdinalI32 / OrdinalU32 / OrdinalU64.

Add the field to ThreadState in src/ctprof.rs:

// In ThreadState struct definition.
/// Description: what the field counts, what kernel call site
/// writes it, and what scheduler classes increment it. Cite
/// `kernel/sched/...` line numbers for the writer.
pub my_new_metric: crate::metric_types::MonotonicCount,

2. Wire the capture path

capture_thread_at_with_tally in src/ctprof.rs is the single per-thread procfs walk. Add the per-source reader (or extend an existing one) and stamp the field in the ThreadState { ... } construction:

// Inside capture_thread_at_with_tally, after the existing
// per-source reads. Wrap in the newtype constructor; never use
// `.into()` (the typed-newtype style is explicit).
my_new_metric: MonotonicCount(sched.my_new_metric.unwrap_or(0)),

The Option::unwrap_or(0) collapse is load-bearing: the profiler’s contract is “never fail the snapshot,” so a missing reading lands at the newtype’s Default::default() (zero). The absent reading is indistinguishable from a genuine zero in the output — see the Capture is best-effort section.

3. Register the metric

Append a CtprofMetricDef entry to CTPROF_METRICS in src/ctprof_compare.rs. The AggRule variant must match the newtype chosen in step 1 — the type system enforces this.

CtprofMetricDef {
    name: "my_new_metric",
    rule: AggRule::SumCount(|t| t.my_new_metric),
    sched_class: None, // or Some("cfs-only") / Some("non-ext") / Some("fair-policy")
    config_gates: &[], // or &["CONFIG_SCHEDSTATS"], etc.
    is_dead: false,    // true for kernel-side dead pointers
    description: "One-line operator-facing description; surfaces in `ctprof metric-list`.",
    section: Section::Primary, // or Section::TaskstatsDelay for taskstats-sourced rows
},

The name field is the canonical metric identifier — used by --sort-by, --metrics, and the metric-list output. (The --columns flag accepts layout names — group, threads, metric, baseline, candidate, delta, %, arrow, value — not metric names.) Names are ASCII short-form (matching the capture-side field name where possible). sched_class and config_gates render as bracketed suffixes in metric-list output ([cfs-only], [SCHEDSTATS]) so operators reading a row know which kernels populate the counter. The section tag drives the --sections per-row filter — most rows take Section::Primary; taskstats-sourced rows take Section::TaskstatsDelay.

Compile-time guards

The type system catches the four most common mistakes:

  • Wrong reduction family: pairing a PeakNs accessor with AggRule::SumNs fails with a type error — PeakNs does not implement Summable (only Maxable), and the closure’s return type does not match the variant’s expected newtype.
  • Wrong unit family: pairing a Bytes accessor with AggRule::SumNs fails the same way.
  • Dead counter with live reduction: DeadCounter does not implement Summable / Maxable / Rangeable / Modeable, so any AggRule::Sum* / Max* / Range* / Mode* variant bound to a dead-counter accessor fails to compile. Register the metric only via the is_dead: true flag with whichever variant matches its shape — the rendering layer surfaces it as [dead] and skips numeric reduction.
  • Categorical with numeric reduction: pairing a CategoricalString accessor with AggRule::SumCount fails because CategoricalString does not implement Summable.

The closure body cannot be type-checked beyond the variant boundary, so a body that actively miswraps a field — e.g. SumNs(|t| MonotonicNs(t.wait_max.0)) laundering a peak through the sum wrapper — type-checks. Don’t do that. The wrapper category is load-bearing; the type system catches the variant mismatch but not the lying inside.

Optional: derived metric

If the new metric has a useful ratio or sum-of-ratios pairing with existing inputs, register a DerivedMetricDef in CTPROF_DERIVED_METRICS (same file). The compute closure reads inputs via input_scalar(metrics, name)? and returns Option<DerivedValue>; the ratio_compute and ratio_of_sum_compute helpers cover the two most common shapes. Set is_ratio: true when the output is in [0, 1] so the renderer suppresses the % column. Set section to Section::Derived for general derivations or Section::TaskstatsDelay if every input is a taskstats field (so --sections taskstats-delay keeps the derivation alongside its raw inputs).

  • cargo ktstr show-host — captures the host context (kernel, CPU, tunables) that the profiler embeds as the host field. Use show-host when you want to inspect host configuration only, without the per- thread walk.
  • Capture and Compare Host State — recipe covering the show-host / stats compare flow for comparing host context across sidecars (not the per-thread profiler).
  • Environment Variables — every ktstr-controlled env var.