Checking
ktstr checks scheduler behavior through two channels: worker-side telemetry and host-side monitoring.
Worker checks
After each scenario, ktstr collects
WorkerReport from every worker
process. Several checks run against these reports:
Starvation – any worker with work_units == 0 fails the test.
Fairness – workers in the same cgroup should get similar CPU time. The “spread” (max off-CPU% - min off-CPU%) must be below a threshold (15% in release builds, 35% in debug). Violations report the spread and per-cgroup statistics.
Scheduling gaps – the longest wall-clock gap observed at work-unit checkpoints. Gaps above a threshold (2000ms release, 3000ms debug) indicate the scheduler dropped a task. Reports include the gap duration, CPU, and timing.
Cpuset isolation – workers must only run on CPUs in their assigned
cpuset. Any execution on an unexpected CPU fails the test. Opt-in via
isolation = true on the #[ktstr_test] attribute or via
Assert::check_isolation(); Assert::default_checks() leaves this
None, so the runtime merge resolves to false and the check is
skipped unless explicitly enabled.
Throughput parity – assert_throughput_parity() checks that
workers produce similar throughput (work_units per CPU-second). Two
thresholds:
max_throughput_cv: coefficient of variation across workers. High CV means the scheduler gives some workers disproportionately less effective CPU. Requires at least 2 workers with nonzero CPU time.min_work_rate: minimum work_units per CPU-second per worker. Catches cases where all workers are equally slow (CV passes but absolute throughput is too low).
Neither threshold is set by default; enable via Assert setters or
#[ktstr_test] attributes.
Benchmarking – assert_benchmarks() checks per-wakeup latency
and iteration throughput. Three thresholds:
max_p99_wake_latency_ns: p99 of allresume_latencies_nssamples across workers in a cgroup. Populated only for work types that record wake-to-run latency:IoSyncWrite,IoRandRead,IoConvoy,Bursty,PipeIo,FutexPingPong,CacheYield,CachePipe,FutexFanOut(receivers),Sequence(Sleep / Yield / Io phases),ForkExit,NiceSweep,AffinityChurn,PolicyChurn,FanOutCompute,MutexContention. Pure-CPU work types (SpinWait,Mixed,CachePressure,PageFaultChurn) do not record samples.max_wake_latency_cv: coefficient of variation of wake latency samples. High CV means inconsistent scheduling latency.min_iteration_rate: minimum outer-loop iterations per wall-clock second per worker.
None are set by default. Set via Assert setters or #[ktstr_test]
attributes.
Monitor checks
The host-side monitor reads guest VM memory (per-CPU runqueue structs via BTF offsets) and evaluates:
- Imbalance ratio:
max(nr_running) / max(1, min(nr_running))across CPUs. The denominator is clamped to 1 so an all-idle sample does not divide by zero. - Local DSQ depth: per-CPU dispatch queue depth.
- Stall detection:
rq_clocknot advancing on a CPU with runnable tasks. Idle CPUs and preempted vCPUs are exempt. See Monitor: Stall detection for exemption details. - Event rates: scx fallback and keep-last event counters.
Monitor thresholds use a sustained sample window (default: 5 samples). A violation must persist for N consecutive samples before failing.
NUMA checks
When workers use a MemPolicy, ktstr collects NUMA
page placement data and checks it against thresholds:
Page locality – assert_page_locality() checks the fraction of
pages residing on the expected NUMA node(s). Expected nodes are derived
from the worker’s MemPolicy::node_set() at evaluation time. Page
counts come from WorkerReport::numa_pages (parsed from
/proc/self/numa_maps). Returns 1.0 (vacuously local) when no pages
are observed. Fails if the observed fraction falls below
min_page_locality.
Cross-node migration – assert_cross_node_migration() checks
the ratio of migrated pages to total allocated pages.
WorkerReport::vmstat_numa_pages_migrated provides the delta of the
numa_pages_migrated counter from /proc/vmstat over the work loop.
Fails if the ratio exceeds max_cross_node_migration_ratio.
Slow-tier ratio – max_slow_tier_ratio checks the fraction of
pages on memory-only NUMA nodes (CXL tiers). Fails if more than the
specified fraction of pages land on memory-only nodes.
None of these thresholds are set by default. Set via Assert setters
or #[ktstr_test] attributes.
Assert struct
Assert is a composable configuration that carries both worker checks
and monitor thresholds:
pub struct Assert {
// Worker checks
pub not_starved: Option<bool>,
pub isolation: Option<bool>,
pub max_gap_ms: Option<u64>,
pub max_spread_pct: Option<f64>,
// Throughput checks
pub max_throughput_cv: Option<f64>,
pub min_work_rate: Option<f64>,
// Benchmarking checks
pub max_p99_wake_latency_ns: Option<u64>,
pub max_wake_latency_cv: Option<f64>,
pub min_iteration_rate: Option<f64>,
pub max_migration_ratio: Option<f64>,
// Monitor checks
pub max_imbalance_ratio: Option<f64>,
pub max_local_dsq_depth: Option<u32>,
pub fail_on_stall: Option<bool>,
pub sustained_samples: Option<usize>,
pub max_fallback_rate: Option<f64>,
pub max_keep_last_rate: Option<f64>,
// NUMA checks
pub min_page_locality: Option<f64>,
pub max_cross_node_migration_ratio: Option<f64>,
pub max_slow_tier_ratio: Option<f64>,
}
Every field is Option. None means “inherit from parent layer.”
Merge layers
Checking uses a three-layer merge:
Assert::default_checks()– baseline:not_starvedenabled, monitor thresholds fromMonitorThresholds::DEFAULT.Scheduler.assert– scheduler-level overrides.- Per-test
assert– test-specific overrides via#[ktstr_test]attributes.
All fields use last-Some-wins semantics. A Some(false) in a
higher layer can disable a check that a lower layer enabled.
let final_assert = Assert::default_checks()
.merge(&scheduler.assert)
.merge(&test_assert);
Default thresholds
Worker checks
| Check | Default (release) | Default (debug) |
|---|---|---|
| Scheduling gap | 2000 ms | 3000 ms |
| Fairness spread | 15% | 35% |
Debug builds run in small VMs with higher scheduling overhead, so thresholds are relaxed. Coverage-instrumented builds collect profraw data for code coverage analysis; all assertion and monitor threshold checks run normally.
Monitor checks
| Threshold | Default | Rationale |
|---|---|---|
max_imbalance_ratio | 4.0 | max(nr_running) / max(1, min(nr_running)) across CPUs (denominator clamped to 1 so an all-idle sample does not divide by zero). Lower values (2-3) false-positive during cpuset transitions. |
max_local_dsq_depth | 50 | Per-CPU dispatch queue overflow. Sustained depth above this means the scheduler is not consuming dispatched tasks. |
fail_on_stall | true | Fail when rq_clock does not advance on a CPU with runnable tasks. Idle CPUs (NOHZ) and preempted vCPUs are exempt. |
sustained_samples | 5 | At ~100ms sample interval, requires ~500ms of sustained violation. Filters transient spikes from cpuset reconfiguration. |
max_fallback_rate | 200.0/s | select_cpu_fallback events per second across all CPUs. Sustained rate indicates systematic select_cpu failure. |
max_keep_last_rate | 100.0/s | dispatch_keep_last events per second across all CPUs. Sustained rate indicates dispatch starvation. |
All monitor thresholds use the sustained_samples window – a
violation must persist for N consecutive samples before failing.
Worker checks via Assert
Assert provides assert_cgroup() for running worker-side checks
directly against collected reports:
let a = Assert::default_checks().max_gap_ms(5000);
let result = a.assert_cgroup(&reports, Some(&cpuset));
Use Assert for both the merge chain (#[ktstr_test] attributes,
Scheduler.assert, execute_steps_with) and direct report checking.
Constants
Assert::NO_OVERRIDES– identity formerge; every field isNone, so it overrides nothing. This is not “no checks” – when used as a per-test or per-schedulerassert, the runtime chain still applies defaults because it mergesdefault_checks() -> scheduler -> test.Assert::default_checks()–not_starvedenabled, monitor thresholds populated fromMonitorThresholds::DEFAULT.
AssertResult
AssertResult carries pass/fail status, diagnostic messages, and
aggregated statistics from a scenario run.
Construction
AssertResult::pass()– creates a passing result with empty details and default stats.AssertResult::skip(reason)– creates a passing result with a skip reason indetailsandskipped = true. Used when a scenario cannot run under the current topology or flag combination but is not a failure.AssertResult::fail(detail)– failing result carrying a singleAssertDetail. Mirrorspass/skipfor the failure axis.AssertResult::fail_msg(msg)– shortcut for the common case where the failure is a plain diagnostic message taggedDetailKind::Other.
Mutation and inspection
result.note(msg)– append an informational annotation taggedDetailKind::Note. Does NOT flippassedorskipped— a note is context, not a verdict. Returns&mut Selfso calls chain.result.with_note(msg)– builder-style sibling ofnotethat consumes and returnsself. Use at the return site to chain a context annotation onto a fresh result without an intermediatelet mut.result.is_skipped()– convenience accessor returningskipped. Stats tooling uses this to subtract non-executions from pass counts.result.is_failed()– convenience accessor returning!passed. Mirrorsis_skippedso branches reading “did this claim fail?” don’t negate.passedinline.
Fields
passed: bool– whether all checks passed.skipped: bool– distinguishes a passing result that ran every check from one that skipped execution (topology / flag mismatch, prerequisite absent).AssertResult::skipsets this;pass/fail/fail_msgleave itfalse.details: Vec<AssertDetail>– structured diagnostic entries; each carries akind: DetailKind(Other,Note,Skip,Temporal, …) plus a human-readablemessage: String. Consumers filter bykindfor routing (failure vs informational note) and readmessagefor display.stats: ScenarioStats– aggregated worker telemetry across all cgroups (spread, gaps, migrations, wake latency, iterations).measurements: BTreeMap<String, NoteValue>– structured per-test measurements keyed by name. Sidecar consumers and comparison tooling read this map directly without parsingdetailsstrings, so populate it (viaVerdict::note_valueduring claim evaluation) for any value a downstream comparison needs to lift programmatically.
Merging
result.merge(other) combines two results. If other.passed is
false, the merged result is also false. Details and stats are
accumulated:
let mut combined = AssertResult::pass();
combined.merge(cgroup_0_result);
combined.merge(cgroup_1_result);
// combined.passed is false if either cgroup failed
// combined.details contains messages from both
Stats merging takes worst values across cgroups for spread, gap, wake
latency, and migration ratio. Counters (total_workers, total_cpus,
total_migrations, total_iterations) are summed.
For examples of overriding thresholds at the scheduler and per-test level, see Customize Checking.