Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Checking

ktstr checks scheduler behavior through two channels: worker-side telemetry and host-side monitoring.

Worker checks

After each scenario, ktstr collects WorkerReport from every worker process. Several checks run against these reports:

Starvation – any worker with work_units == 0 fails the test.

Fairness – workers in the same cgroup should get similar CPU time. The “spread” (max off-CPU% - min off-CPU%) must be below a threshold (15% in release builds, 35% in debug). Violations report the spread and per-cgroup statistics.

Scheduling gaps – the longest wall-clock gap observed at work-unit checkpoints. Gaps above a threshold (2000ms release, 3000ms debug) indicate the scheduler dropped a task. Reports include the gap duration, CPU, and timing.

Cpuset isolation – workers must only run on CPUs in their assigned cpuset. Any execution on an unexpected CPU fails the test. Opt-in via isolation = true on the #[ktstr_test] attribute or via Assert::check_isolation(); Assert::default_checks() leaves this None, so the runtime merge resolves to false and the check is skipped unless explicitly enabled.

Throughput parityassert_throughput_parity() checks that workers produce similar throughput (work_units per CPU-second). Two thresholds:

  • max_throughput_cv: coefficient of variation across workers. High CV means the scheduler gives some workers disproportionately less effective CPU. Requires at least 2 workers with nonzero CPU time.
  • min_work_rate: minimum work_units per CPU-second per worker. Catches cases where all workers are equally slow (CV passes but absolute throughput is too low).

Neither threshold is set by default; enable via Assert setters or #[ktstr_test] attributes.

Benchmarkingassert_benchmarks() checks per-wakeup latency and iteration throughput. Three thresholds:

  • max_p99_wake_latency_ns: p99 of all resume_latencies_ns samples across workers in a cgroup. Populated only for work types that record wake-to-run latency: IoSyncWrite, IoRandRead, IoConvoy, Bursty, PipeIo, FutexPingPong, CacheYield, CachePipe, FutexFanOut (receivers), Sequence (Sleep / Yield / Io phases), ForkExit, NiceSweep, AffinityChurn, PolicyChurn, FanOutCompute, MutexContention. Pure-CPU work types (SpinWait, Mixed, CachePressure, PageFaultChurn) do not record samples.
  • max_wake_latency_cv: coefficient of variation of wake latency samples. High CV means inconsistent scheduling latency.
  • min_iteration_rate: minimum outer-loop iterations per wall-clock second per worker.

None are set by default. Set via Assert setters or #[ktstr_test] attributes.

Monitor checks

The host-side monitor reads guest VM memory (per-CPU runqueue structs via BTF offsets) and evaluates:

  • Imbalance ratio: max(nr_running) / max(1, min(nr_running)) across CPUs. The denominator is clamped to 1 so an all-idle sample does not divide by zero.
  • Local DSQ depth: per-CPU dispatch queue depth.
  • Stall detection: rq_clock not advancing on a CPU with runnable tasks. Idle CPUs and preempted vCPUs are exempt. See Monitor: Stall detection for exemption details.
  • Event rates: scx fallback and keep-last event counters.

Monitor thresholds use a sustained sample window (default: 5 samples). A violation must persist for N consecutive samples before failing.

NUMA checks

When workers use a MemPolicy, ktstr collects NUMA page placement data and checks it against thresholds:

Page localityassert_page_locality() checks the fraction of pages residing on the expected NUMA node(s). Expected nodes are derived from the worker’s MemPolicy::node_set() at evaluation time. Page counts come from WorkerReport::numa_pages (parsed from /proc/self/numa_maps). Returns 1.0 (vacuously local) when no pages are observed. Fails if the observed fraction falls below min_page_locality.

Cross-node migrationassert_cross_node_migration() checks the ratio of migrated pages to total allocated pages. WorkerReport::vmstat_numa_pages_migrated provides the delta of the numa_pages_migrated counter from /proc/vmstat over the work loop. Fails if the ratio exceeds max_cross_node_migration_ratio.

Slow-tier ratiomax_slow_tier_ratio checks the fraction of pages on memory-only NUMA nodes (CXL tiers). Fails if more than the specified fraction of pages land on memory-only nodes.

None of these thresholds are set by default. Set via Assert setters or #[ktstr_test] attributes.

Assert struct

Assert is a composable configuration that carries both worker checks and monitor thresholds:

pub struct Assert {
    // Worker checks
    pub not_starved: Option<bool>,
    pub isolation: Option<bool>,
    pub max_gap_ms: Option<u64>,
    pub max_spread_pct: Option<f64>,

    // Throughput checks
    pub max_throughput_cv: Option<f64>,
    pub min_work_rate: Option<f64>,

    // Benchmarking checks
    pub max_p99_wake_latency_ns: Option<u64>,
    pub max_wake_latency_cv: Option<f64>,
    pub min_iteration_rate: Option<f64>,
    pub max_migration_ratio: Option<f64>,

    // Monitor checks
    pub max_imbalance_ratio: Option<f64>,
    pub max_local_dsq_depth: Option<u32>,
    pub fail_on_stall: Option<bool>,
    pub sustained_samples: Option<usize>,
    pub max_fallback_rate: Option<f64>,
    pub max_keep_last_rate: Option<f64>,

    // NUMA checks
    pub min_page_locality: Option<f64>,
    pub max_cross_node_migration_ratio: Option<f64>,
    pub max_slow_tier_ratio: Option<f64>,
}

Every field is Option. None means “inherit from parent layer.”

Merge layers

Checking uses a three-layer merge:

  1. Assert::default_checks() – baseline: not_starved enabled, monitor thresholds from MonitorThresholds::DEFAULT.
  2. Scheduler.assert – scheduler-level overrides.
  3. Per-test assert – test-specific overrides via #[ktstr_test] attributes.

All fields use last-Some-wins semantics. A Some(false) in a higher layer can disable a check that a lower layer enabled.

let final_assert = Assert::default_checks()
    .merge(&scheduler.assert)
    .merge(&test_assert);

Default thresholds

Worker checks

CheckDefault (release)Default (debug)
Scheduling gap2000 ms3000 ms
Fairness spread15%35%

Debug builds run in small VMs with higher scheduling overhead, so thresholds are relaxed. Coverage-instrumented builds collect profraw data for code coverage analysis; all assertion and monitor threshold checks run normally.

Monitor checks

ThresholdDefaultRationale
max_imbalance_ratio4.0max(nr_running) / max(1, min(nr_running)) across CPUs (denominator clamped to 1 so an all-idle sample does not divide by zero). Lower values (2-3) false-positive during cpuset transitions.
max_local_dsq_depth50Per-CPU dispatch queue overflow. Sustained depth above this means the scheduler is not consuming dispatched tasks.
fail_on_stalltrueFail when rq_clock does not advance on a CPU with runnable tasks. Idle CPUs (NOHZ) and preempted vCPUs are exempt.
sustained_samples5At ~100ms sample interval, requires ~500ms of sustained violation. Filters transient spikes from cpuset reconfiguration.
max_fallback_rate200.0/sselect_cpu_fallback events per second across all CPUs. Sustained rate indicates systematic select_cpu failure.
max_keep_last_rate100.0/sdispatch_keep_last events per second across all CPUs. Sustained rate indicates dispatch starvation.

All monitor thresholds use the sustained_samples window – a violation must persist for N consecutive samples before failing.

Worker checks via Assert

Assert provides assert_cgroup() for running worker-side checks directly against collected reports:

let a = Assert::default_checks().max_gap_ms(5000);
let result = a.assert_cgroup(&reports, Some(&cpuset));

Use Assert for both the merge chain (#[ktstr_test] attributes, Scheduler.assert, execute_steps_with) and direct report checking.

Constants

  • Assert::NO_OVERRIDES – identity for merge; every field is None, so it overrides nothing. This is not “no checks” – when used as a per-test or per-scheduler assert, the runtime chain still applies defaults because it merges default_checks() -> scheduler -> test.
  • Assert::default_checks()not_starved enabled, monitor thresholds populated from MonitorThresholds::DEFAULT.

AssertResult

AssertResult carries pass/fail status, diagnostic messages, and aggregated statistics from a scenario run.

Construction

  • AssertResult::pass() – creates a passing result with empty details and default stats.
  • AssertResult::skip(reason) – creates a passing result with a skip reason in details and skipped = true. Used when a scenario cannot run under the current topology or flag combination but is not a failure.
  • AssertResult::fail(detail) – failing result carrying a single AssertDetail. Mirrors pass / skip for the failure axis.
  • AssertResult::fail_msg(msg) – shortcut for the common case where the failure is a plain diagnostic message tagged DetailKind::Other.

Mutation and inspection

  • result.note(msg) – append an informational annotation tagged DetailKind::Note. Does NOT flip passed or skipped — a note is context, not a verdict. Returns &mut Self so calls chain.
  • result.with_note(msg) – builder-style sibling of note that consumes and returns self. Use at the return site to chain a context annotation onto a fresh result without an intermediate let mut.
  • result.is_skipped() – convenience accessor returning skipped. Stats tooling uses this to subtract non-executions from pass counts.
  • result.is_failed() – convenience accessor returning !passed. Mirrors is_skipped so branches reading “did this claim fail?” don’t negate .passed inline.

Fields

  • passed: bool – whether all checks passed.
  • skipped: bool – distinguishes a passing result that ran every check from one that skipped execution (topology / flag mismatch, prerequisite absent). AssertResult::skip sets this; pass / fail / fail_msg leave it false.
  • details: Vec<AssertDetail> – structured diagnostic entries; each carries a kind: DetailKind (Other, Note, Skip, Temporal, …) plus a human-readable message: String. Consumers filter by kind for routing (failure vs informational note) and read message for display.
  • stats: ScenarioStats – aggregated worker telemetry across all cgroups (spread, gaps, migrations, wake latency, iterations).
  • measurements: BTreeMap<String, NoteValue> – structured per-test measurements keyed by name. Sidecar consumers and comparison tooling read this map directly without parsing details strings, so populate it (via Verdict::note_value during claim evaluation) for any value a downstream comparison needs to lift programmatically.

Merging

result.merge(other) combines two results. If other.passed is false, the merged result is also false. Details and stats are accumulated:

let mut combined = AssertResult::pass();
combined.merge(cgroup_0_result);
combined.merge(cgroup_1_result);
// combined.passed is false if either cgroup failed
// combined.details contains messages from both

Stats merging takes worst values across cgroups for spread, gap, wake latency, and migration ratio. Counters (total_workers, total_cpus, total_migrations, total_iterations) are summed.

For examples of overriding thresholds at the scheduler and per-test level, see Customize Checking.