Expand description
Host-mode worker stall detection via /proc/<pid>/sched polling.
When a scenario runs in host-mode (no VM boot — !is_guest() AND
!cargo_test_mode_active()), the freeze coordinator / KVM-side
stall plumbing is unavailable. This module fills that gap by
polling every worker pid’s /proc/<pid>/sched file from a
background thread and flagging a “task did not run” condition
when both nr_switches and sum_exec_runtime are unchanged
across a sliding window of W samples.
§Signal
/proc/<pid>/sched exposes both nr_switches (total context
switches the task has been involved in) and sum_exec_runtime
(cumulative on-CPU nanoseconds). Both are emitted unconditionally
by kernel/sched/debug.c regardless of CONFIG_SCHEDSTATS, so
the signal works on any production kernel.
se.statistics.wait_sum would arguably be a STRONGER signal
(cumulative time waiting in the runqueue — a starved task
grows wait_sum monotonically while a sleeping task does not),
but it IS gated on CONFIG_SCHEDSTATS=y. The monitor sticks
with the unconditional nr_switches + sum_exec_runtime pair
so it stays useful on minimum-config production kernels;
schedstat-aware schedulers (sched_ext schedulers that read
wait_sum via BPF) supplement this with their own latency
probes via the --ktstr-probe-stack pipeline.
Stall heuristic: if Δnr_switches == 0 AND
Δsum_exec_runtime == 0 across W consecutive samples, the task
has neither been picked nor preempted for W * poll_interval —
a stronger signal than either counter alone (a busy-loop on one
CPU could leave nr_switches flat while sum_exec_runtime
climbs; a fully starved task pins both).
§Cadence
Default poll interval is 500 ms; window size W = 4 yields a 2 s
detection latency. The interval is overridable via the
crate::KTSTR_STALL_POLL_MS_ENV env var (empty / unset / 0 /
unparseable falls back to the default).
§Diagnostic capture
When a stall fires, the polling thread captures a one-shot
diagnostic snapshot from /proc/<pid>/{wchan, syscall, status, stack, cgroup} and /proc/<pid>/task/<pid>/stat, plus the
host’s /proc/loadavg. Each field is read independently and
gracefully degraded to "[unreadable: <reason>]" on EACCES /
ENOENT — /proc/<pid>/stack requires CAP_SYS_ADMIN and is
typically absent for unprivileged callers, so its absence is
not a failure.
Structs§
- Sched
Sample - Snapshot of the two scheduler counters this monitor watches.
- Stall
Diagnostic - One-shot diagnostic snapshot captured at stall-trip time.
- Stall
Report - One stall report: a worker pid plus the sample window that triggered the stall predicate plus the captured diagnostic.
Constants§
- DEFAULT_
POLL_ INTERVAL_ MS - Default poll cadence when
KTSTR_STALL_POLL_MS_ENVis unset / empty / 0 / unparseable. 500 ms × W=4 yields a 2 s detection latency — short enough to catch a stuck scheduler within a typical ktstr test duration, long enough that procfs reads stay O(workers) per second rather than swamping the host. - STALL_
WINDOW - Sliding-window size: number of consecutive flat samples that
flip the stall predicate. W=4 with
DEFAULT_POLL_INTERVAL_MS= 2 s detection latency. Constant rather than env-tunable — the operator already controls latency via the poll interval, and a smaller W would false-positive on transient idle.
Functions§
- parse_
sched_ file - Parse
/proc/<pid>/schedcontent into aSchedSample. - stall_
predicate - Stall predicate: returns
truewhensamples.len() >= STALL_WINDOWAND every consecutive pair has bothnr_switchesdelta == 0 ANDsum_exec_runtime_nsdelta == 0. A window shorter thanSTALL_WINDOWnever fires (insufficient signal).