Module host_stall

Expand description

Host-mode worker stall detection via /proc/<pid>/sched polling.

When a scenario runs in host-mode (no VM boot — !is_guest() AND !cargo_test_mode_active()), the freeze coordinator / KVM-side stall plumbing is unavailable. This module fills that gap by polling every worker pid’s /proc/<pid>/sched file from a background thread and flagging a “task did not run” condition when both nr_switches and sum_exec_runtime are unchanged across a sliding window of W samples.

§Signal

/proc/<pid>/sched exposes both nr_switches (total context switches the task has been involved in) and sum_exec_runtime (cumulative on-CPU nanoseconds). Both are emitted unconditionally by kernel/sched/debug.c regardless of CONFIG_SCHEDSTATS, so the signal works on any production kernel.

se.statistics.wait_sum would arguably be a STRONGER signal (cumulative time waiting in the runqueue — a starved task grows wait_sum monotonically while a sleeping task does not), but it IS gated on CONFIG_SCHEDSTATS=y. The monitor sticks with the unconditional nr_switches + sum_exec_runtime pair so it stays useful on minimum-config production kernels; schedstat-aware schedulers (sched_ext schedulers that read wait_sum via BPF) supplement this with their own latency probes via the --ktstr-probe-stack pipeline.

Stall heuristic: if Δnr_switches == 0 AND Δsum_exec_runtime == 0 across W consecutive samples, the task has neither been picked nor preempted for W * poll_interval — a stronger signal than either counter alone (a busy-loop on one CPU could leave nr_switches flat while sum_exec_runtime climbs; a fully starved task pins both).

§Cadence

Default poll interval is 500 ms; window size W = 4 yields a 2 s detection latency. The interval is overridable via the crate::KTSTR_STALL_POLL_MS_ENV env var (empty / unset / 0 / unparseable falls back to the default).

§Diagnostic capture

When a stall fires, the polling thread captures a one-shot diagnostic snapshot from /proc/<pid>/{wchan, syscall, status, stack, cgroup} and /proc/<pid>/task/<pid>/stat, plus the host’s /proc/loadavg. Each field is read independently and gracefully degraded to "[unreadable: <reason>]" on EACCES / ENOENT — /proc/<pid>/stack requires CAP_SYS_ADMIN and is typically absent for unprivileged callers, so its absence is not a failure.

Structs§

SchedSample: Snapshot of the two scheduler counters this monitor watches.
StallDiagnostic: One-shot diagnostic snapshot captured at stall-trip time.
StallReport: One stall report: a worker pid plus the sample window that triggered the stall predicate plus the captured diagnostic.

Constants§

DEFAULT_POLL_INTERVAL_MS: Default poll cadence when KTSTR_STALL_POLL_MS_ENV is unset / empty / 0 / unparseable. 500 ms × W=4 yields a 2 s detection latency — short enough to catch a stuck scheduler within a typical ktstr test duration, long enough that procfs reads stay O(workers) per second rather than swamping the host.
STALL_WINDOW: Sliding-window size: number of consecutive flat samples that flip the stall predicate. W=4 with DEFAULT_POLL_INTERVAL_MS = 2 s detection latency. Constant rather than env-tunable — the operator already controls latency via the poll interval, and a smaller W would false-positive on transient idle.

Functions§

parse_sched_file: Parse /proc/<pid>/sched content into a SchedSample.
stall_predicate: Stall predicate: returns true when samples.len() >= STALL_WINDOW AND every consecutive pair has both nr_switches delta == 0 AND sum_exec_runtime_ns delta == 0. A window shorter than STALL_WINDOW never fires (insufficient signal).