Worker Processes
Workers are the processes that generate load for scenarios. They run inside the VM, each in its own cgroup.
Fork, not threads
Workers are fork()ed processes. Cgroups operate on PIDs, so each
worker must be a separate process to be independently placed in a
cgroup.
Two-phase start
Workers wait on a pipe for a “start” signal after fork:
- Parent forks the worker.
- Worker installs SIGUSR1 handler, then blocks on pipe read.
- Parent moves the worker to its target cgroup.
- Parent writes to the pipe, signaling the worker to start.
This ensures workers run inside their target cgroup from the first instruction of their workload.
Custom work types
WorkType::Custom workers follow the same two-phase start (fork,
cgroup placement, start signal), and the framework applies affinity
and scheduling policy before handing control to the user function.
After setup, the run function pointer takes over entirely –
the framework work loop is bypassed.
Stop protocol
Workers install a SIGUSR1 handler that sets an atomic STOP flag. The
main work loop checks this flag each iteration. On stop:
- Parent sends SIGUSR1 to all workers.
- Workers exit their work loop.
- Workers serialize their
WorkerReportto a pipe. - Parent reads reports and waits for child exit.
Telemetry
Each worker produces a WorkerReport:
pub struct WorkerReport {
pub tid: i32,
pub work_units: u64,
pub cpu_time_ns: u64,
pub wall_time_ns: u64,
pub off_cpu_ns: u64,
pub migration_count: u64,
pub cpus_used: BTreeSet<usize>,
pub migrations: Vec<Migration>,
pub max_gap_ms: u64,
pub max_gap_cpu: usize,
pub max_gap_at_ms: u64,
pub resume_latencies_ns: Vec<u64>,
pub wake_sample_total: u64,
pub iteration_costs_ns: Vec<u64>,
pub iteration_cost_sample_total: u64,
pub iterations: u64,
pub schedstat_run_delay_ns: u64,
pub schedstat_run_count: u64,
pub schedstat_cpu_time_ns: u64,
pub completed: bool,
pub numa_pages: BTreeMap<usize, u64>,
pub vmstat_numa_pages_migrated: u64,
pub exit_info: Option<WorkerExitInfo>,
pub is_messenger: bool,
pub group_idx: usize,
pub affinity_error: Option<String>,
}
pub enum WorkerExitInfo {
Exited(i32),
Signaled(i32),
TimedOut,
WaitFailed(String),
/// Thread-mode worker panicked. Exclusive to `CloneMode::Thread`;
/// fork workers surface panics via `Exited(1)` or
/// `Signaled(SIGABRT)` depending on the panic strategy.
Panicked(String),
}
iteration_costs_ns mirrors resume_latencies_ns for per-iteration
wall-clock cost: a reservoir-sampled vector capped at
MAX_WAKE_SAMPLES entries, paired with iteration_cost_sample_total
for the total observation count when the cap is exceeded.
group_idx is 0 for the primary group and 1..=N for composed
WorkSpec entries in declaration order (mirrors
WorkloadConfig::composed). affinity_error is Some(reason)
when the worker’s sched_setaffinity / mbind setup failed; the
worker still runs and produces a report but the field documents
the divergence from the requested affinity contract.
Three fields worth calling out explicitly:
-
wake_sample_total— the TOTAL number of wake-latency observations the worker saw, including samples the reservoir sampler dropped.resume_latencies_nsis clamped to at most 100_000 entries (MAX_WAKE_SAMPLES); on a long run that accumulates more wakes than the cap, the vector stays at the cap while this counter keeps climbing. Host-side consumers reporting “total wakeups observed” readwake_sample_total; percentile / CV computations readresume_latencies_ns. -
completed—truewhen the worker reached its natural end (outer loop observed STOP and exited cleanly, or a custom- closure payload returned from itsrun). Sentinel reports synthesised bystop_and_collect’s JSON-parse fallback carryfalse. Lets consumers distinguish “ran to completion, saw zero iterations” from “died / timed out before recording anything.” -
is_messenger—trueonly for the messenger worker in aFutexFanOut/FanOutComputegroup (the single writer that advances the shared generation and issuesfutex_wake). Enables per-worker latency-participation assertions — receivers produceresume_latencies_nsentries, messengers record wake-side work but no resume latency. -
off_cpu_ns = wall_time_ns - cpu_time_ns -
exit_infoisNoneon every live-worker-authored report.stop_and_collectsynthesises a sentinelWorkerReportwithSome(_)when the worker handed back no (or unparseable) JSON, using theWorkerExitInfoenum (Exited(code)/Signaled(signum)/TimedOut/WaitFailed(String)— the string carries the underlyingwaitpiderrno rendering) to preserve the reap shape for post-mortem. -
Migrations are tracked every 1024 work units: after each outer iteration the worker checks
work_units.is_multiple_of(1024)and runs the migration-detect body iff that is true. The check runs exactly once per outer iteration, so the effective period in outer iterations is1024 / gcd(units_per_iter, 1024). Default parameters assumed unless noted:- Every outer iteration (period = 1 iter): SpinWait (1024),
Mixed (1024), Bursty (each outer iter runs
spin_burst(1024)some number of times inside theburst_msloop — always a multiple of 1024), PipeIo (burst_iters=1024), FutexPingPong (spin_iters=1024), CachePressure (1024 strided RMW steps), CacheYield (1024 strided RMW steps), CachePipe (burst_iters=1024), FutexFanOut messenger AND receiver (both callspin_burst(spin_iters)before splitting roles; default 1024), AffinityChurn (spin_iters=1024), PolicyChurn (spin_iters=1024). - Every 2 iterations: NiceSweep (
spin_burst(512)per iter →gcd(512, 1024) = 512). - Every 4 iterations: MutexContention
(
work_iters=1024 +hold_iters=256 = 1280 per acquire+ release →gcd(1280, 1024) = 256, period = 4 iters). FanOutCompute messenger (spin_burst(256)per wake cycle → same 256-unit gcd). - Every 16 iterations: PageFaultChurn — one persistent
MAP_PRIVATE | MAP_ANONYMOUSregion per worker (default 4 MiB viaregion_kb=4096), re-faulted each outer iteration viamadvise(MADV_DONTNEED). Each iteration contributestouches_per_cycle=256 page writes (each first write afterMADV_DONTNEEDtriggers a minor fault; a birthday-collision xorshift64 index may revisit a page already faulted this cycle, so the fault count is a ceiling, not a floor) +spin_iters=64 = 320 work units (gcd(320, 1024) = 64). - Every 64 iterations: IoSyncWrite (16 4-KiB writes per
write-then-sleep pair →
gcd(16, 1024) = 16); IoRandRead and IoConvoy use the same 64-iteration cadence for their per-iteration pread/pwrite mixes. - Every 1024 iterations: YieldHeavy (1 unit per yield),
ForkExit (1 unit per fork+wait), FanOutCompute worker
(
operations=5 matrix multiplies per wake, onework_unitstick per multiply →gcd(5, 1024) = 1). - Phase-inherited: Sequence inherits whichever phase is
currently active — Spin / Yield / Io use the same per-unit
accounting as the SpinWait / YieldHeavy / IoSyncWrite groups
above; Sleep contributes no
work_unitsand so pauses migration checks while it runs. - Not tracked by the framework: Custom workers do not
contribute to
work_unitson the framework’s behalf — migration tracking fires only if the user’srunfunction incrementswork_unitsand emits migrations directly.
- Every outer iteration (period = 1 iter): SpinWait (1024),
Mixed (1024), Bursty (each outer iter runs
-
Scheduling gaps (
max_gap_ms,max_gap_cpu,max_gap_at_ms) record the longest wall-clock interval between consecutive 1024-work-unit migration-check points plus the CPU the gap was observed on and its time from start. High values indicate preemption or descheduling near a checkpoint boundary. The checkpoint cadence — and therefore the gap-measurement cadence — is governed by the samework_units.is_multiple_of(1024)test that the migration tracker uses, so the effective measurement period in outer iterations matches the per-WorkType tables above.
Benchmarking fields
Workers collect two categories of timing data:
Per-wakeup latency (resume_latencies_ns): timestamp-based samples
recorded around blocking operations. Populated for work types with a
blocking step: Bursty (sleep), PipeIo (pipe read), FutexPingPong
(futex wait), FutexFanOut (futex wait, receivers only), FanOutCompute
(futex wait, workers only — measured as CLOCK_MONOTONIC delta from
messenger’s shared timestamp), CacheYield (yield), CachePipe (pipe
read), IoSyncWrite / IoRandRead / IoConvoy (pread / pwrite / fdatasync
blocking), NiceSweep (yield), AffinityChurn (yield),
PolicyChurn (yield), MutexContention (futex wait on contended
acquire), ForkExit (parent’s waitpid wait), and Sequence when its
phases include Sleep, Yield, or Io.
Each sample is in nanoseconds; most work types use
Instant::elapsed() across the blocking call, while FanOutCompute
uses clock_gettime(CLOCK_MONOTONIC) to measure against the
messenger’s pre-wake timestamp.
schedstat deltas: read from /proc/self/schedstat at work-loop
start and end. Three fields:
schedstat_cpu_time_ns– delta of field 1 (on-CPU time)schedstat_run_delay_ns– delta of field 2 (time spent waiting for a CPU)schedstat_run_count– delta of field 3 (pcount — scheduler-in count: incremented each time the scheduler picks this task to execute, across CFS/EEVDF, FIFO/RR, and sched_ext alike). Not a context-switch count — a task that keeps running on the same CPU without leaving the runqueue does not see pcount advance while it runs. For true context-switch counts read/proc/<pid>/status’svoluntary_ctxt_switchesandnonvoluntary_ctxt_switches; the worker reads pcount instead because schedstat delivers it alongsiderun_delay/cpu_timein a single file read.
iterations counts outer-loop iterations.
NUMA fields
numa_pages: per-NUMA-node page counts parsed from
/proc/self/numa_maps after the workload completes. Keyed by node ID.
Empty when numa_maps is unavailable.
vmstat_numa_pages_migrated: delta of the numa_pages_migrated
counter from /proc/vmstat between pre- and post-workload snapshots.
Measures cross-node page migrations during the test.
These fields feed the NUMA checking thresholds.
Custom workers produce their own WorkerReport. The framework does
not populate any telemetry fields for Custom – migration tracking,
gap detection, schedstat deltas, NUMA page counts, and iteration
counters are only present if the user’s run function fills them.
Worker-progress watchdog
Workers send SIGUSR2 to the scheduler when stuck > 2 seconds. The default POSIX disposition terminates the scheduler process, which ktstr detects as a scheduler death and captures the sched_ext dump from dmesg.
In repro mode, the watchdog is disabled to keep the scheduler alive for BPF probe assertions. The watchdog does not fire for Custom workers because they bypass the framework work loop.
RAII cleanup
WorkloadHandle implements Drop: it sends SIGKILL to all child
processes and waits for them. This prevents orphaned worker processes
on error paths.