Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Worker Processes

Workers are the processes that generate load for scenarios. They run inside the VM, each in its own cgroup.

Fork, not threads

Workers are fork()ed processes. Cgroups operate on PIDs, so each worker must be a separate process to be independently placed in a cgroup.

Two-phase start

Workers wait on a pipe for a “start” signal after fork:

  1. Parent forks the worker.
  2. Worker installs SIGUSR1 handler, then blocks on pipe read.
  3. Parent moves the worker to its target cgroup.
  4. Parent writes to the pipe, signaling the worker to start.

This ensures workers run inside their target cgroup from the first instruction of their workload.

Custom work types

WorkType::Custom workers follow the same two-phase start (fork, cgroup placement, start signal), and the framework applies affinity and scheduling policy before handing control to the user function. After setup, the run function pointer takes over entirely – the framework work loop is bypassed.

Stop protocol

Workers install a SIGUSR1 handler that sets an atomic STOP flag. The main work loop checks this flag each iteration. On stop:

  1. Parent sends SIGUSR1 to all workers.
  2. Workers exit their work loop.
  3. Workers serialize their WorkerReport to a pipe.
  4. Parent reads reports and waits for child exit.

Telemetry

Each worker produces a WorkerReport:

pub struct WorkerReport {
    pub tid: i32,
    pub work_units: u64,
    pub cpu_time_ns: u64,
    pub wall_time_ns: u64,
    pub off_cpu_ns: u64,
    pub migration_count: u64,
    pub cpus_used: BTreeSet<usize>,
    pub migrations: Vec<Migration>,
    pub max_gap_ms: u64,
    pub max_gap_cpu: usize,
    pub max_gap_at_ms: u64,
    pub resume_latencies_ns: Vec<u64>,
    pub wake_sample_total: u64,
    pub iteration_costs_ns: Vec<u64>,
    pub iteration_cost_sample_total: u64,
    pub iterations: u64,
    pub schedstat_run_delay_ns: u64,
    pub schedstat_run_count: u64,
    pub schedstat_cpu_time_ns: u64,
    pub completed: bool,
    pub numa_pages: BTreeMap<usize, u64>,
    pub vmstat_numa_pages_migrated: u64,
    pub exit_info: Option<WorkerExitInfo>,
    pub is_messenger: bool,
    pub group_idx: usize,
    pub affinity_error: Option<String>,
}

pub enum WorkerExitInfo {
    Exited(i32),
    Signaled(i32),
    TimedOut,
    WaitFailed(String),
    /// Thread-mode worker panicked. Exclusive to `CloneMode::Thread`;
    /// fork workers surface panics via `Exited(1)` or
    /// `Signaled(SIGABRT)` depending on the panic strategy.
    Panicked(String),
}

iteration_costs_ns mirrors resume_latencies_ns for per-iteration wall-clock cost: a reservoir-sampled vector capped at MAX_WAKE_SAMPLES entries, paired with iteration_cost_sample_total for the total observation count when the cap is exceeded. group_idx is 0 for the primary group and 1..=N for composed WorkSpec entries in declaration order (mirrors WorkloadConfig::composed). affinity_error is Some(reason) when the worker’s sched_setaffinity / mbind setup failed; the worker still runs and produces a report but the field documents the divergence from the requested affinity contract.

Three fields worth calling out explicitly:

  • wake_sample_total — the TOTAL number of wake-latency observations the worker saw, including samples the reservoir sampler dropped. resume_latencies_ns is clamped to at most 100_000 entries (MAX_WAKE_SAMPLES); on a long run that accumulates more wakes than the cap, the vector stays at the cap while this counter keeps climbing. Host-side consumers reporting “total wakeups observed” read wake_sample_total; percentile / CV computations read resume_latencies_ns.

  • completedtrue when the worker reached its natural end (outer loop observed STOP and exited cleanly, or a custom- closure payload returned from its run). Sentinel reports synthesised by stop_and_collect’s JSON-parse fallback carry false. Lets consumers distinguish “ran to completion, saw zero iterations” from “died / timed out before recording anything.”

  • is_messengertrue only for the messenger worker in a FutexFanOut / FanOutCompute group (the single writer that advances the shared generation and issues futex_wake). Enables per-worker latency-participation assertions — receivers produce resume_latencies_ns entries, messengers record wake-side work but no resume latency.

  • off_cpu_ns = wall_time_ns - cpu_time_ns

  • exit_info is None on every live-worker-authored report. stop_and_collect synthesises a sentinel WorkerReport with Some(_) when the worker handed back no (or unparseable) JSON, using the WorkerExitInfo enum (Exited(code) / Signaled(signum) / TimedOut / WaitFailed(String) — the string carries the underlying waitpid errno rendering) to preserve the reap shape for post-mortem.

  • Migrations are tracked every 1024 work units: after each outer iteration the worker checks work_units.is_multiple_of(1024) and runs the migration-detect body iff that is true. The check runs exactly once per outer iteration, so the effective period in outer iterations is 1024 / gcd(units_per_iter, 1024). Default parameters assumed unless noted:

    • Every outer iteration (period = 1 iter): SpinWait (1024), Mixed (1024), Bursty (each outer iter runs spin_burst(1024) some number of times inside the burst_ms loop — always a multiple of 1024), PipeIo (burst_iters=1024), FutexPingPong (spin_iters=1024), CachePressure (1024 strided RMW steps), CacheYield (1024 strided RMW steps), CachePipe (burst_iters=1024), FutexFanOut messenger AND receiver (both call spin_burst(spin_iters) before splitting roles; default 1024), AffinityChurn (spin_iters=1024), PolicyChurn (spin_iters=1024).
    • Every 2 iterations: NiceSweep (spin_burst(512) per iter → gcd(512, 1024) = 512).
    • Every 4 iterations: MutexContention (work_iters=1024 + hold_iters=256 = 1280 per acquire+ release → gcd(1280, 1024) = 256, period = 4 iters). FanOutCompute messenger (spin_burst(256) per wake cycle → same 256-unit gcd).
    • Every 16 iterations: PageFaultChurn — one persistent MAP_PRIVATE | MAP_ANONYMOUS region per worker (default 4 MiB via region_kb=4096), re-faulted each outer iteration via madvise(MADV_DONTNEED). Each iteration contributes touches_per_cycle=256 page writes (each first write after MADV_DONTNEED triggers a minor fault; a birthday-collision xorshift64 index may revisit a page already faulted this cycle, so the fault count is a ceiling, not a floor) + spin_iters=64 = 320 work units (gcd(320, 1024) = 64).
    • Every 64 iterations: IoSyncWrite (16 4-KiB writes per write-then-sleep pair → gcd(16, 1024) = 16); IoRandRead and IoConvoy use the same 64-iteration cadence for their per-iteration pread/pwrite mixes.
    • Every 1024 iterations: YieldHeavy (1 unit per yield), ForkExit (1 unit per fork+wait), FanOutCompute worker (operations=5 matrix multiplies per wake, one work_units tick per multiply → gcd(5, 1024) = 1).
    • Phase-inherited: Sequence inherits whichever phase is currently active — Spin / Yield / Io use the same per-unit accounting as the SpinWait / YieldHeavy / IoSyncWrite groups above; Sleep contributes no work_units and so pauses migration checks while it runs.
    • Not tracked by the framework: Custom workers do not contribute to work_units on the framework’s behalf — migration tracking fires only if the user’s run function increments work_units and emits migrations directly.
  • Scheduling gaps (max_gap_ms, max_gap_cpu, max_gap_at_ms) record the longest wall-clock interval between consecutive 1024-work-unit migration-check points plus the CPU the gap was observed on and its time from start. High values indicate preemption or descheduling near a checkpoint boundary. The checkpoint cadence — and therefore the gap-measurement cadence — is governed by the same work_units.is_multiple_of(1024) test that the migration tracker uses, so the effective measurement period in outer iterations matches the per-WorkType tables above.

Benchmarking fields

Workers collect two categories of timing data:

Per-wakeup latency (resume_latencies_ns): timestamp-based samples recorded around blocking operations. Populated for work types with a blocking step: Bursty (sleep), PipeIo (pipe read), FutexPingPong (futex wait), FutexFanOut (futex wait, receivers only), FanOutCompute (futex wait, workers only — measured as CLOCK_MONOTONIC delta from messenger’s shared timestamp), CacheYield (yield), CachePipe (pipe read), IoSyncWrite / IoRandRead / IoConvoy (pread / pwrite / fdatasync blocking), NiceSweep (yield), AffinityChurn (yield), PolicyChurn (yield), MutexContention (futex wait on contended acquire), ForkExit (parent’s waitpid wait), and Sequence when its phases include Sleep, Yield, or Io. Each sample is in nanoseconds; most work types use Instant::elapsed() across the blocking call, while FanOutCompute uses clock_gettime(CLOCK_MONOTONIC) to measure against the messenger’s pre-wake timestamp.

schedstat deltas: read from /proc/self/schedstat at work-loop start and end. Three fields:

  • schedstat_cpu_time_ns – delta of field 1 (on-CPU time)
  • schedstat_run_delay_ns – delta of field 2 (time spent waiting for a CPU)
  • schedstat_run_count – delta of field 3 (pcount — scheduler-in count: incremented each time the scheduler picks this task to execute, across CFS/EEVDF, FIFO/RR, and sched_ext alike). Not a context-switch count — a task that keeps running on the same CPU without leaving the runqueue does not see pcount advance while it runs. For true context-switch counts read /proc/<pid>/status’s voluntary_ctxt_switches and nonvoluntary_ctxt_switches; the worker reads pcount instead because schedstat delivers it alongside run_delay / cpu_time in a single file read.

iterations counts outer-loop iterations.

NUMA fields

numa_pages: per-NUMA-node page counts parsed from /proc/self/numa_maps after the workload completes. Keyed by node ID. Empty when numa_maps is unavailable.

vmstat_numa_pages_migrated: delta of the numa_pages_migrated counter from /proc/vmstat between pre- and post-workload snapshots. Measures cross-node page migrations during the test.

These fields feed the NUMA checking thresholds.

Custom workers produce their own WorkerReport. The framework does not populate any telemetry fields for Custom – migration tracking, gap detection, schedstat deltas, NUMA page counts, and iteration counters are only present if the user’s run function fills them.

Worker-progress watchdog

Workers send SIGUSR2 to the scheduler when stuck > 2 seconds. The default POSIX disposition terminates the scheduler process, which ktstr detects as a scheduler death and captures the sched_ext dump from dmesg.

In repro mode, the watchdog is disabled to keep the scheduler alive for BPF probe assertions. The watchdog does not fire for Custom workers because they bypass the framework work loop.

RAII cleanup

WorkloadHandle implements Drop: it sends SIGKILL to all child processes and waits for them. This prevents orphaned worker processes on error paths.