Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

WorkType

WorkType controls what each worker process does during a scenario.

The WorkType enum in ktstr::workload is the source of truth. The variants below are grouped by intent; each one-line summary is the leading sentence of the variant’s rustdoc. Run cargo doc --open for full per-variant semantics, parameter ranges, and kernel-path citations — this page reproduces only the high-level shape.

pub enum WorkType {
    // CPU primitives
    SpinWait,                            // Tight CPU spin loop (1024 iterations per cycle).
    YieldHeavy,                          // Repeated sched_yield with minimal CPU work.
    Mixed,                               // CPU spin burst followed by sched_yield.
    AluHot { width: AluWidth },          // Dependent integer multiply chain at high IPC (>= 2.0); optional SIMD width.
    SmtSiblingSpin,                      // Tight PAUSE-spin from a paired worker pinned to two SMT siblings.
    IpcVariance {                        // Alternating high-IPC (multiplies) / low-IPC (random cache touches) phases.
        hot_iters: u64,
        cold_iters: u64,
        period_iters: u64,
    },

    // Block-device I/O (operates on /dev/vda; falls back to per-worker tempfile when absent)
    IoSyncWrite,                         // 16 x 4 KB pwrites + fdatasync per iteration (O_SYNC).
    IoRandRead,                          // Single 4 KB pread at a random sector-aligned offset (O_DIRECT).
    IoConvoy,                            // Interleaved sequential pwrite + random pread with periodic fdatasync (O_DIRECT).

    // Burst-and-sleep
    Bursty {                             // CPU burst for `burst_duration`, sleep for `sleep_duration`, repeat.
        burst_duration: Duration,
        sleep_duration: Duration,
    },
    IdleChurn {                          // CPU burst then `nanosleep` (exercises hrtimer + idle-class path).
        burst_duration: Duration,
        sleep_duration: Duration,
        precise_timing: bool,
    },

    // Cache pressure
    CachePressure { size_kb: usize, stride: usize },    // Strided RMW sized to pressure L1.
    CacheYield { size_kb: usize, stride: usize },       // Cache pressure burst then sched_yield().

    // Wake-placement / cross-CPU paths
    PipeIo { burst_iters: u64 },                        // CPU burst then 1-byte pipe exchange with a partner worker.
    FutexPingPong { spin_iters: u64 },                  // Paired futex wait/wake between partner workers (non-WF_SYNC).
    CachePipe { size_kb: usize, burst_iters: u64 },     // Cache-hot working set + pipe wake.
    FutexFanOut { fan_out: usize, spin_iters: u64 },    // 1:N fan-out wake (one messenger, N receivers).
    FanOutCompute {                                     // Messenger/worker fan-out with matrix-multiply compute per receiver.
        fan_out: usize,
        cache_footprint_kb: usize,
        operations: usize,
        sleep_usec: u64,
    },
    AsymmetricWaker {                                   // Paired workers in mismatched scheduling classes share one futex word.
        waker_class: SchedClass,
        wakee_class: SchedClass,
        burst_iters: u64,
    },
    WakeChain {                                         // Ring of waker-wakee hops via Pipe (WF_SYNC) or Futex wake.
        depth: usize,
        wake: WakeMechanism,
        work_per_hop: Duration,
    },
    EpollStorm {                                        // eventfd producers + epoll_wait consumers (exclusive autoremove wake).
        producers: usize,
        consumers: usize,
        events_per_burst: u64,
    },
    ThunderingHerd {                                    // N waiters on ONE global futex word; broadcast FUTEX_WAKE rouses the herd.
        waiters: usize,
        batches: u64,
        inter_batch_ms: u64,
    },

    // Compound / sequence
    Sequence { first: Phase, rest: Vec<Phase> },        // Loop through ordered phases (Spin / Sleep / Yield / Io).

    // Lifecycle / scheduling-class churn
    ForkExit,                                           // Rapid fork+_exit cycling; parent waitpid's then repeats.
    NiceSweep,                                          // Cycle nice level from -20 to 19 across iterations.
    AffinityChurn { spin_iters: u64 },                  // Rapid self-directed sched_setaffinity to random CPUs.
    PolicyChurn { spin_iters: u64 },                    // Cycle SCHED_OTHER -> BATCH -> IDLE (-> FIFO/RR if CAP_SYS_NICE).
    NumaMigrationChurn { period_ms: u64 },              // Rotate sched_setaffinity across NUMA nodes.
    CgroupChurn { groups: usize, cycle_ms: u64 },       // Cycle cgroup membership between sibling cgroups.

    // Memory pressure / NUMA
    PageFaultChurn {                                    // mmap NOHUGEPAGE -> touch random pages -> MADV_DONTNEED, repeat.
        region_kb: usize,
        touches_per_cycle: usize,
        spin_iters: u64,
    },
    NumaWorkingSetSweep {                               // Rotate the working-set memory across NUMA nodes via mbind.
        region_kb: usize,
        sweep_period_ms: u64,
        target_nodes: Vec<usize>,
    },

    // Lock contention
    MutexContention {                                   // N-way futex mutex contention (CAS acquire / FUTEX_WAIT on failure).
        contenders: usize,
        hold_iters: u64,
        work_iters: u64,
    },
    PriorityInversion {                                 // Three priority tiers contending for one shared lock (Pi or Plain futex).
        high_count: usize,
        medium_count: usize,
        low_count: usize,
        hold_iters: u64,
        work_iters: u64,
        pi_mode: FutexLockMode,
    },

    // Producer/consumer + signal/preempt pressure
    ProducerConsumerImbalance {                         // Producer / consumer pipeline with deliberately-unbalanced rates.
        producers: usize,
        consumers: usize,
        produce_rate_hz: u64,
        consume_iters: u64,
        queue_depth_target: u64,
    },
    SignalStorm {                                       // Paired workers fire tkill(partner, SIGUSR1) between CPU bursts.
        signals_per_iter: u64,
        work_iters: u64,
    },
    PreemptStorm {                                      // One SCHED_FIFO worker preempts CFS spinners on the same CPU at ~kHz rate.
        cfs_workers: usize,
        rt_burst_iters: u64,
        rt_sleep_us: u64,
    },
    RtStarvation {                                      // SCHED_FIFO workers monopolise the CPU at 100%; CFS workers starve.
        rt_workers: usize,
        cfs_workers: usize,
        rt_priority: i32,
        burst_iters: u64,
    },

    // User-supplied
    Custom {                                            // User-supplied work function (name + fn pointer).
        name: String,
        run: fn(&AtomicBool) -> WorkerReport,
    },
}

Imports: WorkType, Phase, SchedPolicy, WorkSpec, and WorkloadConfig are in ktstr::prelude::*. The auxiliary enums FutexLockMode (used by PriorityInversion::pi_mode), WakeMechanism (used by WakeChain::wake), and SchedClass (used by AsymmetricWaker) live under ktstr::workload. Bring them into scope with use ktstr::workload::*; (or import each by name) before writing variant literals that reference them.

Parameterized variants have snake-case convenience constructors — e.g. WorkType::bursty(burst_duration, sleep_duration), WorkType::pipe_io(burst_iters), WorkType::cache_pressure(size_kb, stride), WorkType::page_fault_churn(region_kb, touches_per_cycle, spin_iters), WorkType::mutex_contention(contenders, hold_iters, work_iters), WorkType::priority_inversion(high_count, medium_count, low_count, hold_iters, work_iters, pi_mode), WorkType::wake_chain(depth, wake, work_per_hop), WorkType::custom(name, run). Every parameterised variant has one; see cargo doc --open on WorkType for the full constructor list and parameter validation rules.

Bursty, IdleChurn, and WakeChain take Duration parameters (humantime-serialised in captured configs) — pass Duration::from_millis(N) or Duration::from_micros(N) from std::time rather than raw integers. IpcVariance, ProducerConsumerImbalance, RtStarvation, PriorityInversion, EpollStorm, PreemptStorm, and ThunderingHerd reject zero-valued counters at spawn time (WorkTypeValidationError::*).

Choosing a work type

Scheduler behavior to testRecommended work type
Basic load balancing / fairnessSpinWait (default)
Wake placement / sleep-wake cyclesYieldHeavy, FutexPingPong
CPU borrowing / idle balanceBursty
Cross-CPU wake latencyPipeIo, CachePipe
Cache-aware schedulingCachePressure, CacheYield
Cache-aware fan-out wake latencyFanOutCompute
Fan-out wake stormsFutexFanOut
Mixed real-world patternsSequence
Task creation/destruction pressureForkExit
Priority reweighting / nice dynamicsNiceSweep
Rapid CPU migration / affinity churnAffinityChurn
Scheduling class transitionsPolicyChurn
Page fault / TLB pressurePageFaultChurn
Lock contention / convoy effectMutexContention
Arbitrary user-defined workloadCustom

Variants

SpinWait – tight spin loop with spin_loop() hints. 1024 iterations per check. Pure CPU-bound workload.

YieldHeavythread::yield_now() on every iteration. Exercises scheduler wake/sleep paths.

Mixed – 1024 spin iterations then yield. Combines CPU and voluntary preemption.

IoSyncWrite – 16 × 4 KB pwrites totaling 64 KB at the worker’s stripe offset (per-worker striping prevents fdatasync from coalescing across writers), then fdatasync(). Drives fsync-heavy D-state cycles. Opens /dev/vda with O_SYNC; falls back to a per-worker tempfile when /dev/vda is absent (host-side unit tests).

IoRandRead – single 4 KB pread at a sector-aligned random offset within the device capacity. Opens /dev/vda with O_DIRECT (tempfile fallback); drives high-IOPS short-D-state cycles. Per-worker xorshift PRNG seeded from tid.

IoConvoy – alternates 4 KB pwrite at the worker’s monotonic sequential cursor with 4 KB pread at a random offset; fdatasync() runs every 16 iterations. /dev/vda opened O_DIRECT (tempfile fallback). Currently uses direct IO so the pathology surface is the synchronous flush + IO-mix latency rather than page-cache convoy build-up.

Bursty – CPU burst for burst_duration, sleep for sleep_duration, repeat. Both fields are Duration (humantime- serialised); pass Duration::from_millis(N) from std::time. Frees CPUs during sleep, exercising CPU borrowing.

PipeIo – CPU burst then 1-byte pipe exchange with a partner worker. Workers are paired: (0,1), (2,3), etc. Sleep duration depends on partner scheduling, exercising cross-CPU wake placement. Requires even num_workers.

FutexPingPong – paired futex wait/wake between partner workers. Each iteration does spin_iters of CPU work then wakes the partner and waits on a shared futex word. Exercises the non-WF_SYNC wake path. Requires even num_workers.

CachePressure – strided read-modify-write over a buffer sized to pressure the L1 cache. Each worker allocates its own buffer post-fork. size_kb controls buffer size, stride controls the byte step between accesses.

CacheYield – cache pressure followed by sched_yield(). Tests scheduler re-placement after voluntary yield with a cache-hot working set.

CachePipe – cache pressure burst then 1-byte pipe exchange with a partner worker. Combines cache-hot working set with cross-CPU wake placement. Requires even num_workers.

FutexFanOut – 1:N fan-out wake pattern without cache pressure. One messenger per group does spin_iters of CPU spin work then wakes fan_out receivers via FUTEX_WAKE. Receivers measure wake-to-run latency. For cache-aware fan-out with matrix multiply work, see FanOutCompute. Requires num_workers divisible by fan_out + 1.

FanOutCompute – messenger/worker fan-out with compute work. One messenger per group stamps a CLOCK_MONOTONIC timestamp then wakes fan_out workers via FUTEX_WAKE. Workers measure wake-to-run latency (time from messenger’s timestamp to worker getting the CPU), sleep for sleep_usec microseconds (simulating think time), then do operations iterations of naive matrix multiply over a cache_footprint_kb-sized working set (three square matrices of u64, O(n^3)). Requires num_workers divisible by fan_out + 1.

Sequence – compound work pattern: loop through phases in order, repeat. Each phase runs for its specified duration before the next starts. Phases are defined via the Phase enum:

  • Phase::Spin(Duration) – CPU spin for the given duration.
  • Phase::Sleep(Duration)thread::sleep for the given duration.
  • Phase::Yield(Duration) – repeated sched_yield for the given duration.
  • Phase::Io(Duration) – simulated I/O (write 64 KB + 100 us sleep) for the given duration.

Sequence cannot be constructed via WorkType::from_name() because it requires explicit phase definitions. Build it directly:

WorkType::Sequence {
    first: Phase::Spin(Duration::from_millis(100)),
    rest: vec![
        Phase::Sleep(Duration::from_millis(50)),
        Phase::Yield(Duration::from_millis(20)),
    ],
}

ForkExit – rapid fork+_exit cycling. Each iteration forks a child that immediately calls _exit(0). The parent waitpids then repeats. Exercises wake_up_new_task, do_exit, and wait_task_zombie.

NiceSweep – cycles the worker’s nice level from -20 to 19 across iterations. Each iteration: 512-iteration spin burst, setpriority(PRIO_PROCESS, 0, nice_val), then sched_yield. Exercises reweight_task and dynamic priority reweighting. Skips negative nice values when CAP_SYS_NICE is absent. Resets nice to 0 before exit. Records per-yield wake latency.

AffinityChurn – rapid self-directed CPU affinity changes. Each iteration: spin_iters spin burst, sched_setaffinity to a random CPU from the effective cpuset, then sched_yield. Exercises affine_move_task and migration_cpu_stop. Records per-yield wake latency.

PolicyChurn – cycles through scheduling policies each iteration. Each iteration: spin_iters spin burst, sched_setscheduler to the next policy in the sequence, then sched_yield. Cycles through SCHED_OTHER, SCHED_BATCH, SCHED_IDLE (and SCHED_FIFO/SCHED_RR with priority 1 when CAP_SYS_NICE is available). Exercises __sched_setscheduler and scheduling class transitions. Resets to SCHED_OTHER before exit. Records per-yield wake latency.

PageFaultChurn – rapid page fault cycling. Workers mmap a region_kb KB region with MADV_NOHUGEPAGE (forcing 4 KB pages), touch touches_per_cycle random pages via write faults through do_anonymous_page, then MADV_DONTNEED to zap PTEs and repeat. spin_iters iterations of CPU work separate cycles. Exercises the page allocator, TLB pressure on migration, and rapid user/kernel transitions. Uses xorshift64 PRNG for random page selection (seeded from the process ID).

MutexContention – N-way futex mutex contention. contenders workers per group contend on a shared AtomicU32 via CAS acquire (FUTEX_WAIT on failure). Loop: spin_burst(work_iters) then CAS acquire, spin_burst(hold_iters) in the critical section, then store 0 + FUTEX_WAKE(1) to release. Exercises convoy effect, lock-holder preemption cascading stalls, and futex wait/wake contention paths. Requires num_workers divisible by contenders.

Custom – user-supplied work function. The run function pointer receives a reference to the stop flag (&AtomicBool, set by SIGUSR1) and returns a WorkerReport when the flag becomes true. The framework handles fork, cgroup placement, affinity, scheduling policy, and signal setup; the user function owns the work loop and all WorkerReport field population. Framework telemetry (migration tracking, gap detection, schedstat deltas, iteration counter updates) is not provided – the user function is responsible for any telemetry it needs.

Warning — pgid SIGKILL sweep on teardown. Every worker process calls setpgid(0, 0) immediately after fork, so the worker and any children a Custom closure spawns share a single process group. At teardown, stop_and_collect issues killpg(worker_pid, SIGKILL) on BOTH the graceful-exit and StillAlive-escalation paths, and WorkloadHandle::drop issues another killpg during handle destruction. Every descendant that inherits the worker’s pgid (a helper binary via execv, a subshell via sh -c, a test fixture the closure forks to drive the scheduler) will be SIGKILLed at teardown. Closures that need a child to outlive the worker must either detach it from the worker’s pgid (call setpgid(child_pid, 0) after fork) or wait on it explicitly before returning the WorkerReport.

Function pointers (fn(&AtomicBool) -> WorkerReport) are fork-safe because they carry no captured state across the fork boundary. Closures are not supported. Cannot be constructed via WorkType::from_name().

use std::sync::atomic::{AtomicBool, Ordering};
use ktstr::workload::{WorkType, WorkerReport};

fn my_workload(stop: &AtomicBool) -> WorkerReport {
    // `tid` in `WorkerReport` is an `i32` (libc::pid_t). Using
    // `std::process::id() as i32` avoids a direct `libc` dependency in
    // the consumer crate; inside ktstr the two produce the same value
    // because one worker = one process (no threads).
    let tid: i32 = std::process::id() as i32;
    let start = std::time::Instant::now();
    let mut work_units = 0u64;
    while !stop.load(Ordering::Relaxed) {
        // ... custom work ...
        work_units += 1;
    }
    let wall_time_ns = start.elapsed().as_nanos() as u64;
    // Start from `WorkerReport::default()` so the fields you don't
    // populate take their zero / empty values automatically and new
    // fields added to `WorkerReport` in the future do not require an
    // edit here. Only populate the telemetry your custom workload
    // actually produces.
    WorkerReport {
        tid,
        work_units,
        wall_time_ns,
        iterations: work_units,
        ..WorkerReport::default()
    }
}

let wt = WorkType::custom("my_workload", my_workload);

Grouped work types

PipeIo, FutexPingPong, and CachePipe require num_workers divisible by 2 (paired). FutexFanOut and FanOutCompute require num_workers divisible by fan_out + 1 (1 messenger + N receivers per group). MutexContention requires num_workers divisible by contenders. WorkType::worker_group_size() returns the group size for these variants, or None for ungrouped types. PipeIo and CachePipe use pipes; FutexPingPong, FutexFanOut, FanOutCompute, and MutexContention use shared mmap pages with futex wait/wake.

Clone-mode and pcomm interactions

CloneMode is a per-WorkloadConfig enum with two variants — Fork (the default; each worker is its own thread group, reaped via waitpid) and Thread (workers share the parent’s tgid, run as std::thread::spawn threads, reaped via JoinHandle).

pcomm is not a CloneMode variant — it is a WorkSpec field set via WorkSpec::pcomm(name) / CgroupDef::pcomm(name) in the tutorial. When a WorkSpec carries pcomm = Some(name), apply_setup routes it through the fork-then-thread spawn path: ONE forked thread-group leader whose task->comm is name hosts every matching worker as a pthread-style thread under that leader. Workers sharing a pcomm value coalesce into one container; this combines the per-process-leader visibility schedulers expect (a chrome parent, a java parent) with the in-process std::thread::spawn dispatch shape CloneMode::Thread already uses for the worker bodies themselves.

PipeIo and CachePipe work correctly inside a pcomm container. When workers run as threads inside one forked leader, the per-pair pipe-fd indices computed in the global pipe_pairs table are addressed by each worker’s position WITHIN the container’s thread group, so worker A reads its partner’s write end whether the pair lives in two forked processes (Fork mode) or in two threads of one pcomm container.

SignalStorm uses tkill(partner_tid, SIGUSR1) (per-task signal delivery, PIDTYPE_PID), NOT kill (per-tgid, PIDTYPE_TGID) and NOT tgkill(self_tgid, partner_tid, …) (would return ESRCH under Fork mode because each forked worker is its own tgid leader). tkill looks up the target via find_task_by_vpid(pid) and skips the tgid check, so the signal hits the partner thread’s per-task pending queue under Fork and Thread modes uniformly — including inside pcomm-coalesced thread groups. Sibling threads in a pcomm container do NOT dequeue each other’s SignalStorm signals because the PIDTYPE_PID queue is per-task, not per-tgid.

Default values

WorkType::from_name() uses these defaults:

  • Bursty: burst_duration=50ms, sleep_duration=100ms
  • PipeIo: burst_iters=1024
  • FutexPingPong: spin_iters=1024
  • CachePressure: size_kb=32, stride=64
  • CacheYield: size_kb=32, stride=64
  • CachePipe: size_kb=32, burst_iters=1024
  • FutexFanOut: fan_out=4, spin_iters=1024
  • FanOutCompute: fan_out=4, cache_footprint_kb=256, operations=5, sleep_usec=100
  • AffinityChurn: spin_iters=1024
  • PolicyChurn: spin_iters=1024
  • PageFaultChurn: region_kb=4096, touches_per_cycle=256, spin_iters=64
  • MutexContention: contenders=4, hold_iters=256, work_iters=1024

String lookup

WorkType::from_name() accepts PascalCase names matching the enum variants (e.g. "SpinWait", "FutexPingPong"). Sequence and Custom return None because they require explicit construction parameters. WorkType::ALL_NAMES lists every variant name. WorkType::name() returns the PascalCase name for a given value; for Custom, it returns the user-provided name field.

WorkloadConfig

WorkloadConfig is the low-level struct passed to WorkloadHandle::spawn(). CgroupDef builds one internally; use WorkloadConfig directly when calling setup_cgroups() or WorkloadHandle::spawn() in custom scenarios.

pub struct WorkloadConfig {
    pub num_workers: usize,           // Number of worker processes to fork
    pub affinity: AffinityIntent,     // Per-worker affinity intent (resolved at spawn time)
    pub work_type: WorkType,          // What each worker does
    pub sched_policy: SchedPolicy,    // Linux scheduling policy
    pub mem_policy: MemPolicy,        // NUMA memory placement policy
    pub mpol_flags: MpolFlags,        // Optional mode flags for set_mempolicy(2)
    pub nice: Option<i32>,            // Per-worker nice via setpriority(2); None inherits
    pub clone_mode: CloneMode,        // Fork (default) or Thread dispatch
    pub comm: Option<Cow<'static, str>>, // task->comm via prctl(PR_SET_NAME); kernel truncates to 15 bytes
    pub uid: Option<u32>,             // Effective UID via setresuid; None inherits
    pub gid: Option<u32>,             // Effective GID via setresgid; None inherits
    pub numa_node: Option<u32>,       // Restrict affinity to one NUMA node's CPU set
    pub composed: Vec<WorkSpec>,      // Secondary worker groups spawned alongside the primary
}

Default: 1 worker, AffinityIntent::Inherit, SpinWait, Normal policy, Default mem_policy, no mpol_flags, nice/comm/uid/gid/numa_node = None, clone_mode = Fork, composed = empty.

AffinityIntent is the type-unified affinity expression used at the top level and inside WorkSpec entries — Inherit, Exact(...), and RandomSubset(...) are accepted at WorkloadHandle::spawn; topology-aware variants (SingleCpu, LlcAligned, CrossCgroup, SmtSiblingPair) require scenario context and are rejected at the spawn gate with an actionable diagnostic. composed carries secondary WorkSpec groups that spawn alongside the primary; each composed entry can override work_type, num_workers, sched_policy, affinity, etc., and reports back via WorkerReport::group_idx (0 for the primary, 1..=N for composed entries in declaration order).

See MemPolicy for the NUMA memory placement API.

Scheduling policies

Workers can run under different Linux scheduling policies:

pub enum SchedPolicy {
    Normal,
    Batch,
    Idle,
    Fifo(u32),       // priority 1-99
    RoundRobin(u32), // priority 1-99
    Deadline {
        runtime: Duration,   // budget per period
        deadline: Duration,  // relative deadline from period start
        period: Duration,    // period; Duration::ZERO uses `deadline`
    },
}

Fifo, RoundRobin, and Deadline require CAP_SYS_NICE. The sched-deadline gate (runtime <= deadline <= period, all non-zero unless period == Duration::ZERO, which the kernel substitutes with deadline) is validated user-side in SchedPolicy::deadline() before sched_setattr so a malformed Deadline fails fast rather than tunneling EINVAL through the syscall.

Overriding work types

The work type override (configured via gauntlet or Ctx.work_type_override) replaces the default SpinWait work type for all scenarios that use it. Scenarios with non-SpinWait work types are not overridden.

Overrides to grouped work types (PipeIo, FutexPingPong, CachePipe, FutexFanOut, FanOutCompute, MutexContention) are skipped when num_workers is not divisible by the work type’s group size.

Ops-based scenarios have a separate override mechanism via CgroupDef.swappable. See Ops and Steps.