WorkType
WorkType controls what each worker process does during a scenario.
The WorkType enum in ktstr::workload is the source of truth.
The variants below are grouped by intent; each one-line summary is
the leading sentence of the variant’s rustdoc. Run cargo doc --open
for full per-variant semantics, parameter ranges, and kernel-path
citations — this page reproduces only the high-level shape.
pub enum WorkType {
// CPU primitives
SpinWait, // Tight CPU spin loop (1024 iterations per cycle).
YieldHeavy, // Repeated sched_yield with minimal CPU work.
Mixed, // CPU spin burst followed by sched_yield.
AluHot { width: AluWidth }, // Dependent integer multiply chain at high IPC (>= 2.0); optional SIMD width.
SmtSiblingSpin, // Tight PAUSE-spin from a paired worker pinned to two SMT siblings.
IpcVariance { // Alternating high-IPC (multiplies) / low-IPC (random cache touches) phases.
hot_iters: u64,
cold_iters: u64,
period_iters: u64,
},
// Block-device I/O (operates on /dev/vda; falls back to per-worker tempfile when absent)
IoSyncWrite, // 16 x 4 KB pwrites + fdatasync per iteration (O_SYNC).
IoRandRead, // Single 4 KB pread at a random sector-aligned offset (O_DIRECT).
IoConvoy, // Interleaved sequential pwrite + random pread with periodic fdatasync (O_DIRECT).
// Burst-and-sleep
Bursty { // CPU burst for `burst_duration`, sleep for `sleep_duration`, repeat.
burst_duration: Duration,
sleep_duration: Duration,
},
IdleChurn { // CPU burst then `nanosleep` (exercises hrtimer + idle-class path).
burst_duration: Duration,
sleep_duration: Duration,
precise_timing: bool,
},
// Cache pressure
CachePressure { size_kb: usize, stride: usize }, // Strided RMW sized to pressure L1.
CacheYield { size_kb: usize, stride: usize }, // Cache pressure burst then sched_yield().
// Wake-placement / cross-CPU paths
PipeIo { burst_iters: u64 }, // CPU burst then 1-byte pipe exchange with a partner worker.
FutexPingPong { spin_iters: u64 }, // Paired futex wait/wake between partner workers (non-WF_SYNC).
CachePipe { size_kb: usize, burst_iters: u64 }, // Cache-hot working set + pipe wake.
FutexFanOut { fan_out: usize, spin_iters: u64 }, // 1:N fan-out wake (one messenger, N receivers).
FanOutCompute { // Messenger/worker fan-out with matrix-multiply compute per receiver.
fan_out: usize,
cache_footprint_kb: usize,
operations: usize,
sleep_usec: u64,
},
AsymmetricWaker { // Paired workers in mismatched scheduling classes share one futex word.
waker_class: SchedClass,
wakee_class: SchedClass,
burst_iters: u64,
},
WakeChain { // Ring of waker-wakee hops via Pipe (WF_SYNC) or Futex wake.
depth: usize,
wake: WakeMechanism,
work_per_hop: Duration,
},
EpollStorm { // eventfd producers + epoll_wait consumers (exclusive autoremove wake).
producers: usize,
consumers: usize,
events_per_burst: u64,
},
ThunderingHerd { // N waiters on ONE global futex word; broadcast FUTEX_WAKE rouses the herd.
waiters: usize,
batches: u64,
inter_batch_ms: u64,
},
// Compound / sequence
Sequence { first: Phase, rest: Vec<Phase> }, // Loop through ordered phases (Spin / Sleep / Yield / Io).
// Lifecycle / scheduling-class churn
ForkExit, // Rapid fork+_exit cycling; parent waitpid's then repeats.
NiceSweep, // Cycle nice level from -20 to 19 across iterations.
AffinityChurn { spin_iters: u64 }, // Rapid self-directed sched_setaffinity to random CPUs.
PolicyChurn { spin_iters: u64 }, // Cycle SCHED_OTHER -> BATCH -> IDLE (-> FIFO/RR if CAP_SYS_NICE).
NumaMigrationChurn { period_ms: u64 }, // Rotate sched_setaffinity across NUMA nodes.
CgroupChurn { groups: usize, cycle_ms: u64 }, // Cycle cgroup membership between sibling cgroups.
// Memory pressure / NUMA
PageFaultChurn { // mmap NOHUGEPAGE -> touch random pages -> MADV_DONTNEED, repeat.
region_kb: usize,
touches_per_cycle: usize,
spin_iters: u64,
},
NumaWorkingSetSweep { // Rotate the working-set memory across NUMA nodes via mbind.
region_kb: usize,
sweep_period_ms: u64,
target_nodes: Vec<usize>,
},
// Lock contention
MutexContention { // N-way futex mutex contention (CAS acquire / FUTEX_WAIT on failure).
contenders: usize,
hold_iters: u64,
work_iters: u64,
},
PriorityInversion { // Three priority tiers contending for one shared lock (Pi or Plain futex).
high_count: usize,
medium_count: usize,
low_count: usize,
hold_iters: u64,
work_iters: u64,
pi_mode: FutexLockMode,
},
// Producer/consumer + signal/preempt pressure
ProducerConsumerImbalance { // Producer / consumer pipeline with deliberately-unbalanced rates.
producers: usize,
consumers: usize,
produce_rate_hz: u64,
consume_iters: u64,
queue_depth_target: u64,
},
SignalStorm { // Paired workers fire tkill(partner, SIGUSR1) between CPU bursts.
signals_per_iter: u64,
work_iters: u64,
},
PreemptStorm { // One SCHED_FIFO worker preempts CFS spinners on the same CPU at ~kHz rate.
cfs_workers: usize,
rt_burst_iters: u64,
rt_sleep_us: u64,
},
RtStarvation { // SCHED_FIFO workers monopolise the CPU at 100%; CFS workers starve.
rt_workers: usize,
cfs_workers: usize,
rt_priority: i32,
burst_iters: u64,
},
// User-supplied
Custom { // User-supplied work function (name + fn pointer).
name: String,
run: fn(&AtomicBool) -> WorkerReport,
},
}
Imports:
WorkType,Phase,SchedPolicy,WorkSpec, andWorkloadConfigare inktstr::prelude::*. The auxiliary enumsFutexLockMode(used byPriorityInversion::pi_mode),WakeMechanism(used byWakeChain::wake), andSchedClass(used byAsymmetricWaker) live underktstr::workload. Bring them into scope withuse ktstr::workload::*;(or import each by name) before writing variant literals that reference them.
Parameterized variants have snake-case convenience constructors —
e.g. WorkType::bursty(burst_duration, sleep_duration),
WorkType::pipe_io(burst_iters),
WorkType::cache_pressure(size_kb, stride),
WorkType::page_fault_churn(region_kb, touches_per_cycle, spin_iters),
WorkType::mutex_contention(contenders, hold_iters, work_iters),
WorkType::priority_inversion(high_count, medium_count, low_count, hold_iters, work_iters, pi_mode),
WorkType::wake_chain(depth, wake, work_per_hop),
WorkType::custom(name, run). Every parameterised variant has one;
see cargo doc --open on WorkType for the full constructor list
and parameter validation rules.
Bursty, IdleChurn, and WakeChain take Duration parameters
(humantime-serialised in captured configs) — pass
Duration::from_millis(N) or
Duration::from_micros(N) from std::time rather than raw integers.
IpcVariance, ProducerConsumerImbalance, RtStarvation,
PriorityInversion, EpollStorm, PreemptStorm, and
ThunderingHerd reject zero-valued counters at spawn time
(WorkTypeValidationError::*).
Choosing a work type
| Scheduler behavior to test | Recommended work type |
|---|---|
| Basic load balancing / fairness | SpinWait (default) |
| Wake placement / sleep-wake cycles | YieldHeavy, FutexPingPong |
| CPU borrowing / idle balance | Bursty |
| Cross-CPU wake latency | PipeIo, CachePipe |
| Cache-aware scheduling | CachePressure, CacheYield |
| Cache-aware fan-out wake latency | FanOutCompute |
| Fan-out wake storms | FutexFanOut |
| Mixed real-world patterns | Sequence |
| Task creation/destruction pressure | ForkExit |
| Priority reweighting / nice dynamics | NiceSweep |
| Rapid CPU migration / affinity churn | AffinityChurn |
| Scheduling class transitions | PolicyChurn |
| Page fault / TLB pressure | PageFaultChurn |
| Lock contention / convoy effect | MutexContention |
| Arbitrary user-defined workload | Custom |
Variants
SpinWait – tight spin loop with spin_loop() hints. 1024
iterations per check. Pure CPU-bound workload.
YieldHeavy – thread::yield_now() on every iteration. Exercises
scheduler wake/sleep paths.
Mixed – 1024 spin iterations then yield. Combines CPU and
voluntary preemption.
IoSyncWrite – 16 × 4 KB pwrites totaling 64 KB at the worker’s
stripe offset (per-worker striping prevents fdatasync from coalescing
across writers), then fdatasync(). Drives fsync-heavy D-state cycles.
Opens /dev/vda with O_SYNC; falls back to a per-worker tempfile
when /dev/vda is absent (host-side unit tests).
IoRandRead – single 4 KB pread at a sector-aligned random
offset within the device capacity. Opens /dev/vda with O_DIRECT
(tempfile fallback); drives high-IOPS short-D-state cycles. Per-worker
xorshift PRNG seeded from tid.
IoConvoy – alternates 4 KB pwrite at the worker’s monotonic
sequential cursor with 4 KB pread at a random offset; fdatasync()
runs every 16 iterations. /dev/vda opened O_DIRECT (tempfile
fallback). Currently uses direct IO so the pathology surface is the
synchronous flush + IO-mix latency rather than page-cache convoy
build-up.
Bursty – CPU burst for burst_duration, sleep for
sleep_duration, repeat. Both fields are Duration (humantime-
serialised); pass Duration::from_millis(N) from std::time. Frees
CPUs during sleep, exercising CPU borrowing.
PipeIo – CPU burst then 1-byte pipe exchange with a partner
worker. Workers are paired: (0,1), (2,3), etc. Sleep duration depends
on partner scheduling, exercising cross-CPU wake placement. Requires
even num_workers.
FutexPingPong – paired futex wait/wake between partner workers.
Each iteration does spin_iters of CPU work then wakes the partner
and waits on a shared futex word. Exercises the non-WF_SYNC wake path.
Requires even num_workers.
CachePressure – strided read-modify-write over a buffer sized
to pressure the L1 cache. Each worker allocates its own buffer
post-fork. size_kb controls buffer size, stride controls the byte
step between accesses.
CacheYield – cache pressure followed by sched_yield(). Tests
scheduler re-placement after voluntary yield with a cache-hot working set.
CachePipe – cache pressure burst then 1-byte pipe exchange with
a partner worker. Combines cache-hot working set with cross-CPU wake
placement. Requires even num_workers.
FutexFanOut – 1:N fan-out wake pattern without cache pressure.
One messenger per group does spin_iters of CPU spin work then wakes
fan_out receivers via FUTEX_WAKE. Receivers measure wake-to-run
latency. For cache-aware fan-out with matrix multiply work, see
FanOutCompute. Requires num_workers divisible by fan_out + 1.
FanOutCompute – messenger/worker fan-out with compute work. One
messenger per group stamps a CLOCK_MONOTONIC timestamp then wakes
fan_out workers via FUTEX_WAKE. Workers measure wake-to-run latency
(time from messenger’s timestamp to worker getting the CPU), sleep for
sleep_usec microseconds (simulating think time), then do operations
iterations of naive matrix multiply over a cache_footprint_kb-sized
working set (three square matrices of u64, O(n^3)). Requires
num_workers divisible by fan_out + 1.
Sequence – compound work pattern: loop through phases in order,
repeat. Each phase runs for its specified duration before the next
starts. Phases are defined via the Phase enum:
Phase::Spin(Duration)– CPU spin for the given duration.Phase::Sleep(Duration)–thread::sleepfor the given duration.Phase::Yield(Duration)– repeatedsched_yieldfor the given duration.Phase::Io(Duration)– simulated I/O (write 64 KB + 100 us sleep) for the given duration.
Sequence cannot be constructed via WorkType::from_name() because
it requires explicit phase definitions. Build it directly:
WorkType::Sequence {
first: Phase::Spin(Duration::from_millis(100)),
rest: vec![
Phase::Sleep(Duration::from_millis(50)),
Phase::Yield(Duration::from_millis(20)),
],
}
ForkExit – rapid fork+_exit cycling. Each iteration forks a
child that immediately calls _exit(0). The parent waitpids then
repeats. Exercises wake_up_new_task, do_exit, and
wait_task_zombie.
NiceSweep – cycles the worker’s nice level from -20 to 19
across iterations. Each iteration: 512-iteration spin burst,
setpriority(PRIO_PROCESS, 0, nice_val), then sched_yield. Exercises
reweight_task and dynamic priority reweighting. Skips negative nice values
when CAP_SYS_NICE is absent. Resets nice to 0 before exit. Records
per-yield wake latency.
AffinityChurn – rapid self-directed CPU affinity changes. Each
iteration: spin_iters spin burst, sched_setaffinity to a random CPU
from the effective cpuset, then sched_yield. Exercises
affine_move_task and migration_cpu_stop. Records per-yield wake
latency.
PolicyChurn – cycles through scheduling policies each iteration.
Each iteration: spin_iters spin burst, sched_setscheduler to the
next policy in the sequence, then sched_yield. Cycles through
SCHED_OTHER, SCHED_BATCH, SCHED_IDLE (and SCHED_FIFO/SCHED_RR
with priority 1 when CAP_SYS_NICE is available). Exercises
__sched_setscheduler and scheduling class transitions. Resets to
SCHED_OTHER before exit. Records per-yield wake latency.
PageFaultChurn – rapid page fault cycling. Workers mmap a
region_kb KB region with MADV_NOHUGEPAGE (forcing 4 KB pages),
touch touches_per_cycle random pages via write faults through
do_anonymous_page, then MADV_DONTNEED to zap PTEs and repeat.
spin_iters iterations of CPU work separate cycles. Exercises
the page allocator, TLB pressure on migration, and rapid user/kernel
transitions. Uses xorshift64 PRNG for random page selection (seeded
from the process ID).
MutexContention – N-way futex mutex contention. contenders
workers per group contend on a shared AtomicU32 via CAS acquire
(FUTEX_WAIT on failure). Loop: spin_burst(work_iters) then CAS
acquire, spin_burst(hold_iters) in the critical section, then
store 0 + FUTEX_WAKE(1) to release. Exercises convoy effect,
lock-holder preemption cascading stalls, and futex wait/wake
contention paths. Requires num_workers divisible by contenders.
Custom – user-supplied work function. The run function pointer
receives a reference to the stop flag (&AtomicBool, set by SIGUSR1)
and returns a WorkerReport when the flag becomes true. The
framework handles fork, cgroup placement, affinity, scheduling policy,
and signal setup; the user function owns the work loop and all
WorkerReport field population. Framework telemetry (migration
tracking, gap detection, schedstat deltas, iteration counter updates)
is not provided – the user function is responsible for any telemetry
it needs.
Warning — pgid SIGKILL sweep on teardown. Every worker process
calls setpgid(0, 0) immediately after fork, so the worker and any
children a Custom closure spawns share a single process group.
At teardown, stop_and_collect issues killpg(worker_pid, SIGKILL)
on BOTH the graceful-exit and StillAlive-escalation paths, and
WorkloadHandle::drop issues another killpg during handle
destruction. Every descendant that inherits the worker’s pgid
(a helper binary via execv, a subshell via sh -c, a test
fixture the closure forks to drive the scheduler) will be
SIGKILLed at teardown. Closures that need a child to outlive the
worker must either detach it from the worker’s pgid (call
setpgid(child_pid, 0) after fork) or wait on it explicitly
before returning the WorkerReport.
Function pointers (fn(&AtomicBool) -> WorkerReport) are fork-safe
because they carry no captured state across the fork boundary. Closures
are not supported. Cannot be constructed via WorkType::from_name().
use std::sync::atomic::{AtomicBool, Ordering};
use ktstr::workload::{WorkType, WorkerReport};
fn my_workload(stop: &AtomicBool) -> WorkerReport {
// `tid` in `WorkerReport` is an `i32` (libc::pid_t). Using
// `std::process::id() as i32` avoids a direct `libc` dependency in
// the consumer crate; inside ktstr the two produce the same value
// because one worker = one process (no threads).
let tid: i32 = std::process::id() as i32;
let start = std::time::Instant::now();
let mut work_units = 0u64;
while !stop.load(Ordering::Relaxed) {
// ... custom work ...
work_units += 1;
}
let wall_time_ns = start.elapsed().as_nanos() as u64;
// Start from `WorkerReport::default()` so the fields you don't
// populate take their zero / empty values automatically and new
// fields added to `WorkerReport` in the future do not require an
// edit here. Only populate the telemetry your custom workload
// actually produces.
WorkerReport {
tid,
work_units,
wall_time_ns,
iterations: work_units,
..WorkerReport::default()
}
}
let wt = WorkType::custom("my_workload", my_workload);
Grouped work types
PipeIo, FutexPingPong, and CachePipe require num_workers
divisible by 2 (paired). FutexFanOut and FanOutCompute require
num_workers divisible by fan_out + 1 (1 messenger + N receivers per
group). MutexContention requires num_workers divisible by
contenders. WorkType::worker_group_size() returns the group size
for these variants, or None for ungrouped types. PipeIo and
CachePipe use pipes; FutexPingPong, FutexFanOut, FanOutCompute,
and MutexContention use shared mmap pages with futex wait/wake.
Clone-mode and pcomm interactions
CloneMode is a per-WorkloadConfig enum with two variants —
Fork (the default; each worker is its own thread group, reaped
via waitpid) and Thread (workers share the parent’s tgid, run
as std::thread::spawn threads, reaped via JoinHandle).
pcomm is not a CloneMode variant — it is a WorkSpec
field set via WorkSpec::pcomm(name) /
CgroupDef::pcomm(name)
in the tutorial.
When a WorkSpec carries pcomm = Some(name), apply_setup
routes it through the fork-then-thread spawn path: ONE forked
thread-group leader whose task->comm is name hosts every
matching worker as a pthread-style thread under that leader.
Workers sharing a pcomm value coalesce into one container; this
combines the per-process-leader visibility schedulers expect (a
chrome parent, a java parent) with the in-process
std::thread::spawn dispatch shape CloneMode::Thread already
uses for the worker bodies themselves.
PipeIo and CachePipe work correctly inside a pcomm container.
When workers run as threads inside one forked leader, the per-pair
pipe-fd indices computed in the global pipe_pairs table are
addressed by each worker’s position WITHIN the container’s thread
group, so worker A reads its partner’s write end whether the pair
lives in two forked processes (Fork mode) or in two threads of
one pcomm container.
SignalStorm uses tkill(partner_tid, SIGUSR1) (per-task
signal delivery, PIDTYPE_PID), NOT kill (per-tgid,
PIDTYPE_TGID) and NOT tgkill(self_tgid, partner_tid, …)
(would return ESRCH under Fork mode because each forked worker
is its own tgid leader). tkill looks up the target via
find_task_by_vpid(pid) and skips the tgid check, so the signal
hits the partner thread’s per-task pending queue under Fork and
Thread modes uniformly — including inside pcomm-coalesced
thread groups. Sibling threads in a pcomm container do NOT dequeue
each other’s SignalStorm signals because the PIDTYPE_PID queue
is per-task, not per-tgid.
Default values
WorkType::from_name() uses these defaults:
Bursty:burst_duration=50ms,sleep_duration=100msPipeIo:burst_iters=1024FutexPingPong:spin_iters=1024CachePressure:size_kb=32,stride=64CacheYield:size_kb=32,stride=64CachePipe:size_kb=32,burst_iters=1024FutexFanOut:fan_out=4,spin_iters=1024FanOutCompute:fan_out=4,cache_footprint_kb=256,operations=5,sleep_usec=100AffinityChurn:spin_iters=1024PolicyChurn:spin_iters=1024PageFaultChurn:region_kb=4096,touches_per_cycle=256,spin_iters=64MutexContention:contenders=4,hold_iters=256,work_iters=1024
String lookup
WorkType::from_name() accepts PascalCase names matching the enum
variants (e.g. "SpinWait", "FutexPingPong"). Sequence and Custom
return None because they require explicit construction parameters.
WorkType::ALL_NAMES lists every variant name. WorkType::name()
returns the PascalCase name for a given value; for Custom, it returns
the user-provided name field.
WorkloadConfig
WorkloadConfig is the low-level struct passed to
WorkloadHandle::spawn(). CgroupDef builds one internally; use
WorkloadConfig directly when calling setup_cgroups() or
WorkloadHandle::spawn() in custom scenarios.
pub struct WorkloadConfig {
pub num_workers: usize, // Number of worker processes to fork
pub affinity: AffinityIntent, // Per-worker affinity intent (resolved at spawn time)
pub work_type: WorkType, // What each worker does
pub sched_policy: SchedPolicy, // Linux scheduling policy
pub mem_policy: MemPolicy, // NUMA memory placement policy
pub mpol_flags: MpolFlags, // Optional mode flags for set_mempolicy(2)
pub nice: Option<i32>, // Per-worker nice via setpriority(2); None inherits
pub clone_mode: CloneMode, // Fork (default) or Thread dispatch
pub comm: Option<Cow<'static, str>>, // task->comm via prctl(PR_SET_NAME); kernel truncates to 15 bytes
pub uid: Option<u32>, // Effective UID via setresuid; None inherits
pub gid: Option<u32>, // Effective GID via setresgid; None inherits
pub numa_node: Option<u32>, // Restrict affinity to one NUMA node's CPU set
pub composed: Vec<WorkSpec>, // Secondary worker groups spawned alongside the primary
}
Default: 1 worker, AffinityIntent::Inherit, SpinWait, Normal policy,
Default mem_policy, no mpol_flags, nice/comm/uid/gid/numa_node = None,
clone_mode = Fork, composed = empty.
AffinityIntent is the type-unified affinity expression used at the
top level and inside WorkSpec entries — Inherit, Exact(...),
and RandomSubset(...) are accepted at WorkloadHandle::spawn;
topology-aware variants (SingleCpu, LlcAligned, CrossCgroup,
SmtSiblingPair) require scenario context and are rejected at the
spawn gate with an actionable diagnostic. composed carries
secondary WorkSpec groups that spawn alongside the primary; each
composed entry can override work_type, num_workers,
sched_policy, affinity, etc., and reports back via
WorkerReport::group_idx (0 for the primary, 1..=N for composed
entries in declaration order).
See MemPolicy for the NUMA memory placement API.
Scheduling policies
Workers can run under different Linux scheduling policies:
pub enum SchedPolicy {
Normal,
Batch,
Idle,
Fifo(u32), // priority 1-99
RoundRobin(u32), // priority 1-99
Deadline {
runtime: Duration, // budget per period
deadline: Duration, // relative deadline from period start
period: Duration, // period; Duration::ZERO uses `deadline`
},
}
Fifo, RoundRobin, and Deadline require CAP_SYS_NICE. The
sched-deadline gate (runtime <= deadline <= period, all non-zero
unless period == Duration::ZERO, which the kernel substitutes
with deadline) is validated user-side in
SchedPolicy::deadline() before sched_setattr so a malformed
Deadline fails fast rather than tunneling EINVAL through the
syscall.
Overriding work types
The work type override (configured via gauntlet or
Ctx.work_type_override) replaces the default SpinWait work type
for all scenarios that use it. Scenarios with non-SpinWait work types
are not overridden.
Overrides to grouped work types (PipeIo, FutexPingPong,
CachePipe, FutexFanOut, FanOutCompute, MutexContention) are skipped
when num_workers is not divisible by the work type’s group size.
Ops-based scenarios have a separate override mechanism via
CgroupDef.swappable. See Ops and Steps.