Ops and Steps

The ops system is a composable way to express dynamic cgroup topology changes. It replaces hand-written Action::Custom functions for most dynamic scenarios.

Op

An Op is an atomic operation on the cgroup topology. The enum is #[non_exhaustive], so external pattern matches must end with .. to stay compatible across ktstr version bumps that add new variants:

Op	Description
`AddCgroup`	Create a cgroup
`RemoveCgroup`	Stop workers and remove a cgroup
`SetCpuset`	Set a cgroup’s cpuset via `CpusetSpec`
`ClearCpuset`	Remove cpuset constraints
`SwapCpusets`	Swap cpusets between two cgroups
`Spawn`	Fork workers into a cgroup
`StopCgroup`	Stop a cgroup’s workers
`SetAffinity`	Set worker affinity via `AffinityIntent`
`SpawnHost`	Spawn workers in the parent cgroup
`MoveAllTasks`	Move all tasks from one cgroup to another
`RunPayload`	Spawn a binary-kind `Payload` in the background and track its `PayloadHandle` under the step’s payload set. Subsequent `WaitPayload` / `KillPayload` address it by `(payload.name, cgroup)`. Scheduler-kind payloads are rejected at apply time.
`WaitPayload`	Block until the named payload exits naturally, evaluate its checks, and record metrics to the per-test sidecar. Target lookup is by `(name, cgroup)` composite key; `cgroup: None` resolves to the unique live copy. No timeout — pair with a bounded `HoldSpec` or the payload’s own `--runtime` for time-boxed runs.
`KillPayload`	SIGKILL the named payload, reap the child, evaluate checks, and record metrics. Same `(name, cgroup)` lookup rules as `WaitPayload`. Mirrors step-teardown drain for an explicitly-targeted payload.
`FreezeCgroup`	Freeze every task in the named cgroup via `cgroup.freeze` (kernel-side asynchronous freeze; not a SIGSTOP). Idempotent for already-frozen cgroups. Pair with `UnfreezeCgroup` to release; teardown auto-unfreezes. See Snapshots for the observer-cgroup deadlock warning.
`UnfreezeCgroup`	Unfreeze every task in the named cgroup via `cgroup.freeze`. Inverse of `FreezeCgroup`. Idempotent.
`Snapshot`	Capture a host-side diagnostic snapshot under `name` via the freeze coordinator: pauses every vCPU, reads BPF map state, vCPU registers, and per-CPU counters into a `FailureDumpReport`, then resumes. The report is keyed by `name` on the active `SnapshotBridge`. No active bridge is a no-op with `tracing::warn!`. See Snapshots.
`WatchSnapshot`	Capture a snapshot whenever the guest writes to the named kernel symbol; one fire = one capture tagged with the symbol path. Symbol resolution at op execution time looks the name up by verbatim vmlinux ELF symbol-table match — the requested name must appear in the guest kernel’s static symbol table exactly as written (no path expansion, no BTF descent). Maximum 3 watch ops per scenario (3 hardware watchpoint slots; 1 slot reserved for the error-class exit_kind trigger). See Watch Snapshots.

Op constructors accept string literals directly (no .into() needed):

Op::add_cgroup("cg_0")
Op::set_cpuset("cg_0", CpusetSpec::disjoint(0, 2))
Op::stop_cgroup("cg_0")
Op::spawn("cg_0", WorkSpec::default().workers(4))
Op::set_affinity("cg_0", AffinityIntent::RandomSubset)
Op::spawn_host(WorkSpec::default().workers(4))
Op::freeze_cgroup("cg_0")
Op::unfreeze_cgroup("cg_0")
Op::snapshot("after_spawn")
Op::watch_snapshot("jiffies_64")

SpawnHost creates workers in the parent cgroup, not in a managed cgroup. Use this to simulate host-level CPU contention alongside managed cgroups.

OpKind

OpKind is a payload-free discriminant enum generated from Op via #[strum_discriminants]. It carries the same variant set as Op (AddCgroup, RemoveCgroup, …, RunPayload, WaitPayload, KillPayload, FreezeCgroup, UnfreezeCgroup, Snapshot, WatchSnapshot) with none of the inner fields, so it is cheap to copy and use as a map key. Framework code uses OpKind when it only cares WHICH operation ran (per-op statistics, stimulus-event tagging, verifier/monitor bookkeeping) without the payload. Test authors rarely spell OpKind directly — the strum::EnumIter derive also lets tooling enumerate every OpKind variant for coverage checks.

OpKind shares Op’s #[non_exhaustive] attribute: external pattern matches over OpKind must end with ...

CpusetSpec

CpusetSpec computes a cpuset from the topology at runtime. The enum is #[non_exhaustive], so external callers should construct via the associated constructor functions (see the list below this snippet) rather than naming variant literals — a future field addition (e.g. a stride on Range) can land behind a defaulted parameter without breaking call sites. Pattern matches over CpusetSpec must also end with ..:

pub enum CpusetSpec {
    Llc(usize),                          // All CPUs in an LLC
    Numa(usize),                         // All CPUs in a NUMA node
    Range { start_frac: f64, end_frac: f64 }, // Fraction of usable CPUs
    Disjoint { index: usize, of: usize },     // Equal disjoint partitions
    Overlap { index: usize, of: usize, frac: f64 }, // Overlapping partitions
    Exact(BTreeSet<usize>),              // Exact CPU set
}

Convenience constructors accept parameters directly: CpusetSpec::disjoint(0, 2), CpusetSpec::range(0.0, 0.5), CpusetSpec::exact([0, 1, 2]), CpusetSpec::llc(0), CpusetSpec::numa(0), CpusetSpec::overlap(0, 2, 0.5).

All fractional specs operate on usable_cpus().

CgroupDef

CgroupDef bundles three ops that always go together: create cgroup, set cpuset, spawn workers. It is the primary way to define cgroups in ops-based scenarios.

let def = CgroupDef::named("cg_0")
    .with_cpuset(CpusetSpec::disjoint(0, 2))
    .workers(4)
    .work_type(WorkType::SpinWait);

Builder methods

.with_cpuset(CpusetSpec) – set the cpuset (CPU set the cgroup is pinned to).
.with_cpuset_mems(BTreeSet<usize>) – explicit cpuset.mems override (default derives from the resolved cpuset’s NUMA nodes).
.workers(n) – set worker count.
.work_type(WorkType) – set work type (default: SpinWait).
.sched_policy(SchedPolicy) – set Linux scheduling policy (default: Normal). See WorkSpec Types.
.work(WorkSpec) – add a work group (multiple calls for concurrent groups).
.workload(&'static Payload) – attach a binary workload payload to run alongside the worker group; the framework launches it as a child process inside the cgroup. Panics when called with a scheduler-kind Payload (PayloadKind::Scheduler(_)); the scheduler slot is #[ktstr_test(scheduler = ...)] at the test level, not the cgroup-level workload slot. Step-level Op::RunPayload rejects scheduler-kind payloads with an anyhow::Error instead of panicking; the build-time workload call panics because there is no scenario-level recovery path.
.affinity(AffinityIntent) – set per-worker affinity (default: Inherit).
.mem_policy(MemPolicy) – set NUMA memory placement policy (default: Default). See MemPolicy.
.mpol_flags(MpolFlags) – set mode flags for set_mempolicy(2) (default: NONE). See MemPolicy.
.nice(n) – cgroup-level default per-worker nice value, merged into every WorkSpec whose own nice is unset. See Tutorial: Step 11.
.comm(name) – cgroup-level default per-worker task->comm via prctl(PR_SET_NAME). Merged into every WorkSpec whose own comm is unset.
.pcomm(name) – thread-group-leader task->comm for the fork-then-thread spawn path (workers run as threads under one forked leader). Stamps every existing WorkSpec in-place; not order-independent with .work(...).
.uid(uid) / .gid(gid) – cgroup-level default per-worker effective UID / GID via setresuid / setresgid. Merged into every WorkSpec whose own uid / gid is unset.
.numa_node(node) – cgroup-level default NUMA-node affinity for every WorkSpec. Merged at apply-setup time.
.swappable(bool) – opt into gauntlet work type override.

Cgroup controllers

The cgroup-v2 cpu / memory / io / pids controllers are exposed as typed setters (default: unconstrained):

.cpu_quota_pct(pct) / .cpu_quota(quota, period) / .cpu_unlimited() – write cpu.max (pct is shorthand: 100 = one full CPU). cpu_unlimited resets to the kernel default.
.cpu_weight(weight) – write cpu.weight (1..=10000, default 100).
.memory_max(bytes) / .memory_high(bytes) / .memory_low(bytes) / .memory_unlimited() – write memory.max / memory.high / memory.low. memory_unlimited resets memory.max to max.
.memory_swap_max(bytes) / .memory_swap_unlimited() – write memory.swap.max.
.io_weight(weight) – write io.weight (1..=10000, default 100).
.pids_max(n) / .pids_unlimited() – write pids.max.

MemPolicy-cpuset validation

When a cgroup has a cpuset, ktstr validates that the MemPolicy’s node set is covered by the NUMA nodes reachable from that cpuset. A MemPolicy::Bind([1]) on a cgroup whose cpuset covers only NUMA node 0 fails at setup time. Policies without a node set (Default, Local) skip validation.

WorkSpec type overrides and swappable

CgroupDef has a swappable flag (default: false). When true and a work type override is active (Ctx.work_type_override), the override replaces this def’s work type.

In contrast, the Scenario-level override (in run_scenario()) only replaces SpinWait work types. The two mechanisms serve different scopes:

Scenario-level: replaces SpinWait in WorkSpec.work_type
CgroupDef-level: replaces the work type when swappable = true

Both skip overrides to grouped work types when num_workers is not divisible by the work type’s group size.

WorkSpec type overrides apply only to CgroupDef setup, not to raw Op::Spawn. Op::Spawn always uses the work type as given. Use CgroupDef with .swappable(true) when the work type should participate in gauntlet overrides.

Step

A Step is a sequence of ops with a hold period:

pub struct Step {
    pub setup: Setup,   // CgroupDefs to create after ops
    pub ops: Vec<Op>,   // Operations to apply
    pub hold: HoldSpec, // How long to wait after
}

Setup is either Defs(Vec<CgroupDef>) or Factory(fn(&Ctx) -> Vec<CgroupDef>). Vec<CgroupDef> implements Into<Setup>, so you can write setup: vec![...].into() instead of setup: Setup::Defs(vec![...]).

Constructors

Step::new(ops, hold) – creates a step with ops only (no CgroupDef setup). Use when the step only applies dynamic operations to an existing topology.

Step::with_defs(defs, hold) – creates a step with CgroupDef setup and a hold period. The primary constructor for steps that create cgroups with workers.

Step::set_ops(self, ops) – REPLACES the ops on a step (builder method). Chain after with_defs to add dynamic operations to a step that also creates cgroups.

Naming asymmetry: Step::set_ops REPLACES; the sibling Backdrop::with_ops APPENDS. The two methods deliberately use different verbs to signal the different semantics. A Step::new(ops).set_ops(more) chain produces a step whose ops vec is exactly more (the original ops is dropped); a Backdrop::new().with_ops(ops_a).with_ops(ops_b) chain produces a backdrop whose ops vec is ops_a + ops_b. If you need to extend a step’s ops vec, build the combined Vec<Op> at the call site and pass it to set_ops, or compose at the Backdrop layer instead.

HoldSpec

How long to hold after a step completes:

Variant	Description
`Frac(f64)`	Fraction of the total scenario duration
`Fixed(Duration)`	Fixed time
`Loop { interval }`	Repeat ops at interval until time runs out

HoldSpec::FULL is a constant for Frac(1.0) (hold for the full scenario duration).

execute_defs

execute_defs(ctx, defs) is a convenience wrapper for the common pattern of creating cgroups and running them for the full duration:

execute_defs(ctx, vec![
    CgroupDef::named("cg_0").workers(4),
    CgroupDef::named("cg_1").workers(4),
])

Equivalent to execute_steps(ctx, vec![Step::with_defs(defs, HoldSpec::FULL)]).

execute_steps

execute_steps(ctx, steps) runs a step sequence:

For each step: apply ops, then apply setup (create cgroups from CgroupDefs), hold for the specified duration. Ops run first so parent cgroups can be created before children are spawned. Loop steps reverse this: setup runs once before the loop, then ops repeat at the specified interval.
Check scheduler liveness between steps.
After all steps: collect worker reports and run checks.
Writes stimulus events to the SHM ring buffer for timeline analysis.

execute_steps_with

execute_steps_with(ctx, steps, assertions) is the same as execute_steps but accepts an explicit Assert for worker checks. execute_steps is a convenience wrapper that passes None.

use ktstr::prelude::*;

fn my_scenario(ctx: &Ctx) -> Result<AssertResult> {
    let assertions = Assert::NO_OVERRIDES
        .check_not_starved()
        .max_gap_ms(3000);

    let steps = vec![/* ... */];
    execute_steps_with(ctx, steps, Some(&assertions))
}

When assertions is Some, the provided Assert overrides ctx.assert for worker checks. When None, uses ctx.assert (the merged three-layer config: default_checks -> scheduler -> per-test).

Keyboard shortcuts

ktstr