Resource Budget
--cpu-cap N adds a third tier between full performance-mode
isolation and unreserved no-perf-mode execution. Instead of
“lock each reserved LLC exclusively” (perf-mode), it reserves a
NUMA-aware, consolidation-aware set of host CPUs under LOCK_SH,
enforces the reservation via a cgroup v2 cpuset sandbox, and scales
make -jN fan-out to the reserved capacity. The flock granularity
stays per-LLC: every selected LLC is flocked whole, but plan.cpus
holds EXACTLY N CPUs (the last LLC is partial-taken when the
budget falls mid-LLC). See
Performance Mode for
the comparison against the other two tiers.
Every no-perf-mode VM and kernel build runs through this pipeline
— there is no “no cap” path. When --cpu-cap is absent, the
planner applies a 30% default of the calling process’s
sched_getaffinity cpuset (minimum 1 CPU). This keeps
sched_setaffinity safe under cgroup-restricted CI runners (CI
hosts, systemd slices, sudo-under-a-limited-cpuset) where the
process cannot run on every online CPU even if sysfs lists them.
When to use it
- Multi-tenant CI hosts where unbounded parallelism starves
concurrent builds but the full performance-mode contract
(
SCHED_FIFO, hugepages, NUMA mbind, KVM exit suppression) is too heavy. - Kernel builds run alongside perf-mode VM tests — the
shared
LOCK_SHcoordinates with the perf-modeLOCK_EXsomakenever stomps a measurement in progress. - Concurrent no-perf-mode VMs on a shared host — a cap of
NCPUs bounds how much capacity each run reserves; peers that would exceed the host’s flock availability wait rather than racing for CPU.
CpuCap — parsed and resolved
CpuCap::new(N: usize) -> Result<CpuCap> constructs a cap from a
CLI integer. N is a CPU count. N == 0 is rejected with
--cpu-cap must be ≥ 1 CPU (got 0) — zero is a scripting sentinel,
not a silent “no cap” fallback.
CpuCap::resolve(cli_flag: Option<usize>) -> Result<Option<CpuCap>>
is the three-tier precedence:
- CLI flag (
--cpu-cap N) wins over env var. KTSTR_CPU_CAP=Nenv var applies when the CLI flag is absent. Empty string is treated as unset;0or non-numeric values produce the same rejection as the CLI path.- Neither set →
Ok(None). The planner expands this into the 30%-of-allowed default at acquire time.
CpuCap::effective_count(allowed_cpus: usize) -> Result<usize>
clamps at acquire time, not construction time.
N > allowed_cpus returns a ResourceContention error naming
both numbers — operators reading the error see immediately that
the cap exceeds the process’s sched_getaffinity cpuset, not the
host’s total online CPU count. Fixing the cap requires either
lowering N or releasing the cgroup restriction on the calling
process.
host_allowed_cpus — the reference set
host_allowed_cpus() reads the calling process’s allowed CPUs
via sched_getaffinity(0) with a /proc/self/status
Cpus_allowed_list: fallback. Every consumer of the --cpu-cap
pipeline plans against this set instead of
HostTopology::online_cpus, so sched_setaffinity on the plan’s
CPU list never produces an empty effective mask under a
cgroup-restricted runner.
An empty allowed set is a bail condition, not a fallback to
“every CPU” — guessing on a misconfigured host is worse than
failing visibly. A host topology that has no LLC overlapping the
allowed set (sysfs and sched_getaffinity disagree — e.g. stale
sysfs after hot-plug, cgroup cpuset pinned to CPUs the kernel no
longer reports in LLC groups) also bails with an actionable
diagnostic.
LlcPlan — the ACQUIRE result
acquire_llc_plan(topo, test_topo, cpu_cap) runs three phases:
- DISCOVER — for every LLC, stat the canonical
/tmp/ktstr-llc-{N}.lock, read/proc/locksonce, and build a snapshot of holders per LLC. No flocks are taken. - PLAN — rank LLCs (eligible = at least one allowed CPU):
consolidation (prefer LLCs with existing holders) first, then
fresh LLCs, all tiebroken by ascending index. Seed on the
highest-scored LLC’s NUMA node; greedily fill that node before
spilling to nearest-by-distance nodes via
TestTopology::numa_distance. Accumulate allowed-CPU contribution per LLC until the accumulated count meetstarget_cpus. Final acquire order is ascending LLC index for livelock safety. - ACQUIRE — non-blocking
LOCK_SHon every selected LLC. A singleEWOULDBLOCKdrops every held fd and retries once (one TOCTOU retry — the second DISCOVER’s/proc/locksread IS the backoff; more retries would amplify livelock risk without adding coordination signal).
Partial-take on the last LLC
Post-ACQUIRE, the materialization layer walks each selected LLC’s
CPUs in ascending order, intersects with the allowed set, and
STOPS at exactly target_cpus total. The last selected LLC
typically contributes only a prefix of its allowed CPUs — the
flock is still held at LLC granularity (coordination with
concurrent ktstr peers is always per-LLC), but plan.cpus
reflects the exact CPU budget. sched_setaffinity masks and
cgroup cpuset.cpus writes narrow to that exact set.
The returned LlcPlan carries:
locked_llcs: Vec<usize>— selected host LLC indices, ASC.cpus: Vec<usize>— flat list of reserved CPUs, sized exactlytarget_cpus(a subset of every selected LLC’s allowed CPUs, with the last LLC possibly contributing only a prefix).mems: BTreeSet<usize>— NUMA nodes actually hostingplan.cpus(an LLC that contributes a partial slice only registers the nodes of its used CPUs).snapshot: Vec<LlcSnapshot>— per-LLC discovery trail.locks: Vec<OwnedFd>— RAII flock handles; Drop releases.
When mems spans more than one node
(warn_if_cross_node_spill fires), stderr gets a ktstr: reserving LLCs […] across N NUMA nodes warning so the operator
knows to expect cross-node memory latency. Single-node plans are
silent.
Cgroup v2 cpuset sandbox
BuildSandbox::try_create(plan_cpus, plan_mems, hard_error_on_degrade)
writes the plan into a child cgroup under the caller’s own cgroup,
in the kernel-required order: cpuset.cpus → cpuset.mems →
cgroup.procs. A task in a cgroup with empty cpuset.mems may
be killed by the cpuset allocator, so migration into
cgroup.procs MUST happen after both cpuset fields are populated.
After each cpuset write, .effective is read back. Narrowing by
a parent cgroup (e.g. systemd slice restriction) is a fatal error
under --cpu-cap (hard_error_on_degrade = true) and a warn-
only degrade without the flag.
Drop migrates the build pid back to root, tolerates transient
EBUSY on cgroup.rmdir (5 × 10 ms retries), and orphans the
directory with a tag=resource_budget.cgroup_orphan_left warn-
log if the rmdir still refuses. Orphans older than 24 h are
swept on the next sandbox creation.
make -jN hint
make_jobs_for_plan(plan) returns plan.cpus.len().max(1). The
kernel-build pipeline threads this as make -jN. Without the
hint, make -j$(nproc) fans gcc children across every online
CPU, defeating the cpuset reservation in scheduling terms — the
kernel still enforces cpuset membership at the fs layer, but
gcc’s parallel width silently violates the budget. The .max(1)
floor guards against make -j0 (unbounded on GNU make).
ktstr locks — observational surface
ktstr locks (or cargo ktstr locks) prints every ktstr flock
currently held on the host, cross-referenced against
/proc/locks to name each holder by PID + truncated cmdline.
Read-only — takes no flocks. Four categories:
- LLC locks under
/tmp/ktstr-llc-*.lock - Per-CPU locks under
/tmp/ktstr-cpu-*.lock - Cache-entry locks under
{cache_root}/.locks/*.lock - Run-dir locks under
{runs_root}/.locks/{kernel}-{project_commit}.lock— held for the duration of the (pre-clear + write) cycle byserialize_and_write_sidecarso two concurrent ktstr processes targeting the same run-dir key serialize on the sidecar write rather than tearing each other’s mid-write files.
Flags:
--json— emit a structured snapshot. One-shot usesto_string_prettyfor readability; under--watcheach frame is compact on its own line (ndjson-style) for streaming consumers. Top-level keys:llcs,cpus,cache,run_dirs. Each row names itslockfilepath and aholdersarray; every holder haspid+cmdline.--watch <interval>— redraw on the given interval until SIGINT. Interval useshumantimesyntax (100ms,1s,5m,1h).
Use ktstr locks when --cpu-cap acquires fail with
ResourceContention: the error already names busy LLCs, but the
live snapshot shows every contending peer at once.
KTSTR_BYPASS_LLC_LOCKS — escape hatch
Setting KTSTR_BYPASS_LLC_LOCKS=1 (any non-empty value) skips
acquire_llc_plan entirely. The VM boots or the kernel
builds immediately without coordinating against any concurrent
perf-mode run. Use only when the operator explicitly accepts
measurement noise:
- A shell session doing unrelated work alongside tests.
- An isolated developer workstation.
- A CI queue that already serializes jobs at a higher layer.
Mutually exclusive with --cpu-cap / KTSTR_CPU_CAP at every
entry point (CLI parse for shell + kernel build on both
ktstr and cargo ktstr, the kernel_build_pipeline reservation
phase, and the library-layer KtstrVmBuilder::build no-perf-mode
branch). The error wording always contains "resource contract"
so operators can grep for it; the contract and the bypass cannot
coexist at any of those six sites.
Note: the performance_mode=true vs --cpu-cap exclusion is
weaker. It is enforced at CLI parse (shell --cpu-cap requires
--no-perf-mode via clap requires), but library consumers that
set performance_mode=true on KtstrVmBuilder directly see
KTSTR_CPU_CAP silently ignored — the builder’s perf-mode branch
never calls CpuCap::resolve, it goes through
validate_performance_mode + acquire_resource_locks
(LOCK_EX) instead.
Filesystem requirement
Every ktstr lockfile (/tmp/ktstr-llc-*.lock,
/tmp/ktstr-cpu-*.lock, {cache_root}/.locks/*.lock,
{runs_root}/.locks/*.lock) must live on a local filesystem —
tmpfs, ext4, xfs, btrfs, f2fs, or bcachefs are the
explicitly-accepted set. flock(2) behavior
on NFS, CIFS, SMB2, CephFS, AFS, and FUSE is unreliable: NFSv3
is advisory-only without an NLM peer and NFSv4 byte-range
locking does not cover flock(2); SMB does not emit
/proc/locks entries so ktstr cannot enumerate peer holders;
Ceph MDS does not participate in flock serialization across
nodes; AFS does not support flock(2) at all; FUSE flock
semantics depend on whether the userspace server implements the
op. try_flock statfs-checks every lockfile path at open time
via reject_remote_fs in src/flock.rs — hitting any
deny-listed filesystem produces an actionable runtime error
naming the filesystem plus the remediation “Move the lockfile
path to a local filesystem (tmpfs, ext4, xfs, btrfs, f2fs,
bcachefs).” Unknown local filesystems (zfs, erofs, etc.) are
not on the deny-list and pass through, on the basis that
rejecting unknown-but-local is more disruptive than accepting a
potentially-unreliable flock.
Related
- Performance Mode — the full-isolation tier; the tier comparison lives there.
- Environment Variables
—
KTSTR_CPU_CAP,KTSTR_BYPASS_LLC_LOCKS, and every other ktstr-controlled env var.