Monitor
The monitor observes scheduler state from the host side by reading guest VM memory directly. It does not instrument the guest kernel or the scheduler under test.
What it reads
The monitor resolves kernel structure offsets via BTF (BPF Type Format) from the guest kernel. It reads per-CPU runqueue structures to extract:
nr_running– number of runnable tasks on each CPUscx_nr_running– tasks managed by the sched_ext schedulerrq_clock– runqueue clock valuelocal_dsq_depth– scx local dispatch queue depthscx_flags– sched_ext flags for each CPU- scx event counters (fallback, keep-last, offline dispatch, skip-exiting, skip-migration-disabled, reenq-immed, reenq-local-repeat, refill-slice-dfl, bypass-duration, bypass-dispatch, bypass-activate, insert-not-owned, sub-bypass-dispatch)
When CONFIG_SCHEDSTATS is enabled, the monitor also reads per-CPU
struct rq schedstat fields (run_delay, pcount, sched_count,
ttwu_count, etc.).
The monitor walks the struct sched_domain tree whenever BTF
contains rq->sd and struct sched_domain — no CONFIG_SCHEDSTATS
required. Domain tree walking starts at rq->sd (lowest level) and
follows sd->parent pointers up to the root. Each domain level
provides topology metadata (level, name, flags, span_weight) and
runtime fields (balance_interval, nr_balance_failed,
max_newidle_lb_cost) and optional fields (newidle_call,
newidle_success, newidle_ratio — added in 7.0, backported to
6.18.5+ and 6.12.65+; absent on 6.16-6.18.4). When
CONFIG_SCHEDSTATS is also enabled, each
domain additionally provides load balancing stats: lb_count,
lb_failed, lb_balanced, alb_pushed, ttwu_wake_remote, and
other counters indexed by idle type (CPU_NOT_IDLE, CPU_IDLE,
CPU_NEWLY_IDLE).
Sampling
The monitor takes periodic snapshots (MonitorSample) of all per-CPU
state. Each sample captures a point-in-time view of every CPU.
MonitorSummary aggregates samples into peak values (max imbalance
ratio, max DSQ depth, stall detection), per-sample averages
(imbalance ratio, nr_running per CPU, DSQ depth per CPU), and event
counter deltas. Averages are computed over valid samples only
(excluding uninitialized guest memory).
Threshold evaluation
MonitorThresholds defines pass/fail conditions:
pub struct MonitorThresholds {
pub max_imbalance_ratio: f64,
pub max_local_dsq_depth: u32,
pub fail_on_stall: bool,
pub sustained_samples: usize,
pub max_fallback_rate: f64,
pub max_keep_last_rate: f64,
}
A violation must persist for sustained_samples consecutive samples
before triggering a failure. This filters transient spikes from cpuset
transitions and cgroup creation/destruction.
Stall detection
A stall is detected when a CPU’s rq_clock does not advance between
consecutive samples. Three exemptions prevent false positives:
-
Idle CPUs: when
nr_running == 0in both the current and previous sample, the CPU has no runnable tasks. The kernel stops the tick (NOHZ) on idle CPUs, sorq_clocklegitimately does not advance. These CPUs are excluded from stall checks. -
Preempted vCPUs: when the vCPU thread’s CPU time did not advance past the preemption threshold between samples, the host preempted the vCPU. These samples are excluded from stall checks.
-
Sustained window: stall detection uses per-CPU consecutive counters and the
sustained_samplesthreshold, matching how imbalance and DSQ depth checks work. A single stuck sample does not trigger failure – the stall must persist forsustained_samplesconsecutive samples on the same CPU.
Uninitialized memory detection
Before the guest kernel initializes per-CPU structures, monitor reads return uninitialized data. Two layers handle this:
-
Summary computation (
MonitorSummary::from_samples): skips individual samples where any CPU’slocal_dsq_depthexceedsDSQ_PLAUSIBILITY_CEILING(10,000) viasample_looks_valid(). -
Threshold evaluation (
MonitorThresholds::evaluate): checks all samples globally for plausibility. If allrq_clockvalues are identical across every CPU and sample, or any sample exceeds the plausibility ceiling, the entire report is passed as “not yet initialized” — no per-threshold checks run.
BPF map introspection
The monitor module also provides host-side BPF map discovery and
read/write access via bpf_map::BpfMapAccessor. The host reads and
writes guest BPF maps directly through the physical memory mapping
— no guest cooperation or BPF syscalls are needed.
GuestMem
GuestMem wraps a host pointer to the start of guest DRAM and
provides bounds-checked volatile reads and writes for scalar types
(u8/u32/u64). Byte-slice reads (read_bytes) use
copy_nonoverlapping. It also implements x86-64 page table walks
(translate_kva) for both 4-level and 5-level paging, and
granule-agnostic aarch64 walks (4 KB / 16 KB / 64 KB; level count
derived from TCR_EL1’s TG1 + T1SZ fields).
Scalar accesses use volatile semantics because the guest kernel modifies memory concurrently.
GuestKernel
GuestKernel builds on GuestMem by adding kernel symbol
resolution and address translation. It parses the vmlinux ELF
symbol table at construction and resolves paging configuration
(PAGE_OFFSET, CR3, 4-level vs 5-level) from guest memory.
Subsequent reads use cached state.
Three address translation modes are supported:
- Text/data/bss:
kva - __START_KERNEL_map. For statically-linked kernel variables (read_symbol_*,write_symbol_*). - Direct mapping:
kva - PAGE_OFFSET. For SLAB allocations, per-CPU data, physically contiguous memory (read_direct_*). - Vmalloc/vmap: Page table walk via CR3. For BPF maps, vmalloc’d
memory, module text (
read_kva_*,write_kva_*).
BpfMapAccessor
BpfMapAccessor resolves BTF offsets for BPF map kernel structures
(struct bpf_map, struct bpf_array, struct xa_node, struct idr)
and provides map discovery and value read/write. It borrows a
GuestKernel for address translation.
BpfMapAccessorOwned is a convenience wrapper that owns the
GuestKernel internally. Use BpfMapAccessor::from_guest_kernel
when you already have a GuestKernel; use BpfMapAccessorOwned::new
when you want a self-contained accessor.
Map discovery walks the kernel’s map_idr xarray:
- Read
map_idr(BSS symbol, text mapping translation) - Walk xa_node tree (SLAB-allocated, direct mapping translation)
- Read
struct bpf_mapfields. The allocation may be kmalloc’d or vmalloc’d depending on size and flags, so the translation usestranslate_any_kvawhich handles both paths rather than assuming either.
find_map searches by name suffix (e.g. ".bss" matches
"mitosis.bss"). Only BPF_MAP_TYPE_ARRAY maps are returned.
Use maps() to enumerate all map types without filtering.
Value access for BPF_MAP_TYPE_ARRAY maps reads/writes the inline
bpf_array.value flex array at the BTF-resolved offset. The value
region is vmalloc’d, so each byte access goes through the page table
walker to handle page boundaries.
For BPF_MAP_TYPE_PERCPU_ARRAY maps, bpf_array.pptrs[key] holds
a __percpu pointer (at the same union offset as value). Adding
__per_cpu_offset[cpu] yields the per-CPU KVA in the direct mapping.
read_percpu_array returns one Option<Vec<u8>> per CPU: Some
when the per-CPU PA falls within guest memory, None when it does not.
Typed field access
When a map has BTF metadata (btf_kva != 0), resolve_value_layout
reads the guest’s struct btf and its data blob, parses it with
btf_rs, and resolves the value struct’s fields. This enables
read_field / write_field with type-checked BpfValue variants.
Usage example
Find a scheduler’s .bss map and write a crash variable:
let offsets = BpfMapOffsets::from_vmlinux(vmlinux)?;
let accessor = BpfMapAccessor::from_guest_kernel(&kernel, &offsets)?;
let bss = accessor.find_map(".bss").expect(".bss map not found");
accessor.write_value_u32(&bss, crash_offset, 1);
BpfMapWrite
BpfMapWrite specifies a host-side write to a BPF map during VM
execution. The test runner waits for the scheduler to load (map
becomes discoverable), writes the value, then signals the guest via
SHM to start the scenario.
pub struct BpfMapWrite {
pub map_name_suffix: &'static str, // e.g. ".bss"
pub offset: usize, // byte offset in the map value
pub value: u32, // value to write
}
Use with #[ktstr_test] via the bpf_map_write attribute:
const BPF_CRASH: BpfMapWrite = BpfMapWrite {
map_name_suffix: ".bss",
offset: 42,
value: 1,
};
#[ktstr_test(bpf_map_write = BPF_CRASH, expect_err = true)]
fn crash_test(ctx: &Ctx) -> Result<AssertResult> {
Ok(AssertResult::pass())
}
The map is discovered by name suffix via BpfMapAccessor::find_map.
Only BPF_MAP_TYPE_ARRAY maps are supported. The write targets a
u32 at the specified byte offset within the map’s value region.
Prerequisites
- vmlinux: Required for ELF symbols and BTF. Must match the guest
kernel. Symbols include
phys_baseso the runtime KASLR offset can be resolved via a page-table walk through the BSP’s CR3, breaking the chicken-and-egg between text-symbol PA translation and KASLR.
Cast analysis
BPF maps frequently store kernel pointers (task_struct *,
cgroup *, …) and arena pointers in u64 fields because BTF cannot
express a pointer to a per-allocation type. Without intervention the
renderer treats them as integers and the failure dump shows raw
0xffff…ffff values with no further chase.
The cast analyzer (monitor::cast_analysis::analyze_casts) closes
that gap. The freeze coordinator runs it once per scheduler load,
before any periodic capture or on-demand snapshot would consume
its output:
- The host loads the scheduler binary and locates each
.bpf.oELF in the build artifacts. - Each program section is decoded through
cast_analysis::BpfInsn::from_le_bytesinto a flat&[BpfInsn]slab; relocations against.bss/.data/.rodataannotate the correspondingBPF_LD_IMM64PCs with their datasec target. analyze_castswalks the slab forward, tracking register and stack-slot state for each instruction. Two detection paths feed the output: the arena pointer path (LDX through a previously loadedu64field) and the kernel kptr path (STX of a typed pointer register into au64field). Function-entry seeding frombpf_func_inforeseeds R1..R5 from the BTF FuncProto so typed parameters propagate correctly across subprogram joins.- The result is a
CastMap(BTreeMap<(source_struct_btf_id, field_byte_offset), CastHit>) cached on the per-VMKtstrVm.cast_map(aLazyCastMapthat runs the analyzer on first dump and caches the result process-wide by scheduler binary content hash). The freeze coordinator threads the cachedCastMapthroughDumpContext::cast_mapinto every per-map render so the renderer can consult it at every dump site. render_cast_pointerinmonitor::btf_renderconsumesCastHitviaMemReader::cast_lookup. When au64field at a recorded(struct, offset)is rendered, the renderer chases the pointer through the address-space-appropriate reader (arena vs slab/vmalloc) and tags the result with acast_annotationof"cast→arena"or"cast→kernel"(plus a(sdt_alloc)suffix when the bridge described below fired). Failure dumps show the annotation alongside the resolved struct fields, so cast-recovered pointers are visually distinct from BTF-typed ones.
The renderer also consults an sdt_alloc bridge whenever a chase
target peels to a BTF_KIND_FWD forward declaration (typical for
struct sdt_data __arena * fields whose body lives in the
sdt_alloc library’s BTF rather than the scheduler’s program BTF).
The dump-state pre-pass walks each live scx_allocator and
populates a slot_start → ArenaSlotInfo index — one entry
per live allocator slot, carrying elem_size, header_size, and
the resolved payload BTF type id — that
MemReader::resolve_arena_type (in
dump::render_map::AccessorMemReader) range-looks up during the
chase. The lookup finds the slot whose
[slot_start, slot_start + elem_size) range contains the chased
address and routes by offset_in_slot: a slot-start chase
(offset == 0, e.g. the data field of scx_task_map_val
storing the raw sdt_alloc() return) returns the payload type id
with header_skip = header_size; a payload-start chase
(offset == header_size, e.g. the return of scx_task_data(p)
cached in cached_taskc_raw) returns the same payload type id
with header_skip = 0. The renderer reads header_skip + btf_size
bytes from the chased address, slices off the leading
header_skip bytes, and renders the payload struct. The
resulting Ptr carries a sdt_alloc-flavoured annotation:
"sdt_alloc" on the BTF-typed Type::Ptr arm, and
"cast→arena (sdt_alloc)" / "cast→kernel (sdt_alloc)" on the
cast-analyzer-driven path. The sdt_alloc bridge fires only when
the BTF-only resolve has already exhausted same-name siblings;
false-positive risk on that arm is bounded by the arena-window
range check (MemReader::resolve_arena_type returns None for
addresses outside every known allocator slot).
A separate cross-BTF Fwd resolution path covers the case where a
BTF_KIND_FWD pointee’s body lives in a sibling embedded BPF
object’s BTF rather than an sdt_alloc slot — the typical
multi-.bpf.objs shape where one object declares
struct cgx_target; (forward) and a sibling object defines
struct cgx_target { ... } (full body). The cast-analysis
pre-pass (vmm::cast_analysis_load::build_fwd_index) walks every
parsed embedded program BTF and records a
name -> (btfs index, type_id) entry for every complete
(!is_fwd) Type::Struct / Type::Union. First-write-wins on
duplicate names: when the same name appears in multiple BTFs the
index keeps the first-seen entry. Anonymous types and Typedef
are not indexed (no name to key on, and typedefs add no body —
the chase peels through them via peel_modifiers_with_id before
consulting the index). The index is threaded through
DumpContext::cross_btf and exposed to the renderer via
MemReader::cross_btf_resolve_fwd. When chase_arena_pointer /
render_cast_pointer peel a chase target through
peel_modifiers_resolving_fwd and the local same-BTF sibling
search came up empty, try_cross_btf_fwd_resolve consults the
cross-BTF index by the Fwd’s name (and aggregate kind — struct
vs union); a hit returns a CrossBtfRef { btf, type_id } and
the chase recursion switches to the resolved sibling BTF for the
pointee render. Cross-BTF resolution does NOT introduce a new
annotation — the body is recovered transparently and the rendered
subtree carries the cast or BTF-typed annotation it would have
had if the same struct lived in the entry BTF. Unlike the
sdt_alloc bridge the cross-BTF index is consulted whenever a
Fwd terminal survives the local resolve — there is no
arena-window gate, since the lookup is purely a name-keyed BTF
table and a name miss simply leaves the chase on its existing
“forward declaration; body not in this BTF” skip path.
The analyzer is deliberately conservative: branch joins reset
register and stack state, conflicts drop the offending entry, and
self-stores are rejected. False negatives fall back to raw u64
(the prior behavior); false positives would chase garbage and are
avoided. The analysis is unconditional — no test-author
configuration, no opt-in flag — and the freeze coordinator wires
the resulting CastMap through every snapshot, periodic capture,
and failure dump.
Probe pipeline
The probe pipeline captures function arguments and struct fields during auto-repro. It operates inside the guest VM (not from the host), using two BPF skeletons that share maps.
Architecture
crash stack -> extract functions -> BTF resolve -> load skeletons -> poll
|
kprobe skeleton | fentry/fexit skeleton
(kernel entry) | (BPF entry + kernel exit)
| | |
v v v
func_meta_map <--shared--> probe_data
| (entry + exit fields)
trigger fires (ring buffer)
|
read probe_data entries
|
stitch by tptr
|
format with entry→exit diffs
Kprobe skeleton (probe.bpf.c)
Attaches to kernel functions via attach_kprobe. The BPF handler:
- Gets the function IP via
bpf_get_func_ip - Looks up
func_metafromfunc_meta_map(keyed by IP) - Captures 6 raw args from
pt_regs - Dereferences struct fields via BTF-resolved offsets
- Reads char * string params if configured
- Stores result in
probe_data(keyed by(func_ip, task_ptr))
The trigger fires via tp_btf/sched_ext_exit (inside
scx_claim_exit()) and sends an EVENT_TRIGGER via ring buffer
with the current task pointer and kernel stack.
Fentry/fexit skeleton (fentry_probe.bpf.c)
Handles both BPF struct_ops callbacks and kernel function exit
capture. Loaded in batches of 4 fentry + 4 fexit programs per
skeleton instance via set_attach_target. Shares probe_data and
func_meta_map with the kprobe skeleton via reuse_fd.
A per-slot is_kernel rodata flag controls argument access:
- BPF callbacks (
is_kernel=0):ctx[0]is a void pointer to the real callback arguments. The handler dereferences through it. Uses sentinel IPs (func_idx | (1<<63)) infunc_meta_map. - Kernel functions (
is_kernel=1): args are directly inctx[0..5]. Usesbpf_get_func_ip(ctx)for the real IP, matching the kprobe entry handler’s key.
Fexit handlers look up the existing probe_data entry (written by
fentry or kprobe at function entry) and re-read struct fields into
exit_fields. This captures post-mutation state for paired display.
BTF resolution
Two BTF sources:
-
vmlinux BTF (
btf-rs): resolves kernel struct offsets. Types inSTRUCT_FIELDS(task_struct, rq, scx_dispatch_q, etc.) use curated field lists with chained pointer dereferences (e.g.->cpus_ptr->bits[0]). Other struct pointer params get scalar, enum, and cpumask pointer fields auto-discovered from vmlinux BTF. -
Program BTF (
libbpf-rs): resolves BPF-local struct offsets for types not in vmlinux (e.g. scheduler-definedtask_ctx). Auto-discovers scalar, enum, and cpumask pointer fields.
Callback signatures are resolved by:
____nameinner function in program BTF (typed params)sched_ext_opsmember in vmlinux BTF (fallback)- Wrapper function (void *ctx, no useful params)
Field decoding
The output formatter decodes field values based on their key name:
dsq_id->SCX_DSQ_INVALID,SCX_DSQ_GLOBAL,SCX_DSQ_LOCAL,SCX_DSQ_BYPASS,SCX_DSQ_LOCAL_ON|{cpu},BUILTIN({v}),DSQ(0x{hex})cpumask_0..3-> coalesced into onecpus_ptrfield rendered as0x{hex}({cpu-list})— the masked hex of the cpumask words (low-order word first; multi-word masks join with_between 64-bit chunks) followed by the run-length-collapsed CPU range list (e.g.0xf(0-3),0x1_00000000000000ff(0-7,64))enq_flags->WAKEUP|HEAD|PREEMPTexit_kind->ERROR,ERROR_BPF,ERROR_STALL, etc.scx_flags->QUEUED|ENABLEDsticky_cpu->-1for 0xffffffff
Event stitching
After the trigger fires, all probe_data entries are read, matched
to functions by IP, then filtered to a single task’s scheduling
journey:
- Read the task_struct pointer from the trigger event’s
bpf_get_current_task()value (args[0]) - For functions with a task_struct parameter: keep events where
args[param_idx] == tptr - For functions without a task_struct parameter: keep events where
task_ptr == tptr(matched viabpf_get_current_task()at probe time)
Events are sorted by timestamp for chronological output.