Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Monitor

The monitor observes scheduler state from the host side by reading guest VM memory directly. It does not instrument the guest kernel or the scheduler under test.

What it reads

The monitor resolves kernel structure offsets via BTF (BPF Type Format) from the guest kernel. It reads per-CPU runqueue structures to extract:

  • nr_running – number of runnable tasks on each CPU
  • scx_nr_running – tasks managed by the sched_ext scheduler
  • rq_clock – runqueue clock value
  • local_dsq_depth – scx local dispatch queue depth
  • scx_flags – sched_ext flags for each CPU
  • scx event counters (fallback, keep-last, offline dispatch, skip-exiting, skip-migration-disabled, reenq-immed, reenq-local-repeat, refill-slice-dfl, bypass-duration, bypass-dispatch, bypass-activate, insert-not-owned, sub-bypass-dispatch)

When CONFIG_SCHEDSTATS is enabled, the monitor also reads per-CPU struct rq schedstat fields (run_delay, pcount, sched_count, ttwu_count, etc.).

The monitor walks the struct sched_domain tree whenever BTF contains rq->sd and struct sched_domain — no CONFIG_SCHEDSTATS required. Domain tree walking starts at rq->sd (lowest level) and follows sd->parent pointers up to the root. Each domain level provides topology metadata (level, name, flags, span_weight) and runtime fields (balance_interval, nr_balance_failed, max_newidle_lb_cost) and optional fields (newidle_call, newidle_success, newidle_ratio — added in 7.0, backported to 6.18.5+ and 6.12.65+; absent on 6.16-6.18.4). When CONFIG_SCHEDSTATS is also enabled, each domain additionally provides load balancing stats: lb_count, lb_failed, lb_balanced, alb_pushed, ttwu_wake_remote, and other counters indexed by idle type (CPU_NOT_IDLE, CPU_IDLE, CPU_NEWLY_IDLE).

Sampling

The monitor takes periodic snapshots (MonitorSample) of all per-CPU state. Each sample captures a point-in-time view of every CPU.

MonitorSummary aggregates samples into peak values (max imbalance ratio, max DSQ depth, stall detection), per-sample averages (imbalance ratio, nr_running per CPU, DSQ depth per CPU), and event counter deltas. Averages are computed over valid samples only (excluding uninitialized guest memory).

Threshold evaluation

MonitorThresholds defines pass/fail conditions:

pub struct MonitorThresholds {
    pub max_imbalance_ratio: f64,
    pub max_local_dsq_depth: u32,
    pub fail_on_stall: bool,
    pub sustained_samples: usize,
    pub max_fallback_rate: f64,
    pub max_keep_last_rate: f64,
}

A violation must persist for sustained_samples consecutive samples before triggering a failure. This filters transient spikes from cpuset transitions and cgroup creation/destruction.

Stall detection

A stall is detected when a CPU’s rq_clock does not advance between consecutive samples. Three exemptions prevent false positives:

  • Idle CPUs: when nr_running == 0 in both the current and previous sample, the CPU has no runnable tasks. The kernel stops the tick (NOHZ) on idle CPUs, so rq_clock legitimately does not advance. These CPUs are excluded from stall checks.

  • Preempted vCPUs: when the vCPU thread’s CPU time did not advance past the preemption threshold between samples, the host preempted the vCPU. These samples are excluded from stall checks.

  • Sustained window: stall detection uses per-CPU consecutive counters and the sustained_samples threshold, matching how imbalance and DSQ depth checks work. A single stuck sample does not trigger failure – the stall must persist for sustained_samples consecutive samples on the same CPU.

Uninitialized memory detection

Before the guest kernel initializes per-CPU structures, monitor reads return uninitialized data. Two layers handle this:

  • Summary computation (MonitorSummary::from_samples): skips individual samples where any CPU’s local_dsq_depth exceeds DSQ_PLAUSIBILITY_CEILING (10,000) via sample_looks_valid().

  • Threshold evaluation (MonitorThresholds::evaluate): checks all samples globally for plausibility. If all rq_clock values are identical across every CPU and sample, or any sample exceeds the plausibility ceiling, the entire report is passed as “not yet initialized” — no per-threshold checks run.

BPF map introspection

The monitor module also provides host-side BPF map discovery and read/write access via bpf_map::BpfMapAccessor. The host reads and writes guest BPF maps directly through the physical memory mapping — no guest cooperation or BPF syscalls are needed.

GuestMem

GuestMem wraps a host pointer to the start of guest DRAM and provides bounds-checked volatile reads and writes for scalar types (u8/u32/u64). Byte-slice reads (read_bytes) use copy_nonoverlapping. It also implements x86-64 page table walks (translate_kva) for both 4-level and 5-level paging, and granule-agnostic aarch64 walks (4 KB / 16 KB / 64 KB; level count derived from TCR_EL1’s TG1 + T1SZ fields).

Scalar accesses use volatile semantics because the guest kernel modifies memory concurrently.

GuestKernel

GuestKernel builds on GuestMem by adding kernel symbol resolution and address translation. It parses the vmlinux ELF symbol table at construction and resolves paging configuration (PAGE_OFFSET, CR3, 4-level vs 5-level) from guest memory. Subsequent reads use cached state.

Three address translation modes are supported:

  • Text/data/bss: kva - __START_KERNEL_map. For statically-linked kernel variables (read_symbol_*, write_symbol_*).
  • Direct mapping: kva - PAGE_OFFSET. For SLAB allocations, per-CPU data, physically contiguous memory (read_direct_*).
  • Vmalloc/vmap: Page table walk via CR3. For BPF maps, vmalloc’d memory, module text (read_kva_*, write_kva_*).

BpfMapAccessor

BpfMapAccessor resolves BTF offsets for BPF map kernel structures (struct bpf_map, struct bpf_array, struct xa_node, struct idr) and provides map discovery and value read/write. It borrows a GuestKernel for address translation.

BpfMapAccessorOwned is a convenience wrapper that owns the GuestKernel internally. Use BpfMapAccessor::from_guest_kernel when you already have a GuestKernel; use BpfMapAccessorOwned::new when you want a self-contained accessor.

Map discovery walks the kernel’s map_idr xarray:

  1. Read map_idr (BSS symbol, text mapping translation)
  2. Walk xa_node tree (SLAB-allocated, direct mapping translation)
  3. Read struct bpf_map fields. The allocation may be kmalloc’d or vmalloc’d depending on size and flags, so the translation uses translate_any_kva which handles both paths rather than assuming either.

find_map searches by name suffix (e.g. ".bss" matches "mitosis.bss"). Only BPF_MAP_TYPE_ARRAY maps are returned. Use maps() to enumerate all map types without filtering.

Value access for BPF_MAP_TYPE_ARRAY maps reads/writes the inline bpf_array.value flex array at the BTF-resolved offset. The value region is vmalloc’d, so each byte access goes through the page table walker to handle page boundaries.

For BPF_MAP_TYPE_PERCPU_ARRAY maps, bpf_array.pptrs[key] holds a __percpu pointer (at the same union offset as value). Adding __per_cpu_offset[cpu] yields the per-CPU KVA in the direct mapping. read_percpu_array returns one Option<Vec<u8>> per CPU: Some when the per-CPU PA falls within guest memory, None when it does not.

Typed field access

When a map has BTF metadata (btf_kva != 0), resolve_value_layout reads the guest’s struct btf and its data blob, parses it with btf_rs, and resolves the value struct’s fields. This enables read_field / write_field with type-checked BpfValue variants.

Usage example

Find a scheduler’s .bss map and write a crash variable:

let offsets = BpfMapOffsets::from_vmlinux(vmlinux)?;
let accessor = BpfMapAccessor::from_guest_kernel(&kernel, &offsets)?;
let bss = accessor.find_map(".bss").expect(".bss map not found");
accessor.write_value_u32(&bss, crash_offset, 1);

BpfMapWrite

BpfMapWrite specifies a host-side write to a BPF map during VM execution. The test runner waits for the scheduler to load (map becomes discoverable), writes the value, then signals the guest via SHM to start the scenario.

pub struct BpfMapWrite {
    pub map_name_suffix: &'static str,  // e.g. ".bss"
    pub offset: usize,                  // byte offset in the map value
    pub value: u32,                     // value to write
}

Use with #[ktstr_test] via the bpf_map_write attribute:

const BPF_CRASH: BpfMapWrite = BpfMapWrite {
    map_name_suffix: ".bss",
    offset: 42,
    value: 1,
};

#[ktstr_test(bpf_map_write = BPF_CRASH, expect_err = true)]
fn crash_test(ctx: &Ctx) -> Result<AssertResult> {
    Ok(AssertResult::pass())
}

The map is discovered by name suffix via BpfMapAccessor::find_map. Only BPF_MAP_TYPE_ARRAY maps are supported. The write targets a u32 at the specified byte offset within the map’s value region.

Prerequisites

  • vmlinux: Required for ELF symbols and BTF. Must match the guest kernel. Symbols include phys_base so the runtime KASLR offset can be resolved via a page-table walk through the BSP’s CR3, breaking the chicken-and-egg between text-symbol PA translation and KASLR.

Cast analysis

BPF maps frequently store kernel pointers (task_struct *, cgroup *, …) and arena pointers in u64 fields because BTF cannot express a pointer to a per-allocation type. Without intervention the renderer treats them as integers and the failure dump shows raw 0xffff…ffff values with no further chase.

The cast analyzer (monitor::cast_analysis::analyze_casts) closes that gap. The freeze coordinator runs it once per scheduler load, before any periodic capture or on-demand snapshot would consume its output:

  1. The host loads the scheduler binary and locates each .bpf.o ELF in the build artifacts.
  2. Each program section is decoded through cast_analysis::BpfInsn::from_le_bytes into a flat &[BpfInsn] slab; relocations against .bss / .data / .rodata annotate the corresponding BPF_LD_IMM64 PCs with their datasec target.
  3. analyze_casts walks the slab forward, tracking register and stack-slot state for each instruction. Two detection paths feed the output: the arena pointer path (LDX through a previously loaded u64 field) and the kernel kptr path (STX of a typed pointer register into a u64 field). Function-entry seeding from bpf_func_info reseeds R1..R5 from the BTF FuncProto so typed parameters propagate correctly across subprogram joins.
  4. The result is a CastMap (BTreeMap<(source_struct_btf_id, field_byte_offset), CastHit>) cached on the per-VM KtstrVm.cast_map (a LazyCastMap that runs the analyzer on first dump and caches the result process-wide by scheduler binary content hash). The freeze coordinator threads the cached CastMap through DumpContext::cast_map into every per-map render so the renderer can consult it at every dump site.
  5. render_cast_pointer in monitor::btf_render consumes CastHit via MemReader::cast_lookup. When a u64 field at a recorded (struct, offset) is rendered, the renderer chases the pointer through the address-space-appropriate reader (arena vs slab/vmalloc) and tags the result with a cast_annotation of "cast→arena" or "cast→kernel" (plus a (sdt_alloc) suffix when the bridge described below fired). Failure dumps show the annotation alongside the resolved struct fields, so cast-recovered pointers are visually distinct from BTF-typed ones.

The renderer also consults an sdt_alloc bridge whenever a chase target peels to a BTF_KIND_FWD forward declaration (typical for struct sdt_data __arena * fields whose body lives in the sdt_alloc library’s BTF rather than the scheduler’s program BTF). The dump-state pre-pass walks each live scx_allocator and populates a slot_start → ArenaSlotInfo index — one entry per live allocator slot, carrying elem_size, header_size, and the resolved payload BTF type id — that MemReader::resolve_arena_type (in dump::render_map::AccessorMemReader) range-looks up during the chase. The lookup finds the slot whose [slot_start, slot_start + elem_size) range contains the chased address and routes by offset_in_slot: a slot-start chase (offset == 0, e.g. the data field of scx_task_map_val storing the raw sdt_alloc() return) returns the payload type id with header_skip = header_size; a payload-start chase (offset == header_size, e.g. the return of scx_task_data(p) cached in cached_taskc_raw) returns the same payload type id with header_skip = 0. The renderer reads header_skip + btf_size bytes from the chased address, slices off the leading header_skip bytes, and renders the payload struct. The resulting Ptr carries a sdt_alloc-flavoured annotation: "sdt_alloc" on the BTF-typed Type::Ptr arm, and "cast→arena (sdt_alloc)" / "cast→kernel (sdt_alloc)" on the cast-analyzer-driven path. The sdt_alloc bridge fires only when the BTF-only resolve has already exhausted same-name siblings; false-positive risk on that arm is bounded by the arena-window range check (MemReader::resolve_arena_type returns None for addresses outside every known allocator slot).

A separate cross-BTF Fwd resolution path covers the case where a BTF_KIND_FWD pointee’s body lives in a sibling embedded BPF object’s BTF rather than an sdt_alloc slot — the typical multi-.bpf.objs shape where one object declares struct cgx_target; (forward) and a sibling object defines struct cgx_target { ... } (full body). The cast-analysis pre-pass (vmm::cast_analysis_load::build_fwd_index) walks every parsed embedded program BTF and records a name -> (btfs index, type_id) entry for every complete (!is_fwd) Type::Struct / Type::Union. First-write-wins on duplicate names: when the same name appears in multiple BTFs the index keeps the first-seen entry. Anonymous types and Typedef are not indexed (no name to key on, and typedefs add no body — the chase peels through them via peel_modifiers_with_id before consulting the index). The index is threaded through DumpContext::cross_btf and exposed to the renderer via MemReader::cross_btf_resolve_fwd. When chase_arena_pointer / render_cast_pointer peel a chase target through peel_modifiers_resolving_fwd and the local same-BTF sibling search came up empty, try_cross_btf_fwd_resolve consults the cross-BTF index by the Fwd’s name (and aggregate kind — struct vs union); a hit returns a CrossBtfRef { btf, type_id } and the chase recursion switches to the resolved sibling BTF for the pointee render. Cross-BTF resolution does NOT introduce a new annotation — the body is recovered transparently and the rendered subtree carries the cast or BTF-typed annotation it would have had if the same struct lived in the entry BTF. Unlike the sdt_alloc bridge the cross-BTF index is consulted whenever a Fwd terminal survives the local resolve — there is no arena-window gate, since the lookup is purely a name-keyed BTF table and a name miss simply leaves the chase on its existing “forward declaration; body not in this BTF” skip path.

The analyzer is deliberately conservative: branch joins reset register and stack state, conflicts drop the offending entry, and self-stores are rejected. False negatives fall back to raw u64 (the prior behavior); false positives would chase garbage and are avoided. The analysis is unconditional — no test-author configuration, no opt-in flag — and the freeze coordinator wires the resulting CastMap through every snapshot, periodic capture, and failure dump.

Probe pipeline

The probe pipeline captures function arguments and struct fields during auto-repro. It operates inside the guest VM (not from the host), using two BPF skeletons that share maps.

Architecture

crash stack -> extract functions -> BTF resolve -> load skeletons -> poll
                                                         |
                                    kprobe skeleton      |     fentry/fexit skeleton
                                    (kernel entry)       |     (BPF entry + kernel exit)
                                         |               |          |
                                         v               v          v
                                    func_meta_map  <--shared-->  probe_data
                                                         |        (entry + exit fields)
                                              trigger fires (ring buffer)
                                                         |
                                              read probe_data entries
                                                         |
                                              stitch by tptr
                                                         |
                                              format with entry→exit diffs

Kprobe skeleton (probe.bpf.c)

Attaches to kernel functions via attach_kprobe. The BPF handler:

  1. Gets the function IP via bpf_get_func_ip
  2. Looks up func_meta from func_meta_map (keyed by IP)
  3. Captures 6 raw args from pt_regs
  4. Dereferences struct fields via BTF-resolved offsets
  5. Reads char * string params if configured
  6. Stores result in probe_data (keyed by (func_ip, task_ptr))

The trigger fires via tp_btf/sched_ext_exit (inside scx_claim_exit()) and sends an EVENT_TRIGGER via ring buffer with the current task pointer and kernel stack.

Fentry/fexit skeleton (fentry_probe.bpf.c)

Handles both BPF struct_ops callbacks and kernel function exit capture. Loaded in batches of 4 fentry + 4 fexit programs per skeleton instance via set_attach_target. Shares probe_data and func_meta_map with the kprobe skeleton via reuse_fd.

A per-slot is_kernel rodata flag controls argument access:

  • BPF callbacks (is_kernel=0): ctx[0] is a void pointer to the real callback arguments. The handler dereferences through it. Uses sentinel IPs (func_idx | (1<<63)) in func_meta_map.
  • Kernel functions (is_kernel=1): args are directly in ctx[0..5]. Uses bpf_get_func_ip(ctx) for the real IP, matching the kprobe entry handler’s key.

Fexit handlers look up the existing probe_data entry (written by fentry or kprobe at function entry) and re-read struct fields into exit_fields. This captures post-mutation state for paired display.

BTF resolution

Two BTF sources:

  • vmlinux BTF (btf-rs): resolves kernel struct offsets. Types in STRUCT_FIELDS (task_struct, rq, scx_dispatch_q, etc.) use curated field lists with chained pointer dereferences (e.g. ->cpus_ptr->bits[0]). Other struct pointer params get scalar, enum, and cpumask pointer fields auto-discovered from vmlinux BTF.

  • Program BTF (libbpf-rs): resolves BPF-local struct offsets for types not in vmlinux (e.g. scheduler-defined task_ctx). Auto-discovers scalar, enum, and cpumask pointer fields.

Callback signatures are resolved by:

  1. ____name inner function in program BTF (typed params)
  2. sched_ext_ops member in vmlinux BTF (fallback)
  3. Wrapper function (void *ctx, no useful params)

Field decoding

The output formatter decodes field values based on their key name:

  • dsq_id -> SCX_DSQ_INVALID, SCX_DSQ_GLOBAL, SCX_DSQ_LOCAL, SCX_DSQ_BYPASS, SCX_DSQ_LOCAL_ON|{cpu}, BUILTIN({v}), DSQ(0x{hex})
  • cpumask_0..3 -> coalesced into one cpus_ptr field rendered as 0x{hex}({cpu-list}) — the masked hex of the cpumask words (low-order word first; multi-word masks join with _ between 64-bit chunks) followed by the run-length-collapsed CPU range list (e.g. 0xf(0-3), 0x1_00000000000000ff(0-7,64))
  • enq_flags -> WAKEUP|HEAD|PREEMPT
  • exit_kind -> ERROR, ERROR_BPF, ERROR_STALL, etc.
  • scx_flags -> QUEUED|ENABLED
  • sticky_cpu -> -1 for 0xffffffff

Event stitching

After the trigger fires, all probe_data entries are read, matched to functions by IP, then filtered to a single task’s scheduling journey:

  1. Read the task_struct pointer from the trigger event’s bpf_get_current_task() value (args[0])
  2. For functions with a task_struct parameter: keep events where args[param_idx] == tptr
  3. For functions without a task_struct parameter: keep events where task_ptr == tptr (matched via bpf_get_current_task() at probe time)

Events are sorted by timestamp for chronological output.