Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ctprof

The ctprof profiler captures a host-wide per-thread snapshot of scheduling counters, memory / I/O accounting, CPU affinity, cgroup state, and thread identity, then compares two snapshots to surface what changed. It is a manually-invoked CLI companion to the automated scheduler tests — useful when a run passes on one machine and fails on another, or for A/B comparing host behaviour across kernel / sysctl / workload changes.

This is a different tool from cargo ktstr show-host, which captures the host context (kernel, CPU model, sched_* tunables, NUMA layout, kernel cmdline) — aggregate state that does not change between scenarios. The profiler captures per-thread cumulative counters that do change, and its comparison surface is designed for the thread-level diff.

When to use it

  • Workload investigation — you observe a regression and want to know which process / thread pool moved in run time, context-switch rate, or migration count.
  • Kernel / sysctl A/B — capture before and after flipping a sched_* tunable on an otherwise-identical workload; the compare output surfaces every counter that responded.
  • Host baselining — capture on a known-good host, capture on a failing host, compare to isolate what differs at the thread-behaviour level.

The profiler is not invoked automatically by scenarios or the gauntlet. It is opt-in and operator-driven via the ktstr ctprof subcommand.

Capture

ktstr ctprof capture --output baseline.ctprof.zst
# ... run workload, change a tunable, reboot a kernel, etc. ...
ktstr ctprof capture --output after.ctprof.zst

capture walks /proc for every live thread group, enumerates each thread, and reads a handful of procfs sources for each one. The output is a zstd-compressed JSON snapshot (conventional extension: .ctprof.zst).

What is captured per thread

  • Identity — tid, tgid, pcomm (process name from /proc/<tgid>/comm), comm (thread name from /proc/<tid>/comm), cgroup v2 path, start_time_clock_ticks (from /proc/<tid>/stat field 22, in USER_HZ clock ticks), scheduling policy name, nice, CPU affinity mask.
  • Scheduling counters (cumulative, from /proc/<tid>/sched; schedstat fields gated by CONFIG_SCHEDSTATS, run_time_ns/wait_time_ns/timeslices gated by CONFIG_SCHED_INFO) — run_time_ns, wait_time_ns, timeslices, voluntary_csw, nonvoluntary_csw, nr_wakeups (plus _local / _remote / _sync / _migrate splits), nr_migrations, wait_sum / wait_count, voluntary_sleep_ns (capture-side normalized as sum_sleep_runtime - sum_block_runtime so the kernel’s sleep/block double-count is stripped before the value reaches the snapshot), block_sum, iowait_sum / iowait_count, core_forceidle_sum, wait_max / sleep_max / block_max / exec_max / slice_max (lifetime peaks).
  • Memoryminflt / majflt from /proc/<tid>/stat. allocated_bytes / deallocated_bytes from the jemalloc per-thread TSD counters (tsd_s.thread_allocated / thread_deallocated) read via ptrace + process_vm_readv — populated only for processes linked against jemalloc; glibc arena counters are opaque and read as zero rather than failing capture. smaps_rollup_kb (per-process map of the kernel’s /proc/<tid>/smaps_rollup keys, populated leader-only).
  • I/Orchar, wchar, syscr, syscw, read_bytes, write_bytes, cancelled_write_bytes from /proc/<tid>/io (requires CONFIG_TASK_IO_ACCOUNTING). Note that cancelled_write_bytes records on the truncating task — not the original writer — so it pairs with write_bytes as a group-level signal but per-thread arithmetic between the two is not meaningful.
  • Taskstats delay accounting + watermarks — eight delay categories × four fields each (count, total_ns, max_ns, min_ns) plus hiwater_rss_bytes and hiwater_vm_bytes peaks, pulled via the kernel’s TASKSTATS genetlink family. Requires CAP_NET_ADMIN on the capturing process; delay-family fields additionally require CONFIG_TASK_DELAY_ACCT and the runtime delayacct=on toggle, watermark fields require CONFIG_TASK_XACCT. See the Taskstats delay accounting section below for the full field list, gating, and per-bucket semantic caveats.
  • PSI host-levelcpu.stat / memory.current aggregates per cgroup (see Per-cgroup enrichment) plus psi (Pressure Stall Information) under each cgroup and at the host level. Requires CONFIG_PSI.
  • sched_ext sysfsstate, switch_all, nr_rejected, hotplug_seq, enable_seq from /sys/kernel/sched_ext/. Present only when CONFIG_SCHED_CLASS_EXT is built.

Field families and probe-timing invariance:

  • Cumulative counters and totals (the majority): wakeups, migrations, csw, run/wait/sleep/block/iowait time, schedstat counts, page-fault counters, syscall counters, byte counters, the taskstats per-bucket *_count and *_delay_total_ns, the jemalloc per-thread TSD counters. Sampled twice at different instants the value increases monotonically; probe attachment time does not alter the reading.
  • Lifetime extrema: schedstat *_max family (wait_max, sleep_max, block_max, exec_max, slice_max), every taskstats *_delay_max_ns / *_delay_min_ns, and the memory watermarks (hiwater_rss_bytes, hiwater_vm_bytes). Per-event extrema rather than sums. The *_max and hiwater_* fields are non-DECREASING over time (kernel keeps the largest); the *_delay_min_ns fields are non-INCREASING (kernel keeps the smallest non-zero observation, so sentinel 0 means “no events observed” — compare against the matching *_count).
  • Instantaneous gauges (sensitive to probe timing): nr_threads (signal_struct->nr_threads snapshot), fair_slice_ns (current p->se.slice), and state (task_state_array letter). Sampled at capture time and can legitimately differ between two probes of the same thread.
  • Categorical / ordinal scalars: policy, nice, priority, processor, rt_priority, plus identity strings (pcomm, comm, cgroup) and the cpu_affinity cpuset. Sampled at capture time and can change at runtime (e.g. sched_setaffinity mid-run flips processor and cpu_affinity), so they share the gauge family’s probe-timing sensitivity.

Metrics that reset on attachment (perf_event_open counters, BPF tracing samples, etc.) are intentionally absent — they require long-lived instrumentation the capture layer cannot install without disturbing the system it is measuring.

Capture is best-effort

Each internal reader returns Option; a kernel without CONFIG_SCHED_DEBUG yields None from the /proc/<tid>/sched reader (and a kernel without CONFIG_SCHEDSTATS yields None from /proc/<tid>/schedstat and the schedstat-gated /proc/<tid>/sched keys) without failing the rest of the thread. Counters collapse to 0, identity strings collapse to empty, affinity collapses to an empty vec. A missing reading is indistinguishable from a genuine zero in the output — the contract is “never fail the snapshot.” Tests that need stronger guarantees inspect the underlying readers directly (they remain Option-shaped and are unit-tested in the module).

Per-cgroup enrichment

Every cgroup at least one sampled thread resides in gets a CgroupStats entry. Fields nest under per-controller sub-structs:

  • cpu: CgroupCpuStatsusage_usec, nr_throttled, throttled_usec (from cpu.stat); max_quota_us, max_period_us (from cpu.max); weight, weight_nice (from cpu.weight / cpu.weight.nice).
  • memory: CgroupMemoryStatscurrent (from memory.current); max, high, low, min (from the matching memory.* files; low and min are protection floors, max and high are limits); stat and events as flat key-value maps mirroring memory.stat and memory.events.
  • pids: CgroupPidsStatscurrent and max from the optional pids controller.
  • psi: Psi — per-cgroup Pressure Stall Information from <cgroup>/cpu.pressure / memory.pressure / io.pressure / irq.pressure (gated on CONFIG_PSI).

All fields are read directly from cgroup v2 files, NOT derived from per-thread data, because those are aggregate-over-the-cgroup values.

Snapshot identity

The top-level CtprofSnapshot also embeds a HostContext (the same structure show-host prints — kernel, CPU, memory, sched_* tunables, cmdline). Older tools or synthetic fixtures that omit the context render (host context unavailable) rather than failing the compare.

Cgroup namespace caveat

The per-thread cgroup path is read verbatim from /proc/<tid>/cgroup — it is therefore relative to the cgroup namespace root the capturing process sees, NOT the system-global v2 mount root. A process inside a nested cgroup namespace sees a truncated path; a process outside sees a longer one. Cross-namespace comparison requires external canonicalization (the capture layer deliberately does not attempt it because the right resolution depends on capture-site privilege and namespace visibility).

Taskstats delay accounting

The kernel’s TASKSTATS genetlink family delivers per-task delay-accounting and memory-watermark fields that are NOT exposed via /proc/<tid>/sched or /proc/<tid>/stat. ctprof captures them through crate::taskstats — a netlink socket opens, the family-id resolves via CTRL_CMD_GETFAMILY, and one TASKSTATS_CMD_GET query per tid is issued. The 34 captured fields (8 delay categories × 4 bucket fields + 2 watermarks) all tag Section::TaskstatsDelay so they can be filtered as a unit.

Capability and kconfig gating

Calling the netlink family requires CAP_NET_ADMIN on the capturing process (kernel/taskstats.c::taskstats_ops registers TASKSTATS_CMD_GET with GENL_ADMIN_PERM). ktstr always runs as root in production so the cap is implicit, but a non-root operator running ktstr ctprof capture will hit EPERM on the first query_tid call and every taskstats field will collapse to zero per the best-effort capture contract.

Per-family kconfig gates and runtime toggles:

  • Delay-accounting fields (*_delay_count, *_delay_total_ns, *_delay_max_ns, *_delay_min_ns across the eight categories): require CONFIG_TASKSTATS=y AND CONFIG_TASK_DELAY_ACCT=y AND the runtime delayacct=on toggle (sysctl kernel.task_delayacct=1 or boot param delayacct). The runtime toggle is a separate condition beyond the build-time gates — a kernel built with both CONFIGs but launched without delayacct=on produces all-zero delay readings. ktstr’s standard kernel build includes both kconfigs; the test harness adds delayacct to the guest cmdline.
  • Memory-watermark fields (hiwater_rss_bytes, hiwater_vm_bytes): require CONFIG_TASKSTATS=y AND CONFIG_TASK_XACCT=y. They do NOT respond to the delayacct=on runtime toggle — xacct_add_tsk (kernel/tsacct.c) is unconditional once CONFIG_TASK_XACCT is built. xacct_add_tsk reads watermarks from the SHARED mm_struct, so sibling threads of the same tgid all report identical values; kernel threads (mm == NULL) read zero by design.

Any failed gate or missing cap collapses the affected fields to zero. ktstr’s capture pipeline emits an info-level tracing line per snapshot summarizing taskstats outcomes AND attaches the structured tally to CtprofSnapshot::taskstats_summary (ok_count / eperm_count / esrch_count / other_err_count), so an operator can distinguish “kernel doesn’t expose this” from “every tid raced exit” from “CAP_NET_ADMIN missing” without scraping log lines.

Eight delay categories

CategorySourceNotes
cpu_delay_*tsk->sched_info.{pcount,run_delay} via delayacct_add_tsk (kernel/delayacct.c)Time waiting on the runqueue. RACY: count + total are not updated atomically (lockless sched_info path); a concurrent reader may observe one ahead of the other. Captures the same wait-for-CPU bucket as schedstat wait_* via a different code path.
blkio_delay_*delayacct_blkio_start / _end (kernel/delayacct.c)Synchronous block I/O wait. Updates serialize through task->delays->lock so count + total are atomic (unlike cpu_*). The canonical delay-accounting block-I/O reading; distinct from schedstat iowait_sum.
swapin_delay_*delayacct_swapin_start / _end (include/linux/delayacct.h)Swap-in wait. OVERLAPS with thrashing_* — every thrashing event is also a swapin event from the syscall layer; do not sum the two.
freepages_delay_*delayacct_freepages_start / _end (mm/page_alloc.c)Direct memory reclaim wait.
thrashing_delay_*delayacct_thrashing_start / _end (mm/workingset.c)Thrashing wait. Refines swapin tracking — see swapin_*.
compact_delay_*delayacct_compact_start / _end (mm/compaction.c)Memory-compaction wait.
wpcopy_delay_*delayacct_wpcopy_start / _end (mm/memory.c)Write-protect-copy (CoW) fault wait. Introduced in taskstats v13.
irq_delay_*delayacct_irq (kernel/delayacct.c)IRQ-handler windows charged to the task by IRQ accounting. Introduced in taskstats v14.

Each category has four fields:

  • *_count — number of windows observed (MonotonicCount, SumCount).
  • *_delay_total_ns — cumulative ns of delay (MonotonicNs, SumNs).
  • *_delay_max_ns — longest single window observed (PeakNs, MaxPeak).
  • *_delay_min_ns — shortest non-zero window observed (PeakNs, MaxPeak). Sentinel 0 means “no events observed”, NOT “saw a zero-ns event”; compare against the matching *_count to disambiguate.

The two memory watermarks (hiwater_rss_bytes, hiwater_vm_bytes) are PeakBytes / MaxPeakBytes — see the MaxPeakBytes row in the Aggregation rules section below for the shared-mm semantics.

Compare

ktstr ctprof compare before.ctprof.zst after.ctprof.zst

compare joins the two snapshots on pcomm (process name) by default — see Grouping for the other axes — and emits one row per (group, metric) pair. Groups present on only one side surface as unmatched — a row is missing because the process did not exist, not because it did zero work.

Grouping

  • --group-by pcomm (default) — aggregate every thread of the same process together.
  • --group-by cgroup — aggregate by cgroup path. Useful for container-per-workload deployments where the process name is ambiguous across cgroups.
  • --group-by comm — aggregate by thread name across every process under token-based pattern normalization (tokio-worker-{0..N} → one bucket; kworker/0:1H-events_highpri, kworker/1:0H-events_highpri, … → one bucket). Useful when a thread-pool name spans many binaries and you want one row per pool, not per binary. Disable normalization with --no-thread-normalize.
  • --group-by comm-exact — synonym for --group-by comm --no-thread-normalize. Aggregate by literal thread name, no pattern collapse. Use when distinct token values carry meaning (e.g. tracking each kworker/u8:N independently).

Cgroup-path flattening

ktstr ctprof compare before.ctprof.zst after.ctprof.zst \
    --group-by cgroup \
    --cgroup-flatten '/kubepods/*/pod-*/container' \
    --cgroup-flatten '/system.slice/*.scope'

--cgroup-flatten accepts glob patterns that collapse dynamic segments (pod UUIDs, session scopes, transient unit IDs) to a canonical form before grouping, so the same logical workload across two runs lands on the same row even if the kernel assigned different UUIDs.

Filtering output: --sections vs --metrics

Two complementary filters narrow the rendered output:

  • --sections picks which sub-tables render. The default-empty value renders every section that has data; passing a comma-separated list restricts output to the named sub-tables — every section not listed is suppressed before its data-availability gate runs. Valid section names: primary, taskstats-delay, derived, cgroup-stats, cgroup-limits, memory-stat, memory-events, pressure, host-pressure, smaps-rollup, sched-ext. Five (cgroup-stats, cgroup-limits, memory-stat, memory-events, pressure) require --group-by cgroup; naming any of them under a non-cgroup grouping emits a stderr warning and renders zero rows.
  • --metrics picks which rows render inside the primary and derived sub-tables. The default-empty value renders every metric; passing a comma-separated list restricts the rendered rows to the named metrics. Names must come from the ctprof metric-list vocabulary (CTPROF_METRICSCTPROF_DERIVED_METRICS). Has no effect on the secondary sub-tables (cgroup-stats, smaps-rollup, etc.) — those have fixed column shapes and ignore the row filter.

The two compose multiplicatively: --sections primary --metrics run_time_ns shows a single row in the primary sub-table and nothing else. --sections primary alone keeps every primary row; --metrics run_time_ns alone keeps the single row across every section that displays it.

Each metric carries exactly one Section tag in its registry entry — the 34 taskstats-sourced primary rows and the 9 taskstats-derived rows tag Section::TaskstatsDelay rather than Section::Primary / Section::Derived. They render inside the same primary / derived outer tables but match a distinct section name, so --sections taskstats-delay selects exactly the 34 + 9 taskstats rows alone, while --sections primary excludes them and --sections derived excludes the 9 taskstats derivations. The three-way split lets an operator scope to non-taskstats only, taskstats only, or any combination, without losing the visual grouping under the same outer headers.

Aggregation rules

Each metric declares its own aggregation rule (CTPROF_METRICS in src/ctprof_compare.rs). The AggRule enum is typed: each variant binds an accessor of a specific metric_types newtype (MonotonicCount, MonotonicNs, PeakNs, Bytes, etc.) so a registry entry that pairs a peak field with a sum reduction (e.g. t.wait_max (PeakNs) bound to a Sum* rule) fails to compile rather than producing a meaningless 1×1s ⊕ 1000×1ms aggregate. The 14 variants split into five families: Sum reductions, Max reductions, Range reductions, Mode reductions, and the Affinity reduction.

Sum reductions (cumulative counters)

VariantNewtypeOutput unitExamples
SumCountMonotonicCountunitlessnr_wakeups (+ _local / _remote / _sync / _migrate / _affine / _affine_attempts), nr_migrations, nr_forced_migrations, nr_failed_migrations_*, voluntary_csw, nonvoluntary_csw, minflt, majflt, wait_count, iowait_count, timeslices, syscr, syscw, every taskstats *_delay_count (8 entries)
SumNsMonotonicNsnsrun_time_ns, wait_time_ns, wait_sum, voluntary_sleep_ns, block_sum, iowait_sum, core_forceidle_sum, every taskstats *_delay_total_ns (8 entries)
SumTicksClockTicksUSER_HZ ticksutime_clock_ticks, stime_clock_ticks
SumBytesBytesbytes (IEC)allocated_bytes, deallocated_bytes, rchar, wchar, read_bytes, write_bytes, cancelled_write_bytes

Group reduction: saturating_add per the no-wraparound contract. Delta is the signed difference; percent delta is relative to the before-side. Auto-scale ladder is decimal SI for ns / count, USER_HZ for ticks, IEC binary for bytes.

Max reductions (peaks and gauges)

VariantNewtypeOutput unitExamples
MaxPeakPeakNsnswait_max, sleep_max, block_max, exec_max, slice_max, every taskstats *_delay_max_ns (8 entries), every taskstats *_delay_min_ns (8 entries)
MaxPeakBytesPeakBytesbytes (IEC)hiwater_rss_bytes, hiwater_vm_bytes (taskstats lifetime memory watermarks)
MaxGaugeNsGaugeNsnsfair_slice_ns (current scheduler slice)
MaxGaugeCountGaugeCountunitlessnr_threads (process-wide thread count)

MaxPeak / MaxPeakBytes rows surface the worst single window or largest watermark any thread in the group has ever observed — summing per-thread maxes would conflate “one thread with a 1s spike” with “1000 threads with 1ms spikes each”. MaxPeakBytes is the byte-typed twin of MaxPeak and routes through the IEC binary auto-scale ladder so a 7.5 GiB watermark renders as 7.500GiB rather than dominating the table with raw byte counts. xacct_add_tsk (kernel/tsacct.c) reads the watermarks from the SHARED mm_struct, so sibling threads of the same tgid all report the same value; cross-thread Max within a single process is a no-op, while cross-process Max under a multi-tgid bucket picks the largest watermark any tgid in the bucket reported.

MaxGaugeNs / MaxGaugeCount apply to instantaneous gauges (read at capture time) where summing has no physical meaning. nr_threads specifically is leader-only (populated on tid == tgid, zero elsewhere); Max reads through the leader so a comm-bucketed group still surfaces the largest process represented in the bucket. The taskstats *_delay_min_ns rows also use MaxPeak: min here is the kernel’s per-task lifetime shortest non-zero observation, so cross-thread Max picks “the largest minimum any contributor reported”; sentinel 0 means “no events observed” — compare against the matching count.

Range reductions (bounded ordinals)

VariantNewtypeOutputExamples
RangeI32OrdinalI32[min, max] (i64-widened)nice, priority, processor
RangeU32OrdinalU32[min, max] (i64-widened)rt_priority

The renderer shows [min, max] and the delta uses the midpoint so a shift on either end is visible.

Mode reductions (categorical)

VariantNewtypeOutputExamples
ModeCategoricalStringmost-frequent value + count/totalpolicy
ModeCharchar (coerced)most-frequent char + count/totalstate
ModeBoolbool (coerced)most-frequent bool + count/totalext_enabled

Mode is textual: delta is "same" if both modes agree, "differs" otherwise — there is no arithmetic on a categorical value. ModeChar and ModeBool coerce to String via to_string() before reducing because the underlying types are not themselves Modeable. A 50/50 bool tie resolves lex-smallest-wins (so "false" wins over "true"); operators reading a false mode in a heterogeneous bucket should check the count/total fraction.

Affinity reduction (CPU sets)

VariantNewtypeOutputExample
AffinityCpuSetAffinitySummary { min_cpus, max_cpus, uniform }cpu_affinity

Heterogeneous groups render as "N-M cpus (mixed)". Unlike the other rules, Affinity does not route through a metric_types trait — its reduction produces a structured summary, not a homogeneous newtype.

Derived metrics

Derived metrics consume one or more already-aggregated input metrics from CTPROF_METRICS and produce a single scalar with its own auto-scale ladder. They render in a separate ## Derived metrics table below the per-thread table on both compare and show, with rows colored blue to distinguish them from the primary table on TTY stdout. Registered in CTPROF_DERIVED_METRICS in src/ctprof_compare.rs.

The full registry is 17 entries: 8 schedstat / I/O / heap derivations plus 9 taskstats-derived (the 8 per-bucket avg_*_delay_ns averages plus the total_offcpu_delay_ns rollup). Every formula is implemented as a closure over the group’s metrics map (BTreeMap<String, Aggregated>); a missing input or a zero denominator yields None, which the renderer surfaces as - so the operator can distinguish “not computable” from “computed as zero”.

MetricFormulaInputsUnitNotes
affine_success_rationr_wakeups_affine / nr_wakeups_affine_attemptsnr_wakeups_affine, nr_wakeups_affine_attemptsratio (0..1)wake_affine() success ratio. CFS-only signal — sched_ext does not increment the wakeup counters. Bare three-decimal scalar; the renderer suppresses the % column for ratio rows because absolute delta on a [0, 1] ratio is already in percentage points.
avg_wait_nswait_sum / wait_countwait_sum, wait_countnsAverage runqueue-wait duration per scheduling event. Rendered with the ns auto-scale ladder (ns → µs → ms → s). Schedstat-gated (see wait_sum and wait_count); zero across sched_ext threads.
cpu_efficiencyrun_time_ns / (run_time_ns + wait_time_ns)run_time_ns, wait_time_nsratio (0..1)Fraction of total scheduler-tracked time spent on-CPU. Higher = less time stuck on the runqueue. Both inputs gated by CONFIG_SCHED_INFO.
avg_slice_nsrun_time_ns / timeslicesrun_time_ns, timeslicesnsAverage on-CPU slice length. Useful for spotting timeslice-tuning regressions (e.g. an sched_min_granularity_ns change that shrinks slices). Both inputs gated by CONFIG_SCHED_INFO.
involuntary_csw_rationonvoluntary_csw / (voluntary_csw + nonvoluntary_csw)nonvoluntary_csw, voluntary_cswratio (0..1)Fraction of context switches that were preemptions (kernel pulled the task off-CPU) vs. voluntary blocks. High values indicate preemption pressure; low values indicate cooperative blocking.
disk_io_fractionread_bytes / rcharread_bytes, rcharratio (≥ 0)Fraction of read syscall bytes that traveled past the pagecache layer (cache miss rate; covers local block devices and network filesystems alike). Typically ≤ 1.0, but can exceed 1 when readahead pulls more bytes past the pagecache layer than the syscall requested. Both inputs gated by CONFIG_TASK_IO_ACCOUNTING.
live_heap_estimateallocated_bytes - deallocated_bytes (signed)allocated_bytes, deallocated_bytesbytes (IEC, signed)jemalloc-only live-heap estimate. Glibc and other allocators feed both inputs zero so the derived metric reads zero too — - would imply non-computable but here zero is the genuine reading. Renders on the IEC binary ladder (B → KiB → MiB → GiB → TiB). Per-thread reading carries cross-thread noise: a thread that purely frees objects allocated by other threads reads large negative values; group-level Sum across all threads of the process eliminates the asymmetry.
avg_iowait_nsiowait_sum / iowait_countiowait_sum, iowait_countnsAverage iowait interval per blocking event. Schedstat-gated; zero across sched_ext threads.
avg_cpu_delay_nscpu_delay_total_ns / cpu_delay_countcpu_delay_total_ns, cpu_delay_countnsAverage runqueue-wait per scheduling event from the taskstats delayacct path. RACY: the kernel updates count + total via the lockless sched_info path, so a concurrent reader may observe one ahead of the other; the quotient is approximate at the sub-event scale and stable at the integrated scale. Distinct from avg_wait_ns (schedstat) which captures the same wait-for-CPU bucket via a different code path.
avg_blkio_delay_nsblkio_delay_total_ns / blkio_delay_countblkio_delay_total_ns, blkio_delay_countnsAverage synchronous block-I/O wait per event from the taskstats delayacct path. Distinct from avg_iowait_ns (schedstat) — this is the canonical delay-accounting block-I/O reading.
avg_swapin_delay_nsswapin_delay_total_ns / swapin_delay_countswapin_delay_total_ns, swapin_delay_countnsAverage swap-in wait per event. OVERLAPS with thrashing — every thrashing event is also a swapin event from the syscall layer; do not sum the two averages or the underlying totals directly.
avg_freepages_delay_nsfreepages_delay_total_ns / freepages_delay_countfreepages_delay_total_ns, freepages_delay_countnsAverage direct-reclaim wait per event.
avg_thrashing_delay_nsthrashing_delay_total_ns / thrashing_delay_countthrashing_delay_total_ns, thrashing_delay_countnsAverage thrashing wait per event. OVERLAPS with swapin (see avg_swapin_delay_ns).
avg_compact_delay_nscompact_delay_total_ns / compact_delay_countcompact_delay_total_ns, compact_delay_countnsAverage memory-compaction wait per event.
avg_wpcopy_delay_nswpcopy_delay_total_ns / wpcopy_delay_countwpcopy_delay_total_ns, wpcopy_delay_countnsAverage write-protect-copy (CoW) fault wait per event.
avg_irq_delay_nsirq_delay_total_ns / irq_delay_countirq_delay_total_ns, irq_delay_countnsAverage IRQ-handler window per event.
total_offcpu_delay_nscpu + blkio + freepages + compact + wpcopy + irq + max(swapin, thrashing)every *_delay_total_nsnsSum of every meaningful off-CPU delay-accounting bucket. The swapin/thrashing pair is OR’d with .max() rather than summed because the two share syscall-layer events (every thrashing event is also a swapin from the syscall perspective); summing both would double-count thrashing-induced swapins. When CONFIG_TASK_DELAY_ACCT is off, the runtime toggle is off, or the kernel predates a bucket’s introduction (e.g. wpcopy_* lands in v13, irq_* in v14), the missing buckets read zero from the truncated taskstats payload — the rollup degrades to the sum of the populated buckets rather than returning -. The structured taskstats outcome lives on CtprofSnapshot::taskstats_summary for the operator to disambiguate “no data” from “zero data.”

The is_ratio column on the registry is load-bearing for the renderer: ratio rows skip the % column entirely (the absolute delta already carries percentage-point semantics for a [0, 1] quantity), and the auto-scale ladder is None (bare three- decimal scalar). Non-ratio derived metrics reuse the same ladders as their unit family — Ns for nanosecond derivations, Bytes for byte derivations.

The 9 taskstats-derived entries (the 8 avg_*_delay_ns averages plus total_offcpu_delay_ns) tag Section::TaskstatsDelay rather than Section::Derived so --sections taskstats-delay renders the full taskstats view — the 34 raw rows AND the 9 derivations that depend on them — without dragging in unrelated derivations.

Derived metrics are surfaced by ctprof metric-list alongside the primary registry, and are valid --sort-by keys on both compare and show.

Output and interpretation

The comparison prints raw numbers and percent delta. There are no judgment labels (regression vs. improvement) — the meaning of “run_time went up 15%” depends on whether you were measuring a CPU-bound workload (more work done) or a spin-wait pathology (more time wasted). The interpretation is scheduler- specific and left to the operator.

Sort order: by default, rows are sorted by absolute delta (largest movers first) so the most-changed metrics surface at the top. Rows with no numeric scalar (policy, heterogeneous affinity) fall to the bottom.

File format

.ctprof.zst is zstd-compressed JSON of CtprofSnapshot. The schema is #[non_exhaustive] so field additions do not break existing snapshots:

CtprofSnapshot
├── captured_at_unix_ns: u64
├── host: Option<HostContext>
├── threads: Vec<ThreadState>
├── cgroup_stats: BTreeMap<String, CgroupStats>
├── probe_summary: Option<CtprofProbeSummary>
├── parse_summary: Option<CtprofParseSummary>
├── taskstats_summary: Option<TaskstatsSummary>
├── psi: Psi
└── sched_ext: Option<SchedExtSysfs>

TaskstatsSummary carries per-snapshot taskstats genetlink query outcomes — ok_count, eperm_count, esrch_count, other_err_count — so an operator can distinguish “no taskstats data because every tid raced exit” (high esrch_count) from “no taskstats data because the kernel was built without CONFIG_TASKSTATS” (the netlink open failed up-front, every counter zero) from “no taskstats data because CAP_NET_ADMIN is missing” (high eperm_count).

ThreadState::start_time_clock_ticks is in USER_HZ (100 on x86_64 and aarch64), NOT the kernel-internal CONFIG_HZ — so cross-host comparison between differently-configured kernels on those architectures is meaningful. Other in-tree architectures (alpha, for instance, with USER_HZ=1024) would require normalization at capture time; the capture layer currently targets x86_64 and aarch64 only.

Compression level 3 (matching the ktstr remote-cache convention): adequate ratio at fast speed, and ctprof captures are small enough that further compression produces diminishing returns on I/O.

Adding a metric

Adding a per-thread metric to the registry is a three-step mechanical process. The type system enforces the wiring so a mismatch between the kernel-source semantic and the aggregation rule fails to compile rather than producing a silently-wrong group reduction.

1. Add a ThreadState field with the right newtype

Pick the metric_types newtype that matches the kernel-source semantic of the field — the per-newtype docs name the kernel call sites that update each category. The shape determines what aggregation rules are legal in step 3:

NewtypeWhen to use
MonotonicCountPure counter — only goes up across the thread’s lifetime. Examples: nr_wakeups, syscall counts, every taskstats *_delay_count.
DeadCounterSame shape as MonotonicCount but tagged for kernel counters with no live writer (always reads zero). Captured for parser parity but does NOT implement any reduction trait — register with is_dead: true and the renderer flags it [dead].
MonotonicNsCumulative-time counter in ns. Examples: run_time_ns, wait_sum, every taskstats *_delay_total_ns.
PeakNsLifetime high-water mark in ns. Kernel updates via if (delta > stat->max) stat->max = delta. Summing peaks is a category error. Examples: wait_max, slice_max, every taskstats *_delay_max_ns and *_delay_min_ns.
PeakBytesByte-typed twin of PeakNs — lifetime high-water mark in bytes. Routes through the IEC binary auto-scale ladder. Used for taskstats memory watermarks (hiwater_rss_bytes, hiwater_vm_bytes) read from the shared mm_struct. Pairs with AggRule::MaxPeakBytes.
GaugeNsInstantaneous gauge sampled at capture time (ns). Cannot sum — N near-identical samples collapse to N×gauge with no meaning. Example: fair_slice_ns.
GaugeCountInstantaneous unitless count that goes up AND down. Example: nr_threads.
ClockTicksUSER_HZ-scaled time. Examples: utime_clock_ticks, stime_clock_ticks.
BytesByte counts. IEC binary auto-scale ladder. Examples: read_bytes, wchar.
OrdinalI32 / OrdinalU32 / OrdinalU64Bounded scalar — range-aggregated, not summable. Examples: nice (i32), rt_priority (u32). The Rangeable::range_across reduction returns Option<Range<Self>> — see Range<T> below. OrdinalU64 implements Rangeable but is currently unused in the registry; a metric that picks OrdinalU64 requires adding AggRule::RangeU64 alongside the existing RangeI32 and RangeU32 variants.
CategoricalStringCategorical value — mode-aggregated. Examples: policy.
CpuSetCPU affinity mask — affinity-aggregated. Example: cpu_affinity.
Range<T>Output type of the Rangeable::range_across reduction. Carries min and max of the same T with the min <= max invariant enforced at construction (debug_assert! in Range::new). Not stored on ThreadState — the Aggregated::OrdinalRange boundary unwraps it via into_tuple() to a (i64, i64) pair widened from the underlying OrdinalI32 / OrdinalU32 / OrdinalU64.

Add the field to ThreadState in src/ctprof.rs:

// In ThreadState struct definition.
/// Description: what the field counts, what kernel call site
/// writes it, and what scheduler classes increment it. Cite
/// `kernel/sched/...` line numbers for the writer.
pub my_new_metric: crate::metric_types::MonotonicCount,

2. Wire the capture path

capture_thread_at_with_tally in src/ctprof.rs is the single per-thread procfs walk. Add the per-source reader (or extend an existing one) and stamp the field in the ThreadState { ... } construction:

// Inside capture_thread_at_with_tally, after the existing
// per-source reads. Wrap in the newtype constructor; never use
// `.into()` (the typed-newtype style is explicit).
my_new_metric: MonotonicCount(sched.my_new_metric.unwrap_or(0)),

The Option::unwrap_or(0) collapse is load-bearing: the profiler’s contract is “never fail the snapshot,” so a missing reading lands at the newtype’s Default::default() (zero). The absent reading is indistinguishable from a genuine zero in the output — see the Capture is best-effort section.

3. Register the metric

Append a CtprofMetricDef entry to CTPROF_METRICS in src/ctprof_compare.rs. The AggRule variant must match the newtype chosen in step 1 — the type system enforces this.

CtprofMetricDef {
    name: "my_new_metric",
    rule: AggRule::SumCount(|t| t.my_new_metric),
    sched_class: None, // or Some("cfs-only") / Some("non-ext") / Some("fair-policy")
    config_gates: &[], // or &["CONFIG_SCHEDSTATS"], etc.
    is_dead: false,    // true for kernel-side dead pointers
    description: "One-line operator-facing description; surfaces in `ctprof metric-list`.",
    section: Section::Primary, // or Section::TaskstatsDelay for taskstats-sourced rows
},

The name field is the canonical metric identifier — used by --sort-by, --metrics, and the metric-list output. (The --columns flag accepts layout names — group, threads, metric, baseline, candidate, delta, %, arrow, value — not metric names.) Names are ASCII short-form (matching the capture-side field name where possible). sched_class and config_gates render as bracketed suffixes in metric-list output ([cfs-only], [SCHEDSTATS]) so operators reading a row know which kernels populate the counter. The section tag drives the --sections per-row filter — most rows take Section::Primary; taskstats-sourced rows take Section::TaskstatsDelay.

Compile-time guards

The type system catches the four most common mistakes:

  • Wrong reduction family: pairing a PeakNs accessor with AggRule::SumNs fails with a type error — PeakNs does not implement Summable (only Maxable), and the closure’s return type does not match the variant’s expected newtype.
  • Wrong unit family: pairing a Bytes accessor with AggRule::SumNs fails the same way.
  • Dead counter with live reduction: DeadCounter does not implement Summable / Maxable / Rangeable / Modeable, so any AggRule::Sum* / Max* / Range* / Mode* variant bound to a dead-counter accessor fails to compile. Register the metric only via the is_dead: true flag with whichever variant matches its shape — the rendering layer surfaces it as [dead] and skips numeric reduction.
  • Categorical with numeric reduction: pairing a CategoricalString accessor with AggRule::SumCount fails because CategoricalString does not implement Summable.

The closure body cannot be type-checked beyond the variant boundary, so a body that actively miswraps a field — e.g. SumNs(|t| MonotonicNs(t.wait_max.0)) laundering a peak through the sum wrapper — type-checks. Don’t do that. The wrapper category is load-bearing; the type system catches the variant mismatch but not the lying inside.

Optional: derived metric

If the new metric has a useful ratio or sum-of-ratios pairing with existing inputs, register a DerivedMetricDef in CTPROF_DERIVED_METRICS (same file). The compute closure reads inputs via input_scalar(metrics, name)? and returns Option<DerivedValue>; the ratio_compute and ratio_of_sum_compute helpers cover the two most common shapes. Set is_ratio: true when the output is in [0, 1] so the renderer suppresses the % column. Set section to Section::Derived for general derivations or Section::TaskstatsDelay if every input is a taskstats field (so --sections taskstats-delay keeps the derivation alongside its raw inputs).

  • cargo ktstr show-host — captures the host context (kernel, CPU, tunables) that the profiler embeds as the host field. Use show-host when you want to inspect host configuration only, without the per- thread walk.
  • Capture and Compare Host State — recipe covering the show-host / stats compare flow for comparing host context across sidecars (not the per-thread profiler).
  • Environment Variables — every ktstr-controlled env var.