Diagnose a Slow Scheduler with ctprof
When a scheduler change makes the workload slower but the test
suite still passes, the regression is usually buried in
per-thread off-CPU time. ktstr ctprof capture snapshots every
live thread’s scheduling, memory, I/O, and taskstats delay
counters; ktstr ctprof compare diffs two snapshots and surfaces
the buckets where time went. This recipe walks through a typical
A/B comparison.
See the ctprof reference for the full metric registry, aggregation rules, derived-metric formulas, and taskstats kconfig gating.
Capture before and after
# Baseline: scheduler A loaded, workload running.
ktstr ctprof capture --output baseline.ctprof.zst
# Switch schedulers, restart workload, wait for steady state.
# ...
# Candidate: scheduler B, same workload.
ktstr ctprof capture --output candidate.ctprof.zst
capture walks /proc once and writes the snapshot. It is
read-only — no kprobes, no tracing — so the act of capturing
does not perturb the measurement. The default capture covers
every live tgid; on a busy host this is hundreds of threads.
The snapshot is zstd-compressed JSON, typically a few MB.
Compare with the taskstats lens
The taskstats-delay section bundles the eight kernel
delay-accounting buckets (CPU, blkio, swapin, freepages,
thrashing, compact, wpcopy, irq) plus their nine derived
metrics (avg_*_delay_ns per bucket, total_offcpu_delay_ns
rollup). Running with --sections taskstats-delay filters the
output down to just the off-CPU view:
ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst \
--sections taskstats-delay \
--sort-by total_offcpu_delay_ns:desc
The --sort-by total_offcpu_delay_ns:desc puts the processes
with the largest absolute off-CPU growth at the top. Each row
gives baseline | candidate | delta | %; large positive deltas
on a process that should not have moved are the suspects.
The total_offcpu_delay_ns derivation is:
cpu + blkio + freepages + compact + wpcopy + irq + max(swapin, thrashing)
max(swapin, thrashing) rather than swapin + thrashing
because every thrashing event is also a swapin event from the
syscall perspective; summing both would double-count.
Drill into the per-bucket averages
If total_offcpu_delay_ns jumped on a process, the per-bucket
avg_*_delay_ns derivations identify which off-CPU phase
grew. In the same compare output (the --sections taskstats-delay
filter keeps both the raw counters AND the 9 derivations
together), look at the suspect process’s row in:
| Bucket | Average derivation | Meaning |
|---|---|---|
| CPU runqueue wait | avg_cpu_delay_ns | Time waiting for the scheduler to pick the task. RACY (count + total update lockless). |
| Block I/O wait | avg_blkio_delay_ns | Synchronous block-device wait. Distinct from schedstat iowait_sum; the canonical delay-accounting reading. |
| Swap-in / Thrashing | avg_swapin_delay_ns / avg_thrashing_delay_ns | Memory pressure. The two overlap — a thrashing event is also a swapin event. |
| Direct memory reclaim | avg_freepages_delay_ns | Allocator hit __alloc_pages slowpath. |
| Memory compaction | avg_compact_delay_ns | Allocator demanded a high-order page; compaction stalled. |
| CoW page-fault | avg_wpcopy_delay_ns | Write-protect-copy fault, e.g. fork-then-write. |
| IRQ handling | avg_irq_delay_ns | Time charged to the task by the IRQ accounting subsystem. |
A growing avg_cpu_delay_ns with flat blkio/swap/freepages
suggests the new scheduler is making poor placement choices —
the task is queueing more often or for longer, but no other
subsystem is to blame. A growing avg_blkio_delay_ns with flat
avg_cpu_delay_ns points away from the scheduler entirely
(disk, network filesystem, or a userspace lock pattern).
Cross-reference the primary table
Once a bucket is identified, look at the underlying counters without the section filter:
ktstr ctprof compare baseline.ctprof.zst candidate.ctprof.zst \
--metrics nr_wakeups,nr_migrations,wait_sum,wait_count,run_time_ns,timeslices
--metrics restricts the rendered rows to the named primary
metrics. Useful pairings when the suspect bucket is CPU runqueue
wait:
wait_sum / wait_count— schedstat’s average wait per scheduling event (theavg_wait_nsderivation, exposed outside thetaskstats-delaysection). If this confirmsavg_cpu_delay_ns, both delay-accounting paths agree.nr_migrations— the new scheduler may be moving the task more aggressively. Cross-CPU migrations cost wall-clock time even whenrun_time_nsis identical.nr_wakeups_affine / nr_wakeups_affine_attempts— theaffine_success_ratioderivation; CFS-only signal that reflects how oftenwake_affine()succeeded. A large drop with growingavg_cpu_delay_nsis a strong signal for cache- unfriendly placement.
Confirm taskstats data is actually populated
If every taskstats column reads zero, the snapshot likely hit a
gating problem rather than a real “no delay” reading. Inspect
CtprofSnapshot::taskstats_summary (the structured
per-snapshot tally written into the snapshot itself):
eperm_count > 0— the capturing process lackedCAP_NET_ADMIN. Re-run as root, or grantcap_net_admin+eipviasetcap.esrch_countneartids_walked— every tid raced exit before the per-tid query landed. Lengthen the workload’s steady-state window and re-capture.ok_count == 0ANDeperm_count == 0— the netlink open failed, almost always meaning the kernel was built withoutCONFIG_TASKSTATS. Rebuild with the kconfig.ok_count > 0but every delay column reads zero — kernel built withCONFIG_TASKSTATSandCONFIG_TASK_DELAY_ACCTbut launched without the runtimedelayacct=ontoggle. Adddelayacctto the kernel cmdline, or setsysctl kernel.task_delayacct=1and re-capture.
The structured fields above let an operator distinguish each case without scraping the capture-pipeline tracing log.
Related
- ctprof reference — the full metric registry and gating documentation.
- Capture and Compare Host State — the
cargo ktstr show-hostrecipe for host-context diffs (kernel, sched_* tunables, NUMA layout); use that when the hypothesis is “the host config moved” rather than “a workload’s per-thread behaviour moved.” - A/B Compare Branches — recipe for diffing scheduler-source-tree changes via ktstr’s gauntlet runs; ctprof complements that by surfacing per-thread-level effects the scenario assertions miss.