Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Troubleshooting

Build errors

clang not found

error: failed to run custom build command for `ktstr`
  ...
  clang: No such file or directory

The BPF skeleton build (libbpf-cargo) invokes clang to compile .bpf.c sources. Install clang:

  • Debian/Ubuntu: sudo apt install clang
  • Fedora: sudo dnf install clang

pkg-config not found

error: failed to run custom build command for `libbpf-sys`
  ...
  pkg-config: command not found

libbpf-sys uses pkg-config during its vendored build. Install it:

  • Debian/Ubuntu: sudo apt install pkg-config
  • Fedora: sudo dnf install pkgconf

autotools errors (autoconf, autopoint, aclocal)

autoreconf: command not found
aclocal: command not found
autopoint: command not found

The vendored libbpf-sys build compiles bundled libelf and zlib from source using autotools. These libraries are not system dependencies – they ship with libbpf-sys – but the autotools toolchain is needed to build them. Install:

  • Debian/Ubuntu: sudo apt install autoconf autopoint flex bison gawk
  • Fedora: sudo dnf install autoconf gettext-devel flex bison gawk

make or gcc not found

busybox build requires 'make' — install build-essential (Debian/Ubuntu) or base-devel (Fedora/Arch)
busybox build requires 'gcc' — install build-essential (Debian/Ubuntu) or base-devel (Fedora/Arch)

The build script compiles busybox from source for guest shell mode. This requires make and gcc.

  • Debian/Ubuntu: sudo apt install make gcc
  • Fedora: sudo dnf install make gcc

BTF errors

no BTF source found. Set KTSTR_KERNEL to a kernel build directory,
or ensure /sys/kernel/btf/vmlinux exists.

build.rs generates vmlinux.h from kernel BTF data. It searches the kernel discovery chain (KTSTR_KERNEL, ./linux, ../linux, installed kernel) for a vmlinux file, falling back to /sys/kernel/btf/vmlinux. Most distros ship /sys/kernel/btf/vmlinux with CONFIG_DEBUG_INFO_BTF enabled.

Fixes:

  • Verify BTF is available: ls /sys/kernel/btf/vmlinux
  • If missing, set KTSTR_KERNEL to a kernel build directory that contains a vmlinux with BTF: export KTSTR_KERNEL=/path/to/linux
  • Build a kernel with CONFIG_DEBUG_INFO_BTF=y.
  • Some minimal/cloud kernels strip BTF. Use a distro kernel or build your own.

busybox download failure

failed to obtain busybox source.
  tarball (https://github.com/mirror/busybox/archive/refs/tags/1_36_1.tar.gz): download: ...
  git clone (https://github.com/mirror/busybox.git): ...
  Check network connectivity. First build requires internet access.

build.rs downloads busybox source on first build (tarball first, git clone fallback). Subsequent builds use the cached binary in $OUT_DIR.

Fixes:

  • Verify network connectivity to github.com.
  • If behind a proxy, set HTTP_PROXY / HTTPS_PROXY.
  • After a successful first build, no network access is needed unless cargo clean removes the cached binary.

/dev/kvm not accessible

The host-side pre-flight emits one of the following, depending on whether the device node is missing or merely unreadable:

/dev/kvm not found. KVM requires:
  - Linux kernel with KVM support (CONFIG_KVM)
  - Access to /dev/kvm (check permissions or add user to 'kvm' group)
  - Hardware virtualization enabled in BIOS (VT-x/AMD-V)
/dev/kvm: permission denied. Add your user to the 'kvm' group:
  sudo usermod -aG kvm $USER
  then log out and back in.

ktstr boots Linux kernels in KVM virtual machines. The host must have KVM enabled and the user must have read+write access to /dev/kvm.

Diagnose:

  • Check the device exists and inspect its permissions and owning group: ls -l /dev/kvm. Typical output: crw-rw---- 1 root kvm 10, 232 ....
  • Confirm the kvm group exists and see its members: getent group kvm.

Fixes:

  • Load the KVM module: modprobe kvm_intel or modprobe kvm_amd.
  • Follow the group-membership hint in the error text above (log out and back in afterward for the group change to take effect).
  • On cloud VMs (GCP, AWS, Azure) or nested hypervisors, nested virtualization is typically off by default. Enable it per the provider’s instructions (e.g. GCP --enable-nested-virtualization, AWS metal/.metal instance types, Azure Dv3/Ev3+ with nested virt).
  • In CI, ensure the runner has KVM access (e.g. runs-on: [self-hosted, kvm]).

No kernel found

no kernel found
  hint: set KTSTR_KERNEL to a kernel source directory, a version (e.g. `6.14.2`), or a cache key (see `cargo ktstr kernel list`), or run `cargo ktstr kernel build` to populate the cache
  hint: or set KTSTR_TEST_KERNEL=/path/to/bzImage to point at a pre-built bootable image directly (bypasses KTSTR_KERNEL resolution)

On aarch64 the second hint says Image instead of bzImage.

ktstr shell and cargo ktstr shell auto-download the latest stable kernel when no --kernel is specified and no kernel is found via the discovery chain. See Kernel auto-download failures for download-specific errors.

ktstr needs a bootable Linux kernel image (bzImage on x86_64, Image on aarch64). See Kernel discovery for the search order.

Fixes:

  • Download and cache a kernel: cargo ktstr kernel build
  • Build from a local tree: cargo ktstr kernel build --source ../linux
  • Set KTSTR_TEST_KERNEL to an explicit image path.
  • The host’s installed kernel works for basic testing.

Scheduler not found

scheduler 'scx_mitosis' not found. Set KTSTR_SCHEDULER or
place it next to the test binary or in target/{debug,release}/

When using SchedulerSpec::Discover, ktstr searches for the scheduler binary in:

  1. KTSTR_SCHEDULER environment variable.
  2. Sibling of the current executable (and, when the test binary lives under target/{debug,release}/deps/, the parent of deps/ one level up — this covers the nextest / integration- test layout where the scheduler binary sits next to the test binary’s parent).
  3. target/debug/.
  4. target/release/.
  5. On-demand build via cargo build against the scheduler’s package name — ktstr invokes the build itself when the preceding four locations have no match, so a fresh checkout with an unbuilt scheduler still produces a usable binary without the caller pre-running cargo build.

Fixes:

  • Build the scheduler first: cargo build -p scx_mitosis (skipped automatically if step 5 above can build it on demand, but pre-building makes the first test run faster).
  • Set KTSTR_SCHEDULER=/path/to/binary.
  • Use SchedulerSpec::Path for an explicit path in #[ktstr_test].

Scheduler died

scheduler process died unexpectedly after completing step 2 of 5 (12.3s into test)

The scheduler process died while the scenario was running. This is usually a crash. The exact message varies by when the crash was detected (between steps, during workload, after completion).

The failure output contains diagnostic sections (each present only when relevant):

  • --- scheduler log ---: the scheduler’s stdout and stderr, cycle-collapsed for readability.
  • --- diagnostics ---: init stage classification, VM exit code, and the last 20 lines of kernel console output.
  • --- sched_ext dump ---: sched_ext_dump trace lines from the guest kernel (present when a SysRq-D dump fired).

Set RUST_BACKTRACE=1 to force --- diagnostics --- on all failures, not just scheduler deaths.

Next steps:

  • Check the --- scheduler log --- for the crash reason.
  • Check --- diagnostics --- for BPF errors or kernel oops in the kernel console.
  • Enable auto_repro in the test to capture the crash path with BPF probes. See Auto-Repro.
  • Run with a longer duration and specific flags to narrow the reproducer.

See Investigate a Crash for the complete failure output format and auto-repro walkthrough.

Insufficient hugepages

performance_mode: WARNING: no 2MB hugepages available, guest memory will use regular pages
performance_mode: WARNING: need N 2MB hugepages, only K free — falling back to regular pages

Performance mode requests 2MB hugepages for guest memory. The first form fires when no 2MB hugepages are reserved on the host (free == 0); the second fires when some are reserved but fewer than the run needs. In both cases the VM falls back to regular pages and continues to boot.

Fix:

Allocate hugepages before the run:

echo 2048 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Worker assertion failures

stuck 4500ms on cpu2 at +3200ms (threshold 3000ms)
unfair cgroup: spread=42% (8-50%) 4 workers on 4 cpus (threshold 35%)

The Assert checks (max_gap_ms, max_spread_pct, etc.) detected a worker metric outside the configured thresholds.

Fixes:

  • Check whether the topology has enough CPUs for the scenario. Small topologies produce higher contention, larger gaps, and more spread.
  • Use execute_steps_with() with a custom Assert to override thresholds for scenarios that need relaxed limits.
  • Check the scheduler’s behavior under the specific flag profile that triggered the failure.

Cgroup name typos

No such file or directory: /sys/fs/cgroup/.../nonexistent/cgroup.procs

A cgroup name passed to Op::SetCpuset, Op::Spawn, or CgroupManager::move_tasks does not match a previously created cgroup. Cgroup names are case-sensitive strings.

Fixes:

  • Verify the cgroup name matches the name in Op::AddCgroup or CgroupDef::named().
  • When using dynamic cgroup names (e.g. format!("cg_{i}")), ensure the same formatting is used in all ops referencing that cgroup.

CpusetSpec errors

cgroup 'cg_0': CpusetSpec validation failed: not enough usable CPUs (4) for 8 partitions
cgroup 'cg_1': CpusetSpec validation failed: index 3 >= partition count 3
cgroup 'cg_2': CpusetSpec validation failed: Range fracs must lie in [0.0, 1.0]: start_frac=-1, end_frac=0.5

A CpusetSpec cannot produce a valid cpuset for the test topology. execute_steps treats this as a hard error and aborts the step so the downstream slicing/arithmetic in CpusetSpec::resolve is never reached with inputs that would panic.

Fixes:

  • Guard with a topology check before creating the step: if ctx.topo.usable_cpus().len() < needed { return Ok(AssertResult::skip(...)); }
  • Call CpusetSpec::validate(&ctx) in your scenario builder so failures surface before execute_steps runs.
  • Reduce the partition count or use CpusetSpec::Llc instead of Disjoint on topologies with fewer CPUs than partitions.
  • For Range/Overlap, keep fractions finite and inside [0.0, 1.0]; Range additionally requires start_frac < end_frac.

Worker count mismatches

PipeIo requires num_workers divisible by 2, got 3

Grouped work types (PipeIo, FutexPingPong, CachePipe, FutexFanOut, FanOutCompute) require num_workers divisible by their group size. WorkType::worker_group_size() returns the divisor.

Fixes:

  • Set CgroupDef::workers(n) to a value divisible by the work type’s group size (2 for pipe/futex pairs, fan_out + 1 for FutexFanOut and FanOutCompute).
  • Use an ungrouped work type (SpinWait, Mixed, Bursty, IoSyncWrite, IoRandRead, IoConvoy, YieldHeavy) if worker count flexibility is needed.

Cache corruption

  6.14.2-tarball-x86_64-kc...                 (corrupt: metadata.json malformed: ...)
warning: entries marked (corrupt) cannot be used — cached metadata is missing, malformed, or references a missing image. Inspect the entry directory under ~/.cache/ktstr/kernels to remove it manually, or run `kernel clean --corrupt-only --force` which removes ONLY corrupt entries and leaves valid ones intact. ...

A cached kernel entry has missing, unparseable, or schema-drifted metadata.json, or metadata that references an image file that is no longer present. This can happen after a partial write (e.g. disk full, killed process), or after a ktstr release that evolved the metadata schema in a non-backward-compatible way. cargo ktstr kernel list surfaces these as (corrupt: ...) rows; the trailing footer on stderr summarizes the remediation options. CacheDir::lookup returns None for corrupt entries so test runs at a specific cache key fall through to the normal re-build path.

The JSON form (cargo ktstr kernel list --json) emits an error_kind field on every corrupt entry — one of "missing", "unreadable", "schema_drift", "malformed", "truncated", "parse_error", "image_missing", or "unknown" — so CI scripts can dispatch on a stable token without parsing the free-form error string.

Fixes:

  • Remove ONLY corrupt entries (keeps valid ones intact): cargo ktstr kernel clean --corrupt-only --force
  • Remove the corrupt entry along with everything else: cargo ktstr kernel clean --force
  • Rebuild a specific version after cleanup: cargo ktstr kernel build --force 6.14.2
  • Override the cache directory via KTSTR_CACHE_DIR if the default location is on a problematic filesystem.
  • See cargo ktstr kernel clean for all cleanup options, including --keep N --force to preserve the N newest entries.

Stale vmlinux.btf or default.profraw in kernel source tree

After upgrading from an older ktstr version, you may notice extra files in your kernel source directory:

  • <source>/vmlinux.btf — a sidecar of the kernel’s .BTF section bytes. Older ktstr versions wrote it next to whichever vmlinux they parsed, including source-tree builds. Current ktstr only writes the sidecar when the vmlinux path is inside the cache root (~/.cache/ktstr/kernels/ or whatever KTSTR_CACHE_DIR points at) so source trees stay pristine.
  • <source>/default.profraw — an LLVM coverage runtime artifact. Older ktstr versions could leave it in cwd when a coverage-instrumented cargo ktstr test was launched from inside the kernel tree. Current ktstr injects LLVM_PROFILE_FILE=<cargo-ktstr-binary-parent>/llvm-cov-target/default-{pid}-{binary_hash}.profraw for the bare nextest path so the profraw lands next to the cargo-ktstr binary regardless of cwd. See profraw layout for the per-population directory map.

Both files are leftover state from prior runs and are safe to remove:

rm -f /path/to/linux/vmlinux.btf
rm -f /path/to/linux/default.profraw

If you also see them turn up under a different ktstr-driven source tree, check that you are running a current ktstr build (re-run cargo build or cargo install ktstr to pick up the fix) before deleting again — the guards live in the resolver, not on disk, so an old binary will keep regenerating these files.

Cache directory not found

HOME is unset; cannot resolve cache directory. The container init or login shell did not assign HOME — set it to an absolute path, or set KTSTR_CACHE_DIR to an absolute path (e.g. /tmp/ktstr-cache) or XDG_CACHE_HOME to specify a cache location explicitly.
HOME is set to the empty string; cannot resolve cache directory. An empty HOME usually means a Dockerfile or shell rc has `export HOME=` or `ENV HOME=` with no value. Either set HOME to a real absolute path, or set KTSTR_CACHE_DIR to an absolute path (e.g. /tmp/ktstr-cache) or XDG_CACHE_HOME to specify a cache location explicitly.

The kernel image cache requires a writable directory. ktstr resolves it as: KTSTR_CACHE_DIR > $XDG_CACHE_HOME/ktstr/kernels/ > $HOME/.cache/ktstr/kernels/. The first form fires when HOME is absent from the environment (typical of bare container inits or systemd units with no Environment=HOME=...); the second fires when HOME is present but assigned to the empty string.

Fix: Set KTSTR_CACHE_DIR to an explicit path, or ensure HOME is set to a real absolute path.

Stale kconfig

warning: entries marked (stale kconfig) were built against a different ktstr.kconfig.
Rebuild with: kernel build --force <entry version>

cargo ktstr kernel list marks entries whose stored ktstr_kconfig_hash differs from the current embedded ktstr.kconfig fragment. This happens after updating ktstr (which may change the kconfig fragment).

Fix:

Rebuilds happen automatically on the next cargo ktstr kernel build for stale entries. Use --force to override the cache for other reasons. See cargo ktstr kernel list for the full listing output.

Kernel auto-download failures

ktstr: no kernel found, downloading latest stable
fetch https://www.kernel.org/releases.json: <error>

ktstr auto-downloads a kernel when no --kernel is specified and no kernel is found via the discovery chain (see Kernel discovery). The same download path runs when --kernel specifies a version (e.g. --kernel 6.14.2) that is not in the cache. The CLI label varies: ktstr: for the standalone binary, cargo ktstr: for the cargo subcommand.

The <error> above is the underlying reqwest error (DNS resolution, connection refused, timeout, TLS handshake failure).

fetch https://www.kernel.org/releases.json: HTTP 503

kernel.org returned a non-success status code.

no stable kernel with patch >= 8 found in releases.json

ktstr requires a stable or longterm release with patch version >= 8 to avoid brand-new major versions that may have build issues. This error means releases.json contained no qualifying version.

download https://cdn.kernel.org/.../linux-6.14.10.tar.xz: <error>

Network failure during tarball download (same causes as above).

extract tarball: <error>

Tarball extraction failed. Common causes: disk full, insufficient permissions on the temp directory, or a truncated download.

kernel built but cache store failed — cannot return image from temporary directory

The kernel built successfully but could not be stored in the cache. Check disk space and permissions on the cache directory.

For version-specific download errors (HTTP 404, HTML responses), see Kernel download failures.

Fixes:

  • Verify network connectivity: curl -sI https://www.kernel.org/releases.json
  • Check DNS resolution for kernel.org and cdn.kernel.org.
  • Check disk space — the download, extraction, and build require significant disk space.
  • If behind a proxy, set HTTP_PROXY, HTTPS_PROXY, and NO_PROXY (reqwest respects these environment variables).
  • Override the cache directory via KTSTR_CACHE_DIR if the default location has insufficient space or permissions.
  • Pre-download a kernel explicitly: cargo ktstr kernel build 6.14.10 to isolate whether the failure is in version resolution or download.

Kernel download failures

These errors occur when cargo ktstr kernel build or --kernel specifies an explicit version. For network and extraction errors during auto-download, see Kernel auto-download failures.

version 6.14.22 not found. latest 6.14.x: 6.14.10

The requested version does not exist on kernel.org. When a version in the same major.minor series is available in releases.json, the error suggests it.

version 5.4.99 not found

When the series is EOL or not in releases.json, only the “not found” message appears (no suggestion).

RC tarball not found: https://git.kernel.org/torvalds/t/linux-6.15-rc3.tar.gz
  RC releases are removed from git.kernel.org after the stable version ships.

RC tarballs are removed from git.kernel.org after the stable version ships. Use --git with a git.kernel.org URL to clone the tag instead.

download ...: server returned HTML instead of tarball (URL may be invalid)

Some CDN error pages return HTTP 200 with text/html content type. The download rejects these responses.

Fixes:

  • Check the suggested version in the error message.
  • Verify the version exists: check https://www.kernel.org/releases.json for available versions.
  • For RC releases, use --git with a git.kernel.org URL instead of a tarball download.
  • Run cargo ktstr kernel build without a version to automatically fetch the latest stable.

Shell mode issues

stdin must be a terminal

stdin must be a terminal for interactive shell mode

cargo ktstr shell requires a terminal for bidirectional I/O forwarding. Piped or redirected stdin is rejected.

Fix: Run from an interactive terminal session.

include file not found

-i strace: not found in filesystem or PATH

Bare names (without /, ., or ..) are searched in PATH. If the binary is not in PATH, use an explicit path.

--include-files path not found: ./missing-file

Explicit paths (containing / or starting with .) must exist on disk.

Fix: Verify the file exists and use the correct path.

include directory contains no files

warning: -i ./empty-dir: directory contains no regular files

The directory passed to --include-files was walked recursively but contained no regular files. FIFOs, device nodes, and sockets are skipped during the walk.

Fix: Verify the directory contains the files you expect.

Model load failed

GGUF model load failed at /home/.../models/Qwen3-4B-Q4_K_M.gguf. The
file may be corrupt or incompatible with the linked llama.cpp version
— delete the file and re-run `cargo ktstr model fetch` to download
a fresh copy. Check stderr for the upstream llama.cpp rejection reason.

The host-side LLM extraction backend (OutputFormat::LlmExtract) could not load the cached GGUF weights. The cached file is either corrupt (partial download, disk error) or incompatible with the linked llama.cpp version.

Diagnose:

  • Re-run with RUST_LOG=llama-cpp-2=info (or =debug for more detail) to surface llama.cpp’s own rejection reason on stderr. The first call to the inference engine routes llama_cpp_2::send_logs_to_tracing events through the tracing subscriber under target "llama-cpp-2" (literal hyphens — see Environment Variables for the EnvFilter shape).
  • cargo ktstr model status reports the cache path and verdict (Matches, Mismatches, CheckFailed, NotCached).

Fix:

  • Delete the cached file and re-fetch: cargo ktstr model clean && cargo ktstr model fetch. clean removes both the GGUF artifact and its .mtime-size warm-cache sidecar; fetch re-downloads from the pinned URL and SHA-checks the result.
  • If model status reports Mismatches, the local file’s hash diverged from the pinned digest — cargo ktstr model fetch will refuse to overwrite a corrupt cache and the explicit clean is required first.
  • If you set KTSTR_MODEL_OFFLINE=1, unset it for the re-fetch. See cargo ktstr model.

Flock timeout / NFS rejection

flock LOCK_EX on run-dir target/ktstr/6.14-abc1234 timed out after
30s (lockfile target/ktstr/.locks/6.14-abc1234.lock, holders:
  pid=12345 cmd=cargo-ktstr test --kernel 6.14). A peer cargo
ktstr test process is writing sidecars to the same
{kernel}-{project_commit} directory; wait for it to finish or kill
it, then retry.

A peer process is holding the per-run-key advisory flock(2) that serializes sidecar writes; the helper polled for 30 s and gave up. Run-dir locks live at {runs_root}/.locks/{kernel}-{project_commit}.lock and serialize the (pre-clear + write) cycle so two concurrent ktstr runs sharing the same key can’t tear partially-written sidecars.

target/ktstr/.locks/6.14-abc1234.lock: filesystem NFS is not
supported for ktstr lockfiles (NFSv3 is advisory-only without
an NLM peer; NFSv4 byte-range locking does not cover flock(2)).
Move the lockfile path to a local filesystem (tmpfs, ext4, xfs,
btrfs, f2fs, bcachefs).

try_flock rejects NFS, CIFS, SMB2, CephFS, AFS, and FUSE mounts because flock(2) semantics on those filesystems are unreliable (see Resource Budget — Filesystem requirement for the per-filesystem rationale).

Diagnose:

  • cargo ktstr locks (or ktstr locks --watch 1s) prints every ktstr flock currently held on the host with PID + cmdline, including per-run-key sidecar locks under the “Run-dir locks” section (see cargo ktstr locks).
  • cat /proc/locks | grep '<lockfile-path-from-error>' falls back to the kernel’s own flock enumeration when the holder is outside ktstr.
  • stat -f -c '%T' <runs-root> reports the filesystem type when the rejection error names NFS/CIFS/SMB/CephFS/AFS/FUSE.

Fix:

  • For a peer-holder timeout: wait for the peer to finish, kill it (kill <pid> from the holder list), or retry with the peer done.
  • For an NFS / remote-fs rejection: relocate the runs root to a local filesystem. Set KTSTR_SIDECAR_DIR to a local path (/tmp/ktstr-sidecars, a tmpfs mount) — note that this override path also skips the cross-process flock, so concurrent runs targeting the same KTSTR_SIDECAR_DIR have no serialization between them. Use the override only for a single-process run or per-process distinct paths.
  • The kernel cache’s lockfiles ({cache_root}/.locks/*.lock) face the same constraint — override KTSTR_CACHE_DIR to a local filesystem if the default resolves to NFS. See Cache directory not found.

Tests pass locally but fail in CI

Common causes:

  • No KVM: CI runners need hardware virtualization. Check for /dev/kvm access.
  • Fewer CPUs: gauntlet topology presets up to 252 CPUs may exceed the runner’s capacity. Use smaller topologies.
  • No kernel: set KTSTR_TEST_KERNEL in the CI environment.
  • No CAP_SYS_NICE or rtprio: performance-mode tests require CAP_SYS_NICE or an rtprio limit for RT scheduling, and enough host CPUs for exclusive LLC reservation. Pass --no-perf-mode (or set KTSTR_NO_PERF_MODE=1) to disable all performance mode features. Tests with performance_mode=true are skipped entirely under --no-perf-mode.
  • Debug thresholds: CI often runs debug builds. Debug builds use relaxed thresholds (3000ms gap, 35% spread) but may still hit limits on slow runners. See default thresholds.