Troubleshooting
Build errors
clang not found
error: failed to run custom build command for `ktstr`
...
clang: No such file or directory
The BPF skeleton build (libbpf-cargo) invokes clang to compile
.bpf.c sources. Install clang:
- Debian/Ubuntu:
sudo apt install clang - Fedora:
sudo dnf install clang
pkg-config not found
error: failed to run custom build command for `libbpf-sys`
...
pkg-config: command not found
libbpf-sys uses pkg-config during its vendored build. Install it:
- Debian/Ubuntu:
sudo apt install pkg-config - Fedora:
sudo dnf install pkgconf
autotools errors (autoconf, autopoint, aclocal)
autoreconf: command not found
aclocal: command not found
autopoint: command not found
The vendored libbpf-sys build compiles bundled libelf and zlib from source using autotools. These libraries are not system dependencies – they ship with libbpf-sys – but the autotools toolchain is needed to build them. Install:
- Debian/Ubuntu:
sudo apt install autoconf autopoint flex bison gawk - Fedora:
sudo dnf install autoconf gettext-devel flex bison gawk
make or gcc not found
busybox build requires 'make' — install build-essential (Debian/Ubuntu) or base-devel (Fedora/Arch)
busybox build requires 'gcc' — install build-essential (Debian/Ubuntu) or base-devel (Fedora/Arch)
The build script compiles busybox from source for guest shell mode. This requires make and gcc.
- Debian/Ubuntu:
sudo apt install make gcc - Fedora:
sudo dnf install make gcc
BTF errors
no BTF source found. Set KTSTR_KERNEL to a kernel build directory,
or ensure /sys/kernel/btf/vmlinux exists.
build.rs generates vmlinux.h from kernel BTF data. It searches
the kernel discovery chain (KTSTR_KERNEL, ./linux, ../linux,
installed kernel) for a vmlinux file, falling back to
/sys/kernel/btf/vmlinux. Most distros ship
/sys/kernel/btf/vmlinux with CONFIG_DEBUG_INFO_BTF enabled.
Fixes:
- Verify BTF is available:
ls /sys/kernel/btf/vmlinux - If missing, set
KTSTR_KERNELto a kernel build directory that contains avmlinuxwith BTF:export KTSTR_KERNEL=/path/to/linux - Build a kernel with
CONFIG_DEBUG_INFO_BTF=y. - Some minimal/cloud kernels strip BTF. Use a distro kernel or build your own.
busybox download failure
failed to obtain busybox source.
tarball (https://github.com/mirror/busybox/archive/refs/tags/1_36_1.tar.gz): download: ...
git clone (https://github.com/mirror/busybox.git): ...
Check network connectivity. First build requires internet access.
build.rs downloads busybox source on first build (tarball first,
git clone fallback). Subsequent builds use the cached binary in
$OUT_DIR.
Fixes:
- Verify network connectivity to github.com.
- If behind a proxy, set
HTTP_PROXY/HTTPS_PROXY. - After a successful first build, no network access is needed
unless
cargo cleanremoves the cached binary.
/dev/kvm not accessible
The host-side pre-flight emits one of the following, depending on whether the device node is missing or merely unreadable:
/dev/kvm not found. KVM requires:
- Linux kernel with KVM support (CONFIG_KVM)
- Access to /dev/kvm (check permissions or add user to 'kvm' group)
- Hardware virtualization enabled in BIOS (VT-x/AMD-V)
/dev/kvm: permission denied. Add your user to the 'kvm' group:
sudo usermod -aG kvm $USER
then log out and back in.
ktstr boots Linux kernels in KVM virtual machines. The host must have
KVM enabled and the user must have read+write access to /dev/kvm.
Diagnose:
- Check the device exists and inspect its permissions and owning group:
ls -l /dev/kvm. Typical output:crw-rw---- 1 root kvm 10, 232 .... - Confirm the
kvmgroup exists and see its members:getent group kvm.
Fixes:
- Load the KVM module:
modprobe kvm_intelormodprobe kvm_amd. - Follow the group-membership hint in the error text above (log out and back in afterward for the group change to take effect).
- On cloud VMs (GCP, AWS, Azure) or nested hypervisors, nested
virtualization is typically off by default. Enable it per the
provider’s instructions (e.g. GCP
--enable-nested-virtualization, AWS metal/.metalinstance types, Azure Dv3/Ev3+ with nested virt). - In CI, ensure the runner has KVM access (e.g.
runs-on: [self-hosted, kvm]).
No kernel found
no kernel found
hint: set KTSTR_KERNEL to a kernel source directory, a version (e.g. `6.14.2`), or a cache key (see `cargo ktstr kernel list`), or run `cargo ktstr kernel build` to populate the cache
hint: or set KTSTR_TEST_KERNEL=/path/to/bzImage to point at a pre-built bootable image directly (bypasses KTSTR_KERNEL resolution)
On aarch64 the second hint says Image instead of bzImage.
ktstr shell and cargo ktstr shell auto-download the latest
stable kernel when no --kernel is specified and no kernel is found
via the discovery chain. See
Kernel auto-download failures for
download-specific errors.
ktstr needs a bootable Linux kernel image (bzImage on x86_64,
Image on aarch64). See
Kernel discovery for the
search order.
Fixes:
- Download and cache a kernel:
cargo ktstr kernel build - Build from a local tree:
cargo ktstr kernel build --source ../linux - Set
KTSTR_TEST_KERNELto an explicit image path. - The host’s installed kernel works for basic testing.
Scheduler not found
scheduler 'scx_mitosis' not found. Set KTSTR_SCHEDULER or
place it next to the test binary or in target/{debug,release}/
When using SchedulerSpec::Discover, ktstr searches for the scheduler
binary in:
KTSTR_SCHEDULERenvironment variable.- Sibling of the current executable (and, when the test binary
lives under
target/{debug,release}/deps/, the parent ofdeps/one level up — this covers the nextest / integration- test layout where the scheduler binary sits next to the test binary’s parent). target/debug/.target/release/.- On-demand build via
cargo buildagainst the scheduler’s package name — ktstr invokes the build itself when the preceding four locations have no match, so a fresh checkout with an unbuilt scheduler still produces a usable binary without the caller pre-runningcargo build.
Fixes:
- Build the scheduler first:
cargo build -p scx_mitosis(skipped automatically if step 5 above can build it on demand, but pre-building makes the first test run faster). - Set
KTSTR_SCHEDULER=/path/to/binary. - Use
SchedulerSpec::Pathfor an explicit path in#[ktstr_test].
Scheduler died
scheduler process died unexpectedly after completing step 2 of 5 (12.3s into test)
The scheduler process died while the scenario was running. This is usually a crash. The exact message varies by when the crash was detected (between steps, during workload, after completion).
The failure output contains diagnostic sections (each present only when relevant):
--- scheduler log ---: the scheduler’s stdout and stderr, cycle-collapsed for readability.--- diagnostics ---: init stage classification, VM exit code, and the last 20 lines of kernel console output.--- sched_ext dump ---:sched_ext_dumptrace lines from the guest kernel (present when a SysRq-D dump fired).
Set RUST_BACKTRACE=1 to force --- diagnostics --- on all
failures, not just scheduler deaths.
Next steps:
- Check the
--- scheduler log ---for the crash reason. - Check
--- diagnostics ---for BPF errors or kernel oops in the kernel console. - Enable
auto_reproin the test to capture the crash path with BPF probes. See Auto-Repro. - Run with a longer duration and specific flags to narrow the reproducer.
See Investigate a Crash for the complete failure output format and auto-repro walkthrough.
Insufficient hugepages
performance_mode: WARNING: no 2MB hugepages available, guest memory will use regular pages
performance_mode: WARNING: need N 2MB hugepages, only K free — falling back to regular pages
Performance mode requests 2MB
hugepages for guest memory. The first form fires when no 2MB hugepages
are reserved on the host (free == 0); the second fires when some are
reserved but fewer than the run needs. In both cases the VM falls back
to regular pages and continues to boot.
Fix:
Allocate hugepages before the run:
echo 2048 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
Worker assertion failures
stuck 4500ms on cpu2 at +3200ms (threshold 3000ms)
unfair cgroup: spread=42% (8-50%) 4 workers on 4 cpus (threshold 35%)
The Assert checks (max_gap_ms, max_spread_pct, etc.) detected a
worker metric outside the configured thresholds.
Fixes:
- Check whether the topology has enough CPUs for the scenario. Small topologies produce higher contention, larger gaps, and more spread.
- Use
execute_steps_with()with a customAssertto override thresholds for scenarios that need relaxed limits. - Check the scheduler’s behavior under the specific flag profile that triggered the failure.
Cgroup name typos
No such file or directory: /sys/fs/cgroup/.../nonexistent/cgroup.procs
A cgroup name passed to Op::SetCpuset, Op::Spawn, or
CgroupManager::move_tasks does not match a previously created
cgroup. Cgroup names are case-sensitive strings.
Fixes:
- Verify the cgroup name matches the
nameinOp::AddCgrouporCgroupDef::named(). - When using dynamic cgroup names (e.g.
format!("cg_{i}")), ensure the same formatting is used in all ops referencing that cgroup.
CpusetSpec errors
cgroup 'cg_0': CpusetSpec validation failed: not enough usable CPUs (4) for 8 partitions
cgroup 'cg_1': CpusetSpec validation failed: index 3 >= partition count 3
cgroup 'cg_2': CpusetSpec validation failed: Range fracs must lie in [0.0, 1.0]: start_frac=-1, end_frac=0.5
A CpusetSpec cannot produce a valid cpuset for the test topology.
execute_steps treats this as a hard error and aborts the step so the
downstream slicing/arithmetic in CpusetSpec::resolve is never reached
with inputs that would panic.
Fixes:
- Guard with a topology check before creating the step:
if ctx.topo.usable_cpus().len() < needed { return Ok(AssertResult::skip(...)); } - Call
CpusetSpec::validate(&ctx)in your scenario builder so failures surface beforeexecute_stepsruns. - Reduce the partition count or use
CpusetSpec::Llcinstead ofDisjointon topologies with fewer CPUs than partitions. - For
Range/Overlap, keep fractions finite and inside[0.0, 1.0];Rangeadditionally requiresstart_frac < end_frac.
Worker count mismatches
PipeIo requires num_workers divisible by 2, got 3
Grouped work types (PipeIo, FutexPingPong, CachePipe,
FutexFanOut, FanOutCompute) require num_workers divisible by their
group size. WorkType::worker_group_size() returns the divisor.
Fixes:
- Set
CgroupDef::workers(n)to a value divisible by the work type’s group size (2 for pipe/futex pairs,fan_out + 1for FutexFanOut and FanOutCompute). - Use an ungrouped work type (
SpinWait,Mixed,Bursty,IoSyncWrite,IoRandRead,IoConvoy,YieldHeavy) if worker count flexibility is needed.
Cache corruption
6.14.2-tarball-x86_64-kc... (corrupt: metadata.json malformed: ...)
warning: entries marked (corrupt) cannot be used — cached metadata is missing, malformed, or references a missing image. Inspect the entry directory under ~/.cache/ktstr/kernels to remove it manually, or run `kernel clean --corrupt-only --force` which removes ONLY corrupt entries and leaves valid ones intact. ...
A cached kernel entry has missing, unparseable, or
schema-drifted metadata.json, or metadata that references an
image file that is no longer present. This can happen after a
partial write (e.g. disk full, killed process), or after a ktstr
release that evolved the metadata schema in a
non-backward-compatible way. cargo ktstr kernel list surfaces
these as (corrupt: ...) rows; the trailing footer on stderr
summarizes the remediation options. CacheDir::lookup returns
None for corrupt entries so test runs at a specific cache key
fall through to the normal re-build path.
The JSON form (cargo ktstr kernel list --json) emits an
error_kind field on every corrupt entry — one of "missing",
"unreadable", "schema_drift", "malformed", "truncated",
"parse_error", "image_missing", or "unknown" — so CI
scripts can dispatch on a stable token without parsing the
free-form error string.
Fixes:
- Remove ONLY corrupt entries (keeps valid ones intact):
cargo ktstr kernel clean --corrupt-only --force - Remove the corrupt entry along with everything else:
cargo ktstr kernel clean --force - Rebuild a specific version after cleanup:
cargo ktstr kernel build --force 6.14.2 - Override the cache directory via
KTSTR_CACHE_DIRif the default location is on a problematic filesystem. - See
cargo ktstr kernel cleanfor all cleanup options, including--keep N --forceto preserve the N newest entries.
Stale vmlinux.btf or default.profraw in kernel source tree
After upgrading from an older ktstr version, you may notice extra files in your kernel source directory:
<source>/vmlinux.btf— a sidecar of the kernel’s.BTFsection bytes. Older ktstr versions wrote it next to whichevervmlinuxthey parsed, including source-tree builds. Current ktstr only writes the sidecar when the vmlinux path is inside the cache root (~/.cache/ktstr/kernels/or whateverKTSTR_CACHE_DIRpoints at) so source trees stay pristine.<source>/default.profraw— an LLVM coverage runtime artifact. Older ktstr versions could leave it in cwd when a coverage-instrumentedcargo ktstr testwas launched from inside the kernel tree. Current ktstr injectsLLVM_PROFILE_FILE=<cargo-ktstr-binary-parent>/llvm-cov-target/default-{pid}-{binary_hash}.profrawfor the barenextestpath so the profraw lands next to the cargo-ktstr binary regardless of cwd. See profraw layout for the per-population directory map.
Both files are leftover state from prior runs and are safe to remove:
rm -f /path/to/linux/vmlinux.btf
rm -f /path/to/linux/default.profraw
If you also see them turn up under a different ktstr-driven
source tree, check that you are running a current ktstr build
(re-run cargo build or cargo install ktstr to pick up the
fix) before deleting again — the guards live in the resolver,
not on disk, so an old binary will keep regenerating these
files.
Cache directory not found
HOME is unset; cannot resolve cache directory. The container init or login shell did not assign HOME — set it to an absolute path, or set KTSTR_CACHE_DIR to an absolute path (e.g. /tmp/ktstr-cache) or XDG_CACHE_HOME to specify a cache location explicitly.
HOME is set to the empty string; cannot resolve cache directory. An empty HOME usually means a Dockerfile or shell rc has `export HOME=` or `ENV HOME=` with no value. Either set HOME to a real absolute path, or set KTSTR_CACHE_DIR to an absolute path (e.g. /tmp/ktstr-cache) or XDG_CACHE_HOME to specify a cache location explicitly.
The kernel image cache requires a writable directory. ktstr resolves
it as: KTSTR_CACHE_DIR > $XDG_CACHE_HOME/ktstr/kernels/ >
$HOME/.cache/ktstr/kernels/. The first form fires when HOME is
absent from the environment (typical of bare container inits or
systemd units with no Environment=HOME=...); the second fires when
HOME is present but assigned to the empty string.
Fix: Set KTSTR_CACHE_DIR to an explicit path, or ensure HOME
is set to a real absolute path.
Stale kconfig
warning: entries marked (stale kconfig) were built against a different ktstr.kconfig.
Rebuild with: kernel build --force <entry version>
cargo ktstr kernel list marks entries whose stored ktstr_kconfig_hash
differs from the current embedded ktstr.kconfig fragment. This
happens after updating ktstr (which may change the kconfig fragment).
Fix:
Rebuilds happen automatically on the next cargo ktstr kernel build
for stale entries. Use --force to override the cache for other
reasons. See cargo ktstr kernel list
for the full listing output.
Kernel auto-download failures
ktstr: no kernel found, downloading latest stable
fetch https://www.kernel.org/releases.json: <error>
ktstr auto-downloads a kernel when no --kernel is specified and no
kernel is found via the discovery chain (see
Kernel discovery). The same
download path runs when --kernel specifies a version (e.g.
--kernel 6.14.2) that is not in the cache. The CLI label varies:
ktstr: for the standalone binary, cargo ktstr: for the cargo
subcommand.
The <error> above is the underlying reqwest error (DNS resolution,
connection refused, timeout, TLS handshake failure).
fetch https://www.kernel.org/releases.json: HTTP 503
kernel.org returned a non-success status code.
no stable kernel with patch >= 8 found in releases.json
ktstr requires a stable or longterm release with patch version >= 8 to avoid brand-new major versions that may have build issues. This error means releases.json contained no qualifying version.
download https://cdn.kernel.org/.../linux-6.14.10.tar.xz: <error>
Network failure during tarball download (same causes as above).
extract tarball: <error>
Tarball extraction failed. Common causes: disk full, insufficient permissions on the temp directory, or a truncated download.
kernel built but cache store failed — cannot return image from temporary directory
The kernel built successfully but could not be stored in the cache. Check disk space and permissions on the cache directory.
For version-specific download errors (HTTP 404, HTML responses), see Kernel download failures.
Fixes:
- Verify network connectivity:
curl -sI https://www.kernel.org/releases.json - Check DNS resolution for kernel.org and cdn.kernel.org.
- Check disk space — the download, extraction, and build require significant disk space.
- If behind a proxy, set
HTTP_PROXY,HTTPS_PROXY, andNO_PROXY(reqwest respects these environment variables). - Override the cache directory via
KTSTR_CACHE_DIRif the default location has insufficient space or permissions. - Pre-download a kernel explicitly:
cargo ktstr kernel build 6.14.10to isolate whether the failure is in version resolution or download.
Kernel download failures
These errors occur when cargo ktstr kernel build or --kernel
specifies an explicit version. For network and extraction errors
during auto-download, see
Kernel auto-download failures.
version 6.14.22 not found. latest 6.14.x: 6.14.10
The requested version does not exist on kernel.org. When a version in the same major.minor series is available in releases.json, the error suggests it.
version 5.4.99 not found
When the series is EOL or not in releases.json, only the “not found” message appears (no suggestion).
RC tarball not found: https://git.kernel.org/torvalds/t/linux-6.15-rc3.tar.gz
RC releases are removed from git.kernel.org after the stable version ships.
RC tarballs are removed from git.kernel.org after the stable version
ships. Use --git with a git.kernel.org URL to clone the tag instead.
download ...: server returned HTML instead of tarball (URL may be invalid)
Some CDN error pages return HTTP 200 with text/html content type.
The download rejects these responses.
Fixes:
- Check the suggested version in the error message.
- Verify the version exists: check
https://www.kernel.org/releases.jsonfor available versions. - For RC releases, use
--gitwith a git.kernel.org URL instead of a tarball download. - Run
cargo ktstr kernel buildwithout a version to automatically fetch the latest stable.
Shell mode issues
stdin must be a terminal
stdin must be a terminal for interactive shell mode
cargo ktstr shell requires a terminal for bidirectional I/O
forwarding. Piped or redirected stdin is rejected.
Fix: Run from an interactive terminal session.
include file not found
-i strace: not found in filesystem or PATH
Bare names (without /, ., or ..) are searched in PATH. If the
binary is not in PATH, use an explicit path.
--include-files path not found: ./missing-file
Explicit paths (containing / or starting with .) must exist on
disk.
Fix: Verify the file exists and use the correct path.
include directory contains no files
warning: -i ./empty-dir: directory contains no regular files
The directory passed to --include-files was walked recursively but
contained no regular files. FIFOs, device nodes, and sockets are
skipped during the walk.
Fix: Verify the directory contains the files you expect.
Model load failed
GGUF model load failed at /home/.../models/Qwen3-4B-Q4_K_M.gguf. The
file may be corrupt or incompatible with the linked llama.cpp version
— delete the file and re-run `cargo ktstr model fetch` to download
a fresh copy. Check stderr for the upstream llama.cpp rejection reason.
The host-side LLM extraction backend (OutputFormat::LlmExtract)
could not load the cached GGUF weights. The cached file is either
corrupt (partial download, disk error) or incompatible with the
linked llama.cpp version.
Diagnose:
- Re-run with
RUST_LOG=llama-cpp-2=info(or=debugfor more detail) to surface llama.cpp’s own rejection reason on stderr. The first call to the inference engine routesllama_cpp_2::send_logs_to_tracingevents through the tracing subscriber under target"llama-cpp-2"(literal hyphens — see Environment Variables for the EnvFilter shape). cargo ktstr model statusreports the cache path and verdict (Matches,Mismatches,CheckFailed,NotCached).
Fix:
- Delete the cached file and re-fetch:
cargo ktstr model clean && cargo ktstr model fetch.cleanremoves both the GGUF artifact and its.mtime-sizewarm-cache sidecar;fetchre-downloads from the pinned URL and SHA-checks the result. - If
model statusreportsMismatches, the local file’s hash diverged from the pinned digest —cargo ktstr model fetchwill refuse to overwrite a corrupt cache and the explicitcleanis required first. - If you set
KTSTR_MODEL_OFFLINE=1, unset it for the re-fetch. Seecargo ktstr model.
Flock timeout / NFS rejection
flock LOCK_EX on run-dir target/ktstr/6.14-abc1234 timed out after
30s (lockfile target/ktstr/.locks/6.14-abc1234.lock, holders:
pid=12345 cmd=cargo-ktstr test --kernel 6.14). A peer cargo
ktstr test process is writing sidecars to the same
{kernel}-{project_commit} directory; wait for it to finish or kill
it, then retry.
A peer process is holding the per-run-key advisory flock(2)
that serializes sidecar writes; the helper polled for 30 s and
gave up. Run-dir locks live at
{runs_root}/.locks/{kernel}-{project_commit}.lock and serialize
the (pre-clear + write) cycle so two concurrent ktstr runs
sharing the same key can’t tear partially-written sidecars.
target/ktstr/.locks/6.14-abc1234.lock: filesystem NFS is not
supported for ktstr lockfiles (NFSv3 is advisory-only without
an NLM peer; NFSv4 byte-range locking does not cover flock(2)).
Move the lockfile path to a local filesystem (tmpfs, ext4, xfs,
btrfs, f2fs, bcachefs).
try_flock rejects NFS, CIFS, SMB2, CephFS, AFS, and FUSE mounts
because flock(2) semantics on those filesystems are unreliable
(see Resource Budget — Filesystem requirement
for the per-filesystem rationale).
Diagnose:
cargo ktstr locks(orktstr locks --watch 1s) prints every ktstr flock currently held on the host with PID + cmdline, including per-run-key sidecar locks under the “Run-dir locks” section (seecargo ktstr locks).cat /proc/locks | grep '<lockfile-path-from-error>'falls back to the kernel’s own flock enumeration when the holder is outside ktstr.stat -f -c '%T' <runs-root>reports the filesystem type when the rejection error names NFS/CIFS/SMB/CephFS/AFS/FUSE.
Fix:
- For a peer-holder timeout: wait for the peer to finish, kill
it (
kill <pid>from the holder list), or retry with the peer done. - For an NFS / remote-fs rejection: relocate the runs root to a
local filesystem. Set
KTSTR_SIDECAR_DIRto a local path (/tmp/ktstr-sidecars, a tmpfs mount) — note that this override path also skips the cross-process flock, so concurrent runs targeting the sameKTSTR_SIDECAR_DIRhave no serialization between them. Use the override only for a single-process run or per-process distinct paths. - The kernel cache’s lockfiles
(
{cache_root}/.locks/*.lock) face the same constraint — overrideKTSTR_CACHE_DIRto a local filesystem if the default resolves to NFS. See Cache directory not found.
Tests pass locally but fail in CI
Common causes:
- No KVM: CI runners need hardware virtualization. Check for
/dev/kvmaccess. - Fewer CPUs: gauntlet topology presets up to 252 CPUs may exceed the runner’s capacity. Use smaller topologies.
- No kernel: set
KTSTR_TEST_KERNELin the CI environment. - No CAP_SYS_NICE or rtprio: performance-mode tests require
CAP_SYS_NICEor an rtprio limit for RT scheduling, and enough host CPUs for exclusive LLC reservation. Pass--no-perf-mode(or setKTSTR_NO_PERF_MODE=1) to disable all performance mode features. Tests withperformance_mode=trueare skipped entirely under--no-perf-mode. - Debug thresholds: CI often runs debug builds. Debug builds use relaxed thresholds (3000ms gap, 35% spread) but may still hit limits on slow runners. See default thresholds.