On 2026-03-07 at ~01:06 UTC, a git push to russell.ballestrini.net triggered CI pipeline #21163. It failed instantly:
bash: line 62: printf: write error: No space left on device
Build server root filesystem sat at 100%. Zero bytes available. Every CI job across every project failed the same way. The build server had gone deaf.
timeline
- ~01:06 UTC — Pipeline #21163 fails. No space left on device during get_sources.
- ~01:35 UTC — SSH into build.unturf.com. df confirms / at 100% (98G used, 0 available). Swap at 97%. 31 zombie processes.
- ~01:40 UTC — Removed 9.4G stale golden image tarball from /tmp. Freed first bytes.
- ~01:45 UTC — Killed 15+ orphaned python3 processes from a dead uncloseai-cli CI build holding deleted directories open.
- ~01:50 UTC — Deleted 18 stopped LXD build-* containers (orphaned CI test containers).
- ~01:52 UTC — Purged 12 cached LXD images (50G total, including 9G Ubuntu server images). Disk dropped to 42%.
- ~01:55 UTC — Cleaned stale build slot directories (slots 4-63). Reduced concurrent from 64 to 4. Installed cleanup cron. Disk at 31%.
- ~02:00 UTC — Retriggered pipeline. Build passed.
root causes
Five independent failures conspired. Each alone stays survivable. Together they filled 98G in weeks.
1. gitlab-runner concurrent = 64
/etc/gitlab-runner/config.toml set concurrent = 64. A shell executor creates a full git checkout per slot per project. 64 slots across multiple repos meant 64 copies of every codebase. The unsandbox-all-upgradable repo alone stored 3.7G FreeBSD & 710M OpenBSD qcow2 images per slot. Slot 15 alone consumed 4.3G.
Fix applied: concurrent = 4. Matches actual workload. Stale slots 4-63 removed. Enforced permanently via salt state gitlab.build-host.ubuntu (pillar-configurable: gitlab-runner:concurrent).
2. LXD images never expired
CI pipelines launch LXD containers for multi-distro testing (Alpine, Arch, Debian, Fedora, Rocky, Ubuntu, FreeBSD, OpenBSD). LXD cached every base image indefinitely. 12 images accumulated to 50G. Two Ubuntu server images weighed 9G each.
images.remote_cache_expiry defaulted to 10 days but never cleaned up because images.auto_update_interval kept refreshing them. No pruning mechanism existed.
Fix applied: images.remote_cache_expiry = 3, images.auto_update_interval = 0. Weekly cron prunes unused images. Enforced permanently via salt state lxd.build-host.
3. orphaned LXD containers
CI pipelines create build-* containers but don't always clean them on failure. 18 stopped containers accumulated. Each carried a full rootfs snapshot.
Fix applied: Hourly cron deletes stopped build-* containers.
4. zombie processes holding deleted files
An uncloseai-cli build spawned python3 test servers that outlived the CI job. The runner deleted the build directory, but 15+ processes still held references to the deleted path (cwd pointed to a deleted inode). These became zombies. The 31 zombie count in motd signaled this.
Fix applied: Hourly cron kills orphaned gitlab-runner processes holding deleted directories (lsof +L1 detection).
5. no disk monitoring
Disk grew from comfortable to critical over weeks. No alert fired. No cron checked. Nobody noticed until CI broke.
Fix applied: Cron checks disk every 15 minutes, logs warnings above 85% to /var/log/disk-alert.log.
cleanup cron
Installed at /etc/cron.d/build-cleanup:
# Delete stopped LXD build containers hourly
0 * * * * root lxc list --format csv -c n,s | grep STOPPED | grep '^build-' | \
cut -d',' -f1 | xargs -I{} lxc delete {}
# Prune unused LXD images weekly (Sunday 3am)
0 3 * * 0 root lxc image list --format csv -c f | tr ',' '\n' | \
xargs -I{} lxc image delete {}
# Clean /tmp files older than 7 days
0 4 * * * root find /tmp -maxdepth 1 -user gitlab-runner -mtime +7 -delete
# Kill orphaned processes holding deleted directories
0 * * * * root lsof +L1 | grep gitlab-runner | grep deleted | \
awk '{print $2}' | sort -u | xargs -r kill
# Disk alert: log warning if / exceeds 85%
*/15 * * * * root df --output=pcent / | tail -1 | tr -d ' %' | \
awk '{if ($1 > 85) print strftime("[%Y-%m-%d %H:%M]"), "DISK WARNING:", $1"%"}' \
>> /var/log/disk-alert.log
salt states (permanent fixes)
Manual hotfixes on a server vanish on reprovision. Every fix got codified into salt states in foxhop-states so a salt-call state.highstate reproduces them.
gitlab/build-host/ubuntu.sls:
- Enforces concurrent in config.toml via sed (default 4, configurable via pillar gitlab-runner:concurrent)
- Manages /etc/cron.d/build-cleanup with all five cleanup jobs
- Uses /snap/bin/lxc full paths (snap-installed LXD)
- Disk alerts go to syslog via logger -t disk-alert instead of a file
lxd/build-host.sls:
- Sets images.remote_cache_expiry = 3 (days)
- Sets images.auto_update_interval = 0 (CI pulls fresh images on demand)
- Both idempotent with unless guards
top.sls already targets build.unturf.com with both states:
'build.unturf.com':
- gitlab.build-host.ubuntu
- lxd.build-host
disk recovery
Before: 98G used / 98G total = 100% (0 bytes free)
After: 29G used / 98G total = 31% (65G free)
Space recovered:
LXD images purged .............. 50G
Stale build slots removed ...... 10G
/tmp tarball removed ........... 9G
Go module cache cleaned ........ 1G
----
Total freed .................... 70G
lessons
hardcoded limits carry hidden debt. concurrent = 64 seemed harmless on day one. By week eight, 64 slots × multiple repos × qcow2 images = full disk. Today's default becomes tomorrow's outage.
five failures don't equal five problems. Each CI failure looked like "disk full." The actual defect graph had five nodes: excessive concurrency, image caching, container orphaning, process zombies, no monitoring. Fixing only one delays the outage. Fixing all five prevents it.
absence of signal stays a signal. No disk alert fired because no disk alert existed. Silence in monitoring always means one of two things: everything works, or nothing watches. Assume the second until proven otherwise.
build servers need janitors. CI systems produce waste: cached images, stopped containers, orphaned processes, stale checkouts. Without automated cleanup, waste accumulates until something breaks. The cron job costs nothing. The outage costs a pipeline.
manual fixes rot. salt states persist. Every fix applied manually on the build server got codified into salt states within the same session. gitlab/build-host/ubuntu.sls manages concurrent limits & the cleanup cron. lxd/build-host.sls manages image cache expiry. A highstate reproduces the fix. A reprovision preserves it. Manual hotfixes buy time. Configuration management buys permanence.