build server postmortem: disk full, CI dead

On 2026-03-07 at ~01:06 UTC, a git push to russell.ballestrini.net triggered CI pipeline #21163. It failed instantly:

bash: line 62: printf: write error: No space left on device

Build server root filesystem sat at 100%. Zero bytes available. Every CI job across every project failed the same way. The build server had gone deaf.

timeline

~01:06 UTC — Pipeline #21163 fails. No space left on device during get_sources.
~01:35 UTC — SSH into build.unturf.com. df confirms / at 100% (98G used, 0 available). Swap at 97%. 31 zombie processes.
~01:40 UTC — Removed 9.4G stale golden image tarball from /tmp. Freed first bytes.
~01:45 UTC — Killed 15+ orphaned python3 processes from a dead uncloseai-cli CI build holding deleted directories open.
~01:50 UTC — Deleted 18 stopped LXD build-* containers (orphaned CI test containers).
~01:52 UTC — Purged 12 cached LXD images (50G total, including 9G Ubuntu server images). Disk dropped to 42%.
~01:55 UTC — Cleaned stale build slot directories (slots 4-63). Reduced concurrent from 64 to 4. Installed cleanup cron. Disk at 31%.
~02:00 UTC — Retriggered pipeline. Build passed.

root causes

Five independent failures conspired. Each alone stays survivable. Together they filled 98G in weeks.

1. gitlab-runner concurrent = 64

/etc/gitlab-runner/config.toml set concurrent = 64. A shell executor creates a full git checkout per slot per project. 64 slots across multiple repos meant 64 copies of every codebase. The unsandbox-all-upgradable repo alone stored 3.7G FreeBSD & 710M OpenBSD qcow2 images per slot. Slot 15 alone consumed 4.3G.

Fix applied: concurrent = 4. Matches actual workload. Stale slots 4-63 removed. Enforced permanently via salt state gitlab.build-host.ubuntu (pillar-configurable: gitlab-runner:concurrent).

2. LXD images never expired

CI pipelines launch LXD containers for multi-distro testing (Alpine, Arch, Debian, Fedora, Rocky, Ubuntu, FreeBSD, OpenBSD). LXD cached every base image indefinitely. 12 images accumulated to 50G. Two Ubuntu server images weighed 9G each.

images.remote_cache_expiry defaulted to 10 days but never cleaned up because images.auto_update_interval kept refreshing them. No pruning mechanism existed.

Fix applied: images.remote_cache_expiry = 3, images.auto_update_interval = 0. Weekly cron prunes unused images. Enforced permanently via salt state lxd.build-host.

3. orphaned LXD containers

CI pipelines create build-* containers but don't always clean them on failure. 18 stopped containers accumulated. Each carried a full rootfs snapshot.

Fix applied: Hourly cron deletes stopped build-* containers.

4. zombie processes holding deleted files

An uncloseai-cli build spawned python3 test servers that outlived the CI job. The runner deleted the build directory, but 15+ processes still held references to the deleted path (cwd pointed to a deleted inode). These became zombies. The 31 zombie count in motd signaled this.

Fix applied: Hourly cron kills orphaned gitlab-runner processes holding deleted directories (lsof +L1 detection).

5. no disk monitoring

Disk grew from comfortable to critical over weeks. No alert fired. No cron checked. Nobody noticed until CI broke.

Fix applied: Cron checks disk every 15 minutes, logs warnings above 85% to /var/log/disk-alert.log.

cleanup cron

Installed at /etc/cron.d/build-cleanup:

# Delete stopped LXD build containers hourly
0 * * * * root lxc list --format csv -c n,s | grep STOPPED | grep '^build-' | \
    cut -d',' -f1 | xargs -I{} lxc delete {}

# Prune unused LXD images weekly (Sunday 3am)
0 3 * * 0 root lxc image list --format csv -c f | tr ',' '\n' | \
    xargs -I{} lxc image delete {}

# Clean /tmp files older than 7 days
0 4 * * * root find /tmp -maxdepth 1 -user gitlab-runner -mtime +7 -delete

# Kill orphaned processes holding deleted directories
0 * * * * root lsof +L1 | grep gitlab-runner | grep deleted | \
    awk '{print $2}' | sort -u | xargs -r kill

# Disk alert: log warning if / exceeds 85%
*/15 * * * * root df --output=pcent / | tail -1 | tr -d ' %' | \
    awk '{if ($1 > 85) print strftime("[%Y-%m-%d %H:%M]"), "DISK WARNING:", $1"%"}' \
    >> /var/log/disk-alert.log

salt states (permanent fixes)

Manual hotfixes on a server vanish on reprovision. Every fix got codified into salt states in foxhop-states so a salt-call state.highstate reproduces them.

gitlab/build-host/ubuntu.sls:

Enforces concurrent in config.toml via sed (default 4, configurable via pillar gitlab-runner:concurrent)
Manages /etc/cron.d/build-cleanup with all five cleanup jobs
Uses /snap/bin/lxc full paths (snap-installed LXD)
Disk alerts go to syslog via logger -t disk-alert instead of a file

lxd/build-host.sls:

Sets images.remote_cache_expiry = 3 (days)
Sets images.auto_update_interval = 0 (CI pulls fresh images on demand)
Both idempotent with unless guards

top.sls already targets build.unturf.com with both states:

'build.unturf.com':
  - gitlab.build-host.ubuntu
  - lxd.build-host

disk recovery

Before:  98G used /  98G total = 100%  (0 bytes free)
After:   29G used /  98G total =  31%  (65G free)

Space recovered:
  LXD images purged .............. 50G
  Stale build slots removed ...... 10G
  /tmp tarball removed ...........  9G
  Go module cache cleaned ........  1G
                                  ----
  Total freed .................... 70G

lessons

hardcoded limits carry hidden debt. concurrent = 64 seemed harmless on day one. By week eight, 64 slots × multiple repos × qcow2 images = full disk. Today's default becomes tomorrow's outage.

five failures don't equal five problems. Each CI failure looked like "disk full." The actual defect graph had five nodes: excessive concurrency, image caching, container orphaning, process zombies, no monitoring. Fixing only one delays the outage. Fixing all five prevents it.

absence of signal stays a signal. No disk alert fired because no disk alert existed. Silence in monitoring always means one of two things: everything works, or nothing watches. Assume the second until proven otherwise.

build servers need janitors. CI systems produce waste: cached images, stopped containers, orphaned processes, stale checkouts. Without automated cleanup, waste accumulates until something breaks. The cron job costs nothing. The outage costs a pipeline.

manual fixes rot. salt states persist. Every fix applied manually on the build server got codified into salt states within the same session. gitlab/build-host/ubuntu.sls manages concurrent limits & the cleanup cron. lxd/build-host.sls manages image cache expiry. A highstate reproduces the fix. A reprovision preserves it. Manual hotfixes buy time. Configuration management buys permanence.

Sun 08 March 2026

Tags: DevOps, Guide

Want comments on your site?

Remarkbox — is a free SaaS comment service which embeds into your pages to keep the conversation in the same place as your content. It works everywhere, even static HTML sites like this one!