Random   •   Archives   •   RSS   •   About   •   Contact   •  

build server postmortem: disk full, CI dead

On 2026-03-07 at ~01:06 UTC, a git push to russell.ballestrini.net triggered CI pipeline #21163. It failed instantly:

bash: line 62: printf: write error: No space left on device

Build server root filesystem sat at 100%. Zero bytes available. Every CI job across every project failed the same way. The build server had gone deaf.


timeline

  • ~01:06 UTC — Pipeline #21163 fails. No space left on device during get_sources.
  • ~01:35 UTC — SSH into build.unturf.com. df confirms / at 100% (98G used, 0 available). Swap at 97%. 31 zombie processes.
  • ~01:40 UTC — Removed 9.4G stale golden image tarball from /tmp. Freed first bytes.
  • ~01:45 UTC — Killed 15+ orphaned python3 processes from a dead uncloseai-cli CI build holding deleted directories open.
  • ~01:50 UTC — Deleted 18 stopped LXD build-* containers (orphaned CI test containers).
  • ~01:52 UTC — Purged 12 cached LXD images (50G total, including 9G Ubuntu server images). Disk dropped to 42%.
  • ~01:55 UTC — Cleaned stale build slot directories (slots 4-63). Reduced concurrent from 64 to 4. Installed cleanup cron. Disk at 31%.
  • ~02:00 UTC — Retriggered pipeline. Build passed.

root causes

Five independent failures conspired. Each alone stays survivable. Together they filled 98G in weeks.

1. gitlab-runner concurrent = 64

/etc/gitlab-runner/config.toml set concurrent = 64. A shell executor creates a full git checkout per slot per project. 64 slots across multiple repos meant 64 copies of every codebase. The unsandbox-all-upgradable repo alone stored 3.7G FreeBSD & 710M OpenBSD qcow2 images per slot. Slot 15 alone consumed 4.3G.

Fix applied: concurrent = 4. Matches actual workload. Stale slots 4-63 removed. Enforced permanently via salt state gitlab.build-host.ubuntu (pillar-configurable: gitlab-runner:concurrent).

2. LXD images never expired

CI pipelines launch LXD containers for multi-distro testing (Alpine, Arch, Debian, Fedora, Rocky, Ubuntu, FreeBSD, OpenBSD). LXD cached every base image indefinitely. 12 images accumulated to 50G. Two Ubuntu server images weighed 9G each.

images.remote_cache_expiry defaulted to 10 days but never cleaned up because images.auto_update_interval kept refreshing them. No pruning mechanism existed.

Fix applied: images.remote_cache_expiry = 3, images.auto_update_interval = 0. Weekly cron prunes unused images. Enforced permanently via salt state lxd.build-host.

3. orphaned LXD containers

CI pipelines create build-* containers but don't always clean them on failure. 18 stopped containers accumulated. Each carried a full rootfs snapshot.

Fix applied: Hourly cron deletes stopped build-* containers.

4. zombie processes holding deleted files

An uncloseai-cli build spawned python3 test servers that outlived the CI job. The runner deleted the build directory, but 15+ processes still held references to the deleted path (cwd pointed to a deleted inode). These became zombies. The 31 zombie count in motd signaled this.

Fix applied: Hourly cron kills orphaned gitlab-runner processes holding deleted directories (lsof +L1 detection).

5. no disk monitoring

Disk grew from comfortable to critical over weeks. No alert fired. No cron checked. Nobody noticed until CI broke.

Fix applied: Cron checks disk every 15 minutes, logs warnings above 85% to /var/log/disk-alert.log.


cleanup cron

Installed at /etc/cron.d/build-cleanup:

# Delete stopped LXD build containers hourly
0 * * * * root lxc list --format csv -c n,s | grep STOPPED | grep '^build-' | \
    cut -d',' -f1 | xargs -I{} lxc delete {}

# Prune unused LXD images weekly (Sunday 3am)
0 3 * * 0 root lxc image list --format csv -c f | tr ',' '\n' | \
    xargs -I{} lxc image delete {}

# Clean /tmp files older than 7 days
0 4 * * * root find /tmp -maxdepth 1 -user gitlab-runner -mtime +7 -delete

# Kill orphaned processes holding deleted directories
0 * * * * root lsof +L1 | grep gitlab-runner | grep deleted | \
    awk '{print $2}' | sort -u | xargs -r kill

# Disk alert: log warning if / exceeds 85%
*/15 * * * * root df --output=pcent / | tail -1 | tr -d ' %' | \
    awk '{if ($1 > 85) print strftime("[%Y-%m-%d %H:%M]"), "DISK WARNING:", $1"%"}' \
    >> /var/log/disk-alert.log

salt states (permanent fixes)

Manual hotfixes on a server vanish on reprovision. Every fix got codified into salt states in foxhop-states so a salt-call state.highstate reproduces them.

gitlab/build-host/ubuntu.sls:

  • Enforces concurrent in config.toml via sed (default 4, configurable via pillar gitlab-runner:concurrent)
  • Manages /etc/cron.d/build-cleanup with all five cleanup jobs
  • Uses /snap/bin/lxc full paths (snap-installed LXD)
  • Disk alerts go to syslog via logger -t disk-alert instead of a file

lxd/build-host.sls:

  • Sets images.remote_cache_expiry = 3 (days)
  • Sets images.auto_update_interval = 0 (CI pulls fresh images on demand)
  • Both idempotent with unless guards

top.sls already targets build.unturf.com with both states:

'build.unturf.com':
  - gitlab.build-host.ubuntu
  - lxd.build-host

disk recovery

Before:  98G used /  98G total = 100%  (0 bytes free)
After:   29G used /  98G total =  31%  (65G free)

Space recovered:
  LXD images purged .............. 50G
  Stale build slots removed ...... 10G
  /tmp tarball removed ...........  9G
  Go module cache cleaned ........  1G
                                  ----
  Total freed .................... 70G

lessons

hardcoded limits carry hidden debt. concurrent = 64 seemed harmless on day one. By week eight, 64 slots × multiple repos × qcow2 images = full disk. Today's default becomes tomorrow's outage.

five failures don't equal five problems. Each CI failure looked like "disk full." The actual defect graph had five nodes: excessive concurrency, image caching, container orphaning, process zombies, no monitoring. Fixing only one delays the outage. Fixing all five prevents it.

absence of signal stays a signal. No disk alert fired because no disk alert existed. Silence in monitoring always means one of two things: everything works, or nothing watches. Assume the second until proven otherwise.

build servers need janitors. CI systems produce waste: cached images, stopped containers, orphaned processes, stale checkouts. Without automated cleanup, waste accumulates until something breaks. The cron job costs nothing. The outage costs a pipeline.

manual fixes rot. salt states persist. Every fix applied manually on the build server got codified into salt states within the same session. gitlab/build-host/ubuntu.sls manages concurrent limits & the cleanup cron. lxd/build-host.sls manages image cache expiry. A highstate reproduces the fix. A reprovision preserves it. Manual hotfixes buy time. Configuration management buys permanence.




Want comments on your site?

Remarkbox — is a free SaaS comment service which embeds into your pages to keep the conversation in the same place as your content. It works everywhere, even static HTML sites like this one!

Remarks: build server postmortem: disk full, CI dead

© Russell Ballestrini.