Running Out of Disk Space in Production
A sysadmin's worst nightmare unfolds as a product launch immediately hits a critical disk space wall. This post meticulously details a frantic, multi-stage debugging process, revealing how seemingly innocent Nginx defaults can silently consume gigabytes. It's a relatable tale of infrastructure woes and the crucial lesson of reading documentation carefully, especially under pressure.
The Lowdown
The author recounts a stressful production incident where a newly launched server, meant to distribute digital files, quickly ran out of disk space, leading to customer complaints and service disruption. What began as a simple static file server on a small Hetzner machine escalated into an urgent debugging session to restore service.
- Initial Crisis: Minutes after launch, the 40GB disk on the NixOS server filled up, causing "Insufficient system storage" errors and service interruption for customers trying to download 2.2GB files.
- Panic Debugging (Initial Attempts): The author frantically tried
nix-collect-garbage -dandjournalctl --vacuum-time=1sto clear space, but these offered only temporary relief or failed due to lack of space. - Temporary Solution: Unable to upgrade the server, the author moved the large
/nix/storeto a separate 12GB volume, following NixOS Wiki instructions. This stabilized the root partition, allowing the service to partially recover. - Large File Download Issue: Even with more space, customers reported that large 2.2GB files were failing to download halfway through.
- Nginx Misconfiguration 1: Investigation revealed the Nginx
proxy_max_temp_file_sizedefault of 1024m was too small for the 2.2GB files. Increasing it to 5000m resolved this download issue. - Nginx Misconfiguration 2 (Root Cause): Disk space spikes reappeared. Using
lsof +L1, the author discovered Nginx was holding 14.5GB of "deleted" temporary files. A closer look at Nginx documentation revealedproxy_bufferingwas enabled by default, causing Nginx to buffer entire responses to disk. Disablingproxy_bufferingand settingproxy_max_temp_file_sizeto 0 finally stabilized disk usage at 20%.
This incident highlights the common pitfalls of rushing under pressure and the often-overlooked importance of thoroughly understanding the default behaviors and documentation of critical components like Nginx, especially when dealing with large files and resource constraints.