HN
Today

Running Out of Disk Space in Production

A sysadmin's worst nightmare unfolds as a product launch immediately hits a critical disk space wall. This post meticulously details a frantic, multi-stage debugging process, revealing how seemingly innocent Nginx defaults can silently consume gigabytes. It's a relatable tale of infrastructure woes and the crucial lesson of reading documentation carefully, especially under pressure.

4
Score
0
Comments
#10
Highest Rank
8h
on Front Page
First Seen
Apr 7, 11:00 AM
Last Seen
Apr 7, 11:00 PM
Rank Over Time
1111111016292824

The Lowdown

The author recounts a stressful production incident where a newly launched server, meant to distribute digital files, quickly ran out of disk space, leading to customer complaints and service disruption. What began as a simple static file server on a small Hetzner machine escalated into an urgent debugging session to restore service.

  • Initial Crisis: Minutes after launch, the 40GB disk on the NixOS server filled up, causing "Insufficient system storage" errors and service interruption for customers trying to download 2.2GB files.
  • Panic Debugging (Initial Attempts): The author frantically tried nix-collect-garbage -d and journalctl --vacuum-time=1s to clear space, but these offered only temporary relief or failed due to lack of space.
  • Temporary Solution: Unable to upgrade the server, the author moved the large /nix/store to a separate 12GB volume, following NixOS Wiki instructions. This stabilized the root partition, allowing the service to partially recover.
  • Large File Download Issue: Even with more space, customers reported that large 2.2GB files were failing to download halfway through.
  • Nginx Misconfiguration 1: Investigation revealed the Nginx proxy_max_temp_file_size default of 1024m was too small for the 2.2GB files. Increasing it to 5000m resolved this download issue.
  • Nginx Misconfiguration 2 (Root Cause): Disk space spikes reappeared. Using lsof +L1, the author discovered Nginx was holding 14.5GB of "deleted" temporary files. A closer look at Nginx documentation revealed proxy_buffering was enabled by default, causing Nginx to buffer entire responses to disk. Disabling proxy_buffering and setting proxy_max_temp_file_size to 0 finally stabilized disk usage at 20%.

This incident highlights the common pitfalls of rushing under pressure and the often-overlooked importance of thoroughly understanding the default behaviors and documentation of critical components like Nginx, especially when dealing with large files and resource constraints.