HN
Today

An Ode to Bzip

Faced with disk space limitations for a Minecraft mod's Lua code, the author meticulously compares compression algorithms, concluding that bzip unexpectedly emerges as the top performer. This deep technical dive explains how bzip's unique Burrows-Wheeler Transform (BWT) sidesteps the complexities of LZ77-based methods, achieving superior ratios and efficient decoding for text-like data. The story resonates on HN for its practical application of theoretical computer science, challenging long-held assumptions about 'slow' algorithms.

11
Score
1
Comments
#5
Highest Rank
3h
on Front Page
First Seen
Mar 14, 5:00 PM
Last Seen
Mar 14, 7:00 PM
Rank Over Time
555

The Lowdown

The author, working on the ComputerCraft Minecraft mod, faced the challenge of compressing Lua code due to limited disk space. Instead of defaulting to common compression libraries like LibDeflate, they embarked on a quest to find the most efficient, simplest, and ratio-effective compression algorithm. The surprising conclusion, contrary to popular belief, was that bzip (and its successor bzip3) offered superior performance for text-like data compared to more modern alternatives.

  • The author benchmarked various popular compression algorithms (gzip, zstd, xz, brotli, lzip, bzip2, bzip3) on a 327 KB Lua codebase.
  • bzip2 and bzip3 achieved significantly better compression ratios, beating even xz and zstd.
  • This superiority stems from bzip's fundamental difference: it uses the Burrows-Wheeler Transform (BWT) instead of the LZ77 scheme employed by most other algorithms.
  • BWT reorders text to group similar characters, which is then highly compressible by simple run-length encoding (RLE), unlike LZ77 which replaces repetitive text with backreferences.
  • BWT-based compression is deterministic and lacks the heuristics of LZ77, meaning optimal ratios are achieved without complex tuning (e.g., bzip3 doesn't even have compression level flags).
  • While bzip is commonly perceived as slow, the author argues this is nuanced; for critical applications where maximum compression is paramount, its slightly slower decode time is an acceptable trade-off for a better ratio.
  • Furthermore, a custom bzip-style decoder, optimized for a single Huffman table, can be remarkably small (around 1.5 KB), competitive with or smaller than highly optimized LZ77 decoders.
  • The author highlights a broader insight: improvements in compression often come from restructuring data and applying powerful general-purpose methods rather than complicating algorithms, drawing parallels to machine learning.

In essence, the article provides a compelling technical argument for bzip as an unsung hero for text and code compression, advocating for its use where maximum ratio and a relatively compact decoder are priorities, even challenging its reputation for slowness in specific contexts.