An Ode to Bzip
Faced with disk space limitations for a Minecraft mod's Lua code, the author meticulously compares compression algorithms, concluding that bzip unexpectedly emerges as the top performer. This deep technical dive explains how bzip's unique Burrows-Wheeler Transform (BWT) sidesteps the complexities of LZ77-based methods, achieving superior ratios and efficient decoding for text-like data. The story resonates on HN for its practical application of theoretical computer science, challenging long-held assumptions about 'slow' algorithms.
The Lowdown
The author, working on the ComputerCraft Minecraft mod, faced the challenge of compressing Lua code due to limited disk space. Instead of defaulting to common compression libraries like LibDeflate, they embarked on a quest to find the most efficient, simplest, and ratio-effective compression algorithm. The surprising conclusion, contrary to popular belief, was that bzip (and its successor bzip3) offered superior performance for text-like data compared to more modern alternatives.
- The author benchmarked various popular compression algorithms (gzip, zstd, xz, brotli, lzip, bzip2, bzip3) on a 327 KB Lua codebase.
bzip2andbzip3achieved significantly better compression ratios, beating evenxzandzstd.- This superiority stems from
bzip's fundamental difference: it uses the Burrows-Wheeler Transform (BWT) instead of the LZ77 scheme employed by most other algorithms. - BWT reorders text to group similar characters, which is then highly compressible by simple run-length encoding (RLE), unlike LZ77 which replaces repetitive text with backreferences.
- BWT-based compression is deterministic and lacks the heuristics of LZ77, meaning optimal ratios are achieved without complex tuning (e.g.,
bzip3doesn't even have compression level flags). - While
bzipis commonly perceived as slow, the author argues this is nuanced; for critical applications where maximum compression is paramount, its slightly slower decode time is an acceptable trade-off for a better ratio. - Furthermore, a custom
bzip-style decoder, optimized for a single Huffman table, can be remarkably small (around 1.5 KB), competitive with or smaller than highly optimized LZ77 decoders. - The author highlights a broader insight: improvements in compression often come from restructuring data and applying powerful general-purpose methods rather than complicating algorithms, drawing parallels to machine learning.
In essence, the article provides a compelling technical argument for bzip as an unsung hero for text and code compression, advocating for its use where maximum ratio and a relatively compact decoder are priorities, even challenging its reputation for slowness in specific contexts.