Asymmetric Quantization: Near-Lossless Retrieval with 97% Storage Reduction
Mixedbread introduces 'Asymmetric Quantization' to drastically reduce the storage footprint of late interaction retrieval models, a critical component for high-precision search. By preserving query precision while binarizing document vectors, they achieve a 97% storage reduction with minimal impact on retrieval quality. This innovation makes advanced, multi-vector search architectures practical and cost-effective at scale, addressing a key challenge in big-data and AI applications.
The Lowdown
Mixedbread's blog post details their innovative approach, 'Asymmetric Quantization,' to tackle the significant storage and cost challenges associated with late interaction retrieval models, which are known for their superior precision in search but demand substantial resources. This technique is crucial for making such models viable in production at billion-document scales, as exemplified by their internal silo engine.
- Late interaction models, unlike single-vector embeddings, generate numerous vectors per document, leading to vastly increased storage requirements (e.g., 393 KiB vs. 12 KiB per document for multi-vector vs. single-vector FP32). This multi-vector approach, while more precise, makes the system expensive due to storage, IO, and cold-start times.
- Asymmetric Quantization optimizes this by storing query vectors at higher precision (int8) and document vectors at much lower precision (binary signs). This leverages the fact that query vectors are short-lived, while document vectors are stored and accessed repeatedly, dominating costs.
- The result is a remarkable 32x, or 97%, reduction in raw document-vector storage, from 393 KiB to 12.28 KiB per document, while maintaining near-lossless retrieval quality. Performance drops minimally, from 90.26 NDCG@10 (FP32 baseline) to 89.65 NDCG@10.
- The scoring mechanism for int8 queries against binary document vectors is computationally efficient, avoiding full multiplies by leveraging sums of query values corresponding to positive document bits. This also translates to a 3.8x speedup over the FP32 baseline.
- The post contrasts different quantization pairings: int8xint8 offers strong quality and speed with a 4x storage reduction, while int8xbinary is optimal for storage economics. Binaryxbinary, though fastest and most compact, results in an unacceptable 7.2-point quality drop.
By carefully balancing precision where it matters most, Asymmetric Quantization enables the adoption of high-quality late interaction retrieval without the prohibitive storage and performance costs, transforming it from a niche solution into a practical default for large-scale retrieval systems.