HN
Today

A distributed queue in a single JSON file on object storage

Turbopuffer engineers detail how they built a robust distributed job queue using nothing but a single JSON file on object storage, achieving 10x lower latency. This post cleverly illustrates an iterative design process, showcasing how simple primitives can form the backbone of scalable systems. It's a fascinating example of engineering ingenuity using common cloud infrastructure in an unconventional way.

7
Score
0
Comments
#4
Highest Rank
11h
on Front Page
First Seen
Feb 24, 10:00 AM
Last Seen
Feb 24, 8:00 PM
Rank Over Time
10455571212182928

The Lowdown

The article outlines how Turbopuffer replaced its internal indexing job queue, a critical component for scheduling asynchronous work, by ingeniously leveraging object storage. Facing issues with sharded queues causing bottlenecks, the team sought a solution that offered better performance and reliability with simpler operational overhead. The result is a highly efficient distributed queue built around a single JSON file, demonstrating the power of understanding and utilizing fundamental object storage characteristics.

  • The initial design, queue.json, involved clients (pushers and workers) directly interacting with a single JSON file on object storage, using compare-and-set (CAS) for atomic updates. This provided strong consistency but was limited by object storage write latency (e.g., ~1 request per second on GCS).
  • To improve throughput, "group commit" was introduced, buffering requests in memory and batching writes to the object storage. This decoupled the write rate from the request rate, shifting the bottleneck to network bandwidth but still facing contention from multiple writers.
  • The solution to contention was a "brokered group commit," where a single stateless broker process handled all interactions with object storage. This centralized write operations, running a single group commit loop on behalf of all clients, effectively scaling write operations for many clients.
  • For high-availability, the system was enhanced with mechanisms for broker and worker fault tolerance. Broker addresses are stored in queue.json, allowing clients to find new brokers if one fails, with CAS ensuring correctness during transitions. Worker failures are handled via heartbeats, enabling other workers to reclaim uncompleted jobs. By incrementally building upon object storage's simple yet powerful primitives, Turbopuffer developed a distributed queue that is reliable, scalable, and provides at-least-once delivery with significant performance improvements. This approach highlights how a deep understanding of underlying infrastructure can lead to elegant and highly performant solutions, even for complex distributed system challenges.