HN
Today

Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

A new 'Show HN' project unveils a comprehensive, live-updated archive of all Hacker News items since 2006, encompassing over 47 million entries and 11.6GB of data. This meticulously organized dataset is provided in Parquet format via Hugging Face, designed for easy querying with tools like DuckDB. It promises to be an invaluable resource for researchers and developers keen on analyzing HN's rich discourse, technology trends, and community dynamics.

32
Score
3
Comments
#4
Highest Rank
3h
on Front Page
First Seen
Mar 18, 5:00 PM
Last Seen
Mar 18, 7:00 PM
Rank Over Time
11104

The Lowdown

This 'Show HN' project introduces an open-source, complete archive of Hacker News content, making nearly two decades of submissions and discussions readily available for analysis. The dataset is hosted on Hugging Face and is engineered for both accessibility and real-time updates, positioning itself as a critical tool for anyone looking to dive deep into the history and trends of the influential tech community.

  • Comprehensive Coverage: The dataset includes every story, comment, Ask HN, Show HN, job posting, and poll submitted to Hacker News from October 2006 through the present, totalling over 47 million items and 11.6GB.
  • Live Updates: An automated pipeline fetches new items every five minutes from the HN Firebase API and commits them as individual Parquet files, ensuring the archive remains current.
  • Data Structure: Data is organized into monthly Parquet files for historical content, with real-time 5-minute blocks for the current day. A daily rollover mechanism ensures monthly files are always complete and deduplicated.
  • Ease of Use: Leveraging standard Parquet format, the dataset is designed for seamless integration with tools like DuckDB, the Hugging Face datasets library, pandas, and huggingface_hub, with practical SQL and Python examples provided.
  • Statistical Insights: The project includes statistics on item types (87.2% comments, 12.7% stories), score distribution (median score of 0, highest ever 6,015), top linked domains (github.com, youtube.com), and the most active story submitters.
  • Technical Implementation: The pipeline is built in Go, using DuckDB for Parquet conversion. Historical data is sourced from ClickHouse Playground, while live data comes directly from the HN Firebase API.
  • Intended Use Cases: It is ideal for language model pretraining, sentiment and trend analysis, community dynamics research, information retrieval benchmarks, and content recommendation model development.
  • Limitations & Biases: The project openly discusses known biases of the HN community (tech, US-centric, moderation effects) and data-specific limitations (e.g., HTML in text fields, integer types for booleans).

This meticulously curated and actively maintained Hacker News archive democratizes access to a vast repository of tech discourse, empowering researchers and developers with a robust platform for data-driven insights.