Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

This 'Show HN' project introduces an open-source, complete archive of Hacker News content, making nearly two decades of submissions and discussions readily available for analysis. The dataset is hosted on Hugging Face and is engineered for both accessibility and real-time updates, positioning itself as a critical tool for anyone looking to dive deep into the history and trends of the influential tech community.

Comprehensive Coverage: The dataset includes every story, comment, Ask HN, Show HN, job posting, and poll submitted to Hacker News from October 2006 through the present, totalling over 47 million items and 11.6GB.
Live Updates: An automated pipeline fetches new items every five minutes from the HN Firebase API and commits them as individual Parquet files, ensuring the archive remains current.
Data Structure: Data is organized into monthly Parquet files for historical content, with real-time 5-minute blocks for the current day. A daily rollover mechanism ensures monthly files are always complete and deduplicated.
Ease of Use: Leveraging standard Parquet format, the dataset is designed for seamless integration with tools like DuckDB, the Hugging Face datasets library, pandas, and huggingface_hub, with practical SQL and Python examples provided.
Statistical Insights: The project includes statistics on item types (87.2% comments, 12.7% stories), score distribution (median score of 0, highest ever 6,015), top linked domains (github.com, youtube.com), and the most active story submitters.
Technical Implementation: The pipeline is built in Go, using DuckDB for Parquet conversion. Historical data is sourced from ClickHouse Playground, while live data comes directly from the HN Firebase API.
Intended Use Cases: It is ideal for language model pretraining, sentiment and trend analysis, community dynamics research, information retrieval benchmarks, and content recommendation model development.
Limitations & Biases: The project openly discusses known biases of the HN community (tech, US-centric, moderation effects) and data-specific limitations (e.g., HTML in text fields, integer types for booleans).

This meticulously curated and actively maintained Hacker News archive democratizes access to a vast repository of tech discourse, empowering researchers and developers with a robust platform for data-driven insights.

Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

The Lowdown