Twice this week, I have come across embarassingly bad data
The author exposes two startling examples of egregiously bad public data from UK institutions, highlighting how basic validation errors undermine trust and lead to poor decisions. Hacker News users debate whether publishing flawed data is better than none, weighing the costs of data cleaning against the erosion of credibility. This discussion underscores the perpetual challenge of data quality and the surprising prevalence of fundamental errors in official datasets.
The Lowdown
This article shines a spotlight on the alarming state of data quality within official UK sources, presenting two recent cases where glaring errors went uncorrected and were publicly disseminated. The author, Andy Brice, argues that such 'garbage data' is not only embarrassing but detrimental to trust and informed decision-making.
- UK Fuel Finder Data: Brice analyzes a CSV file intended to provide UK fuel station locations and prices. He quickly identifies significant outliers, including stations located in the Indian and South Atlantic oceans, and a 1538:1 ratio between cheapest and most expensive fuel, indicating basic data entry and validation failures.
- RAC Electric Car Report: Another example involves a graph from the RAC's electric car report, which inexplicably shows a massive drop in UK EV numbers from 1.4 million to 0.0017 million. This error is attributed to a likely decimal point mistake, yet it was published without evident scrutiny.
- Consequences and LLMs: The author warns that this sloppiness erodes institutional trust and could lead to a "slop-apocalypse" if unchecked data is used to train AI models, perpetuating and obscuring errors.
In conclusion, Brice passionately advocates for a return to fundamental data validation, emphasizing that authors should proofread, programmers test, and data professionals take pride in the quality of their work. He even humorously notes that his own article initially contained a typo, which was swiftly corrected following a reader's observation.
The Gossip
Data Dilemma: Dirty or Denied?
A significant portion of the discussion revolves around the tension between publishing flawed data and withholding it. Many argue that public institutions should release data even if it's messy, citing that withholding it denies access for those who *can* clean it or find value in its raw form. They highlight the expense of cleaning data and the potential for a 'no data' outcome if perfect cleanliness is required. Conversely, others agree with the author that publishing obviously incorrect data erodes trust and is irresponsible, advocating for at least basic 'smell tests' or disclaimers.
Clean Data's Costly Conundrum
Several users underscore the significant effort and expense involved in thoroughly cleaning data. They point out that meticulous processes, such as Source Data Verification (SDV) in clinical trials, are labor-intensive and costly. This leads to the argument that demanding perfectly clean data from every source might be an unrealistic expectation for under-resourced public bodies, making the 'publish raw' argument more palatable to some.
Trustworthiness and Tainted Truths
There's a strong sentiment that bad data, even with disclaimers, severely damages trust in institutions and the data itself. Commenters emphasize that poor data can lead to incorrect decisions, undermine the credibility of projects and products, and create a perception that the underlying technology is flawed. The 'obviously wrong' nature of some errors makes the lack of basic checks particularly galling, pushing for at least minimal validation.