HN
Today

UK Biobank health data keeps ending up on GitHub

The UK Biobank, holding sensitive health and genetic data for half a million volunteers, is facing a significant and ongoing crisis as this highly restricted information repeatedly appears on public GitHub repositories. Despite strict sharing agreements with 20,000 researchers, the data is consistently leaked, raising serious questions about data governance, participant privacy, and the effectiveness of current remediation efforts. This persistent exposure, including instances of re-identification and data appearing for sale on Alibaba, highlights the profound challenges of managing large-scale sensitive datasets in the age of widespread research collaboration.

94
Score
24
Comments
#11
Highest Rank
16h
on Front Page
First Seen
Apr 23, 9:00 PM
Last Seen
Apr 24, 12:00 PM
Rank Over Time
12111212111112141616171819202323

The Lowdown

The UK Biobank, a crucial resource for health research, collects genetic, health, and lifestyle data from half a million British volunteers. While access is granted to 20,000 researchers globally under strict non-sharing agreements, an alarming trend has emerged: this sensitive participant data is continuously being uploaded to public GitHub repositories.

The author, a privacy researcher, tracks DMCA takedown notices sent by UK Biobank to GitHub, identifying 110 notices targeting 197 repositories by 170 developers across at least 14 countries, with the US and China being prominent. The leaked data often includes genetic/genomic files, tabular data potentially containing phenotype or health records, and Jupyter/R notebooks.

  • Persistent Leaks: Despite agreements, researchers and even individuals not granted initial access are exposing data.
  • Re-identification Risk: The Guardian demonstrated re-identification of a volunteer using just their approximate birth date and a single major surgery date, contradicting UK Biobank's stance on re-identification risks.
  • Remediation Challenges: UK Biobank uses copyright-based DMCA notices for takedowns, as there's no direct privacy-breach equivalent in the UK for rapid platform action. These takedowns are often for specific files, and previously deleted data frequently reappears.
  • Global Reach: Developers targeted span a wide geographical area, complicating enforcement.
  • Governance Failures: The article frames this as the latest in a series of governance challenges for UK Biobank, suggesting a lack of humility and willingness to learn from privacy experts.
  • Escalating Concerns: Recent reports indicate all 500,000 participant records were found for sale on Alibaba, further exposing the severity of the situation even after UK Biobank shifted to a remote access platform via DNAnexus and Amazon.

This continuous data exposure underscores a critical systemic failure in safeguarding highly sensitive personal information, raising profound concerns about participant trust and the future viability of large-scale biobank initiatives without more robust, enforceable data governance and security protocols.

The Gossip

The Inescapable Leakage & De-anonymization Dilemma

Commenters widely discussed the inherent difficulty, if not impossibility, of truly anonymizing large, sensitive datasets and preventing leaks when shared with thousands of researchers. Many felt it was 'naive' to expect perfect compliance, comparing anonymization to encryption in its susceptibility to ever-improving attacks. The point was made that even seemingly innocuous combinations of facts (like a birth date and surgery date) can lead to re-identification, making 'fully informed consent' a complex ethical challenge for participants.

Accountability & Consequences Conundrum

A significant concern revolved around the lack of accountability and tangible consequences for those responsible for the data breaches. Commenters questioned whether institutions or individual researchers would face sanctions beyond mere access suspension, drawing comparisons to stricter regulatory environments like the US HHS. There was skepticism about whether 'random university students' or less experienced researchers truly understand data security implications, highlighting a gap in training or oversight.

Participant Paradox & Data Governance

The irony of participants being unable to access their own data, even as it's leaked publicly, was a common sentiment. The discussion also touched upon the 'pros and cons' of open-sourcing such data entirely for future biobank projects, with some suggesting this might be a more transparent, albeit riskier, approach. Others countered that this would undoubtedly deter participation, arguing that volunteers did not consent to such broad dissemination and that security through obscurity, while not ideal, serves a practical purpose in protecting sensitive data.

Persistent Breaches & Remediation Failures

Users expressed dismay at the continuous nature of the leaks and the perceived ineffectiveness of UK Biobank's remediation efforts. The recent discovery of all 500,000 participant records for sale on Alibaba was cited as a stark example of ongoing, severe breaches. Commenters highlighted that even after takedown notices, leaked data often reappears elsewhere, indicating that the current system of DMCA notices and shifting to remote access platforms is not adequately addressing the root cause of the data exposure.