Data Science for Health – an Alan Turing Institute Summit

Yesterday I attended ‘Data Science for Health – an Alan Turing Institute Summit’ at the Centre for Mathematical Sciences, Cambridge. This post is a brief summary of topics raised that particularly interested me. My background: I have an undergraduate degree in physics, I’m currently doing a computational biology PhD, and I’m interested in the potential of using electronic data to improve healthcare, particularly public health.

The Alan Turing Institute (ATI) is a cross-university collaboration between Oxford, Cambridge, Edinburgh, UCL, and Warwick, with physical headquarters at the British Library. It aims to “undertake data science research at the intersection of computer science, mathematics, statistics and systems engineering; provide technically informed advice to policy makers on the wider implications of algorithms; enable researchers from industry and academia to work together to undertake research with practical applications; and act as a magnet for leaders in academia and industry from around the world to engage with the UK in data science and its applications.”

As well as academic institutions there will also be other industrial and governmental partners, including Cray, Lloyds Register, and GCHQ.

It’s not surprising that GCHQ have a keen interest in the development of techniques for dealing with big data (or as they might euphemistically call it, ‘big metadata’). Today’s summit was focused on health data, where one of the main obstacles is requiring consent. If there is a moral case for blanket monitoring of personal electronic data for surveillance, it rests on the assumption that it is necessary for protecting the general population and does limited harm. I think an equally strong if not stronger moral case can be made for blanket monitoring of personal health data for the improvement of healthcare, which in a consequential sense seems likely to have a much larger beneficial impact on the general population (do some back-of-the-envelope calculations yourself). It seems bizarre that we are happy to permit the mass extraction of people’s personal electronic communications yet remain incredibly cautious of the legal ramifications of sharing medical data between separate government departments, public services, and academic institutions.

As well as this fragmentation of data sharing the legal framework surrounding medical data is incredibly complex and variable. For example, I didn’t know beforehand that mortality data is under a completely different set of legal barriers to healthcare data, despite being a crucial part of epidemiology. Legislation could fix this but the government shows no sign of improving the situation. In another absurd situation, academic researchers often have to pay six-figure sums to access datasets from GP practices. As Ruth Gilbert pointed out, this would cause public outrage if it were widely known.

Surprisingly, patients do not technically own their data. Legally it belongs to the Secretary of State and the relevant Data Controllers. Ruth Gilbert suggested that patients don’t actually care whether their data is used for administrative or research purposes (despite the different legal frameworks) and that it is not feasible to get informed prior consent for every possible use – nobody can exhaustively know what data will ultimately be used for. She argued that the legal frameworks for consented and unconsented data should be very similar.

The challenge of linking up different datasets was raised several times. Andrew Morris said that every patient should be a research patient. The big problems facing the NHS are chronic disease and multi-comorbidity (having e.g. heart disease, obesity, and diabetes), and understanding these requires integrated databases. Ruth Gilbert added that data from past clinical trials could be incredibly useful for longitudinal research if the original data could be linked to future administrative records.

Academics are quite good at developing ‘improvements’ to healthcare that weren’t asked for by clinicians. Ian Cree pointed out that not all such improvements save money or improve patient care. Detecting cancers earlier is only useful if it’s done early enough. If you only reduce the detection time by a small amount, the patient won’t have an improved prognosis and will actually cost the NHS more money than if you hadn’t detected the cancer.

Mary Black of Public Health England raised another fascinating point that I don’t think gets mentioned enough. Some speakers had claimed that the innovation brought about by using big data in the NHS would save money. She argued that there was minimal evidence for this claim, and that in fact the converse was true. Innovation simply expands the remit of healthcare and expands the range of treatable conditions, resulting in an increased cost.

As an example, I think it’s fair to say that the big data community is obsessed with human genomics. Gil McVean argued that genomics should be at the heart of big data because it’s both accurate and easy to collect. These are both good reasons, but I would question whether the vast potential promised by genomics can be realized soon in terms of deliverable improvements to routine healthcare. The paradox is that extremely rare diseases are often the simplest to deal with conceptually: variant filtering identifies a mutation in a specific gene, which illuminates the fundamental biology causing the condition, potentially allowing a treatment to be developed. This is fascinating science, but is actually (very expensive) low-hanging fruit that affects a tiny number of patients compared to the complex issues of how to deal with a multifactorial disease like obesity that affects X% of the population.

Academics in the biomedical sciences are always going to be keen on new directions of research made possible by technological advances. They’re less keen on researching the dull, complicated issues that clinicians face every day. The fact that clinical best practice is often based on minimal evidence and has vast potential for improvement through the interrogation of large datasets is often lost in tech-utopian visions of personalized medicine (strap the Oxford MinION to your wrist, wirelessly connect to the cloud, and stream personal informatics 24/7!).

The expectation in the UK is that you only go to see the doctor if you’re sick. The NHS is (mostly) very good at treating the very ill. What it’s less good at is treating the unhealthy. Integrating different datasets should help us to understand how we can improve this situation. Ian Cree referenced the saying: “In the past medicine was simple, safe, and ineffective.” I don’t know how true that is – I’m certainly not sure how safe medicine was in the past – but I hope that we can get better at making it more effective by using the vast amounts of data that we, the patients, generate every day.

  1. Sean Whitton said:

    Is it possible to make the data sufficiently anonymous, or in order to do interesting research do you end up having to assign (e.g.) unique ID numbers which follow patients around, such that if these UIDs are at any point linked to their name, there’s no anonymity anymore?


    • It depends on the question you’re asking. The broader the question, the more anonymising you can get away with. Linking patients to UIDs is usually called ‘pseudonymisation’ and as long as the key remains private (i.e. held by a ‘data controller’ and not given to researchers) it’s not subject to the Data Protection Act. Of course this raises questions of trust about the data controller, and rightly so.

      This article summarises the main principles and approaches when releasing individual data:

      For research on rarer conditions a problem is that patients remain easily identifiable even after pseudonymisation because they’re so unique (in a way that e.g. pseudonymised asthma patients aren’t). However, I think the main danger to anonymity in the future will be from cross-referencing multiple pseudonymised datasets that when considered together allow anonymity to be broken. Coordinating this problem as different government departments start releasing independent datasets is going to be difficult.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: