I have acne. Occasionally, I read online forums because I like knowing the experiences of other people with acne. For example, a long time ago I started using a prescription facial cream and wasn’t enjoying the side effects. I wanted to know how long I could expect these to last, so I browsed around some different forums and read dozens of different people’s experiences until I’d satisfied my curiosity.

People enthusiastic about ‘asking for evidence’ might be critical of my decision to take health advice from random strangers on the internet. I think that’s a patronising attitude to take. (Disclaimer: I’m a 2016 ‘Ask for Evidence’ ambassador. Please don’t fire me.)

I didn’t base my decisions on the random strangers alone:

  • I asked my GP when I got the medication about side effects
  • I read NHS Choices
  • I read WebMD
  • I asked a Real Dermatologist who I know personally

Discussion forums were one component of my decision to keep using the medication. But they were a very useful one that provided something none of the others did: patient experiences. I think there’s something valuable in hearing from normal people as well as medical professionals, and anecdotally from my personal experience the advice given seems pretty good.

how bad is dangerous?

So I really enjoyed this recent paper by Cole et al. (2016): “Health Advice from Internet Discussion Forums: How Bad Is Dangerous?”

The authors selected 25 discussion threads offering health advice about three conditions (HIV, diabetes, and chickenpox) on three websites (Reddit, Mumsnet, and Patient). They then got eight qualified doctors and nine non-doctors to assess how accurate, complete, and sensible the threads were.

Perhaps somewhat surprisingly, the information provided on the threads was mostly considered to be “reasonably good” with only a small amount assessed as “poor”. And even for the “poor” information, the authors noted that:

The forums that contained the most inaccurate or controversial information also contained counterbalancing comments that appear able to dilute the potentially harmful consequences of the poor quality information.

So even when there’s bad advice, other people are usually on hand to call it out even if the reader might not pick up on it (which I suspect they often would).

There are a few criticisms that could be made of the study. It’s quite small and focused on a small number of conditions (but to be fair it was a pilot study for a PhD). The analysis combines assessments from doctors (58 of 79, 73%) and non-doctors (21 of 79, 27%) which I find slightly confusing as (a) it’s such an obvious criticism, (b) it seems it wouldn’t be much effort to separate out, and (c) I would doubt it has much of an impact on the conclusions.

The study contributes to existing evidence that the majority of health information online is good (~60-70%) with just a small proportion “genuinely inaccurate” (~5-7%). I look forward to future publications from Cole et al. on this topic. I’d be particularly interested to know what the quality of advice is like on patient websites for specific diseases vs. more general sites like Reddit (guess where the quality of information was most variable?).

(mostly) not terrible

Health advice on the internet is (mostly) not terrible. I think we need to recognise that there’s a whole spectrum of medical advice out there: sufferers of a condition sharing their experiences of different drugs is very different from a recommendation to drink bleach to cure multiple sclerosis. Horror stories of bogus companies selling quack cures (and there are many) shouldn’t stop us using the internet alongside more traditional healthcare routes e.g. talking to qualified practitioners.

Every day, the internet contributes to the healthcare decisions made by millions of people. That contribution can only rise. Doctors can’t do anything about this – and they shouldn’t. We should be educating people about how and where to look for advice, not telling them not to.

the paper

Cole J, Watkins C, Kleine D. Health Advice from Internet Discussion Forums: How Bad Is Dangerous? J Med Internet Res 2016;18(1):e4 DOI: 10.2196/jmir.5051 PMID: 26740148


“Data from all survey results including links to the actual question as it appeared on the discussion forum website ” is available…

…as a pdf 😦


Yesterday I attended ‘Data Science for Health – an Alan Turing Institute Summit’ at the Centre for Mathematical Sciences, Cambridge. This post is a brief summary of topics raised that particularly interested me. My background: I have an undergraduate degree in physics, I’m currently doing a computational biology PhD, and I’m interested in the potential of using electronic data to improve healthcare, particularly public health.

The Alan Turing Institute (ATI) is a cross-university collaboration between Oxford, Cambridge, Edinburgh, UCL, and Warwick, with physical headquarters at the British Library. It aims to “undertake data science research at the intersection of computer science, mathematics, statistics and systems engineering; provide technically informed advice to policy makers on the wider implications of algorithms; enable researchers from industry and academia to work together to undertake research with practical applications; and act as a magnet for leaders in academia and industry from around the world to engage with the UK in data science and its applications.”

As well as academic institutions there will also be other industrial and governmental partners, including Cray, Lloyds Register, and GCHQ.

It’s not surprising that GCHQ have a keen interest in the development of techniques for dealing with big data (or as they might euphemistically call it, ‘big metadata’). Today’s summit was focused on health data, where one of the main obstacles is requiring consent. If there is a moral case for blanket monitoring of personal electronic data for surveillance, it rests on the assumption that it is necessary for protecting the general population and does limited harm. I think an equally strong if not stronger moral case can be made for blanket monitoring of personal health data for the improvement of healthcare, which in a consequential sense seems likely to have a much larger beneficial impact on the general population (do some back-of-the-envelope calculations yourself). It seems bizarre that we are happy to permit the mass extraction of people’s personal electronic communications yet remain incredibly cautious of the legal ramifications of sharing medical data between separate government departments, public services, and academic institutions.

As well as this fragmentation of data sharing the legal framework surrounding medical data is incredibly complex and variable. For example, I didn’t know beforehand that mortality data is under a completely different set of legal barriers to healthcare data, despite being a crucial part of epidemiology. Legislation could fix this but the government shows no sign of improving the situation. In another absurd situation, academic researchers often have to pay six-figure sums to access datasets from GP practices. As Ruth Gilbert pointed out, this would cause public outrage if it were widely known.

Surprisingly, patients do not technically own their data. Legally it belongs to the Secretary of State and the relevant Data Controllers. Ruth Gilbert suggested that patients don’t actually care whether their data is used for administrative or research purposes (despite the different legal frameworks) and that it is not feasible to get informed prior consent for every possible use – nobody can exhaustively know what data will ultimately be used for. She argued that the legal frameworks for consented and unconsented data should be very similar.

The challenge of linking up different datasets was raised several times. Andrew Morris said that every patient should be a research patient. The big problems facing the NHS are chronic disease and multi-comorbidity (having e.g. heart disease, obesity, and diabetes), and understanding these requires integrated databases. Ruth Gilbert added that data from past clinical trials could be incredibly useful for longitudinal research if the original data could be linked to future administrative records.

Academics are quite good at developing ‘improvements’ to healthcare that weren’t asked for by clinicians. Ian Cree pointed out that not all such improvements save money or improve patient care. Detecting cancers earlier is only useful if it’s done early enough. If you only reduce the detection time by a small amount, the patient won’t have an improved prognosis and will actually cost the NHS more money than if you hadn’t detected the cancer.

Mary Black of Public Health England raised another fascinating point that I don’t think gets mentioned enough. Some speakers had claimed that the innovation brought about by using big data in the NHS would save money. She argued that there was minimal evidence for this claim, and that in fact the converse was true. Innovation simply expands the remit of healthcare and expands the range of treatable conditions, resulting in an increased cost.

As an example, I think it’s fair to say that the big data community is obsessed with human genomics. Gil McVean argued that genomics should be at the heart of big data because it’s both accurate and easy to collect. These are both good reasons, but I would question whether the vast potential promised by genomics can be realized soon in terms of deliverable improvements to routine healthcare. The paradox is that extremely rare diseases are often the simplest to deal with conceptually: variant filtering identifies a mutation in a specific gene, which illuminates the fundamental biology causing the condition, potentially allowing a treatment to be developed. This is fascinating science, but is actually (very expensive) low-hanging fruit that affects a tiny number of patients compared to the complex issues of how to deal with a multifactorial disease like obesity that affects X% of the population.

Academics in the biomedical sciences are always going to be keen on new directions of research made possible by technological advances. They’re less keen on researching the dull, complicated issues that clinicians face every day. The fact that clinical best practice is often based on minimal evidence and has vast potential for improvement through the interrogation of large datasets is often lost in tech-utopian visions of personalized medicine (strap the Oxford MinION to your wrist, wirelessly connect to the cloud, and stream personal informatics 24/7!).

The expectation in the UK is that you only go to see the doctor if you’re sick. The NHS is (mostly) very good at treating the very ill. What it’s less good at is treating the unhealthy. Integrating different datasets should help us to understand how we can improve this situation. Ian Cree referenced the saying: “In the past medicine was simple, safe, and ineffective.” I don’t know how true that is – I’m certainly not sure how safe medicine was in the past – but I hope that we can get better at making it more effective by using the vast amounts of data that we, the patients, generate every day.

I received some Illumina data from collaborators without knowing much about how it had been generated.

Inspecting the files I found that the data had already been demultiplexed and stripped of their barcodes. There were also paired reads for each sample. I wasn’t familiar with how to deal with this sort of data, but Robert Edgar has a discussion here, with Example 2 being the appropriate case: http://www.drive5.com/usearch/manual/ill_demx_reads.html

It’s a simple matter to adapt his helpful solution for the multiple file case, but I always find myself googling basic shell scripting so here’s my version.

First we need to get a list of all of the sample names. Assuming that your file names are in the standard form ‘SampleName_L001_R1_001.fastq’, this can be done by the following:

ls *.fastq | awk -F '_L001' '{print $1'} | uniq > sample_names.txt

Then loop through all the samples, doing 1) merging the forward and reverse reads; 2) filtering of the reads; 3) adding the barcode=SampleName annotation; 4) concatenating the reads into a single file.

while read p; do
    echo 'Processing reads for '"$p"''
    usearch61 -fastq_mergepairs ''"$p"'_L001_R1_001.fastq' \
     -reverse ''"$p"'_L001_R2_001.fastq' -fastqout ''"$p"'_merged.fastq'
    usearch61 -fastq_filter ''"$p"'_merged.fastq' \
     -fastaout ''"$p"'_filtered.fa' -fastq_maxee 1.0
    sed '-es/^>\(.*\)/>\1;barcodelabel='"$p"';/' \
    cat ''"$p"'.fa' >> reads.fa
done < sample_names.txt

When running multiple_join_paired_ends.py I encountered the error:

Cannot find fastq-join. Is it installed? Is it in your path?

The solution was apparently to install ea-utils, which contains fastq-join.

So I tried that.

However, make failed with the same error as detailed here: https://groups.google.com/forum/#!msg/ea-utils/nR5qvhgZKIY/yx5BSEta_dQJ

The trick here was that not everything in ea-utils was required to make fastq-join work, as Eric Aronesty pointed out on the above thread:

“Of course fastq-join doesn’t use sparse-hash… so if you ran “make fastq-join” … it would work, even on a Mac.”

Then of course you need to move the resulting fastq-join binary to somewhere in your path. Now multiple_join_paired_ends.py should work fine.

Statistical significance is a concept that even established researchers get completely wrong. If you don’t believe me, just read the list of increasingly desperate descriptions of non-significant results (compiled by Matthew Hankins).

I think this confusion is largely to do with language. After doing a hypothesis test, the word ‘significant’ has a precise meaning: it means that the probability of observing the given result if the null hypothesis was true has been calculated and found to be less than some arbitrary pre-determined significance level. This arbitrary level is usually taken to be 0.05, corresponding to a 1 in 20 chance that you’d see the same result if the null hypothesis were true. (If you followed that explanation you’ve almost certainly heard it before.)

However, this meaning of ‘significant’ is different to its everyday meaning.

When you say a result is ‘significant’ to most non-statisticians, they’re likely to start thinking of any of the following closely related words: notable, noteworthy, worthy of attention, remarkable, outstanding, important…this is clearly how it gets (mis)used in practice.

Conversely, saying a result is ‘not significant’ sounds like you’re saying it is none of those things.

Is it any wonder that people become obsessed over whether the p-value passes that arbitrary p < 0.05 threshold when they hear in Applied Stats 101 that their result won’t be ‘important’ unless it does?

Things are further complicated by the fact that ‘clinical significance’ is also a thing. I’ve noticed particularly in medical studies it’s not uncommon talk about results as being ‘significant’ and imply that they’re clinically significant or important, whereas in fact they’re probably not.

The Wikipedia page on statistical significance stresses:

“The term significance does not imply importance and the term statistical significance is not the same as research, theoretical, or practical significance.” (source)

It’s clear that this message has failed to get through to thousands of students and researchers.

Therefore, I would like to suggest a new word to be used in place of ‘significant’ after performing a hypothesis test:

psignificant (pʰsɪɡˈnɪfɪk(ə)nt/)

When spoken, the p at the start should to be aspirated (‘puh-significant’) to remind everyone that this interpretation is inextricably linked to a p-value from a statistical test and is not the same as the everyday meaning of ‘significant’.

With this new word, I look forward to statements like this appearing in published papers:

“the difference in values is psignificant (p < 0.05) but is too small to be of clinical significance”

This post is just a dumb suggestion. But I don’t think it’s completely fair to blame non-statisticians for misusing p-values when the language used to describe them is misleading.

A recent paper by Sumner et al. in the BMJ analysed 462 press releases associated with scientific papers related to health research published by 20 leading UK universities in 2011. They found that, when compared to the original journal article,

  • “40% of the analysed press releases contained more direct or explicit advice”
  • “33% of primary claims in press releases were more strongly deterministic”
  • “For studies on animals, cells, or simulations, 36% of press releases exhibited inflated inference to humans”

These are sobering statistics. It’s long been the fashionable thing for scientists to blame journalists for the misrepresentation of their work. Sumner et al. have shown that for the papers they analysed exaggeration of various kinds at the press release stage is not uncommon. This should be where authors theoretically have some control, so it’s particularly embarrassing.

Sumner et al. focused on summary statistics about the overall problem. However, as Ben Goldacre (@bengoldacre) pointed out in a linked editorial, it would be possible to use the published data to find “those academics and institutions associated with the worst exaggerations and publish their names online, along with details of the transgressions” (source).

I liked the idea of extracting the information so that individual press releases could be analysed easily, so I downloaded the data.

[Aside: I was disappointed to find that the data files were all in the proprietary MATLAB (.mat) format. I’m not sure why the authors chose to release the data like this given that an individual license for MATLAB costs £1600. Fortunately I have access to MATLAB through my university, but I think journals should be much more active in encouraging the release of supplementary material in non-proprietary formats.]

Data extraction details

In extracting the data for individual press releases, I chose to stick fairly closely to the criteria used by Sumner et al. for simplicity. Each journal article, press release, and related news article included in the study was coded according to a set of guidelines detailed in their supplementary material. I read these guidelines and then tried to link them to the variables present in the final dataset to extract the information of interest for every press release. This wasn’t especially trivial and I may well have made mistakes in this step – the code is available on GitHub for checking. I’m not a MATLAB user normally, so it’s very rough and ready.

I chose to restrict the analysis to press releases and to stick fairly closely to the data used for summary statistics by Sumner et al. The information I aimed to extract was:

  • The reference number given to the paper and press release by Sumner et al.
  • The titles of the paper and press release
  • The authors of the original paper (using a PubMed search by title)
  • The university the press release was issued by
  • The study sample, as reported in the paper and in the press release (to identify e.g. inference to humans from animal research)
  •  The strength of any advice given in the paper and in the press release (to identify e.g. exaggeration of advice given to readers)
  • The strength of causation according to the paper and to the press release (to identify e.g. exaggeration of correlative statements into causative statements)
  • The main variables in the study according to the paper and to the press release (to identify e.g. generalization of variables)
  • Whether the word “cure” was used


The resulting csv file is available on GitHub or as a Google Spreadsheet.

I invite anybody who’s interested to take a look and do their own analysis. There are some missing entries in the author data (69/462) due to problems with the PubMed lookup, even for some papers that are present on PubMed – I’m not sure why this is and will try to fix it (EDIT: the code has been updated and now only 18 papers lack authors because they aren’t indexed on PubMed. That indicates that they’re not typical health research papers – sure enough, looking at the titles most are of a more sociological flavour – but I plan to add them manually for completeness when I get a chance). The authors seem to be accurate based on manually checking a small random sample of the data – see for yourself by going to the PubMed URL provided. For other variables, all missing entries (often coded as ‘-9’) are as present in the original data files.

In that spreadsheet it should be possible to sort the publications by a variable and quickly identify papers and authors that were coded as such. You can then look at the original data and use the reference number to read the press release (in folder ‘5. Press releases’)  to see if you agree with the assessment of Sumner et al.

For example, let’s look at some in which the study sample was generalized in a major way when written about in the press release. This corresponds to a value of 2 in the ‘Sample_changed’ column. There are 85 press releases which meet this criteria, and you can see what the samples were in the ‘Sample_journal’ and ‘Sample_PR’ columns.

Or take the exaggeration of advice. A value of 3 in the ‘Advice_exaggeration’ column corresponds to explicit advice to the general public present in the press release when the paper contained no advice. There are 19 papers that meet this criteria. I’ve left the Google spreadsheet sorted in this order.

I plan to do some more analysis myself, but please feel free to use the data yourself, bearing in mind some caveats…

Preliminary thoughts

The spreadsheet I’ve provided simply represents my attempt at extracting a summary of what Sumner et al. have already publicly released. I would recommend caution before accusing authors of misrepresentation based solely on the information here, both because I may have extracted the data incorrectly and because that data itself comes from subjective judgements (although the coding guidelines are rigorous and Sumner et al. showed a high concordance of 91% between blinded coders).

Even if the information is accurate, much stronger evidence would be required to suggest that anybody identified here as an author of a paper that had an exaggerated press release was being duplicitous or deliberately misleading. I think the gradual exaggeration of what might have been a measured scientific article can happen when scientists and universities are trying hard to sell their research as relevant to the general public. Sumner told James Hamblin at The Atlantic that correlative statements in papers (“significant associations between variable x and outcome y”) often become causative statements in press releases (“variable x increases risk of outcome y”):

“It is very common for this type of thing to happen…probably partly because the causal phrases are shorter and just sound better. There may be no intention to change the meaning.” (source)

This is a bad situation, but by acknowleding the problem the academic community can begin to tackle it. As Goldacre points out, this should require only “a modest extension of current norms”. (source)

Scientists have a responsibility to avoid exaggerations in press releases for their papers instead of passing the buck to journalists (some of whom have already used this paper as an opportunity to pass the buck right back, which isn’t very helpful). A culture of calling out and attempting to prevent these sorts of exaggerations at every level – in news articles, in the press release, and also in the original paper – would be a good thing. It certainly seems reasonable to me that academics are made accountable for their own press releases as Goldacre recommends.

As to how those press releases should best be written: Sumner et al. are apparently following up this retrospective study with a randomised trial looking at “how different styles of press releases, and variants in specific phrasing, influence the accuracy and quantity of science news” (source).

Get the data:

Data extracted: available as a Google spreadsheet or from GitHub

Code: available via GitHub


Original article: Sumner et al., ‘The association between exaggeration in health related science news and academic press releases: retrospective observational study’, BMJ 2014;349:g7015

Original data and supplementary material: here