As debate rumbles on about how and how much poor statistics is to blame for poor reproducibility, Nature asked influential statisticians to recommend one change to improve science. The common theme? The problem is not our maths, but ourselves.
Another thing that we can now do is do simulated experiments and observations with random numbers adjusted to have suitable statistical distributions. The results can then be plugged into statistical-significance tests to see how well they do.JEFF LEEK: Adjust for human cognition
In the past couple of decades, many fields have shifted from data sets with a dozen measurements to data sets with millions. Methods that were developed for a world with sparse and hard-to-collect information have been jury-rigged to handle bigger, more-diverse and more-complex data sets. No wonder the literature is now full of papers that use outdated statistics, misapply statistical tests and misinterpret results. The application of P values to determine whether an analysis is interesting is just one of the most visible of many shortcomings.
BLAKELEY B. MCSHANE & ANDREW GELMAN: Abandon statistical significance
In many fields, decisions about whether to publish an empirical finding, pursue a line of research or enact a policy are considered only when results are statistically significant, defined as having a P value (or similar metric) that falls below some pre-specified threshold. This approach is called null hypothesis significance testing (NHST). It encourages researchers to investigate so many paths in their analyses that whatever appears in papers is an unrepresentative selection of the data.
DAVID COLQUHOUN: State false-positive risk, too
To demote P values to their rightful place, researchers need better ways to interpret them. What matters is the probability that a result that has been labelled as statistically significant turns out to be a false positive. This false-positive risk (FPR) is always bigger than the P value.
How much bigger depends strongly on the plausibility of the hypothesis before an experiment is done the prior probability of there being a real effect. If this prior probability were low, say 10%, then a P value close to 0.05 would carry an FPR of 76%. To lower that risk to 5% (which is what many people still believe P < 0.05 means), the P value would need to be 0.00045.
MICHÈLE B. NUIJTEN: Share analysis plans and results
Planning and openness can help researchers to avoid false positives. One technique is to preregister analysis plans: scientists write down (and preferably publish) how they intend to analyse their data before they even see them. This eliminates the temptation to hack out the one path that leads to significance and afterwards rationalize why that path made the most sense. With the plan in place, researchers can still actively try out several analyses and learn whether results hinge on a particular variable or a narrow set of choices, as long as they clearly state that these explorations were not planned beforehand.
STEVEN N. GOODMAN: Change norms from within
Norms are established within communities partly through methodological mimicry. In a paper published last month on predicting suicidality, the authors justified their sample size of 17 participants per group by stating that a previous study of people on the autism spectrum had used those numbers. Previous publication is not a true justification for the sample size, but it does legitimize it as a model. To quote from a Berwick report on system change, culture will trump rules, standards and control strategies every single time (see go.nature.com/2hxo4q2).
As an example, I recently went over to A Website About Stephen Jay Gould's Essays On Natural History and got its maintainer's estimates of what SJG talked about in each of his essays -- a grand total of 299 of them with 52 possible subjects. I looked for correlations with Principal Component Analysis, and I found that most of the principal-component lengths fit onto a line with the highest two being somewhat greater.
I decided to check on what might be some artifact of the PCA procedure by taking the average numbers of subjects and creating a simulated distribution with those subjects' frequencies and no correlations between them. I found the same curve except for the two greatest PC's -- in the real data, they were noticeably greater. So I conclude that SJG's writings had only a limited amount of correlation, that his writing about one subject was pretty much uncorrelated with his writing about some other subject.
But there were some correlations. "Historical Figures in Evolution / Natural History" and "Darwin's Theory" are negatively correlated, meaning that if SJG wrote about one of them in some essay, he tended not to write about the other in that essay. Those were also his favorite essay subjects, appearing in 31% and 25% of the essays, respectively. Looking at how other subjects are correlated with these ones, one could compose two broad categories: "Historical Stuff" and "How Evolution Works", whether SJG preferring to write about one at a time.