We use cookies to give you a better experience. Carry on browsing if you're happy with this, or read our cookies policy for more information.

Skip main navigation

False-positive psychology

Professor Miguel explores a paper by Joseph Simmons, Leif Nelson, and Uri Simonsohn discussing the impact of flexibility in data collection methods
So what do these folks do? They basically are very worried about the “u” term, “Researcher Degrees of Freedom.” In other words, once you have data in hand, stuff you can do to make an effect significant that is really a null effect, that’s really a zero.
And that leads them to believe that for any given research study that they read they’re pretty skeptical about whether it’s a real effect. So this is their main table, and they’re interested in the likelihood of obtaining a false-positive. So this is just 1 minus the PPV. The PPV is like what are the odds that when I see a result it’s real. And this is sort of the opposite– not exactly, but when I see a result, is it a false-positive, basically. Kind of like 1 minus the PPV. The wording isn’t exactly the same, the denominators are different in those two, but let’s just say it’s roughly 1 minus PPV.
And they look at researcher degrees of freedom in four cases, which are listed out here. Now, remember, this data is pure noise. They’ve generated noise. So anything – and we’re going to focus on this P less than .05 column, so standard levels of significance. Given that it’s noise, the chance that we reject the null for all of these cases should be exactly or close to .05, other than some sampling variation. If it was .04 I’d be happy. If it was .06 I’d be happy. But it shouldn’t deviate much from .05.
So what are the different cases that they look at? The first is, what if you just had two outcome variables? Very typical situation when you’re, say, doing fieldwork or you’re in the lab. You collect a couple different measures of the same thing. And they say, what is the researcher degrees of freedom here? They’re actually going to allow these things to be pretty highly correlated. They’re not just noise. They’re two correlated measures at .5 of the same thing. If I can focus on one as my main outcome and sort of ignore the other, if I can focus on measure B rather than measure A, or if I can take the average of the two.
Those are the three things I can do with the data. If I can do those three things with these highly correlated outcome measures, I’ve doubled the chance I can reject the null.
Case two, and this is really, again, this is where we’re more in like the lab world. Can I collect ten more observations for a given cell? So I’m in the lab, I’m looking at my data as it comes in. I’m like, “This is looking pretty interesting, I want a little more data for treatment 2, I think that one’s kind of interesting.” If I can do that and get 10 more observations per cell, just 10 more observations, I’m already from 5 to 8 percent. And apparently – and he has some data in the paper here – 70 percent of lab social psychologists do this. They kind of look at their data and decide which treatments to get more data for.
So already we can’t really believe the p-values there. Three is really another useful one. What if I just have one covariate, gender? And again, it’s correlated with the outcomes through chance here, it’s all noise. What if I have one covariate, so I can either control for the covariate or not, or I can focus on the interaction between treatment and the covariate, just to look at subgroups. Those are the things I could do. Will any of those yield a P greater than .05? I’m at 11.7 percent. Just one covariate. People have dozens or hundreds of covariates.
And then the last one here, there were three experimental conditions– what if I could just sort of exclude the data from one of them and sort of not report it? And say, “Oh, that treatment didn’t work.” And that happens all the time, again, in field experiments and lab experiments. People say, “Ah, that just didn’t kid of work.” Again, I have a much higher p-value. So any of these very limited things. 10 observations, 1 covariate, 2 outcomes lead to misleading p-values. But what gets really scary in this exercise is the ability to combine them.
So if I can do all four of these things– add a covariate, look at one additional outcome measure, add a few observations – things that are totally within the bounds of normal practice in a lot of empirical fields across the social sciences– the odds that I’m going to find at least one significant effect is 60 percent now. Simple solutions to false-positive publications. These are for authors.
Authors have to decide the rule for terminating data collection before data collection begins and show it in the article. So no more of this “I’m going to add 10 or 20 observations.” They’re going to try to tie researchers’ hands on that margin. You have to have at least 20 observations per cell or give some reason why you didn’t. So no more of these “I’m going to have 10 or 12 observations,” which in lab experiments in social psychology and even some branches of experiments, like economics happens, like massively underpowered analyses.
For me as an outsider to this literature, I’ve always asked myself, and I think that’s part of the motivation for what they’re doing, why don’t these folks do like half as many experiments and have twice as big a sample size to actually say something definitive? But of course if you’re in a world of false-positive results, that’s the last thing you want to do. You want to have the possibility of false-positives, and big samples will kill all your zero results. So you want small samples and you live on the sampling variation and publish off that. Isn’t that terribly cynical? But that’s basically what they’re saying in this article. Authors must list all the variables they collected.
You can’t just not tell us that you collected another proxy for the same thing, that ex-ante was just as good as the one you published. You have to tell us all three of them or four of them or two of them. Authors have to report all the experimental conditions. These are just tying their hands against all the things they were warning about in the previous table. If you had a certain treatment in this experiment, tell us about it and show us the data. Don’t exclude it without telling us about it. If observations are eliminated, they’re outliers or for some other reason, you have to report what the results are if you include the data.
So this is the beginning of their robustness. If you stop dropping data, you may have a good reason for doing so, but I still want to see the full sample results. Yes, you can argue and make the case for dropping them, but don’t hide it.
And again, same exact thing: if you run analysis with a covariate, I want to see the unadjusted analysis. And again, there may be very good reasons for having the covariate. You make that case, but I want to see the unadjusted one. And I’m going to believe your result a lot more if it holds in both cases 5 and 6, and that’s what a robustness table really is.
In this video, we explore how flexibility, common in data collection and analysis, can increase the likelihood of producing a false-positive. Joseph Simmons, Leif Nelson, and Uri Simonsohn, the authors of “False-positive Psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant”, demonstrate this using experiments and computer simulations, and also give simple guidelines that researchers and reviewers can use to reduce the incidence of these errors.

False-positives in social science research can have particularly negative impacts when research is used to inform policy or other outcomes. These outcomes can range, from the endorsement of the idea that simply posing in a powerful stance can actually empower someone, to the suggestion that increasing austerity measures leads to economic growth.
In this article, authors Joseph Simmons, Leif Nelson, and Uri Simonsohn, address the costly issues and frequency of false positives in social science research and suggest six requirements for authors that they believe are simple, low-cost, and straight-forward solutions to the problem.
They first define a false positive as the “incorrect rejection of a null hypothesis”. The problem with false positives is that once they are published, they are persistent and there is little incentive to replicate findings to test their significance. The authors write “…false positives waste resources: they inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.”
They argue that, with current standards of disclosure, false positives are actually “vastly more likely” due to the ease with which researchers can publish “statistically significant” evidence for nearly any hypothesis.
Many of these issues can be attributed to researcher degrees of freedom, or the multitude of decisions researchers can make about the design and details of their experiments. This freedom increases the likelihood that their analyses will produce false positives, and can be attributed to two factors: “ambiguity in how to best make decisions, and the researcher’s desire to find a statistically significant result”, the latter referring to the temptation and likelihood of researchers coming to conclusions that are consistent with their own desires or beliefs.
The authors performed simulations to estimate the influence of researcher degrees of freedom on the probability of a false-positive result. They focused on four common degrees of freedom that increase the likelihood of a researcher falsely detecting a significant effect. These include flexibility in:
  1. choosing among dependent variables,
  2. choosing sample size,
  3. using covariates, and
  4. reporting only subsets of experimental conditions.
They also suggest six requirements as a solution to the high incidence of false-positives, encouraging appropriate conduct of research, transparency in methods, and holding readers accountable to make informed decisions about the credibility of findings.
Authors must:
  1. Decide the rule for terminating data collection before data collection begins and report this rule in the article. This prevents authors from adding additional observations and further testing to achieve statistical significance if initial results are non-significant.
  2. Collect at least 20 observations per cell or else provide a compelling cost-of-data-collection justification. Small samples are “simply not powerful enough to detect most effects” and are “more likely to reflect interim data analysis and a flexible termination rule”.
  3. List all variables collected in a study. This prevents researchers from only reporting convenient subsets of measurements and allows readers to identify degrees of freedom.
  4. Report all experimental conditions, including failed manipulations. This “prevents authors from selectively choosing only to report conditions that yield results consistent with their hypothesis.”
  5. Report the statistical results of eliminated observations as if those observations had been included. This requires authors to explain why they eliminated the data and encourages readers to consider the validity of the exclusion.
  6. Report the statistical results of the analysis without a covariate if the analysis includes a covariate. This requires authors to “justify use of the covariate,” reveals “the extent to which a finding is reliant on the presence of a covariate,” and, again, encourages readers to practice discernment about whether the covariate is warranted.
Finally, Simmons, Nelson, and Simonsohn present four guidelines for reviewers to abide by.
Reviewers should…
  1. Ensure that authors follow the requirements.
  2. Be more tolerant of imperfections in results (false-positive findings could be due to an “unreasonable expectation” imposed by reviewers for data to turn out as predicted).
  3. Require authors to demonstrate that their results do not depend on arbitrary analytic decisions.
  4. Require authors to conduct an exact replication if justifications of data collection or analysis are not compelling.
While there has been some criticism of the authors’ proposed solutions, they argue that the requirements for which they advocate impose minimal costs to all involved in research and review and are a step towards discovering and disseminating valid research.
Think about an experiment that recently caught your interest. Were any of the author requirements listed in this article followed, or even disclosed? If not, what do you think are the implications for their findings? Can you think of a time when degrees of freedom might have affected your own research?
You can read the entire article here.
Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn. 2011. “False-Positive Psychology Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Psychological Science 22 (11): 1359–66. doi:10.1177/0956797611417632.
This article is from the free online

Transparent and Open Social Science Research

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education