Skip to 0 minutes and 0 secondsLet’s continue our discussion. What should p-values look like? And this is the Simonsohn et al. They’re really just going to focus on the distribution of p-values in psychology. For p-values less than .05. That’s like the bulk of point estimates in their literature. In psychology, you just don’t publish if it’s not less than .05. So we’re not missing much in psychology anyway. And there’s something actually kind of nice about this, because it narrows the focus to these findings. And you can say, well, let’s look at these findings. Let’s look at this body of claimed findings. And what can the distribution of p-values in this body of studies tell us about whether there was data mining or selective publication?

Skip to 0 minutes and 41 secondsSo, these are the claimed effects. Let’s evaluate the claimed effects. Let’s look at p less than .05. There’s a couple ideas and they’re very intuitive. So if there’s no effect, what is the distribution of p-values? There should be a uniform distribution. Okay, that’s intuitive. If there is an effect, the distribution of p-values should be right skewed. So this is .01, .02, to .05, should be right skewed. That’s the idea. In other words there should be more significant results, with significance at .01 or .02 level, than at .04 or .05. Obviously the more power you have, the more likely you are gonna have p-values that are really, really small.

Skip to 1 minute and 29 secondsAdequate power in perspective studies, the rule of thumb is 80 percent power. A lot of studies probably have 50 percent or maybe 40 percent power. This is like a woefully underpowered study. Even in a woefully underpowered study, there are more than twice as many p-values between zero and .01 as there are p-values between .04 and .05. As we get up to adequate power, this is 79 percent power. You shouldn’t actually have many p-values between .04 and .05. 79 percent of the time, you’re gonna detect significance, and the bulk of that time you’re gonna be way over here, at less than .02. 83 percent of the significant effects are gonna be at .02 or less.

Skip to 2 minutes and 21 secondsThey next ask another question, which is, “What would we expect to see if researchers are data mining?” Like, what will this curve look like? The way they operationalize it here is they say, “Imagine you start with a certain sample. And I keep adding five subjects until I get a significant result. And then I stop. That’s my stopping rule.” So this again is no effect, we’re in the case of no effect. And this is the real data. This is what the distribution of p-values should look like without manipulation. But if researchers are allowed to manipulate, and they simulate this. If you’re allowed to manipulate and get five more observations until you get p of .05, then you get a left-skewed p-curve.

Skip to 3 minutes and 2 secondsYou get lots of observations between .04 and .05 and very few down here. So, the intuition here is clear, and again, the write up in their paper is very nice. They say, researchers who are p-hacking have a stopping rule. And ultimately they have limited ambition. They don’t seek to get down to .02 or .01, they seek to get below .05. So there’s gonna be a spike at .05. So this pattern reflects the goal of the researcher to sort of attain a certain statistical significance level. And the difference couldn’t be any starker. Instead of right-skewed, it’s left-skewed. My favorite part of the paper is when they actually take this to some actual literature. So they go to a leading psychology journal.

Skip to 3 minutes and 49 secondsAnd they say, “Is there p-hacking when we look at dozens of studies in this journal?” And they do something very clever, which is they say, “There’s studies we suspect there was a lot of manipulation on.” There are experiments where the authors never showed the simple treatment versus control difference. They only showed differences with covariate controls. And the question is, what do the p-curves look like? And the blue line is the p-curve. So in these studies, in a psychology journal, where they never show you the raw data, they only show you the sort of manipulated result, there are a lot of p’s between .04 and .05. And very, very few out here.

Skip to 4 minutes and 33 secondsBased on what they just found, this is very troubling. These are studies that where taken together, we shouldn’t have very much confidence in this body of literature. Just looks like those are almost certainly, if you sort of take their theories seriously, false positives out there. These are the studies where they don’t think there was much indication of data manipulation. And here you have the sort of right-skew.


How can we tell when publication bias has led to data mining? In this video, Professor Miguel shows us what the distribution of p-values should look like when (1) there are no observable effects of a treatment, (2) when there are observable effects, and (3) when data mining is likely to have occurred. He also discusses a 2014 article authored by Uri Simonsohn, Leif Nelson, and Joseph Simmons (the same authors of “False-Positive Psychology”) in the Journal of Experimental Psychology that demonstrates widespread data mining in that body of literature.

Share this video:

This video is from the free online course:

Transparent and Open Social Science Research

University of California, Berkeley