Skip to 0 minutes and 0 secondsLet’s continue our discussion. What should p-values look like? And this is the Simonsohn et al. They’re really just going to focus on the distribution of p-values in psychology. For p-values less than .05. That’s like the bulk of point estimates in their literature. In psychology, you just don’t publish if it’s not less than .05. So we’re not missing much in psychology anyway. And there’s something actually kind of nice about this, because it narrows the focus to these findings. And you can say, well, let’s look at these findings. Let’s look at this body of claimed findings. And what can the distribution of p-values in this body of studies tell us about whether there was data mining or selective publication?
Skip to 0 minutes and 41 secondsSo, these are the claimed effects. Let’s evaluate the claimed effects. Let’s look at p less than .05. There’s a couple ideas and they’re very intuitive. So if there’s no effect, what is the distribution of p-values? There should be a uniform distribution. Okay, that’s intuitive. If there is an effect, the distribution of p-values should be right skewed. So this is .01, .02, to .05, should be right skewed. That’s the idea. In other words there should be more significant results, with significance at .01 or .02 level, than at .04 or .05. Obviously the more power you have, the more likely you are gonna have p-values that are really, really small.
Skip to 1 minute and 29 secondsAdequate power in perspective studies, the rule of thumb is 80 percent power. A lot of studies probably have 50 percent or maybe 40 percent power. This is like a woefully underpowered study. Even in a woefully underpowered study, there are more than twice as many p-values between zero and .01 as there are p-values between .04 and .05. As we get up to adequate power, this is 79 percent power. You shouldn’t actually have many p-values between .04 and .05. 79 percent of the time, you’re gonna detect significance, and the bulk of that time you’re gonna be way over here, at less than .02. 83 percent of the significant effects are gonna be at .02 or less.
Skip to 2 minutes and 21 secondsThey next ask another question, which is, “What would we expect to see if researchers are data mining?” Like, what will this curve look like? The way they operationalize it here is they say, “Imagine you start with a certain sample. And I keep adding five subjects until I get a significant result. And then I stop. That’s my stopping rule.” So this again is no effect, we’re in the case of no effect. And this is the real data. This is what the distribution of p-values should look like without manipulation. But if researchers are allowed to manipulate, and they simulate this. If you’re allowed to manipulate and get five more observations until you get p of .05, then you get a left-skewed p-curve.
Skip to 3 minutes and 2 secondsYou get lots of observations between .04 and .05 and very few down here. So, the intuition here is clear, and again, the write up in their paper is very nice. They say, researchers who are p-hacking have a stopping rule. And ultimately they have limited ambition. They don’t seek to get down to .02 or .01, they seek to get below .05. So there’s gonna be a spike at .05. So this pattern reflects the goal of the researcher to sort of attain a certain statistical significance level. And the difference couldn’t be any starker. Instead of right-skewed, it’s left-skewed. My favorite part of the paper is when they actually take this to some actual literature. So they go to a leading psychology journal.
Skip to 3 minutes and 49 secondsAnd they say, “Is there p-hacking when we look at dozens of studies in this journal?” And they do something very clever, which is they say, “There’s studies we suspect there was a lot of manipulation on.” There are experiments where the authors never showed the simple treatment versus control difference. They only showed differences with covariate controls. And the question is, what do the p-curves look like? And the blue line is the p-curve. So in these studies, in a psychology journal, where they never show you the raw data, they only show you the sort of manipulated result, there are a lot of p’s between .04 and .05. And very, very few out here.
Skip to 4 minutes and 33 secondsBased on what they just found, this is very troubling. These are studies that where taken together, we shouldn’t have very much confidence in this body of literature. Just looks like those are almost certainly, if you sort of take their theories seriously, false positives out there. These are the studies where they don’t think there was much indication of data manipulation. And here you have the sort of right-skew.
P-Curve: A tool for detecting publication bias
How can we tell when publication bias has led to data mining? In this video, we’ll show you what the distribution of p-values should look like when (1) there are no observable effects of a treatment, (2) there are observable effects, and (3) data mining is likely to have occurred. We also discuss a 2014 article authored by Uri Simonsohn, Leif Nelson, and Joseph Simmons (the same authors of “False-Positive Psychology”) in the Journal of Experimental Psychology that demonstrated widespread data mining in that body of literature.
In this article, authors Joseph Simmons, Leif Nelson, and Uri Simonsohn propose a way to distinguish between truly significant findings and false positives resulting from selective reporting and specification searching, or p-hacking.
P values indicate “how likely one is to observe an outcome at least as extreme as the one observed if the studied effect were nonexistent.” As a reminder, most academic journals will only publish studies with p values less than 0.05, the most common threshold for statistical significance.
Some researchers use p-hacking to “find statistically significant support for nonexistent effects,” allowing them to “get most studies to reveal significant relationships between truly unrelated variables.”
The p-curve can be used to detect p-hacking. The authors define this curve as “the distribution of statistically significant p values for a set of independent findings. Its shape is a diagnostic of the evidential value of that set of findings.”
In order for p-curve inferences to be credible, the p values selected must be:
- associated with the hypothesis of interest,
- statistically independent from other selected p values, and
- distributed uniformly under the null.
It is also important to clarify that the p-curve assesses only reported data and not the theories that they are testing. Similarly, it’s important to keep in mind that if a set of values is found to have evidential value, it doesn’t automatically imply internal or external validity.
Using the p-curve to detect p-hacking is fairly straightforward. If the curve is right-skewed as in the chart to the right in the figure below, there are more low (0.01s) than high (0.04s) significant p values, suggesting truly significant p values. When non-existent effects are studied (i.e., a study’s null hypothesis is true), all p values are equally likely to be observed, thus producing a uniform curve or a straight line. In the figure below, each chart incorporates a uniform curve that is dotted and red for comparison. Curves that are left-skewed however, as in the chart to the left in the figure below, indicate more high p values than low ones; p-hacking has likely occurred.
“P-curves for Journal of Personality and Social Psychology” (Click to expand)
The above figure displays the results of the authors’ demonstration of the p-curve through the analysis of two sets of findings taken from the Journal of Personality and Social Psychology (JPSP). They hypothesized that one set was p-hacked, while the other was not. In the set in which they suspected p-hacking, they realized that the authors of the publication reported results only with a covariate. While there is nothing wrong with including covariates in study’s design, many researchers will include one only after their initial analysis (without the covariate) was found to be non-significant.
Simmons, Nelson, and Simonsohn provide guidelines to follow when selecting studies to analyze with the p-curve:
Create a selection rule. Authors should decide in advance which studies to use.
Disclose the selection rule.
Maintain robustness to resolutions of ambiguity If it is unclear whether or not a study should be included, authors should report results both with and without that study. This allows readers to see the extent of the influence of these ambiguous cases.
Replicate single-article p-curves. Because of the risk of cherry-picking single articles, the authors suggest a direct replication of at least one of the studies in the article to improve the credibility of the p-curve.
In addition to these guidelines, Simmons, Nelson, and Simonsohn also provide five steps to ensure that the selected p-values meet the three selection criteria we mentioned earlier:
- Identify researchers’ stated hypothesis and study design.
- Identify the statistical result testing the stated hypothesis.
- Report the statistical results of interest.
- Recompute precise p-values based on reported test statistics. – This has been made easy through an online app, which you can find at http://p-curve.com/.
- Report robustness results of the p-values to your selection rules.
As with anything, the p-curve is not one hundred percent accurate one hundred percent of the time. The validity of the judgments made from a p-curve may depend on “the number of studies being p-curved, their statistical power, and the intensity of p-hacking”. There isn’t much concern over cherry-picking p-curves to ensure the result of a lack of evidential value. However, such a practice can be prevented simply with the disclosure of selections, ambiguity, sample size, and other study details.
Additionally, there are a few limitations with the p-curve. First, it “does not yet technically apply to studies analyzed using discrete test statistics” and is “less likely to conclude data have evidential value when a covariate correlates with the independent variable of interest.” It also has a hard time detecting confounding variables; if there is a real effect, but also mild p-hacking, it usually won’t detect the latter.
Simmons, Nelson, and Simonsohn conclude that, with the examination of a distribution of p-values, one will be able to identify whether selective reporting was used or not. What do you think about the p-curve? Would you use this tool?
We’ll use http://p-curve.com/ in an Exercise later in this Activity.
You can read the entire paper here.
Simonsohn, Uri, Leif D. Nelson, and Joseph P. Simmons. 2014. “P-Curve: A Key to the File-Drawer.” Journal of Experimental Psychology: General 143 (2): 534–47. doi:10.1037/a0033242.
© Center for Effective Global Action