Skip to 0 minutes and 0 secondsLet’s continue our discussion. What should pvalues look like? And this is the Simonsohn et al. They’re really just going to focus on the distribution of pvalues in psychology. For pvalues less than .05. That’s like the bulk of point estimates in their literature. In psychology, you just don’t publish if it’s not less than .05. So we’re not missing much in psychology anyway. And there’s something actually kind of nice about this, because it narrows the focus to these findings. And you can say, well, let’s look at these findings. Let’s look at this body of claimed findings. And what can the distribution of pvalues in this body of studies tell us about whether there was data mining or selective publication?
Skip to 0 minutes and 41 secondsSo, these are the claimed effects. Let’s evaluate the claimed effects. Let’s look at p less than .05. There’s a couple ideas and they’re very intuitive. So if there’s no effect, what is the distribution of pvalues? There should be a uniform distribution. Okay, that’s intuitive. If there is an effect, the distribution of pvalues should be right skewed. So this is .01, .02, to .05, should be right skewed. That’s the idea. In other words there should be more significant results, with significance at .01 or .02 level, than at .04 or .05. Obviously the more power you have, the more likely you are gonna have pvalues that are really, really small.
Skip to 1 minute and 29 secondsAdequate power in perspective studies, the rule of thumb is 80 percent power. A lot of studies probably have 50 percent or maybe 40 percent power. This is like a woefully underpowered study. Even in a woefully underpowered study, there are more than twice as many pvalues between zero and .01 as there are pvalues between .04 and .05. As we get up to adequate power, this is 79 percent power. You shouldn’t actually have many pvalues between .04 and .05. 79 percent of the time, you’re gonna detect significance, and the bulk of that time you’re gonna be way over here, at less than .02. 83 percent of the significant effects are gonna be at .02 or less.
Skip to 2 minutes and 21 secondsThey next ask another question, which is, “What would we expect to see if researchers are data mining?” Like, what will this curve look like? The way they operationalize it here is they say, “Imagine you start with a certain sample. And I keep adding five subjects until I get a significant result. And then I stop. That’s my stopping rule.” So this again is no effect, we’re in the case of no effect. And this is the real data. This is what the distribution of pvalues should look like without manipulation. But if researchers are allowed to manipulate, and they simulate this. If you’re allowed to manipulate and get five more observations until you get p of .05, then you get a leftskewed pcurve.
Skip to 3 minutes and 2 secondsYou get lots of observations between .04 and .05 and very few down here. So, the intuition here is clear, and again, the write up in their paper is very nice. They say, researchers who are phacking have a stopping rule. And ultimately they have limited ambition. They don’t seek to get down to .02 or .01, they seek to get below .05. So there’s gonna be a spike at .05. So this pattern reflects the goal of the researcher to sort of attain a certain statistical significance level. And the difference couldn’t be any starker. Instead of rightskewed, it’s leftskewed. My favorite part of the paper is when they actually take this to some actual literature. So they go to a leading psychology journal.
Skip to 3 minutes and 49 secondsAnd they say, “Is there phacking when we look at dozens of studies in this journal?” And they do something very clever, which is they say, “There’s studies we suspect there was a lot of manipulation on.” There are experiments where the authors never showed the simple treatment versus control difference. They only showed differences with covariate controls. And the question is, what do the pcurves look like? And the blue line is the pcurve. So in these studies, in a psychology journal, where they never show you the raw data, they only show you the sort of manipulated result, there are a lot of p’s between .04 and .05. And very, very few out here.
Skip to 4 minutes and 33 secondsBased on what they just found, this is very troubling. These are studies that where taken together, we shouldn’t have very much confidence in this body of literature. Just looks like those are almost certainly, if you sort of take their theories seriously, false positives out there. These are the studies where they don’t think there was much indication of data manipulation. And here you have the sort of rightskew.
PCurve: A tool for detecting publication bias
How can we tell when publication bias has led to data mining? In this video, we’ll show you what the distribution of pvalues should look like when (1) there are no observable effects of a treatment, (2) there are observable effects, and (3) data mining is likely to have occurred. We also discuss a 2014 article authored by Uri Simonsohn, Leif Nelson, and Joseph Simmons (the same authors of “FalsePositive Psychology”) in the Journal of Experimental Psychology that demonstrated widespread data mining in that body of literature.
In this article, authors Joseph Simmons, Leif Nelson, and Uri Simonsohn propose a way to distinguish between truly significant findings and false positives resulting from selective reporting and specification searching, or phacking.
P values indicate “how likely one is to observe an outcome at least as extreme as the one observed if the studied effect were nonexistent.” As a reminder, most academic journals will only publish studies with p values less than 0.05, the most common threshold for statistical significance.
Some researchers use phacking to “find statistically significant support for nonexistent effects,” allowing them to “get most studies to reveal significant relationships between truly unrelated variables.”
The pcurve can be used to detect phacking. The authors define this curve as “the distribution of statistically significant p values for a set of independent findings. Its shape is a diagnostic of the evidential value of that set of findings.”
In order for pcurve inferences to be credible, the p values selected must be:
 associated with the hypothesis of interest,
 statistically independent from other selected p values, and
 distributed uniformly under the null.
It is also important to clarify that the pcurve assesses only reported data and not the theories that they are testing. Similarly, it’s important to keep in mind that if a set of values is found to have evidential value, it doesn’t automatically imply internal or external validity.
Using the pcurve to detect phacking is fairly straightforward. If the curve is rightskewed as in the chart to the right in the figure below, there are more low (0.01s) than high (0.04s) significant p values, suggesting truly significant p values. When nonexistent effects are studied (i.e., a study’s null hypothesis is true), all p values are equally likely to be observed, thus producing a uniform curve or a straight line. In the figure below, each chart incorporates a uniform curve that is dotted and red for comparison. Curves that are leftskewed however, as in the chart to the left in the figure below, indicate more high p values than low ones; phacking has likely occurred.
“Pcurves for Journal of Personality and Social Psychology” (Click to expand)
The above figure displays the results of the authors’ demonstration of the pcurve through the analysis of two sets of findings taken from the Journal of Personality and Social Psychology (JPSP). They hypothesized that one set was phacked, while the other was not. In the set in which they suspected phacking, they realized that the authors of the publication reported results only with a covariate. While there is nothing wrong with including covariates in study’s design, many researchers will include one only after their initial analysis (without the covariate) was found to be nonsignificant.
Simmons, Nelson, and Simonsohn provide guidelines to follow when selecting studies to analyze with the pcurve:

Create a selection rule. Authors should decide in advance which studies to use.

Disclose the selection rule.

Maintain robustness to resolutions of ambiguity If it is unclear whether or not a study should be included, authors should report results both with and without that study. This allows readers to see the extent of the influence of these ambiguous cases.

Replicate singlearticle pcurves. Because of the risk of cherrypicking single articles, the authors suggest a direct replication of at least one of the studies in the article to improve the credibility of the pcurve.
In addition to these guidelines, Simmons, Nelson, and Simonsohn also provide five steps to ensure that the selected pvalues meet the three selection criteria we mentioned earlier:
 Identify researchers’ stated hypothesis and study design.
 Identify the statistical result testing the stated hypothesis.
 Report the statistical results of interest.
 Recompute precise pvalues based on reported test statistics. – This has been made easy through an online app, which you can find at http://pcurve.com/.
 Report robustness results of the pvalues to your selection rules.
As with anything, the pcurve is not one hundred percent accurate one hundred percent of the time. The validity of the judgments made from a pcurve may depend on “the number of studies being pcurved, their statistical power, and the intensity of phacking”. There isn’t much concern over cherrypicking pcurves to ensure the result of a lack of evidential value. However, such a practice can be prevented simply with the disclosure of selections, ambiguity, sample size, and other study details.
Additionally, there are a few limitations with the pcurve. First, it “does not yet technically apply to studies analyzed using discrete test statistics” and is “less likely to conclude data have evidential value when a covariate correlates with the independent variable of interest.” It also has a hard time detecting confounding variables; if there is a real effect, but also mild phacking, it usually won’t detect the latter.
Simmons, Nelson, and Simonsohn conclude that, with the examination of a distribution of pvalues, one will be able to identify whether selective reporting was used or not. What do you think about the pcurve? Would you use this tool?
We’ll use http://pcurve.com/ in an Exercise later in this Activity.
You can read the entire paper here.
Reference
Simonsohn, Uri, Leif D. Nelson, and Joseph P. Simmons. 2014. “PCurve: A Key to the FileDrawer.” Journal of Experimental Psychology: General 143 (2): 534–47. doi:10.1037/a0033242.
© Center for Effective Global Action