Want to keep learning?

This content is taken from the University of California, Berkeley, Center for Effective Global Action (CEGA) & Berkeley Initiative for Transparency in the Social Sciences (BITSS)'s online course, Transparent and Open Social Science Research. Join the course to learn more.

Skip to 0 minutes and 0 seconds Gerber and Malhotra, a useful starting point. They went out and they got data from a couple of leading political science journals over a long period of time. They look at the leading statistical tests in those papers. And they just do something really simple and say, “What does the distribution of p-values look like regardless of what assumptions you make?” There’s no way they should look like these guys are going to show they look like. There’s just going to be some really weird patterns, and this is something that comes out in literature after literature. And they’re really going to focus in on this critical value of .05, 95 percent significance, just because there’s so much–

Skip to 0 minutes and 45 seconds what’s the right word – it’s like fetishistic, this like obsession with 95 percent significance, and there’s a feeling that a lot of researchers have that if they don’t have a result that isn’t significant with P less than .05 it’s unpublishable, it’s irrelevant, no one will be interested in it. So they’re going to focus on this. There are a couple of reasons why you would have bias and only certain results would get published. One is editors and referees. Editors and referees may just have this notion that only results that are significant in 95 percent are worth publishing.

Skip to 1 minute and 18 seconds Given that, authors may only submit results that have a certain pattern of significance or have p-values less than .05.

Skip to 1 minute and 28 seconds So maybe there’s no manipulation going on, but there’s the file drawer problem. So there’s the sort of so-called cross-study bias problem. Meaning, I have a whole bunch of results, I only send the ones off that are significant. The others stay in my file drawer. Then there’s the possibility of manipulation in a given dataset with given data. So actually my result isn’t significant, there really is no effect, and I manipulate the data to get a p-value just less than .05. This is a real quote. You guys should just take a minute and read it. This is a quote that Jeremy Weinstein, a professor of political science at Stanford presented a couple of years ago in a conference.

Skip to 2 minutes and 6 seconds Maybe you guys can just take a second and read this quote. This is from a real referee report. He won’t say who got it, if it’s him or somebody else, but he swears this is a real letter from a journal.

Skip to 2 minutes and 25 seconds So basically what the editor is saying is the paper, gosh, this is a pretty important question, that’s good.

Skip to 2 minutes and 35 seconds They get at a causal impact, that’s great. It’s a pretty important question and there’s some causal evidence on it. But the lack of results–

Skip to 2 minutes and 48 seconds that’s what this means – the lack of results, meaning not everything is significant as we hoped really weakens it. What you really need to do is generate new results by looking at some subgroups, looking for some heterogeneity, so I can actually say there’s a result in this damn paper and publish it in – this was either AJPS or APSR or something like that. So this is a couple years ago. This is normal. This stuff happens. And it’s sort of sad. You might think papers would get judged based on the quality of the question, the quality of the design, the quality of the data, the importance of the finding.

Skip to 3 minutes and 27 seconds And if you find there’s no effect of something that theory says should have an effect, you might even think that’s more interesting. Holy cow, like we’re actually learning something here, we’re not just confirming our priors. Anyhow, this is a concern, this is now, this is sort of what we’re up against. So let’s just turn to the political science literature. This is a histogram of z-statistics for the point estimates in these papers published over 13 years in American Political Science Review and American Journal of Political Science. No one had done this before in political science. There had been a number of papers looking at these kinds of distributions of p-values in other fields, including econ and medical research.

Skip to 4 minutes and 10 seconds Apparently it was quite novel in political science.

Skip to 4 minutes and 15 seconds And they asked the question, is their smoothness– if you look at the distribution of p-values – are they smooth around .05? And again, any normal occurring distribution of p-values would be smooth around .05. And they reject smoothness at the 1 in 32 billion level. So what does that mean? And these are z-statistics, so, you know, the key point is going to be here at 1.96 or 2.0, right?

Skip to 4 minutes and 44 seconds So these are the more significant results over here on the right; less significant on the left. The results with z-statistics less than 1.96 are not significant at 95 percent. These are significant at 95 percent. And there’s this incredible jump at that point.

Skip to 5 minutes and 3 seconds There are three times as many studies here as here.

Skip to 5 minutes and 8 seconds There’s another paper by the same two authors that came out the same year. They basically did this for political science and sociology. They wanted to just get articles in the leading journals, get hundreds of articles over a decade and plot this out, and they find the same thing. So now instead of 3 to 1, it’s 2 to 1 in the sociology journals. Here somehow visually it seems particularly, again, stark,

Skip to 5 minutes and 30 seconds where you have this incredible jump. So for some reason papers get published with p-values of .049 but not .051, and that’s some combination of those three factors we talked about– editors, file drawer problem, not sending things out, and data mining.

Do statistical reporting standards affect what is published?

Publication bias can also explain why false-positives are so common in peer-reviewed journals. There is a widespread perception among journal editors and reviewers, as well as authors, that only results with at least 95% significance – or p-values less than 0.05 – are worth publishing. In 2008, political scientist Alan Gerber and political economist Neil Malhotra reviewed the reported observations of significance just above and below 95% in articles published in two leading Political Science Journals – American Political Science Review (APSR) and American Journal of Political Science (AJPS).

As you watch this video and learn about their findings, ask yourself: what does such a bias mean for studies that produce statistically non-significant results, but may nonetheless ask important questions and use rigorous methods?

This article assesses two prestigious journals for publication bias caused by a “reliance on the 0.05 significance level.” Authors Alan Gerber and Neil Malhotra define publication bias as “the outcome that occurs when, for whatever reason, publication practices lead to bias in the published parameter estimates.”

The authors list four ways in which bias can occur:

  1. Editors and reviewers may prefer significant results and reject methodologically sound articles that do not achieve statistical significance thresholds.

  2. Scholars may only submit studies with statistically significant results to journals and place the rest in “file drawers.”

  3. Investigators may adjust sample sizes after observing that results narrowly fail tests of significance.

  4. Researchers may engage in data mining to find model specifications and sub-samples that achieve significance thresholds. Or they may continuously collect data until statistical significance surpasses the 0.05 threshold. This, along with the third item, refers to a practice known as “p-hacking.” We’ll get more into this in the next activity.

To detect publication bias, Gerber and Malhotra conducted a “caliper test,” in two leading political science journals, looking at the number of published results for critical values just above and below the cut-off. Because sampling distributions should reflect continuous probability distributions, the values just above and just below an arbitrary cut-off should be the same.

Their results showed a dramatic spike in published results when critical values were just above the threshold. They concluded that many of the findings in these journals could be false due to bias.

Gerber and Malhotra discuss the consequences of publication bias:

“First, publication bias may result in a significant understatement of the chance of a Type I error, which lends false confidence and may misdirect subsequent research. Second, anticipation of journal practices may distort how studies are conducted, encouraging data mining, specification searches, and post hoc sample size adjustments. Third, and perhaps most important, holding work to the arbitrary standard of p < 0.05 may discourage the pursuit and publication of work that is well designed and on important topics but unlikely to produce precisely measured estimates”.

There is value in well-designed, robust, innovative studies, even if the power of the study is weak. Gerber and Malhotra propose that, along with greater attention to research design, study registries should be implemented to reduce publication bias. We’ll learn more about these registries next week.

You can read the full article here.


Gerber, Alan, and Neil Malhotra. 2008. “Do Statistical Reporting Standards Affect What Is Published? Publication Bias in Two Leading Political Science Journals.” Quarterly Journal of Political Science 3 (3): 313–26. doi:10.1561/100.00008024.

Share this video:

This video is from the free online course:

Transparent and Open Social Science Research

University of California, Berkeley