Want to keep learning?

This content is taken from the University of California, Berkeley, Center for Effective Global Action (CEGA) & Berkeley Initiative for Transparency in the Social Sciences (BITSS)'s online course, Transparent and Open Social Science Research. Join the course to learn more.

Skip to 0 minutes and 0 seconds If you talk to a lot of the leading social psychologists over the last decade, two decades there is a growing perception that a lot of their classic results were just like not real. There was a crisis of confidence. In the field there were all those fraud cases we talked about early in the term. But there was this general feeling that it really wasn’t clear which empirical results in this field were credible.

Skip to 0 minutes and 23 seconds So how can we assess robustness? How can we make progress in this field? And try to figure out what to believe and what not believe in this body of research? A couple of years ago, the Center for Open Science and Project Implicit posted a public call for collaboration. And they said, “Well, let’s try to get as many labs as we can to do a replication of these 13 classic and prominent findings in our field.” It’s a crowd sourced model. So instead of there being me, one lowly scholar, or one lonesome scholar, taking on a field, we’re going to have the power of numbers and the anonymity of numbers.

Skip to 0 minutes and 57 seconds To sort of be part of a process to more systematically gauge robustness. How do they do it? They decided on these 13 kind of important experiments and they created a standardized, computerized interface to test these 13 experiments in one session. So, they streamlined them down to their essence. They were experiments that you could do with two conditions. You know, a treatment and a control. So they were kind of simple experiments, but classic experiments and pretty low costs for the labs. It was like, Download this thing. We have the whole experimental procedure. Put it on your computer network and recruit subjects. So all you had to do was recruit subjects. There wasn’t that much additional work.

Skip to 1 minute and 40 seconds And this is actually a listing of the labs. There’s thirty-six labs. There are some in Brazil and the Czech Republic. A bunch in the US and Turkey, Malaysia. A lot in North America. Some in Europe. And the plan all along was to try to publish it together. All of us were going to get together and publish it as a team in a leading journal. Which is what they did. So this is a pretty interesting model. We’re going to have the anonymity of crowds, so we don’t have to worry about retaliation. We’re going to get definitive answers because we are going to all do the analysis, in this case, with different subjects.

Skip to 2 minutes and 14 seconds So by pooling the labs, you get tons of data. And then because it’s so unusual to have these definite tests of so many famous hypotheses, it’s going to fly in a top journal. That’s the idea. So a lot of people wanted to join in because then you get to put on your CV, “Hey, I got a publication in a top journal.” So the incentives now are aligned around publication, not around getting 5K or whatever. It’s like, this is what people want. Researchers want publications.

Skip to 2 minutes and 42 seconds Now before getting into the kind of summary results, I want to talk about two specific tests. The first one is this Gain versus Loss Framing by our most famous psychology and economics pioneers, Tversky and Kahneman. This is really getting into how people view risks. There is an epidemic. Do you frame the intervention as saving 200 out of 600 lives, or having 400 people die out of 600? So if you frame it one way – it’s exactly the same. There are 600 people. Two hundred will live. Four hundred will die. Does that framing mean anything? Now, from the point of view of economic theory or some sort of optimal social planning exercise, these are totally equivalent. There should be no difference.

Skip to 3 minutes and 25 seconds This is really a framing effect. They are going to test whether the framing affects your choice.

Skip to 3 minutes and 35 seconds The second one is flag-priming. People are in the lab and you ask them to do some funny task. You ask them to sort of like estimate the time of the day in a photo. But in some of the photos – in the background, subtly, there is an American flag. And if they see the American flag in the background subtly, people become more conservative. They’re like, “Yeah, you know, I’m not really for gun control. I’m against it because I saw an American flag” kind of thing. This was a very famous paper. It was well cited. It was published recently and it’s about priming. I mean, I’ve done a little bit of work on priming in the lab. It’s very controversial.

Skip to 4 minutes and 18 seconds Like, what does priming do? Just subtly mentioning – There’s tons of work on this. If you subtly mention something to somebody, does it affect how they answer? How they view the world? And there are tons of people who say it does. So this is, you know, a kind of mainstream priming paper where they want to figure out if this holds. They only did this experiment in the American labs because if you show the American flag in Malaysia, why it would or wouldn’t affect views on affirmative action in Malaysia are unclear. There is actually a lot of affirmative action in Malaysia towards ethnic melees and it is a really big issue.

Skip to 4 minutes and 54 seconds But it has nothing to do with the American flag, I don’t think.

Skip to 4 minutes and 59 seconds How much do these results hold up? It really depends on how you classify “holding up.” So we’re going to have alternate definitions. If you consider “replicate,” which is sort of how psychologists do it, as “Yes, I rejected the null that the effect is in the same direction as the original published paper.” Then basically 10 to 11, there is one that is right on the fence, but 10 to 11 of the results replicate. Meaning, if the original paper said there was a positive effect, pooling all this data, you get a positive effect. Ask yourself what replication means if the original effect size was a one standard deviation effect.

Skip to 5 minutes and 36 seconds But here, when you pool thousands of observations, you get a 0.1 of a standard deviation effect. Does that replicate that result or not? But, the most generous definition is 10 to 11 out of the 13 studies broadly replicate. In a more narrow definition, fewer replicate.

Skip to 5 minutes and 58 seconds So, how does this look? I love this figure. This is an awesome figure. These are the 13 different studies where they mention the original paper there in the citation. And this is Gain versus Loss Framing and Flag Priming, our two favorite cases that we’re going to talk about. What do all these awesome symbols mean? An x means this is the original effect size. So this top one is pretty interesting. This is one where the original effect was sort of moderate – a little less than a one standard deviation effect. But now, the result from pooling all the data is that circle with those two whiskers around it.

Skip to 6 minutes and 33 seconds And you can see the whiskers are really tight because we’ve got 6,000 observations. So we’re estimating an effect really precisely here and you can see all those circles. So you want to compare the circles to the x’s and then the little dots are the individual study samples. Where the shaded ones are the US labs and the white ones are the international, the non-US. Actually, for some of these famous anchoring studies, they found an effect originally, but when you replicate it, you get even stronger effects. So that’s kind of interesting. So those are results where, they don’t replicate. The results are even stronger. I mean, like you can reject the original point estimate because the results are even stronger.

Skip to 7 minutes and 14 seconds For some of them, the x’s and the big circles are closer. And then some of them, towards the bottom here, the original effect was also around one standard deviation. But these are the three that basically don’t replicate at the bottom. And, again, there is one that is sort of right on the margin. So there is some heterogeneity. Now let’s zoom in on Gain versus Loss Framing and the Flag Priming. The original x here, which you can’t really see that well, is over here. So this was about a one standard deviation effect. So for Gain versus Loss Framing, here is a significant effect. If you frame things as Gain versus Loss, you get different choices.

Skip to 7 minutes and 49 seconds The magnitude of the effect is about half as large as the original estimate. So, it’s a question of interpretation. It’s a much smaller effect than the original paper, but there is an effect going in the same direction. So, that’s pretty useful. The other thing you see is there is quite a bit of variation across labs in the finding. And we’ll talk about why. Is it the samples they’re using? It is something else? Flag Priming is just a dead on zero. The original result was here, about half a standard deviation in normalized units was the effect. But that’s just like a zero. There’s no effective Flag Priming. If you show people the American flag, it doesn’t really do that much.

Skip to 8 minutes and 28 seconds This is sort of how they summarized the results. You’ve got the original estimates over here.

Skip to 8 minutes and 34 seconds And then you have the sort of median and mean replications estimates here with the confidence interval. And here, they even have the 99% confidence interval. These things are like precisely estimated when you have this much data. So they’re even taking like a wider confidence interval. And then we have something really nice over here which is, study by study, what fraction of the studies have the same direction effect significant? So, again, if were using that as our criterion for replication, how often does an individual study that you do, that we know is at least adequately powered and had the same instructions, so there is some quality control. How often do they replicate? This is like power. What are our two cases?

Skip to 9 minutes and 22 seconds What do they tell us? Gain versus Loss Framing. The original effect was about 1.1 standard deviation units and that falls by about half with all this data. You never get an effect in the opposite direction. 86% of the individual studies give you the same sign and significant. In terms of the Flag Priming, the original effect was about half a standard deviation. You just get this dead on zero basically. There is just no effect of Flag Priming. In none of these individual studies do you even reject the null. In fact, in one of them, you get the wrong sign. That’s about the proportion you’d expect at 95% confidence. About 4%, you know, one out of these studies.

Skip to 10 minutes and 2 seconds So this was kind of an encouraging thing. Like, we know, we pretty much believe that is a zero and yeah, you get one false positive out of 30-something. How much heterogeneity is driven by different things? So one concern is maybe certain labs just produce big effects for some reason. Is there something about the setting. Is there something about the sample? We don’t know. Now, here, it’s reorganized. Each of these is the different lab. So this is Tilburg. And this one is Amazon Mechanical Turk. This is Ithaca. And these are the point estimates for each of the 13 experiments that they ran. And you can see, overall, there isn’t that much variation here.

Skip to 10 minutes and 47 seconds It isn’t like there are some labs that have huge effects and some that always have zeros or something. So they say there isn’t tons of sort of cross-lab variation in overall effects. That’s kind of interesting. The other thing that they show, and I thought this was also interesting – They don’t show much variation due to the study setting. Whether these games were played online or in the lab didn’t really affect the point estimates. That’s kind of encouraging. Of course, it’s all a computer interface anyway, so you might think it wouldn’t matter. But still, if you’re recruiting online samples versus people who walk into the lab, there is quite a bit of different self-selection there. US versus non-US labs.

Skip to 11 minutes and 24 seconds So it wasn’t like the US study samples were different. Now, I might want to cut the data a little differently as a development economist. I might want to separate out Brazil and Malaysia and say, “Oh, these are middle income countries” or whatever – But anyway, overall there was no kind of big differences there. They also, and this was even more surprising, don’t find much in the way of interaction effects with different demographic characteristics. So they also did a survey at the end of lab. Your gender, age, education, kind of basic stuff. They have it in an appendix. That didn’t matter much either. So it looks like most of the variations really are just like sampling variations.

Skip to 12 minutes and 0 seconds It just happens that they are different effects. This is a pretty interesting finding. You know, there is a lot of fear that there is a lot of lab-specific effects. Now, maybe there are lab-specific effects. Maybe the 36 labs that signed up for this are really similar in some way. These 36 labs have, you know, principle investigators who are really into replication. Or they have a particular research approach and so there is sort of less heterogeneity here. I don’t know. Maybe. We don’t know that for sure. But still, they have a lot of different labs. And not much in the way of heterogeneous treatment effects.

A replication example: The Many Labs Project

This paucity of published replications has contributed to a crisis in confidence in social psychology. Recognizing this, a team of 51 researchers from 35 institutions teamed up to create the first Many Labs Project. The project’s aim was to systematically evaluate the robustness of published social science research using the power of anonymity afforded by the team’s large size. Each institution conducted the replications of the same studies and the results were collectively analyzed. In this video, I hone in on two of these studies in particular that focused on framing and priming.

Note: There is a chart displayed in this video at 8:27 that may be too detailed to read clearly on some devices. If this is your case, you can access the chart by clicking on the link in the SEE ALSO section at the bottom of this page.

“Replication is a central tenet of science; its purpose is to confirm the accuracy of empirical findings, clarify the conditions under which an effect can be observed, and estimate the true effect size,” begin the authors of “Investigating Variation in Replicability.”

The benefits of replication may include:

  • “[f]ailure to identify moderators and boundary conditions of an effect may result in overly broad generalizations of true effects across situations… or across individuals,”

  • “overgeneralization may lead observations made under laboratory observations to be inappropriately extended to ecological contexts that differ in important ways,”

  • “attempts to closely replicate research findings can reveal important differences in what is considered a direct replication… thus leading to refinements of the initial theory,” and

  • “[c]lose replication can also lead to the clarification of tacit methodological knowledge that is necessary to elicit the effect of interest.

However, replications of published studies ar remarkably scarce in social science journals. One reason why is that researchers – especially junior ones – fear retaliation from more senior and published scientists. They may also perceive limited interested from journal editors to publish replications.

In an attempt to address these concerns, Richard Klein and 34 of his colleagues in the psychological sciences took on a “Many Labs” replication project. Their goal was to systematically and semi-anonymously evaluate the robustness of published results in psychology journals, as well as to “establish a paradigm for testing replicability across samples and settings and provide a rich data set that allows the determinants of replicability to be explored” and “demonstrate support for replicability for the 13 chosen effects.”

They selected 13 effects that had been published in behavioral science journals and attempted to replicate them across 36 labs and 11 countries, controlling for methodological procedures and statistical power. The effects included sunk costs, gain versus loss framing, anchoring, retrospective gambler’s fallacy, low-versus-high category scales, norm of reciprocity, allowed/forbidden, flag priming, currency priming, imagined contact, sex differences in implicit math attitudes, and implicit math attitudes relations with self-reported attitudes.

“In the aggregate, 10 of the 13 studies replicated the original results with varying distance from the original effect size. One study, imagined contact, showed a significant effect in the expected direction in just 4 of the 36 samples (and once in the wrong direction), but the confidence intervals for the aggregate effect size suggest that it is slightly different than zero. Two studies – flag priming and currency priming – did not replicate the original effects. Each of these had just one p-value < .05 and it was in the wrong direction for flag priming. The aggregate effect size was near zero whether using the median, weighted mean, or unweighted mean.”

They also found very few differences in replicability and effect sizes across samples or lab settings and conclude “that most of the variation in effects was due to the effect under investigation and almost none to the particular sample used.”

This study is just one of a handful of “Many Labs” replication projects that began in 2011 at the Center for Open Science. These projects have continued to attract interest from researchers and publishers alike. Read more about the project and find other replications.

You can read the whole paper by clicking on the link in the SEE ALSO section at the bottom of this page.

If you have time and want to hear a deeper discussion of replication, check out Episode 7 of the Hi-Phi Nation podcast titled “Hackademics”. In it, Brian Nosek, a principle investigator of the Many Labs project and Co-founder of the Center for Open Science, discusses a complementary replication project “Estimating the Reproducibility of Psychological Science” in which he and others involved in an Open Science Collaboration attempted to replicate 100 experimental and correlations published psychology studies. Listen to the episode here.


Klein, Richard A., Kate A. Ratliff, Michelangelo Vianello, Reginald B. Adams Jr, Štěpán Bahník, Michael J. Bernstein, Konrad Bocian et al. “Investigating variation in replicability.” Social Psychology (2014).

Share this video:

This video is from the free online course:

Transparent and Open Social Science Research

University of California, Berkeley