Want to keep learning?

This content is taken from the University of California, Berkeley, Center for Effective Global Action (CEGA) & Berkeley Initiative for Transparency in the Social Sciences (BITSS)'s online course, Transparent and Open Social Science Research. Join the course to learn more.

Skip to 0 minutes and 0 seconds So far we’ve talked about a number of different problems with empirical social science research. You know the Ioannidis findings. About how many research literatures are distorted by false positive results – results we really can’t trust. How pervasive data mining is. We also have gone through the fact that there is large numbers of null results that are just missing from research literatures. Basically invisible studies that we can’t take into account when we think about what research findings exist. In the article that I wrote with twenty other co-authors, we lay out of a couple of different approaches to start dealing with these problems. We talk about disclosure. We talk about the importance of open data and materials.

Skip to 0 minutes and 46 seconds So these are ways to sort of begin dealing with these challenges. I think one of the points that came out when we talked about disclosure, and we talked about open data and materials – is asking researchers or authors to disclose, on its own, may not be a very powerful tool because it’s hard to verify what they’ve done. Similarly, asking them to share their data is useful. But if they’re only sharing a piece of their data, or they’re only sharing the parts of their data they want the rest of the research community to see, again, it’s not going to solve all the problems that we talked about.

Skip to 1 minute and 20 seconds But, in combination with pre-registration of research plans and research hypotheses, these tools actually become more powerful. The idea is researchers can post research hypotheses, the data used to test them, and what their planned research design is, in a place where other members of the scholarly community can access that information. So if I actually know what you plan to analyze and what data you plan to use, when you do post a limited subset of your data, I know what to look for. I can say, “Wait a minute! In that pre-registered hypotheses document, you said your primary outcome was X and X isn’t even in the data.” So there is a lot more accountability when there is pre-registration.

Skip to 2 minutes and 4 seconds Anyway, just to be more specific, let’s go through the AEA Registry a little bit. The AEA Registry is new. The focus of the AEA Registry, American Economics Association’s Registry, is randomized control trials. That’s what they aim to register. Of course, there’s been a boom in RCTs in economics that they’re trying to capture, But they very strategically called it socialscienceregistry.org. Because they want to make this the central registry for experiments across the social sciences. Okay, so this is what you need to – the information you need to register a study. It’s actually pretty minimal, honestly. It’s a lot less than you would need for a funding proposal.

Skip to 2 minutes and 42 seconds So if you’ve written a funding proposal, you’ve kind of got all this ready to go. The title, the country, the status. You know, an abstract. Some information on the dates. You need to list your main outcomes. That’s pretty valuable. Again, in terms of this issue of data mining, having to go on the record and say, “These are my two or three main outcomes” is useful You have to talk about what your design is. What is your main approach going to be? Now, again, you don’t need that much detail. You don’t have to write down the exact regression equation. But still, this is something.

Skip to 3 minutes and 12 seconds So they sort of want to make sure that folks are getting on the record with these basic elements. And, you know, it’s pretty straight forward. I just took the screen shot. Today, you just click on “I want to start a new trial.” This is me. I logged in. I want to start a new trial. I put in the title – whatever, And it just takes you through these really simple screens on each of these things and you can enter in the information. So it’s actually pretty – pretty straightforward.

Skip to 3 minutes and 36 seconds What about if the project you’re working on is not a randomized control trial, but you still want to register some hypotheses? So there’s plenty of studies, for instance, that are prospective studies. They’re designed in advance of some, say, policy change. Let’s say an election is coming up. You’re studying political economy. An election is coming up and you know some data that’s going to be collected after the election. And you know what research hypothesis you want to ask. You can register it on the Open Science Framework. Basically, you can timestamp and archive any document you want on this framework. It’s like set up for exactly that. You can create a project title. Timestamp your PDF.

Skip to 4 minutes and 16 seconds Make it publicly available and it’ll have a unique identifier you can point people to. So, you can pretty much pre-register whatever you want through the Open Science Framework. So again, as I mentioned before, there’s quite a bit of debate about how much detail there should be in pre-registration material. But my general take is it’s pretty valuable even to have the basics. The design, the main outcomes, the timeframe, all of that is valuable. So let’s think about what the concrete benefits could be, and again, some of this is recapping what we talked about in the first few weeks of the course, and some of it is new.

Skip to 4 minutes and 51 seconds So the first bit of value to even sparse registration is filling in the gaps in the literature. If it became the norm to pre-register studies, we’d have a sense of what studies failed. We’d have a sense of which authors to contact if we wanted information on studies in the literature or we wanted to access their surveys. So that’s going to be useful for understanding publication bias and improving some meta-analysis activity. So that’s one clear benefit. The second clear benefit relates to reducing the risk of data mining because we’ll know what the author’s original intentions were. We’ll know what they were planning to test.

Skip to 5 minutes and 26 seconds And if I’m a referee on a paper, it should be pretty straightforward for me to see the paper, check the registration, and see if those tests in the paper are just wildly divergent from the registration I can say, “Wait a minute, you said you were going to run Y-1 on X-1. You’re running regressions of Y-2 on X-2. I want to see Y-1 on X-1. I want you to include that in the paper.” And that may make clear that actually the results are not quite as robust as the authors make it out to be, or maybe they are robust. But it will be a lot clearer to know where the authors started.

Skip to 5 minutes and 59 seconds So that could be useful in the refereeing process and for the rest of the scholarly community. Related to that is the issue of generating correctly-sized statistical tests, meaning p-values that we can believe in basically, that are meaningful. If people are doing lots of data- mining and running tons of tests, I don’t know what their p-values mean anymore because they may have strategically picked p-values of 0.049. So by knowing what they plan to run and asking to see those results, I can see p-values that are more meaningful to me. That’s, again, pretty valuable for the scholarly community.

Skip to 6 minutes and 34 seconds As we alluded to before, open data and disclosure are going to be much more effective if you can check what’s been released against the original data plans because you can see what’s missing, you can see what the omissions are, and you can try to get that information from the author. So again, accountability is going to be enhanced by having a registry. And then a fifth point that I think is very important that we’ve alluded to a couple times before is – and I can attest to this from my own experience – having to write up a pre-analysis plan and register some information about your study in advance actually I think most of the time makes the research better.

Skip to 7 minutes and 13 seconds Because people launch projects, sure, they put thought into the design and they go through the motions. The talk about it, they think about the design. They do some power calculations or something. But having to put it on paper and know you’re going to have to run that exact analysis at some point in the future really focuses the mind. There’s a big concern that somehow pre-registration will limit creativity. This comes up all the time. That there’s a notion that working through the data and seeing what unexpected correlations are in the data and whatnot is just a central part of the scientific process. That it’s sort of either/or. You either pre-register and you do what you said you were going to do.

Skip to 7 minutes and 50 seconds Or you do exploratory research. I think the counterpoint to that is there’s nothing at all to stop you. Let’s say you pre-registered a hypothesis, you collect data, you present the results you had pre-registered. There’s absolutely nothing to stop you or other members of the scholarly community from doing exploratory work on that data. Looking for unexpected correlations. The data is out there. So I don’t think it’s an either/or. The flip side, I think where people think there could be some stifling of exploratory work is if journals just won’t publish anything except for pre-specified work. That if somehow the norm becomes so strong that if you don’t pre-specify your hypothesis, they won’t even look at the results.

Skip to 8 minutes and 33 seconds My own view is this concern – my view is it’s overstated. We’ve been doing field experiments and development for almost 20 years now. And still if you look at published papers in the top economics journals, over 80 percent are still observational studies.

Skip to 8 minutes and 51 seconds Exploratory work is inherently more tentative. If you really did look at 30 or 40 different variables and you found some unexpected correlations, it’s pretty hard to know how many of those are real and how many are spurious and just the result of either data-mining or sampling variation or something. So if you do find cool correlations, that’s great. And maybe that sets the stage for the next experiment or the next study where you go into that prospectively. One way we put it in our science article last year is we want to free exploratory analysis from being portrayed as hypothesis testing, as sort of ex-ante hypothesis testing.

Skip to 9 minutes and 27 seconds So people explore their data, they run a million tests, they mine their data, they do all these things. And then they show a set of tables presenting the p-values as if that was the only test they had run. And that isn’t the only test they’ve ran, and everybody knows it’s not the only test they’ve run. So we’re trying to make a break from that. So if you label those p-values and that analysis as exploratory, you didn’t pre-specify it, we know what it means. If you do pre-specify it, we know what that means. And it just creates some clarity as to what’s being done. So I think that’s the potential advantage.

Introduction to pre-registration

I introduced you last week to the concept of pre-registration (sometimes called the Registered Reports method) and pre-analysis plans, which allow researchers to openly share their research questions and plans for data collection and analysis. In this video, I explore further the motivation for using these tools and what using them generally entails. I also address some common reservations researchers have with pre-registration.

Share this video:

This video is from the free online course:

Transparent and Open Social Science Research

University of California, Berkeley