Want to keep learning?

This content is taken from the University of California, Berkeley, Center for Effective Global Action (CEGA) & Berkeley Initiative for Transparency in the Social Sciences (BITSS)'s online course, Transparent and Open Social Science Research. Join the course to learn more.

Skip to 0 minutes and 0 seconds So this Vines et al piece is really short, but it highlights the potential depth of this problem. Basically what they find when they look at a particular literature in biology is, in a setting where there were no requirements to post data, or very limited requirements to post data, data quickly disappears. It’s very hard to access. So what they do is they get 500 articles over 20 years, published in a specific sub-literature within biology.

Skip to 0 minutes and 31 seconds And they try to get the data, that’s basically the goal. They try to find the email addresses of the authors. The original papers have email addresses. A bunch of times they bounce. They look for them on the web. They try to email them. They try to email them again until they reply. A lot of time they don’t reply. Even when they reply they rarely share the data. But they basically show the breakdown in each of those steps. It’s kind of an interesting exercise. Now every community will be different. In some research communities there may be better data sharing. But they just did this a few years ago again. So first finding, it’s very difficult to contact published authors.

Skip to 1 minute and 6 seconds A lot of email addresses bounce, and the ability to secure data drops off very quickly. So even five, ten years back, it’s hard to get data. So overall out of their 500 articles, they were only able to get the data for about 20 percent. 80 percent of the data just wasn’t accessible. Now maybe if they’d been more persistent. Maybe if they had sent ten emails instead of two or three. Maybe if they called up the author five times. But that’s not how it’s supposed to go, right. It’s not supposed to be the case you have to spend months of your life getting data. It’s supposed to be freely shared, it’s supposed to be…

Skip to 1 minute and 40 seconds So this is this is the data on their table. Hopefully you guys can see this. This is sort of the year by year. That’s the publication year on the left. And because they had sort of limited bandwidth they just coded up every other year.

Skip to 1 minute and 56 seconds Once you go back to some of the older articles, 30, 40 percent of the time there’s no working email. For the more recent articles usually there’s a working email.

Skip to 2 minutes and 6 seconds The author responds to the email maybe 50 to 60 percent of the time but maybe 40 percent of time they don’t.

Skip to 2 minutes and 17 seconds They very often don’t even give us a response that’s meaningful in terms of the stats and the data. Sometimes they say the data is lost, especially for the old studies. What he says is in a lot of these old studies they’re like, and this is you know 20-some years ago, so it’s on a floppy disk. My new computer doesn’t have a floppy disk reader. I don’t even know how to get this to you.

Skip to 2 minutes and 38 seconds So they only actually get the data 19 percent of the time. This is just like kind of demoralizing. And definitely for the old studies, you’re just not going to get the data basically. This is just lost to science. No one’s going to be able to extend this stuff or reanalyze it. Five years ago, six years ago, 2009 or so, an editor took over AJPS named Rick Wilson who was really strongly in favor of data posting. And made it a requirement that papers published in AJPS post their data. As of 2013 when a review was done, very few political science journals required data posting.

Skip to 3 minutes and 19 seconds And even those who have policies, they only really weakly enforce them. There’s kind of a consensus that that’s the case. Dafoe presents and collects data about just how much of a problem this is. And this is data that links back up to the Vines et al piece and the availability of psychology data. He surveyed Ph.D. students in his class at Yale as well as in a Statistical Methods class at Harvard, asked them what their experience was with the papers they tried to replicate. Like could they find the replication files. I forget what the sample was for his survey here, however many dozen observations from people who were trying to replicate data, replicate studies and try to get data.

Skip to 3 minutes and 58 seconds What he basically finds is in about 50 percent of the cases basically there isn’t enough posted to replicate the study. Some of them have nothing available, even when you ask the authors. Some only have the data, not the code. Some have the code and not the data, et cetera. So basically you have this sort of survey. He consented people and asked them all these questions about their experiences, and this is the kind of key bottom-line figure. This is from his appendix. Proportion of responses where there were just no replication materials available was 20-something percent.

Skip to 4 minutes and 37 seconds Cases where there was limited data, but so limited you can’t even do any analysis, was another 20-something percent. So it was basically half the cases. And then there’s this kind of distribution of quality in terms of availability of code or data. In some it was like yeah, there was enough to kind of do what I needed to do. And others, oh, I could do most of what I needed to do. So there’s this sort of continuum. But in a full half the cases, these folks just couldn’t get the data. And a lot of the papers they wanted to replicate were recent papers. These were like Ph.D. students trying to get on the research frontier, looking at recent papers.

Skip to 5 minutes and 11 seconds So this is like a best case scenario. If you go back 15 years, it’s like got to be worse than this, or 10 years. So it’s sort of like field after field, we’re still in a weird equilibrium it seems where people aren’t sharing their data. Journals may be able to exploit their leverage, and there may be a key sort of institution to getting data online. Dafoe presents and contrasts AJPS policy under Rick Wilson, where they very seriously pushed for data sharing, with the APSR policy of encouraging data posting. And making it clear that there is an expectation that data will be shared with interested scholars. But basically no enforcement.

Skip to 5 minutes and 52 seconds But what Rick Wilson did is he basically held up publication early on as editor of a bunch papers and was like, “No, I’m not going to run your paper until you give me the data.” Like he was serious about it.

Skip to 6 minutes and 2 seconds He would himself check out the replication files to see that they ran.

Skip to 6 minutes and 8 seconds And then they pushed very hard to get all the data posted on Dataverse. The left-hand panel is AJPS, and Rick Wilson takes over in 2009. So in 2009, the papers published in 2009, basically none of those papers published state that their replication data is available. But by 2010 it’s sort of a mix. And by 2011 and ’12 almost all of them state that their data is available. He made them say that. He was like, “You have to state in the article your data is available and you have to post it.” This follow-on plot – again, left-hand side is AJPS, right-hand side is APSR – is data that when you actually go to look for it actually is available.

Skip to 6 minutes and 56 seconds He could find it, Dafoe could easily find the data. And even though it isn’t quite 100 percent, you see a huge improvement from like 2009! This isn’t that long ago! None of the data was available in the leading political science journal. There were like two folks who shared their data or something, like two nerds, you know? And then by the end, it’s 75 percent! So this is a massive improvement. In two years you go from no data to 70 percent of data being posted. That’s a big improvement.

Skip to 7 minutes and 32 seconds So again, it suggest that journals have leverage, and this may be a promising way forward.

Open data makes for more credible science: two articles on disappearing data

Why is scientific misconduct such a problem? One major reason is that, until a few years ago, there weren’t many effective mechanisms for keeping researchers accountable. Even if someone wanted to replicate a study, obtaining data from published authors could be extremely difficult. In 2014, zoologist Timothy Vines and his colleagues wrote a short piece describing just how big an issue disappearing data was. Less than a year earlier, political scientist Rick Wilson became the new editor of the American Journal of Political Science (AJPS) and began strictly enforcing the journal’s requirement that published authors post their data. Allan Dafoe, another political scientist, saw this as an opportunity to examine how well the enforcement of data sharing requirements served as a way to prevent research fraud. Watch the video to learn what he found.

Recently, many policies are being put into place that require research to be accessible to anyone through public archives. Political scientist Allan Dafoe, author of “Science Deserves Better: The Imperative to Share Complete Replication Files,” advocated for replication transparency, saying that “good research involves publishing complete replication files, making every step of research as explicit and reproducible as is practical.”

Unfortunately, many researchers still do a poor job preserving their data and it is too often lost. Dafoe’s paper simply argues that, with transparency and publication, “political science will become more refutable, open, cumulative, and accessible.” Without transparency, fraud threatens to reduce the public’s trust in science.

In “The Availability of Research Data Declines Rapidly with Article Age”, Timothy Vines, et al. also defend the importance of data transparency through an analysis of the effect of article age on data availability. The study formally investigated the relationships between a published paper’s age and four other probabilities:

  1. the probability of finding at least one working e-mail for a first, last, or corresponding author in order to request data;
  2. the conditional probability of a response, given that at least one e-mail appeared to work;
  3. the conditional probability of getting a response that indicated the status of the data, given that a response was received; and
  4. the conditional probability that the data were extant, given that an informative response was received.

The authors found a negative relationship between the age of the paper and the probability of finding at least one apparently working email, either through the journal or searching online. In fact, for each additional year, the chances of finding a working email fell by 7%. Additionally, there was a “negative relationship between age of the paper and the probability of the data set being extant (‘shared’ or ‘exists but unwilling to share’).” And, with each additional year after publication, the odds of data being extant decreased by 17%. Finally, they found a slightly positive effect of article age on working emails found via web searches. Data from older studies tended to not be available mainly because data sets were lost or stored in inaccessible media like Zip or floppy disks. Restoration of these data using modern computer infrastructure, therefore, would take an excessive amount of time.

Because of data’s potential usefulness in studies performed long after collection, the authors advocate for data preservation in public archives where it cannot be lost or withheld by authors.

These articles demonstrate how imperative data availability is for maintaining scientific credibility, both within the research community itself and in the public eye. In an effort to facilitate a transition toward more open data, Allan Dafoe makes various recommendations for how to produce good replication files:

For Statistical Studies:

  1. Do all data preparation and analysis in code.
  2. Adopt best practices for coding, including clarity in code, testing, and running code all the way through.
  3. Build all analysis from primary data files.
  4. Fully describe variables.
  5. Document every empirical claim.
  6. Archive your files.
  7. Encourage co-authors to adopt these standards.

For Journals:

  1. Require complete replication files before acceptance.
  2. Encourage high standards for replication files.
  3. Implement replication audits.
  4. Retract publications with non-replicable analyses.

Data sharing and transparency are scientific public goods, benefiting many and lowering the barrier to entry for students and junior researchers. Open science provides tools, incentivizes caution in study designs, and can produce much more credible research.

Why don’t you think scientists and researchers make more of an effort to preserve their data after publication?

If you want to dive deeper into the material, you can read the entirety of both papers by clicking on the links in the SEE ALSO section at the bottom of this page.


Dafoe, Allan. 2014. “Science Deserves Better: The Imperative to Share Complete Replication Files.” PS: Political Science & Politics 47 (1): 60–66. doi:10.1017/S104909651300173X.

Vines, Timothy H., Arianne YK Albert, Rose L. Andrew, Florence Débarre, Dan G. Bock, Michelle T. Franklin, Kimberly J. Gilbert, Jean-Sébastien Moore, Sébastien Renaut, and Diana J. Rennison. 2014. “The availability of research data declines rapidly with article age.” Current biology 24 (1): 94-97.

Share this video:

This video is from the free online course:

Transparent and Open Social Science Research

University of California, Berkeley