Skip to 0 minutes and 0 secondsSo this Vines et al piece is really short, but it highlights the potential depth of this problem. Basically what they find when they look at a particular literature in biology is, in a setting where there were no requirements to post data, or very limited requirements to post data, data quickly disappears. It's very hard to access. So what they do is they get 500 articles over 20 years, published in a specific sub-literature within biology.
Skip to 0 minutes and 31 secondsAnd they try to get the data, that's basically the goal. They try to find the email addresses of the authors. The original papers have email addresses. A bunch of times they bounce. They look for them on the web. They try to email them. They try to email them again until they reply. A lot of time they don't reply. Even when they reply they rarely share the data. But they basically show the breakdown in each of those steps. It's kind of an interesting exercise. Now every community will be different. In some research communities there may be better data sharing. But they just did this a few years ago again. So first finding, it's very difficult to contact published authors.
Skip to 1 minute and 6 secondsA lot of email addresses bounce, and the ability to secure data drops off very quickly. So even five, ten years back, it's hard to get data. So overall out of their 500 articles, they were only able to get the data for about 20 percent. 80 percent of the data just wasn't accessible. Now maybe if they'd been more persistent. Maybe if they had sent ten emails instead of two or three. Maybe if they called up the author five times. But that's not how it's supposed to go, right. It's not supposed to be the case you have to spend months of your life getting data. It’s supposed to be freely shared, it’s supposed to be...
Skip to 1 minute and 40 secondsSo this is this is the data on their table. Hopefully you guys can see this. This is sort of the year by year. That's the publication year on the left. And because they had sort of limited bandwidth they just coded up every other year.
Skip to 1 minute and 56 secondsOnce you go back to some of the older articles, 30, 40 percent of the time there's no working email. For the more recent articles usually there's a working email.
Skip to 2 minutes and 6 secondsThe author responds to the email maybe 50 to 60 percent of the time but maybe 40 percent of time they don't.
Skip to 2 minutes and 17 secondsThey very often don't even give us a response that's meaningful in terms of the stats and the data. Sometimes they say the data is lost, especially for the old studies. What he says is in a lot of these old studies they're like, and this is you know 20-some years ago, so it's on a floppy disk. My new computer doesn't have a floppy disk reader. I don't even know how to get this to you.
Skip to 2 minutes and 38 secondsSo they only actually get the data 19 percent of the time. This is just like kind of demoralizing. And definitely for the old studies, you’re just not going to get the data basically. This is just lost to science. No one’s going to be able to extend this stuff or reanalyze it. Five years ago, six years ago, 2009 or so, an editor took over AJPS named Rick Wilson who was really strongly in favor of data posting. And made it a requirement that papers published in AJPS post their data. As of 2013 when a review was done, very few political science journals required data posting.
Skip to 3 minutes and 19 secondsAnd even those who have policies, they only really weakly enforce them. There’s kind of a consensus that that's the case. Dafoe presents and collects data about just how much of a problem this is. And this is data that links back up to the Vines et al piece and the availability of psychology data. He surveyed Ph.D. students in his class at Yale as well as in a Statistical Methods class at Harvard, asked them what their experience was with the papers they tried to replicate. Like could they find the replication files. I forget what the sample was for his survey here, however many dozen observations from people who were trying to replicate data, replicate studies and try to get data.
Skip to 3 minutes and 58 secondsWhat he basically finds is in about 50 percent of the cases basically there isn’t enough posted to replicate the study. Some of them have nothing available, even when you ask the authors. Some only have the data, not the code. Some have the code and not the data, et cetera. So basically you have this sort of survey. He consented people and asked them all these questions about their experiences, and this is the kind of key bottom-line figure. This is from his appendix. Proportion of responses where there were just no replication materials available was 20-something percent.
Skip to 4 minutes and 37 secondsCases where there was limited data, but so limited you can’t even do any analysis, was another 20-something percent. So it was basically half the cases. And then there’s this kind of distribution of quality in terms of availability of code or data. In some it was like yeah, there was enough to kind of do what I needed to do. And others, oh, I could do most of what I needed to do. So there’s this sort of continuum. But in a full half the cases, these folks just couldn't get the data. And a lot of the papers they wanted to replicate were recent papers. These were like Ph.D. students trying to get on the research frontier, looking at recent papers.
Skip to 5 minutes and 11 secondsSo this is like a best case scenario. If you go back 15 years, it’s like got to be worse than this, or 10 years. So it’s sort of like field after field, we’re still in a weird equilibrium it seems where people aren’t sharing their data. Journals may be able to exploit their leverage, and there may be a key sort of institution to getting data online. Dafoe presents and contrasts AJPS policy under Rick Wilson, where they very seriously pushed for data sharing, with the APSR policy of encouraging data posting. And making it clear that there is an expectation that data will be shared with interested scholars. But basically no enforcement.
Skip to 5 minutes and 52 secondsBut what Rick Wilson did is he basically held up publication early on as editor of a bunch papers and was like, “No, I’m not going to run your paper until you give me the data.” Like he was serious about it.
Skip to 6 minutes and 2 secondsHe would himself check out the replication files to see that they ran.
Skip to 6 minutes and 8 secondsAnd then they pushed very hard to get all the data posted on Dataverse. The left-hand panel is AJPS, and Rick Wilson takes over in 2009. So in 2009, the papers published in 2009, basically none of those papers published state that their replication data is available. But by 2010 it’s sort of a mix. And by 2011 and ’12 almost all of them state that their data is available. He made them say that. He was like, “You have to state in the article your data is available and you have to post it.” This follow-on plot – again, left-hand side is AJPS, right-hand side is APSR – is data that when you actually go to look for it actually is available.
Skip to 6 minutes and 56 secondsHe could find it, Dafoe could easily find the data. And even though it isn’t quite 100 percent, you see a huge improvement from like 2009! This isn’t that long ago! None of the data was available in the leading political science journal. There were like two folks who shared their data or something, like two nerds, you know? And then by the end, it’s 75 percent! So this is a massive improvement. In two years you go from no data to 70 percent of data being posted. That's a big improvement.
Skip to 7 minutes and 32 secondsSo again, it suggest that journals have leverage, and this may be a promising way forward.
Two studies on the problem of disappearing data
Why is scientific misconduct such a problem? One major reason is that, until a few years ago, there weren’t many effective mechanisms for keeping researchers accountable. Even if someone wanted to replicate a study, obtaining data from published authors could be extremely difficult. In 2014, zoologist Dr. Timothy Vines and his colleagues wrote a short piece describing just how big an issue disappearing data was. Less than a year earlier, political scientist Dr. Rick Wilson became the new editor of the American Journal of Political Science (AJPS) and began strictly enforcing the journal’s requirement that published authors post their data. Dr. Allan Dafoe, another political scientist, saw this as an opportunity to examine how well the enforcement of data sharing requirements served as a way to prevent research fraud. Watch the video to learn what he found.
© Center for Effective Global Action