Want to keep learning?

This content is taken from the University of California, Berkeley, Center for Effective Global Action (CEGA) & Berkeley Initiative for Transparency in the Social Sciences (BITSS)'s online course, Transparent and Open Social Science Research. Join the course to learn more.

Skip to 0 minutes and 0 seconds So far we’ve been saying, “Ah researchers, they don’t wanna share their data. They’re being selfish. They wanna get all the publications out of it themselves.” But maybe there’s something they’re not considering. And maybe the thing they’re not considering is, actually, if you posted your data, it might actually be good for you. And not just good for you because your university might condition getting tenure on you being a proponent of open data. That would be different. That would be like a regulation. This would just be like, in the sort of like “sphere” of the debate over ideas and the intellectual community, you will have a higher profile if you share your data with the research community.

Skip to 0 minutes and 36 seconds Now, if that’s true then it’s sort of like: Hey you’re being dumb! Put the time in and share your data. You’re gonna get more citations. People are going to build on your work. And maybe you’re being shortsighted by not sharing your data. So, maybe people are being lazy and they’re not assembling the replication code. But if they put the effort in it would actually be to their benefit because other people could build on what they have been doing.

Skip to 0 minutes and 59 seconds So let’s talk about the Piwowar and Vision piece. And the goal of this piece is to test this question. Does open data, does making your data public increase your citation cap? So what did they do? They assemble 10 thousand papers, again from Biology, on gene expression microarray data. That sounds really cool. I have no idea what it is. And they classified the availability of data for these different studies. Does making your data available increase later citations? And they find that it does. They find that on average articles that had the data publicly available, got 9% more citations.

Skip to 1 minute and 43 seconds Now, it turns out that papers that have just been published in the last few years don’t see any of this sort of benefit. Sort of seems to appear over time. So, once you’re like 5, or you know, 5, 6, 7 years ago, then the citation increase for papers with publicly available data, is 30%.

Skip to 2 minutes and 6 seconds And they have evidence. A lot of the datasets are reused by others. So this, if it’s real, provides a professional incentive to post your data. Now, the problem with this is that these studies may be different. People who post their data may be different than those who don’t post their data. For instance, if people request my data, it may be because they’re actually interested in my study. Like they’re citing my study. They may be more likely to request it and I may be more likely to post it. If that’s the case, this correlation is problematic. Now, they have said they have 124 different characteristics on all these papers. They control for all these things.

Skip to 2 minutes and 50 seconds They control for the site count of the lead author and they control for the exact subfield and all these other things. But to the extent there’s still some kind of effect which is like, I post my data when there is demand for my data. And there is demand for my data when I have a good paper that’s being cited. Then this correlation is harder to interpret. Like, you want some exogenous push in data posting. Because of a journal requirement or some other requirement. I don’t know, maybe I’m being too cynical. I just don’t know how convincing, at the end of the day, it is. It’s suggestive.

Skip to 3 minutes and 24 seconds It’s like a neat “finding” and it’s cool that they’re doing this kind of research. But I think we probably need a little more research design. Meaning, some sort of quasi-experimental or experimental variation, to be convinced.

If you post your data, it might actually be good for you!

A 2013 study by Dr. Heather Piwowar and biologist Dr. Todd Vision suggests that sharing data isn’t just good for the scientific research community as whole, but for individual publishing authors as well. They found a positive correlation between posting data and increased citations. This video discusses their methods, as well as why I’m not entirely convinced this provides a strong incentive to share data.

Interest in data sharing, while growing, has yet to become a scientific norm. This may be due, in part, to a common belief that the cost of preparing and making data widely available are not worth the benefits. However, an article by Heather Piwowar and Todd Vision, titled “Data reuse and the open data citation advantage,” may provide authors an extra incentive to share data. The authors suggest that, on top of allowing for future investigation of past studies and methods, encouraging multiple perspectives regarding data, identifying errors, and improving publication integrity, sharing data leads to higher citation rates.

There can be challenges, however, to making data open and publicly available. Piwowar and Vision acknowledge that variables can be controlled to predict citation rates, leading to uncertain estimates of “citation benefit[s].”

In their study on gene expression microarray data, Piwowar and Vision looked at citation rates, controlling for citation predictors in order to determine the variability of data reuse. Their methods are described below:

“First, we conduct a small-scale manual review of citation contexts to understand the proportion of citations that are made in the context of data reuse. Second, we use attribution through mentions of data accession numbers, rather than citations, to explore patterns in data reuse on a much larger scale.”

They conducted their analysis using many factors as covariates, including date of publication, open access status, number of authors, author country, study topic, and more. Additionally, they examined patterns of data reuse.

The authors conclude that there is a strong citation benefit from open data, and a “direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data.” They found that the number of citations a paper received is strongly correlated to its publication date. And overall, papers with openly available data received more citations, even after controlling for variables known to affect citation rates.

Piwowar and Vision also list other factors aside from third-party data reuse that may be relevant to “open data citation benefits”:

  1. Data Reuse – Papers with available datasets can be used in ways that papers without data cannot, and they may receive additional citations as a result.

  2. Credibility Signalling – The credibility of research findings may be higher for research papers with available data. Such papers may be preferentially chosen as background citations or the foundation of additional research.

  3. Increased Visibility – Third party researchers may be more likely to encounter a paper with available data, either by a direct link from the data or indirectly through cross-promotion.

  4. Early View – When data is made available before a paper is published, some citations may accrue earlier than they would otherwise because of accelerated awareness of the methods, findings, and so on.

  5. Selection Bias – Authors may be more likely to publish data for papers they judge to be their best quality work, because they are particularly proud or confident of the results.

The obstacles the authors faced while gathering citation data suggests that “improvements in tools and practice are needed to make impact tracking easier and more accurate, for day-to-day analyses as well as studies for evidence-based policy.”

While there are positive and negative incentives to data sharing, the authors ultimately assert that, in the transition from “data not shown” to a culture where published data is normalized, sharing data should be seen as a tenet of science, and science as a public resource.

If you want to dive deeper into the material, you can read the whole paper by clicking on the link in the SEE ALSO section at the bottom of this page.


Piwowar, Heather A., and Todd J. Vision. 2013. “Data Reuse and the Open Data Citation Advantage.” PeerJ 1 (October): e175. doi:10.7717/peerj.175.

Share this video:

This video is from the free online course:

Transparent and Open Social Science Research

University of California, Berkeley