At the American Political Science Association meetings earlier this year, Gary King, Albert J. Weatherhead III University Professor at Harvard University, gave a presentation on Dataverse. Dataverse is an important tool that many researchers use to archive and share their research materials. As many readers of this blog may already know, the journal that I co-edit, Political Analysis, uses Dataverse to archive and disseminate the replication materials for the articles we publish in our journal. I asked Gary to write some remarks about Dataverse, based on his APSA presentation. His remarks are below.
* * * * *
An update on Dataverse
By Gary King
If you’re an academic researcher, odds are you’re not a professional archivist and so you probably have more interesting things to do when making data available than following the detailed protocols and procedures established over many years by the archiving community. That of course might be OK for any one of us but it is a terrible loss for all of us. The Dataverse Network Project offers a solution to this problem by eliminating transaction costs and changing the incentives to make data available by giving you substantial web visibility and academic citation credit for your data and scholarship (King, 2007). Dataverse Networks are installed at universities and other institutions around the world (e.g., here is the Dataverse network at Harvard’s IQSS), and represent the world’s largest collection of social science research data. In recent years, Dataverse has also been adopted by an increasingly diverse array of other fields and protocols and procedures are being built out to enable numerous fields of science, social science, and the humanities to work together.
With a few minutes of set-up time, you can add your own Dataverse to your homepage with a list of data sets or replication data sets you make available, with whatever levels of permission you want for the broader community, and a vast array of professional services (e.g., here’s my Dataverse on my homepage). People will be able to more easily find your data and homepage, explore your data and scholarship, find connections to other resources, download data in any format, and learn proper ways of citing your work. They will even be able to analyze your data while still on your web site with a vast array of statistical methods through the transparent and automated connection Dataverse has built to Zelig: Everyone’s Statistical Software, and through Zelig to R. The result is that your data will be professionally preserved and easier to access — effectively automating the tasks of professional archiving, including citing, sharing, analyzing, archiving, preserving, distributing, cataloging, translating, disseminating, naming, verifying, and replicating data.
Dataverse is an active project with new developments in software, protocols, and community connections coming rapidly. A brand new version of the code, written from scratch, will be available in a few months. Through generous grants from the Sloan Foundation, we have been working hard on eliminating other types of transaction costs for capturing data for the research community. These include deep integration with scholarly journals so that it can be trivially easy for an editor to encourage or require data associated with publications to be made available. We presently offer journals three options:
- Do it yourself. Authors publish data to their own dataverse, put the citation to their data in their final submitted paper. Journals verify compliance by having the copyeditor check for the existence of the citation.
- Journal verification. Authors submit draft of replication data to Journal Dataverse. Journal reviews it, and approves it for release. Finally, the dataset is published with a formal data citation and back to the article. (See, for example, the Political Analysis Dataverse, with replication data back to 1999.)
- Full automation: Seamless integration between journal submission system and Dataverse; Automatic Link created between article and data. The result is that it is easy for the journal and author and many errors are eliminated.
Full automation in our third option is where we are heading. Already today, in 400 scholarly journals in the Open Journal System, the author enters their data as part of submission of the final draft of the accepted paper for publication, and the citation, permanent links between the data and the article, and formal preservation is taken care of, all automatically. We are working on expanding this as an option for all of OJS’s 5,000+ journals, and to a wide array of other scholarly journal publishers. The result will be that we capture data with the least effort on anyone’s part, at exactly the point where it is easiest and most important to capture.
We are also working on extending Dataverse to cover new higher levels of security that are more prevalent in big data collections and those in public health, medicine, and other areas with informative data on human subjects. Yes, you can preserve data and make it available under appropriate protections, even if you have highly confidential, proprietary, or otherwise sensitive data. We are working on other privacy tools as well. We already have an extensive versioning system in Dataverse, but are planning to add support for continuously updated data such as streamed from sensors, tools for online fast data access, queries, visualization, analysis methods for when data cannot be moved because of size or privacy concerns, and ways to use the huge volume of web analytics to improve Dataverse and Zelig.
This post comes from the talk I gave at the American Political Association Meetings August 2014, using these slides. Many thanks to Mike Alvarez for inviting this post.