SEPTEMBER 2019CIOAPPLICATIONS.COM9Cloud Dataverse: A Data Repository Platform for the CloudData sharing is being adopted in many scientific communities as the way to make data accessible to others. This is driven by a number of factors, including recent open data policies, funder and journal requirements, and community awareness for the need of reproducibility of a scientific claim, which require access to the data. The research community has developed standards and best practices to incentivize and improve the quality of data sharing. Each dataset must have: 1) A data citation to credit data authors 2) A registered global persistent identifier to locate and reference the dataset indefinitely (e.g. a Digital Object Identifier or DOI) 3) Well-defined restrictions, licenses and terms of use to know how to access the data4) Rich metadata describing the dataset to help find it and reuse it (see Joint Declaration of Data Citation Principles and FAIR principles). The Dataverse repository platform enables the building of repositories without having to implement from scratch all the standards and best practices needed to fully support data sharing and archiving. Dataverse provides additional features such as versioning of datasets, customized virtual repositories within the same hosting infrastructure, multiples roles and permissions to support data management and curation, tiered access based on granted permissions, and APIs to deposit, explore, or visualize the data. The Dataverse software has been developed since 2006 at the Institute for Quantitative Social Science at Harvard University. Like OpenStack, it is open-source, with a growing user and developer community, and with 22 installations around the world, which can be federated to share metadata. The Harvard Dataverse repository alone hosts more than 70,000 datasets with contributions from 500 research and academic institutions worldwide.Cloud Dataverse benefits from the repository infrastructure and rich set of features provided by Dataverse, as well as from cloud technologies that enable storing and computing of large sets. Our first implementation of Cloud Dataverse is with the Massachusetts Open Cloud (MOC); a regional public cloud effort by Harvard, Boston University, MIT, Northeastern, and UMass along with a community of industry partners. How does Cloud Dataverse extend Dataverse? First, it integrates with the MOC's OpenStack Swift object storage. Swift provides scalable storage optimized to handle large and not bounding datasets, at a low cost. This integration lets Dataverse users deposit and access large data files directly from the Swift storage, without being limited by the Dataverse web interface and APIs, which can only handle datasets up to a few GBs. Second, it integrates with the MOC's OpenStack's Keystone identity services. This allows data users to find a dataset in a Dataverse repository and seamlessly access the data in the cloud environment, using the credentials in Dataverse. And third, it integrates with the MOC's OpenStack Sahara service to manage access to computational-intensive data processing frameworks such as Hadoop or Spark. We are now starting to design how Cloud Dataverse can integrate with other Dataverse repositories to allow datasets from federated repositories to be automatically integrated into the Cloud. With the convergence of two growing open-source projects, Cloud Dataverse can grow the set of features and be useful to both the scientific and industry communities. But more importantly, Cloud Dataverse represents the necessary next step to combine cloud computing with data sharing. The Harvard Dataverse repository alone hosts more than 70,000 datasets with contributions from 500 research and academic institutions worldwide
<
Page 8 |
Page 10 >