The Virtual Data Collaboratory (VDC) is a federated data cyberinfrastructure that is designed to drive data-intensive, interdisciplinary and collaborative research, and enable data-driven science and engineering discoveries. VDC accomplishes this by providing seamless access to data and tools to researchers, educators, and entrepreneurs across a broad range of disciplines and scientific domains as well as institutional and geographic boundaries. In addition to enabling researchers to advance research frontiers across multiple disciplines, VDC also focuses on (1) training the next generation of scientists with deep disciplinary expertise and a high degree of competence in leveraging data, cyberinfrastructure, and tools to address research problems and (2) helping data scientists and engineers develop and apply advanced federated data management and analysis tools for high impact scientific applications. To meet this mission, VDC extends beyond its collaborating institutions and leverages NSF investments to provide cyberinfrastructure typically not available to community colleges, state-associated colleges and universities, and regional liberal arts colleges and universities, and to stimulate intense user engagement and adoption by scientists across domains and institutions.

VDC represents state of the art data-intensive computing, storage, and networking solutions, integrated with an innovative data services layer. VDC is federated and coordinated across three geographically distributed Rutgers University campuses in New Jersey and multiple campuses in Pennsylvania and New York by a high-speed network, with the potential to incorporate academic/research institutions across the Mid-Atlantic and the nation. VDC builds on and integrates existing national/international and regional data repositories, including NSF-funded repositories, and leverages local/regional/national ACI investments. Central to the VDC vision are three infrastructural innovations, a regional science data science DMZ network that provides services to enable efficient and transparent access to data and computing capabilities, an expandable and scalable architecture for data-centric infrastructure federation, and a data services layer to support research workflows that utilize cutting-edge semantic web technologies, support interdisciplinary research, expand access, and increase the impact of data-science worldwide.

The end product is a fully-developed system for collaborative use by the research and education community. A data management and sharing system is constructed, based largely on commercial off-the-shelf technology. The system will be integrated with existing research data repositories, such as the Ocean Observatories Initiative and Protein Data Bank repositories. Regional high-performance computing and network infrastructure is leveraged, including New Jersey's Regional Education and Research Network (NJEdge), Pennsylvania's Keystone Initiative for Network Based Education and Research (KINBER), the Extreme Science and Engineering Discovery Environment (XSEDE) computing capabilities, Open Science Grid, and other NSF Campus Cyberinfrastructure investments. The project also develops a custom site federation and data services layer; the data services layer provides services for data linking, search, and sharing; coupling to computation, analytics, and visualization; mechanisms to attach unique Digital Object Identifiers (DOIs), archive data, and broadly publish to internal and wider audiences; and manage the long-term data lifecycle, ensuring immutable and authentic data and reproducible research.

