Role: Senior Personnel
Date: September 2018 - current


Introduction

The Virtual Data Collaboratory (VDC) is a federated data cyberinfrastructure that is designed to drive data-intensive, interdisciplinary and collaborative research, and enable data-driven science and engineering discoveries. VDC accomplishes this by providing seamless access to data and tools to researchers, educators, and entrepreneurs across a broad range of disciplines and scientific domains as well as institutional and geographic boundaries. In addition to enabling researchers to advance research frontiers across multiple disciplines, VDC also focuses on (1) training the next generation of scientists with deep disciplinary expertise and a high degree of competence in leveraging data, cyberinfrastructure, and tools to address research problems and (2) helping data scientists and engineers develop and apply advanced federated data management and analysis tools for high impact scientific applications. To meet this mission, VDC extends beyond its collaborating institutions and leverages NSF investments to provide cyberinfrastructure typically not available to community colleges, state-associated colleges and universities, and regional liberal arts colleges and universities, and to stimulate intense user engagement and adoption by scientists across domains and institutions.

VDC represents state of the art data-intensive computing, storage, and networking solutions, integrated with an innovative data services layer. VDC is federated and coordinated across three geographically distributed Rutgers University campuses in New Jersey and multiple campuses in Pennsylvania and New York by a high-speed network, with the potential to incorporate academic/research institutions across the Mid-Atlantic and the nation. VDC builds on and integrates existing national/international and regional data repositories, including NSF-funded repositories, and leverages local/regional/national ACI investments. Central to the VDC vision are three infrastructural innovations, a regional science data science DMZ network that provides services to enable efficient and transparent access to data and computing capabilities, an expandable and scalable architecture for data-centric infrastructure federation, and a data services layer to support research workflows that utilize cutting-edge semantic web technologies, support interdisciplinary research, expand access, and increase the impact of data-science worldwide.

The end product is a fully-developed system for collaborative use by the research and education community. A data management and sharing system is constructed, based largely on commercial off-the-shelf technology. The system will be integrated with existing research data repositories, such as the Ocean Observatories Initiative and Protein Data Bank repositories. Regional high-performance computing and network infrastructure is leveraged, including New Jersey's Regional Education and Research Network (NJEdge), Pennsylvania's Keystone Initiative for Network Based Education and Research (KINBER), the Extreme Science and Engineering Discovery Environment (XSEDE) computing capabilities, Open Science Grid, and other NSF Campus Cyberinfrastructure investments. The project also develops a custom site federation and data services layer; the data services layer provides services for data linking, search, and sharing; coupling to computation, analytics, and visualization; mechanisms to attach unique Digital Object Identifiers (DOIs), archive data, and broadly publish to internal and wider audiences; and manage the long-term data lifecycle, ensuring immutable and authentic data and reproducible research.


Overarching goals

  • Provide seamless access to data and tools to researchers, educators, and entrepreneurs across a broad range of disciplines and scientific domains as well as institutional and geographic boundaries. tools to address research problems.
  • Enable data scientists and engineers develop and apply advanced federated data management and analysis tools for high impact scientific applications.
  • Train the next generation of scientists with deep disciplinary expertise and a high degree of competence in leveraging data, cyberinfrastructure, and tools to address research problems.


Proposed VDC Architecture

  • Regional science data DMZ network.
  • Scalable data-centric infrastructure federation.
  • Data services to support research.
VDC architecture
Proposed VDC architecture


Driving applications

  • Deciphering Sequence and Structural Correlates of Protein Nucleic Acid Interactions (H. Berman & V. Honavar)
  • High-Volume City Data Sharing and Processing for Smart, Resilient, and Sustainable Cities (J. Gong, RU; Z. Zhu, CUNY; X. Liang, University of Pittsburgh; M. Balduccini, Drexel University)
  • Ocean Observatories Initiative (I. Rodero, M. Parashar)


Education and Outreach

  • Incorporate VDC into research-based and general data science/analytics classes so that students can perform large, applied projects (analytics/data science programs at RU, PSU, Drexel, and CUNY).
  • Create a set of easy to use modules/online material that could be used for all courses (data Management, stewardship, reproducibility, and curation).
  • Leverage NJBDA to impact analytics and data science courses across NJ.
  • Foster learning communities using the online modules to enable peer-peer graduate learning through standard meet-up and chat software.


List of personnel

Rutgers University:

  • Ivan Rodero (Principal Investigator)
  • Manish Parashar (Former Principal Investigator)
  • Grace Agnew (Co-Principal Investigator)
  • Barr van Oehsen (Co-Principal Investigagor)
  • J. J. Villalobos
  • Ryan Womack

Penn State University:

  • Vasant Honavar (Co-Principal Investigator)
  • Jenni Evans (Co-Principal Investigator)
  • Wayne Figurelle
  • Ryan Gilmore
  • Matt McIntyre
  • Maurie Kelley

NJEdge:

  • Edward Chapel
  • Forough Ghahramani

KINBER:

  • Wendy Huntoon
  • Jennifer Oxenford


Resources

Virtual Data Collaboratory website


Award Information

National Science Foundation - Award Number 1640834