Role: Engineering Lead
Date: February 2017 - July 2019


Introduction

Rutgers’ new supercomputer, which is named “Caliburn,” is the most powerful system in the state. It was built with a $10 million award to the Rutgers Discovery Informatics Institute (RDI2) from the New Jersey Higher Education Leasing Fund. The lead contractor is High Point Solutions of Bridgewater, N.J., which was chosen as the lead contractor after a competitive bidding process. The system integrator is Super Micro Computer Inc. of San Jose, California. It is based on a new network interconnect developed by Intel (Omni-Path) and it is among the first clusters to use the Intel Omni-Path fabric and the latest Intel processors. Along with users at Rutgers, the system is accessible to researchers at other institutions. RDI2 works with the New Jersey Big Data Alliance, which was founded by Rutgers and seven other universities in the state, to build an industry users program. The capabilities of this new system establishes New Jersey's reputation in advanced computing and benefits a broad spectrum of industry sectors and academic disciplines.

Supercomputer
Supercomputer

 

Construction

Phase I

The project was built in three phases. The Phase I system went live in January 2016 and provides approximately 200 teraflops of computational and data analytics capabilities and one petabyte of storage to faculty and staff researchers throughout the university. Early users of this system (Elf cluster) spanned a wide range of disciplines including chemistry and chemical biology, engineering, genomics, humanities, integrative biology, mathematics, medical informatics, microbiology, physics and astronomy, and proteomics. The Elf cluster consists of 144 compute servers, each containing two 12-core Intel Xeon E5-2680v3 processors and 256 GB of main memory, 16 compute servers are also equipped with an NVIDIA Tesla K40m graphics processing unit (GPU) accelerator, ideal for the most demanding HPC and big data problem sets. The Elf cluster high performance interconnect is based on Mellanox InfiniBand FDR 56 Gbps. InfiniBand FDR is all about lower latency and higher scalability and reliability. The Phase I also included the deployment of a high performance parallel filesystem based on DataDirect Networks (DDN) GRIDScaler, an implementation of IBM Spectrum Scale, previously known as General Parallel File System (GPFS), which provides a global name space, shared file system access, simultaneous file access from multiple nodes, high recoverability and data availability through replication, the ability to make changes while a file system is mounted, and simplified administration even in large environments. In particular, DDN GRIDScaler's implementation of IBM Spectrum Scale provides flexible choices for enterprise-grade data protection, availability features, and the performance of a parallel file system coupled with DDN's deep expertise and history of supporting highly efficient, large-scale deployments.

Technical specs:

The DDN GRIDScaler deployed configuration provides 1 petabyte of usable space over both InfiniBand and TCP/IP networks.

Technical specs:

DDN architecture
DDN architecture

 

Overall, the system has 3,456 cores and 38 terabytes of main memory, providing 138 teraflops of peak performance, boosted up to 206 teraflops when taking into account the 68 teraflops of peak performance provided by the GPU server pool.

Phase II

Phase II included a new self-contained modular data center (MDC) that hosts seventeen (17) 42U racks and delivers 350 kilowatts total power. The cooling system is based on chilled water and the racks require rear-door heat exchangers. From a safety standpoint, the MDC implements a state-of-the-art fire detection and suppression system.

Phase III

Phase III encompasses the final installation of the supercomputer (Caliburn cluster) and the last elements of the network, which provides approximately 600 teraflops of computational and data analytics capabilities and over 200 terabytes of non-volatile memory express (NVMe) storage. The Caliburn cluster is based on a new network interconnect developed by Intel and it is among the first clusters to use the Intel Omni-Path fabric. Scaling applications to hundreds or thousands of servers is a common practice in today’s HPC clusters. The data throughput capability of Intel OPA in its first generation is 100 Gbps in each direction of the link. This corresponds to up to 12.5 GBps of uni-directional bandwidth and 25 GBps of bi-directional bandwidth. In addition to high bandwidth, Intel OPA is a low latency and highly resilient interconnect with many different Quality of Service (QoS) features.

Caliburn implements a two-tier Omni-Path fabric. In this configuration, compute nodes are connected to "edge" switches, and these edge switches are in turn connected to "core" switches. Eight (8) core switches and nineteen (19) edge switches conform the Intel OPA network fabric.

Technical specs:

Supercomputer architecture
Supercomputer architecture
 

Omni-Path fabric architecture
Omni-Path fabric architecture
 

Overall, the system has 20,160 cores, 140 terabytes of memory and 218 terabytes of non-volatile memory. The performance is 603 teraflops with a peak performance of 677 teraflops.


Technical reports

Caliburn: Advanced Cyberinfrastructure report
High Performance Computing at the Rutgers Discovery Informatics Institute