Feb. 15, 2017 – HPC staff have been hard at work since last November installing the latest iteration of UNT's Talon High Performance Computing, HPC, system. The system provides a computing environment that coordinates hundreds of individual compute servers and allows UNT researchers to run large and complex calculations that model elements at the atomic level all the way up to structures on a cosmic scale. This third major upgrade of the Talon HPC system is expected to be available to UNT researchers by the end of February, 2017.
The heart of this upgrade is the acquisition of 192 new Dell servers (computers) that feature dual Intel Xeon E5-2680 CPUs with 14 processer cores each, yielding 28 cores per server and 5,376 total cores in this set of compute nodes. In addition, 16 new servers have been acquired each with the same CPU as well as dual NVidia Tesla K80 Graphic Processor Units (GPU), each with 4,992 co-processor cores (that's a total of about 160,000 CPU cores.) We have also retained 128 of the Talon2 servers for continued use, yielding a total of 8128 CPU cores. With the increased processor core count and improvements in the newest Intel CPUs, Talon's theoretical maximum performance has grown from about 90 Teraflops (trillions of floating-point calculations per second) to over 280 Teraflops.
The upgrade started in November of 2016 when HPC staff removed 96 of the c32 nodes and 16 GPU nodes from the Talon cluster to prepare for the installation of new computing hardware. New equipment was received and installed during the months of December and January. And in late January of 2017, a maintenance period was necessary to modify and extend the Infiniband network infrastructure that supports high speed communication within the HPC cluster. Included in the scope of this work:
- 200 fiber cables were individually prepared and labeled
- 100 fiber cables had to be untangled, sorted, and tested
- 100 core-to-leaf underground cables were run and connected
- 10 Infiniband leaves (edge switches) were installed and configured
- 2 new Infiniband cores (central switches) were installed, configured, and connected
- 1 Infiniband core switch was moved, configured, and connected
As of mid-February, operating system configurations are being finalized and cluster-installed research applications are being optimized and tested. The same 1.4 petabytes high-performance Lustre scratch file system will continue to be available under the new configuration as well as additional 700 terabytes object storage for which more information will be available at a later date.
Once configuration and testing are completed for the new HPC installation, user accounts will be migrated to the new system after which the Talon 2 nodes will be taken offline followed by their integration into the Talon 3 system. One major change for Talon users is that in order to get the best use out of the newly configured system, Talon3 will make use of the SLURM job scheduler. SLURM is a totally different architecture than the UGE scheduler used on Talon 2 and therefore users will need to use different commands and job scripts. Training seminars on using SLURM are being planned and sessions will be held prior and concurrent to the availability of Talon 3 to help users transition to the new scheduler.
We ask that Talon users routinely visit the HPC's website, https://hpc.unt.edu/, as we update it with critical information such as:
- Basic Talon3 availability and usage information
- Dates and times of workshops on the new system
- Tips and tricks to efficiently maximize your utilization
We are excited to usher in a new era of high-performance computing here at UNT. After the dust clears, we are certain researchers will enjoy access to modern resources that enable their significant research activities.