Research Computing Support Service – Summer 2018 Update

Welcome to the summer 2018 update. Spring was another busy semester with a number of upgrades and changes detailed below. We will be ordering the new rack in the next few days (see the update for details), and there is still time to invest. If you are interested, please contact us at rcss-support@missouri.edu. For HPC Investor nodes (research grants that require equipment purchases only), a node is estimated to cost $9,675.12 (40 Cores). Otherwise, for the new HPC Compute Service, it is $2,600 per 10 cores for 5 years paid in full in advance. This service can be purchased at any time; however, it helps if we have commitments at the time of ordering. We plan on expanding/upgrading the cluster every 6-9 months, a rack at a time, depending on demand. Because of this, we will not be adding HPC Investor nodes between orders to reduce complexity, and to bring down the node cost. Therefore, please let us know in advance if you are interested in investing so we are able to better plan our upgrades.

RCSS May 2018 Maintenance Window Overview

  • We upgraded our scientific software management system (spack and modules) to allow for module auto-loading. This means that users can now easily load modules with complex module dependencies. We currently have 265 software modules (431 total including versions and variants) from a wide spectrum of disciplines.
  • We completed a consolidation and virtualization effort to reduce our infrastructure footprint by two racks; thus, saving power, cooling and making room for more compute and storage.
  • We upgraded cluster internal networking infrastructure (DNS, DHCP), and upgraded the external connectivity capacity for the cluster to 80Gbps connectivity.
  • The General partition now can run jobs up to four hours in duration, but still defaults to two hours.
  • There is now an ‘Interactive’ partition for running short test, debug, and interactive jobs. The ‘Interactive’ partition defaults to two hours with a maximum of four hours.
  • SLURM now manages licenses and you must request a license to run licensed software (Matlab, SAS).
  • The Data Transfer Node (DTN) is degraded due to incompatibilities with the recent CentOS update. Please use the login node for large transfers (the connection has been upgraded).  BioCompute users please use the biologin node.

New Services

For researchers that need more computation and storage, we now offer the following new services:

  • HPC Storage Service: Researchers can now purchase additional HPC Storage (/storage/hpc/group) for $15.50/TB/Month (FY2019) billed in 1TB quota increments. This allows researchers to expand and contract the storage space as their research requires.
  • HPC Compute Service:  Researchers can now invest in fractional nodes and get immediate fairshare credit. For the upcoming expansion, the rate is $2,600/10 cores/5 years (paid in full in advance). This will allow for better capacity planning and additional volume discounts.

Cluster Expansion

We will expand the cluster again in late June/early July with 32 additional nodes connected with a new 100Gbps EDR Infiniband to support large parallel jobs. These nodes have the latest generation Intel CPUs (6th generation) that have 40 core each for a total of 1280 additional cores. These CPU’s from Intel are tuned for better machine learning performance and we look forward to seeing what the real world results are. This will bring our physical core count to 7060 (currently 5780).

New Test Node

We are constantly testing and evaluating new technologies. Our long-term working relationship with vendors allows us to be in constant contact with the latest technology, and vendors give us access to test equipment as well as larger test facilities. We are currently testing a new vendor provided AMD EPYC system to compare to other architectures and configurations in preparation for a grant proposal. This capability gives us a unique ability to test, architect, and build custom configurations for large investors as well as provide assurances to granting agencies that these systems will be built, run well, and have broader impact.

Research Computing Community Updates

In addition to working with vendors, we also work within the broader state, regional, and national research computing community to share and discover best practices, solve problems, and work to advance Cyberinfrastructure as a profession. This community has helped shape our services, technologies employed, and even help us with simple things like how to get specific software packages installed on the cluster. Below are some of the recent highlights:

  • In March 2018 we participated in a NSF sponsored workshop on the professionalization of Cyberinfrastructure.  This workshop was hosted by CaRC (https://carcc.org/about/), a national consortium for advancing campus CI, for which we are a founding member.  This workshop is laying the foundation for CI as a profession, and we were one of the select few invited to participate.  A draft paper from this workshop can be found at https://carcc.org/wp-content/uploads/2018/05/CI-Professionalization-Job-Families-Career-Guide.pdf.
  • We are now an XSEDE Level 3 service provider (https://www.xsede.org/ecosystem/service-providers). This allows us to participate in the technical community that supports our national super computers, a 110 million dollar NSF program.
  • Our undergraduate student worker just accepted a position at TACC making CI their career.  TACC’s Stampede2 is #12 on the HPC top 500. Congrats!
  • We just finished architecting, building, and deploying 12 FIONA nodes (https://fasterdata.es.net/science-dmz/DTN/fiona-flash-i-o-network-appliance/) across the Great Plains Network in conjunction with the newly formed Great Plains Research Platform (GPN-RP). The nodes are capable of connecting regional HPC centers at 100Gbps. The GPN-RP is modeled after the Pacific Research Platform (http://prp.ucsd.edu/), and was started in part due to our leadership in trying to connect local HPC centers to our 100Gbps AL2S connection. The goal is to facilitate research across institutions that require high performance networks.
  • We just upgraded our 100Gbps AL2S connection to IPv6/L3 to regional partners and the Internet2 through the GPN. This combined with the GPN-RP will allow researchers to securely connect directly to regional and national HPC centers at 100Gbps without being limited by firewalls. To give some perspective, this will allow researchers to transfer one Petabyte of data between centers in just over a day.
  • This fall we will be participating in the largest temporary research network in the world as part of the SCINet team at Supercomputing 2018 for the second year in a row. This gives us the hands on experience with millions of dollars of next generation equipment and the people network around it. Work has already begun in preparing for this event.
  • In January, the ShowMeCI (http://showmeci.org/) was formalized as a state-wide Cyberinfrastructure effort for “Sharing Cyberinfrastructure information, education, and resources across the Show Me State.” This allows members easy access to computational resources across the state and a platform for state-wide collaboration.

RCSS Growth and Numbers

Research Computing Support Services (RCSS) passed a major milestone this spring with the replacement of our storage system in January, completing a nearly three year journey of upgrades, growth, and expansion. We have grown from 90 active users a month using 400,000 core hours (November 2014) to 167 active users a month using 3,000,000 core hours (April 2018), which is over seven times growth in computation in three years.

  • Users: 830, Groups: 67
  • Active Users/Month: March 167; April 165
  • Software Packages (modules): 265 (431 with versions)
  • Cores: production compute cores/threads 5780 [total cores including hyperthreading]
  • Compute Nodes: 192
  • Nodes built: 817
  • HPC Capacity: 4.23 million core hours/month (30.5 days/month)
  • HPC Cluster utilization(production): March 2,983,949 (71%); April 2,753,853 (65%) core hours
  • GPU Utilization: 30%
  • Ticketing: average 7 tickets a day, 35% grad students, 497 users in 5 months (10/26/2017 – 3/20/2018)
  • HTC Storage: 958TB Allocated
  • CI Engineer: 559 impact hours in 2017, 330 impact hours Jan-March 2018
  • Cluster Investment: 25% community, 20% individual investor, 42% MRI, and 13% BioCompute

RCSS Services and FY2019 Service Rates

  • Community High Performance Computing (HPC) – No charge
  • HPC Investor [grant friendly funding, ~$10,500/node]
  • HPC Compute Service [$2,600/10 cores/5 years]
  • HPC Rack Investment – custom configurations with dedicated partitions
  • General Purpose Research Storage (GPRS) [$7/TB/Month] (Columbia, KC)
  • HTC Storage [$120/TB/60Months] – economical, large, low compute intensity
  • HPC Storage [$15.50/TB/Month] – High Performance Parallel Storage
  • UMKC Researcher Managed Backup (rsync) in Kansas City [$120/TB/60Months]
  • Secure4 – DCL4 compute cluster. $1,200/Project/Year, $560/User/Year.
  • 100Gbps IPv6 Internet2/AL2S and regional VLAN connectivity
  • Teaching Cluster (Clark) – 13 HPC Nodes for teaching – No Charge.
  • Science Gateways (Limited Access)
  • User training and help sessions (Wednesdays, seminars, workshops)
  • Grant writing consultation (letters of support, quotes, configuration)

June 5, 2018