May Maintenance Complete

The May 2018 maintenance window is complete. We updated the internal cluster networking and networking services. We changed a number of SLURM settings to meet the changing needs of the cluster, the most notable being that the General partition can run jobs up to four hours and support for SLURM licenses (Matlab, SAS). The important changes are detailed below. Please review carefully as these changes may impact your job.

Lewis is expanding soon and we will be providing new and easier ways to invest to get more fairshare (see below for more details). Please email rcss-support@missouri.edu soon if you are interested in investing.

Important changes to Lewis:

  • The General partition can now run jobs up to four hours in duration, but will still default to two hours.
  • There is now an ‘Interactive’ partition for running short test, debug, and interactive jobs. The ‘Interactive’ partition defaults to two hours, with a maximum of four hours. Please do not run a large number of jobs or long jobs on this partition. Users that abuse this will be notified, and may have their accounts suspended.
  • The older HPC2 nodes have been taken out of General.
    • Please use ‘Interactive’ for interactive and short term jobs for testing and debugging.
  • SLURM now manages licenses and you must request a license to run licensed software.
    • Matlab will now enforce the use of the “–license=matlab:1” flag. Jobs or `srun` that use module without having a license will fail. This will allow jobs that require a matlab license to wait in the queue until there is a free license instead of failing. Please see docs.rnet.missouri.edu.
    • SAS requires a SLURM license now to enforce the use of the “–license=sas:1” flag
  • All new software packages will use the new module system and will “autoload” all dependent modules (and specific versions if necessary). Older modules should behave the same.
  • MPI has been upgraded to openmpi/openmpi-2.1.3. For new versions of OpenMPI (including 3.0), you must now use `srun` and not `mpirun` or `mpiexec` in your jobfile. Jobs that use `mpirun` will fail.
  • The Data Transfer Node (DTN) is degraded due to incompatibilities with the recent CentOS update. Please use the login node for large transfers. For BioCompute users, please use the biologin node. We do not have an estimate when it will be operational again as this is a know issue with the new version of CentOS.

We are in the process of finishing updating the documentation for these changes. If you have questions, please email rcss-support@missouri.edu.

The teaching cluster has been named “Clark”, and you can reach it at clark.rnet.missouri.edu in addition to tc.rnet.missouri.edu. The storage system for Clark was updated during this maintenance window and a number of nodes were removed to make room for expanding the Lewis cluster.

In the next maintenance window later this summer, we will add an additional rack with 32 nodes (1,280 cores) with an updated 100Gbps EDR Infiniband network. If you are interested in investing, please contact rcss-support@missouri.edu soon as we will be placing the order in the next week or so. The new nodes will be the latest generation (6th gen) processors with 40 cores, 384GB of RAM and 25Gbps Ethernet and 100Gbps EDR Infiniband. For HPC Investors using grant funds, the cost will be approximately $10,500.

We are now offering a HPC Compute Service. This allows you to invest in HPC cores at any time for an immediate increase in capacity (fairshare). Currently a “slice” is estimated to be $2,600/10 Cores/5 years, paid in full in advance. Investors with more than 40 cores (HPC Investor and HPC Compute Service) will get 3TB of group storage for the duration of the investment. Please contact us soon if you are interested as we may need to purchase more nodes if there is a large demand.

For users that need high performance parallel storage, we now offer HPC storage (/storage/hpc) for $15.50/TB/month billed monthly in 1TB quota increments, with utilization calculated after compression.

For users that need large and low compute-intensity storage, we continue to offer HTC Storage at $120/TB/5 years paid in full in advance with a minimum of 10TB of storage (see docs.rnet.missouri.edu for more details). Please note that HTC Storage is for low computational intensity jobs on the cluster. Large parallel jobs or lots of random reads and writes can dramatically impact or crash the HTC Storage system. HTC Storage is built for large, economical, safe, and performant storage, in that order. HPC Storage is built for FAST!

For the latest information, please watch our website at https://doit.missouri.edu/research/ for updates. We will followup this email with our Spring 2018 update.

Thank you for your patience and support as we continue to work on making Lewis better. Feel free to let us know how we are doing by emailing rcss-support@missouri.edu, or by dropping in for one of our training sessions.

May 29, 2018