Storage Upgrade Complete

After many months of work and planning, the HPC storage upgrade is complete.  There has been a massive increase in performance for individual jobs and for overall cluster. Initial testing indicates that individual file read/write speeds are going from 10’s of MB/s to 1000’s of MB/s (a 10x-100x speedup for most workloads). We have also built a new all-SSD storage system for “/home” to improve interactive performance (check out `module avail`!).  With different storage technologies in different locations, it is now more important than ever to understand how to make the best use of the different storage locations. There are also a number of changes that have been necessitated by the design of the new file system so please read carefully below.  This and additional information can be found on our updated storage policy.

The most important change is that quotas are no longer defined per path (/data, /scratch, /group) but by the group ID that owns the file.  The other important change is that /home, /data, /group, and /scratch all use compression.  Quotas are calculated on the compressed values, not the larger actual size of the file (so everyone should be able to store more data in their quota).

  • Home Storage – The home directory has been updated without changing the location (/home/$USER) or quota size. It is high performance all SSD storage that is ideal for source code and (small) file manipulations. You can check your home quota and usage with the command: `df -h /home/$USER`.
  • HPC Storage – This new storage system is where three of our paths now resides: /data, /group, and /scratch. This space has a single shared quota of 100GB unless you have purchased additional storage (HPC Investors get 3TB of HPC storage) or asked for a higher allowance. This is very different from the previous scheme as /scratch now has a quota per group that is shared with the /data folder. You can check your HPC quota in /data and /scratch with the command: ‘lfs quota -hg $USER /storage/hpc`
    1. /data – The data directory has been updated and moved to /storage/hpc/data/$USER. A symlink to the old path (/data/$USER) was added for backwards compatibility. This is now significantly higher performant than before.
    2. /scratch – Likewise, the scratch directory has been updated and now is /storage/hpc/scratch/$USER. A symlink for backwards compatibility has been added (/scratch/$USER). This is also much faster than the previous /scratch folder but, importantly, now shares a quota with /data. The /scratch folder was NOT migrated from the old system as communicated previously.
    3. /group – Users who are members of a group will have their files in /group count against the group quota if the file’s group permissions are properly set to be owned by the user:group (as opposed to the default user:user in the /data directory). Again, the performance much better than before. You can check your quota with the command: `lfs quota -hg <group-name-here> /storage/hpc`
  • HTC Storage – No change
  • GPRS Storage – No change
  • Local Scratch – No change, but with performance improvements in the other storage options advanced users should re-benchmark their runtimes against /scratch or /home.
  • UMKC (Kansas City) Research Managed Archive Storage – No change

Note that some legacy users never had a quota set on one or more of their directories and will now be out of compliance with the new quotas. You will not be able to write to files or create new folders and files if used space is greater than the quota – please delete or move the files to a different location (i.e. /home à /data or /scratch à /group)

Thank you everyone for being patient during the period of time where we were seeing performance issues with the Isilon.  A special thanks goes to those users who were taxing the system with their research and moved their workloads to our Proof Of Concept environment.  This allowed us to move quicker to production and to take load of the existing system.

To give a sense of scale,  the change involved over 2000 drives in 4 racks and the movement of 200,000GB of data connected to over 200 machines. The new file system contains 960 10TB drives, 35 TB of NVMe storage (faster than SSD), and 2TB of RAM all connected by 100Gbps Ethernet (compared to 10Gbps connections on the old one).

If you are seeing increased performance, and would like to share how much faster your workflow runs, please send feedback to rcss-support@missouri.edu.  If you are seeing performance impacts, we would also like to know so we can help you improve your workloads (there are some important knobs that users can tune).

Jan. 16, 2018