Spring 2019 Update

This spring saw a lot of changes to the team with the addition of new members and the departure of others. In January Brian Marxkors joined our team as a Business Technology Analyst and will be helping us manage our projects, services, documentation, and work with researchers on their research computing needs. He was joined in February by Asif Ahamed, our latest researcher facing Cyberinfrastructure Engineer who will be supporting researchers with their scientific workflows on the cluster with a focus in A&S. They joined Derek Howard who joined us last June as our latest research system administrator supporting the cluster and interactive computing and is taking over some of the responsibilities of George Robb who accepted a position at a national lab in April. George was with the team from the beginning and we thank him for his time, effort, and enthusiasm over the years. Our team continues to grow and we have an open positions for a researcher facing Cyberinfrastructure Engineer and a research Linux systems administrator.

There have been a number of smaller changes to the cluster this spring. We added two new GPU nodes with 3 NVIDIA v100 GPU’s with 32GB of GPU RAM (for a total of 6 v100 cards), 40 CPU cores, and 384GB of RAM. We also addressed some latency issues by replacing the login node with a new login node (these were very hard to debug and we could not determine the root cause).
The big issue over the spring was with the HTC Storage system that have since been resolved. The issues were triggered by a memory (DRAM) failure in the storage node. This triggered an integrity scan that kept the storage system offline for a long time doing integrity checking. A second DRAM failure lead us to discover a secondary issue that was the root cause for both outages that kept the system offline for an extended time. This was a controller driver issue that was only triggered by the integrity scan and was found and fixed with the help of the Campus Champions community.

This is a reminder that HTC storage was designed for economically storing large datasets (hundreds of Terabytes) that are infrequently accessed. It was not designed for high availability, large number of files, direct computation, or heavy random I/O access. The system design goals are for capacity, cost, simplicity, and performance in that order. This means we made a number of trade-offs during design to ensure the low cost point.

May 8, 2019