GPU

Description

  • Some scientific workflows are greatly accelerated when they run on one or more Graphical Processing Units (GPUs). The Lewis Cluster includes a partition dedicated to GPU processing to accommodate these workflows. There are two kinds of NVIDIA GPUs available for use on Lewis, namely GeForce and Tesla class GPUs.

GPU Capabilities

17 GPU nodes spanning two generations:

  • GPU3 (15 Nodes)
    • Nvidia GeForce GTX 1080 Ti
    • Nvidia Tesla k40m
    • Nvidia Tesla k20xm
    • Nvidia Tesla P100
  • GPU4 (2 Nodes)
    • Nvidia Tesla V100

Example Use Case

  • As a researcher I want to train a neural network to classify images, but my project budget does not cover the cost of purchasing and managing the amount of GPU hardware that is required to complete this task.

Policies

  • The GPU partition must only be used for GPU accelerated workflows. Jobs running on the GPU partition that are not utilizing the GPU are subject to cancellation and potential loss of GPU partition access.
  • The use of srun for active development and debugging is permitted but is limited to allocations of 2 hours or less. Excessive srun session idle time or excessive number of srun sessions is not permitted.
  • GPU jobs that utilize only 1 GPU should be structured in a way to allow other jobs to share the node. The exclusive SLURM option should NOT be used and CPU cores, memory, and GPU resources need to be ‘right-sized’ to the workload. Resource requests should match the correct class and quantity of GPUs for the algorithm.
  • No more than 50% of the partition resources will be available for concurrent use by any single user.
  • The “GPU” partition is limited to 2 hours. Jobs up to 48 hours require “GPU” group membership and additional training.