Protecting Parallel File Systems from Harm

Modern supercomputers are establishing a new era in High-Performance Computing (HPC), providing unprecedented compute power that enables large-scale parallel applications to run at massive scale. However, contrary to long-lived assumptions about HPC workloads, where applications were compute-bound and write-dominated (such as scientific simulations), modern applications like Deep Learning training are Read more…

Container Orchestration on HPC Platforms

The last decade witnessed a new era of software development that allows software developers to write applications independently of the target environment by packaging them along with their dependencies and environment variables inside containers. Numerous studies [1-2] have shown that containers are optimal for building and running applications reliably on Read more…

Maintaining Storage Health over Time

Storage is a crucial aspect of HPC clusters. Storage nodes are typically shared among multiple users, and are highly utilized – there isn’t a lot of unused space. The storage nodes also experience a lot of reads and writes – data is typically written, deleted, and written again multiple times. Read more…

Improving Storage QoS for HPC centers

Data-centric applications (e.g., data analytics, machine learning, deep learning) running at HPC centers require efficient access to digital information in order to provide accurate results and new insights.  Users typically store this information on a shared parallel file system (e.g., Lustre, GPFS), which is available at HPC infrastructures. This is Read more…