Modern supercomputers are establishing a new era in High-Performance Computing (HPC), providing unprecedented compute power that enables large-scale parallel applications to run at massive scale. However, contrary to long-lived assumptions about HPC workloads, where applications were compute-bound and write-dominated (such as scientific simulations), modern applications like Deep Learning training are data-intensive, read-dominated, and generate massive amounts of metadata operations. Indeed, several research centers, including TACC and AIST, have already observed a surge of metadata operations in their clusters, and they expect this to become more severe over time (PFS), which is where users store and access their data — causing I/O contention and overall performance degradation. In such an environment, the main challenge that arises is how to protect the PFS from harmful I/O workloads?

The metadata problem in PFS

PFSs are the storage backbone of HPC infrastructures, being used to store and retrieve, on a daily basis, petabytes of data from hundreds to thousands of concurrent jobs. Most Top500 supercomputers rely on Lustre-like PFSs as their main storage system, such as Lustre, BeeGFS, and PVFS. These file systems can be made of hundreds of servers, and organized in three main components that have different roles on how users storage and access their data. Metadata servers (MDSs) maintain the file system namespace, keeping track of the name of files, permissions, and the location where these are stored, while storage servers (OSSs) store the actual data. A third component, called the file system client, resides at compute nodes (where jobs are executed) and its main goal is to enable users interacting with PFS as if it was a regular local file system, such as ext4 or xfs.

However, regardless of the application, workload, or job, whenever a file needs to be accessed (e.g., create/open/remove file, access control) the main I/O path always flows through the metadata server. When creating new files, the file system client orders the MDS to create a new entry in the namespace and assign OSTs to persist the actual data; for existing files, the MDS retrieves to the file system client their location (OST nodes). As such, given that the number of file system clients is several time higher than available MDSs, this centralized design can severely bottleneck the file system when several concurrent jobs have aggressive I/O metadata behavior, impacting the performance of all running jobs.

Why is this still a problem?

By now, one might be wondering “if this is such a big problem, why is not there already a solution for it?”. It turns out that this challenge is more complex than it seems. In fact, while there are some approaches that aim at solving this problem, all of them experience (at least one) of the following shortcoming.

Manual intervention: in several HPC research facilities, system administrators stop jobs with aggressive metadata I/O behavior and temporarily suspend job submission access for users that do not comply with the cluster’s guidelines. While this helps protect the file system from metadata-aggressive users, this is a reactive approach that is only triggered when the job has already slowed the storage system and the other jobs in execution.

Intrusiveness to critical cluster components: while some solutions propose optimizations to mitigate I/O contention and performance variability, they are tightly coupled to the shared PFS. Such an approach requires deep understanding of the system’s internal operation model and profound code refactoring, increasing the work needed to maintain and port it to new platforms.

Partial visibility and I/O control: some solutions overcome the previous challenges by actuating at the compute node level, enabling QoS control from the application-side, thus not requiring changes to core components of the HPC cluster. However, these solutions are only able to control the resources where they are placed (acting in isolation), being unable to holistically coordinate the I/O generated from multiple jobs that compete for shared storage.

The PADLL storage middleware

Based on these challenges, we built PADLL, an application and file system agnostic storage middleware that enables QoS control of metadata workflows in HPC storage systems. Fundamentally, it allows system administrators to proactively and holistically control the rate at which requests are submitted to the PFS from all running jobs in the HPC system.

PADLL adopts ideas from the Software-Defined Storage paradigm [1, 2] following a decoupled design made of two planes of functionality: control and data. The data plane is a multi-stage component distributed over compute nodes, where each stage transparently intercepts and rate limits the I/O requests between a given application and the shared PFS. This enables general applicability and cross-compatibility with POSIX-compliant file systems, and does not require either changes to applications or any core components of the HPC cluster.

The control plane is a logically centralized entity with global system visibility that acts as a global coordinator that continuously monitors and manages all running jobs by adjusting the I/O rate of each data plane stage. It does so by enabling system administrators to express rate limiting rules per-job, group of jobs, or cluster-wide granularity.

The vision (and a preliminary version) of PADLL [3] was published at the 2nd Workshop on Re-envisioning Extreme-Scale I/O for Emerging Hybrid HPC Workloads, co-located with the IEEE International Conference in Cluster Computing. This is a joint effort between INESC TEC, TACC, AIST, and Intel. The BigHPC team is currently working on the final version of PADLL, so it can be used by research facilities that experience similar problems as those described here.

References:

[1] Ricardo Macedo, João Paulo, José Pereira, Alysson Bessani. A Survey and Classification of Software-Defined Storage Systems. ACM Computing Survey 53, 3 (48), 2020.

[2] Ricardo Macedo, Yusuke Tanimura, Jason Haga, Vijay Chidambaram, José Pereira, João Paulo. PAIO: General, Portable I/O Optimizations With Minor Application Modifications. 20th USENIX Conference on File and Storage Technologies, 2022.

[3] Ricardo Macedo, Mariana Miranda, Yusuke Tanimura, Jason Haga, Amit Ruhela, Stephen Lien Harrell, Richard Todd Evans, João Paulo. Protecting Metadata Servers From Harm Through Application-level I/O Control. IEEE International Conference in Cluster Computing @ 2nd Workshop on Re-envisioning Extreme-Scale I/O for Emerging Hybrid HPC Workloads, 2022.

Ricardo Macedo, INESC TEC and University of Minho

November 2, 2022