Monitoring can be described as a three-step process, composed of collecting, storing, and alerting. Each of these steps is intrinsically simple and understandable by everyone: collecting is the process of gathering the necessary data, where this can be a temperature sensor, RAM usage counter, power consumption, or the number of firewall connections currently on; storing is responsible for keeping this data available for later analysis; and alerting for trigger warnings, either to users or sysadmins, if a certain threshold is reached.
Big Data issues
In the Big Data scenario, the used applications are what we call data-centric, with the software in use not developed with timeshare environments in mind. This association of characteristics leads to irregular workloads that have arbitrary and intensive IO, with random, frequent, and small reads/writes with many small files. The software in use is highly specialized, with complex requirements and with always evolving versions. These programs are expecting exclusive access to the systems in which they are running, resulting in irregular workloads, unpredictable resources usage, and persistence of services, which is the opposite of traditional scientific workloads. In this scenario, storage is one of the major bottlenecks and the one being tackled within the BigHPC project. Here the compute nodes are stressing the storage component, the parallel file system, and since the file system is shared by all users, it is one of the most important variables to monitor. Alongside the filesystem itself, the network interfaces, responsible for the data transfer between the file system and the compute nodes, are also important to monitor. Additionally, since the software is highly specialized and always evolving, the demand for containers has been seen as a solution to run these applications in HPC centers. This also raises problems on the level of permissions existing on the compute nodes, with most HPC centers resourcing to unprivileged containers using technologies such as Singularity.
The usual challenges of monitoring are amplified in HPC infrastructures since the number of variables for each node is big and there are hundreds to thousands of servers, composed of multiple nodes to monitor. This creates a scalability problem by generating a lot of data that must be stored, some of which must be kept for historical purposes (e.g. cluster usage, workload status, accounting purposes, etc), and others that may be discarded after some time. In that sense, we must choose carefully what metrics to collect, store, and whatnot, as we may fail to collect precious information that may well be needed in the future. Another challenge is the sampling interval for collecting metrics, too low and may not capture the necessary metrics, too high and it will generate excess information that will prove difficult to manage in the long term. One additional issue, common to most HPC centers for security reasons, is that transfer data on the compute nodes is limited to pushing data from the node and cannot be done by pulling. This has an impact on the monitoring infrastructure, which must be able to collect data that is randomly sent from the nodes. Lastly, the monitor agent should avoid using necessary resources that are meant to be used by normal workloads. The maximum desired impact should not go above 5%.
BigHPC specific challenges
In the BigHPC project, as with most HPC centers, root access to the nodes is forbidden, so we are reduced to measuring public exposed values or our own user metrics. Also, since we will be resourcing to containers, and in order to be able to collect metrics from the workloads inside them, the monitoring agent must be running in the node under the same username as the workload container. For ease of deployment, all BigHPC components should share a common database and the other components in development also require a database that may store for example binary files. The monitor backend must inform the virtual manager of the status of the running workloads and provide updated info on where to allocate the workload. Additionally, the monitoring component must provide the storage manager with the most up-to-date information about storage usage. Failures in HPC scheduling affect current running jobs by wasting scheduling and queuing times.
Both the recipients and targets of the BigHPC project have very different goals regarding the monitoring system. Sysadmins want to know how the cluster is performing, receive alerts of malfunctioning nodes or misbehaved workloads, they also need long-term monitoring for accounting purposes, relative to the cluster and the user’s workloads. Additionally in the scope of this project, they must know when to launch the workload, thus the need for real-time data.
Users, on the other hand, are mainly interested in their workload, they may want to detect execution points where it is underperforming or where they can be enhanced, in addition, they might also be interested in the visualization of their data or simulation results.
Stay tuned for the next posts, where we will be sharing the monitoring solution being developed along with the BigHPC project, that will address the issues that were identified along with this post.
Bruno Antunes, Wavecom
February 11, 2022