HPC services are no longer solely targeted at traditional HPC applications, highly parallel modeling and simulation tasks. Indeed, the computational power offered by theses services is now being used to support data-centric Big Data and Artificial Intelligence (AI) applications. By combining both types of computational paradigms, HPC infrastructures will be key for improving the lives of citizens, speeding up scientific breakthroughs in different fields (e.g., health, IoT, biology, chemistry, physics), and increasing the competitiveness of companies.

The three types of applications are different both in the way they are used to interact and leverage computational resources and the requirements they impose on each type of resource.

The focus of traditional HPC applications is on achieving extreme performance on computationally complex problems, typically simulations of real-world systems (think explosions, oceanography, global weather hydrodynamics, even cosmological events like supernovae, etc.). This meant very large parallel processing systems with hundreds, even thousands, of dedicated compute nodes and vast multi-layer storage appliances, over vast high-speed networks. Typically they require large slices of computational time, are CPU-bound. Data-driven Big Data applications contemplate shorter computational tasks, and, in some casses, have real-time response requirements (i.e., latency-oriented). However, they typically use datasets that are large and I/O consumes a significant portion of the runtime. For artificial intelligence, complex algorithms train models to analyse multiple sources of streaming data to draw inferences and predict events so that real-time actions can be taken. Many of these applications require specific hardware (e.g., GPUs) in order to be efficient.

We’re seeing applications from one domain employing techniques from the others. For systems to converge, they must handle the full range of scientific and technical computing applications. As workloads converge, they will require platforms that bring these technologies together. There is an ongoing effort to develop a unified infrastructure with converged software stacks and compute infrastructure.

With this convergence, HPC infrastructures must support general-purpose AI and BigData applications that were not designed explicitly to run on specialized HPC hardware and are used to different tools. For example, HPC users submit applications to an HPC workload manager like Slurm or PBS Pro, that will run them in batch as soon as there are available resources. BigData and Machine learning applications are primarily run in containers under the jurisdiction of a container orchestration system, such as Kubernetes. A simple solution to run them in a cluster will be to, in a static manner, create a cluster with a subset of the nodes to Kubernetes for the BigData and machine learning application and another cluster with the remaining subset of nodes, running Slurm. However, this is rigid and prevents that according to the demand, for example the majority or the whole the cluster is used to run an important HPC simulation, or, to train an algorithm.

We should be able to have a single cluster capable of sharing its resources dynamically across different types of applications and workloads. BigHPC will simplify the management of HPC infrastructures supporting Big Data and parallel computing applications in the same cluster dynamically. It aims at enabling both traditional HPC, as well as novel Big Data analytics applications to be deployed on top of heterogeneous HPC hardware. MACC and TACC infrastructures are being used to validate the BigHPC components.

Ricardo Vilaça, MACC

September 27, 2022