Storage is a crucial aspect of HPC clusters. Storage nodes are typically shared among multiple users, and are highly utilized – there isn’t a lot of unused space. The storage nodes also experience a lot of reads and writes – data is typically written, deleted, and written again multiple times. In such an environment, the challenge is maintaining the health of the storage over time. 

Storage Health

The health of a storage unit, such as a file system, is reflected in the performance it provides for reads and writes. For example, a file system when it is newly created might offer hundreds of megabytes of sustained read and write performance. However, as the file system is used more and more, with files being created and deleted over time, the file system is only able to offer a third or less of the sustained read and write performance of a newly created file system. 

This problem is well known on traditional file systems and is termed fragmentation. Essentially, the data is distributed all over the device over time, and finding and reading/writing the data becomes expensive. The root cause is that reading/writing data sequentially on traditional storage devices is about 100 times faster than reading/writing data in a random fashion. 

Traditional file systems counter fragmentation by employing a defragmentation tool that collects together and rewrites it so that accessing data gets sequential performance, rather than random performance. File systems are occasionally defragmented to maintain good read and write performance. This maintenance operation is crucial for preserving the health of a file system. 

Does aging matter for persistent memory? 

At the BigHPC project, one of the problems we are studying is effectively how to use new technologies, such as Persistent Memory (PM), in HPC clusters. PM is a new memory technology introduced by Intel. One can imagine PM to be similar to main memory (Dynamic Random Access Memory, or DRAM); unlike DRAM though, PM retains data even if the machine is powered down or rebooted. There is a lot of interest in how to effectively use PM in various settings, since it offers high performance for storage. 

The question we asked ourselves: does aging matter for PM? Unlike traditional storage devices such as magnetic hard drives, there is no mechanical operation required to find and read/write data. Will it matter if data is close together or spread far apart? 

It turns out it does matter, but not for the reasons we originally considered. One of the ways PM is accessed is by directly reading or writing to it from the application, without going through a file system. In this access mode, called memory-mapped access, the application sets it up so that a write to its address 10240 is translated into a write to PM location 4096. Thus, reading and writing to PM does not incur any software overhead, allowing applications to fully utilize the high performance of PM. 

The translation in our example is carried out by a piece of the operating system called the page table. It turns out that setting up the page table is a big cost in accessing PM – you need the table set up first before the translation can proceed. For small accesses to PM, the cost of setting up this table can be as big as the cost of accessing the PM. 

It turns out that if your data is close together, setting up the page tables is easier. You can use one entry to translate a large amount of PM (called a “huge page”). However, if the data is spread apart, you need many entries to accomplish the same translation, and setting up the page table consumes a lot of time. 

WineFS

Based on this insight, we built a new file system called WineFS (because it gets better with age). WineFS uses a number of techniques to keep data together so that the page table setup cost is low. The file system is able to provide the same high performance regardless of age – no maintenance defragmentation operations are required!  

WineFS differentiates between data that is meant to be read sequentially, and small data that is meant to be read in a random fashion. By maintaining separate allocation pools for these kinds of data, WineFS is able to provide contiguous PM extents to applications that will benefit from such extents, while providing small PM allocations to applications where random performance is acceptable. 

WineFS also uses a technique called journaling to update the file system. Interestingly, journaling causes more writes to PM since data is first written to a journal and then to its proper place in the file system. However, journaling also preserves the layouts of files, and therefore read/write performance as the file system ages. Hence, the extra write is a worthwhile trade-off for the design of a PM file system. 

The design of WineFS was published at the 28th ACM Symposium on Operating Systems Principles (SOSP 21). You can find the citation below [1]. WineFS was a collaboration between UT Austin, CMU, and Google. 

WineFS has been made open source, in line with one of the goals of the BigHPC project (disseminating knowledge and inspiring follow on work). You can access it here: https://github.com/utsaslab/WineFS. The BigHPC team would be happy to help you use it if it fits one of your use cases. 

[1] Rohan Kadekodi, Saurabh Kadekodi, Soujanya Ponnapalli, Harshad Shirwadkar, Gregory R. Ganger, Aasheesh Kolli, Vijay Chidambaram. WineFS: a hugepage-aware file system for persistent memory that ages gracefully. SOSP 2021

Vijay Chidambaram, UT Austin
May 12, 2022