Background

Petascale supercomputing systems at the cutting edge of high-performance computing (HPC) are predicated on distributed memory architectures. To share information between parallel processes, most conventional HPC software relies on message-passing libraries, such as MPI. To make these systems easier to program, the switch fabric used for interconnections supports remote DMA operations that allow for very efficient remote memory access operations.

Using hardware support, Partitioned Global Address Space (PGAS) libraries allows the distributed processor memories to be addressed in a global manner. The PGAS model allows application programs for large distributed memory clusters to be developed using a shared memory style of programming. Writing thread-like code for shared memory systems is both simpler to implement and also allows for conventional representations of traditional data structures, such as trees and graphs.

The main challenge with using a global view when programming with dynamic data structures such as trees is that data access is often fine-grained, whereas data communication must be done in large blocks to maintain efficiency. To address this, we have developed a runtime library, Global Trees (GT), in conjunction with collaborators at the Ohio State University.

Global Trees organizes data elements into chunks. Chunks are collections of data which are globally accessible and located on a single processor. All data communication is done at the level of a single chunk (usually 100-10k elements). While chunks are homed to a single node, they may be cached throughout the system. When a process needs to access a node, it transfers the containing chunk from the owner process. This communication is performed via rDMA to eliminate the need for explicit synchronization with the owning processor. Subsequent requests for elements within the same chunk are served from local cache.

System Design

The Global Trees framework consists of several interconnected components. GT is built upon the ARMCI one-sided messaging library from collaborators at Pacific Northwest Laboratories. ARMCI provides asynchronous get/put operations for operating on the partitioned global address space. Both ARMCI and Global Trees are compatibile with MPI messaging, but can be used independently from any message-passing framework. GT also takes advantage of the Scioto dynamic load balancing library to manage irregular tree traversals.

Within the GT runtime, there are two major components. The Global Chunk Layer (GCL) is a lower level library which provides a set of primitives for working with global linked data structures. Global Trees provides routines which are tailored to linked representations of tree data structures.

Global Chunks provides support for operating on individual data elements using global pointers, as well as operating on entire chunks. The GCL provides all of these views to maximize the flexibility offered to HPC application authors. Individual elements can be referenced by portable global pointers; references within the same chunk can take advantage of special optimizations incurring nearly the same overhead as native C pointers. Lastly, operations on entire chunks (such as do-all parallelism), can be done using local, direct memory access without any global pointer overhead.

Global Trees provides several features to reduce the effort needed by developers to produce high-performance code. GT provides several commomn built-in traversals which are locality-aware and automatically load-balanced. Node allocation can be customized to use a variety of different strategies, including several locality-aware methods, or any other method written with respect to the application data access patterns.

Applications

Global Trees has been shown to be effective with several scientific applications:

MADNESS is a multiresolution analysis framework for Hartree-Fock and DFT quantum chemistry applications using multiwavelets. The MADNESS system relies on octree data structures in the tens of millions of nodes, with each node containing over 50 kilobytes of wavelet coefficients. These trees are highly irregular and computation on each node involves a multidimensional tensor contraction. Global Trees provides scalable performance with several complex MADNESS operations at over 1000 nodes.

Barnes-Hut is a tree-based approach to solving the n-body problem for computing force interactions between a collection of particles. Barnes-Hut uses spatial decomposition octrees to represent particles distributed in a volume. Unlike MADNESS, Barnes only performs a small number of floating-point operations per node. Global Trees has ben shown to outperform distributed shared memory approaches, such as Cluster OpenMP, and also outperforms other PGAS approaches, including UPC.