RAMA: A FILESYSTEM FOR MASSIVELY PARALLEL COMPUTERS
Ethan L. Miller and Randy H. Katz
University of California, Berkeley
This paper describes a file system design for massively parallel computers which makes very efficient use of a few disks per processor. This overcomes the traditional I/O bottleneck of massively parallel machines by storing the data on disks within the high-speed interconnection network. In addition, the file system, called RAMA, requires little inter-node synchronization, removing another common bottleneck in parallel processor file systems. Support for a large tertiary storage system can easily be integrated into the file system; in fact, RAMA runs most efficiently when tertiary storage is used.
Disk-based file systems are nothing new; they have been used for over thirty years. However, two developments in the past few years have changed the way computers use file systems?massive parallelism and robotically-controlled tertiary storage. It is now possible to have a multi-terabyte file system being accessed by hundreds of processors concurrently. Current disk file systems are not suited well to this environment. They make many speed, capacity, and erasability assumptions about the characteristics of the file storage medium and about the stream of requests to the file system. These assumptions held with increases in disk speed and processor speed. However, massively parallel machines may have hundreds of processors accessing a single file in different places, and will require data to come from a multi-terabyte store too large to cost-effectively fit on disk. The changes in access patterns to the file system and response times of tertiary storage media require a new approach to designing file systems.
The RAMA (Rapid Access to Massive Archive) file system differs from a standard file system in that it treats the disk as merely a cache for the tertiary storage system. Because it relies on optical disk, tape, and other mass storage devices to hold the ?true? copies of each file, the disk file system may use different, more efficient methods of arranging data.
This paper describes the design of the RAMA file system. The first section provides some background on relevant file systems. Next, we detail the design of the disk-based portion of the file system. We then discuss
the advantages of the system, and some possible drawbacks. We conclude with future directions for research.
There have been few file systems truly designed for parallel machines. While there have been many massively parallel processors, most of them have used uniprocessor-based file systems. These computers generally perform file I/O through special I/O interfaces and employ a front-end CPU to manage the file system. This method has the major advantage that it uses well-understood uniprocessor file systems; little additional effort is needed to support a parallel processor. The disadvantage, however, is that bandwidth to the parallel processor is generally low, as there is only a single CPU managing the file system. Bandwidth is limited by this CPU?s ability to handle requests and by the single channel into the parallel processor. Nonetheless, systems such as the CM-2 use this method.
Some parallel processors do use multiprocessor file systems. Generally, though, these systems make a distinction between computing nodes and file processing nodes. This is certainly a step in the right direction, as file operations are no longer limited by a single CPU. These systems, however, are often bottlenecked by centralized control. Additionally, there is often a strict division between I/O nodes and processing nodes. This unnecessarily wastes CPU cycles, as the I/O nodes are idle during periods of heavy computation and the processing nodes are idle during high I/O periods. The CM-5 is an example of this type of file system architecture . The Intel iPSC/2 also uses this arrangement of computation and I/O nodes . In the iPSC/2, data was distributed among many I/O nodes. However, I/O services ran only on I/O nodes, scaling performance by just that number. This arrangement works well if I/O needs are modest, but costs too much for a large system with thousands of computation nodes. Such a system would need hundreds or thousands of distinct I/O nodes as well, requiring additional processors and a larger interconnection network, and increasing the machine?s cost.
The Bridge file system  is one example of a truly parallel file system. In it, each processor has a disk, distributing the file system across the entire parallel computer. Bridge showed good performance, but it required that computation move to where the data actually resided for optimal performance. This approach does not work well for supercomputing applications such as climate models. For these problems, data layout is critical, as interprocessor communication must be optimized. Additionally, this arrangement fails when a single file system must be accessed by both workstations and highperformance computers. The authors reported little speedup for ?naive? use of the file system, but such use is necessary if workstations are to share a file system