page 1  (12 pages)
2to next section

An Analysis of File Migration in a Unix Supercomputing Environment 1

An Analysis of File Migration in a Unix Supercomputing

Environment

Ethan L. Miller

Randy H. Katz

University of California, Berkeley

ABSTRACT

The supercomputer center at the National Center for Atmospheric Research (NCAR) migrates large numbers of files to and from its mass storage system (MSS) because there is insufficient space to store them on the Cray supercomputer?s local disks. This paper presents an analysis of file migration data collected over two years. The analysis shows that requests to the MSS are periodic, with one day and one week periods. Read requests to the MSS account for the majority of the periodicity; as write requests are relatively constant over the course of a week. Additionally, reads show a far greater fluctuation than writes over a day and week since reads are driven by human users while writes are machine-driven.

1 Introduction*

Over the last decade, computers have made incredible gains in speed. This speedup has encouraged the processing of larger and larger amounts of data; however, storing this data on magnetic disk is not feasible. Instead, most data centers with large data sets use tertiary storage devices such as tapes and optical disks to store much of their data. These devices provide a lower cost per megabyte of storage, but they have longer access times than magnetic disk. By studying the tradeoffs between cheaper and slower tertiary storage and more expensive and faster disk storage, response time can be improved without increasing storage costs.

The problem is especially acute at computer centers, such as the National Center for Atmospheric Research (NCAR), that deal with large amounts of data that can never be deleted. Data grows at the rate of several terabytes per year [20]. The cost of storing this data on shelved magnetic tape is relatively low, as cartridge tapes are inexpensive. However, storing even 1% of the total data in magnetic disk would be expensive, requiring hundreds of gigabytes of Cray disk storage.

This paper analyzes file migration behavior on the NCAR system described in [1] and [18]. The first section will provide some background on the problem, discussing current mass storage systems and previous work on them. The next section will describe the NCAR system in more detail. We will then present our trace-gathering methods.

* This work was supported in part by University Corporation for Atmospheric Research contract S9128, and an NSF Fellowship.

The main part of the paper is a two-part analysis of the gathered trace data?analyzing the usage patterns for the entire mass storage system (MSS), and studying the behavior of individual files. The first part of the analysis includes system behavior over the course of a day, week, and longer periods. It characterizes user behavior with respect to the entire MSS, showing at what rate data and files are read and written. Other characteristics of the mass store at NCAR, such as request latency and interrequest distribution, are also discussed. The second part of the analysis provides insight for designing migration algorithms, as it focuses on how individual files are treated. This part of the analysis will discuss file size distribution and individual file reference patterns.

We will finally present some implications of our findings on migration algorithms, and suggest some directions for future research.

2 Background

2.1 History

File migration systems are used by many large computer installations, such as NCAR [1,18] and NASA [7,19], to store more data than what would cost-effectively fit on magnetic disk. Tertiary storage, which usually consists of tape and optical disk, lies at the bottom of the ?storage pyramid,? as shown in Figure 1. Cost and speed increase going up the pyramid, while the size of the memory level increases towards the bottom of the period. CPU cache is at the top of the pyramid; it have the highest cost per byte and is the smallest and fastest of the levels. At the bottom of the pyramid are tape and optical disk, which have slow access speeds, on the order of seconds or minutes, and very low cost, under $10/GB.