page 1  (10 pages)
2to next section

File Access Patterns in Public FTP Archives

and an Index for Locality of Reference?

Silvano Maffeis

[email protected]

Institut fur Informatik der Universitat Zurich (IFI)

Winterthurerstr. 190, CH{8057 Zurich, Switzerland

IFI TR 92.13y

August 1992

ABSTRACT

Global filesystems and new file transfer protocols are a great need and challenge in the presence of drastically growing networks. In this paper we present results obtained from an investigation of access to public files which took place over three months. This work visualizes first results on the popularity of public ftp files, on common operations (deletions, updates and insertions) to public file-archives and on encountered filesizes. An index for measuring locality of reference to a resource is also proposed. The results show that most file transfers relate to only a small fraction of the files in an archive and that a considerable part of the operations to public files are updates of files. Further results are presented and interpreted in the paper.

Keywords: file transfer, popularity of files, locality of reference, filesizes, replication

1 INTRODUCTION

The number of hosts in the Internet increases drastically from year to year [7, 5], accompanied by an increase of file transfer traffic. Today there are thousands of anonymous FTP sites with millions of files stored on them. Under these conditions, the traditional FTP utility [10] shows severe inefficiencies like duplicate file-transmissions [4], long access times and inadequate file locating mechanisms. Therefore, several improvements and redesigns of FTP have been proposed, for example [11, 2, 6] and [8]. In this paper we analyze FTP data-traffic on a per-file basis. The purpose of the analysis is to gather characteristic information on file-accesses in public FTP archives, which is useful for designing global filesystems or new file transfer protocols and for simulating data access in large networks.

The rest of the paper is structured as follows: In section 2 we define which specific analyses were performed and

?This work was supported by Siemens AG, ZFE, Germany and Schweizer Bundesamt fur Konjunkturfragen, Grant No. 2255:1 yin: ACM SIGMETRICS Performance Evaluation Review Vol. 20 Nr. 3, March 1993

how the data was collected. Section 3 presents results concerning the popularity of anonymous FTP files and proposes an index for characterizing locality of reference. Section 4 presents our results relating to the most important operations on file-archives, namely creation of new files, deletion of files and updating of files. File-sizes are examined in section 5. Finally, section 6 summarizes the main findings and concludes the paper.

The tables and figures referenced in the text are included at the end of the paper.

2 CONDUCTED ANALYSES

The intent of our work was to find answers to the following questions:

1. What can be said about the popularity-distribution of public files? Here we aimed at deriving statements of the form "X% of the filetransfers from a certain archive relate to Y% of its files".

2. How are the frequencies of operations like creation of new files, deletion of files and updates of files interrelated? Is the assumption that updates to files happen very infrequently valid?

3. How are filesizes distributed in public FTP archives?

4. What can be said about the sizes of the most frequently retrieved files?

To address these questions two distinct classes of measurements were carried out: we analyzed FTP-logfiles and FTP-indexfiles.

2.1 ANALYZING FTP-LOGFILES

Many FTP server processes (ftpd) log FTP commands issued by anonymous users to a specific logfile. With the help of some perl [12] programs we wrote for this purpose, statements concerning the access-rates to files could be derived. This class of measurements was complicated by the fact that many different formats of FTP-logfiles exist.

Logfiles were located by having the archie [3] service search for appropriate filenames in the Internet network. Also, some operators of well-known archive sites participated directly by granting us access to their FTP-logfiles. We tried to find a few characteristic sites showing a minimum of 2000 file transfers per month and which were located at different