Cache misses due to coherence and directory maintenance is a major reason for poor performance in shared memory multiprocessors. We show that the relationship between a particular access pattern and cache miss ratios for a class of directory-based, write-invalidate cache coherence protocols can be characterised in a small set of parameters.
In order to do this, a reference generator has been designed that, based on parameters automatically extracted from a program, artificially can generate a reference stream that results in the same cold, coherence and directory replacement miss ratios as an execution of the program.
Directory-based, write-invalidate cache coherence protocols have been proposed and incorporated in several recent shared memory multiprocessor implementations. Such protocols keep track of the cached copies using a directory that points to the caches that have copies of each block. The most aggressive solution, called a full-map protocol, uses a bit vector of N bits, assuming N caches, whereas solutions with lower implementation cost use a limited number of pointers, i < N, and are called limited directory protocols .
Although most shared data accesses can be satisfied locally in the cache, coherence and directory maintenance - especially for limited directory protocols - cause cache misses that may limit the performance of the multiprocessor application. The number of such cache misses in parallel applications is highly dependent on the access pattern to shared data which we refer to as the sharing behaviour of a program.
The purpose of this paper is to identify a set of parameters that describes the relationship between shared data accesses and cache misses in directory-based, write-invalidate protocols. We do this by focusing on an access pattern - called stationary data accesses - that loosely speaking shows up in applications where each block is accessed by the same set of processors and in the same fashion (e.g., read-modify-write or read-only) throughout the execution. An example of a program with such an access pattern is a parallelisation of the DGEMM matrix multiplication routine from the BLAS-3 collection of linear algebra routines .
The parameters we have identified include the sharing profile, describing the distribution of references to the same
access class, e.g., read-write blocks and read-only blocks; the block distribution, representing the total number of memory blocks in each access class; and the temporal granularity reflecting the temporal locality of accesses to each block type.
To demonstrate that (1) the parameters can be automatically extracted from the execution of a program and (2) they can accurately predict the cache miss ratios for limited directory protocols, we have developed an analysis tool and a memory reference generator that artificially generates references to stationary data based on the extracted parameters. We show results that demonstrates the accuracy of the modelling framework for three matrix multiplication implementations.
We begin in section 2 to establish the architectural framework for this study. Then in section 3 we define stationary data and develop the framework for quantifying it by a set of parameters. Section 4 describes the reference generator used to verify the correctness of the model in section 5. Before concluding the paper we discuss some related work in section 6.
2 Architectural concepts and assumptions
We assume an architecture with N processors and a shared memory module that can be accessed by all processors in unit time. An infinitely large cache memory is attached to each processor. The caches are kept coherent by means of a directory-based, write-invalidate cache coherence protocol. The identities of the processors holding a copy are recorded in a directory with an entry for each memory block.
We will present results for a class of limited directory protocols denoted Diri NB by Agarwal et al. . NB stands for no-broadcast of coherency actions (in contrast to some other protocols which relies on a broadcasting interconnection medium) and i is the maximum number of copies of a memory block, 0 < i < N. Since a Diri NB protocol has limited space to hold information on the location of block copies it is necessary to invalidate one of the existing copies when a new processor requests a block and the directory is full. This is called directory replacement.
During the execution of the program there will be a number of cache misses due to different reasons. For this study, we consider only infinite caches which means that there will be no replacement misses because of a limited
Modelling Accesses to Stationary Data
in a Shared Memory Multiprocessor
Mats Brorsson and Per Stenstr?m
Department of Computer Engineering, Lund University
P.O. Box 118, S-221 00 LUND, Sweden
In Proceedings of the 7th International Conference on Parallel and Distributed Computing Systems (PDCS?94) Las Vegas, October 1994, pp. 802-807.