page 1  (5 pages)
2to next section

32

Visualisation of Cache Coherence Bottlenecks in

Shared Memory Multiprocessor Applications

Mats Brorsson and Per Stenstr?m

Department of Computer Engineering, Lund University,

P.O. Box 118, S-221 00 Lund, Sweden

email: [email protected]

Abstract

Shared memory multiprocessors are becoming more and more important but one major problem is

how to keep the processor caches coherent. Many different solutions to this problem exist and the

performance of a given program depends largely on the access pattern observed to shared data, the

sharing behaviour.

We discuss in this paper how to characterise and visualise sharing behaviour. In terms of cache

coherence the degree of sharing, the access mode and the temporal granularity are found to be

essential in order to describe and understand sharing behaviour. The sharing behaviour can be meas-

ured by simulation and visualised in a sharing profile diagram.

We use an unblocked matrix multiplication and two different ways of parallelising a blocked

matrix multiplication algorithm to illustrate the problems involved and how the sharing profile can

aid in choosing the right algorithm.

1.0 Introduction

The use of caches to reduce memory latency and interconnect traffic is imperative in high performance shared mem-

ory multiprocessors. The problem of keeping the contents of the caches coherent is however a major obstacle which

has to be conquered. A number of coherence mechanisms have been proposed in the literature and it is widely

acknowledged that their relative performance is highly dependent on the sharing behaviour of the workload. In order

to understand how a particular application will perform using a given cache coherence protocol we need a thorough

comprehension of how the processors access the shared data structures.

Moreover, when writing new applications for shared memory multiprocessors, or when parallelising existing

ones, it may be necessary to consider other issues than the pure algorithmic speedup. An algorithm, which on a

PRAM has the best efficiency, may exhibit a sharing behaviour that leads to a large communication overhead. The

performance of a parallel application on a given architecture is a complex function of algorithmic speedup and the

performance penalty paid on keeping the caches coherent.

In this paper we will address the issue of the importance of analysing and understanding the sharing behaviour of

parallel applications. We are using a method for measuring and visualising sharing behaviour which we apply on an

unblocked and a blocked algorithm of matrix multiplication. We analyse the sharing behaviour of two different ways

of parallelising the blocked matrix multiplication algorithm. The analysis is used to explain the differences in per-

formance of the various algorithms when executed on a simulated multiprocessor with a full-map and limited direc-

tory cache coherence protocol respectively.

The next section discusses sharing behaviour and how it relates to cache coherence protocols. Section 3.0 presents

how the sharing behaviour can be measured and visualised. We then continue with an analysis of the different matrix

multiplication algorithms in Section 4.0. In Section 5.0 we present simulation data of two simple cache coherence

protocols and discusses the results before concluding the paper in Section 6.0.

2.0 Sharing behaviour in relation to cache coherence protocols

In this section we will look at the aspects of sharing that are important for the performance of a cache coherence pro-

tocol. If process migration is not allowed, the instructions and the process? private data will also be private to the

processing element. We will thus only be concerned with shared data. For simplicity we consider only write-invali-

date schemes. Similar reasoning can be applied to write-update schemes.

Agarwal et al. discusses in their article a number of directory based cache coherence protocols [1]. They use the

notation DiriX where i is the maximum number of copies of a memory block. X may be NB for Non Broadcast or B

for Broadcast. We will use the DirNNB and the Dir2NB protocols as examples of how the sharing behaviour affects

their relative performance. N is the total number of caches in the system. DirNNB is also called the full-map directory

protocol, and Dir2NB the limited directory protocol with 2 pointers.

In IEEE Computer Society Technical Committee on Computer Architecture Newsletter

Nr. 3, Fall 1993, pages 32-36