Visualisation of Cache Coherence Bottlenecks in
Shared Memory Multiprocessor Applications
Mats Brorsson and Per Stenstr?m
Department of Computer Engineering, Lund University,
P.O. Box 118, S-221 00 Lund, Sweden
email: [email protected]
Shared memory multiprocessors are becoming more and more important but one major problem is
how to keep the processor caches coherent. Many different solutions to this problem exist and the
performance of a given program depends largely on the access pattern observed to shared data, the
We discuss in this paper how to characterise and visualise sharing behaviour. In terms of cache
coherence the degree of sharing, the access mode and the temporal granularity are found to be
essential in order to describe and understand sharing behaviour. The sharing behaviour can be meas-
ured by simulation and visualised in a sharing profile diagram.
We use an unblocked matrix multiplication and two different ways of parallelising a blocked
matrix multiplication algorithm to illustrate the problems involved and how the sharing profile can
aid in choosing the right algorithm.
The use of caches to reduce memory latency and interconnect traffic is imperative in high performance shared mem-
ory multiprocessors. The problem of keeping the contents of the caches coherent is however a major obstacle which
has to be conquered. A number of coherence mechanisms have been proposed in the literature and it is widely
acknowledged that their relative performance is highly dependent on the sharing behaviour of the workload. In order
to understand how a particular application will perform using a given cache coherence protocol we need a thorough
comprehension of how the processors access the shared data structures.
Moreover, when writing new applications for shared memory multiprocessors, or when parallelising existing
ones, it may be necessary to consider other issues than the pure algorithmic speedup. An algorithm, which on a
PRAM has the best efficiency, may exhibit a sharing behaviour that leads to a large communication overhead. The
performance of a parallel application on a given architecture is a complex function of algorithmic speedup and the
performance penalty paid on keeping the caches coherent.
In this paper we will address the issue of the importance of analysing and understanding the sharing behaviour of
parallel applications. We are using a method for measuring and visualising sharing behaviour which we apply on an
unblocked and a blocked algorithm of matrix multiplication. We analyse the sharing behaviour of two different ways
of parallelising the blocked matrix multiplication algorithm. The analysis is used to explain the differences in per-
formance of the various algorithms when executed on a simulated multiprocessor with a full-map and limited direc-
tory cache coherence protocol respectively.
The next section discusses sharing behaviour and how it relates to cache coherence protocols. Section 3.0 presents
how the sharing behaviour can be measured and visualised. We then continue with an analysis of the different matrix
multiplication algorithms in Section 4.0. In Section 5.0 we present simulation data of two simple cache coherence
protocols and discusses the results before concluding the paper in Section 6.0.
2.0 Sharing behaviour in relation to cache coherence protocols
In this section we will look at the aspects of sharing that are important for the performance of a cache coherence pro-
tocol. If process migration is not allowed, the instructions and the process? private data will also be private to the
processing element. We will thus only be concerned with shared data. For simplicity we consider only write-invali-
date schemes. Similar reasoning can be applied to write-update schemes.
Agarwal et al. discusses in their article a number of directory based cache coherence protocols . They use the
notation DiriX where i is the maximum number of copies of a memory block. X may be NB for Non Broadcast or B
for Broadcast. We will use the DirNNB and the Dir2NB protocols as examples of how the sharing behaviour affects
their relative performance. N is the total number of caches in the system. DirNNB is also called the full-map directory
protocol, and Dir2NB the limited directory protocol with 2 pointers.
In IEEE Computer Society Technical Committee on Computer Architecture Newsletter
Nr. 3, Fall 1993, pages 32-36