| ![]() |
Performance Evaluation of Link-Based Cache Coherence Schemes
H?akan Nilsson and Per Stenstr?om
Department of Computer Engineering, Lund University
P.O. Box 118, S-221 00 Lund, Sweden
Abstract
Large-scale shared-memory multiprocessors rely on private coherent caches by using directory-based protocols. Directory-based protocols preserve network bandwidth by reducing the number of consistency actions. A critical issue becomes how they maintain state information about the set of caches and how they reduce read and write latencies. These tradeoffs are studied in this paper.
We study two link-based approaches, called treebased and linear-list protocols, and contrast their performance and implementation cost to that of a fullmap protocol. Using program-driven simulation and a set of three benchmark programs, we find that treebased and linear-list protocols (e.g. IEEE P1596 SCI) perform almost as well as full-map protocols but to a considerably lower implementation cost. However, if the sharing set is large, linear-list schemes may suffer because of the large write latency while tree-based protocols still perform well.
1 Introduction
Shared-memory multiprocessors offer a flexible and powerful programming model. However, scaling such computers to a large number of processors has shown to be difficult mainly due to the contention and the latency associated with their memory systems. To cope with these problems, a unified approach has been to use caches in conjunction with a directory-based cache coherence protocol implemented in hardware [14].
The objective of directory-based cache coherence protocols is to reduce memory system contention by exclusively sending point-to-point messages to those caches that share a copy of a memory block { the sharing set. In limited directory protocol [5], exact information is maintained only when the sharing set is less or equal to the number of pointers. However, when the sharing set is large, limited directory-based
protocols perform poorly due to broadcast, pointer replacement, or software overhead, that is intrinsic to this class of protocols. Instead, in this paper we focus on the performance and implementation tradeoffs of three protocols that maintain exact information about the sharing set but that differ considerably in hardware complexity { full-map, linear-list, and tree-based protocols.
In full-map directory schemes [4, 13] the cache pointers are represented by a bit vector containing N bits, where N is the number of caches. Since a bit vector is associated with each memory block, the resulting implementation cost for the directory is unacceptable for multiprocessors containing several hundreds of caches.
For large system configurations, say 1024 nodes, researchers have explored other approaches to reduce the implementation cost of the directories. In linkbased protocols the hardware cost is reduced by organizing the caches that have a copy of a memory block into a link structure such as a linear list or a tree. IEEE P1596, known as the Scalable Coherent Interface (SCI) [9], associates two cache pointers with each cache line. All the caches that share a memory block form a double-linked list. Unfortunately, the SCI incurs a severe performance penalty in handling write operations because the invalidation or update messages have to traverse the list of caches resulting in a write latency of O(n), given n caches.
In an earlier work [11], we have proposed a new directory-based cache coherence protocol called the Scalable Tree Protocol (STP). The STP arranges the caches sharing the same memory block in a tree structure. It has a slightly higher implementation cost than the SCI. Thanks to the fact that the sharing set is arranged in a tree-structure, the write latency in the STP is O(log n). A critical issue in the design of treebased cache coherence protocols is how to reduce read and write latencies.
In this paper we investigate the intrinsic performance tradeoffs of full-map, linear-list, and tree-based