This work was partly supported by the Swedish National Board for Technical Development (NUTEK) under contract number 9001797 and by the National Science Foundation under Grant No. CCR-9115725.
Combined Performance Gains of Simple Cache Protocol Extensions
Fredrik Dahlgren, Michel Dubois, and Per Stenstr?m
*Department of Electrical Engineering-Systems
University of Southern California
Los Angeles, CA90089-2562, U.S.A.
Department of Computer Engineering
P.O. Box 118, S-221 00 LUND, Sweden
We consider three simple extensions to directory-based cache coherence protocols in shared-memory multiprocessors. These extensions are aimed at reducing the penalties associated with memory accesses and include a hardware prefetching scheme, a migratory sharing optimization, and a competitive-update mechanism. Since they target different components of the read and write penalties, they can be combined effectively.
Detailed architectural simulations using five benchmarks show substantial combined performance gains obtained at a modest additional hardware cost. Prefetching in combination with competitive-update is the best combination under release consistency in systems with sufficient network bandwidth. By contrast, prefetching plus the migratory sharing optimization is advantageous under sequential consistency and/or in systems with limited network bandwidth.
Private caches in conjunction with directory-based, writeinvalidate protocols are essential, but not sufficient, to cope with the high memory latencies of large-scale shared-memory multiprocessors. Therefore, many recent studies have focused on techniques to tolerate or reduce the latency. In  Gupta et al. compared three approaches to tolerating latency, which are relaxed memory consistency models, software-based prefetching, and multithreading. Whereas the relaxation of the memory consistency model improves the performance of all applications uniformly, the effects of prefetching and multithreading vary widely across the application suite. The success of software-based prefetching is predicated on effective algorithms to predict cache misses at compile
time. Multithreading is only applicable to specially designed processors with more than one context and it requires more application concurrency.
By contrast, in this paper, we evaluate simple extensions to the cache protocol itself. These extensions are completely hardware-based, are transparent to the software, and are compatible with traditional, single-context processors. Moreover, they add only marginally to the overall system complexity while still providing a significant performance advantage when applied separately. We focus on three extensions, adaptive sequential prefetching , a migratory sharing optimization [2,12] and a competitive-update mechanism [10,4].
Adaptive sequential prefetching cuts the number of read misses by fetching a number of consecutive blocks into the cache in anticipation of future misses. The number of prefetched blocks is adapted according to a dynamic measure of prefetching effectiveness which reflects the amount of spatial locality at different times. This simple scheme relies on three modulo-16 counters per cache and two extra bits per cache line, and it removes a large number of cold, replacement, and (sometimes) true sharing misses. Moreover, as opposed to simply adopting a larger block size, it does not affect the false sharing miss rate . The migratory sharing optimization is aimed at reducing the write penalty for migratory blocks, a major component of the overall memory access penalty under sequential consistency. Although this optimization requires only minor modifications to the cache coherence protocol, two independent studies [2,12] have shown that it is very successful in many cases.
The above two protocol extensions are fairly ineffective when it comes to coherence misses. On the other hand, it is well-known that write-update protocols have no coherence miss penalty. The major drawback of these protocols is their large write memory traffic. Competitiveupdate protocols trade off traffic and penalty by mixing updates and invalidations. The idea is simple. Instead of invalidating a block copy at the first write by another pro-
In Proc. of the 21st Ann. Int. Symp. on Computer Architecture (21st ISCA), pp. 187-197, April 1994