| ![]() |
Using Write Caches to Improve Performance of Cache Coherence
Protocols in Shared-Memory Multiprocessors
Fredrik Dahlgren and Per Stenstr?m
Department of Computer Engineering, Lund University
P.O. Box 118, S-221 00 LUND, Sweden
Abstract
Write-invalidate protocols suffer from memory-access penalties due to coherence
misses. While write-update, or hybrid update/invalidate protocols can reduce
coherence misses, the update traffic can increase memory-system contention. We
show in this paper that update-based cache protocols can perform significantly
better than write-invalidate protocols by incorporating a write cache in each processing
node. Because it is legal to delay the propagation of modifications of a
block until the next synchronization under relaxed memory consistency models, a
write cache can significantly reduce traffic by exploiting locality in write accesses.
By concentrating on a cache-coherent NUMA architecture, we study the
implementation aspects of augmenting a write-invalidate, a write-update and two
hybrid update/invalidate protocols with write caches. Through detailed architectural
simulations using five benchmark programs we find that write caches, with
only a few blocks each, help write-invalidate protocols to cut the false-sharing
miss rate and hybrid update/invalidate protocols to keep other copies, including
the memory copy clean at an acceptable write traffic level. Overall, the memoryaccess
penalty associated with coherence misses is drastically reduced.
1. INTRODUCTION
Private caches in conjunction with directory-based cache coherence protocols are key to tolerate the memory-access latencies in large-scale, shared-memory multiprocessors. For example, many recent machines such as the Stanford DASH [20], the MIT Alewife [1], and the Kendall Square Research?s KSR1 [19] use a directory-based, write-invalidate protocol to allow memory blocks to be replicated across the private caches in each processing node. Unfortunately, as the speed gap between the processors and the memory system continues to increase, write-invalidate protocols alone are not sufficient to achieve an acceptable processor utilization. This is because processors need to stall for two reasons: on read misses and on write accesses to blocks that are not present in the private cache or replicated in other caches. In general, whereas the latter stall component can be eliminated by relaxing the memory consistency model [14, 21], the former component is more difficult to attack if processors block on read requests.
To appear in Journal of Parallel and Distributed Computing