*This work was supported by the Swedish National Board for Technical Development (NUTEK), contract number 9001797.
Reducing the Write Traffic for a Hybrid Cache Protocol*
Fredrik Dahlgren and Per Stenstr?m
Department of Computer Engineering, Lund University
P.O. Box 118, S-221 00 LUND, Sweden
access from the local processor. A counter that is associated with each cache line does this by counting the number of updates between two consecutive local accesses. While the performance of such a protocol was impressive?it outperformed write-invalidate for two out of four applications?the extra network traffic limited the performance of the other two applications.
We study in this paper how write traffic can be reduced for a competitive-update protocol by incorporating write caches. A write cache uses an allocate-on-a-write-miss, write-back, and a no-allocate-on-a-read-miss strategy, with a single combined dirty/valid bit per word [1, 8]. Under relaxed memory consistency models, it is legal to combine all writes from the same processor in the write cache between two consecutive synchronizations. A writecache block need only be propagated to other caches when a block is replaced or when a synchronization point is reached. Because of the spatial and temporal locality in the write accesses, as we observe experimentally in this study, we show that a write cache with only four entries suffices to drastically reduce the write traffic. The write caches in combination with the lower coherence-miss rate of competitive-update protocols are shown to outperform writeinvalidate protocols for all five benchmark applications.
In Section 2 we discuss the performance limitations of write-invalidate and write-update protocols. Section 3 presents the architectural framework and how a write cache can be incorporated in a hybrid protocol. The experimental results are then presented in Sections 4 and 5. Finally, we relate this study to previous work in Section 6 before we conclude in Section 7.
2.1 Write-Invalidate and Write-Update Protocols
Recent implementations of CC-NUMA architectures, such as the Stanford DASH multiprocessor , use writeinvalidate protocols. This is because they reduce the traffic by exploiting long write-runs  which loosely speaking are write sequences from the same processor to the same block with no intervening access from another processor to that block. Under write-invalidate protocols, a write-run results in exactly one global write transaction, in essence an invalidation. Unfortunately, the cost of this write-traffic reduction is coherence misses that increase the read pen-
Coherence misses limit the performance of write-invalidate cache protocols in large-scale shared-memory multiprocessors. By contrast, hybrid protocols mix updates with invalidations and can reduce the coherence miss rate. The gains of the fewer coherence misses, however, can sometimes be outweighed by contention due to the extra traffic making techniques to cut the write traffic important.
We study in this paper how write traffic for hybrid protocols can be reduced by incorporating a write cache in each node. Detailed architectural simulations reveal that write caches are effective in exploiting locality in write accesses under relaxed memory consistency models. Hybrid protocols augmented with write caches with only a few entries are shown to outperform a write-invalidate protocol for all five benchmark applications under study.
Even though write-invalidate protocols constitute an effective approach to reduce the latency and contention problems of large-scale shared-memory multiprocessors, coherence misses have a predominant effect on processor utilization. In CC-NUMA architectures, this penalty typically includes a latency of several node-to-node traversals because the dirty copy must be first located by interrogating the home node. Thus, it is important to both reduce the coherence miss rate, and the latency taken for each coherence miss.
Write-update protocols  eliminate coherence misses but increase the write traffic due to updates. Although the latency associated with update transactions can be tolerated using a relaxed memory consistency model [7, 11], the increased network traffic can offset the effect of less misses due to contention. A previous study by Nilsson et al.  have shown that hybrid protocols?cache protocols that mix updates and invalidations?can outperform write-invalidate-protocols by selectively updating only copies that are actively being shared. Their study considers a class of hybrid protocols called competitive-update protocols. Such protocols dynamically select cache copies that benefit from being updated by invalidating copies that have been updated a predefined number of times with no
In Proc. of the 1994 International Conference on
Parallel Processing, vol. I, pp. 166-173, August 1994