Previous studies of bus-based shared-memory multiprocessors have shown hybrid write-invalidate/write-update snooping protocols to be incapable of providing consistent performance improvements over write-invalidate protocols. In this paper, we analyze the deficiencies of hybrid snooping protocols under release consistency, and show how these deficiencies can be dramatically reduced by using write caches and read snarfing.
Our performance evaluation is based on program-driven simulation and a set of five scientific applications with different sharing behaviors including migratory sharing as well as producer-consumer sharing. We show that a hybrid protocol, extended with write caches as well as read snarfing, manages to reduce the number of coherence misses by between 83% and 95% as compared to a write-invalidate protocol for all five applications in this study. In addition, the number of bus transactions is reduced by between 36% and 60% for four of the applications and by 9% for the fifth application. Because of the small implementation cost of the hybrid protocol and the two extensions, we believe that this combination is an effective approach to boost the performance of busbased multiprocessors.
Private caches are essential to reduce the bus congestion and to cope with the latency of memory references in bus-based sharedmemory multiprocessors. In such systems, snooping cache coherence protocols are commonly accepted as an effective approach to keep shared data coherent, since they utilize the simple and effective broadcast capability of a single bus.
In a write-invalidate protocol, a write request to a block invalidates all other shared copies of that block. If a processor issues a read request to a block that has been invalidated, there will be a coherence miss. The Illinois protocol  which is used in the SGI Challenge multiprocessor  is based on this approach. In a writeupdate protocol  on the other hand, each write request to shared data updates all other copies of the block, and the block remains shared. Although there are fewer read misses for a writeupdate protocol, the write traffic on the bus is often so much higher that the overall performance is decreased . Hybrid write-invalidate/write-update protocols aim at eliminating coherence misses
Boosting the Performance of Hybrid Snooping Cache Protocols
Department of Computer Engineering, Lund University
P.O. Box 118, S-221 00 LUND, Sweden
Internet: [email protected], http://www.dit.lth.se/~fredrik/
for actively shared blocks while avoiding useless updates to other blocks. One such protocol is competitive-snooping , where a block copy is updated at first, but if the local processor does not access the block during a specific number of updates from other processors the copy is invalidated. A similar protocol was proposed by Archibald in , henceforth referred to as the Archibald protocol, where all shared copies are updated until the same processor has issued a specific number of updates while no other processor has accessed the block. In , Eggers and Katz evaluated the performance of write-update, write-invalidate, and competitive-snooping for four applications and found that none of the protocols performed the best for all applications. The reason was that the protocol which performed the best depended on the sharing behavior of the application. The problem is to find a protocol where the bus traffic needed to keep an actively shared block updated is consistently lower than the coherence traffic of a writeinvalidate protocol for all applications.
In this paper, we analyze the deficiencies of competitive-snooping and the Archibald protocol, and show how these deficiencies can be dramatically reduced by simple protocol extensions. We identify the major deficiency to be that too many writes are issued in a sequence from the same processor to the same block. We show that a write cache [4,11] is capable of clustering these writes under release consistency, which leads to fewer update transactions that have to be sent on the bus. A write cache allocates a block frame on a write to shared or invalid data, and coalesces writes to the same block from the local processor. At synchronization points in the program, or when a block is replaced from the write cache, the coalesced writes are transferred to other copies as a single update transaction per block. We show that this has a dramatical impact on the effectiveness of the Archibald protocol.
Another weakness of the Archibald protocol is that if an actively shared block happens to become exclusive, all the nodes that invalidated the block might encounter a subsequent coherence miss. We show how this problem can be solved effectively by read snarfing (also called read-broadcast) [10,14], which means that a data block that is transferred on the bus as a read response not only updates the node that requested it, but also updates all other caches having the block invalidated.
Neither write caches nor read snarfing have previously been considered for a hybrid write-invalidate/write-update snooping protocol. As compared to a write-invalidate protocol, the Archibald protocol extended with write caches and read snarfing eliminates between 83% and 95% of the coherence misses for all applications we have evaluated, while the number of bus transactions is reduced by between 36% and 60% for four out of five applications and by 9% for the fifth application.
In Proc. of the 22nd Ann. Int. Symp. on Computer Architecture (22nd ISCA), June 1995