Simple Compiler Algorithms to Reduce Ownership Overhead
in Cache Coherence Protocols
Jonas Skeppstedt and Per Stenstr?om
Department of Computer Engineering, Lund University
P.O. Box 118, S-221 00 Lund, SWEDEN
We study in this paper the design and efficiency of compiler algorithms that remove ownership overhead in sharedmemory multiprocessors with write-invalidate protocols. These algorithms detect loads followed by stores to the same address. Such loads are marked and constitute a hint to the cache to obtain an exclusive copy of the block. We consider three algorithms where the first one focuses on load-store sequences within each basic block of code and the other two analyse the existence of load-store sequences across basic blocks at the intra-procedural level. Since the dataflow analysis we adopt is a trivial variation of live-variable analysis, the algorithms are easily incorporated into a compiler.
Through detailed simulations of a cache-coherent NUMA architecture using five scientific parallel benchmark programs, we find that the algorithms are capable of removing over 95% of the separate ownership acquisitions. Moreover, we also find that even the simplest algorithm is comparable in efficiency with previously proposed hardware-based adaptive cache coherence protocols to attack the same problem.
Shared-memory multiprocessors often rely on efficient memory hierarchies with private caches and high-bandwidthinterconnection networks to achieve high performance. To ensure consistency among the multiple caches, write invalidate is a widely used cache coherence policy which combines reasonable cache miss rates and a low amount of network traffic for small block sizes. However, write invalidate is an ownershipbased cache coherence protocol, which implies that when a processor needs to modify a block, it must obtain an exclusive copy and request that all other copies are invalidated to ensure that only one copy is modified at a time. Unfortunately, such ownership acquisitions may stall the processor and result in substantial penalties under sequential consistency . Moreover, even though the memory model is relaxed  the additional traffic caused by ownership acquisitions can result in memory-system contention which as a secondary effect can increase read and synchronisation penalties.
Previous studies have focused on hardware-based as well as software-based approaches to remove explicit ownership requests. Starting with hardware-based methods, Cox and
Fowler  and Stenstr?om et al.  have studied adaptive cache coherence protocols that detect non-overlapping loadstore sequences by different processors to the same block? called migratory sharing. Detection is done by augmenting the cache coherence protocol with extra functionality to detect cache-miss requests followed by ownership requests. When such sequences are detected, the block is deemed migratory. Subsequent cache-miss requests for migratory blocks will then obtain an exclusive copy of the block that eliminates explicit ownership acquisitions for the stores. Stenstr?om et al.  found that for applications with substantial migratory sharing, most of the ownership requests were removed which reduced the execution time by as much as 35% under sequential consistency.
Continuing with software-based approaches, Hill et al. have proposed a programming notation called CICO (CheckIn/Check-Out) . With CICO, the programmer provides information about when a data structure is exclusively used. This information can be used by the compiler to insert instructions that prefetch an exclusive copy of the block into the cache. Mowry and Gupta have studied the use of such exclusive-prefetch instructions . They have found that explicit ownership requests associated with the stores can be effectively removed if inserted far ahead of the stores. To be feasible, such instructions have to be inserted by the compiler?and not by the programmer. In a follow-up study, Mowry et al. [15, 13] have designed and evaluated compiler algorithms for the automatic insertion of prefetch instructions in general, and exclusive-prefetch instructions in particular. Their algorithms predict statically which load instructions cause cache misses and insert exclusive prefetch instructions if such loads are followed by a store to the same block. A limitation of their approach, however, is due to the difficulty in predicting cache misses statically in the general case. This may prevent insertion of exclusive prefetches or cause an intolerable instruction overhead if too many exclusive-prefetch instructions are inserted.
We have designed compiler algorithms that can detect and cut the ownership overhead significantly. Like the approach taken by Mowry , our algorithms detect load-store sequences in the program code. Unlike their approach, however, loads followed by stores to the same address are marked and act as a hint to the cache system. Thus, it is up to the cache system to decide whether or not to fetch an exclusive copy. As a result, our approach neither causes any instruction-count overhead, nor does it have to deal with the problem of predicting cache misses statically. This fact simplifies the design of the algorithms and also makes them more general.
The contributions of this paper are the design of three compiler algorithms as well as a detailed performance evalu-