*This work was supported by the Swedish National Board for Technical Development (NUTEK) under the contract P855.
Effectiveness of Hardware-Based Stride and Sequential Prefetching
in Shared-Memory Multiprocessors*
Fredrik Dahlgren and Per Stenstr?m
Department of Computer Engineering, Lund University
P.O. Box 118, S-221 00 LUND, Sweden
We study the relative efficiency of previously proposed stride and sequential prefetching?two promising hardware-based prefetching schemes to reduce read-miss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequential prefetching does better than stride prefetching for three applications. This is because (i) most strides are shorter than the block size (we assume 32 byte blocks), which means that sequential prefetching is as effective for stride accesses, and (ii) sequential prefetching also exploits the locality of read misses for non-stride accesses. However, we find that since stride prefetching causes fewer useless prefetches, it consumes less memory-system bandwidth.
Shared-memory multiprocessors with a general interconnection network often suffer from a significant processor stall time, mostly because of the latencies associated with read and write accesses in large-scale shared-memory multiprocessors. While the latency of write accesses can easily be hidden by appropriate write buffers and relaxed memory consistency models , most processors have to stall on read accesses until the requested data has been provided by the memory system. Private caches in conjunction with hardware-based cache coherence maintenance  are quite effective at reducing the average read stall time. However, due to cache block coherence maintenance and replacements, other techniques are needed to reduce the number of read misses.
Prefetching is a promising approach to reduce the number of read misses, and thus to reduce the read penalty. Non-binding prefetching  does this by bringing into the cache the blocks which will be referenced in the future and are not already present in cache. The value returned by the prefetch is not bound; the prefetched block is still subject to invalidations and updates by the cache coherence mechanism. Non-binding prefetching approaches proposed in the literature can be either software- or hardware-
based. Software-controlled prefetching schemes  rely on the user/compiler to insert prefetch instructions prior to a reference triggering a miss. By contrast, hardware-controlled prefetching schemes utilize the regularity of data accesses in applications, and need no software support to decide what and when to prefetch.
Two promising hardware-based prefetching strategies in shared-memory multiprocessors are sequential [6, 17] and stride prefetching [5, 10, 13, 16]. Sequential prefetching tries to exploit spatial locality across block boundaries by prefetching consecutive blocks in anticipation of future misses. By contrast, stride prefetching detects and prefetches blocks associated with strides only but does not require spatial locality to be effective. Clearly, the relative performance between the two approaches depends on the amount of spatial locality and stride accesses in an application.
Another notable difference between the two approaches is the amount of hardware support needed. Whereas sequential prefetching in its simplest form only requires a counter associated with each cache , previously published stride prefetching schemes  require fairly complex detection mechanisms and also modifications to the processor.
This paper evaluates the relative performance as well as implementation implications of stride and sequential prefetching schemes in a unified framework consisting of a cache-coherent NUMA architecture which is discussed in detail in Section 2. In Section 3 we describe the prefetching schemes we evaluate. We then evaluate the relative performance of the prefetching schemes in Sections 4 and 5 using detailed architectural simulators and a set of six scientific applications, in which stride accesses dominate in four of them. Surprisingly, we find that one of the simplest variations of sequential prefetching does better or same as the most aggressive stride prefetching scheme in five out of the six applications. There are two intuitive reasons for this. First, when strides are shorter than the block size, sequential prefetching that brings consecutive blocks will be effective. Second, unlike stride prefetching schemes, sequential prefetching can also attack misses caused by non-stride accesses that exhibit a high spatial locality. This type of misses is fairly dominant
In Proc. of the First Int. Symp. on High-Performance Computer Architecture (HPCA-1), pp. 68-77, Jan. 1995