To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler.
Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of prefetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show significant reductions of the read penalty and of the overall execution time.
Shared-memory multiprocessors offer significant performance improvements over uniprocessors at an affordable cost while preserving the intuitively appealing programming model provided by a single, linear memory address space. With the advent of ultra-fast uniprocessors and of massively parallel systems, the shared-memory access penalty (i.e., the average time each processor is blocked on an access to shared data) becomes a significant component of program execution time. Private caches in conjunction with hardware-based cache coherence  are effective at reducing this latency. In a system with caches, most of the memory access penalty originate from the miss rate of read requests, i.e., the fraction of read requests that miss in the cache. By contrast, the latency of write requests can easily be hidden by appropriate write buffers and weaker memory consistency models .
Prefetching is a common technique to reduce the read miss penalty. Good prefetching mechanisms can predict which blocks will miss in the future and bring these blocks
in the cache before they are accessed. Software-controlled prefetching  relies on the user or compiler to insert prefetch instructions in the code. Hardware-controlled prefetching [6, 9, 10] usually takes advantage of the regularity of data accesses in scientific computations to detect access strides dynamically and requires complex hardware.
The main theme of this paper is to evaluate two simple hardware-controlled prefetching schemes with a low implementation cost and requiring no compiler support. These two schemes are called fixed and adaptive sequential prefetching and are nonbinding, meaning that a block prefetched in a cache remains subject to the coherence protocol. In sequential prefetching, the cache controller prefetches on a miss the K blocks following the missing block, where K is the degree of prefetching. In fixed sequential prefetching, the degree of prefetching remains constant throughout the execution whereas it varies dynamically in the case of adaptive sequential prefetching.
Section 2 shows the usefulness of the schemes by reviewing the effects of the cache block size on the miss rate and network traffic. We then describe the two hardware schemes in Section 3 and, in Sections 4 and 5, we evaluate their performance with a detailed multiprocessor simulator model and a set of six benchmark programs. We find that fixed sequential prefetching provides significant read miss reduction in some cases in spite of its simplicity. The adaptive technique, although slightly more expensive to implement, yields considerably lower miss rates coupled with low traffic overhead. We finally relate our results to others? in Section 6 and conclude the paper in Section 7.
2 IMPACT OF BLOCK SIZE
In this section, we review the effects of cache block sizes on the miss rate and memory traffic in multiprocessors. This is important in order to understand the effects of sequential prefetching and to compare them to prefetching effects due to larger block sizes. (Increasing the block size is the simplest and best known means of prefetching.)
A number of studies have reported the effects of block size variations on cache miss rates in the context of shared-memory multiprocessors [5, 8]. We need to distinguish between the different types of misses. In the case of
Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors
Fredrik Dahlgren, Michel Dubois*, and Per Stenstr?m
Department of Computer Engineering
P.O. Box 118, S-221 00 LUND, Sweden
*Department of Electrical Engineering-Systems
University of Southern California
Los Angeles, CA90089-2562, U.S.A.
This work was partly supported by the Swedish National Board for Technical Development (NUTEK) under contract number 9001797 and by the National Science Foundation under Grant No. CCR-9115725.
In Proceedings of the 1993 International Conference on Parallel Processing, vol. I, pp. 56-63, August 1993