A Lockup-free Multiprocessor Cache Design
Per Stenstr?om, Fredrik Dahlgren, and Lars Lundberg
Department of Computer Engineering, Lund University
P.O. Box 118, S-221 00 LUND, Sweden
Shared memory latency is an important performance limitation of large-scale, shared-memory multiprocessors. Although it can be reduced by using private caches and pipelined networks, pipelining is restrictive due to access order requirements as imposed by the programming model.
By letting the compiler (or user) provide access order information to the memory system, considerable performance improvements can be obtained. In this paper, we report on the design of a processing element memory subsystem, in essence a lockup-free cache, that exploits this and controls pipelining and cache coherence for a general class of pipelined networks.
Large-scale, shared-memory multiprocessors rely on private caches to reduce performance degradation due to contention and network latency . Even if caches are infinitely large, communication between the caches are needed in order to maintain cache consistency.
In order to hide the network latency, one can use pipelined networks and exploit memory access pipelining. The buffered MIN in IBM RP3  is one example. The correctness of parallel programs relies on a welldefined access order model. Sequential consistency  is the strictest model and assumes that the result of the execution is the same as if memory accesses are performed in program order. In general, pipelining must be avoided which degrades performance.
Other models restrict access order to synchronization points. They have been formulated differently in terms of pipelining requirements such as the weakly ordered model originally proposed by Dubois et al. , the DRF0-model , and release consistency .
By letting the compiler (or user) provide access ordering information, we can formulate sufficient conditions that can be exploited by the memory system.
This results in performance improvements due to less restrictive pipelining requirements.
In this paper, we formulate information that is extracted from the parallel program, and sufficient conditions needed to maintain various access order models. This information is exploited by a lockup-free cache controller in the global memory request pipelining control. We present a complete VLSI-design of the lockup-free cache controller and outline the requirements on other parts of the memory system.
Related work on this topic is the uniprocessor, lockup-free cache proposal by Kroft . Scheurich and Dubois proposed a multiprocessor lockup-free cache which supports the weakly ordered model and is aimed at bus-oriented architectures . Important objectives of our design were to support general access order models and make it applicable to a general class of networks.
2 Access order information
In this section, we consider three access order models; sequential consistency, weak ordering, and release consistency. We use a classification scheme for variables and define a relation that constrains access order for the correctness of parallel programs under these models.
2.1 Access order relations
The access order relation is based on the notion of performing a memory access" specified by Dubois et al.. In Section 3, we use our classification scheme and relation to formulate sufficient pipelining requirements in terms of memory access performance.
Consider an access sequence S = (a1; a2; : : : ; an) as specified by the program for a single processor, where ai 2 C which is a set of consistency properties. The elements of C = fweak (w), ordinary (o), strong (s), fence (f), gather (g)g. The parallel program is correct if accesses are performed in the order specified by S.