page 1  (4 pages)
2to next section

A Latency-Hiding Scheme for Multiprocessors

with Buffered Multistage Networks

Per Stenstr?om

Department of Computer Engineering, Lund University

P.O. Box 118, S-221 00 LUND, Sweden

Abstract

Multistage networks for large-scale shared-memory multiprocessors are buffered to increase the throughput and hide the latency. This is achieved by pipelining consecutive memory requests from the same processor. However, unrestrictive pipelining may violate strict memory consistency models such as sequential consistency since memory requests are not guaranteed to be performed in program order.

In this paper we propose and evaluate a novel access ordering scheme that allows the processors to pipeline memory requests under the sequential consistency model. This is achieved by access ordering mechanisms at memory that detect when a request arrives out of order. Simulations show that the scheme manages to improve the processor utilization significantly by hiding network latency through pipelining.

1 Introduction

An important design issue for large-scale sharedmemory multiprocessors is to reduce the performance degradation due to memory system latency and contention. It is commonly agreed that the use of private caches is one of the most important techniques to solve this problem [10]. While private caches reduce latency to private data, caching of shared read-write data results in the cache coherence problem.

The cache coherence problem can be solved by using a hardware-implemented cache consistency protocol. In recent implementations, such as the Stanford DASH, cache coherence is maintained by a directorybased, write-invalidate protocol [7]. Upon a write request, a directory keeps track of which caches have copies and an invalidation is sent to those caches. The amount of memory needed to keep track of copies makes hardware-implemented protocols less attractive. An alternative is to maintain cache coherence by software-controlled actions that avoid to cache data structures that may cause inconsistency [10]. Examples of such implementations are the IBM RP3 [8] and the Cedar system [5].

RP3 and Cedar use multistage networks (MINs). Each request has to pass O(log N ) switches given N processors. Clearly, network latency for large-scale machines of several hundred processors may degrade the performance considerably. In order to hide this network latency, each network switch may contain buffers that allow requests from each processor to be

pipelined. Now if two consecutive requests are routed to different memorymodules, they may arrive at memory in reverse order because of contention in switch buffers. Since the result of the execution of a parallel program may depend on the intuitive notion that memory requests are performed in program order, program correctness may be violated. This correctness model has been formalized as the sequential consistency model by Lamport [6]. In order to maintain sequential consistency for multiprocessors using buffered MINs, pipelining of memory requests must generally be avoided.

This paper suggests an access ordering mechanism at memory that detects out-of-order requests. It allows requests to be pipelined and thus reduces the impact of network latency. Simulations show that the scheme provides a 20% speedup of processor utilization for the studied workloads.

In Section 2, we show how the correctness of a parallel program can be violated if memory requests are pipelined in buffered MINs. Sufficient conditions to support the sequential consistency model are also presented. We then present the new latency-hiding scheme in Section 3 and outline its implementation. In Section 4, we evaluate the performance improvement. Finally, in Section 5 we conclude the results.

2 Access order violation in MINs

Consider a multiprocessor architecture that interconnects N processors with N memory modules by a buffered MIN. In Figure 1, eight processors are interconnected with eight memory modules using three stages of switches. Each switch is a 2?2 crossbar with a buffer associated with each input. The buffers make it possible to pipeline memory requests in a packetswitched fashion.

While a private cache may be associated with each processor, we assume that shared read-write data is not cached. Thus, memory requests to shared readwrite data are satisfied by the memory modules.

Now consider the following example parallel program consisting of two processes P1 and P2 that are executed by two processors according to Figure 1.

P1: P2:
A = 1; B = 1;
while (B == 1); while (A == 1);
mutex; mutex;