| ![]() |
496
Abstract
A number of experiments regarding the placement of
instructions, private data and shared data in the Non-Uniform-Memory-Access
multiprocessor, RP3 has been performed.
Three Scientific/Mathematical workloads have been used in the experiments, and the results have been modelled in a simple performance model which takes linear contention into consideration.
The results indicate that it can very well be feasible not to have memory local to the processors in RP3-like architectures. There seems to be a trade-off between the effort spent in the design on the memory system and the interconnection network and the use of local memory which can be costly in terms of prohibited process migration and more complicated software management.
1.0 Introduction
In the construction of highly parallel multiprocessors, different approaches have been adopted in order to achieve efficient access to main memory. Some use Uniform Memory Access (UMA) making sure that performance is not degraded by having processor cache memories. Others which have Non-Uniform Memory Access (NUMA) achieve this by placing pieces of the memory space in memory close to a processor, thus reducing the traffic in the interconnection network. The different architectures require different techniques to reduce network and memory contention.
The IBM RP3 multiprocessor can be said to belong to both of these classes with its capability to configure its memory as local memory, costly to access by remote processors, and/or global memory which is almost uniformly accessible by all processors. It has processor caches and a high bandwidth, low-latency multistage interconnection network.
One of the most commonly mentioned advantages of having shared memory multiprocessors is for their ease of programming, and the question of load balancing. Ease of programming because of the insignificance of where shared
data are located, and with shared memory equally accessible from all processors, it makes no difference in which processor a specific thread (The term thread is in this paper used denoting the smallest concurrently executable unit, see Section 2.2) executes. A thread may even move around between processors in order to utilize them more efficiently.
A uniform memory access multiprocessor architecture has the processors on one side of the interconnection network, and the memory on the other side. All memory references has to go through the interconnection network. There may be local processor caches in order to minimize the traffic through the interconnection network. In contrast, the NUMA architecture there is memory close to each processor accessible without using the interconnection network. The shared memory paradigm makes it, however, possible to access other processor?s memory remotely through the interconnection network.
An obvious advantage of the NUMA architecture is the possibility to access memory locally, without going through the network, which may be time consuming and/or a source for contention which will degrade the performance of the system. Memory locations which are accessed by only one thread can be placed in local memory and performance can be increased by shorter mean memory access times and the probability of contention is reduced. However, there are some major disadvantages of local memory, for instance, if the local memory is fully utilized, there is a substantial amount of processor-specific state information which might make thread migration prohibitively expensive. Also, if the difference in how local and remote memory is accessed is not taken into account during memory allocation, performance can be severely degraded from interconnection network contention and/or longer memory access times.
This paper reports on some experiments conducted on the IBM RP3 regarding the use of global, uniformly accessible memory, versus having local memory, close to the processors. The 64 processing element RP3 prototype can be configured with entirely global memory, entirely local or a mix of them both, and the user can choose which parts of
Local vs. Global Memory in the IBM RP3:
Experiments and Performance Modelling.
Mats Brorsson
Department of Computer Engineering, Lund University
P.O. Box 118, S-221 00 LUND, Sweden
In Proceedings of the 3rd IEEE Symposium on Parallel and Distributed Processing, Dallas, Texas,December, 1991
0-8186-2310-1/91 $1.00 ? 1991 IEEE