A Program-driven Simulation Model of an
Department of Computer Engineering, Lund University
P.O. Box 118, S-221 00 Lund, Sweden
A simulation model that supports very accurate modeling of multiprocessors with a hierarchical, packet-switched interconnection network and private caches is explored. The simulation system contains workload simulators and a memory system simulator. The workload simulators are programdriven, i.e. they actually execute programs. The time unit of the simulator is the time between two consecutive memory references from the processors. The performance of the simulation model, although acceptable, could be improved using a trace-driven approach. We show that results obtained from trace-driven simulation methods in the course of multiprocessor performance evaluation are generally not valid. Furthermore, we show that in the evaluation of certain processor architectural features, such as non-blocking architectures, the program-driven approach is necessary.
Shared-memory multiprocessors constitute an important class of computers to facilitate the eternal need of increased computing power. As new architectures are being proposed, a need for evaluating the performance of the interconnection network, memory system, and cache coherence protocol, appears. This can be done by either building prototypes, simulation or analytic methods. It is not feasible to build such systems, and it is practically impossible to derive accurate analytic models. Therefore, it is commonly agreed that simulation is the only feasible approach.
Traditionally, simulators are trace-driven. This means that the simulation is based on memory references recorded when running the program [Fer78]. This method is extensively used to evaluate uniprocessor caches, see e.g. [Smi87], and there are trace capturing methods to include
references from system calls and the operating system [ASH86]. Trace-driven simulation has also been used to analyze multiprocessors [EK89].
Several problems of the trace-driven approach when exploring multiprocessors are pointed out by Bitar [Bit90]. The traces suffer from being dependent on the host architecture, which may introduce incorrect performance results. When simulating a new architecture, the order of events, the latency of the network, and the number of processors might be completely different from the trace-host, which makes the traces useless. Synchronization skewing, for example, may give rise to unreal sharing.
Another problem with trace-driven simulation is when exploring a non-blocking processor with a lockup-free cache [SD88]. Data-dependency constitutes an important limit for the increased performance of non-blocking processors [DS90]. The fact that the traces must contain information on data dependency, e.g. between registers, complicates the trace-capturing process and restricts the flexibility of this kind of simulations.
This report presents the design and implementation of a program-driven simulator of an MIMD multiprocessor [Fly66]. In a program-driven simulator, the processing elements are executing real code, and thereby can run applications. Since every action in the processors is simulated, the multiprocessor can be accurately modeled, and the problems with architecture dependency no longer exists. Synchronizations and the actions due to processor communication can be accurately analyzed, and complete knowledge about the data dependency is obtained. A disadvantage of this type of simulators is the execution-intensity making it very time-consuming. Therefore, the objective of the simulator approach reported in this paper has been accuracy and efficiency. The approach has shown to be advantageous when analyzing synchronization traffic and cache coherence schemes on a fine-grain basis, and when analyzing lockup-