page 1  (5 pages)
2to next section


Andreas Steininger
Department of Electrical Measurement, Technical University of Vienna, Austria

Johannes Reisinger
Department of Real?Time Systems, Technical University of Vienna, Austria

Abstract. In critical control systems the demands on reliability and proper timing behavior are very high. The usual approach of using sophisticated software with standard hardware, however, is a compromise with a lot of shortcomings.
This paper presents a solution which has been used for the implementation of a processor board for the MARS system. Both hardware and software have been tailored to each other already in the development phase. Every kind of indeterministic timing behavior of the hardware has been eliminated or at least strictly bounded in order to fulfil the demands on real?time behavior. In contrast to usual approaches which only allow error detection either in hardware or in software, the integral design solution makes a combination of both approaches possible. By this the fail?silent property of the component can be achieved in a reasonable way.

Keywords. Fault tolerance; real time computer systems; self checking computers; computer control systems; operating systems; computer hardware; MARS system.


An important class of distributed computer control systems has to meet very high dependability requirements. The probability of a failure of such a system, for example a flight control system, has to be kept as low as possible. In particular, two main aspects must be considered: The system has to be highly reliable and its timing behavior has to be completely deterministic.

The use of general purpose hardware, however, significantly limits the achievable degree of system dependability, even if a sophisticated real?time operating system is used. General purpose hardware neither claims to be highly reliable nor to provide deterministic timing behavior. The best results can be obtained, if the dependability requirements are considered not only during operating system design, but also during the design of the hardware. This gives the choice for each dependability mechanism to be implemented either in hardware or in software. In addition, it becomes possible to use some mechanisms which cannot be implemented in software at all.

Real Time Aspects

Predictable timing behavior is of crucial importance especially in real?time systems, where the results need to be correct in the value? as well as in the time domain. Timing failures often do not result from a fault in a specific task, but manifest themselves only in special situations, depending on the behavior of other tasks of the system. This property of timing failures makes testing very difficult, especially in a

simulated environment. A solution to this problem is to exactly define the timing behavior of each part of the system and verify it by testing and analyzing. This implies the determination of upper bounds on the execution times of tasks, on the execution times of the operating system and on the time needed for communication between tasks. In addition, asynchronous activities of processing elements, such as DMA and memory refresh have to be bounded. Determining all these bounds and using them in designing and analyzing a system allows to achieve a deterministic timing behavior and to avoid timing failures.

Fault Tolerance Strategy

In general, distributed systems are better suited for fault tolerance than central systems. In distributed systems `smallest replaceable units' (SRUs) can be defined. One approach to prevent propagation of failures is the usage of fail silent SRUs. A fail silent SRU is defined to deliver either correct results (in the value? and time domain) or no results at all. The safest way to prevent a failed SRU from disturbing the rest of the system is to turn it off immediately after the detection of a failure. To provide continuous service even in this case, two or more SRUs are grouped together in active redundancy, forming a `fault tolerant unit' (FTU).


The hardware and operating system development described in this paper is based on the MARS architecture (Kopetz and