

## Durham E-Theses

## The reliability of small digital controllers

Pearson, Jonathon C.

## How to cite:

Pearson, Jonathon C. (1983) The reliability of small digital controllers, Durham theses, Durham University. Available at Durham E-Theses Online: http://etheses.dur.ac.uk/7217/

## Use policy

The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for personal research or study, educational, or not-for-profit purposes provided that:

- a full bibliographic reference is made to the original source
- a link is made to the metadata record in Durham E-Theses
- the full-text is not changed in any way

The full-text must not be sold in any format or medium without the formal permission of the copyright holders.
Please consult the full Durham E-Theses policy for further details.

# THE RELIABILITY OF SMALL <br> DIGITAL CONTROLLERS 

by<br>Jonathan C.Pearson BSc(Eng). ACGI

The copyright of this thesis rests with the author. No quotation from it should be published without his prior written consent and information derived from it should be acknowledged.

Thesis submitted for the degree of Doctor of
Philosophy in the Faculty of Science
University of Dürnam

Thesis
1983/PEA

Dedicated to

MY MOTHER AND FATHER

## ABSTRACT

Increasing use is being made of small digital controllers in Industry and Commerce. The failure of such controllers is important since it may either cause ${ }_{\wedge}$ plant to become unsafe or the interruption of production. Faulttolerant techniques are discussed for improving the reliability of digital controllers with special reference to the development of a hybrid electromechanical gas governor, whose electronic controller is an example of a small digital controlier. Three microprocessors are used in a two out of three majority voting configuration and the memory is Hamming code protected. Redundancy techniques are used to protect against faults in other parts of the controller and it will tolerate most classes of transient fault

When comparing designs or attempting to meet reliability criteria. it is necessary to predict the reliability of a system and its individual components. Several sources of failure rate prediction are compared and the wide variation in the failure rates of integrated circuits is highlighted. The comparison concludes by recommending which reliability data source is likely to be most accurate for each type of component.

The gas governor is an example of a repairable system and analysis is developed for predicting the improvement in reliability for repairable redundant systems and for determining the optimum maintenance and repair times for equipment.

The testing of redundant systems is difficult because of their complexity. and under certain circumstances the redundancy can mask design faults. Testing methods using complex test equipment are described. as well as the testing of the experimental controller.

A review is included of other fault-tolerant systems. Although the work on large computers is not directly applicable to small controllers. many of the tecnniques can be used.

## Acknowlengements

I shouid like to express my gratitude to the following people. To Or Clive Preece. my Supervisor. for his nelp. encouragement, friendship. and assistance in the preparation of this Thesis. To the British Gas Corporation tor the provision of a Research Scholarship and equipment. To Dr ken Jenkins for his help. encouragement. and the acquisition of equipment, and also for the close links between the Engineering Research Staition and the Department of Engineering. I would like to thank Mr Robert Halse for his help and discussion. To Dr David Smith for his interest ano valuable discussion about this research. To the Electrical Tecnnicians for jobs too numerous to mention and to Motorola for providing much useful reliability iniormation. Finally 1 would like to thank Gillian for her love and encouragement in the preparation of this Thesis.

## CONTENTS

Page
ix
LIST OF FIGURES
$x i$
LIST OF TABLES ..... xi.i
LIST OF SYMBOLS
1
CHAPTER I The Need for a Reliable Controller
1.1 NATIONAL GAS NETWORK ..... 1
1.2 PNEUMATIC GOVERNORS ..... 2
1.3 LIMITATIONS OF PNEUMATIC GOVERNORS ..... 4
CHAPTER 2 A Review of Other Fault-tolerant Controllers ..... 7
2.1 SMALL CONTROLLERS ..... 7
2.i.1 TMR controllers ..... 7
2.1.2 Fail-safe controller ..... 8
2.1.3 Built in test equipment ..... 9
2.2 LARGE COMPUTERS ..... 9
2.2.1 Space shuttle and avionics ..... 9
2.2.2 Commercial computers ..... 11
2.2.3 Telephone exchange computers ..... 12
2.2.4 Satety monitoring computer ..... 12
2.2.5 PDP11 computer ..... 13
2.2.6 Railway applications ..... 13
2.2.7 Fault-tolerant software ..... 14
2.3 SUMMARY ..... 14
CHAPTER 3 Failure of Components ..... 17
3.1 FAILURE DISTRIBUTION AND MECHANISMS ..... 17
3.2 FAILURE RATE MEASUREMENT AND ACCELERATION ..... 18
3.2.1 Accelerated testing ..... 19
3.3 FAILURE RATE PREDICTION ..... 21
3.3.1 CNET ..... 23
3.3.2 National centre of systems reliability ..... 24
3.3.3 RADC field data ..... 25
3.3.4 European space agency ..... 25
3.3.5 Manufacturers testing ..... 25
3.3.6 Simple models ..... 26
3.3.7 GIDEP computer data base ..... 26
3.3.8 EuReData ..... 27
3.3.9 MIL-217 data base ..... 27
3.4 COMPARISON OF FAILURE RATE DATA ..... 27
3.4.1 Resistors ..... 28
3.4.2 Capacitors ..... 28
3.4.3 Soldered joints ..... 28
3.4.4 Wire-wrap connections ..... 28
3.4.5 Edge connectors ..... 29
3.4.6 Integrated circuits ..... 29
3.4.7 TTL integrated circuits ..... 32
3.4.8 6800 microprocessor ..... 32
3.4.9 $\quad 8080$ microprocessor ..... 33
3.4.10 EPROM ..... 34
3.4.11 Bipolar ROMs ..... 34
3.4.12. Dynamic RAM ..... 34
3.4.13 Static RAMs ..... 34
3.4.14 Comparison with other field data ..... 35
3.4.15 Recommendations ..... 36
3.5 COMPARISON OF DIFFERENT DEVICE TECHNOLOGIES AND ENCAPSULATION ..... 37
3.5.1 CMOS versus TTL ..... 38
3.5.2 Hermetic versus non-hermetic encapsulation ..... 39
3.5.3 Recommendations ..... 43
3.6 SCREENING AND METHODS OF IMPROVING FAILURE RATES ..... 44
3.6.1 Derating and cooling ..... 46
3.7 SOFTWARE AND TRANSIENT ERRORS ..... 47
3.8 SUMMARY ..... 50
CHAPTER 4 Prediction of Reliability Improvement due to Fault-tolerant techniques ..... 52
4.1 METHODS OF Éxpressing RELIABILITY ..... 52
4.2 EFFECT OF REDUNDANCY ON RELIABILITY ..... 54
4.3 EFFECT OF MAINTENANCE ON A SYSTEM ..... 55
4.3.1 Redundant systems ..... 56
4.4 METHODS OF EXPRESSING IMPROVEMENT ..... 59
4.5 SUMMARY ..... 60
CHAPTER 5 Techniques for Improving a System's Fault-tolerance ..... 62
5.1 LEVELS OF FAULT-TOLERANCE ..... 62
5.2 DEGIGN TOOLS ..... 63
5.2.1 FMECA ..... 63
5.2.2 FTA ..... 64
5.3 HARDWARE ..... 64
5.3.1 Watchoog timer ..... 64
5.3.2 Snake ..... 65
5.3.3 Power supply levels ..... 66
5.3.4 Output verification ..... 66
5.3.5 Component redundancy ..... 66
5.3.6 Memory protection ..... 67
5.3.7 Temperature ..... 67
5.4 SOFTWARE ..... 67
5.4.1 Exception nandling ..... 68
5.5 SELF-TESTING ..... 69
5.6 SUMMARY ..... 70
CHAPTER 6 EFFECT OF SYSTEM ARCHITECTURE ON RELIABILITY
AND COST ..... 72
6.1 CHOICE OF MICROPROCESSOR ..... 72
6.1.1 Single chip microprocessors ..... 72
6.1.2 Eight-bit microprocessors ..... 72
6.1.3 Sixteen-bit microprocessors ..... 73
6.2 MAJORITY VOTING ..... 74
6.2.1 Hardware ..... 74
6.2.2 Software ..... 75
6.3 THE USE OF REDUNDANCY AND INCREASED COST ..... 75
CHAPTER 7 Background to Design of Governor Controller ..... 78
7.1 SPECIFICATION ..... 78
7.2 MECHANICAL VALVE ..... 79
7.3 NON FAULT-TOLERANT CONTROLLER ..... 82
7.4 METHODS OF INTRODUCING FAULT-TOLERANCE ..... 83
CHAPTER 8 • Description of Experimental Controller ..... 86
8.1 CONSTRUCTION ..... 86
8.2 TMR MICROPROCESSOR BOARD ..... 86
8.2.1 Microprocessor block ..... 87
8.2.2 Voters ..... 87
8.2.3 Self synchronising clock ..... 88
8.2.4 Synchronisation hardware ..... 89
8.2.5 Reset circuitry and watchdog timers ..... 92
8.2.6 RS232 interface ..... 94
8.3 MEMORY BOARD ..... 94
8.3.1 Decoding ..... 96
8.3.2 Circuit description ..... 96
8.3.3 Address decoding and "spare" EPROMs ..... 97
8.3.4 . Memory matrix ..... 98
8.4 INPUT/OUTPUT BOARD ..... 98
8.4.1 Real time clock ..... 99
8.4.2 Multiplexer and A/D converter ..... 99
8.4.3 Stepper motor driver ..... 100
8.4.4 Solenoid drive circuitry ..... 100
8.4.5 Pressure transducers ..... 101
8.5 PNEUMATIC TEST RIG ..... 102
8.6 ESTIMATED RELIABILITY ..... 104
8.6.1 Revised failure rate prediction ..... 106
CHAPTER 9 Governor Controller Software ..... 107
9.1.1 MAIN ..... 107
9.1.2 RESYNC ..... 108
9.1.3 MERROR ..... 109
9.1.4 CNTRL ..... 110
9.1.5 PRESSR ..... 171
9.1.6 RBKGEN ..... 111
9.1.7 BLOCK ..... 111
9.1.8 SLFTST ..... 112
9.1.9 SFAIL ..... 112
9.1.10 WTRAP ..... 112
9.1.11 SNAKE ..... 113
9.1.12 DFAULT ..... 113
9.1.13 INITL ..... 113
9.1.14 COUT ..... 113
9.1.15 COUTBF ..... 114
9.1.16 PRBUFF ..... 114
9.1.17 MSGE ..... 114
9.1.18 NMOUT ..... 114
9.1.19 TIMLOG ..... 114
CHAPTER 10 Testing of Fault-tolerant Systems ..... 115
10.1 SYSTEM DEBUGGING ..... 115
10.1.1 In-circuit emulation ..... 115
10.1.2 Logic analysis ..... 115
10.2 FAULT INJECTION AND RECOVERY ..... 117
10.3 OPERATIONAL TESTING
10.4 TESTING OF THE GOVERNOR CONTROLLER ..... 118
10.4.1 Pressure control ..... 118
10.4.2 Fault injection ..... 119
10.4.3 Voting errors ..... 119
10.4.4 RAM errors ..... 120
10.4.5 Watchoogs ..... 120
10.4.6 Snake ..... 121
10.4.7 Solenoid failure ..... 121
10.4.8 Pressure transducers ..... 121
10.4.9 Interference testing ..... 121
DISCUSSION ..... 123
LIST OF REFERENCES ..... 129
FIGURES ..... 136
TABLES ..... 174
APPENDIX ..... 185

1. Circuit diagram of power supply ..... 185
2. Circuit board tayout ..... 186
3.1 Error flags low pass filter calcuiation ..... 190
3.2 FPLA programming table ..... 191
3. Calculation of delay before latching error flags ..... 192
4. Encoding ROM data ..... 194
5. Decoding ROM data ..... 195
6. Operation of the real time clock ..... 196
7. Design of redundant solenoid driver circuitry ..... 197
8. List of integrated circuits ..... 201
9. Governor controller software listings ..... 202

## LIST OF FIGURES

Figure Page
1 Cross section of a Donkin FIG280 regulator ..... 136
2 Governor and distribution network ..... 137
3 Twin stream governor ..... 137
4 The bath-tub curve ..... 138
5 TMR voter and error detection in open collector TTL logic ..... 138
6 TMR ring structure ..... 138
7 Graph of acceleration factor versus operating temperature fora reference temperature of $25^{\circ} \mathrm{C}$. Plotted for differentactivation energies.139
Failure rate of a 8085 microprocessor versus case ambient temperature ..... 140
BS9400 screening procedure ..... 141
10 Graph of simplex and TMR reliability versus normalised mission time. ..... 142
11 Grapn of MTTFIF versus ratio $\lambda_{2} / \lambda_{1}$ ..... 142
12 Fault tree analysis for pressure trip ..... 143
13 Software recovery using recovery blocks ..... 144
14 Stepper motor controlled regulator ..... 145
15 Solenoid controlled regulator ..... 145
16 Block diagram of non fault-tolerant controller ..... 146
17 Op-code fetch timing showing synchronisation problem ..... 146
18 RST/HOLD timing ..... 146
19 Photograph of Governor Controller ..... 147
20 Photograph of pneumatic test rig ..... 147
21 Block diagram of microprocessor board ..... 148
22 Microprocessor and buffers - CHANNEL 1 ..... 149
23 Microprocessor and buffers - CHANNEL 2 ..... 150
24 Microprocessor and buffers - CHANNEL 3 ..... 151
25 Voting circuitry ..... 152153
27 Resynchronisation timing monitored by logic analyser ..... 153
28 Clock. Test, and RS232 interface ..... 154
29 Resynchronisation circuitry ..... 155
30 Watchdogs and reset circuitry ..... 156
31 Block diagram of memory Doard ..... 157
32 Part of memory board including RAM and decoders ..... 158
33 Detail of RAM storage circuitry including test circuitry ..... 159
34 Instant ROM. EPROM. and PORT(RD) decoder circuit diagram ..... 160
35 Block diagram of input/output board ..... 161
36 Input/output board ..... 162
37 Stepper motor and redundant solenoid drive circuitry ..... 163
38 Pressure transducer and conditioning circuitry ..... 164
39 Block diagram of pneumatic test rig ..... 165
40 Flow chart of module MAIN ..... 166
41 Flow chart of module RESYNC ..... 167
42 Flow chart of module MERROR ..... 168
43 Flow chart of module MERROR (continued) ..... 169
44 Flow chart of module CNTRL ..... 170
45 Flow chart of module PRESSR ..... 171
46 Graph of governor outlet pressure versus orifice plate differential ..... 172
47 Error messages ..... 173

## LIST OF TABLES

Table ..... Page
1 Failure mechanisms and activation energies in MOS semiconductors ..... 174
2 A comparison of resistor failure rates ..... 174
175
3 A comparison of capacitor failure rates
175
4 A comparison of soldered joints failure rates
175
5 A comparison of wire-wrap joint failure rates
6 A comparison of edge connector failure rates ..... 176
7 Activation energies used in failure rate prediction ..... 176
8 A comparison of TTL integrated circuit failure rates ..... 177
9 A comparison of the fallure rates for a 6800 microprocessor ..... 177
10 A comparison of 6800 microprocessor adjusted failure rates ..... 178
11
A comparison of the failure rates for a 8080 microprocessor ..... 178
12 ..... 179
A comparison of 8080 microprocessor adjusted failure rates15 A comparison of the failure rates of a 16k Dynamic RAM
179
A comparison of the failure rates of a 2716 EPROM
180
A comparison of the failure rates of a 1 k Bipolar ROM180
16
A comparison of the failure rates of a 1 k Static RAM ..... 181
17 The relation between cost. screening level, and Quality factor ..... 181
Recommended maximum junction temperatures for semiconductors ..... 182
Signals carried by back-plane bus ..... 183
Correction bits for SEC/DED Hamming code ..... 184 ..... 2018421 Error position as indicated by error flags

## List of Symbols

| ASCII | American Standard Code for Information Interchange |
| :---: | :---: |
| CMOS | Complementary Metal Oxide Semiconductor |
| DAG | Demand Activated Governing |
| DBE | Double Bit Error |
| dil | Qual in line |
| ECL | Emitter coupled logic |
| EEC | European Economic Community |
| EMF' | Electromagnetic Pulse |
| EPROM | Erasable Programmable Read Only Memory |
| FMEA <br> FMECA | Fault diode and Efdect Analys is Fault Mode and Effect Criticality Analysis |
| FPLA | Field Programmable Logic Array |
| FTA | Fault Tree Analysis |
| f/M hrs | failures per million hours |
| ${ }^{I C E}$ | In Cirenit Enmelation failure rate |
| LSI | Large Scale Integration |
| LSTTL | Low Power Shottky Transistor Transistor Logic |
| MOS | Metal Oxide Semiconductor |
| MTBF | Mean Time Between Failures |
| MTFF | Mean Time to First Failure |
| MTTF | Mean Time To Failure |
| MTTFIF | Mean Time To Failure Improvement Factor |
| MTTR | Mean Time To Repair |
| NASA | North American Space Administration |
| POR | Power On Reset |
| PROM | Programmable Read Only Memory |
| psi | pounds per square inch |
| RAM | Random Access Memory |
| rn | relative humidity |


| HOM | Read Only Memory |
| :--- | :--- |
| RTC | Real Time Clock |
| SEC/DED | Single Bit Error Correction / Double Bit Error Detection |
| SSI | Small Scale Integration |
| STTL | Shottky Transistor Transistor Logic |
| THB | Temperature Humidity Bias |
| TMR | Triple Modular Redundancy |
| TTL | Universal Asynchronous Receiver Transmitter. |
| UART | Visual Display Unit |
| VDU | Very Large Scale Integration |
| VLSI | inches water guage |
| "wg |  |

## CHAPTER 1

## THE NEED FOR A RELIABLE CONTROLLER

### 1.1 NATIONAL GAS NETWORK

in the days before the extraction of natural gas. gas was produced. stored. transmitted, and used locally. Gas was transmitted at low pressure and there was no national network which made the control problem much easier.

The changeover to natural gas involved the installation of a national network. Gas comes ashore from several gas fields at high pressure and is transmitted throughout the country at high pressure. typically at 1000 psi in four foot diameter pipes. Compressor stations are used at regular intervals to overcome pressure drops in the system. Because of the extremely large volume of gas being controlled at high pressure and the absolute need for reliable control. considerable expenditure on fault tolerant main frame ànd mini-computers for control purposes can be justified.

Gas is progressively reduced in pressure as it is transmitted to the consumer. Transmission pressures range from 1000psi to 100psi and control of gas at the bottom end of this range is performed by pneumatics. In order to make full use of the gas network. the technique of line-pack has been developed, where the pipework itself is used to store large quantities of gas in order to smooth out variations in demand. To optimise the efficiency of techniques such as line-pack, it is necessary to replace pneumatic controllers with electronic digital controllers. These controllers must be extremely reliable because of the requirement for safe operation of the network, and failure to supply gas could involve British Gas in litigation resulting in the payment of considerable compensation. Fault tolerant controllers can achieve the levels of reliability required
and many of the techniques described and demonstrated in the following chapters could be used. Although the cost of the controller must be considered more carefully at low pressures. sufficient savings can be made by the more efficient operation of the network to justify such controllers.

Gas linally reaches the "distribution" network at relatively low pressures where there are several thousand governor stations. The quantity of gas handled by each governor is much less than at higher pressures and all control is performed in pneumatics. A large governor station might feed two thousand consumers. The cost of electronic control at this level is critical, nowever the controller must still be very reliable which requires fault tolerant techniques to be used. It is this type of "smali digital controller": that has been studied in depth at Durham and a prototype controller developed. The controller developed is more powerful than that required for simple pressure/flow control of a single regulator and could be expanded to control line-pack as described above.

As well as experimenting with hybrid electronic/mechanical governors. British Gas are making increasing use of microprocessor equipment. Because of the requirement for high reliability when controlling gas. it is essential to consider the reliability of every piece of equipment and introduce fault tolerant techniques if necessary.

### 1.2 PNEUMATIC GOVERNORS

Since the aim of this research was to develop a "small digital controller" to be used as part of an electromechanical governor. it is necessary to study the characteristics of pneumatic governors so that the replacement is no worse than its predecessor in performance and cost.

The terms regulator and governor are virtually synonymous. but the term governor will be used to describe a system naving an inlet and a controlled outlet. The term regulator will be used to describe the
individual regulator valves which comprise the system
Pneumatic regulators have been used by British Gas for many years They have many advantages of which the major ones are listed below : Tried and trusted - For a mechanical device they are very reliable. This is partly due to their mature technology and the vast quantity produced.
(ii) Inexpensive - A typical regulator costs a few hundred pounds.
(iii) They are self powered and need no mains supply.
(iv) They can withstand some unusual operating conditions and transient surges.
(v)

Mechanically they are very rugged and will operate over a wide temperature range as long as they are prevented from freezing up.

A typical regulator valve is shown in figure(1). The valve which controls the flow of gas is mounted within the lower cast iron body. The valve orifice is made from stainless steel to minimise wear and corrosion and the valve seat is made from nitrile rubber to ensure low leakage when shut. The valve is attached to a long stem which passes up into the diaphragm chamber and is attached to the centre of the diaphragm. Movement of the diaphragm causes the valve to open and shut. The diaphragm equilibrium position is reached by balancing the force of the spring on the top of the diaphragm with the gas pressure below. Downstream gas pressure is fed to the pressure tapping in the bottom diaphragm bowl which is shown to the right of figure(1). If the downstream gas pressure rises. this causes the diaphragm to move up which closes the valve and reduces the downstream pressure. The regulator is thus seen to behave as a negative feedback closed loop controller of pressure. The outlet pressure of the regulator is set by the main spring whose compression can be varied by adjusting the top screwed plug. The valve shown is an "open at rest" type and rupturing of the diaphragm will cause the valve to fail fully open.

### 1.3 LIMITATIONS OF PNEUMATIC GOVERNORS

A schematic diagram of a governor feeding a distribution network is shown in figure(2). There will be some point in the network where the largest pressure drop exists between that point and the governor. This point is labelled the worst case pressure point. To complicate matters further, this point moves around as the loading on the network changes. There is a statutory minimum pressure (about 5 inches $w g$ ) at which consumers must be supplied. which means that the governor outlet pressure must always be nigh enough to give this minimum pressure at the worst case pressure point. All other consumers will receive gas at pressures above the statutory minimum.

If is desirable to keep the outlet pressure of the governor as low as possible whilst still satisfying the above conditions because of the problem of leakage. Old distribution pipework is subject to relatively low levels of nonhagardons leakage which is proportional to the network over-pressure. Under non worst case conditions and periods of low demand, there will be considerable over-pressure unless the governor is readjusted. It has been estimated by Murphy[231 that considevable-savings in the calue of lost gas could be made for a 2 wg reduction in nationwide system pressure. Using microprocessor techniques, it should be possible to reduce network over-pressure by at least $2^{\text {n }}$ wg.

As well as resistive pressure drops in the network there is a variable pressure drop across a simple pneumatic regulator. The outlet pressure of a typical regulator drops by 15 per cent as the flow through it is varied from zero to full rated - this is called "droop". Consequently the regulator set point must be set higher than required to compensate for droop at nigh flows. A microprocessor controlled valve could reduce the droop to zero
required in a "twin stream" governor. In large governor stations where it is imperative to supply gas without failure, it is common to use a twin stream regulator configuration as shown in figure(3). Slam shuts are similar to regulator valves. but are designed to close rapidly. cutting off the gas flow. when their set pressure is reached. Typical pressure settings are shown in figure(3). The minimum acceptable pressure at the governor outlet is $12 " \mathrm{wg}$. so the standby regulator is set to give this. Normally the standby regulator is held off and the outlet pressure is controlled at. $14^{\prime \prime} \mathrm{wg}$ by the active stream regulator. If the outlet pressure rises above $18^{\prime \prime} w g$. the active stream is cut off by the slam shut set at 18 "wg and the standby stream takes over. If the pressure rises further to $20^{\prime \prime} \mathrm{wg}$. the standby stream is cut off by the slam shut set to $20^{\prime \prime} \mathrm{wg}$ and gas flow through the governor is cut off. It is difficult to set accurately both regulators and slam shuts. so it is necessary to have their settings several inches wg apart so that they do not inter act with each other. if all the valves and slam shuts were controlled by a microprocessor. then it would be possible to have their set points closer together.

To compensate for resistive pressure drops in the system. it is desirable to reduce the governor pressure under conditions of low flow and to increase the pressure under conditions of high flow. Such control is called "demand activated governing" (DAG). A pneumatic governor has been developed to perform this control function. but is very complex and requires many regulator valves and pneumatic components. If a multi-feed network is fed by severai DAG controllers, then the controllers may interract and make the system unstable. To prevent this it is necessary for the controllers 10 communicate with each other so that interaction is controlled.

The previous discussion has shown that pneumatic regulators are satisfactory in their single form, but as soon as several are connected together to form a governor, penalties must be paid by way of complexity.
increased network leakage and inefficient control of the network. Most of
the disadvantages of pneumatic governors can be overcome by using a
pneumatic/electronic hyorid governor and such tasks as telemetry, health
monitoring and complex control algorithms are made much easier to
implement. it is however essential that the electronic controller is
extremely reliabie. especially in circumstances where the controller is
controlling several valves. It is necessary to use fault tolerant
techniques to achieve such a nigh level of reliability.

## CHAPTER 2

A REVIEW OF OTHER FAULT-TOLERANT CONTROLLERS

### 2.1 SMALL CONTRULLERS

The application of redundancy to small controllers is often more cost sensitive than large computers. Redundancy can be used to achieve non stop processing, tail safe operation, and a reduction in the mean time to repair (MTTR) by the use of built in diagnostics.

### 2.1.1 TMR controllers

Platteter[7] describes a TMR design using three 8085 microprocessors which is similar to the experimental controller described in chapter eight. Platteter recommends the use of TMR for the protection against untestable errors in microprocessors. Three different manufacturers' 8085s are used. each having a different independent design. He proposes that it is better to accept that complex VLSI circuits such as microprocessors will contain untestable faults. and it is better to protect against this using TMR techniques, than to eliminate all manufacturing faults. The voting is periormed al bus level in STTL integrated circuits which will be less reliable than voting in FPLAs. A very unsatisfactory approach is taken towards resynchronisation. Either resynchronisation "just happens" or it is aided by PUSHing all registers onto the stack and then POPing them off. He reports that sometimes the system would not resynchronise and it was necessary to perform a reset. The reasons for the system not synchronising and how this can be overcome is described in chapter eight

Higuchi et al [22] describe a TMR system using three 8085 microprocessors which is resynchronised at regular intervals. This is unecessary since it is better only to resynchronise when a voting error is detected. Resynchronisation is performed by PUSHing all registers onto the stack and then POPing them off. This will not always resynchronise the
program counter and sometimes one processor will be found to lock one clock cycle behind the other two as described in chapter eight. These difficulties are not reported. An alternative TMR configuration is proposed by Higuchi which uses software voting and the inree processors execute tasks at staggered intervals and compare their results when all processors have executed the task. The software will be more complex and voting will not be transparent to the user. but the execution of tasks at different times should reduce the effect of transient errors.

Ryland [20] describes the use of microprocessors in reliable railway signalling equipment. The interlocking system uses three 6800 microprocessors in a TMR configuration. The processors are loosely synchronised and a processor will attempt to shut down any processor with which it disagrees (assassination). If the assassination attempt fails, it will shut itself down (suicide). Fault reporting and the repair of a faulty channel is performed on-line. Two separate data links are provided to the equipment and data is transmitted at 10 k baud in Manchester code with Hamming code protection. The software was written in assembler decause the designer felt that it was easier to achieve reliability because no arithmetic was used and the hardware interface was simple.

Davies et al $[21]$ discuss ring communication structures for a small system. Three 8748 single chip microprocessors are connected in a TMR ring structure and the synchronisation is performed in software and is accomplished by handshaking between the processors or the insertion of a fixed delay betore voting. The outputs from the three microprocessors must De combined which is nearly as complex as voting in hardware. but they propose that software voting offers greater versatility and is not prone to. component failure.

### 2.12 Fail sate controller

A life support system controller is described by Lim [4]. The design
is deliberately not fault-tolerant because of the increased cost. Dut is designed to detect hardware and software faults and to cause the system to fail sate. An alarm is given on failure of the controller so that human operation will ensure continued operation of the life support system.

### 2.1.3 Built in test equipment

Foose [17] describes an industrial microprocessor controller which was designed to reduce the MTTR. The design aim was $80 \%$ automatic testability for $10 \%$ exira cost. The controller is divided into separate modules. each performing a single function, and modules are designed to test themselves. By reducing the MTTR the availability of the controller is increased and maintenance costs reduced. Failure diagnosis can also be performed by unskilled operators:

### 2.2 LARGE COMPUTERS

The majority of research so far has been concentrated on large faulttolerant computers which are used in the space, avionics, and nuclear industries.

### 2.2.1 Space shuttle and avionics

The space shuttle is probably one of the best (and most expensive) examples of a fault-tolerant system. References[12.13] describe the digital processing subsystem. NASA decided to use standard proven avionics computers which nowadays are rather outdated and to implement the redundancy in software. Five identical IBM avionics 32 bit computers are used each having 250 k bytes of memory. The computers share two 16 M byte lape drives for mass storage. The computer contains built in test equipment and can detect $98 \%$ of errors. The computers are interconnected to themselves and to the sensors and actuators by serial buses. The highest reliability configuration consists of four computers in loose
synchronism with the fifth computer executing background tasks. Synchronism is achieved by hardware and software and two computers out of the four can fail without a catastrophic system faifure. The inertial guidance platform is triplicated and connected to different buses. The inertial information undergoes selection filtering which compares the information channels and rejects any that are above a predetermined trip level. With all channels functional, mid-value selection is used which is common for inertial systems and failure of channels results in a degraded selection algorithm. Fallure of a channel must not cause a large transient disturbance at the output of the selection filter. since a fault lasting only one second is enough under certain conditions to crash the shuttle.

The cycle time of the control process is 40 ms in which time selection filtering and control of all the aerodynamic surfaces and rockets is performed. The cycle time of 40 ms is necessary for the stability of the shuttle.

Ahern et al [14] describe a software voter/monitor for selection filtering of inertial guidance information. It is necessary to reject faulty channels and then to perform the vote. Mid-value. selection is again chosen as the best algorithm and the requirement to smoothe the switch-over from one selection algorithm to another is stressed to avoid transient disturbances.

Modern military aircraft are controlled by the fly-by-wire technique. A digital flight computer reads in inertial and pilot information and controls electromechanical actuators which are connected to the control surfaces. Mechanical and hydraulic linkages are replaced. but it is necessary to use redundancy techniques to achieve high reliability in the controller. An added advantage in military aircraft is the increased survivability of the plane if it is damaged. For obvious reasons nothing has been published about modern use of microprocessors in digital flight controllers. nut it is known that several difterent types of microprocessor
including the Texas 9900 and Motorola 6800 are used in redundant configurations in the Tornado. Deets et al [15], in discussing the design and flight experience of a fly-by-wire control system. describe the first aircraft to use the fly-by-wire technique. An Apollo space computer was installed in a modified F 8 fighter. The 16 bit computer had a 36 k word memory. $12 \mu \mathrm{~s}$ instruction cycle and performed the control aigorithm in 30 ms . The initial tests were successful and were conducted with a parallel back up nydraulic system.

Black et al [5] discuss the development of a spaceborne memory, which uses a memory error correction code. having the same number of bits as the SECIDED Hamming code. but which is able to correct double bit errors (DBE). The position of stuck single bit errors is logged so that if a double bit error occurs. the stuck single bit error can be "erased" since its position is known, and a DBE tolerated. In order to achieve the nigh reliability required in space applications using semiconductor random access memory (RAM) the design was required to tolerate douole bit errors.

### 2.2.2 Commercial computers

Commercial computers are now being built with redundant circuitry. especially for memory protection. The aim is to increase the mean time to lailure (MTTF) by redundant circuitry and to reduce the MTTR by built in test equipment. Hence the availability can be increased and maintenance costs reduced. Toschi et al[6],in discussing a fault-tolerant computer memory, describe the advantages gained by adding single bit error correction 10 memory and the tolerance of transient errors. Troublesome and failed memory devices are masked by redundancy and are replaced at periodic maintenance intervals. Swarz [18] describes the design methodology behind the VAX computer with special emphasis on the ability of the computer to tolerate memory. disk. and other errors. The computer is designed 10 minimise the MTTR by the use of a LSI 11 microprocessor
dedicated to diagnosing faults in the VAX computer. In this way the availability is improved and the maintenance costs reduced.

### 2.2.3 Telephone exchange computers

Modern PCM telephone exchange computers must have a very high availability. typically less than two hours down-time in forty years. Fantini et al [2] describe an exchange computer based on the 2800016 bit microprocessor which has a fault coverage of 0.98 . A duplicate processor is used to detect errors by comparison with the main processor. The power supply. clock, and bus are continuously monitored for errors and RAM and ROM checking is performed off-line. The aim of the design is to detect and isolate faults. Most faults were found to be transient, but the system relies on the prompt repair of permanent faults to achieve a high availability.

Ceru et al [10] describe a similar exchange computer. The 16 bit processor is constructed from discrete TTL and LSTTL and consists of two processors operating in synchronism. The detection of an error causes both processors to check themselves and each other. and the first one to finish the checking resumes control of the exchange. Faults are logged and the second processor is resynchronised if possible.

### 2.2.4 Safety monitoring computer

Harbert [3] describes a system using three 808616 bit microprocessors in a fire/gas detection and automatic shut-down controller. The input boards, microprocessor, memory, and output boards are triplicated and the processors operate asynchronously. Magnetic bubble storage is used since this gives improved reliability over spinning memory devices. Several nundred detectors are monitored by the system and their status displayed on colour VDUs, which give a clear display of information. The system controls emergency shut-down and extinguishing equipment.

Canepa et al [19]. in discussing the architecture of multiprocessing systems. describe a system consisting of three LSI 11 computers connected in a versatile triplex configuration: Hardware modification is minimal except for the construction of the voters which use 250 SSI integrated circuits. Voting is performed at bus level which makes the redundancy transparent to the software, allowing standard software to be executed. The processors can be configured to operate singly, with one processor talking to the other two, or in a TMR configuration where the processors run ir synchronism. The system has been built to examine the effect of transient faults on computers. By monitaring the three processor system and injecting taults into one channel, they hope to gain information about the frequency, duration and location of transient faults in a computer.

### 2.2.6 Railway applications

Forsythe et al [16]. in discussing reliable train control applications, describe a system using three $\mathbb{N} S 890016$ bit microprocessors which share a common bus. Each processor has its own memory store and both processors and memory are buffered to improve the fault isolation. The three processors share common RAM and EPROM and voting is performed in software to reduce the hardware costs. The operation of the controller is divided into four sectors. In the first three sectors. each processor executes a task and then swaps tasks with another processor at the end of the sector. In this way the three tasks are executed three times on three different microprocessors. In the fourth sector the results of the computations are compared and recovery is executed if required. Input\%output is performed in the fourth sector only if the error checking is salisfactory In spite of the high level of redundancy. the system was found to lock-up as a result of certain interference tests and the watchdog timer was found to reset the system and restore correct operation.

### 2.2.7 Fault-tolerant software

Much of the fault-tolerance on large computers is implemented in software. References[8.9.17] describe the implementation of "recovery blocks" on a large machine in a high level language. A recovery block is defined as a section of code in which recovery is possible. Calculations undergo a series of acceptance tests. If the acceptance test is not passed an alternative calculation is tried. Before executing an acceptance tesl. variables are stored in a "recovery cache" so that the variables can be restored to their initial state if the alternative calculation and acceptance test fails. Ghani et al [11] describe the "recovery cache" nardware as implemented on a PDP 11 computer.

### 2.3 SUMMARY

Many of the fault-tolerant features reported in small controllers are used in the experimental controller. The designs of Platteter [7] and Higuchi et al [22] are similar to the governor controller which overcomes many of the shortcomings in their designs, such as the failure to resynchronise. Ryland [20] proposes a different TMR structure and reports that software was written in assembler. The governor controller software was likewise written in assembler. Davies et al [21] discuss an alternative TMR structure where the voting is performed in software. This structure is very suitable for single chip microprocessors, but was not used in the governor controller for the reasons given in chapters six and seven. The controller described by Lim [4] is designed to fail safe as is the governor controller. Foose $[171$ reports that built in testability can be incorporated at little extra cost and Ryland [20] mentions the on-line reporting of faults. The governor controller performs on-line fault reporting which allows the availability to be increased and maintenance costs reduced.

Although large computers are outside the definition of small controllers. a discussion of fault-tolerant features is included for the sake of completeness and many of their fault-tolerant features can be adapted for use on small microprocessor controllers. Much of their faulttolerance is implemented in software and is more complicated than that required for a small controller. The space shuttle uses serial buses for the interconnection of units since this is more reliable than using large paraliel buses. The space shuttle and fly-by-wire aircraft use redundant inertial sensors and software is used to select the best inertial information. Mid-value selection is commonly used as it is a fast algorithm to implement. The governor controller, in a similar fashion. uses redundant pressure trasnsducers and selects the best pressure information using software. Time is not critical, so a more suitable averaging algoritnm is used, rather than the faster mid-value selection.

References $\{5.6: 18]$ describe the use of single bit error correction in semiconductor memory. Black et al [5] discuss a code which will correct double bit errors, but which uses the same number of bits as the more normal Hamming code. The experimental controller cannot correct double bit errors, but allows recovery from them, as well as transparently correcting single bit errors.

References $[2.10]$ describe simllar telephone exchange computers consisting of a main processor and a standby spare. This architecture could have been used in the governor controller, but a TMR structure was used in preference. Fantini et al [2] report that most faults experienced were of a transient nature. References $[3.16]$ describe 16 bit TMR controllers where the voting is performed in software. The governor controller performs the voting in hardware which is transparent to the software. Canepa et al [19] describe a TMR system . using ihree PDP11 computers. where the voting is likewise performed in hardware. The complexity of hardware voting in large computers is highlighted by the size
of the required voting circuitry.
Recovery blocks. as described in references [8.9.11]. can usefully be implemented on small controllers. The governor controller uses a special type of recovery block, more properly called a recovery vector, which allows vectored recovery to be executed following the detection of a hardware or software error.

## CHAPTER 3

## FAILURE OF COMPONENTS

### 3.1 FAILURE DISTRIBUTION AND MECHANISMS

The tallure rate of most types of electrical devices follows the classical "Dath-tub" curve of figure(4). Phase one: infant mortality. represents the early life failures of a device and is usually associated with one or more manufacturing defects. After several hundred hours, the failure rate approaches some constant low value. phase two. where it remains for anything from several years to several nundred years and failures occur randomly. Wearout failures. phase three. occur at the end of the useful life of a device and are characterised by a rapidly rising fallure rate with time as the device wears out both physically and electrically. A common wearout mechanism in integrated circuits is corrosion due to moisture trapped inside the device package. Under normal operating conditions, failure due to wearout is rarely experienced with integrated circuits, uniess they are operated for a very long period of time - references[34.27].

The constant failure rate region, phase two, represents failures due to random events such as electrical surges. Most failure rate data sources assume a constant failure rate and present their results in the form of $n$ failures per unit time. typically $n$ failures/10 $0^{6}$ nours. This is satisfactory as long as the wearout region is not encountered during the lifetime of the equipment. This assumption is likely for integrated circuits and resistors, but components such as electrolytic capacitors may be different. The wearout phase of an electrolytic capacitor results from the electrolyte drying up. which might occur after only a few years. However the quoted "constant" fallure rate might suggest a MTTF of several hundred years. Thus it is essential to distinguish between the useful life
of a component and MTTF it wearout is encountered.
An example of the misleading result that this confusion might produce can be given by considering the human life span. The mortality rate for humans approximately follows the "bath-tub" curve. but if only the constant "failure rate" experienced during youth and middie-age is considered. then a MTTF of two thousand years is predicted as opposed to a normal lifespan of about seventy five years.

The majority of components used in "small digital controllers" are integrated circuits, resistors, and small decoupling capacitors. It is therefore valid under favourable operating conditions to consider the failure rate of such components constant and to assume that their useful life is equivalent to the MTTF.

Failure mechanisms in MOS integrated circuits are given in table(1) and are discussed in more detail in references[27,35]. Most potential failures caused by these mechanisms can be detected early by suitable screening. Consideration of these mechanisms is important since the majority of microprocessor and memory components are fabricated using the MOS technology.

### 3.2 FAILURE RATE MEASUREMENT AND ACCELERATION

The simplest method of calculating the failure rate of devices under controlled or field testing is by the equation:

$$
\text { failure rate }=\frac{\text { number of failures }}{\text { no. devices } \times \text { no. hours tested }}
$$

The problem with this equation is that statistically it only represents a point estimate at $50 \%$ confidence. and that if testing reveals no fallures, then obviously the fallure rate is not zero. A statistical solution to this problem is given by the Chi-squared distribution as discussed in references[32.49].

$$
\begin{equation*}
\text { failure rate }=\frac{x^{2}(1-C \cdot L, 2 r+2)}{2 n t^{i}} \tag{3.2.1}
\end{equation*}
$$

```
where: \(\quad x^{2}=\) Chi-square function
    \(C L=\) confidence level expressed as a decimal
    \(r=\) number of failures
    \(n=\) number of parts tested
    \(t=\) total test duration
```

Tables of $\chi^{2}$ are found in many texts on statistics[49].
Equation(3.2.1) is almost universally used. with most failure rates quoted at the $60 \%$ confidence level.

### 3.2.1. Accelerated testing

Even "unreliable" microelectronic devices may last several thousand nours betore failing which makes life-testing a very long process and screening virtually impossible. It has been shown that the failure rate of microelectronic devices exponentially increases with temperature according to the Arrhenius reaction rate equation, references[35,37.38].

```
fallure rate = C exp (-Ea/KT)
```

```
where
Ea = activation energy in eV
K= Boltzmanns constant ( 8.63 x 10 % )
T = absolute temperature
C = an appropriate constant
```

When conducting accelerated tests and analysing test data it is important to remember two things:
(i) The failure rate is exponentially dependent on temperature so that incorrect specification of the device junction temperature will have a large effect on the failure rate.
(ii) The correct activation energy should be chosen. appropriate to the failure mechanism under consideration.

According to the Arrnenius equation it. is possible to accelerate significantly failures by testing at elevated temperatures and the acceleration factor is calculated by:

$$
\begin{equation*}
F a=\exp \left[\frac{E_{A}}{K}\left(\frac{1}{T_{2}}-\frac{1}{T_{1}}\right)\right] \tag{3.2.3}
\end{equation*}
$$

where: $\mathrm{Fa}=$ acceleration factor
$T T=$ test temperature of the junction
T2 = desired temperature of the junction

The graph of figure(7) shows the relationship between Fa and the test temperature. T1. for a desired temperature. T2, of $25^{\circ} \mathrm{C}$. A family of curves is plotted. showing the sensitivity of acceleration factor to activation energy. Activation energies chosen are those corresponding to the major failure mechanisms of table(1).

When considering failures due to moisture ingression into the microcircuit package and subsequent corrosion of the metalisation and bonding wires, a similar acceleration factor may be used. it is common to test under conditions of $85^{\circ} \mathrm{C} / 85 \% \mathrm{rh}$. references[39.43]. The acceleration factor may be calculated according to the Lawson-Harrison law. references [35.44] and is given by

$$
\begin{equation*}
F_{a}=\exp \left[\frac{E_{A}}{K}\left(\frac{1}{T_{2}}-\frac{1}{T_{i}}\right)+b\left(H_{1}^{2}-H_{2}^{2}\right)\right] \tag{3.2.4}
\end{equation*}
$$

```
where : Fa = acceleration factor
    Ea = activation energy
    K == Boltzmanns constant
    T1 = desired junction temperature
    T2 = test junction temperature
    H2 = test humidity
    HI = desired numidity
    b = a constant
```

Reynolds[35] uses values of $\mathrm{Ea}=0.6 \mathrm{eV}$ as per table(1) and $\mathrm{b}=4.4$.

### 3.3 FAILURE RATE PREDICTION

Failure rates may be predicted either by accelerated testing or a combination of accelerated testing. field testing, and mathematical modelling. The most widely used document for failure rate prediction is probably MIL-217 [37], prepared by the American Department of Defense at the Rome Air Development Centre (RADC). This document is regularly updated and has been published as MIL-217 A.B.C and recently D. The first to predict the failure rate of microelectronic devices was MIL-217B. Components covered are those mainly used in defence applications. but this covers resistors. capacitors. most integrated circuits. connectors. switches etc. and field data is mainly gathered from defence applications. A mathematical failure rate is developed for each type of component of the form :

$$
\begin{equation*}
\lambda=\lambda_{b} \Pi_{Q} \Pi_{E} \Pi_{A}-\cdots \cdots \cdots . \tag{3.3.1}
\end{equation*}
$$

where: $\lambda=$ failure rate

$$
\lambda_{b}=\text { base failure rate }
$$

$\Pi_{Q}=$ quality factor

$$
\begin{aligned}
& \Pi_{E}=\text { environmental factor } \\
& \Pi_{A}=\text { application factor }- \text { voltage stress. power rating etc. }
\end{aligned}
$$

The factors $\lambda_{b} . \Pi_{\varphi} . \Pi_{E} . \Pi_{A}$ etc. are tabulated in MIL-217 covering many operating conditions and grades of component.

The failure rate of components depencs on their quality and the MIL217 series attempts to allocate $\Pi_{Q}$ or quality factors according to the grade of manufacture and subsequent screening.

An environment factor. $\Pi_{E}$, is applied to take account of the operating environment of the component. Experience has shown that failure rates depend on the operating environment, as might be expected. For instance a missile launch is more hostile than an aircraft in flight which in turn is more hostile than a ground based environment. For the purpose of evaluation of industrial equipment, the environment chosen as being appropriate is "ground fixed". Gf.

Finally the application factor. $\Pi_{A}$. is taken into account. This factor takes many forms. but is mainly used to reflect the electrical stress under which the component is operating. Any derating of the voltage. current, or power handled by the component will result in an improved value of $\Pi_{A}$

The fallure rate model for MOS and bipolar devices is of particular interest since this covers TTL logic and most microprocessor and memory devices excluding ROMs. The model used is :

$$
\begin{equation*}
\lambda=\Pi_{\varphi} \Pi_{L}\left[C_{1} \Pi_{T} \Pi_{v}+\left(C_{2}+C_{3}\right) \Pi_{E}\right] \quad \text { failures } / 10^{6} \text { nours } \tag{3.3.2}
\end{equation*}
$$

where: $\quad \lambda=$ device tailure rate
$\Pi_{\varphi}=$ quality tactor - depends on grade and screening level
$\Pi_{L}=$ device learning factor - unity for a mature device
$\Pi_{T}=$ temperature acceleration factor - Arrhenius relation

```
\(\Pi_{v}=\) voltage stress factor - unity except for CMOS
\(\mathrm{Cl}=\) device complexity factor - depends on transistor count
\(\mathrm{C} 2=\) device complexity factor - depends on transistor count
C3 = package complexity factor
\(T_{E}=\) environment factor
```

Since $\Pi_{T}$ is exponentially dependent on temperature. then for high temperatures. $\Pi_{T}$ is large and the model can be simplified to the approximation :

$$
\begin{equation*}
\lambda=\Pi_{\varphi} \Pi_{\tau} C_{1} \tag{3.3.3}
\end{equation*}
$$

The failure rate is therefore exponentially dependent on the junction temperature of the device.

### 3.3.1 CNET

In 1972 the Comité de Coordination des Télécommunications in France decided to establish a group of people 10 evaluate and predict the reliability of components used in the telecommunications and computing industries. The first version of their report was published in 1976. reference[38]. This report was based on MIL-217B can earlier version of MIL-2170 [37]. but was biased towards telecommunication and computer equipment operating in favourable environments as opposed to defence equipment operating in hostile environments. Failure rate models are given which are similar to MIL-217. The model given for microcircuits was considerably different. but was updated in 1982 to the following model which is more similar to the MIL-217 model.

$$
\lambda=\Pi_{\varphi} \Pi_{L}\left[C_{1} \Pi_{T} \Pi_{t} \pi_{V}+C_{2} \Pi_{B} \Pi_{E} \Pi_{s}\right]
$$

where: $\lambda=$ device failure rate

```
\(\Pi_{P}=\) quality factor - depends on grade and screening level
\(\Pi_{L}=\) reliability growth factor - unity for a mature device
\(\Pi_{t}=\) temperature acceleration factor - Arrhenius relation
\(\Pi_{T}=\) device technology factor
\(\Pi_{v}=\) voltage stress factor - unity except for CMOS
\(\mathrm{Cl}=\) device complexity factor - depends on transistor count
C2 = device complexity factor - depends on transistor count
\(\Pi_{\mathrm{E}}=\) environment factor
\(T_{B}=\) numidity \(/\) temperature factor
\(\Pi_{s}=\) transportation factor - depends on no. of journeys
```


### 3.3.2 National Centre of Systems Reliability

The NCSA reliability data [34] is a condensed version of their computer data bank neld by the Systems Reliability Service. SRS. The data bank comprises two main parts. The first contains field data gathered from the nuclear industry where the environmental conditions are well known and controlled. The second part contains information from MIL-217C as well as data from other published sources. laboratory tests and theoretical predictions. Martin Marietta Aerospace provide much of the support for the RADC data which in turn influences the MIL-217 series. The SRS data bank contains this data. so the SRS data will not be an independent source to MiL-217.

The failure rate models are similar to MIL-217 with the exception of the microcircuit model, which is :

$$
\begin{equation*}
F=\mathrm{K} 7 \mathrm{Kg}(\mathrm{Fel}+\mathrm{Fe} 2+\mathrm{Ft}) \text { failures } / 10^{6} \text { nours } \tag{3.3.5}
\end{equation*}
$$

where: $F=$ lailure rate for a hermetic device
$k 1=$ unity for mature devices otherwise ten

```
Kg = reliability growth factor
Fel= transistor count complexity / environment factor
Fe2 = packaging factor - depends on no. of pins and environment
Ft = temperature accelerature factor dependent on packaging
```

The model only considers two grades of device. hermetic and non-hermetic. The non-hermetic device failure rate is equal to twice the nermetic failure rate as well as further adjustments incorporated into Ft .


#### Abstract

3.3.3 RADC fieid data

As well as publishing the MIL-217 series, the RADC publishes failure rate data obtained from field experience. Kiein [45] presents much field data on microcircuits which is shown to agree approximately with MIL-217C.


#### Abstract

3.3.4 European Space Agency

The ESA has its own data bank and requires companies wishing to iender for projects to perform a reliability prediction. using the failure rate data generated and supplied by themselves. This has the great advantage that all companies are forced to use the same failure rate data and a valid comparison between proposed designs can be made.


### 3.3.5 Manufacturers testing

Most microcircuit manufacturers publish the results of accelerated testing on their devices. references[27.40]. Devices are subjected to thermal and physical shock tests as well as dynamic testing at elevated temperatures. A large number of devices are tested for typically several thousand hours. and the number of fallures observed are fitted to a Chisquared distribution. The failure rate is typically quoted at the $60 \%$ contidence level and the failure rates are further modified according to
the Arrhenius acceleration factor, to give a failure rate appropriate to the likely conditions of usage. There is some discrepancy between different manufacturer's predictions, since they make use of different activation energies as applied to the Arrhenius acceleration equation. Tests are also performed to verify the suitability of device packages as regards shock and humidity as reported by Motorola [40].

### 3.3.6 Simple models

A simpie model for microcircuits is guoted by the RRE which is useful when very tew parameters are known :

$$
\begin{equation*}
\lambda: \quad \lambda_{B} k d k I \tag{3.3.6}
\end{equation*}
$$

```
where: \(\quad \lambda_{B}=5\) digital bipolar
    \(\lambda_{B}=8\) digital MOS
    \(\lambda_{B}=12\) linear bipolar
    \(k d=\frac{\text { die area in square inches }}{0.015}\)
    \(K I=1+\frac{N-12}{24}\)
```

    \(N=\) number of package leads
    For this model to apply. the ambient temperature must not exceed $55^{\circ} \mathrm{C}$. the junction temperature must not be greater than $40^{\circ} \mathrm{C}$ above ambient and plastic encapsulation must not be used. If plastic encapsulation is used. then it is recommended that the failure rate is doubled.

### 3.3.7 GIDEP computer data base

The Government industry Data Exchange Programme is an American based association which was established in 1959 and provides access to four data

Danks, which are :

| (i) | Engineering data bank |
| :--- | :--- |
| (ii) | Feliability-maintainability data bank |
| (iii) | Failure experience data bank |
| (iv) | Meirology data bank |
| The reliablity-maintainability data bank is of particular interest as |  |
| regards reliability prediction. |  |

### 3.3.8 EuReData

The European Reliability Data Bank Association was established in 1974 on a voluntary basis and was formally constituted in 1979. The organisation is non profit making and is supported by the EEC. Its main aims are to promote data exchange between organisations and to set up standard methods for obtaining and using reliability data.

### 3.3.9 MIL-217 Data base

The lallure rate models and factors contained in the MIL-217 series ideally lend themselves to computerisation. since mathematical formulae are given for all the failure rate factors. $\Pi_{T}, \Pi_{\rho}$ etc. The Predictor package [81] includes the computerisation of MIL-217. The CNET data. likewise. is suitable, nowever data such as NCSR [34] is not suitable since no formulae are given for the failure rate factors. but only tables of figures.

### 3.4 COMPARISON OF FAILURE RATE DATA

The fallure rates for typical components used in a digital controller are given in tables(2) to (16). There are two main sources of data:
(i) Field and accelerated life testing data.

### 3.4.1 Resistors

The tailure rates of oxide film resistors are compared in table(2). Of the predicted data. the CNET prediction is identical to MIL 217 which might be expected since the CNET data is based on MIL 2178. The NCSR prediction is double that of the other predictions. but fits in between the two failure rates based on field experience. In addition the NCSR data is based on field experience within the nuclear industry. so it is proposed that the NCSR failure rate is the value most likely to be correct. There is however good agreement between all values with the worst and best lailure rates only differing by a factor of four.

### 3.4.2 Capacitors

The tailure rates of a 0.1 FF decoupling capacitor are given in table (3). This time agreement is not good with the worst and best values differing by a factor of twenty. This difference may be partly due to the wide variation in capacitor types and it is not possible to make predictions for identical capacitor types. However the CNET and MIL 217 values are in close agreement together with the ICL field data. as was the case for resistors. so it is proposed that the MIL 217 failure rate is accepted for this type of capacitor.

### 3.4.3 Soldered joints

The failure rates of soldered joints are compared. in table(4). Ignoring the CNET prediction which seems to be low. there is good agreement between the other values and again the $1 C L$ field data agrees very closely with MIL 217. It is proposed that the MIL 217 value is accepted.

### 3.4.4 Wire-wrap connections

The lailure rates of wire-wrapped joints are compared in table(5). The MIL 217 and CNET predictions seem grossly optimistic when compared with the field data and soldered joints. In the case of the MIL 217 data it is
suspected that very few wire-wrap joints are used in military equipment. but many more are used in the computer industry. For this reason and because the 1 CL data and Dummer agree so closely, it is proposed that the field data is more likely to be correct.

### 3.4.5 Edge connectors

The failure rate of edge connectors is compared in table(6). It is difficult to compare failure rates because of the wide variety of edge connectors and the mating/unmating cycles are not known for the field data. The Dummer failure rate seems nigh whilst the MIL 217 rate seems low when compared with field data as well as soldered and wire-wrapped joints. Although the CNET data is based on MIL 217, the CNET failure rate for connectors is much higher (a factor of 30). Presumably CNET found the MIL 217 model to be too optimistic. The CNET prediction agrees with the ICL data and seems to be a sensible value. so it is proposed that the CNET data is used in preference to the other sources.

### 3.4.6 Integrated circuits

The failure rates of integrated circuits are several orders of magnitude greater than resistors, capacitors, or connections, and therefore because of the large number of integrated circuits used in a digital contraller. their failure rate dominates the total controller failure rate. The failure rate of integrated circuits has been shown to fit the Arrnenius relationship as discussed in section 3.2. which means that the failure rate is exponentially dependent on junction temperature and activation energy It is theretore essential to make comparisons between different reliability data sources at equivalent activation energies and junction temperatures. otherwise the exponential dependence will introduce large dilferences between the failure rates. The activation energies used by MIL 217D. MIL 217 C and NCSR appropriate to different device technologies are given in
table(7). The higher the activation energy, the higher the failure rate at an elevated temperature.

Tre main difference between MIL 217C and MIL 2170 is the section on microcircuits. Only two activation energies used in the calculation of $T_{T}$ are used in Mil 217 C and no distinction is made between hermetic and nonnermetic (plastic) encapsulation. The revisions incorporated in MIL 2170 seem to improve the predictions. Nine different activation energies are used according to device technology and a distinction is made between nermetic and non-hermetic encapsulation with a higher activation energy quoted for non-hermetic devices. This seems reasonable since non-hermetic devices are more prone to corrosion due to moisture and this fallure mechanism is shown to have a high activation energy as shown in table(1). The NCSR data only uses three activation energies which agree closely with MIL 277D ior hermetic devices. The NCSR activation energies make no distinction between hermetic and non-hermetic devices and it is simply recommended to multiply the fallure rate by two for non-hermetic devices. This approach is felt to be too simple.

The CNET failure rate model uses two activation energies of 0.3 and 1.0 eV in the calculation of TTt . the temperature acceleration factor. The weighting between 0.3 and 1.0 eV is varied according to device technology. This seems to be a better approach than MIL 2170 and NCSR which take the rather simplistic view that only one activation energy is present when many activation energies may all be making contributions through different failure mechanisms.

Some manufacturers accelerated testing makes use of two activation energies. references[28.29]. although one activation energy is seen to dominate

The junction temperature is calculated according to:

$$
\begin{equation*}
T j=\text { Tamb }+ \text { Oja.Pdiss } \tag{3.4.1}
\end{equation*}
$$

```
where: Tj = junction temperature
    Tamb = ambient case temperature
        Qja = junction/ambient thermal resistance
        Pdiss= power dissipated
```

If Pdiss is small as in small CMOS or LSTTL devices. the temperature rise due to Oja.Pdiss will be small and the failure rate will be relatively insensitive to these two parameters. However in the case of a microprocessor which dissipates about 0.5 W . and can physically be felt to run warm, it is important to detervine accurately Oja.Pdiss. Motorola[42] give values of Qja trom $70-115^{\circ} \mathrm{C} / \mathrm{W}$ for a plastic device and $50^{\circ} \mathrm{C} / \mathrm{W}$ for a ceramic device. Considering MIL 217 D and the 6800 microprocessor. Oja is correctly given as $50^{\circ} \mathrm{C} / \mathrm{W}$ for a hermetic device. but the worst case power dissipation of $I W$ is given. It would seem more reasonable to use the typical power dissipation of 0.5 W which reduces the junction temperature by $25^{\circ} \mathrm{C}$. giving a large increase in predicted reliability.
if Oja cannot be determined from manufacturers data or the table in MIL 2170. then MiL 2170 recommends that the following values are used:

| Package type | Oja |
| :---: | :---: |
| ${ }^{\circ} \mathrm{C} / W$ |  |
| < 22 pin nérmetic | 30 |
| < 22 pin non-hermetic | 125 |
| $>22$ pin nermetic | 25 |
| $>22$ pin non-hermetic | 100 |

The NCSR data recommends the following values for Oja:

$$
\begin{array}{ll}
\theta_{j a}=60^{\circ} \mathrm{C} / \mathrm{W} & \text { hermetic devices } \\
\theta_{j a}=155^{\circ} \mathrm{C} / \mathrm{W} & \text { non-hermetic devices }
\end{array}
$$

figures seem excessively high.


#### Abstract

Although the power dissipation. Pdiss, is given in MIL 217D in the same table as eja, it is recommended that the manufacturers data is consulted instead since (for example) the power dissipation of an 8080 microprocessor is incorrectly given as 1.7 W and not 1.5 W as given by Intel [33]. It is further recommended that a more realistic failure rate is obtained by using the typical power dissipation. as quoted in the manufacturers data and not the maximum power dissipation. If possible the device case temperature should be measured to give an accurate value of case ambient.


### 3.4.7 TTL integrated circuits

The fallure rates of TTL integrated circuits are compared in table (8). The junction temperature of the ICL devices is unknown. but all other predictions are based on a junction temperature of $33^{\circ} \mathrm{C}$. The four predicted failure rates use very similar activation energies. Ea. The MIL 217 C prediction is obviously too high by at least an order of magnitude. but the other failure rates are seen to vary from worst to best by a factor of six. The predicted iailure rates are pessimistic when compared with the ICL iield data, alihough this difference may be due to lower junction temperatures of the ICL devices. If the CNET. MIL 217 D and NCSR predictions are compared. they are seen to vary by a factor of three which is thought to be very good. Since the CNET prediction falls between the other two predictions. it is proposed that the CNET data is accepted in preference to the other two for TTL failure rate prediction.

### 3.4.8 6800 Microprocessor

The tailure rates for a 6800 microprocessor are compared in table(9). The activation energies and junction temperatures used to calculate the fallure rates are those recommended by the respective reliability data
sources and are seen to vary considerably. All the predictions use a junction temperature which is at least $15^{\circ} \mathrm{C}$ too high. Again MIL 217C is seen to be 100 high by at least two orders of magnitude. altnough this is partly due to using a higher activation energy than the other predictions.

Table(10) shows the failure rates of table(9) converted to a common base of junction temperature and activation energy. An activation energy of 1.0 eV is chosen. since this is the value used by Motorola as well as the MIL-STD 883 screening procedure. MIL 217 C is ignored since this is much higher than the other values, however both MIL 217 D and NCSR predictions are seen to agree closely and give a sensible value of about one failure per twenty years. The Motorola failure rate seems to be too low and correspands to one failure per thousand years.

## $3.4 .9 \quad 8080$ Microprocessor

The dallure rates of an 8080 microprocessor are compared in table(1). The activation energies are very similar except for MIL 217C. It is interesting to note that Intel chose an activation energy of 0.5 eV as compared with 1.0 eV used by Motorola. The junction temperatures vary considerably, although the junction temperature used by Intel is unknown. but probably around $89^{\circ} \mathrm{C}$. The simple RRE failure rate model given in section 3.3 is seen to give a reasonable result. although probably too high by an order of magnitude. Again MIL 217 C is seen to be too high by about two orders of magnitude.

Table(12) shows the failure rates of table(11) converted to a common base of junction temperature and activation energy. With the exception of MIL 217C. the predictions agree closely. although they are nigher than the predictions for the 6800 by a factor of four. This is because of the higner junction temperature of the 8080 microprocessor. The intel and Motorola failure rates determined by accelerated testing are seen to agree closely. although they are felt to be too low.

### 3.4.10 EPROM

The fallure rates of 2716 EPROMs are compared in table(13) at a common junction temperature. With the exception of MIL 217C. the failure rates agree within an order of magnitude which is much better than for microprocessors. The CNET failure rate is seen to agree most closely with the manufacturers failure rate. An activation energy of 0.55 eV is used by both MIL 217D and NCSR since this is the energy corresponding to NMOS devices. Intel have found that 0.3 eV is more suited to accelerated testing of EPROMs. If the MIL $217 D$ and NCSA failure rates were calculated using an activation energy of 0.3 eV . then agreement between the failure rates would be detter.

### 3.4.11 Bipolar ROMS

The failure rates for a lk bipolar ROM are compared in table(14). With exception of MIL 217C. the failure rates all agree very closely and it is impossible to suggest that one value is more believable than another. It is interesting to note that all reliability data sources use nearly identical activation energies.

### 3.4.12 Dynamic RAM

The failure rates for a 16 k dynamic RAM are compared in table(15). The MIL 217 C value seems too high, whilst the Motorola figure seems too low. If the Motorola failure rate. having an activation energy of 1.0 eV . is converted to a failure rate, having an activation energy of 0.3 ev . then it may be compared sensibly with the Intel failure rate. Both the MIL 2170 and NCSR failure rates are high when compared with the manufacturers accelerated test results and the CNET prediction seems to agree more closely as found by Reynolds [35].

### 3.4.13. Static RAMs

The failure rates of 1 k static RAMs are compared in table(16). With
the exception of MIL 217C. the failure rates agree closely and it is impossible to suggest that one data source is preferable to another.

### 3.4.14 Comparison with other field data

Kiein [45] compares field failure rates with MIL 217C predictions. For all devices except PROMs. MIL 217C is found to be pessimistic by up to wo orders of magnitude. For PROMs. MIL 217C is shown to be too high by an order of magnitude for some devices. whilst it is found to be too low by an order of magnitude for other devices. For microprocessors MIL 217C is pessimistic by up to two orders of magnitude and observed fallure rates are in the range $0.3-2.0$ 1/million hours. For RAMs. MIL 217C is correct for some devices, but is pessimistic by up to two orders of magnitude for the majority of devices, and observed failure rates are in the range $0.1-20$ f/million nours. For ROMs. MIL 217C is pessimistic for most devices by an order of magnitude and observed failure rates are in the range 0.08-0.8 t/million hours

Reynolds [35] concludes that MIL 217C is pessimistic for most microcircuits and that the CNET predictions are more appropriate to "ground fixed" applications.

Daniels et al [46] compare observed failure rates based on field data with predictions made according to MIL 217C. incorporating Notice 1 (May 1980). for twelve pieces of equipment of two types, one being analogue and the other a digital controller. The comparison shows that $79 \%$ of predicted values are within a factor of two of the observed values. although MIL 217C is always pessimistic. In fact where microcircuits are concerned. it is proposed that they should have found there to be wider disagreement between predicted and observed values. It was assumed as a basis for the comparison that the temperature rise within equipment is only $10^{\circ} \mathrm{C}$ above an ambient temperature of $25^{\circ} \mathrm{C}$. Thus the failure rate of a 6800 microprocessor is predicted at a junction temperature of $35^{\circ} \mathrm{C}$. This
juriction temperature is much too low as previous discussion has shown. Assuming a thermal resistance of $\theta \mathrm{j} a=50^{\circ} \mathrm{C} / \mathrm{W}$ and a typical power dissipation of 0.5 W . the temperature rise within a 6800 microprocessor will be at least $25^{\circ} \mathrm{C}$. which gives a junction temperature of $50^{\circ} \mathrm{C}$. Since the types of microcircuits used in the digital controller are not specified it is impossibie to conclude now pessimistic MIL 217 C is. especially for microcircuits.

Claridge 147] makes a comparison between observed and predicted iailure rates of a large instrumentation and protection system. Although very few integrated circuits are used in the system. a useful lesson can probably be learnt. The system comprises about nine nundred sub-units. each containing about forty components. If the observed fallure rates of the sub-units are compared with the predictions, then the failure rates are seen to vary by about two orders of magnitude. If nowever the failure rate of the systems consisting of nine hundred sub-units are compared. then the predicted and observed failure rates vary by a factor of four with the prediction almost always pessimistic. Unfortunately the data source used for failure rate prediction is not quoted. but this example shows that observed and predicted failure rates can be shown to agree more closely if a comparison is made between a system containing many thousand components. rather than a hundred or so. A statistical analysis is always seen to be more accurate if the sample size is increasea.

### 3.4.15 Recommendations

The most widely used data source for reliability prediction is probably the MIL 217 series. It has the advantage that most commonly used components are listed. It has been shown that many of the predictions of MIL 217D. especially those for microcircuits are pessimistic. although it is better to err on the side of caution. If a more accurate failure rate prediction is required. then it is suggested that the following tailure
rates and reliability data sources are used for the following components:

| Component type | Data source / failure rate |
| :--- | :--- |
| Resistor | NCSR |
| Decoupling capacitor | MIL 217D |
| Sotdered joint | MIL 217D |
| Wire-wrap connection | 0.0008 f/M hrs |
| Connectors | CNET |
| TTL integrated circuits | CNET |
| Microprocessors | CNET |
| EPROM | CNET |
| Bipolar PROM | CNET. / MIL 217D |
| Dynamic RAM | CNET |
| Static RAM | CNET / MIL 217D |

The findings of this section as well as Reynolds [35] indicate a preference for the use of CNET data for microcircuits, rather than MIL 217D. although MIL 217D should not be in error by more than one order of magnitude.

Whichever reliability data source is used, is is recommended that for integrated circuits, the thermal resistance $\theta_{j a}$ is determined from the manufacturers data or. failing that, from MIL 217D as described previously In this section. The typical power dissipation. determined from the manufacturers data. should be used to calculate the device junction temperature. Tj.

### 3.5 COMPARISON OF DIFFERENT DEVICE TECHNOLOGIES AND ENCAPSULATION

For a given type of encapsulation. the failure rate depends on the device tecmnology in two ways:

Activation energy - The activation energies corresponding to the tailure mechanisms of different technologies are given in table(7) and discussed in section 3.4. There is fairly good agreement between manufacturers and data sources for failure rate prediction.
(ii)

Fower dissipation -. The device junction temperature depends on the ambient temperature and the temperature rise within the microcırcuit package. which is dependent on the power dissipation. Technologies such as TTL and ECL exhibit much higner power dissipation than CMOS.

### 3.5.1 C.MOS vs TTL logic

It is shown by Peterson [43] that non-nermetic CMOS devices are more sensitive to moisture activated corrosion and surface instability than TTL devices. Many CMOS devices were found to fail because properties such as input current and leakage current increased. sending the device out of tolerance. This is probably to be expected because of the high input impedance of CMOS. If hermetic $C M O S$ is considered. then it is still reported by Johnson et al [48] that TTL devices generally exhibit a much nigher reliabulity.

The comparison between CMOS and TTL is to some extent a trade-off between temperature and activation energy. Bipolar TTL devices dissipate more power than CMOS, therefore have a nigher junction temperature. but their activation energy is lower. CMOS devices fail according to a high activation energy. but dissipate negligible power. Peterson [43] concludes that plastic encapsulated CMOS has a failure rate five to thirly times that of plastic TTL dovices. Although there is considerable evidence suggesting that CMOS devices are less reliable than TTL, the findings of Motorola [39] suggest that CMOS. NMOS and HMOS have approximately equal failure rates. based on nearly six million hours of testing at $125^{\circ} \mathrm{C}$ on seven thousand
devices.
Unless design considerations dictate otherwise. it is recommended that TTL logic is used in preterence to CMOS for the above reasons:

### 3.5.2 Hermetic vs non-nermetic encapsulation

There are Dasically three types of hermetic package:
(i)
(ii)
(iii)

Ceramic package with metal lid - This type of encapsulation is common for VLSI devices and consists of a piece of ceramic moulded around the lead frame with a central "well" in the package. The die is placed in the well and wire-bonded to the lead frame. The device is then sealed by brazing or soldering on a metal iid. The package is available in both DIL and leadiess chip carrier.

Cerdip - This package is available in both DIL and flat pack and consists of a lower slab of ceramic which has the lead frame moulded into it. The die is placed on top of the lead frame and wire-bonded to it. The package is sealed by a ceramic lid with a glass seal around the edges.

Metal can - This type of package is rarely used for logic devices and is mainly used for linear and hybrid devices. The microcircuits are contained within a sealed metal can with connections passing through glass seals on the bottom of the package.

Non-hermetic packages consist of a die attached and wire-bonded to a lead frame and then the die and lead frame are encapsulated using an epoxy. silicone, or phenolic resin.

Apart from other considerations. hermetic devices have a lower thermal resistance (typically $50^{\circ} \mathrm{C} / \mathrm{W}$ ) than plastic devices (typically $100^{\circ} \mathrm{C} / \mathrm{W}$ ). This difference between plastic and ceramic devices may be reduced with the introduction of copper lead frames instead of aluminium as currently used

Motoroia [39] report the advantage of using a copper lead frame and the improvement in plastic device thermal resistance. For devices which dissipate a lot of power, the junction temperature of hermetic devices will be lower than that of plastic devices which should give a more reliable device. If devices are to be used over the full military temperature range of $-55^{\circ} \mathrm{C}$ to $+125^{\circ} \mathrm{C}$. then it is necessary to use ceramic devices.

References[49.50] suggest that ceramic devices are twice as expensive as plastic devices and Hakim et al [50] propose that ninety per cent of all integrated circuits are plastic. This poses a problem for high reliability users of microcircuits since nearly all devices are available in plastic. but not all are availabie in ceramic. It is therefore necessary for the high reliability user to consider the use of plastic devices on the grounds of cost and availability. Plastic microcircuits offer lower weight and it has been suggested that the encapsulant which surrounds the die and wirebonds makes the device more robust.

There are two main disadvantages with plastic devices:
Moisture which is either trapped inside the package when sealed. or which leaks into the package along the lead-frame. has been shown to cause corrosion of the die and wire-bonds. references[43.48.50.54.66].
(ii) Package integrity - If the coefficients of thermal expansion of the die. lead-frame, and moulding compound are not equal. stresses will be set up within the package which may crack the die or break the wire-bonds.

In order to accelerate the failure of devices due to moisture corrosion it is common to perform THB (temperature humidity bias) testing. The device under test is placed under conditions of $85^{\circ} \mathrm{C} / 85 \%$ in with a static bias of +5 V applied. The biassing of the device is organised to minimise the power dissipation of the device so that heat generated on-chip
will upset ine test conditions as little as possible. References[51.52.53] propose equations linking the acceleration in fallure rate to the temperature and relative numidity. Reynolds [35] suggests that equipment under benign conditions of $30^{\circ} \mathrm{C} / 30 \% \mathrm{rn}$ will undergo an acceleration in failure rate of 550 under conditions of $85^{\circ} \mathrm{C} / 85 \% \mathrm{rh}$. whilst the corresponding acceleration factor for equipment normally operated under uncontrolled conditions of $12^{\circ} \mathrm{C} / 80 \% \mathrm{rh}$ is 210 . The latter acceleration rate is chosen as the rate most appropriate to British Gas "field conditions".

Lycoudes [54] concludes that new plastics have virtually eliminated the early problems of package integrity and that the reliability of plastic devices is comparable to that of hermetic devices. provided that environmental conditions are not extreme.

Results of accelerated THB testing are given by Hakim et al [50] who report that hermetic ceramic devices perform best of all. but that the extrapolated failure rates for plastic devices are accepiable. They conclude that the use of plastic devices in military and high reliability equipment is not recommended under all conditions. but their use under controlled conditions might be acceptable.

Bauer et al [66] recommend that plastic devices may be used with caution in high reliability applications as long as their failure rate is estimated to be four times worse than equivalent ceramic devices.

The results of $85^{\circ} \mathrm{C} / 85 \%$ rh testing. thermal shock testing. and temperature cycle testing on plastic packages are reported by Motorola 139.40.41]. The results of thermal shock testing are good. especially in reference[40]. confirming that new plastics have eliminated early problems. The resuits of $85^{\circ} \mathrm{C} / 85 \%$ rh testing. reference[39], are more difficult to interpret. For a random failure process it is a valid assumption to multiply the number of devices under test by the test duration. which gives
an equivalent number of device hours under test conditions. However failure due to corrosion is not a random process and should not be analysed Dy this method. The only conclusion that can be drawn from the Motorola data is that out of a sample of 1717 devices. 18 devices had failed after 1008 hours of testing at $85^{\circ} \mathrm{C} / 85 \% \mathrm{rh}$. If however this recommendation is ignored and the results of $T H B$ testing in reference[39] are used to give a lallure rate at $85^{\circ} \mathrm{C} / 85 \% \mathrm{rh}$, then at $90 \%$ confidence the fallure rate is 14 I/M hours which agrees well with the results of Hakim et al [50]. If this failure rate is extrapolated to conditions of $70^{\circ} \mathrm{C} / 30 \%$ rh such as might prevail under. Motorola's accelerated temperature testing. then the result is a failure rate of $0.37 \mathrm{f} / \mathrm{M}$ hours as compared with the accelerated testing result for plastic devices of $0.2 \mathrm{f} / \mathrm{M}$ hours. Although THB testing results snould not be extrapolated to give a failure rate, it appears that reasonable results are obtained if extrapolation is performed.

For the above reasons. the results of extrapolated THB testing should be treated with caution.

- An interesting result is given by Motorola [39]. in which the failure rates of plastic and ceramic devices are given as $0.2 \mathrm{f} / \mathrm{M}$ hours and 0.24 f/M hours respectively. This suggests that under favourable conditions. there is no difference between the reliability of plastic and ceramic devices. This finding is also the conclusion of Fox [67].

The graph of figure(8) shows the failure rates of equivalent plastic and ceramic 8085 microprocessors plotted against ambient temperature. The failure rates are calculated according to MIL 217 D and CNET data, using the following equations:

MIL 217D
$\left.\lambda_{\text {plastic }}: 0.1295 \exp \left[9270, \frac{1}{298}-\frac{1}{T+358}\right)\right]+1.785$
$\lambda_{\text {ceramic }}=0.0296 \exp \left[6373\left(\frac{1}{298}-\frac{1}{T+315}\right)\right]+0.533$

## CNET

$$
\begin{align*}
& \lambda_{\text {plastic }}=7.28 \times 10^{4} \exp \left[\frac{-3500}{T_{A}+358}\right]+1.27 \times 10^{14} \exp \left[\frac{-11600}{T_{A}+358}\right]+1.575  \tag{3.5.3}\\
& \lambda_{\text {ceramic }}=7.28 \times 10^{4} \exp \left[\frac{-3500}{T_{A}+315}\right]+3.64 \times 10^{13} \exp \left[\frac{-11600}{T_{A}+315}\right]+0.525 \tag{3.5.4}
\end{align*}
$$

The MIL 2170 and CNET predictions for ceramic devices agree closely. but predictions for plastic devices disagree by nearly an order of magnitude. The CNET data which is less pessimistic than MIL 217 D is felt to be more accurate when considered in relation to this section and the findings of Motorola. In fact under favourable conditions. packages may be even better than CNET data predicts.

If is proposed that the lower failure rate of plastic TTL devices as compared with CMOS is due in part to the heat generated by TTL devices. which tends to drive off any moisture present in the package. Since CMOS generates almost negligible heat. any moisture present in the package will not be driven off and moisture activated corrosion will proceed at a faster rate because of the higher level of relative numidity.

### 3.5.3 Recommendations

i. TTL logic should be used in preference to CMOS unless power consumption is critical.
2. There is strong evidence to suggest that hermetic devices are slightly more reliable than plastic devices. and it is recommended that ceramic packages are used in all but the most cost-sensitive applications.
3. Failure rate prediction of plastic encapsulated devices should be performed using CNET data in preference to MIL 217 D .

### 3.6 SCREENING AND METHODS OF IMPROVING FAILURE RATES

One of the mosi common ways of improving the reliability of individual components and electronic systems is the technique of "burn-in". The device under test is operated under normal conditions or more commonly elevated temperature for a period of time which is sufficient to remove the freak population of early fallures which occur on the "infant mortality" section of the "bath tub" curve. As components fail, they are replaced and on completion of a successful burn-in, all weak components will have been replaced and the failure of components will assume a constant rate due to random tailures. Techniques for determining an optimum burn-in profile are given in references[55.56] since it is important to burn-in for sufficient time to expose weak components, but further burn-in time will be a waste of time. or in the case of some components. will exhaust some of the useful life of the components.

For high reliability applications. components are available from manufacturers which have satisfied agreed standard screening procedures. Such screening may involve visual. mechanical. and electrical tests. Integrated circuit failure rate predictions in MIL 217D refer to the screen level of a device according to US MIL STD 883. This screening procedure was introduced in 1968 in order to create an economically feasible standardised integrated circuit screening flow which would achieve inequipment failure rates of $0.8 \mathrm{f} / \mathrm{M}$ hours and $0.04 \mathrm{f} / \mathrm{M}$ hours for class B and class A devices respectively. The standard has been modified since 1968 and now represents a very tough screening specification. An equivalent screening specification is covered by British Standard BS9400. The screening specifications for integrated circuits are covered in detail by National Semiconductors 1581 .

The reiation between MIL 2170 screen level. screening specification. quality factor $\pi q$; and typical relative cost is given in table(17).

Considerable differences exist between MIL STD 883 class S and BS9400 class A. for example. class $S$ screening specifies non destructive wire-bond pull tests, however the two classes are broadly similar. The screening classes $B$ and $C$ of MIL. STD 883 are very similar to those of BS9400. and BS9400 includes a iourth class $D$ not covered by MIL STD 883. The screening requirements of 859400 are shown in figure(9). Considering screening level A. the process begins with a visual examination of the die and wire-bonds under 30-200 times magnification. The condition $A$ visual examination is more stringent than $B$ and specifies maximum allowable deformation in metalisation widths etc. The visual examination is obviously very labour intensive and theretore expensive. The device is then encapsulated and baked to stabilise it. The integrity of the package and wire-bonds is now tested by temperature cycling. mechanical shock, and constant acceleration tests. The package seal is tested next by fine and gross leak tests and then the aevice is tested electrically. In order to detect weak devices. the device is burnt-in at high temperature, after which it is tested again electrically. A high temperature reverse bias is now applied and the device is again tested electrically. Finally the device undergoes an X-ray examination 10 derect bad wire-bonds, particles inside the package. or bad die adhesion. If all these tests are passed successfully. the device continues to routine testing before being released. Screening levels B. C. and $D$ form a sub-set of level $A$ as well as having a less stringent visual examination and burn-in. The only difference between screening level $B$ and $Q$ is the final burn-in of class $B$ devices. It is therefore recommended that devices are procured according to BS9400 class $C$ at a typical relative cost of six times that of a plastic device. The equipment is then assembled and subjected to a burn-in at $125^{\circ} \mathrm{C}$ for 160 hours. In this way. class $C$ devices which survive the assembly and burn-in could be considered 10 be equivalent to class $B$ devices. According to table(17). this burn-in process should increase the reliability by a factor of three. making the
*
equipment an order of magnitude more reliable than equipment using plastic devices. but at only six times the cost.

As an alternative to this burn-in procedure. British Telecom burn-in some nigh reliability equipment for 500 hours at $85^{\circ} \mathrm{C}$.

### 3.6.1 Derating and cooling

Feducing the stresses on a component will increase its life. Jensen [57] suggests that components should be derated according to the following guidelines for high reliability applications:

Resistors - derate power to 0.5 rated Electrolytic capacitors - maintain core temperature below $70^{\circ} \mathrm{C}$

Semiconduciors $-\quad$ Power derating $=0.3$
Current derating $=0.5$
Voltage derating $=0.6$

Wherever possible. the junction temperature of semiconductor devices should not exceed the maximum values of table(18).

The reliability of CMOS and linear integrated circuits can be improved by derating the power supply voltage. This is not possible with TTL and NMOS logic which must be supplied with $+5 \mathrm{~V} \pm 5 \%$. A $5 \%$ reduction in supply voltage would increase the reliability slightly. but at the expense of reduced noise immunity and tolerance of power supply voltage fluctuations.

Since the fallure rate of integrated circuits is approximately exponentially dependent on temperature, it is important to minimise the junction temperature. Therefore circuit boards should be mounted well away from power supplies, heat sinks etc. and provided with as much cooling as possible. in many cases. the use of forced cooling is advantageous and can dramatically reduce the failure rate of equipment.

The power dissipation within a device is also dependent on the output loading. Jensen [57] recommends that the fan-out of digital circuits and

Ine loading of linear circuits is derated by a factor of at least 0.8 .

### 3.7 SOFTWARE AND TRANSIENT ERRORS

Before discussing malfunctions in digital controllers. it is necessary to deline the terms "error" and "fault". They are defined by Anderson and Lee 100$]$ as

```
Error - Part of an erroneous state that constitutes a difference from
a valid state.
Fault .. An error in a component or the design of a system will be referred to as a fault in a component or system
```

An error can thus be seen as the manifestation of a fault, a single fault producing one or more errors.

Faults in digital controllers are either permanent (hard) or transient (soft). Permanent faults are normally easy to diagnose and repair and are caused by the failure of components and software. Unless a software fault is fundamental to the operation of the controller, it is usually possible to recover from the fault by the technique of "exception handling". described iater. Careful testing should reveal permanent software faults. nowever hardware may fail at any time. as described in section 3.4. and redundancy techniques are required to protect against such failures. Although it is difficult to predict accurately the failure rates of hardware as demonstrated in section 3.4 . sufficient data exists to make an estimate which will be correct within an order of magnitude or so.

Transient faults are much harder to deal with since they are by their very nature of short duration, and may be caused by many different sources. It is not always possible to detect the cause of a transient fault. although a particular class of transient fault may be injected into a system to measure the system's fault-tolerance to that class of fault.
internal operation of most microprocessors is dynamic and relies on stored charge. so it is reasonable to expect that smaller device geometries in microprocessors will increase their sensitivity to alpha radiation.

Motorola [61] describe the design procedure for a 64 k dynarnic RAM with particular attention to the effect of alpha radiation. Transient error rates as high às $500 \mathrm{f} / \mathrm{M}$ hours are quoted for initial production units in 1979. Transient error rates for current production devices are quoted as $22 \mathrm{f} / \mathrm{M}$ hours. being 22 times greater than permanent failure rates.
(v) Environmental effects - Often digital controllers are sited in harsh electrical environments which may cause transient faults due to noise on the electrical supply or high electromagnetic fie:ld strengths. LSI circuits are particularly prone 10 electrical .interterence and in the worst enviromments. further measures may be required other than attention to earthing.

```
screening, and provision of a "clean" power supply.
```

A powerful tool for the correction of transient faults is the provision of fault-tolerant software which is written to check for faults and provide any necessary correction.

## 3. $\overline{8}$ SUMMAFIY

Often. as is the case for the experimental controller, a design requirement is that a piece of equipment does not exceed a certain failure rate. It is also necessary to predict the failure rate of equipment so that alternative designs may be compared. The previous chapter has shown that it is difficult to predict accurately the failure rate of equipment. Components are assumed to fail according to the "bath-tup" curve and it is assumed that during the useful life of the component, that the failure rate is constant. In order to screen semiconductor equipment and to produce failure rate data, it is essential to accelerate the failure process and the Arrhenius acceleration equation is universally used when calculating the acceleration in fallure rate due to a rise in temperature.

The wide variation in failure rate prediction data is discussed in section 3.4 and the comparison of different predictions with field data suggests that some data sources are preferable to others. The recommended Galure rate prediction data source for each type of component is given at the end of the section. For most components. the CNET data [38] is to be preferred.

A comparison of device technologies shows that TTL is more reliable than CMOS, which is probably due to the high input impedance of CMOS which makes it very sensitive to moisture, and the higher power dissipation of TTL which is sufficient to dry out the integrated circuit package. Unless design considerations require the use of CMOS in low power applications, it is recommended that TTL is used in preference to CMOS.


## CHAPTER 4

## PREDICTION OF RELIABILITY IMPROVEMENT DUE TO FAULT-TOLERANT TECHNIQUES

### 4.1. METHUUS OF EXPRESSING RELIABILITY

Unlike the physical parameters of an electronic device such as resistance, capacitance, voltage etc. which are easy to use and to measure. ine rellability of a component is a very difficult quantity. The whole of reliability engineering relies on the use of statistics and therefore great caution must be exercised. since statements are made concerning statistical quantities rather than physical quantities.

The failure of electronic components is shown by Cluley [1] to fit the Poisson distribution. The reliability of a system is defined as:

The probability that the system will perform its required function.
under stated conditions. for a stated period of time

Normally a system is required not to fail, so the reliability of a system is given by the probability that no failures are observed during a given tirne period. The probability is given by:

$$
\begin{equation*}
\dot{P}(0)=\exp \left(-\lambda_{t}\right) \tag{4.1.1}
\end{equation*}
$$

where: $t=$ time interval
$\lambda=$ failure rate of the component
$P(0)=$ probability of no failures
which can be rewritten as:
$R=\exp \left(-\lambda_{1}\right) \quad$ where $: R=$ reliability of the component (4.1.2)

In order to calculate the reliability it is necessary to know the failure rate of the component. $\lambda$. which can be measured by means of life testing or may be predicted as described in chapter three.

The concept of "mission time" is useful when it is required to predict the probability that a piece of equipment will operate successfully over a given time perion. The probability of success is given by

$$
P=\exp (-\lambda T) \quad \text { where }: \quad T=\text { mission time } \quad \text { (4.1.3) }
$$

Such an analysis is often used in space and defence when it is required to calculate the probability of a successtul space mission or the probability that a missile will hit its target.

In an industrial environment. mission time is a less useful concept as it is normally required to predict how long a piece of equipment will operate before failing. The mean time to failure. MTTF. is defined by :

$$
\begin{align*}
\text { MTTF } & =\int_{0}^{\infty} R(t) d t  \tag{4.1.4}\\
& =\int_{0}^{\infty} \exp (-\lambda t) d t \\
& =\frac{1}{\lambda}
\end{align*}
$$

If the MTTF is substituted for the mission time for a piece of equipment. then the reliability is given by :

$$
\begin{aligned}
R & =\exp \left(-\lambda \frac{1}{\lambda}\right) \\
& =\exp (-1) \\
& =0.37
\end{aligned}
$$

This means that there is only a $37 \%$ chance that a piece of equipment will operate for a time equal to the MTTF. It is for this reason that the MTTF should be used carefully.

### 4.2 EFFECT OF REDUNDANCY ON RELIABILITY

For $n$ units in series, each having failure rates $\lambda_{1}, \lambda_{2}, \lambda_{3}$. etc., the system reliadiity is given by:

$$
\begin{equation*}
R=\exp \left[-\left(\lambda_{1}+\lambda_{2}+\lambda_{3}+\ldots \ldots .\right) t\right] \tag{4.2.1}
\end{equation*}
$$

The failure rates may simply be added together to calculate the total system failure rate. This is the case for most failure rate predictions and the failure rate of the governor controller is calculated by summing the failure rates of the individual components.

The simplest form of redundancy is a parallel standby system where either component can fail without the system tailing.


Assuming that the two components fail independently. the probability of fallure is given by

$$
P(1)=(1-R)^{2} \quad \text { where } \quad P(f)=\text { probability of failure }
$$

since: $\quad P(s)=1-P(f) \quad$ where $\quad P(s)=$ probability of success
the reliability of the system is given by:

$$
\begin{equation*}
R=2 R-R^{2} \tag{4.2.2}
\end{equation*}
$$

Cluley [1] shows that the MTTF of the system is equal to:

$$
\text { MTTF }=\frac{3}{2 \lambda}
$$

which is only 50 per cent more than the MTTF of a single channel.

If a TMR system having perfect voters is considered. the reliability is given Dy

$$
\begin{equation*}
R=3 R^{2}-2 R^{3} \tag{4.2.3}
\end{equation*}
$$

and the MTTF is given by:

MTTF $=\frac{5}{6 \lambda}$
in this case the MTTF is less than that of a single channel. At first sight this would suggest that a TMR system is inferior to a duplicated or a single channel. however a TMR system can be shown to be more reliable if repair or the concept of mission time is considered.

A graph is plotted in figure(10) of the reliability of a simplex and a TMR system against $\lambda_{t}$. the normalised mission time. It is seen that there is a crossover at which point the TMR system ceases to be more reliable than the simplex system. The position of this point may be calculated by equating the simplex and TMR reliabilities according to:

$$
R=3 R^{2}-2 R^{3}
$$

The solution $R=0.5$ corresponds to $\quad \lambda t=0.693$. The crossover point which ocurrs before $\lambda_{t=1}$ explains why the MTTF of a TMR system is worse than that of a simplex system. For a system without repair. the useful life of a TMR system is limited to $\quad \lambda_{\mathrm{t}}=0.693$ or less.

### 4.3 EFFECT OF MAINTENANCE ON A SYSTEM

The previous sections have not considered repair and a system was considered to have exceeded its useful life once it had failed. This is typical of space and defence equipment. Systems which are repaired when they have failed will now be considered: this siluation can be considered typical of industrial equipment such as the governor controller. Since
systems are repaired between failures, the term MTBF (mean time between failures) is used to replace the term MTTF. The mean time to repair of a failed system is given by the MTTR. The availability of a system is defined as

$$
\begin{equation*}
\dot{A}=\frac{M T B F}{M T B F+M T T R} \tag{4.3.1}
\end{equation*}
$$

The availability may be interpreted as a probability that the system is in working order at any instant. The concept of availability is useful when planning the maintenance of equipment. If the MTBF of a piece of equipment is known. and it is required to achieve a certain level of availability. then the necessary MTTR to achieve that level of availability may be calculated.

### 4.3.1 Redundant systems

The introduction of redundancy into a system, logether with repair. allows considerable improvements to be made in equipment reliability. For example if one channel of a TMR system fails and is repaired before a further channel fails, then the system will operate without interruption. In order to achieve interruption free operation. it is essential to know when components fail so that they may be replaced before further failures occur.

For a repairable system with $n$ units of which one is a spare. the Markov failure model is given below:

where: $\quad \begin{aligned} & \lambda_{1}=\text { failure rate of a single unit } \\ & \lambda_{2}=\text { repair rate } \\ & c=\text { coverage }\end{aligned}$

The coverage is a number between zero and unity and is a measure of the

- proportion of faults which are covered by the redundancy in the system. where unity is equivalent to complete coverage. The transfer rates between states are given by:

$$
\begin{align*}
& \frac{d P_{1}(t)}{d t}=-n \lambda_{1} P_{1}(t)+\lambda_{2} P_{2}(t)  \tag{4.3.2}\\
& \frac{d P_{2}(t)}{d t}=n \lambda_{1} c P_{1}(t)-\lambda_{2} P_{2}(t)-(n-1) \lambda_{1} P_{2}(t)  \tag{4.3.3}\\
& \frac{d P_{3}(t)}{d t}=n \lambda_{1}(1-c) P_{1}(t)+(n-1) \lambda_{1} P_{2}(t) \tag{4.3.4}
\end{align*}
$$

The slate transition matrix is therefore given by :
$\left[\begin{array}{lll}-n \lambda_{1} & n \lambda_{1} c & n \lambda_{1}(1-c) \\ \lambda_{2} & -\left(\lambda_{2}+(n-1) \lambda_{1}\right) & (n-1) \lambda_{1} \\ 0 & 0 & 0\end{array}\right]$

The solution of these equations in terms of the MTFF (mean time to first failure) is given by Dhillon and Singh [71] as:

$$
M T F F=[10][-Q]^{-1} U \quad \text { where }: \quad U=\text { unit matrix }
$$

$Q$ is the reduced state transition matrix which is given by:

$$
|Q|=\left[\begin{array}{cl}
-n \lambda_{1} & n \lambda_{1} c \\
\lambda_{2} & -\left(\lambda_{2}+(n-1) \lambda_{1}\right)
\end{array}\right]
$$

Hence :

$$
[-Q]^{-1}=\frac{1}{\Delta}\left[\begin{array}{cc}
{\left[\lambda_{2}+(n-1) \lambda_{1}\right]} & n \lambda_{1} c \\
\lambda_{2} & n \lambda_{1}
\end{array}\right]
$$

The MTFF is given by

$$
\begin{align*}
\text { MTFF } & =\left[\begin{array}{ll}
1 & 0
\end{array}\right][-Q]^{-1} \\
& =\frac{\lambda_{2}+(n-1) \lambda_{1}+n \lambda_{1} c}{n(n-1) \lambda_{1}^{2}+n \lambda_{1} \lambda_{2}(1-c)} \tag{4.3.6}
\end{align*}
$$

For $\lambda_{1} \ll \lambda_{2}$ this equation can be simplified to:

$$
\begin{equation*}
M T F F=\frac{\lambda_{2}}{n(n-1) \lambda_{1}^{2}\left[1+\frac{(1-c)}{(n-1)} \frac{\lambda_{2}}{\lambda_{1}}\right]} \tag{4.3.7}
\end{equation*}
$$

Which is the same equation as that given by Arnold [72].
Equation(4.3.6) may be checked for a TMR system with no repair which has 100 per cent coverage using the values:

$$
\begin{aligned}
\lambda_{2} & =0 \\
c & =1 \\
n & =3
\end{aligned}
$$

this gives:

$$
M T F F=\frac{5}{6 \lambda}
$$

which agrees with the expression for MTFF derived previously.
The concept of coverage as applied to software is easy to define. It
 hardware if l:: proposed that the coverage is defined by:
$c=\quad \frac{\text { failure rate of protected circuitry }}{\text { failure rate of protected }+ \text { unprotected circuitry }}$

The unprotected circuitry includes all common circuitry which contributes to common mode failures.

### 4.4 METHODS OF EXPRESSING IMPROVEMENT

Any improvement obtained in the reliability of a system can be shown by an improvement in the availability or the reliability of the system for a given mission time. In an industrial environment it is useful to consider any improvement that can be made to the MTTF of a system. The MTTFIF (mean time to failure improvement factor) is defined as:

$$
\text { MTTFIF }=\frac{\text { MTTF (fault tolerant) }}{\text { MTTF (non fault tolerant) }}
$$

Hence the MTTFIF for a system with $n$ units of which one is a spare is given by the modified equation(4.3.7):

$$
\begin{equation*}
\text { MTTFIF }=\frac{\lambda_{2} / \lambda_{1}}{n(n-1)\left[1+\frac{(1-c)}{(n-1)} \frac{\lambda_{2}}{\lambda_{1}}\right]} \tag{4.4.1}
\end{equation*}
$$

Hence the MTTFIF for a TMR system with repair as shown by Pearson and Preece [73] is given by:

$$
\begin{equation*}
\text { MTTFIF }=\frac{\lambda_{2 /} \lambda_{1}}{6\left[1+\frac{k}{6+2 k} \frac{\lambda_{2}}{\lambda_{1}}\right]} \tag{4.4.2}
\end{equation*}
$$

$$
\text { where: } \quad \begin{aligned}
\lambda_{2} & =\text { repair rate } \\
\lambda_{1} & =\text { failure rate of a single unit } \\
k & =\frac{\lambda_{v}}{\lambda_{1}} \\
\lambda_{v} & =\text { failure rate of voters }
\end{aligned}
$$

A graph of MTTFIF against $\quad \lambda_{2} / \lambda_{1}$ for different $k$ factors is shown in figure(11). For large values of $k$. it is seen that little improvement can be made in the MTTF as the ratio $\lambda_{2} / \lambda_{1}$ is increased. For small values of $k$. large improvements can be made to the MTTF as the ratio $\lambda_{2} / \lambda_{1}$ is increased. For a given value of $k$ it is seen that the curve reaches a plateau. after which further increases in $\lambda_{2} / \lambda_{1}$ cause little extra improvement in MTTF. The point at which this plateau is reached depends on $k$ and this point should be used to determine the minimum time to repair. Any further improvements in the repair time will be wasted. since the MTTF cannot be improved much more.

These findings can be summarised by stating that in order to achieve improvement in the MTTF of a system by repair. the fault coverage must be nigh.

A similar approach can be used to determine the improvement to be gained by using SECIDED Hamming code protected memory. In this case there are $n$ RAM chips of which one is a spare. The coverage is calculated as previously defined:

$$
c=\frac{\text { failure rate of RAM chips }}{\text { failure rate of RAM chips + Hamming code circuitry }}
$$

### 4.5 SUMMARY

Two types of reliability analysis have been discussed. The first is concerned with predicting the probability of failure free operation of a non-repairable system for a given mission time. This analysis is typically used in space and defence applications. The second type of analysis is directly applicable to the governor controller. The individual failure rates are anded together to give the total failure rate. The effect of maintenance and repair is then considered and the MTTF of the equipment is calculated. In order. that a repairable redundant controller may be
compared with a non-redundant system without repair. the concept of MTTF improvement factor is introduced. Consideration of the MTTF improvement factor allows the optimum repair time to be calculated.

## CHAPTER 5

## TECHNIQUES FOR IMPROVING A SYSTEM'S FAULT-TOLERANCE

### 5.1 LEVELS OF FAULT-TOLERANCE

A structured approach to fault-tolerance implementation is proposed by Halse et al [74]. Fault-tolerant techniques are assigned levels of recovery as shown below.

level 3
level 2
level 1

At the lowest level, level 1, complete recovery is provided from a limited range of fault types. Recovery is embedded in the hardware design of the system and in the experimental controller consists of recovery from voting and memory errors. Level 2 provides a more limited recovery from a wider range of faults. Any state of the controller which is outside its specification is termed an "exception". As the processing of data proceeds. exception handling routines check for errors which may be caused by external stimuli or design faults. and initiate recovery from them. Recovery routines perform the logging of errors and identical fault recovery may be used to recover from different types of fault. At the highest level. level 3. limited or degraded recovery, such as a fail-safe emergency shut-down. is provided from a wide variety of fault types.

A structured approach to assigning levels of fault recovery is useful. since if a given level fails to recover from a fault. recovery is passed up to the next level until recovery or soft failure is successfully completed.

### 5.2 DESIGN TOOLS

Two commonly used design tools for assessing the reliability of a piece of equipment are FMECA and FTA which are defined below.

### 5.2.1 FMECA

Fault mode. effect and criticality analysis is a formal design exercise where each component in a system is considered and a failure rate is assigned to each mode of failure. For instance a resistor might fail open circuit 25 per cent of the time and fail out of tolerance for 75 per cent. The consequence of resistor failure is listed, together with a criticality rating. Failure open circuit might result in failure of the circuit. whilst parameter drift might result in a degraded but acceptable performance. An example of a FMECA is given below :

| Item | Failure mode | fallure rate <br> f/M hours | effect | criticality <br> rating(l=low <br> $3=$ high) |
| :--- | :--- | :--- | :--- | :--- |
| resistor R1 | open circuit | 0.002 | no output <br> from circuit | 3 |
| resistor R1 | parameter <br> drift | 0.006 | reduced <br> output | 1 |
| capacitor C1 | short cct. | 0.002 | no output <br> reduced gain <br> at high freqs. | 2 |

Hence the failure rate at each criticality rating may be determined. rather than assuming all fallures are catastrophic. This analysis is well suited to analogue circuits and data sources such as NCSR [34] give the percentage failure for each mode. The criticality analysis is of little use for systems containing many digital integrated circuits. Failure rate data gives no information about failure modes and it is assumed that any failure within complex integrated circuits results in total failure of the device.

In this case. the FMECA is shortened to FMEA, fault mode effect analysis. If the tailure rates of all the components in the system are tabulated. then components having a high relative failure rate can be identified and corrective action taken. An example of a FMEA is given in section 7.3 when the microprocessor was found to dominate the total failure rate.

### 5.2.2 FTA

Fault tree analysis is complementary to FMECA and. FMEA. Instead of being a bottom-up process it is a top-down process. The analysis starts by considering failure modes of the complete system and then working downwards to identify the part failures which could generate such system failures. Conventional logic symbols are used to combine events. An example of a FTA for a pressure trip is shown in figure(12). If the pressure exceeds the preset value it is required to trip the process. To prevent spurious trips. two pressure transducers are used. both of which must signal overpressure before action is taken. If probabilities are assigned to each fault, then the probability of an erroneous trip may be calculated.

### 5.3 HARDWARE

When considering the failure rate of hardware it is useful to remember that using less hardware will decrease the failure rate, since there are fewer components to fail. A well managed and designed piece of equipment is likely to be more reliable than a nurried design where the main acceptance test is whether the equipment works or not.

Several methods are now proposed for improving the fault-tolerance of digital controllers. Many of these methods are used on the governor controller described in chapter eight.

### 5.3.1 Watchdog timer

A watchdog timer is probably one of the simplest and most effective
devices that can be added to a system. It will not cure all ills. but should prevent the system remaining in a crashed state. In its simple form. the watchdog consists of a monostable which is triggered under software control by the syatem it is protecting, at time intervals less than the pulse width of the monostable. In this way the monostable generates a constant pulse. If the retriggering pulse. is not received. then after a delay set by the monostable. it times-out and a reset pulse is generated for the system. It is important that the watchoog is retriggered once only per program cycle. and this is usually performed on each pass through the main control loop. There is a small probability that the controller can stick in an erroneous loop. continually retriggering the watchdog, and the watchdog will fail to reset the system. To protect against this more complex watchdogs can be used. As the controller passes in an orderly sequence through a set of instructions. a signature or key is output to the watchdog which is only retriggered it it receives this key in the correct sequence.

### 5.3.2 Snake

A snake is a long sequence of NOPs or restart instructions which occupy any unused address space. An erroneous jump into this address space causes the program to "slide" down the snake until vectored recovery is executed at the end of the snake. Unused address space is connected so that an instruction fetch from non-existent memory fetches a no-operation or software restart instruction. A software restart instruction is used to jump back into the program. If no-operation instructions are used. then the controller will execute a whole series of NOPs until the program counter arrives at a valid address range. The 8085 microprocessor ideally lends itself to this technique as described in chapter eight. It nonexistant memory is pulled high, an instruction fetch will read FF and a RST7 will be executed. If the microprocessor does not support software
restarts and if full memory decoding is used. memory selects corresponding to non-existant memory can be used to force the instruction for a NOP or a restart onto the data bus.

### 5.3.3 Power supply levels

Many logic circuits are only guaranteed to function over a $\pm 5$ per cent power supply variation. Power supplies should be set to the middle of this range to give maximum protection against noise. It is important to detect transient undervoltage of the power supplies and to reset the system. since transient undervoltages have been observed to crash microprocessors. A dip in the +5 V supply to a microprocessor by only iv was found to crash it and such a dip was too fast to be detected by the power-on-reset circuitry.

### 5.3.4 Output verification

After an output function has been performed. the output is read back and is compared with the value which has just been output. In this way it is possible to test for fallure and transient faults in the output circuitry.

### 5.3.5 Component redundancy

Critical circuit components can be duplicated with a spare. This has both cost and reliability advantages over a TMR system, but suffers from the severe drawback that with only two components it is difficult to determine which is in error. A TMR configuration overcomes this difficulty by taking a majority vote. Chapter four has shown that with maintenance. large improvements can be made in the rellability of equipment using a TMR configuration. however care should be taken not to exceed the "useful life" of a TMR system without repair. Voting can be performed in software or by the logic circuit of figure(5).

The addition of an extra parity bit to memory allows errors to be detected as long as an odd number of bits is in error. but it is not possible to correct errors. A better type of protection is the SEC/DED Hamming code which will correct single bit errors and detect double bit errors. The disadvantage with this code is that the memory word length is increased since check bits must be stored as well as data bits.

For small systems the coding can be performed in ROM as described in chapter eight, but for larger word lengths there are several VLSI circuits available which implement SEC/DED Hamming code on 16 bit word lengths. references $[59.79 .80]$. For small systems. the main benefit is to be gained from the correction of transient errors, nowever large systems benefit in addition by the correction of permanent errors.

### 5.3.7 Temperature

As shown in chapter three. the fallure rate of semiconductors increases exponentially with temperature. it is therefore essential to keep the system temperature as low as possible and to use forced cooling if necessary.

### 5.4 SOFTWARE

Structured programming is essential in order to açaieve reliable software. The software is divided into easily managable modules which are written and debugged separately and then linked together to form the complete software package. For small systems. naving a program size of a lew kilobytes it is probably better to write programs in assembler. if assembler is used, each machine instruction is defined by the programmer and is under better control than code produced by a high level language. Because of the complex nature of recovery routines and the difficulty in separating nardware and software in a small controller, it is better to
write recovery routines in assembler.
As more complex microprocessors such as 16 bit devices are used and the program size is increased above a few kilobytes. it becomes increasingly difficult to write in assembler. There is a trade off between writing in a high level language which is easier to document and modify at a later date probably by another programmer, and writing in assembler which is more efficient and where the code produced is under tighter control. It is suggested that for large software packages. only the recovery routines are written in assembler and the rest is written in a high level language such as Pascal which offers structured programming. The machine code produced by the high level language should be examined to ensure that it meets the high reliability requirements of the controller.

### 5.4.1 Exception handling

Any abnormal response in a controller is termed an "exception" which may be caused by transient or design errors in either hardware or software. Exception handing is a powerful technique. implemented in software. where the software continually checks itself for errors and initiates recovery if any are detected. A simple form of exception handling would be to test the input values read into a controller and reject them if they are outside a reasonable range. When exceptions are detected. techniques such as "rollback" are useful in executing recovery. Roll-back is a form of time redundancy which repeats the block of code in which an exception has been detected. In this way it is not necessary to restart the controller and only a small block of code need be repeated. The recovery blocks described in chapter eight are a crude type of roll back and are illustrated in figure(13). The program is divided up into a number of routines. Each routine inputs a few values, performs some calculation or control action. and then continues onto the next routine. Ideally routines pass few variables between each other which should be contained in microprocessor
registers, since the recovery blocks of chapter eight only restore register status following a crash, and make no attempt to restore RAM contents. which could be included.

As shown at the end of each routine. a recovery block is generated. The system is shown to crash after routine 4. so a recovery block is executed which causes routine 4 to be repeated before passing onto routine 5.

### 5.5 SELF-TESTING

Selt-testing or health monitoring of controllers can be performed in spare execution time and is carried out as a background task which should not interrupt control of the process. Chapter four has shown that large improvements can be made in the reliability of equipment if the MTTR is reduced. In order to repair equipment. it is necessary to know that it has failed which means that self-testing is essential in redundant systems where the redundancy will often mask component fallures.

Self-testing can be divided into two groups:

Diagnostic - When an exception is detected. the cause should be determined and transient faults distinguished from permanent faults which require to be repaired. Fault logging may point towards a suspect component or area of bad design if repeated transient faults are logged.
(ii) Preventative - At regular intervals the system modules are tested. Such tests are a checksum test of ROM, test of RAM. input/output circuitry etc.. In the particular case of gas regulator valves where the detailed response characteristics of the system are known, it might be possible to predicl mechanical valve fallures. such as a sticking valve. by injecting a disturbance into the system and analysing the response.


Faults in redundant systems are often easier to diagnose, because failure in a non-redundant system will just cause it to cease operation. If a redundant system can continue to operate in the presence of a fault. it is possible to give a detailed diagnosis of the fault which will reduce the repair time of equipment and skill required.

### 5.6 SUMMARY

In a small controller such as the governor controller, it is difficult to separate the hardware from the software. A structured approach to the implementation of fault-tolerance by assigning fault recovery levels
ensures that most classes of fault will be detected and recovery executed.
A valuable tool for identifying the most unreliable parts of a system is FMECA which was used on the governor controller when deciding where it was necessary to incorporate fault-tolerance. The complementary FTA is discussed for the sake of completeness, and was not used when designing the governor controller because FMECA was found to be more suitable.

Hardware techniques such as a watchdog timer. snake. power supply undervoltage detection. output verification. memory protection. and component redundancy are all used on the governor controller.

Exception handling routines are initiated by hardware on the governor controller and software fault-tolerance is used in selecting the best pressure information from the redundant transducers. Much of the governor controller will test itself, and the nature of all transient and permanent faults is logged. At regular intervals a self-test is periormed on the solenoid valves and the self-test software could be expanded to test the complete sysiem.

## CHAPTER 6

## EFFECT OF SYSTEM ARCHITECTURE ON RELIABILITY AND COST

### 6.1 CHOICE OF MICROPROCESSOR

The first decision to be made when choosing a microprocessor is the processing power required. that is whether a single chip. 8 bit. or 16 bit device is required.

### 6.1.1 Single chip microprocessors

If the process consists of a single control loop and little arithmetic is involved. it is likely that a single chip device will have sufficient processing power. Most single chip microprocessors contain internal read only memory (ROM) and random access memory (RAM) which makes the sharing of memory, Hamming code protection. and the provision of a spare copy of ROM impossible uniess the single chip system is expanded. If a FMEA of the controller snows that the failure rate is too high. it is necessary to incorporate some form of redundancy according to the guidelines of section 6.3. Failure of the controller could be detected and a standby switched in. Whilst this is satisfactory for permanent. faults. it offers little protection against transients. since with only two controllers it is impossible to arrive at a majority decision should the controllers differ. if is for this reason that the TMR configuration is recommended. Voting in a single chip TMR system could be either in hardware or by software.

If the voting were performed in software, a very reliable controller consisting of only a few components could be constructed.

### 6.1.2. Eight bit microprocessors

The design of the controller reported in chapter eight is an example of a redundant 8 bit microprocessor controller where most of the redundancy is implemented in hardware. Although the hardware techniques "described
could be used with most 8 bit microprocessors such as the 280. MCM6800 etc.. the 8085 hardware is ideal for a redundant controller. The provision of serial input/output pins removes the need for three separate UARTS and the provision of four maskable and one non-maskable interrupts removes the need for an interrupt controller. Several interrupts are required because nardware exception handling routines are called by interrupts. The 6800 microprocessor is less suitable because it only has one maskable and one non-maskable interrupt, and has no serial input/output pins. The instructions for NOP and SWI are not ideal for implementing a "snake".

Possibly a better microprocessor than the 8085 would be the NSC800 which is identical in hardware to the 8085 except that no serial input/output pins. are provided. The instruction set is compatible with the 8085 and consists of the more powerful 280 instruction set:

If much of the redundancy were to be performed in software at the expense of processor throughput. it is possible that different microprocessors would be more suitable.

### 6.1.3. Sixteen bit microprocessors

Modern 16 bit microprocessors are really outside the definition of "small" controllers and are almost as powerful as mini-computers. It is probably better to apply more of the redundancy in software because of the processing power and increased hardware and software complexity

The Texas 9900 microprocessor differs from other 16 bit devices because it does not perform calculations on internal registers. but on a block of registers held in RAM which are pointed to by the internal "workspace pointer". Thus if the workspace pointer is corrupted. all registers will be addressed incorrectly and corruption or permanent failure of RAM will cause register errors.

Voting at bus level with 16 bit microprocessors is not to be recommended because of the complexity of the hardware and the number of
signals. Instead it would be better to arrange the three microprocessors as a master and two slaves sharing common memory. Communication between the processors would then be by the common RAM and the voting would be mainly performed in software.

It is likely that 16 bit microprocessors will have a lot of RAM which will probably be shared as well. The RAM should be protected and it is suggested that 16 bit SEC/DED Hamming code integrated circuits could be used to protect the RAM, references $[59.79 .80]$ instead of the circuitry proposed in chapter eight.

Large software packages will be required for a 16 bit controller which are better written in a high level language. Assembler should only be used as necessary and for critical areas such as voting and rec̣overy routines as discussed in chapter five.

In high reliability applications, 16 bit microprocessors shouid only be used if an 8 bit device is not powerful enough, or if the addition of a maths processor to an 8 bit system is less reliable than a 16 bit system. The increased complexity of 16 bit systems over 8 bit systems leads to nigher failure rates.

### 6.2 MAJORITY VOTING

Majority voting in a TMR system can either be performed in hardware or software. the choice of which is influenced by many factors.

### 6.2.1 Hardware

For a small controller using single chip microprocessors and which only has a few outputs. it is easier to vote on the outputs in hardware. Discrete logic can be used. connected as shown in figure(5) or several channels of voling and error detection could be contained in a FPLA as described in chapter eight. If the failure rate of the system is approximated to that of the voter. then a low fallure rate is predicted.
typically $0.02 \mathrm{f} / \mathrm{yr}$ as compared with $0.5 \mathrm{f} / \mathrm{yr}$ for a single microprocessor. according to MIL-217D.

Voting on inputs is probably better performed in software, although if the total software package is small. the fault coverage will be reduced if the extra fault-tolerant software becomes too large and the system reliability will be reduced.

### 6.2.2 Soltware

ihree systems can be connected together in a ring. whether they are single chip or 16 bit microprocessor based. as shown in figure(6). Each microprocessor can communicate with its two neighbours. If a processor gets no response from one neighbour, but can communicate with its oiner neighbour, then the first neighbour must be in error. At any time one microprocessor must be in charge and act as the master and the other two microprocessors act as slaves. The master is however checked by the slaves so that if the slaves are not interrogated by the master within a set time. it is concluded that the master has failed and one of the slaves takes over as master. Communication between processors could be either by shared RAM which is quicker and is recommended for 16 bit devices. or could be by a parallel or serial link. Since single chip devices contain their own RAM. it would be better to communicate via a parallel link.

Software communication and voting routines should not use up too much of the processing power of the devices or fault coverage and system throughput will be decreased. Software voting is prone to transient errors. but once debugged cannot fail permanently as is the case for hardware voters.

### 6.3 THE USE OF REDUNDANCY AND INCREASED COST

If a FMEA of a system shows that the failure rate of some components is unacceptably high. some form of redundancy must be included to mask the high failure rate of the individual components. To a first approximation.
for a set of components which is completely protected by fault-tolerant circuitry, the failure rate of the protection circuitry must not exceed that of the non-redundant components that are to be protected. For example the failure rate of the voters in a TMR configuration should not exceed that of a single channel of the triplex arrangement.

A more accurate method of predicting the benefit likely to be gained from redundancy is the analysis of chapter four which includes the effect of maintenance. The software coverage is calculated according to the relative execution times of the different software routines, whilst the hardware coverage is calculated according to the ratio of protected to unprotected circuitry. The proposed repair or maintenance rate must be defined. then the MTTFIF can be calculated for the redundant configuration. if the MTTFIF is greater than unity. then the use of redundancy is justified. In cost sensitive applications. redundancy should be applied to those parts of the system for which the greatest MTTFIFs are obtained.

The increased cost of redundancy should be considered in relation to the total system cost. For a small controller it is likely that the majority of the cost is due to the cabinet, power supply, printed circuit board and assembly costs. If the equipment is to be used in a hazardous environment, the cost of intrinsic safety measures such as barriers is substantial. It is likely that redundancy will double the component cost of a system and will require a larger printed circuit board and enclosure. but when the total system cost is examined the increase will be less than double. Typical microprocessors cost under $£ 5$ nowadays which makes the component cost's small in comparison with other costs.

A fault-tolerant system will greatly increase the design costs both of hardware and software. If complex voting and communication routines are performed in software. this will greatly increase the software design costs.

The design of a general purpose fault-tolerant controller which contains the necessary fault recovery software would allow the increased design costs to de spread over many units. Although a redundant controller may initially cost more. it should last longer and require less maintenance. The cost of integrated circuits is approximately constant because of ever increasing advances in technology. but the cost of maintenance which is labour intensive is ever increasing.

## CHAPTER 7

## BACKGROUND TO DESIGN OF GOVERNOR CONTROLLER

### 7.1 SPECIFICATION

The governor is of a nybrid nature. consisting of an electronic microprocessor based controller. which in turn controls a mechanical valve which controls the flow of gas through the governor. The two main constraints on the design are reliability and cost (a few hundred pounds). The governor should cost little more than its pneumatic equivalent and should be as reliable. Some increase in cost and apparent degradation of governor reliability can be compensated for by added features and more efficient gas network control. The electronics should consume as little power as possible because of the problem of battery back-up of the mains supply. Whilst the electronics to be described do not prohibit the use of battery back-up. a CMOS design would be better from this point of view and if should be possible to convert the design of chapter eight to CMOS at the expense of computational speed of the controller and increased failure rate. Whilst design considerations may require the use of CMOS. it should be appreciated that the failure rate of CMOS devices is likely to be higher, especially under harsh environmental conditions.

The controller is a single or double loop controller, having a typical instruction cycle of $2 \mu \mathrm{~S}$. Depending on the software. the controller will implement the following control algorithms :-
(i) Three term controller of outlet pressure.
(ii) Clock control - The outlet pressure is varied over a preprogrammed pressure versus time profile so that the governor is able to satisfy the changes in demand throughout the day.
(iii) Flow control - The flow through the governor is maintained constant except for high and low pressure overrides
(iv)

Demand Activated Governing (DAG) - The outlet pressure of the governor is varied according to the flow through the governor. Under. conditions of low flow the outlet pressure is neld at a minimum, whilst under conditions of high flow. the outiet pressure is increased to compensate for resistive pressure drops in the downstream pipework.

The control algorithm implemented on the Durham governor is (iv). DAG since this is the most efficient and sultable for a distribution governor.

### 7.2 MECHANICAL VALVE

There are considerable problems associated with interfacing an electronic controller to a mechanical valve. Ideally the valve would have no moving parts and would control the flow of gas smoothly and linearly from OFF to ON by simply applying an analogue voltage ranging from $O V$ to a few volts. This is obviously not possible. so some form of electromechanical control is required.

There are two types of mechanical valve :-
(i)

Direct acting - The position of the valve is controlled solely by an electrical signal. Failure of the electrical signal causes the valve to fail fully OPEN or fully SHUT.
(ii) Pneumatic back-up - A conventional pneumatic valve is modified so that its set-point is adjusted by an electrical signal. Failure signal of the electrical ${ }_{\wedge}$ causes the valve to fail to its maximum or minimum pressure and not the catastrophic OPEN or SHUT situation. The maximum and minimum are chosen so that the governor will still operate within safe limits. but under normal operating conditions the pressure can be varied between the two set-points. giving efficient control of the network.

British Gas have developed two valves of this type which the electronic controller is capable of driving. The first type uses a stepper motor to vary the compression of the main valve spring and hence the set-point. This is shown in figure(14). The stepper motor drives a gearbox which engages with a screwed nyton plug. Hence the plug may be moved up and down. changing the compression on the main spring. This design has several disadvantages :-
(i) Compiex drive circuitry and high power consumption of the stepper motor.
(ii)
(iii)

The stepper motor must be contained within a large and expensive flame proof enclosure because of the requirement for intrinsic safety.

If the stepper motor or drive signals to it. fail. then the valve will stick at some intermediate position.

The second type of mechanical valve used by British Gas is the one chosen for the experimental governor at Durham. This consists of a modified pneumatic valve controlled by two solenoid valves. By switching the solenoids in the required sequence. it is possible to raise the outlet pressure. noid it constant, or lower it. The valve is shown in figure(15). The valve is a modified Donkin 226 ' $K$ ' pilot which has an extended lower body part containing a dual Bellofram sealed piston with a spring positioned between the topside of the piston and the underside of the pilot valve. The chamber above the upper Bellofram is connected to the pilot outlet pressure. Gas at higher pressure from the inlet to the valve is admitted or exhausted from the lower Bellofram chamber through two solenoid valves which are controlled by the microprocessor controller. The level of pressure in the lower chamber determines the position of the lower spring against the pitol valve. and therefore the amount that it deloads the pilot main spring. The volume between the two Belloframs is vented to atmosphere and serves as a breathing chamber.

A needle valve is positioned near the inlet/outlet port of the lower chamber to provide a means of adjusting the rate of change of loading pressure. The rate of change determines both the accuracy of the set-point and the stability of the pilot valve. The governor is arranged to. fail sate so that if the solenoids are de-energised due to a power failure. then the outlet pressure fails to the maximum pressure controlled by the top spring compression. This is arranged by having the inlet pressure solenoid normally closed and the outlet pressure solenoid normally open. The compression of the lower spring controls the minimum outlet pressure of the pilot valve. Thus the valve will fail safe to a minimum or maximum (normally maximum) pressure. should the solenoids fail and not fully OPEN or SHUT. The only dangerous failure mode is if both solenoids fail OPEN. This provides a bypass round the pilot valve, but flow should be held to a sate limit because the saturated flow through two solenoids is small compared with the flow through the pilot valve.

This design overcomes the problem of the stepper motor system and has several advantages :-
(i) The solenoids operate at low power, typically 0.25 W each which would be ideal for a CMOS controller.
(ii) The solenoids are small, cheap and intrinsically safe.
(iii) Fault-tolerant drive for the solenoids is easier to implement.
(iv) If it is necessary to protect against solenoid tailure. it is both cheap and easy to replace each solenoid with four in a series/parallel configuration.
The one disadvantage is that the solenoids have a typical life of about five million operations and therefore under typical operating conditions would need replacing every eighteen months.

The experimental rig at Durham consists of the electronic controller and the mechanical valve just described. Whilst this is sufficient to verity techniques of flow/pressure control. the small pitot valve cannot
pass the volume of gas used by several nundred consumers as a typical governor station is required to do. To pass a greater volume of gas. the pilot valve could be used to control a large main valve. This technique is the pneumatic equivalent of a transistor Darlington pair. where a large current is controlled in the second transistor by a smaller current injected into the first transistor.

An estimate is needed for the failure rate of the mechanical valve because of the requirement that the mechanical and electrical parts of the governor have failure rates which are of the same order of magnitude. Such data is very difficult to obtain. but British Gas give two failure rates for small valves. Blake and Drew [26] give a failure rate of $0.185 \mathrm{f} / \mathrm{yr}$. either OPEN or CLOSED for an "open at rest" regulator, whilst Drew [25] gives a failure rate of $0.19 \mathrm{f} / \mathrm{yr}$ for an IGA2000 regulator. although the report suggests that this figure is a bit pessimistic. So a failure rate of $0.2 \mathrm{f} / \mathrm{yr}$ seems a good estimate for a small regulator valve. This failure rate excludes the fallure rate of the main governor valve. for which it is impossible to obtain data, but it would be reasonable to assume an equal failure rate to the pilot valve.

The failure rate of the solenoid valves is neglected since regular replacement should reduce their failure rate to a negligible value. If this proves to be a problem "in the field", then a redundant series/parallel configuration of solenoids should cure the problem.

### 7.3 NON FAULT-TOLERANT CONTROLLER

The electronic controller specification has been defined in the previous two sections. It must have a fallure rate of approximately 0.4 $t / \mathrm{yr}$. . interiace to the solenoid controlled valve. and implement the DAG control algorithm. A first iteration design is now proposed which is then analysed by an FMEA (Fault Mode Effect Analysis) which nighlights the areas
of greatest failure rate.
The block diagram of the proposed contro
Using the data of MIL-2170 [37] and actual meas
the following failure rates are obtained for the controt
Pressure transducer P1 (manufacturers failure rate)
Pressure transducer P2 (manufacturers failure rate)
A/D converter
Control circuitry
Microprocessor + address latch + buffers
$4 k$ static RAM
$4 k$ EPROM
Input/output circuitry
Watchdog timer + reset circuitry
TOTAL
10.4 f/million hrs.
10.4 t/million hrs.
$1.93 \mathrm{f} /$ million hrs.
$2.05 \mathrm{f} / \mathrm{million} \mathrm{hrs}$.

460 f/million hrs
12.3 f/million hrs.
$3.0 \mathrm{t} / \mathrm{million} \mathrm{hrs}$.
$0.94 \mathrm{f} / \mathrm{million} \mathrm{hrs}$.
$0.63 \mathrm{f} / \mathrm{million} \mathrm{hrs}$.
$502 \mathrm{f} /$ million hrs .
$=4 \mathrm{f} / \mathrm{yr}$

This total failure rate is unacceptable and is too high by a factor of ten. The microprocessor is clearly the most unreliable part of the system. followed by the pressure transducers and the RAM. It is therefore necessary to redesign the controller using fault-tolerant techniques to mask the high failure rate of the components just mentioned.

### 7.4 METHODS OF INTRODUCING FAULT-TOLERANCE

A further design requirement was that the controller could be programmed by British Gas personnel not having detailed knowledge of faulttolerant programming or hardware techniques. and that standard software should run on the system. For these reasons many of the fault-tolerant features of the controller are transparent to the user and the specialised fault recovery software is pre-programmed and called as subroutines.

Alternative fault-tolerant architectures are discussed in chapters five and six. It was decided to use three microprocessors in a two out of three majority voting configuration. sharing common memory, to overcome the nigh failure rate of a single microprocessor. The voting was to be performed in hardware which has the advantage that, at the output of the voters. the TMR contiguration just "looks" like a single microprocessor as long as the three microprocessors run in exact synchronism. This allows standard software to be run on the system. The voters also act as. and replace. the buffers between the microprocessor and the system bus. Three alternative methods were considered for performing the majority vote :-
(iii)

TTL logic - The components needed for a single channel are shown in figure(5). To decrease the number of gates required. wire ORed outputs are used. The disadvantage with wire ORing is that the rise time of the output is slower than a device with "active" pull-up. Since it is necessary to vote on thirty channels. this would require one hundred and fifty gates contained in thirty eight TTL packages. The large number of integrated circuits required would be unreliable and bulky.

EPROM - A four channel voter could be contained in a 4 k EPROM. The disadvantage is that the EPROM could not drive the system bus without further buffering and that the 450 ns access time of the EPROM would make voting slow and require the microprocessor clock frequency to be reduced, thus reducing processor throughput.

FPLA (Field Programmable Logic Array) - The equivalent voting circuit to figure(5) is programmed into a logic array. A five channel voler can be contained in a FPLA, offering fast voting as well as three error llags: The system bus can be driven directly without further buffering.

The FPLA approach was chosen and is described in more detail tater.
The three processors share common memory. To improve both the
permanent and transient tolerance of faults. it was decided to protect the memory using a SEC/DED. (Single bit error. correction/double bit error detection) Hamming code. reference [77]. Large scale integrated circuits are commercially available to implement this type of protection on 16 bit wide data, references $[59,79.80]$. nowever two per memory are required. The devices are expensive. complex. and of unknown though probably high failure rate. An 8 bit version of these integrated circuits is not available. so it was decided. in the interests of complexity. reliability, and cost. to use a simple encoding/decoding PROM to implement SEC/DED protection. This is described in more detail later.

In order to make the pressure transducers highly reliable. it was decided to use three cheap piezo devices in a majority voting configuration, instead of using one expensive transducer.

As a final protection against transient or software errors. it was decided to incorporate a system "watchdog". which should restore the controller to correct operation following a system crash.

The control "firmware" will be stored in two separate EPROM sets. each capable of executing the control algorithm. This provides a hardware spare as well as allowing a soffware "spare" to be incorporated at a later date. If the spare EPROM set contains the same atgorithm, but coded differently. then if a software error is detected. switching over to the alternative software coding will restore correct operation as lang as the same software error does not exist in both copies of "firmware".

The output drivers to the solenoid valves were designed to tolerate any single component failure because of the unproven reliability of the driver integrated circuit and the large currents being repeatedly switched.

## CHAPTER 8

## DESCRIPTION OF EXPERIMENTAL CONTROLLER

### 8.1 CONSTFUCTION

The controller consists of : three double Eurocard "microbus" prototyping cards housed in a 6 U high $19^{*}$ Vero rack. The rack also houses the power supplies. twenty column printer. and pulse generator used for testing purposes. A photograph of the controller is given in figure(19).

The boards are wire-wrapped as is the system bus to the edge connectors. Each integrated circuit is decoupled by a. 0.luF capacitor. The top side of the prototyping boards consists of a ground plane which is connected to OV at several points. This reduces noise on the power supply and gives a low impedance connection between the power supply and the integrated circuits.

The power supply. whose circuit diagram is given in appendix(1) has the following outputs :-
+5 V @ 5A overvoltage and short circuit protected
+12 V 1A short circuit protected
-12V © 1A short circuit protected
+28 V @ 1A short circuit protected
The mains supply to the power supply is fittered to reduce the effect of mainsborne transients.

The signals carried by the backplane bus are detined in table(19).

The layout of the circuit boards is given in appendix(2).

### 8.2 TMR MICROPROCESSOR BOARD

The block diagram of the microprocessor board is given in ligure(21). Each block will be described in detail with reterence 10 its circult diayram. A two out of Ihree majority vote is performed on inree
microprocessors running in synchronism, the voting being at bus level. The voters drive the system bus directly. A common self-synchronising clock is connected to the processors as well as synchronisation hardware and reset circuitry/watchdog timer. An RS232 interface is provided to communicate with a VDU or printer.

### 8.2.1 Microprocessor block

Three identical channels consisting of a microprocessor and associated buffers are connected as shown in figures(22.23.24). Considering channel 1. as shown in figure(22), an 8085 microprocessor. U9. has its eight least significant address lines A0-A7 demultiplexed by U8 which are then connected to the voters. Address lines A8-A15, are connected directly to the voters and several control lines as well as the SOD line. Bidirectional data buffers U12. U13. separate data into DATA OUT and DATA $I N$. The DATA $\operatorname{IN}$ connections to the three channels are connected in parallel and to the system bus. DATA OUT is connected to the voters. Pull-ups are connected to ADO-AD7 and the RD and WR lines. this is because these lines are tristated during certain instructions and the pull-ups will ensure that all three channels are pulled up to +5 V to prevent voting errors due to the indeterminate nature of a tristated line. All inputs to the microprocessors with the exception of CLK. TEST, and HOLD are connected in parallel, whilst most outputs, as described. are voted upon.

On the circuit diagrams a number in brackets is given after the signal name, this identifies the channel from one to three and is used on both inputs and outputs. Where no bracketed number is given. the signal referred to is an output from the system voters or is a signal on the system bus.
8.2.2 Voters

The voting on ninety signals (thirty from each microprocessor) is performed in six identical FPLAs as shown in figure(25). Each FPLA votes
upon fifteen signals and reduces them to five signals. The remaining three outputs from the FPLAs are used as error flags. all of which are active low 50 that error flags such as $\overline{1+2}$ and $\overline{2+3}$ can be wire ORed. The error flag $\overline{1+2}$ indicates an error is present on channel 1 or 2 . whilst $\overline{2+3}$ indicates an error on channel 2 or 3 . Thus by examining both error flags it is possible to determine which channel does not agree with the other two. The error flags $\mathrm{Vt-V6}$ indicate in which voter there is a voting error. Thus it is possible to determine which channel is in error as well as the nature of the error. address bus. data bus. or control bus.

Due to slight timing differences between the three processors as well as access time delay in the voters, the error flags do not give a constant logic ' 1 ' for no error and logic ' 0 ' for an error. instead a ragged signal is produced which is either predominantly logic ' 7 ' or ' 0 '. For this reason the error flags are filtered by a low pass filter consisting of a 1 k resistor and a 330 pF capacitor. The oscilloscope trace of figure(26) shows the unfiltered error flag at the top and the filtered flag below. It can be seen from the unfiltered signal that the rise time of the voters is considerably longer than the fall time (approx. 50nS). This is because open collector outputs are used which have active pull down. but passive pull-up via the 1 k resistors connected to all of the outputs.

The calculation of the low pass filter values and the programming information for the FPLA is given in appendix(3).

### 8.2.3 Self synchronising clock

The clock input to the 8085 is divided by two internally betore being used as the internal microprocessor clock. This internal clock signal is available at the CLK OUT pin. It was found that on power-up, the divide by two stage could power up in either state. Normally this would not matter. but for three processors which are desired to run in synchronism. this is serious: It was found that one processor could power-up half an internal
clock cycle out of phase with the other two. This situation could only de remedied by switching off the power and trying again. The self synchronising clock circuit of figure(28) was developed to overcome this problem. The three CLK OUT signals are fed to a three to eight line decoder. U22. which controls the three AND gates. U23. The AND gates remove the clock signal to the errant processor, should it be out of synchronism with the other two. A conventional 4 Mhz quartz controlled clock is generated by $U 20$ and associated components. The three AND gates. U23. provide some degree of isolation between the clock inputs. should any of them stick at logic ' 1 ' or ' 0 '.

### 8.2.4 Synchronisation hardware

Additional circuitry is required to synchronise the three processors and latch the voter error flags in the event of a voting error. This circuitry is shown in figure(29). The signals marked PORTxx(RD) or PORTxx(WR) refer to outputs from the port address decoders. where PORTxx(RD) is generated by a INxx instruction and PORTxx(WR) by an OUTxx instruction. The top half of the circuit latches the error flags and the bottom half is used to synchronise the three processors.

When a voting error occurs either or both error flags $\overline{1+2}$ and $\overline{2+3}$ go low. causing the output of U23d pin 11 to go low. Assuming that a PORTOI(WR) has taken place prior to a voting error, the flip/flop U28a. U28b will have been reset. The flip/flop is set by a low on pinl u28a and the falling edge on pin 6 of the flip/flop triggers the monostable U18b. After a short delay, governed by the $33 p F / 22 k$ RC network (440ns), the monostable generates a rising edge which latches the error flags in U31. an octal latch. The delay is necessary so that the- filtered error flags have lime to settle before being latched. The calculation of the necessary delay and hence RC network is given in appendix(4). The latched error flags are read by a read to PORT01. The flip/flop U28 prevents further
transitions on U23a pinll (further voting errors) from triggering 418 b and latching new error flags before the old ones have been read. A write to PORT01 must be executed to reset the flip/flop which re-arms the error flag latching mechanism. A two bit tristate buffer. U30 allows the instantaneous error flags $\overline{1+2}$ and $\overline{2+3}$ to be read as PORT02.

Synchronisation of the processors is achieved by a combination of nardware and software. It is essential to synchronise the three program counters. This is hardware controlled and can only be done by executing a simultaneous RESET or interrupt. A RESET performs a cold start and is therefore undesirable except for initial power-up of the system. when a RESET causes all three processors to be synchronised, so interrupt driven resynchronisation was chosen. The hardware interrupt process synchronises the program counter and the interrupt software completes the rest of the synchronisation. It is not sufficient to simply supply a common interrupt to all three processors because an interrupt is only recognised at the end of the current instruction. Two processors will be executing one instruction. whilst the third. errant processor. may be executing an instruction of different length. The processors will therefore recognise an interrupt at different times and synchronisation cannot be achieved. If several retries are made, then it is possible to execute a simultaneous interrupt, but one further problem exists. It was found that one processor could lock one clock cycle behind the other two. Considering the op-code fetch timing of figure(17), it is seen that data is valid and read at T3. but continues to be valid at T4. Hence one processor can lock one clock period behind and read an instruction at T4 which then becomes T3 for that processor. Hence the errant processor will execute the same op-codes as the other two. but one cycle behind and will read/write incorrect data in three clock cycle read/write operations. It is impossible to cure this situation by execuling further interrupts and the errant processor will remain locked one clock cycle behind.


The RST/HOLD circuitry is shown at the bottom of figure(29). The clock is divided by ten in U27a to give a frequency of 200 kHz which is also used by the $A / D$ converter. The divided clock is fed to the two bit counter U27D which cycles through the following states. changing every ten clock cycles.

## 00 No action

01 RST6.5 interrupt
10 HOLD operation
11 Reset flip/flop U21
The iwo bit counter U27b is normally held cleared (output=00). by a high on pin 14. When a voting error causes U23d pin 11 to go low. this sets flip/flop U21 which is connected to the clear input of the two bit counter and the counter advances through the states given above. When the
count reaches 11 this is detected by U21c which resets the flip/flop U21. The flip/flop formed by U28c. U28d is used to enable/disable the RST/HOLD circuitry. A low on U28C pin 11 holds the output of flip/flop U21 permanently nigh. disabling the RST/HOLD circuitry. The flip/flop is disabled by a write to PORT00 and is enabled by a write to PORT02.

The PORT07(WR) select is connected to flip/flop U21 so that it is set by a write to PORT07. Thus it is possible to trigger the RST/HOLD circuitry by a port write instruction as well as by detection of a voling error. This is useful for testing purposes and software initiated resynchronisation.

The port addresses for the resynchronisation hardware are summarised

| below :- |  |  |
| :--- | :--- | :--- |
| PORT | REAO | WRITE |
| 0 | - | Disable RST/HOLD circuitry |
| 1 | Read latched error flags | Reset flags latch flip/fiop |
| 2 | Read instantaneous error flags | Enable RST/HOLD circuitry |
| 7 | - | Trigger RST/HOLD circuitry |

### 8.2.5 Reset circuitry and watchdog timers

As well as power-on reset (POR) it is necessary to reset the controller if the +5 V logic supply dips below acceptable limits caused by a short interruption to the mains supply. It was found that most RC controlled POR networks will not detect a short interruption in the +5 V supply, but this interruption was sufficient to cause the microprocessor to cease reliable operation and crash. Such a short interruption to the mains supply has been experienced during electrical storms when there is a long delay before auto re-closure of circuit breakers following a lightning strike to the supply cables.

Two watchdog timers are included to reset the system in the event of it crashing. Botn watchdogs are monostables which are triggered by a write
to PORT06. The instruction to retrigger the watchdogs (OUT06) occurs once only in the control software and is located in the main program loop. The monostables are retriggered at a rate greater than their pulse width so that a "consiant" pulse is generated. However if the program crashes this will cause the retriggering operation to cease and the monostable will time out after a delay determined by its pulse width. A TRAP (non maskable) interrupt is generated when the first watchdog times out which generates a warm start. If the warm start fails. then the second watchdog will "timeout" after a further delay and a cold system ${ }^{\text {RESET }}$ will result.

The reset circuitry and watchdogs are shown in figure(30). The level triggered reset circuitry uses a "long tailed pair" contained in U29 as a comparator which compares the reference voltage across two silicon diodes to the +5 V logic supply. The +5 V supply is divided by a potential divider before comparison. By suitable choice of the potential divider, the circuit is arranged to give a logic ' 0 ' if the power supply falls below 4.5 V and will give a logic ' 1 ' otherwise. Power-on-reset is detected and manual reset is provided by a push button switch.

Monostable Ul8a forms the TRAP watchoog which is triggered by a write to PORTO6. The pulse width is set by the RC network and in this case is set to approximately two seconds. When the watchdog times-out. the $Q$ output goes from a logic ${ }^{\prime} 0^{\prime}$ to ${ }^{\prime} 1$ ', generating a TRAP interrupt.

Monostable Ul9a, which has a pulse width of four seconds, forms the RESET watchdog and is retriggered by a write to PORT06. When the watchdog times-out it triggers the monostable U19b. generating a RESET pulse of approximately one millisecond duration. The rising edge of the RESET puise is used to retrigger monostable U19a in case the reset operation has not been successful. In this way the watchoog is retriggered by the RESET pulse as well as a write to PORT06. so that if the reset is not successful. then the watchdog will continue to generate RESET pulses at an interval governed by Ul9a. A divide by two flip/flop is connected to the output of
the RESET watchdog which will "toggle" everytime the watchdog times-out. The output from this lip/flop is used to select one of two EPROM sets. both of which are capable of executing the control program. Thus in the event of a system crash a warm start is tried first. then the spare EPROM set is switched in as well as performing a cold reset.

### 8.2.6 RS232 interface

The serial input and output lines are connected to RS232 drivers as shown in figure(28) so that an RS232 device can communicate with the controller for debugging. error logging. or console input/output purposes.

### 8.3 MEMORY BOARD

The memory board block diagram is given in figure(31) and contains 4 k of static. RAM which is Hamming code protected, two sets of 4 k EPROM. 4 k "instant ROM" and address decoders for memory and ports. At bus level the board "looks" like a conventional memory board and many of the faulttolerant features are transparent to the user. although error flags and additional self-test circuitry are included to make full use of the faulttolerant features of the board.

The RAM is protected by a SEC/DED Hamming code. The necessary check bits are generated by a look-up table in the encoding ROM and stored alongside data in the RAM. When data is read from the RAM. the data and check bits are read out and decoded by a ROM. which corrects the data if necessary as well as flagging errors.

Eight-bit wide data requires five check bits. This would require an encoding ROM having eight address lines and thirteen data outputs. The decoding ROM would need thirteen address lines and thirteen data outputs if error flags are included. Both these ROM sizes are very large and not commercially available. Two $2 k$ EPROMs could be used tor encoding and two 8k EPROMS for decoding. which are commercially available. This approach
would be no more reliable than the design adopted. but would reduce the storage required from sixteen bits wide to thirteen bits wide. The disadvantage of using EPROMs instead of ROMs is their slow access time (450ns) as opposed to 50 ns for bipolar ROMs. This would add to the memory read/write access time and would require the microprocessor to be slowed down, thus reducing processor throughput. The only advantage of EPROMs is their lower power consumption than bipolar ROMs. which may be important in a CMOS design, and their slightly higher reliability.

In order to use small commercially available bipolar ROMs. it was decided to split the eight bit wide data into two identical four bit wide streams which has the added advantage that each stream will correct a single bit error and detect a double bit error. Thus if simultaneous errors occur in both streams. the protection circuitry will correct double bit errors and detect tour bits in error. Each stream has its own Hamming code protection in the form of four check bits. Data is stored as four data bits and four check bits and is read out as eight bits. which are decoded to restore the four data bits as well as generating four error flags. The encoding ROM can be contained in a $32 \times 8$ device and the decoding ROM in a $256 \times 8$ device: both of which are small and easily available.

The ROM look-up tables are calculated as follows. The SEC/DED code consists of four data bits and four check bits organised as :-

$$
\begin{array}{lll}
\text { D3 D2 D1 D0 C3 C2 C1 C0 } & \text { iwhere } \quad D x=\text { data bit } \\
& C x=\text { check bit }
\end{array}
$$

Data is stored and read out in this form. On read out the check bits are re-calculated from the data bits and these check bits are exclusive ORed with the chock bit: road oul from tho RAM. Tho rosult l: four cormocion bit: (.:3 (:3' (:1) 60
$\mathrm{C} 3^{\prime}=\mathrm{C} 3$ (recalculated) $\oplus \mathrm{C} 3$ (stored) etc.
These correction bits identify the position of an error if it is a single bit error or indicate a double bit error according to table(20).

Looking vertically up the columns and writing down a data bit where a 'l' ocurrs in table(20), it can be seen that the check bits are calculated by:-

```
C0 == D0 }\oplus\textrm{D}1\oplus\textrm{D}
Cl=D0 \oplus D1 }\oplus\textrm{D}
C2 = D0 @ D2 }\oplus\textrm{D}
C3 = D1 }\oplus\textrm{D}2\oplus\textrm{D}
```

Appendix(5) shows the look-up table contained in the encoding ROM which is calculated according to the equations above.

### 8.3.1 Decoding

The decoding ROM is also a look-up table which corrects the data if necessary and generates four error flags S2 S1 S0 E. where S2 S1 S0 identify which RAM (out of eight) is in error and $E$ signals the occurrence of a single bit error. If $E$ is not set. but $S O$ is set. then this indicates a double bit error. This is summarised in table(21). The error flags S 2 S1 S0 E are referred to as the four bit error syndrome.

A program was written in assembler to calculate the information to be programmed into the decoding ROM. The decoding ROM look-up table; is given in appendix(6).

### 8.3.2 Circuit description

The circuit diagram of the board is given in figure(32). The encoding ROMs U33. U34 convert the eight bit data to eight data bits and eight check bits which are then stored in the $4 k \times$ l6bit memory matrix which is shown in more detail in figure(33). It is important to use one bit wide RAMs so that the failure of a single device causes a single bit failure which can be corrected. The decoding ROMs U35. U36 regenerate the eight bit data. corrected if necessary. flagging any errors that occur. It is possible to connect a much larger memory matrix between the encoders/decoders which may be dynamic RAM if required. In fact commercially available sixteen bit memory boards would be ideal. The data from the decoding ROMs is connected
to the system bus via the tristate buffer 438 which is enabled by a read request to the RAM. The enable is generated by the OR gate $U 39$ which is connected to the $\operatorname{FAM} \overline{\mathrm{CS}}$ and the $\overline{\mathrm{RD}}$ line: thus producing a $\overline{\mathrm{RAMRD}}$ signal. The error flags are latched into 437 in the event of an error and can be read by a read to PORT00. Three OR gates U39. are connected to form a four input OR gate. The four inputs are connected to the decoding ROM error flags $E$ and $S 0$. Any one of these flags going high indicates an error hence the output of U39 pin 6 is ' 0 ' for no error and ' 1 ' for an error. This single error flag is sampled and latched in U40a on the rising edge of every RAMRD. If a ' 0 ' is latched in, it will have no effect, but if a '1' will cause a ' 1 ' to be latched into $U 40 \mathrm{D}$. causing the $Q$ output to go high. This generates a RST5.5 interrupt as well as latching the error flags. It is left up to the software interrupt routine to interrogate the error flags and determine whether the error is a single or double bit error and which RAM device is in error. A read to PORTO0 reads the error flags. as well as resetting the flip/flop U400. Until the flip/llop is reset. no further error flags can be latched into U37.

### 8.3.3 Address decoding and "spare" EPROMs

Address decoder 441 decodes eight $8 k$ pages which are then further divided into $2 k$ pages by U42. The "instant ROM" and RAM are decoded by U42a, whilst U420 decodes the EPROM memory. The most significant address line connected to U42b. A12. is connected via the exclusive OR gate U43. This is so that when the ROM SELECT line is at ' 0 ' the Al2 line is not inverted and when at ' 1 ' the address line is inverted. With Al2 not inverted the address decoding is as shown with the first EPROM set located from $0000-0 F F F$ and the second from $1000-1$ FFF. If the A12 line is inverted. then EPROM set two is selected instead of set one and vice-versa. The "spare" EPHOM set located from 1000-IFFF can be read as normal memory. nowever it is identical to set one and programmed to run from 0000-0FFF
so will not run in the range 1000-1FFF. If the two sets are swapped by the ROM SELECT line. then the "spare" will appear to occupy memory in the range $0000-0 F F F$ and will run and the other set will appear in the range $1000-$ 1FFF. Thus it is possible for self-testing purposes to read the "spare" EPROM set. and in the event of a failure. a separate set can be switched in whilst allowing interrogation of the old set to determine the cause of failure

The "instant ROM" and EPROM circuitry is shown in figure(34). "Instant ROM" is essentially battery-backed RAM which can be written to and retains data when power is removed. Its pin connections make it pin compatible with EPROM. Two "instant ROMs" are installed U45. U46 located from 2000-2FFF. They are write-protected by the switch shown and are butfered by bidirectional buffers U51. U52. Four EPROMs are installed U47. U48. U49. U50. located from $0000-1 F F F$ and are buffered by U53 which is enabled by the BUFFEN signal from the EPROM decoder U42b.

The input port address decoder is shown at the bottom of figure(34).

### 8.3.4 Memory matrix

The $4 k \times$ l6bit memory matrix is shown in more detail in figure(33). Memories U55-U70 are $4 k \times$ lbit static RAMs and the $\overline{C S}$ line to each device is connected via the OR gates U71-U74. This is for testing purposes only and allows RAM devices to be switched out by preventing their $\overline{C S}$ from going low. Dual-in-line switches S1. S2: are used to switch out RAMs and pulses injected into the control inputs of the OR gates enable the RAMs to be transiently switched out.

### 8.4 INPUT/OUTPUT BOARD

The block diagram of the input/output board is given in figure(35) The board contains a battery-backed real time clock (RTC). eight bit input port. eight bit output port. eight channel A/D converter, eight bit
solenoid driver. as well as the necessary power electronics to drive a stepper motor and solenoid valves. A separate printed circuit board (PCB) holds the six pressure transducers as well as their signal conditioning circuitry.

### 8.4.1 Real time clock

The real lime clock $U 77$ is shown in figure(36). This is buffered by the bidirectional buffer U78. On failure of the +5 V supply. the clock switches over to standby operation and continues to run. A ni-cad battery. which is trickle charged under normal operation. provides the necessary back-up supply. Open collector AND gates $U 75$ are used to buffer the $\overline{R D}$ and $\overline{W F}$ signals so that the RTC read and write inputs are isolated from the power supply when it is switched off and are pulled up to Vad. This is necessary for correct standby operation. The write strobe to the RTC is connected via LKI which write-protects the RTC when it is removed. This is so that a system crash will not corrupt the RTC. The RTC is decoded as sixteen ports from PORT 10 to PORTIF and is programmed and read as described in appendix(7) which is an extract from the Radiospares data sheet.

### 8.4.2 Multiplexer and A/D converter

An eight bit analogue multiplexer. $U 79$ is controlled by three address lines MUXO. MUX1. MUX2 and is connected to the A/D converter U80. Inputs to the multiplexer are low pass filtered with a cut-off frequency of about 170 Hz . inputs $P 0$ to P 6 are from the pressure transducers and S 1 . S2 monitor the voltage applied to the solenoid valves. The A/D converter is clocked by a 200 kHz signal already available which is obtained by dividing the processor clock by ten. The two flip/flops. 481 synchronise the start conversion signal and the clock. A write to PORTO3 starts the conversion and a read to PORTO3 enables the tristate output of the converter, allowing the digital output to be read onto the system bus. The end of conversion llag is connected to the most significant bit of the eight bit input port

U82. The other seven inputs to the port are spare and are connected to a dil switch. The input port is read as PORT04.

The bottom of figure(36) shows the output port address decoder U83.

### 8.4.3 Stepper motor driver

The eight bit output port. 484 which is written to as PORT05 performs two functions. The three least significant bits are used to control the analogue multiplexer via the address lines MUXO to MUX2. whilst the four most significant bits drive four Darlington power transistors. The transistors are sufficient to drive the four phases of a stepper motor with the common stepper motor connection in this case connected to +5 V . Protection diodes are incorporated as shown. The output port is cleared when a RESET is executed.

### 8.4.4 Solenoid drive circuitry

The solenoid valve drive circuitry is shown at the bottom of iigure(37). An octal latch and driver. 485 is connected via two resistor networks to the solenoid valves. The circuit is designed so that solenoids can still be switched on and off in the event of any single component failing. The aim of the design is to protect against failure either short circuit or open circuit of the solenoid driver transistors.

The solenoids, although rated at 28 V . were found to pull-in at greater than 8.0 V and to drop-out at less than 2.6 V . These voltages define the ON and $O F F$ limits in the event of component failure in the driver circuitry. The drive circuit is basically a potential divider with switched resistors forming the lower leg of the divider. By switching in different resistors R1 to R4 the voltage across R5 is varied. This voltage swing is connected across the solenoid and used to switch it ON and OFF. If all transistors are off. then the voltage across R5 is zero, however if one transistor fails short circuit, the voltage across $R 5$ is equal to the minimum value. If afl four transistors are now switched on, the voltage across R5 is
increased. If one transistor fails open circuit. the voltage developed across $R 5$ is reduced and equal to the maximum value. The aim of the potential divider is to ensure that with one transistor failed open or short circuit. the swing from maximum to minimum is sufficient to turn the solenoid $O N$ and OFF. The zener diode is necessary to subtract a bias voltage from the voltage across $R 5$ so that the swing across the solenoid is within its ON and OFF limits. Two zener diodes are used in parallel to protect against the zener failing open circuit. Any of the resistors. transistors. or zener diodes can fail open or short circuit and the driver will continue to switch the solenoid ON and OFF as long as it is a single component failure. The voltage applied to the bottom end of the solenoid is monitored by the $A / D$ converter after being attenuated by the potential divider made up of the $22 k / 1.2 k$ resistors. which attenuate by a factor of approximately twenty. By switching the divider transistors and monitoring the solenoid voltage, it is possible to diagnose component failure in the driver stage. The detailed calculation of values for the resistors, zener diodes and supply voltage is given in appendix(8).

### 8.4.5 Pressure transducers

Six "Foxboro" plezo pressure transducers type $1,80001 \mathrm{Cl} 00 \mathrm{C} 0$ are mounted on a separate printed circuit board. Each transducer uses a quad operational amplifier connected as an insirumentation amplifier as weil as providing the constant current excitation to the transducers. The six transducers and their amplifiers are identical except that there is only one band-gap reference for all six transducers amplifiers.

The circuit diagram for a typical channel is given in figure(38). A band-gap reference is connected to U86a which controls the constant current of 1.5 mA through the transducer. The voltage produced across the transducer is amplified by the instrumentation amplifier U86c. U86d which nas an adjustable gain orea to 300 set by the "span" preset


#### Abstract

potentiometer. A non-inverting amplifier. U86b. having a gain of ten. produces a voltage of 0 to 7.5 V at its output. This voltage is derived from the band-gap reference and is therefore stable and can be varied by the "zero" preset potentiometer. This variable voltage is fed via a large resistance into the inverting input of the final operational amplifier. U86d and is used to zero the output. Resistor $R 6$ is individually chosen to nave a voltage drop of about 1.5 V across it. This is so that slight variations in the 0 to 7.5 V supply do not effect the zeroing very much. Resistors. R1 to R5 are supplied with the transducer and are for temperature compensation. Resistors R1 to R6 are connected as shown to a header plug.

Each transducer and amplifier is adjusted to give a swing of 1.5 V for a 0 to 15 "w.g. pressure range. and a "zero" of 0.1 V .


### 8.5 PNEUMATIC TEST RIG

The test governor was built to demonstrate the reliable. control of gas flow and pressure, and to show that the governor will continue to function in the presence of errors. Test circuitry allows transient and permanent errors to be injected into the controller. A schematic diagram of the pneumatic governor is given in figure(39) as well as a photograph in figure(20).

Considering the schematic diagram of figure(39). dry compressed air at approximately $10 p s i$ enters the " $J$ " governor which reduces the pressure to 20"w.g. which is approximately held constant at a sate inlet pressure for the main control valve. Gas now passes through the modified " $K$ " pilot , as described in chapter(7). which is controlled by the two solenoid valves S1. $\because$ and is stabilised by the neodio valve. II is approciated that it is undesirable to place bends in the plpework close to regulators or pressure tappings. but in order to achieve a compact design. it was necessary to bend the rig round on itself. This is unimportant for demonstration
purposes. A needie valve is used as an adjustable "orifice plate" which has pressure tappings either side of it. Transducers P1. P2. P3 measure the outlet pressure of the "K" pilot and P4. P5. P6 measure the outlet pressure of the governor. The pressure differential across the "orifice plate" is a measure of the flow through the system according to the equation:-

$$
\Delta P=k Q^{2} \quad \text { where } \quad k=\text { orifice plate constant }
$$

$$
\begin{aligned}
Q & =\text { flow } \\
\Delta P & =\text { pressure differential }
\end{aligned}
$$

Manometers are connected to the pressure tappings to give a visual check on the operation of the system as well as being used to calibrate the pressure transducers. A variable downstream load is simulated by another needie valve and gas is finally exhausted to atmosphere through a silencer.

### 8.6 ESTIMATED RELIABILITY

Chapter tinree concludes that MIL 2170 is pessimistic when estimating the reliability of integrated circuits. especially plastic. However so that the failure rate of the fault tolerant controller can be compared with the non fault tolerant controller of section 7.3. an order of merit comparison will be performed using MIL 217 D data. It is likely that the failure rate estimated in this way will be pessimistic. so the failure rate calculation is repeated using the failure rate data sources recommended in chapter three.

When analysing the fault tolerant controller several assumptions were made :

Failures are instantly repaired. If the MTTR is very much less than the MTTF, then this assumption will not produce too large an error. For systems where the MTTR is not insignificant. the analysis of chapter four should be used.
(ii) Three pressure transducers per channel are connected in a TMR configuration with the voting performed in software. For this reason the failure rate of the transducers may be neglected. The three microprocessors are connected in a TMR configuration with the voting performed in FPLAs. The failure rate of the microprocessors is neglected and only the failure rate of the voters is considered.

A spare EPROM set is used therefore the failure rate of the EPROMS may be neglected.
(v) The RAM is protected against single bit failures. so the failure rate of the RAM chips is neglected and the failure rate of the RAM is calculated as the failure rate of the extra Hamming code circuitry
(vi) The failure rate of the solenoids is neglected since frequent replacement should prevent wearout and reduce their failure rate to a low value.
(vii) The failure rate of the solenoid drivers is neglected since any component failure is not fatal and is logged.
(viii) An estimation of the failure rate of the software is not included in the controller failure rate.

Using MIL 217D and measured integrated circuit case temperatures. the following failure rates are obtained for the controller using plastic integrated circuits:

| A/D convertor |  | 1.93 f/million hours |
| :---: | :---: | :---: |
| control circuitry |  | 2.05 f/million hours |
| voters |  | $38.5 \mathrm{f} / \mathrm{million}$ hours |
| watchodog |  | $0.63 \mathrm{f} / \mathrm{million}$ hours |
| Hamming code circuitry |  | $14.5 \mathrm{f} / \mathrm{million}$ hours |
| output circuitry |  | $0.94 \mathrm{f} / \mathrm{million}$ hours |
|  | TOTAL | 58.5 f/million hours |
|  |  | 0.5 t/year |

Comparing these figures with those of section 7.3. an improvement in the failure rate by nearly an order of magnitude is observed. It is seen that the Hamming code circuitry is no more reliable than a $4 k$ block of unprotected memory. If more reliable encoding/decoding ROMs were used. or if the controller were operated at a lower temperature then the Hamming code circuitry would be more reliable than a $4 k$ block of memory. Although tho w:o of lamming code protocled momory gives no decrease in latlune batr: when permanent tailures are considered, there is considerable benetit to be gained from transient error protection. Castillo et al [78] report transient fatlure rates which are up to fifty times greater than permanent

```
faults. The fault tolerant controller offers considerable protection.
against transient errors which is not possible in the non fault tolerant
design.
```


### 8.6.1 Revised failure rate prediction

The failure rate of the fault tolerant controller is now predicted using the recommended fallure rate prediction sources of chapter three.

| Circuit block | Data source | Failure rate |
| :---: | :---: | :---: |
| A/D convertor | MIL 2170 | 1.93 f/M hrs. |
| control circuitry | CNET | $1.83 \mathrm{f} / \mathrm{M} \mathrm{nrs}$. |
| voters etc. | CNET | 18.0 f/M hrs. |
| watchoog | CNET | $0.62 \mathrm{f} / \mathrm{M} \mathrm{hrs}$. |
| Hamming circuitry | CNET | 7.27 f/M hrs. |
| output circuitry | CNET | $0.78 \mathrm{t} / \mathrm{M} \mathrm{nrs}$. |
| address decoders etc. | CNET | $5.69 \mathrm{f} / \mathrm{M} \mathrm{hrs}$. |
| resistors (100) | NCSR | $0.70 \mathrm{f} / \mathrm{M}$ hrs |
| capacitors (100) | MIL 217 D | $0.40 \mathrm{f} / \mathrm{M} \mathrm{hrs}$. |
| wire wrap connections | ICL fleld data | $1.60 \mathrm{f} / \mathrm{M} \mathrm{hrs}$. |
| soldered joints (500) | MIL 217D | $1.30 \mathrm{f} / \mathrm{M}$ hrs. |
| edge connectors (200 prs) | CNET | $0.60 \mathrm{f} / \mathrm{M} \mathrm{hrs}$. |
|  | TOTAL | $40.7 \mathrm{f} / \mathrm{M} \mathrm{hrs}$. |
|  |  | 0.36 f/year |

This faiture rate does not include the failure rate of the power supply, but is likely to be pessimistic for the rest of the controller since all failures are assumed to be fatal and do not result in a degraded performance

The failure rate of 0.36 f/year meets the required value as defined in section 7.3. It is therefore proposed that the controller is suitable for use in a hybrid pressure controller as described in chapter seven.

## CHAPTER 9

## GOVERNOR CONTROLLER SOFTWARE

As a first step towards writing reliable software. modular programming was used. The software consists of nineteen modules written in assembler which are linked togetner to form the complete soffware package which is about $2 k$ bytes in length. The detailed software listings are given in appendix(10) and a detailed description of each module follows, logether with any nocessary flow charts.

### 9.1.1 MAIN

Module MAIN is located at address zero and is executed when a reset is performed. The module contains the jump vectors for interrupts and software restart instructions. The flow chart is given in figure(40). Interrupts are disabled at the start of the module so that the initialisation process is not interrupted. Before the RAM is used by data variables and the stack it is initialised to FF hex. i.e. all ones. This is because of the snake which will be described later. The module INITL is then called which initialises all data variables and registers. It is important to initialise all registers to the same value. since if this is not done. they will contain random data and PUSHing such a register onto the stack will result in a voting error if the three microprocessor registers contain different random data. The system reset is logged as well as the time and EPROM set in use. As described in section 8.3 the EPROM set in use is either the main or the spare set. Finally the RST/HOLD circuitry is enabled before entering the main control loop. With the interrupts disabled the RST/HOLD circuitry will have no effect other than to insert hold pulses. but this slows down the processors. so the RST/HOLD circuitry is disabled at the start of the module to speed up execution. The rest of the module consists of the main control loop. Firstly a
"recovery block" is formed and stored in RAM which will be used to perform vectored recovery if the system crashes. The interrupts are enabled after the formation of the recovery block. This is important on the first pass through the control loop because some interrupt routines use vectored recovery, and if an interrupt and subsequent vectored recovery is executed before the formation and storage of a recovery block. then vectored recovery will not be possible and the system will "lock-up". Each pass through the loop calls the modules CNTRL. PRBUFF. and SLFTST to be described later) and resets the watchdog timers.

### 9.1.2 RESYNC

A voting error between the three processors generates a RST6.5 interrupt which causes a jump to the RESYNC module. This module, whose flow chart is given in figure(41). performs the necessary software resynchronisation, controls the resynchronisation hardware. and finally logs the error. On entry to the module. the RST/HOLD circuitry is disabled to speed up execution. If the RST/HOLD circuitry is successful. then the program counters will be resynchronised.

All registers are first PUSHed onto the stack and then POPed off. In this way a two out of three majority vote is performed on all registers and finally all $S O D$. output pins are reset to zero. The retry counter is now decremented and tested for zero. If the count is not zero. the current state of the error flags is read. If the error flags show no error. then a transient error is logged. but if the error still exists. then hardware is used to generate a RST6.5 interrupt signal and the processors enter a halt state whilst waiting for the interrupt to be recognised. If the retry counter reaches a count of zero. then all retries have been exhausted and a CPI lalure is logged as well as switching out the RST6.5 interrupt and RGT/HOLD circuitry.' As well as logging the transient or permanent error. the channel in error, syndrome (voter flags $V 1-\mathrm{V} 6$ ) and number of retries
executed is logged. Finally the retry counter which is stored in RAM is reset as well as the syndrome latch flip/flop. Vectored recovery is used to return to the main control loop and is initiated by calling module BLOCK.

The program counters are resynchronised by the. RST/HOLD circuitry and by executing a simultaneous interrupt. It is for this reason that retries are performed by re-enabling the interrupts and then generating a hardware interrupt pulse. Thus the retry mechanism is a mixture of hardware and software tecnniques and each retry will attempt to resynchronise the program counters which could not be achieved by software alone.

### 9.1.3 MERROR

A memory error generates a RST5.5 interrupt which causes a jump to the MERROR module. whose flow chart is given in figures(42,43). The module first of all saves the error syndrome for future logging and tests if it is possible to read and write to the RAM. If it is not possible, then subroutines can no longer be used because they use the stack which is contained in RAM. After a number of retries controlled by the retry counter a jump is made to SFAIL which logs the failure of the RAM and executes a soft failure which puts the system in a fail safe condition. If it is possible to read/write 00 to the RAM, then the ability to read/write FF is tested. Testing using both. 00 and $F F$ will reveal stuck-at faults. The corresponding syndromes for a 00 and a FF read/write are both saved and are now compared. If both syndromes are zero and the retry counter has not been exhausted, then unless the RST5.5 pin is stuck at ' 1 '. the error is logged as a transient. If the RST5.5 pin is stuck at ' 1 ', then a hardware failure is logged and the RST5.5 interrupt is switched out. If the retry count has been exhausted, then it is assumed that the error is permanent. The two syndromes are now examined for a double bit error. If a DBE does not exist. then if either syndrome is zero or if they are both equal. the
error is a single bit failure which is logged. If this is not the case. then the extra hardware is in error and this is logged. All permanent errors are logged and then the RST5.5 interrupt is switched out. If either syndrome shows a DBE. then either there is a stuck-at DBE in the check bits or the hardware has failed. The DBE must be in the check bits and not the data bits since the ability to read/write to the RAM nas already been confirmed. To test for a stuck-at fault in the check bits, a table of test data is written to the RAM and then read back. If there is no DBE in the data or check bits, then the hardware must be faulty.

After the type of error has been logged. the syndrome and time are logged and. in the case of a single bit error. a normal return from interrupt is executed. If the error was a DBE. then the return address held in RAM may have been corrupted and vectored recovery is executed instead.

### 9.1.4 CNTRL

This module contains the software which controls the process. The flow chart is given in figure(44). The pressures either side of the orifice plate are read. Pgov being upstream of Pout. The pressure differential across the orifice plate is calculated and the outlet pressure of the regulator valve. Pgov, is controlled according to the equation:

Pgov $=$ Fset +2 Pdelta. $\quad$;where Pset is an arbitary set point

The desired pressure Pgov and the measured value are then compared. If they are equal the solenoids are controlled to hold the pressure. otherwise it is determined whether the pressure is too high or 100 low. If the pressure error is greater than the defined deadband. the solenoids are controlted to either lower or ratse the pressure as appropriale. Finally a delay is executed to prevent instability.

### 9.1.5 PRESSR

This module combines time redundancy with. component redundancy and the flow chart is given in figure(45). Two pressures readings are required Pgov and. Pout, which are the pressures either side of the orifice plate Pgov is upstream of Pout. A total of six pressure transducers are read. arranged as two groups of three. Firstly all six pressure transducers are read eight times and the average calculated and stored in a table Time. redundancy such as this will lessen the effect of an erroneous read or s: noise on the input. Then a majority vote is performed on each group of three transducers. The transducers are taken in pairs and the three different averages are calculated. Then each transducer reading is. compared with the average of the other two in order to determine if one channel disagrees with its related pair. If a channel is found to be in error, the error and time is logged as long as the retry count has not been exhausted and the pressure reading returned by the module vis equal to the average of the two "good" channels. If all three channels are in agreement, then the average of the three readings is calculated and returned by the module. Finally the pressure differential. Pdelta, across the orifice plate is calculated.

### 9.1.6 RBKGEN

A "recovery block" is generated. The contents of all registers are saved in a block of memory for later use.

### 9.1.7 BLOCK

This module is complementary to RBKGEN and is used to execute vectored recovery. The block of memory which contains the contents of all registers at some tirne in the past is reloaded into the registers and execution is restarted at the position the last recovery block was generated

### 9.1.8 SLFTST

This module is used to test the solenoids and driver circuitry and could be expanded to self-test the whole of the system. The self-test routine is executed every ten minutes under control of the clock and any error in the solenoid circuitry is logged. together with the time. A solenoid error is defined as any single component failure which is not fatal because of the redundancy already described. The solenoids are only tested every ten minutes. since in the event of a permanent failure. this will cause the failure to be logged every ten minutes until repaired. Any more frequently would be a nuisance. This is an alternative technique to decrementing a retry counter every time an error is detected and ceasing to $\log$ the error once the retry count has been exhausted.

### 9.1.9 SFAIL

No RAM is used by this module since some soft failures are caused by total fallure of the RAM. For this reason no subroutines are used. The modute logs the type of soft. failure and then puts the system into a failsafe condition, which for the experimental controller means failing to the maximum pressure. After the fail-safe shut down. the processor is halted. The processor will be interrupted from its halt state by the watchdog timer and a system reset will be executed if possible. If not. the shut down procedure will be repeated.

## 9.1:10 WTRAP

$\dot{A}$ non-maskable TRAP interrupt causes a jump to this module. A TRAP interrupt is generated when the first watchoog times-out and vectored recovery is attempted. This module resets the printer. clears the output buffer in case it has been corrupted and then logs the error and time as well as the address at which the TRAP interrupt was called. A retry counter is decremented each time the module is called and on every fifth call. a full system resel is initiated.

This and the following modules $\log$ the address at which the error ocurred. This may be very useful in detecting software errors. If the error always occurs at the same address. then further debugging at that address is probably required to solve the problem.

### 9.1.11 SNAKE

This module is called by a software RST7 instruction which is conveniently FF. Data lines are pulled high and unused memory is filled with FF so that an erroneous jump to non-existent memory or memory initialised to FF will cause a RST7 to be executed. In this way erroneous jumps are detected and execution may be safely vectored back. The module clears the output buffer in case it has been corrupted and then logs the error time. and address at which the RST7 was called. Finally vectored recovery is used to return to the main program.

### 9.1.12 DFAULT

This module may be called to $\log$ software errors and is called by a RST5 instruction. If self checking of software is incorporated at a later date. then a RST5 instruction will log the error and execule vectored recovery. This module is not called by any others at present. and is used for demonstration purposes only.

### 9.1.13 INITL

This is called during a system reset and initialises all registers and variables which are used later.

## 9.1 .14 COUT

A character is converted to serial format and sent to the printer at 1200 baud.

### 9.1.15 COUTBF

This module puts a character in the output buffer. Output is buffered so that the control process is not halted whilst printing messages and logging errors.

### 9.1.16 PRBUFF

Unless the output buffer is empty. a character is fetched and sent to the printer.

### 9.1.17 MSGE

A message whose address is pointed to by the HL register pair is put into the output buffer.
9.1.18 NMOUT

The contents of the $A$ register is converted to two ASCII hexadecimal characters which are put in the output buffer.
9.1.19 TIMLOG

The date and time is read from the real time clock and put into the output buffer.

## CHAPTER 10

## TESTING OF FAULT-TOLERANT SYSTEMS

### 10.1 SYSTEM DEBUGGING

Two techniques are available for debugging microprocessor systems:

### 10.1.1 In-circuit emulation

The microprocessor is removed and replaced with the emulation pod. The emulator usually runs in real time and will therefore simulate any timing problems. The emulation is controlled by the host microcomputer development system. Software is developed on the development system. stored on disk, and then executed on the test system by means of the emulator. Breakpoints may be inserted into the program. the program flow traced. and the contents of registers examined. Emulation is probably easier than logic analysis, but does suffer from some disadvantages when testing multiprocessor and fault-tolerant systems. in-circuit emulators are available for multiprocessor systems. but are very expensive and are usually dedicated to only one processor. They cannot be used to inject faults into the system, other than RAM and register corruption. and interference testing and the monitoring of recovery may cause the emulator to crash. since this forms an important part of the system under test. In fact an emulation pod cannot simulate a microprocessor for the purposes of interference testing.

### 10.1.2 Logic analysis

A logic analyser is really a very fast data logger which will record. in memory, the binary states of many parallel signals in real time. The information captured in this way may be displayed at leisure on a CRT screen in a user friendly format such as a pseudo timing display. hexadecimal or binary list, or a disassembled listing of the code executed
by the microprocessor. Typically such a logic analyser will capture 48 channels of information at rates up to 25 MHz and will store 1 k samples. Information is clocked into the logic analyser by an internal or external clock. The logic analyser must be triggered and the triggering position may be varied within the block of data captured. The triggering sequence can be either the single ocurrence of a trigger word or a complex sequence of trigger words. Trigger words correspond to the 48 channeis of input data as well as extra clock qualifiers.

The logic analyser is a purely passive monitoring device and does not interfere with the operation of the system under test. A logic analyser was used to debug the controller described in chapter eight. Since the voting is performed at bus level, the replacement of one microprocessor by an emulation pod would cause the pod to be out-voted by the other two microprocessors. For this reason logic analysis must be used. but also has several advantages. It does not interfere with the system under test and can monitor all types of fault recovery, including the effects of interference testing.

Since the logic analyser is purely passive. some method of loading software into the test system must be provided. In the system of chapter eight. a "monitor" was written to run on the test system which would control the down-line loading of code into the test system. display memory dumps, allow memory locations to be modifled etc. Test software was stored in battery backed RAM which could be write protected.

The logic analyser timing display was found to be very useful when debugging hardware errors and testing the resynchronisation hardware.

Software can also be tested with the logic analyser. If the inputs to the system under test are varied over their complete range, it is possible to force the system to execute all possible paths via conditional jump instructions. The complex triggering capability of the logic analyser can be used to verify that all possible, jump permutations have been executed.

In order to test the recovery from faults, it is necessary to provide the means for injecting faults into the system. Test connectors should be provided that allow faults to be injected into the microprocessors, RAM. and recovery hardware. Faults are injected by a pulse generator, a short pulse correponds to a transient fault whilst a long pulse corresponds to a permanent fault. The pulse generator may be triggered manually or by the logic analyser.

The exact position in software of fault injection may be controlled using the logic analyser. The logic analyser is programmed to trigger on a combination of address. data, and control signals corresponding to a unique position in the software. The trigger output from the logic analyser is used to trigger the pulse generator, and the logic analyser can then monitor the action of the recovery routine. in this way, the logic analyser will capture normal operation followed by fault injection and recovery. Thus the effectiveness of recovery routines can be tested for faults injected anywhere in the software.

Random faults caused by power supply transients may be simulated by interference testing. The interference simulator is connected in series with the mains connection to the power supply. Voltage dips and transient overvoltage spikes can be simulated, and recovery observed using the logic analyser. Care must be taken that the transients do not affect the correct operation of the logic analyser.

### 10.3 OPERATIONAL TESTING

Once the system has been debugged. it should be tested for all input conditions over its complete operating temperature range.

The system should then be burnt-in at elevated temperature for several hundred hours and then released for use.


#### Abstract

All permanent and transient faults should be logged by the system. preferably by a printer or some other non-volatile means, which will be useful when diagnosing and repairing permanent faults. The logging of transient faults will provide a useful history of the incidence of transient faults in that particular environment. and may require preventative action to be taken such as additional screening or filtering of the power supply.


### 10.4 TESTING OF THE GOVERNOR CONTROLLER

### 10.4.1 Pressure control

The correct control of pressure was first of all tested according to the relation:

```
Pgov = Pset + 2Pdelta
```

```
where Pdelta = Pgov - Pout
    Pgov \(=\) outlet pressure of regulator
    Pout . \(\quad\) pressure downstream of orifice plate
```

The system was found to be stable with the solenoid needle valve fully open. The needle valve which simulates the orifice plate was set to a suitable value and then the downstream load was varied by means of the second needle valve. A graph is plotted in figure(46) of the regulator outlet pressure against the pressure differential across the "orifice plate". Pdelta. The observed values of Pgov are compared with the theoretical result shown by the straight line. Agreement between theoretical and measured values is very close and since the system was also found to be stable. it is concluded that pressure is correctly and accurately controlled.

Faults were injected into the controller by a pulse generator which was either triggered manually by a push button. or by the logic analyser. The pulse generator had two open collector outputs which produced negative and positive pulses. Fault recovery was traced using the logic analyser as well as observing that pressure control was correct and that the correct error message was printed out. The complete set of error messages as produced by the printer is given in figure(47).

### 10.4.3 Voting errors

A long and a short pulse was injected into the RDY line of channel one microprocessor as shown in figure(28). This caused that channel to lose synchronisation and message 1 was printed. This message reports a transient error in channel 1 and one attempt (retry) was found to resynchronise the three processors. The logic analyser timing trace of figure(27) shows the timing of the resynchronisation hardware. Haliway through the fault injection pulse, the $\overline{1+2}$ error flag goes low and at the same time microprocessor 1 is seen to be out of synchronism with the other two as shown by the ALE pulses. The positive edge of the latch pulse. latches the error syndrome which shows that the error is in channel 1 . The system clock at 2 MHz is shown in the centre for comparison. A short time after flagging the error. the RST/HOLD sequence is generated. Firstly a RST6.5 pulse is held for ten clock cycles and then the HOLD pulse for ten clock cycles, after which the RST/HOLD circuit resets. The RST/HOLD sequence is seen to be successful since all ALE pulses are in synchronism afterwards. A long pulse caused error message 2 to be printed. Attempts to resynchronise after 255 . retrles were abandoned and a CPU failure of channel 1 is logged. In both cases vectored recovery restored correct operation of the system.

### 10.4.4 RAM errors

The puise generator was connected to the RAM test circuitry of figure(33). A short pulse injected into a single RAM chip caused error message 3 to be printed. The pulse was short so a transient error was logged and the syndrome $S=10$ correctly identified the RAM chip into which the error had been injected.

A long pulse injected as above caused message 4 to be printed. The pulse was long which caused the retry count to be exhausted. The system therefore assumed that the RAM chip had falled permanently.

A long pulse was injected into the preset pin of $U 40 \mathrm{~b}$ as shown in figure(32). This caused a RST5.5 memory error interrupt to be generated without an error actually existing. The error was correctly logged as a hardware error as shown by message 5.

A long pulse was injected into two RAM chips which stored check bits. Message 6 was printed which shows that there is a stuck-at double bit error In the check bits.

A long pulse was injected into two RAM chips which stored data bits. It was found that it was not possible to read and write to RAM so message 7 was printed as well as executing a soft fallure. The regulator was found to fail safe to its maximum value.

### 10.4.5 Watchdogs

The controller was deliberately crashed and the first watchdog. the TRAP watchdog, was found to restore correct operation via vectored recovery. This error caused message 8 to be printed. The controller was deliberately crashed five times, and on the fifth occasion a full system reset was executed as logged by message 9 . initiated by the second watchdog timer. The trap watchdog routine is programmed to give a full system reset after four attempts. A full system reset was also observed to occur at power-on-reset and after a short power supply interruption which
was correctly detected by the undervoltage detector.
The system was crashed until a second full system reset was performed. but this time message 10 was printed. This differs from message 9 in that EPROM set 01 is logged instead of EPROM set 00 . This confirms that the spare EPROM set was correctly switched in by the RESET watchdog.
10.4.6 Snake

A hardware RST7.5 was generated. This caused a jump to 003C which contains FF which is the code for a RST7 instruction. A RST7 was executed and message 11 records operation of the snake and vectored recovery. The address is given as 003 D and not 003 C because the program counter is incremented before being pushed onto the stack.

### 10.4.7 Solenoid failure

A solenoid driver transistor was shorted out. simulating a short circuit fallure. The solenoid failure was correctly logged as shown by message 12. The failure was logged once only and was at 30 minutes past the hour which corresponds to a call of the self-test routine at ten minute intervals.

### 10.4.8 Pressure transducers

The plastic pipe feeding pressure transducer 3 was bent double. cutting off the transducer. Message 13 shows that the error was correctly logged as channel 3.

### 10.4.9 Interference testing

The switching on and off of equipment near the controller caused several transient errors to be logged. Interruptions to the power supply were detected by the undervoltage detector and caused a full system reset.

Transient spike testing was not performed since this was felt to be more a test of the power supply than the controller. The power supply is fitted with "crowbar" overvoltage protection which prevents spike testing
since the crowbar would be triggered. Also any failure of the power supply
could destroy many integrated circuits which would be time consuming and
expensive to replace in the experimental controller. If it is required to
perform spike testing on the controller. it is suggested that a duplicate
controller is built which uses a professional grade power supply.
It is proposed that interruption testing is a more useful test since
short interruptions to the mains supply have been observed during an
electrical storm. caused by auto-reclosure of circuit breakers. Some
microprocessor equipment has been found to fail interruption testing when
the interruptions are very short (a few mains cycles).

## DISCUSSION

The review of fault-tolerant controllers in chapter two has shown that most of the work elsewhere has been concerned with large computers and that much of the fault-tolerance is implemented in software, although many of the techniques used on large computers are applicable to small controllers. The low cost of small digital controllers makes the design of a reliable controller using redundancy techniques much more cost sensitive. it has been shown that built in testability can be included at little extra cost. The use of redundancy will increase hardware costs as well as significantly increasing design costs, but these costs should be offset by increased reliability and availability.

In order to compare designs, or to meet certain design criteria. it is necessary to predict the failure rate of components. The failure of components is assumed to fit the constant portion of the "bath-tub" curve. This assumes completion of the initial burn-in phase and that components do not wear out. The wearout of integrated circuits is rarely experienced and is normally caused by moisture corrosion when the device is operated in a numid environment. In order to accelerate the failure of integrated circuits for life-testing purposes or to extrapolate failure rate data to different temperatures, the Arrhenius acceleration equation is used. The failure rate of devices is shown to increase exponentially with temperature. Life-test data is analysed using the Chi-squared distribution, however care must be exercised when extrapolating life-test data. For random fallures it is permissible to multiply the number of devices under test by the test duration. This should not be done for nonrandom failures and especially when failures are due to wearout. The fallure rate of integrated, circuits is shown to dominate the overall failure rate of digital controllers and a nuge variation is observed in failure rate predictions for integrated circuits. This variation is partly
due to the use of different activation energies and junction temperatures when calculating the increase in failure rate due to temperature. it is important to use a realistic activation energy and to use the typical power dissipation when calculating the junction temperature. The predictions of MIL-217C are shown to be much too high by about two orders of magnitude. The predictions of MIL-217D and the NCSR data bank are more reasonable. but it is proposed here that the CNET data is best when predicting the failure rates of industrial ground based equipment.

Research elsewhere[43.48] has demonstrated that CMOS is less reliable than TTL and should only be used when it is essential to use devices with low power consumption.

There is much published information suggesting that plastic encapsulated devices should be used with caution since they have a higher failure rate than ceramic devices. New plastics have improved their reliablity, but ceramic encapsulation is still better under conditions of high humidity, since moisture ingression is prevented which can cause corrosion and wearout of the integrated circuit. There is considerable evidence from one manufacturer[39] that under favourable conditions of low temperature and humidity, there is no difference between the failure rates of plastic and ceramic devices. Since the gas governor controller will be used in an unprotected and possibly humid environment; it is recommended that ceramic devices are used throughout.

All equipment should undergo burn-in prior to release in order to expose weak components. It is recommended that integrated circuits are purchased according to BS9400 grade $C$ and are burnt-in after installation In the equipment. After successful burn-in, the grade $C$ devices may be considered equivalent to the higher grade $B$ devices which are predicted to have a failure rate one tenth that of plastic devices.

All components should be derated as recommended here and the use of forced cooling should be considered because of the exponential increase in
fallure rate with temperature.
Transient error rates fifty times those of permanent failures have been reported due to environmental, pattern sensitivity, and alpha radiation effects. It is important to protect equipment against transient errors for which software fault-tolerance is ideal. The gas governor controller will tolerate most classes of transient fault.

The effect of maintenance and repair on the reliability of redundant systems has been shown to achieve large improvements in the reliability of equipment, but is dependent on having a high fault coverage.

When designing fault-tolerant systems. it is useful to adopt a structured approach with defined levels of fault recovery. In this way protection will be afforded against most classes of fault and the higher levels of fault-tolerance should trap errors which are not corrected at the lower levels. The use of FMECA and FMEA is useful for identifying the most unreliable parts of a system which require further design effort.

The choice of microprocessor is important when designing a controller and the size of the control task should be matched to the power of the microprocessor, since the fallure rates of microprocessors increase with complexity. The cholce of microprocessor may influence whether faulttolerance is implemented in hardware or software. Most of the faulttolerance in the governor controller is transparent and implemented in hardware. since this allows standard software to be run on the controller.

The electromechanical gas governor is an example of a highly reliable digital controller. The mechanical part of the governor is a standard piece of equipment developed by British Gas and has the advantage of being simple to control. cheap. intrinsically safe. and will fail safe to pneumatic control in the event of controller failure. The controller uses many of the techniques discussed in this Thesis. Difficuity was experienced in synchronising the three processors which was cured by the development of special circuitry. In order for the TMR system to be more
reliable than a simplex system, the voters must be more reliable than a single microprocessor. This could not have been achieved if the voters were constructed from TTL. but was achieved by voting in FPLAs which are more reliable and compact than TTL voters. The voters were found to be fast enough to permit operation of the microprocessors at an internal speed of 3 MHz . The voters were chosen to have open collector outputs since this allowed wire ORing of the outputs and gave greater flexibility during development. It would be better to use voters with "totem pole" outputs and to perform the required ORing of the outputs using external TTL gates. Pull-up resistors would no longer be required on the outputs and the rise time of the outputs would be improved due to the active pull-up. This should enable the voters to work with faster processors. The reliability of the TMR configuration could be further improved if lower power voters were made available. The currently used FPLAs are constructed in bipolar logic which dissipates considerable power, resulting in a high failure rate. The Hamming code protected memory offers no improvement in the failure rate for permanent failures. This is because the failure rate of the protection circuitry is equal to that of an unprotected 4 k block of memory. An improvement in the failure rate could be achieved if lower power encoding and decoding ROMs were available. having lower failure rates. If the memory size were increased above $4 k$, then the permanent fallure rate would be improved. The main benefit to be galned from protecting the memory is the tolerance of transient errors which have been shown to occur up to fifty times more frequently than permanent fallures. The watchdog reset circultry was found to be very effective in resetting the controller after it had crashed, and could be included at little extra cost on even the most cost-sensitive non-redundant controllers. The piezo pressure transducer amplifiers were: found to drift and the printed circuit board was found to be very sensitive to surface moisture. The problem of moisture was cured by coating the underside of the board with varnish. The
voting on the pressure transducers made this drift less critical. but it would be advisable to replace the amplifiers with purpose built transducer amplifiers. The writing and debugging of the controller software was made much easier by making most of the fault-tolerance transparent and implemented in hardware. The logic analyser was found to be a powerful tool when debugging and testing the controller. The logic analyser timing display was invaluable when debugging the hardware, as was the logic analyser disassembler when tracing software execution. It would have been impossible to develop this controller without the use of a powerful logic analyser.

The use of ICE and logic analysis in the testing of redundant systems is discussed and it is concluded that logic analysis is more suitable for the testing of TMR systems. Since the logic analyser does not interfere with the system under test, it is suitable for interference testing and the monitoring of certain classes of transient fault. it was necessary to include test circuitry in the controller so that the different types of fault recovery could be tested. A short pulse injected into the system was used to simulate a transient fault. whilst a long pulse was used to simulate a permanent failure. A useful test facility was constructed by using the logic analyser to trigger the pulse generator used for fault injection. In this way the logic analyser will monitor the operation of the controller before the fault, during the fault, and during fault recovery.

The logging of transient errors by the controller. during normal operation, will provide much useful information about the nature and frequency of transient errors as they affect the controller.

Further work is suggested in the following areas. The controller should be rebuilt on printed circuit boards and use professional grade power supplies of proven reliability. The controller should then undergo interference and environmental testing. Whilst the recovery software has
been extensively developed and tested, the complete software package could
be expanded to include more self-checking and exception handling. There is
a requirement within British Gas for a similar fault-tolerant controller.
having a much lower power consumption. If the increased failure rate of
CMOS were acceptable, then it should be possible to convert the controller
design presented here to CMOS.

## LIST OF REFERENCES

1 CLULEY J.C. - Electronic Equipment Reliability - Macmillan 1974
2 FANTINI F. LOLLI M. MICHELETTI G - Microtau: A nighly reliable control unit based on a commercially available microprocessor - Proc. Eurocon ' 82 Copenhagen - Pub. N.Holland p375-380

HARBERT F.C. - High integrity safety systems for offshore platiorms (A triplicated microprocessor system for fire/gas and automatic shut-down operations) - Proc. Eurocon ' 82 Copenhagen - Pub. N.Holland p433-437 LIM B.C.B - A fail safe microprocessor controller for a life support system - Proc. Eurocon'82 Copenhagen - Pub. N.Holland p451-455
black C.J. SUNDBERG C.E. WALKER W.K.S - Development of a spacebourne memory with a single error and erasure correction scheme - Proc. 7th Conf. on Fault Tolerant Computing 1977 p50-55

TOSCHI E.A. WATANABE T - A semiconductor memory with fault detection. correction and logging - H.P.Journal p8-13

7 PLATTETER D.G - Transparent protection of untestable LSI microprocessors - Fault tolerant Computing Symposium Tokoyo Oct. 1980 p345-347

ANDERSON T. KERR R - Recovery blocks in action : A system supporting nigh reliability - international Conf. on Software Engineering. San Francisco. Oct. 1976

10 CERU S. COPPADORO G. Morganti M - UDET 7116 : Common control for PCM telephone exchange : Diagnostic software design and availability evaluation - Proc. Ann. Int. Conf. F-T Computing. Toulouse France. June 1978 p16-23

GHANI N. HERON K. LEE P.A - A recovery cache for the PDP 11 - IEEE Trans. on Computers - Vol. C-29 No. 6 June 1980 p546-549

MORALEE D - Super-reliability computer keeps space shuttle safe Crosstalk Electronics and Power May 1981 p359-361

GELDERLOOS H.C. WILSON D.V - Redundancy management of shuttle flight control sensors - Proc. 1976 IEEE Conf. on Descision and Control p462-475

AHERN D.B. LAMONT G.B. PETERSON J.B - Microprocessor development for a digital flight control system voter/monitor - IEEE Proc National Aerospace Electronics Conf., Dayton Ohio May 1976 p454-462

DEETS D.A. SZALAI K.J - Design and flight experlence with a digital fly by wire control system using Apollo guidance system hardware on a F8 aircraft - Integrity in Electronic Flight Control Systems. Pub. AGARD Neuilly sur Seine France April 1977 p21.1-21.30

FORSYTHE W. MARSHALL W.G - Self-checking multiprocessor module for train control applications - Electronics Letters, 11th June 1981 Vol. 17 No. 12 p408-410

FOOSE R - Module minimises repair time of process control systems Electronics 2nd March 1978 p121-124

SWARZ R.S - Reliability and maintainability enhancements for the VAX-11/780-Proc Annual Int Conf F-T Computing Toulouse France. June 1978 p24-28

CANEPA M. CLARK S. SIEWIOREK D - C.vmp : The architecture and implementation of a fault-tolerant multiprocessor - Proc. 7th Conf. on Fault Tolerant Computing 1977 p37-43

RYLAND. H.A - Microprocessor systems for railway signalling applications - IEE Colloquium on Microprocessor Applications requiring High Integrity and Fault-tolerance 11th Oct. 1982

DAVIES O.J. OBAC-RODA $V$ - Aspects of fault-tolerant ring structures IEE Colloquium on Microprocessor Applications requiring High integrity and Fault-tolerance líth Oct. 1982 *

HIGUCHI T. KAMEYAMA M - Design of dependent-fallure-tolerant microcomputer system using triple-modular redundancy - IEEE Trans. on Computers Vol. 291980 p202-205

MURPHY D.T - Demand Activated Governing - British Gas internal report - ERS Governor Seminar March 1979

DREW W.A - Review of equipment fault data collection exercise British Gas internal report ERS R. 954

BLAKE J.D. DREW W.A - Reliability models of pressure reduction stations - British Gas internal report ERS R. 850

PASCOE W - 2107A/2107B N-channel silicon gate MOS 4k RAMs - intel reliability report RR-7 1975

EUZENT B - Intel 2116 N -channel silicon gate 16k dynamic RAM - Intel rellability report RR-16 1977

ROSENBERG $S$ - Intel 2716 16k UV erasable PROM - Intel reliability report RR-19 1979

PASCOE W - Polysilicon fuse blpolar PROMS - Intel reliability report RR-8 1975

PASCOE W - MOS static RAMs - Intel rellability report RR-9 1975
NICHOLS $N$ - 8080/8080A microcomputer - Intel reliability report RR-10 1976

INTEL - Component Data Catalog 1980
INSPEC - Electronic reliability data : A guide to selected components - Pub. IEE London 1981

REYNOLDS F.H - Measuring and modelling integrated circuit failure rates - Proc. Eurocon' 82 Copenhagen - Pub. N:Holland p32-45

Milltary Standard Handbook MIL-HDBK-217C 1979 Dept.Defense USA
Military Standard Handbook MIL-HDBK-217D 1982 Dept.Defense USA
CNET -- Recueil de Données de Flabilité Edition 1980

MOTOROLA - 1982 Memory and micro Reliability - Reliability report No.83/N001

MOTOROLA - Microcomputer components 1979
PETERSON P.W - The performance of plastic encapsulated CMOS microcircuits in a humid environment - IEEE Trans. on Components. Hybrids and Manufacturing Technology Vol.CHMT-2 No. 41979 p422-427

FEDERICI F. MAMMUCARI F. TURCONI G - Influence of plastic encapsulated IC on TLC equipment reliability performance - Proc. Eurocon'82 Copenhagen - Pub. N.Holland p259-264

KLEIN M.R - Microcircuit device reliability : Memory/LSI data - RADC report MDR-13 1979

DANIELS B.K. HUMPHREYS $M$ - How do electronic system failure rate predictions compare with field experience - Proc. Eurocon'82 Copenhagen - Pub. N.Holland p935-944

CLARIDGE A.N - A comparison between predicted and observed reliability in a large instrumentation and protection system - Proc. Eurocon'82 Copenhagen - Pub. N.Holland p421-426

JOHNSON G.M. STITCH M - Microcircuit accelerated testing reveals life limiting failure modes - 15 th Annual Proc. Reliability Physics Las Vegas USA April 1977 p179-195

O'CONNOR P.D.T - Practical Rellability Engineering - Pub. Heyden 1981 HAKIM E.B. REICH B - Can plastic semiconductor devices and microcircuits be used in milltary equipment - Proc. 1974 Annual Reliability and Maintainability Conf. Los Angeles USA Jan 1974 p396-402
51. HAKIM E.B. REICH B - Environmental factors governing field rellability of plastic transistors and integrated circuits - Proc. 1972 Reliability Physics Symposium

PECK D.S. ZIERDT C - Temperature-numidity acceleration of metal electrolysis failure in semiconductor devices - Proc. 1973 Reliability Physics Symposium

LAWSON R.W - The qualification approval of plastic encapsulated components for use in moist environments - Proc. on Plastic Encapsulated Devices RSRE May 1976

LYCOUDES $N$ - The reliability of plastic microcircuits in moist environments - Solid State Technology Oct. 1978 p53-62

SMITH D.J - Reliability and Maintainability in Perspective - Pub. Macmillan

JENSEN F. PETERSEN N.E - Burn in : An engineering approach to the design and analysis of burn-in procedures - Pub. Wiley 1982

JENSEN F - Reliability Tutorial - Presented at Eurocon'82 Copenhagen
NATIONAL SEMICONDUCTORS - The Reliability Handbook Vol. 1 2nd Edition 1982

NATIONAL SEMICONDUCTORS - DP8400 E C Expandable Error Checker and Corrector Oct. 1981

ANDERSON T. LEE P.A - Fault Tolerance : Principles and Practice - Pub. Prentice Hall 1981

MOTOROLA - 64k Dynamic RAM Reliability Design Manual 1981
HAYES J.P - Testing memories for single-cell pattern-sensitive faults - IEEE Trans. on Computers Vol.C-29 No. 31980 p249-254

REDDY S.M. SUK D.S - Test patterns for a class of pattern-sensitive faults in semiconductor random access memories - IEEE Trans. on Computers Vol.C-29 No. 61980 p219-226

ABRAHAM J.A. THATTE S.M - Testing of semiconductor random. access memories - Proc Annual Int. Conf. F-T Computing Los Angeles California June 1977 p81-87

BRODSKY M - Hardening RAMs against soft errors - Electronics April 24 1980 p117-122

BRAUER J.B. KAPFER V.C. TAMBURRINO A.L - Can plastic encapsulated microcircuits provide reliability with economy - Microelectronics (GB) Vol. 11970 p5-24 FOX M.J - A comparison of the performance of plastic and ceramic encapsulations based on evaluation of CMOS integrated circuits Microelectronics and Reliability Vol. 161977 p251-254

KNIGHT L. LUCAS P - Observed fallure rates of electronic components in computer systems - Microelectronics and Reliability Vol. 151976 p239-243

ARNOLD T.F - The concept of coverage and its effect on the reliability model of a repairable system - IEEE Trans. on Computers Vol.C-22 No. 3 1973 p251-254

PEARSON J.C. PREECE C - Hardware and software aspects of fault coverage in small digital controllers - Proc. Eurocon'82 Copennagen Pub. N.Holland p663-668

HALSE R.G. PEARSON J.C. PREECE C - The introduction of fault tolerance into digitally controlled gas regulators - Proc. 1983 International Gas Research Conference London

HUNGER A - Characterisation test of microprocessors using a new approach in self-testing - Proc. Eurocon'82 Copenhagen - Pub.N.Holland p906-910

DANIELS S.F. FASANG P.P - Microbit I : A microcomputer with built in diagnostics - Proc. Eurocon'82 Copenhagen - Pub. N.Holland p625-629 HAMMING R.W - Error detecting and error correcting codes - Bell Systems Technical Journal Vol. 291960 pl47-160

CASTILLO X. MCCONNEL S.R. SIEWIOREK D.P - Derivation and calibration of a transient error reliability model - IEEE Trans. on Computers Vol.C-31 No. 71982 p658-671

TEXAS INSTRUMENTS - Error detection and correction using SN54/74LS360 or SN54/74LS361 - Bulletin CA-201

ADVANCED MICRO DEVICES - Am2960 Cascadable 16 bit error detection and correction unit

PREDICTOR - Computer software package marketed by Management Sciences Inc. - 6022 Constitution Ave. Albuquerque N.M. 87110


## CROSS SECTION OF A DONKIN FIG. 280 REGULATOR




FIGURE 4 The bath-tub curve


FIGURE 5 TMR voter and error detection in open collector TTL logic



FIGURE 7 7 Graph of acceleration factor vs operating temperature for a reference temperature of $25^{\circ} \mathrm{C}$. Plotted for different activation energies.


FIGURE 8 Failure rate of a 8085 Microprocessor vs case ambient temperature



FIGURE 10 Graph of Simplex and TVR reliability vs normalised mission time


FIGURE 11 Graph of MTTFIF vs ratio $\lambda_{2} / \lambda_{1}$





FIGURE 16 Block diagram of non fault-tolerant controller


FIGURE 17 Op-code fetch timing showing synchronisation problem



FIGURE 19 Photograph of governor controller


FIGURE 20 Photograph of pneumatic test rig


FIGURE 21 Block diagram of Microprocessor board


Microprocessor and buffers - One of three identical channels - CHANNEL 1
FIGURE 22
$\stackrel{9}{\square}$
OJNM
AAAAA
DATA BUS
INPUT.



Microprocessor and buffers - One of three identical channels - CHANNEL 3
$\overline{\square อ \text { gynois }}$


SYSTEM BUS - All FPLA outputs pulled-up to +5 V by 1 k resistors


FIGURE 25 Voting circuitry


## FIGURE 26 FPLA error signal showing the effect of filtering



FIGURE 27 Resynchronisation timing monitored by logic analyser


Self-sychronising clock circuitry


RS2.32 Interface




FIGURE 32 Part of Memory board including RAM and decoders







FIGURE 38 Pressure transducer and conditioning circuitry


FIGURE 39 Block diagram of pneumatic test rig



$\xrightarrow{\text { MERROR }}$
(B)




DATE 55:07 HR 1こ: 11
YECTORED RECOYERY
SYND=FA RETRIES=0 I
SYHC ERR CHANMEL 01
DATE 15:07 HR 13:15
YECTORED RECOYERY
SYND=DE RETRIES=FF
CPU FAIL CHANHEL OI
DATE 15:07 HR 13:17
TRANS RAM ERR $\mathrm{S}=10$
DATE 15:07 HR 13:26
RAM FAILURE $5=10$
DATE 15:07 HR 13:18
RAM HHARE FAIL $5=00$
TOTAL RAM FAILURE Message 7
DATE 15:07 HR 13:22
YECTORED RECOYERY
TRAP HDOG ADR=0639
DATE 15:07 HR 13:26
EPROH SET 00
FULL SYSTEM RESET
DATE 15:07 HR 13:24
EPROM SET OI
FULL SYSTEA RESET
DATE 15:07 HR 13:19
YECTORED RECOYERY
SNAKE RST7 ADR=0030
DATE 15:07 HR 13:30
SOLEMOID FAILURE
DATE 15:07 HR 13:16
PRESSR ERPOR CH 03

Message 1 SYHC ERR CHANMEL OI

DATE 15:07 HR 13:15 YECTORED RECOYERY SYND=DE RETRIES=FF CPU FAIL CHANMEL OI

DATE 15:07 HR 13:17 TRANS RAM ERR $\mathrm{S}=10$

DATE 15:07 HR 13:26 RAM FAILURE $5=10$

DATE 15:07 HR 13:18 RAM HHARE FAIL $5=00$

```
DATE 15:07 HR 13:17 Message 6
CBIT STUCK DBE S=20
```

DATE 15:07 HR 13:22 YECTORED RECOUERY TRAP HDOG ADR=0639

DATE 15:07 HR 13:26
EPROH SET 00
FULL SYSTEM RESET

DATE 15:07 HR 13:24
EPROM SET O1
FULL SYSTEH RESET
DATE 15:07 HR 13:19
YECTORED RECOYERY
SNAKE RST7 ADR=0030
DATE 15:07 HR 13:30
SOLEMOID FAILURE

DATE 15:07 HR 13:16 PRESSR ERPOR CH 03

Message 8

Message 9

Message 10

Message 11

Message 12

TABLE 1 Failure mechanisms and activation energies in MOS semiconductors

| Failure mechanism | Type | Activation energy eV | Detection | Preventive measures |
| :---: | :---: | :---: | :---: | :---: |
| slow trapping | wearout | 1.0 | high terp. bias | ultra clean processing |
| contamination | wearout/infant | 1.4 | high terp. bias | ultra clean processing |
| surface charge | wearout | 0.5-1.0 | high temp. bias | ultra clean processing |
| polarisation | wearout | 1.0 | high terp. bias | eliminate phosphorus in gate oxide |
| electromigration | wearout | 1.0 | high temp. operating life | $\mathrm{J}<10^{5} \mathrm{~A} / \mathrm{mm}^{2}$ |
| microcracks | random | - | temperature cycling | contoured oxide steps |
| contacts | wearout/infant | - | high terp. operating | ultra clean processing |
| oxide defects | infant/random | 0.3 | high voltage operating life and cell stress | ultra clean processing |
| aluminium corrosion | wearout | 0.8 | leak detection/ moisture tests | improve packaging |
| galvanic bond pad corrosion in plastic encaps. | wearout | 0.6 | leak detection/ moisture tests | improve packaging |
| charge loss from N channel EPROM | wearout | 0.8 | high terp. tests | improved design |

TABLE 2 A comparison of resistor failure rates

| Source | Failure rate $\mathrm{f} / 10^{6} \mathrm{hrs}$. |  |
| :--- | :--- | :---: |
| ICL data | reference(68) | 0.004 |
| Dunmer | reference(69) | 0.015 |
| NCSR | reference(34) | 0.007 |
| CNET | reference(38) | 0.0035 |
| MIL 217D | reference(37) | 0.0035 |
| MIL 217C | reference(36) | 0.0035 |

Failure modes :
$90 \%$ open circuit
$10 \%$ short circuit

Failure rates calculated for an oxide film resistor in a computer environment with $\mathrm{R}<100 \mathrm{~K}$ Temperature $=25^{\circ} \mathrm{C}$ and stress $=0.1$ (operating wattage/rated wattage)

TABLE 3 A comparison of capacitor failure rates

| Source | Failure rate f/10 hrs. |
| :--- | :---: |
| ICL data reference(68) | 0.003 |
| Dummer | reference(69) |
| NCSR | reference(34) |
| CNET | reference(38) |
| MIL 217D | reference(37) |
| MIL 217C reference(36) | 0.08 |

Failúre rates calculated for a $0.1 u F$ polycarbonate capacitor in a computer environment. With Tamb. $=25^{\circ} \mathrm{C}$ and stress $=0.1$ (operating voltage $/ \mathrm{rated}$ )

TABLE 4 A comparison of soldered joints failure rates

| Source | Failure rate $\mathrm{f} / 10^{6} \mathrm{hrs}$. |
| :--- | :---: |
| ICL data reference (68) | 0.002 |
| Dummer | reference(69) |
| CNET | reference(38) |
| MIL 217D | reference(37) |
| MIL 217C | reference(36) |

TABLE 5 A comparison of wire-wrap joint failure rates

| Source | Failure rate $\mathrm{f} / 10^{6} \mathrm{hrs}$. |
| :--- | :---: |
| ICL data reference(68) | 0.0008 |
| Dummer | reference(69) |
| CNET | reference(38) |
| MIL 217D | reference(37) |
| MIL 217C | reference(36) |

TABLE 6 A comparison of edge connector failure rates

| Source | Failure rate $\mathrm{f} / 10^{6} \mathrm{hrs}$. |
| :--- | :---: |
| ICL data reference(68) | 0.0030 |
| Dummer | reference(69) |
| CNET | reference(38) |
| MIL 217D | reference(37) |
| MIL 217C | reference(36) |

Failure rates calculated for a mating pair of contacts connected once only without repetitive mating/un-mating.

MIL 217 and CNET failure rates calculated for a mating pair of contacts within a 64 way connector of material type $B$.

TABLE 7 Activation energies used in failure rate prediction

| Technology | Package type | Activation energy eV |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  |  | MIL 217D | MIL 217C | NCSR |
| TTL HTTL | Hermetic | 0.40 | 0.40 | 0.40 |
| DTL ECL | non hermetic | 0.45 | 0.40 | 0.40 |
| LTTL STTL | hermetic | 0.45 | 0.40 | 0.40 |
|  | non hermetic | 0.50 | 0.40 | 0.40 |
| LSTTL | hermetic | 0.50 | 0.40 | 0.50 |
|  | non hermetic | 0.55 | 0.40 | 0.50 |
| $\mathrm{I}^{2} \mathrm{~L} \quad$ MNOS | hermetic | 0.60 | 0.40 | - |
|  | non hermetic | 0.80 | 0.40 | - |
| PMOS | hermetic | 0.50 | 0.70 | 0.55 |
|  | non hermetic | 0.70 | 0.70 | 0.55 |
| NMOS CCD | hermetic | 0.55 | 0.70 | 0.55 |
|  | non hermetic | 0.80 | 0.70 | 0.55 |
| CMOS, CMOS/SOS | hermetic | 0.65 | 0.70 | 0.55 |
| and linear | non hermetic | 0.90 | 0.70 | 0.55 |

TABLE 8 A comparison of TTL integrated circuit failure rates

| Source | Failure rate $\mathrm{f} / 10^{6} \mathrm{hrs}$. | Ea eV |
| :--- | :---: | :---: |
| ICL data reference(68) | 0.014 | - |
| NCSR | reference(34) | 0.031 |
| CNET | reference(38) | 0.050 |
| MIL 217D | reference(37) | 0.084 |
| MIL 217C reference(36) | 0.533 | $0.3 \& 1.0$ |

Failure rates calculated for a plastic package containing 4 gates and in a computer environment with Tamb. $=25^{\circ} \mathrm{C}$ and $\mathrm{Tjunct} .=33^{\circ} \mathrm{C}$

TABLE 9 A comparison of the failure rates for a 6800 Microprocessor

| Source | Failure rate $\mathrm{f} / 10^{6} \mathrm{hrs}$. | Ea eV | $\mathrm{Tj}{ }^{\circ} \mathrm{C}$ |
| :--- | :---: | :---: | :---: |
| Motorola reference(39) | 0.30 | 1.0 | 105 |
| MIL 217D reference(37) | 7.80 | 0.55 | 120 |
| MIL 217C | reference(36) | 1380 | 0.70 |
| CNET | reference(38) | 6.60 | $0.3 \&$ |
| NCSR | reference(34) | 6.40 | 1.0 |

Failure rates calculated for a hermetic package containing 1367 gates in a "ground fixed" environment at Tamb. $=70^{\circ} \mathrm{C}$ Oja specified by reliability data source
Pdiss = worst case $=1 \mathrm{~W}$ (except for Motorola $=0.5 \mathrm{~W}$ typical)
Quality level = C1 (approximately computer grade)

TABLE 10 A comparison of 6800 Microprocessor adjusted failure rates

| Source | Failure rate $\mathrm{f} / 10^{6} \mathrm{hrs}$. |
| :--- | :---: |
| Motorola reference(39) | 0.013 |
| MIL 217D | reference(37) |
| MIL 217C | reference(36) |
| CNET | reference(38) |
| NCSR | reference(34) |

Failure rates converted to a common base of :

```
Tamb. = 45 %}\textrm{C
Oja}=5\mp@subsup{0}{}{\circ}\textrm{C}/\textrm{W
Pdiss = 0.5W (typical)
Tj}=7\mp@subsup{0}{}{\circ}\textrm{C
Ea}=1.0\textrm{eV
Quality level = C1
```

** Since CNET data uses both 0.3 and 1.0 eV activation energies, it is not
possible to convert to a single activation energy of 1.0 eV , however the
junction temperature is adjusted from $120^{\circ} \mathrm{C}$ to $70^{\circ} \mathrm{C}$.

TABLE 11 A comparison of the failure rates for a 8080 Microprocessor

|  | Source | Failure rate $\mathrm{f} / 10^{6} \mathrm{hrs}$. | Ea eV | $\mathrm{Tj}{ }^{\circ} \mathrm{C}$ |
| :--- | :---: | :---: | :---: | :--- |
| Intel | reference(32) | 0.12 | 0.50 | 89 typ.? |
| MIL 217D | reference(37) | 4.2 | 0.55 | 106 |
| MIL 217C | reference(36) | 510 | 0.70 | 106 |
| CNET | reference(38) | 3.8 | $0.3 \&$ | 106 |
| NCSR | reference(34) | 10.8 | 1.0 |  |
| RRE |  | 36.2 | 0.55 | 145 |

Failure rates calculated for a hermetic package containing 1100 gates in a "ground fixed" environment at Tamb. $=55^{\circ} \mathrm{C}$

Oja specified by reliability data source
Pdiss = worst case
Quality level = C1 (approximately computer grade)

TABLE 12 A comparison of 8080 Microprocessor adjusted failure rates

| Source | Failure rate $\mathrm{f} / 10^{6} \mathrm{hrs}$. |
| :--- | :---: |
| Intel | reference(32) |
| MIL 217D | reference(37) |
| MIL 217C | reference(36) |
| CNET | reference(38) |
| NCSR | reference(34) |

Failure rates converted to a common base of :
Tamb. $=45^{\circ} \mathrm{C}$
Oja $=50^{\circ} \mathrm{C} / \mathrm{W}$
Pdiss $=0.78 \mathrm{~W}$ (typical)
$\mathrm{Tj}=84^{\circ} \mathrm{C}$
$\mathrm{Ea}=1.0 \mathrm{eV}$
Quality level = C1
** Since CNET data uses both 0.3 and $1.0 e V$ activation energies, it is not possible to convert to a single activation energy of 1.0 eV , however the junction temperature is adjusted from $106^{\circ} \mathrm{C}$ to $84^{\circ} \mathrm{C}$.

TABLE 13 A comparison of the failure rates of a 2716 EPROM

| Source | Failure rate f/10 ${ }^{6} \mathrm{hrs}$. | Ea eV |
| :--- | :---: | :---: |
| Intel | referencé(29) | $0.45 @ 60 \%$ C.L. |
| MIL 217D | reference(37) | 0.30 |
| MIL 217C | reference(36) | 3.5 |
| CNET | reference(38) | 0.64 |
| NCSR | reference(34) |  |

Failure rates calculated with :
Tamb. $=55^{\circ} \mathrm{C}$
$\theta j a=25^{\circ} \mathrm{C} / \mathrm{W}$
Pdiss $=0.525 \mathrm{~W}$
$\mathrm{Tj}=68^{\circ} \mathrm{C}$
Quality level Cl
"ground fixed"environment
hermetic encapsulation

TABLE 14 A comparison of the failure rates of a 1 k Bipolar ROM

| Source | Failure rate f/10 $\mathrm{hrs}$. | Ea eV |  |
| :--- | :--- | :---: | :--- |
| Intel | reference(30) | $0.5 @ 90 \%$ C.L. | 0.40 |
| MIL 217D | reference(37) | 0.85 | 0.45 |
| MIL 217C | reference(36) | 4.90. | 0.40 |
| CNET | reference(38) | 0.31 | $0.3 \&$ |
|  |  | 0.34 | 1.0 |
| NCSR | reference(34) |  | 0.40 |

Failure rates calculated with :

Tamb. $=85^{\circ} \mathrm{C}$
$\theta j a=30^{\circ} \mathrm{C} / \mathrm{W}$
Pdiss $=0.5 \mathrm{~W}$
$\mathrm{Tj}=100^{\circ} \mathrm{C}$
Quality level = C1
"ground fixed" environment
hermetic encapsulation

TABLE 15 A comparison of the failure rates of a 16 k Dynamic RAM

|  | Source | Failure rate $\mathrm{f} / 10^{6} \mathrm{hrs}$. | Ea eV |
| :---: | :---: | :---: | :---: |
| Intel | reference(28) | 0.27 @ 60\% C.L. | 0.30 |
| MIL 217D | reference(37) | 8.0 | 0.55 |
| MIL 217C | reference(36) | 261 | 0.70 |
| CNET | reference(38) | 0.84 | $\begin{aligned} & 0.30 \& \\ & 1.0 \end{aligned}$ |
| NCSR | reference(34) | 4.9 | 0.55 |
| Motorola | reference(40) | 0.083 @ 60\% C.L. | 1.0 |
| Motorola converted | $\begin{aligned} & \text { reference }(40) \\ & \text { to } \mathrm{Ea}=0.3 \mathrm{eV} \end{aligned}$ | 1.5 @ 60\% C.L. | 0.3 |

Failure rates calculated with :

```
Tamb: = 70 %
0ja}=3\mp@subsup{0}{}{\circ}\textrm{C}/\textrm{W
Pdiss = 0.4W
Tj = 820}\textrm{C
"ground fixed"environment
hermetic encapsulation
Quality level = C1
```

TABLE 16 A comparison of the failure rates of a 1 k Static RAM

|  | Source | Failure rate $\mathrm{f} / 10^{6} \mathrm{hrs}$. | Ea eV |
| :---: | :---: | :---: | :---: |
| Intel | reference(31) | 0.4 @ $90 \%$ C.L. | 0.30 |
| MIL 217D | reference(37) | 0.68 | 0.55 |
| MIL 217C | reference(36) | $13.2{ }^{\text {² }}$ | 0.70 |
| CNET | reference(38) | 0.22 | $\begin{aligned} & 0.3 \& \\ & 1.0 \end{aligned}$ |
| NCSR | reference(34) | 0.57 | 0.55 |

Failure rates calculated with :
Tamb. $=55^{\circ} \mathrm{C}$
Oja $=30^{\circ} \mathrm{C} / \mathrm{W}$
Pdiss $=0.2 \mathrm{~W}$
$\mathrm{Tj}=61^{\circ} \mathrm{C}$
Quality level = C1
"ground fixed" environment
hermetic encapsulation

TABLE 17 The relation between cost, screening level and Quality factor

| MIL 217 <br> screen level | Screening method | $\pi_{Q}$ | Typical relative <br> cost |
| :--- | :--- | :--- | :--- |
| S | MIL-M-38510 Class S | 0.5 | $8-20$ |
| S | BS9400 Class A ** | 0.5 |  |
| B | MIL-M-38510 Class B | 1.0 |  |
| B | BS9400 Class B | 1.0 |  |
| B-1 | MIL-STD-883 me.thod 5004 Class B | 3.0 | $4-6$ |
| B-2 | Vendor equivalent of B-1 | 6.5 |  |
| C | MIL-M-38510 Class C | 8.0 | 8.0 |
| C | BS9400 Class C | 10.0 | $2-4$ |
| C-1 | BS9400 Class D |  |  |
| C-1 | Vendor equivalent of C | 13.0 |  |
| D | Commercial Hermetic | 13.0 | 17.0 |
| D-1 | Commercial plastic | 35.0 | 0.5 |

[^0]TABLE 18 Recommended maximum junction temperatures for Semiconductors


TABLE 19 Signals carried by back-plane Bus

| Pin No. | Bottom connector |  | Top connector |  |
| :---: | :---: | :---: | :---: | :---: |
|  | A | C | A | C |
| 1 | OV | OV | -12V | $-12 \mathrm{~V}$ |
| 2 | D7 (out) | D7(in) |  |  |
| 3 | D6 (out) | D6(in) | $+5 \mathrm{~V}$ | OV |
| 4 | D5 (out) | D5(in) | +5V | OV |
| 5 | D4 (out) | D4(in) | +5V | OV |
| 6 | D3 (out) | D3(in) | +5V | OV |
| 7 | D2 (out) | D2(in) | +28V |  |
| 8 | D1 (out) | D1(in) | $+28 \mathrm{~V}$ |  |
| 9 | DO (out) | DO(in) |  |  |
| 10 | SOD | $\overline{\text { RESET }}$ |  |  |
| 11 | CLK OUT | CLK OUT $\div 10$ |  |  |
| 12 | ALE | SID |  |  |
| 13 | IO/ $\bar{M}$ |  |  |  |
| 14 | $\overline{W R}$ |  |  |  |
| 15 | $\overline{\mathrm{RD}}$ |  |  |  |
| 16 | A15 | INTR |  |  |
| 17 | A14 | RST 5.5 |  |  |
| 18 | A13 | RST 6.5 |  |  |
| 19 | A12 | RST 7.5 |  |  |
| 20 | A11 | TRAP |  |  |
| 21 | A10 | ROM SELECT | PORT 7 (WR) | PORT 7 (RD) |
| 22 | A9 |  | PORT 6(WR) | PORT 6(RD) |
| 23 | A8 |  | PORT 5(WR) | PORT 5(RD) |
| 24 | A7 |  | PORT 4 (WR) | PORT 4(RD) |
| 25 | A6 |  | PORT 3(WR) | PORT 3(RD) |
| 26 | A5 |  | PORT 2(WR) | PORT 2(RD) |
| 27 | A4 |  | PORT 1 (WR) | PORT 1 (RD) |
| 28 | A3 |  | PORT O(WR) | PORT O(RD) |
| 29 | A2 |  |  |  |
| 30 | A1 |  |  | SERIAL IN |
| 31 | AO |  |  | SERIAL OUT |
| 32 | $+5 \mathrm{~V}$ | +5V | +12V | +12V |

## TABLE 20 Correction bits for SEC/DED Hamming code

| C3 | C2 | C1 | C0 | Bit in error |
| :--- | :--- | :--- | :--- | :--- |
| 0 | 0 | 0 | 0 | No error |
| 0 | 0 | 0 | 1 | C0 |
| 0 | 0 | 1 | 0 | C1 |
| 0 | 0 | 1 | 1 | DBE |
| 0 | 1 | 0 | 0 | C2 |
| 0 | 1 | 0 | 1 | DBE |
| 0 | 1 | 1 | 0 | DBE |
| 0 | 1 | 1 | 1 | DO |
| 1 | 0 | 0 | 0 | C3 |
| 1 | 0 | 0 | 1 | DBE |
| 1 | 0 | 1 | 0 | DBE |
| 1 | 0 | 1 | 1 | D1 |
| 1 | 1 | 0 | 0 | DBE |
| 1 | 1 | 0 | 1 | D2 |
| 1 | 1 | 1 | 0 | D3 |
| 1 | 1 | 1 | 1 | DBE |

Horizontal parity =
ODD for single bit error (SBE)
EVEN for double bit error (DBE)

TABLE 21 Error position as indicated by error flags

| S2 | S1 | SO | E | Error position |
| :--- | :--- | :--- | :--- | :--- |
| 0 | 0 | 0 | 0 | No error |
| 0 | 0 | 0 | 1 | RAM 0 |
| 0 | 0 | 1 | 1 | RAM 1 |
| 0 | 1 | 0 | 1 | RAM 2 |
| 0 | 1 | 1 | 1 | RAM 3 |
| 1 | 0 | 0 | 1 | RAM 4 |
| 1 | 0 | 1 | 1 | RAM 5 |
| 1 | 1 | 0 | 1 | RAM 6 |
| 1 | 1 | 1 | 1 | RAM 7 |
| 0 | 0 | 1 | 0 | Double bit error |

## APPENDIX 1 Circuit diagram of power supply

The circuit diagram of the controller power supply is given in figure(A1) below.


APPENDIX 2 Circuit board layout

The layout of the three governor controller circuit boards is given in the following figures A2-A4.


FIGURE A2 Microprocessor board layout



## APPENDIX 3.1

## ERROR FLAGS LOW PASS FILTER CALCULATION

Although the three processors are fed by a common clock. they will not run in exact synchronism because of different internal delays: Consideration of the 8085 timing given in reference(33) shows that a maximum difference of 140 ns is possible between processors on the address bus. Any difference between the three channels will cause transient dips in the error flags which must be smoothed out by the low pass filters. The 50 ns rise time of the error flags must be added to the 140 ns giving a 190 ns maximum dip in the error flags. The voter flags must not be allowed to dip below the logic ' $\gamma$ ' level of 2.0 V for a 250 ns interruption which includes a safety margin added to the 190 ns .

Hence the time constant of the LPF is calculated by:

$$
\left.\begin{array}{rlr}
V & =V_{0} \exp -(t / C R) & \text { where : } \\
\Rightarrow 2 & =5 \exp -(250 \mathrm{~ns} / C R) & \\
V=5 V \\
V=2 V
\end{array}\right) .
$$

Using $R=7 k \pm 5 \%$ this gives $C=285 p F$ for $R \min =0.95 k$. Since the capacitor tolerance is $\pm 10 \%$, a 330 pF capacitor is used.

The low pass filter therefore consists of :


The following table will implement a 5 channel TMR voter including error
flags when programmed into the FPLA．


82S 100－I．N • 82S 101－I．N
16X48X8 FPLA PROGRAM TABLE

|  | PROGRAM TABLE ENTRIES |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | INPUT VARIABLE |  |  |  |  |  |  | OUTPUT FUNCTION |  |  |  |  |  |  |  |  |  | OUTPUT ACTIVE LEVEL |  |  |  |  |  |
|  | Im | Im |  |  | Don＇t Care |  |  |  | Prod．Term Present in $\mathrm{Fp}_{\mathrm{p}}$ |  |  |  | Prod．Term Not Present in $\mathrm{FP}_{\mathrm{p}}$ |  |  |  |  | Active High |  |  | Active Low |  |  |
|  | H | L |  |  | －（dash） |  |  | A |  |  |  |  |  | －（period） |  |  |  | H |  |  | L |  |  |
|  | Enter（－）lor unused inputs of used P－terms． |  |  |  |  |  |  |  | 1．Entries independent of output polarity． <br> 2．Enter（A）ior unused outbuts ol used P－terms |  |  |  |  |  |  |  |  | 1 Polarity progremmed once only <br> 2 Entrif（hill for alt unused oulpuls |  |  |  |  |  |
|  | PRODUCT TERM． |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | ACTIVE LEVEL ${ }^{\circ}$ |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | OUTPUT FUNCTION |  |  |  |  |  |
|  | NO． |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  | 2 |  | 0 | 97 | $T \overline{8}$ |  |  | $5{ }^{-1} 4$ | 3 |  | 2］${ }^{-1}$ |  | O |  | 776 |  | 3 | T 2 | $\bigcirc$ |
|  | 0 | － | －－ | － | － | － | － | － | － | － | － | － | － | －－ |  | H | H |  |  |  |  |  | A |
|  | 1 | － | － | － | － | － | － | － | － | － | － | － | － | － |  | H | － |  |  |  |  |  | A |
|  | 2 | － | － | －－ | － | － | － | － | － | － | － | － | － | － | H | － H | H |  |  |  |  |  | A |
|  | 3 | － | －－ | － | － | － | － | －－ | － | － | － | － | H | H | － | $-$ | － |  | － |  |  | A |  |
|  |  |  | － | － |  | － | － | － | － | － | － | H | H | － | － | －－ | － |  |  |  |  | A |  |
|  | 5 |  |  | － | － | － | － | － | － | － | － | $\mathrm{H}^{-}$ | － | H | － |  | － |  |  |  |  | A |  |
|  | 6 |  | － | － | － | －－ | － | － | $\overline{7}$ | H | H | － | － | － | － | $-1$ | － |  |  |  |  | A |  |
|  | 7 | － | －1－ | － | － | － | － | － | H | H | － | － | － | $-1$ | － | － | － |  |  |  |  | A |  |
|  | 8 | － | － | － | － | － | － | － | H | － | H | － | － | － | － | － | － |  |  |  |  | A |  |
|  | 9 | － | － | － | $-$ | － | H | ${ }^{+}$ | － | － | － | － | － | － | － | － | － |  |  |  | A |  |  |
|  | 10 | － | － | － | － |  | H－ | － | － | － | － | － | － | － | － | － | － |  |  |  | A |  |  |
|  | 11 | － | － | － | － | H | － | H | － | － | － | － | － | － | － | －- | － |  |  |  | A |  |  |
|  | 12 | － | － | H | H | － | － | － | － | － | $=$ | － | － | － | － | － | － |  |  | A |  |  |  |
|  | 13 | － | H | H | － | － | － | －1－ | － | － | － | － | － | － | － | －－ | － |  |  | A． |  |  |  |
|  | 14 | － | H | － H | H | － | － | － | $=$ |  |  | － | － | － |  |  |  |  |  | A | － | $\therefore \cdot$ | $\therefore$ |
|  | 15 |  |  | － |  |  |  |  |  |  |  | － |  |  |  |  |  |  | A $\perp$ |  |  |  |  |
|  | 16 |  | － | － | － | － | $-1$ | － | － | － | － | － | $-$ | － |  | H | L |  | A $A$ |  |  |  |  |
|  | 17 | － | －－ | －－ | － | － | － | － | － | － | $\cdots$ | － | － | － | L | H | － | A | A |  |  |  |  |
|  | 18 | － | － | － | － | － | － | － | － | － | － | － | － | － | H． | L | － |  | A |  |  |  |  |
|  | 19 | － | － | － | － | － | － | $-1$ | － | － | － | － | L | H | － | － | － |  | A ${ }^{\text {A }}$ |  |  |  |  |
| $\cdots$ | 20 | － | － | － | － | － | －1－ | －1－ | － | － |  | －H | H | L | － | － | － |  | A A |  |  |  |  |
| \％ | 21 | － | － | － | － | － | － | $-$ | － | － | － | L | H | － | － | － | － | A． | A |  |  |  |  |
| 㫛 | 22 | － | － | － | － | － | － | － | － | － |  | H | L | － | － | － | － | A | －${ }^{\text {A }}$ |  |  | － |  |
| 8 | 23 | － | － | － | － | － | － | $-$ | － | L |  | － | － | － | － | － | － |  | A A |  |  |  |  |
| － | 24 | － | － | － | － | － | － | － | － | H | L | － | － | － | － | － | － |  | A A |  |  | － |  |
| $\stackrel{\text { c }}{\substack{0 \\ 0}}$ | 25 | － | － | $-1$ | － | － | － | － | L | H | － | － | － | － | － | － | － | A | A |  |  |  |  |
| \％ | 26 | － | －1－1 | － | － | － | － |  | H | L | － | － | $-$ | － | － | － | － |  | －A | ． |  |  |  |
| $\stackrel{\square}{9}$ | 27 | － | － | － | － | － |  | H | － | － | － | － | － | － | － | $-$ | － |  | A A |  |  |  |  |
| 9 ¢ | 28 | － | － | －－ | － | － | H | L | － | － | － | － | － | － | － | － | － |  | A A | ． |  | － |  |
| $\pm$ 号 | 29 | － | － | － | － | L | H | － | － | － | － | － | － | － | － | － | － |  |  |  |  |  |  |
| 式 | 30 | － | － | － | － | H | L | － | － | － | － | － | － | － | － | － | － | A | A |  |  |  |  |
|  | 31 | － | － | L | H | － | － | － | － | － | － | － | － | － | － | － | － |  | A A A |  |  | ． |  |
| －E | 32 | － | － | H | L | － | － | － | － | － | － | － | － | － | － | － | － |  | A A |  |  |  |  |
| 5 䙇 | 33 | － | L | H | － | － | － | － | － | － | － | － | $-$ | － | － | － | － | A | －A |  |  |  |  |
|  | 34 | － | H | L | － | － | － | － | － | － | － | － | － | － | － | － | － | A | A |  |  |  |  |
|  | 35 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  | 36 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| $\underset{\sim}{2}$ | 37 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  | 38 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  | 39 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  | 40 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | － |  | － |
|  | 41 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  | 42 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  | 43 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  | 44 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 응 | 45 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ¢ $\sim_{3}^{\text {¢ }}$ | 46 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  | 47 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

[^1]
## Appendix 4

## CALCULATION OF DELAY BEFORE LATCHING ERROR FLAGS

Component tolerances in the error flag low pass filters will cause their outputs to change state at different times. therefore the error flags should not be latched until they have all had time to settle to their final value. The outputs of the filters will decay exponentially and. considering the extremes of component tolerances, will have time constants between the limits:

```
tmin.= 0.95k x 297pF=282ns
tmax.= 1.05k x 363pF=381ns
```

The graph below shows the exponential decay of the outputs plotted for the two extremes of time constant.


The TTL logic thresholds of ${ }^{\prime} 1^{\prime}=>2.0 \mathrm{~V}$ and ${ }^{\prime} 0^{\prime}=<0.8 \mathrm{~V}$ leave an indeterminate range of $0.8-2.0 \mathrm{~V}$. Taking the worst case. the voting error is detected at time th when the output, having a time constant of 282 ns . drops below 2V. However the other error flags must be allowed time to settle to the 0.8 V level at a rate governed by the 381 ns time constant. This level is reached at time t2. The difference $t 2$ - 11 is the maximum possible delay between detecting an error and allowing the error flags to settle.

$$
\begin{aligned}
& \mathrm{t} 1=282 \ln 2.5=258 \mathrm{~ns} \\
& \mathrm{t} 2=381 \ln 6.25=698 \mathrm{~ns}
\end{aligned}
$$

Hence: $\mathrm{t} 2-\mathrm{t}=440 \mathrm{~ns}$

Hence the monostable UT8b is designed to give a delay of 440 ns between detecting a voling error and latching the error flags.

APPENDIX 5 Encoding ROM data
$0000 \quad 00172 B 3 C 4 D 5 A \quad 6671$ 8E 99 A5 B2 C3 D4 E3 FF $0010 \quad 00172 \mathrm{~B}$ 3C 4D 5A 6671 8E 99 A5 B2 C3 D4 E3 FF

The above table is programmed into a $32 \times 8$ PROM and is used to encode data into the SEC/DED Hamming code. Four-bit wide data is encoded as four data bits and four check bits.

## APPENDIX 6 Decoding ROM data

| 0000 | 00 | 01 | 03 | 02 | 05 | 02 | 02 | 19 | 07 | 02 | 02 | 2B | 02 | 4 D | 8 |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0010 | 09 | 12 | 12 | 15 | 12 | 13 | 11 | 10 | 12 | 9 F | 5D | 12 | 3B | 12 | 1 |  | 7 |
| 0020 | OB | 22 | 22 | 27 | 22 | AF | 6D | 22 | 22 | 23 | 21 | 20 | 39 | 22 | 22 |  |  |
| 0030 | 32 | 7 D | BF | 32 | 37 | 32 | 32 | 1B | 35 | 32 | 32 | 29 | 30 | 31 | 33 |  | 2 |
| 0040 | OD | 42 | 42 | CF | 42 | 47 | 6B | 42 | 42 | 45 | 59 | 42 | 41 | 40 | 42 |  | 3 |
| 0050 | 52 | 7B | 57 | 52 | DF | 52 | 52 | 1D | 53 | 52 | 50 | 51 | 52 | 49 | 55 |  | 2 |
| 0060 | 62 | 79 | 65 | 62 | 63 | 62 | 60 | 61 | EF | 62 | 62 | 2D | 62 | 4B | 67 |  | 2 |
| 0070 | 71 | 70 | 72 | 73 | 72 | 75 | 69 | 72 | 72 | 77 | 5B | 72 | 3D | 72 | 72 |  |  |
| 0080 | OF | 82 | 82 | CD | 82 | $A B$ | 87 | 82 | 82 | 99 | 85 | 82 | 83 | 82 | 80 |  |  |
| 0090 | 92 | 97 | BB | 92 | DD | 92 | 92 | 1 F | 91 | 90 | 92 | 93 | 92 | 95 | 89 |  |  |
| OOAO | A | A5 | B9 | A2 | A1 | AO | A2 | A3 | ED | A2. | A2 | 2 F | A2 | A7 | 8B |  |  |
| OOBO | B3 | B2 | B0 | B1 | B2 | A9 | B5 | B2 | B2 | 98 | B7 | B2 | 3F | B2 | B2 |  |  |
| OOCO | C2 | C3 | C1 | CO | D9 | C 2 | C2 | C5 | EB | C2 | C2 | C 7 | C 2 | 4F | 8D |  |  |
| OODO | D5 | D2 | D2 | C9 | DO | D1 | D3 | D2 | D2 | 9D | 5F | D2 | D7 | D2 |  |  |  |
| OOEO | E7 | E2 | E2 | CB | E2 | AD | 6 F | E2 | EO | E1 | E3 | E2 | E5 | E2 |  |  |  |
| OOFO |  | 7 F | BD |  |  |  |  |  | E9 |  |  |  |  |  |  |  |  |

The above table is programmed into a $256 \times 8$ PROM and is used to decode data from the SEC/DED Hamming code. Eight-bit wide data, consisting of four data bits and four check bits, is decoded as four data bits and four error flags. Single bit errors in the data are corrected and the error flags signal the occurrence and position of single and double bit errors.

The RS 58174 is a metal gate CMOS circuit that functions as a real time clock and calendar in busorientated microprocessor systems. An interrupt timer is included, which can be programmed to have one of three times. The time base is obtained from a RS 32.768 kHz crystal with time keeping down to 2.2 V for low power standby operation from batteries.

## Application

When the system is powered up it is necessary to enter the correct data into the device registers and start the clock running. The seconds, minutes, hours, days and months counters are all parallel loaded with data from the 4 bit data bus when correctly addressed, $\overline{\mathrm{CS}}$ is low and a write data strobe pulse is given. Data to be entered is set up on the 4 bit data bus; the address from Table 1 for the required register is set on the 4 bit address bus and a write data strobe pulse is sent. Chip select must be low during write and read operations. All information is entered in the same way with the relevant address. To start the clock running at the required instant a logic 1 is written to DB0 at address 14, likewise writing a 0 stops the clock. Data can be read from a register by using the required address as in Table1 and applying a read strobe pulse. The data becomes available on the 4 bit data bus: chip select must be low for this read operation.
The internal counters are arranged as bytes of four bits each. When a byte is addressed it will appear on the data I/O bus enabling independent access. For bytes which do not contain 4 bits (e.g. week days use only 3 bits) the unused bits are not recognised during a write operation and tied to $\mathrm{V}_{\mathrm{SS}}$ during a read operation. The addressable reset latch holds tenths, units and tens of seconds in a reset condition. If a register is updated during a read operation the I/O data is prevented from updating and a subsequent read will return the illegal b.c.d. code 1111. This allows detection that the previous data had changed and is now incorrect. The interrupt timer can be programmed for $0.5,5$ or 60 second intervals and may be coded for single or repeated operation. The open drain interrupt output is pulled to $V_{S S}$ when the timer times out and reading the interrupt register provides the status and internal selected information.

## Standby mode

This is automatically selected when the supply voltage falls to the standby level. ( 2.2 V minimum) with no read or write strobes.

## Test mode

This mode is used in production testing of the circuit. For normal operation the circuit must be in non-test mode and set as part of the system initialisation. Non test mode is set by writing a logic 0 to DB3 at ADO.

## Years Status Register

This is a 4 bit shift register which is shifted each year on the 31st December. The status register must be set in accordance with the table below and no readout capability is available.

|  | DB3 | DB2 | DB1 | DB0 |
| :--- | :---: | :---: | :---: | :---: |
| Leap year | 1 | 0 | 0 | 0 |
| Leap year +1 | 0 | 1 | 0 | 0 |
| Leap year +2 | 0 | 0 | 1 | 0 |
| Leap year +3 | 0 | 0 | 0 | 1 |

TABLE 1

| Selected counter | Address bits |  |  |  | Mode |
| :--- | :--- | :--- | :--- | :--- | :--- |
|  | AD3 | AD2 | AD1 | ADO |  |
| 0 Test only | 0 | 0 | 0 | 0 | Write only |
| 1 Tenths of sec. | 0 | 0 | 0 | 1 | Read only |
| 2 Units of secs. | 0 | 0 | 1 | 0 | Read only |
| 3 Tens of secs. | 0 | 0 | 1 | 1 | Read only |
| 4 Units of mins. | 0 | 1 | 0 | 0 | Read or Write |
| 5 Tens of mins. | 0 | 1 | 0 | 1 | Read or Wrie |
| 6 Units of hours | 0 | 1 | 1 | 0 | Read or Write |
| 7 Tens of hours | 0 | 1 | 1 | 1 | Read or Write |
| 8 Units of days | 1 | 0 | 0 | 0 | Read or Wr:it: |
| 9 Tens of days | 1 | 0 | 0 | 1 | Read or Wrie |
| 10 Day of week | 1 | 0 | 1 | 0 | Read or Write |
| 11 Units of months | 1 | 0 | 1 | 1 | Read or Write |
| 12 Tens of months | 1 | 1 | 0 | 0 | Read or Write |
| 13 Years | 1 | 1 | 0 | 1 | Write only |
| 14 Stop/Start | 1 | 1 | 1 | 0 | Write only |
| 15 Interrupt and status | 1 | 1 | 1 | 1 | Read or Write |

## Appendix 8

## DESIGN OF REDUNDANT SOLENOID DRIVER CIRCUITRY

The solenoid is rated as follows :

```
Vnormal =24V
Vhold =2.6V
Vpull-in = 8.0V
Resistance = 287\Omega
```

An octal driver is used to control two solenoids and the circuit diagram for one solenoid using half of the driver is shown below:


The aim of the design is that any of the components excluding the solenoid can fail. whilst still belng able to switch the solenold $O N$ and $O F F$. The resistor values R2 - R5 must be chosen to give the maximum possible voltage swing at their junction with R7.

The potential divider is simplified to:


With three transistors off and one falled short circuit. the voltage $V 1$ is given by:

$$
V 1=\frac{x}{1+x} \operatorname{Vcc}
$$

Conversely, with three transistors $O N$ and one failed open circuit. the voltage V 2 is given by :

$$
V 2=\frac{x / 3}{1+x / 3} \quad V c c=\frac{x}{3+x} \operatorname{Vcc}
$$

Hence : ' $\quad \frac{\Delta v}{v c c}=\frac{v 1-v 2}{v c c}=\frac{x}{1+x}-\frac{x}{3+x}=v^{\prime}$

For maximum $\Delta V \quad \frac{d V^{\prime}}{d x}=0$

Hence: $\quad x=\sqrt{3}=1.73$

Vmax. $=0.268 \mathrm{Vcc}$

Vcc was chosen to be 28 V which gives $\quad \mathrm{Vmax}=7.5 \mathrm{~V}$

A sufficient swing to turn the solenoids $O N$ and OFF is thus ensured by choosing Vcc $=28 \mathrm{~V}$. but this voltage is sufficiently low so as not to damage the solenoid due to overvoltage. should R1 fail.

The minimum acceptable $V$ is given by:

$$
\begin{aligned}
\text { Vmin } & =\text { Vpull-in }- \text { Vhold } \\
& =8.0-2.6=5.4 \mathrm{~V}
\end{aligned}
$$

Therefore the voltage swing of 7.5 V is 2 V greater than necessary and the excess can be used to provide a safety margin. The safety margin will be reduced by the loading effect of the solenoid on the potential divider.

The zener diode is used to subtract a bias from the voltage swing in order to satisfy the solenoid ON and OFF limits. Considering the diagram below for three transistors OFF and one failed short circuit :


The voltage at the potential divider junction is given by :

$$
V=\frac{1.73 R \times 28}{R+1.73 R}=17.7 V
$$

To ensure that the solenoid is OFF, and including a safety margin of 0.5 V . it is required that $\mathrm{Vs}=2.1 \mathrm{~V}$.

Hence : $\quad V z=(28-17.7)-2.1=8.2 V$

So an 8.2V zener diode is used.
The equivalent series resistance of the solenoid and zener diode is approximately $600 \Omega$. so Rl was chosen to be about a tenth of this value so that the potential divider is not shunted too heavily by the load it is driving. The value chosen was $R 1=68 \Omega \quad$ and $R 2-R 5$ are given by the relation:

$$
R 2=1.73 \times R 1=1.73 \times 68 \approx 120 \Omega
$$

So R2 - R5 were chosen to be $120 \Omega$.

The complete circuit design is now given by


Using the above circuit, the voltages across the solenoid are checked With three transistors. ON the solenoid voltage $=8.7 \mathrm{~V}$ : and with $\therefore$ three transistors OFF the solenoid voltage $=1.9 \mathrm{~V}$. This gives a-0.7V safety: margin on both solenoid limits. The circuit should be set up as follows because of tolerance variations in the resistors and zener diode. The power supply is reduced from 28 V until the solenoid ceases to switch $O N$ and OFF using three of the transistors. The voltage is recorded and is equal to Vmin. The power supply is then increased above 28 V until the solenoid again ceases to function correctly. This voltage is equal to Vmax. Finally the power supply is set mid-way between Vmin. and Vmax.

To protect against fallure open circuit of the zener diode. two diodes are used in parallel. Fallure short circuit of the zener diode is: not fatal as long as no other components fall.

Consideration of the above circultry shows that any component excluding the solenoid can fail. whilst still being able to : switch the solenoid ON and OFF.

| U1-82S101 | U54 - 74LS138 |
| :---: | :---: |
| U2 - 82S101 | U55 - 2141 |
| U3 - 82S101 | U56-2141 |
| U4 - 82S101 | U57-2141 |
| U5 - 82S101 | U58-2141 |
| U6 - 82S101 | U59 - 2141 |
| U7 - 1489 | U60-2141 |
| U8-8212 (1) | U61 - 2141 |
| U9 - 8085 (1) | U62-2141 |
| U10-8085 (2) | U63-2141 |
| U11-8085 (3) | U64 - 2141 |
| U12-8216 (1) | U65-2141 |
| U13-8216 (1) | U66-2141 |
| U14 - 8216 (2) | U67-2141 |
| U15-8216 (2) | U68 - 2141 |
| U16-8216 (3) | U69 - 2141 |
| U17-8216 (3) | U70-2141 |
| U18 - 74LS123 | U71 - 74LS32 |
| U19 - 74LS123 | U72 - 74LS32 |
| U20-74LSO4 | U73 - 74LS32 |
| U21. - 74LS10 | U74 - 74LS32 |
| U22 - 74LS138 | U75 - 74LS09 |
| U23-74LS08 | U76-74LS00 |
| U24 - 8212 (2) | U77-58174 |
| U25-8212 (3) | U78-8216 |
| U26-1488 | U79 - DG508 |
| U27 - 74LS390 | U80 - ZN427 |
| U28 - 74LSOO | U81 - 74LS74 |
| U29 - CA3046 | U82-74LS244 |
| U30-74LS367 | U83-74LS138 |
| U31 - 74LS374 | U84-74LS273 |
| U32 -- 74LS74 | U85-8232 |
| U33 - 74S188 | U86- LM324 |
| U34 - 74S188 | U87 - LM324 |
| U35-74S371 | U88 - LM324 |
| U36-74S371 | U89 - LM324 |
| U37-74LS374 | U90 - LM324 |
| U38 - 81LS97 | U91 - LM324 |
| U39 - 74LS32 |  |
| U40 - 74LST4 |  |
| U41 - 74LS138 |  |
| U42 - 74LS139 |  |
| U43 - 74LS86 |  |
| U44 - 74LS08 |  |
| U45 - Instant ROM |  |
| U46 - Instant ROM |  |
| U47-2516 |  |
| U48-2516 |  |
| U49 - 2516 |  |
| U50-2516 |  |
| U51 - 8216 |  |
| U52-8216 |  |
| U53-81LS97 |  |

APPENDIX 10 Governor Controller Software Listings

The controller software is written as nineteen modules which are linked together to form the complete software package. The module listings are as follows :


```
;Main Calling program
```



```
    NAME ('MAIN)
    CSE6
    EXT WTRAF,DFAULT,MERROR,HESWNC,SNAKE,INITL,MS6E
    EXT TIML06,RBKGEN,CNTRL,PRBUFF,SLFTST,NHOUT
    FUBLIC MAIN,STACK
    OR6 0000H
MAIN: DI ;DISABLE INTERRUPTS
    OUT OOH ;DISABLE RST/HOLD
    JMP MAIN1
;
    ORG 0024H
    JMF UTRAF
    ORE 002BH
    JMF DFAULT ;RST5 SOFTGARE ERROR VECTOR
;
    OR6 002CH
    jMf merrof
    OR6 0034H
    JHP RESYNC
    OR6 0038H
    JMP SNAKE
;
mAIN1: RNI C,OFFH ;INITIALI5E ALL RAM TO FF FOR SNAKE
    LLI H,3000H ;START OF RAM
FILL: MOV M,C
    INX H
    MOV A,H
    CPI 40H
    jNZ FILL ;LOOP UNTIL ALL RAM FILLED
    LXI SP,STACK ;INITIALISE STACK
    CALL INITL ;INITIALISE REGISTERS AND VARIABLES
    LXI H,CLDASG ;LOG COLD RESET
    CALL H56E
    LXI H,ROMMSG
    CALL MS6E
    LDA OOOTH ;READ NUMPER FROH EPROM
    CALL NHOUT ;PRINT IT
    CALL TIMLOE ;AND TIME
    OUT 02H
    ;ENABLE RST/HOLD
    ;GENERATE RECOVERY BLOCK
    ;ENAELE INTERRUPTS AFTER RBKGEN!
    ;RESET HATCHDOES
    ;CONTROL ALGORITHM
    ;PRINT A CHARACTER FROM BUFFER
    ;SELF TEST - SOLENOIDS EVERY 10 MINS.
    ;REPEAT CALLING LOOP
;
CLDM56: DB OAH,'FULL SYSTEM RESET',OOH
;
ROMMS6: DB OAH,'EPFOM SET ',OOH
;
STACK EQU 3FOOH
    END
```



```
;RESYNCHRONISATION ROUTINE - RST6.5 INTERRUPT
;DESTROYS ALL REG5. AND EXECUTES RECOVERY BLOCK
```



```
    NAME ('RESYNC')
    CSEG
    EXT MSGE,##OUT,BLOCK,STACK,CDUTBF
    PUBLIC RESYNC,SRETRY
RESYNC: OUT OOH ;DISABLE RST/HOLD - QUICKER EXECUTION
    LXI SP,STACK ;RESET STACK POINTER
    PUSH PSH ;RESYNCHRONISATION ROUTINE
    PUSH H
    LXI H,OOOOH
    DAD SP
    PUSH H ;PUSH STACK POINTER
    PUSH B
    PUSH D
    POF D ;REVERSE THE PROCESS
    POP B
    POP H
    SPHL ;RESTORE STACK POINTER
    POP H
    POP PSH
    MVI A,40H ;RESET SOD LINE
    SIH
    LXI H,SRETRY
    DCR M ;DECREMENT RETRY COUNTER HELD IN RAM
    JI DISABL ;DISABLE RST 6.5 IF RETRY EXHAUSTED.
    EI ;ENABLE INTERRUPT FOR HARDHARE RETRY
    OUT O2H ;EMABLE RST/HOLD
    IN 02H ;GET CURRENT SYNDROME
    ANI 03H ;MASK OFF ERROR FLAGS
    CPI OJH ;TEST FOR BOTH FLAGS HIGH
    JZ ERRLOG ;NO ERROR SO LOG TRANSIENT
    OUT 07H ;GENERATE HARDMARE RST6.5 INTERRUFT
    HLT ;MAIT FOR INTERRUPT
ERRL06: LXI H,EMSGE ;LOG TRANSIENT ERROR
    JAP NOCHAN
;
DISABL: RIM
    ;SHITCH OUT RST 6.5
    ANI 07H
    ORI OAH
    SIM
    OUT OOH ;DISABLE RST/HOLD
    LXI H,DMSGE ;LOG CPU FAILURE
;
:NOCHAN: CALL HSGE
    IN OIH
    ANI OTH ;STRIP OFF STATUS BITS
                                ;GET LATCHED ERROR SYNDROME
    LXI H,TABLEI; ;LOOK UP TABLE FDR CHANMEL IN ERROR
    ADD L
    HOV L,A
    MOV A,H
    CALL NMOUT ;PRINT IT
    MVI C,OAH
    CALL COUTBF
```

```
;
    LXI H,SYNMSG ;LDE SYNDROME
    CALL HSGE
    IN OIH ;GET LATCHED SYNDROME
    CALL NHOUT ;PRINT IT
;
    LXI H,RETMSG ;LOG NUMBER OF RETRIES
    CALL HSGE
    LXI H,SRETRY
    MVI A,OFFH
    SUB :1
    CALL MMOUT
;
    END: MVI A,OFFH ;RESET RETRY COUNTER
    STA SRETRY ;STORE IN RAM
    OUT O1H ;RESET SYNDROME LATCH FLIP/FLOP
    JMP BLOCK ;JUAP TD RECOVERY BLOCK
;
EMSGE: DB OAH,'SYMC ERR CHANNEL ',OOH
;
    DMSGE: DB OAH,'CPU FAIL CHAMNEL ',OOH
;
    SYNHSG: DB 'SYND=',OOH
;
    RETMSG: DB' RETRIES=',OOH
;
    TABLEI: DB 02H,03H,01H,00H
;
    SRETRY: DS 
    END
```



```
;MEMORY ERROR ROUTINE - RST 5.5 INTERRUPT
;SAVES ALL REGS. UNLESS DBE - RECOVERY BLOCK IF DBE
```



```
;
    RAME ('MERROR')
    CSEG
    EXT MSGE,NHOUT,BLOCK,COUTBF,SFAIL,TIMLOG
    PUBLIC AERROR
MERROR: PUSH PSH ; SAVE REGISTERS
    PUSH H
    PUSH B
    PUSH D
    IN OOH ;SAVE latched SYmDROME & RESET latCh
    MOV C,A ;SAVE SYMDROME IN C REG.
    MVI B,MRETRY ;INITIALISE RETRY COUNTER
MREAD: LXI H,TSTLOC ; IS MRITE/READ TO TESTLOC OK
RMLIOF: XRA A ;CLEAR A
    MOV M,A ;MRITE TO TESTLOC
    MOV A,M ;READ BACK
    ORA A ;SET FLAGS
    JNZ AGAIM ;TRY AGAIM IF READ ERROR
    RIM
    ANI 1OH
    JL RHO1 ;JUMP IF NO RST5.5 GITH ZERO IN A REG
    IN OOH ;OTHERWISE READ SYNDROME
RWO1: MOV E,A ;SAVE SYNDROME IN E REG.
    NVI A,OFFH ;READ/WRITE FF
    HOV m,A ;GRITE TO TESTLOC
    MOV A,M ;READ BACK
    INR A ;FF BECOMES 00
    JZ UPDATE ;O.K. SO CHECK SYMDROME
AGAIN: DCR B ;DECREMENT RETRY COUNTER
    JNZ RWLOOP ;TRY AGAIN IF NOT EXHAUSTED
    LXI H,SFR56 ;LOG TOTAL RAM FAILURE
    JMP SFAIL ;PRINT HESSAGE AND SOFT FAILURE
UPDATE: CALL SYNRD ;READ CURRENT SYNDROKE
    MOV D,A ;SAVE FF.W/R SYNDRDME
    ORA E ;OR IN OO W/R SYNDROME
    Jl ERRLOG ;LOG TRANS ERROR IF BOTH SYMDROMES=00
    IN OOH ;CLEAR SYNDROME LATCH
    DCR B ;DECREMENT RETRY COUNTER
    JNZ HREAD ;RETRY IF NOT EXHAUSTED
    MOU C,E ; OO SYNDROME = DBE ?
    CALL DBE
    J2 DBETST
    MOU C,D
    CALL DBE
    JZ DBETST ;DBE IN CHECK BITS OR HARDUARE
    HOV A,E ;NOT DBE SO ARE OO ,FF SYMDROME EQUAL OR IS OHE
    ORA A ;/EQUAL TO ZERO
    Il RFAIL ;JUMP FOR STUCK AT SIMGLE BIT ERROR
    MOV C,E ;E SYNDROHE NOY ZERO OR harDuARE FAULT SD LOAD
    MOV A,D ; /C REG HITH SYNDROME TO BE LOGGED
    ORA A
    Il RFAIL ;JUMP FOR STUCK AT SBE
```



```
;
SFMS6: DB OAH,'TOTAL RAM FAILURE',OAH,OOH
;
FHSGE: DB OAH, 'RAM FAILURE S=',OOH
;
EMSGE: DE OAH, 'TRAMS RAM ERR }S=\prime,OO
;
HMMSGE: DB OAH,'RAM HHARE FAIL S=',OOH
;
DBFMSG: DB OAH,'CBIT STUCK DBE S=',OOH
;
TSTDTA: DB 33H,55H,66H,99H,OAAH,OCCH,OOH
;
HRETRY EQU 64H
;
    DSE6
    TSTLOC: DS 1 ;RESERVE & BYTE FOR TESTLOC
;
    END
```

| gain control algorithm - dag pressure control |  |  |
| :---: | :---: | :---: |
| ; DESTROYS BC, $\mathrm{HL}, \mathrm{A}, \mathrm{D}$ |  |  |
|  |  |  |
|  |  |  |
| NAME ('CNTRL') |  |  |
| CSE6 |  |  |
| EXT PGOU, POUT, PDELTA,PRESSK |  |  |
| PUBLIC CATRL |  |  |
| CNTRL: | CALL PRESSR | ;READ PRESSURE TRANSDUCERS |
|  | XRA A | ; CLEAR CARRY |
|  | LDA PDELTA | ;GET PRESSURE differental ackus plat |
|  | RAL | ; MULTIPLY BY 2 |
|  | ADI PSET | ;CALCULATE PSET+2PDELTA |
|  | MOV C, A | ; SAVE In C REG. |
|  | LDA PGOV | ;GET GOVERNOR PRESSURE READING |
|  | CHP C | ;COMPARE WITH PSET+2PDELTA |
|  | JI HOLD | ; HOLD PRESSURE If Equal |
|  | JC UP | ;PGOV LOH SO IMCREASE PRESSURE |
| DOUN: | SUB C | ; DECREASE PRESSURE AND CALC. PRESSURE ERROn |
|  | MVI D,OFFH | ; SOLENOID DOUN CONTROL WORD |
|  | JHP BMIDTH | ; MO ACTION IF PRESSURE ERROR ITHINCION |
| UP: | nov B, A | ;IACREASE PRESSURE - REVERSE SUbTRACTON |
|  | MOV $A_{1}$ C |  |
|  | SUB B | ;CALCULATE PRESSURE ERROR |
|  | HVI $\mathrm{D}, 00 \mathrm{H}$ | ;SOLENDID UP CONTROL MORD |
| BUIDTH: | CPI MAX | ; IS PRESSURE ERROR LESS THAN HAX. |
|  | JC HOLD | ;HOLD PRESSURE IF SO |
|  | HOV $A, D$ | ; OTHERWISE ADJUST PRESSURE UP OR DUNH |
|  | JMP SOLDUT | ;OUTPUT TO SOLENOID DRIVERS |
| HOLD: | MVI $\mathrm{A}, \mathrm{OFH}$ | ;SOLENOID HOLD CONTROL HORD |
| SOLOUT: | OUT 04H | ;OUTPUT to solenoid port |
|  | LXI H,1000H | ;DELAY |
| DELAY: | DCR L |  |
|  | JNZ DELAY |  |
|  | DCR H |  |
|  | JNZ DELAY |  |
|  | RET |  |
| ; |  |  |
| MAX | EQU 05 H | ;0.5 MCHES W.6. |
| PSET | EQU 90 | ;9 INCHES M. 6. |
|  | END |  |



```
    ; TAKE TIME avERAgE OF EACH PRESSURE TRANSDUCER, THEN PERFORM
    ;HAjORITY VOTE AND RETURN VALUES - P6OV, POUT, PDELTA
    ;DESTROYS ALL REGISTERS
```



```
    NAFE ('PRESSR')
    CSE6
    EXT MSGE,NMOUT,TIMLOG
    PUBLIC PRESSR,PRETRY,PGOV,POUT,PDELTA
    PRESSR: LXI H,PTABLE ;INITIALISE PRESSURE READINGS TABLE PNTR.
    HVI B,O5H ;CHANNEL COUNTER
    ScAN: CALL PAVE ;TIME AVERAGE FOR 8 READINGS OF CHANNEL
    HOV A,A ;SAVE IN TABLE
    INX H ;IMCREHENT TABLE POINTER
    DCR B ;DECREMENT CHANNEL COUNT
    JP SCAN ;LDOP LATIL FINISHED
    VOTE: LXI H,PTABLE+2 ;START AT END OF TABLE - WORK BACKWARDS
    lXI d,PTABLE+b ;STORE AVERAGES IN TABLE
    CALL AVRGE2 ;CALCULATE 3 DIFFERENT AVES. - CHANNELS 3-5
    LXI H,PTABLE+5 ;REPEAT FJR CHANNELS 0-2
    LXI D,PTABLE+9
    CALL AVRGE2
    LXI H,PTABLE+3 ;TABLE FOR CHANNELS 0-2
    LXI D,PTABLE+9 ;CORRESPONDING AVERAGES
    MVI C,OOH ;CHANNELS OFFSET
    CAlL CHECK ;COMFARE CHANNELS AND AVERAGES AND LOG
    STA PGOY ;/ERRORS - STORE BEST PRESSURE IN PGOV
    LXI H,PTABLE ;REPEAT SAME FOR CHANNELS 3-5
    LXI D,PTABLE+6
    MVI C,O3H
    CALL CHECK
    STA POUT ;STORE BEST PRESSURE IN POUT
    HOV C,A
    LDA PGOU ;CALCULATE PRESSURE DIFFERENTIAL ACROSS
    SUB C
    STA PDELTA
    RET
;
PAVE: PUSH H ;SAVE HL
    LXI H,OOOOH ; 2ERO HL
    LXI D,000OH ; EERD HL
    mVI C,OBH ;READ CHANNEL 8 TIMES
PAVEI: CALL ADREAD ;READ CHANMEL
    MOV E,A
    DAD D ;ADD VALUE TO PREVIOUS SUM
    DCR C ;DECREMENT LOOP COUNTER
    JNZ PAVEI ;LOOP UNTIL DONE 8 TIHES
    HVI C,OTHH ;COUNT 3 SHIFTS RIGHT HL - DIVIDE BY B
PAVE2: MOV A,H
    RAR
    MOV H,A
    MOV A,L
    RAF
    MOV L,A
    OCPC
    jMi EAVEZ ;REFEAT LMTIL S SHIFT GFERGTIUNS
```

|  | $\begin{aligned} & \text { POP H } \\ & \text { RET } \end{aligned}$ | ; RESTORE HL |
| :---: | :---: | :---: |
| ; ADREAD: |  |  |
|  | Mov $A, B$ | ;GET CHANMEL ND. 0-5 |
|  | OUT OSH | ; OUTPUT TO MUX |
|  | OUT 03H | ;START CONUERSION A/D CONVERTOR |
| NEOC: | IN 04H | ; TEST EOC FLAG |
|  | RLC |  |
|  | JNC NEOC | ;LOOP YILL CONVERSION COAFLETE |
|  | IN 03H | ;READ A/D CONVERTOR |
|  | SUI OFFSET | ;SUBTRACT CALIBRATION OFFSET |
|  | RET |  |
| AURGE2: |  |  |
|  | XRA A | ;CLEAR CARRY |
|  | HOV $A, H$ | ;GET CHANNEL C |
|  | HOU C, A | ;SAVE IN C REG. |
|  | DCX H | ;POINT TO CHANNEL B |
|  | ADD $\quad$ M | ;ADD CHANNELS B+C |
|  | RAR | ;DIVIDE BY 2 |
|  | STAK D | ;STORE ( $B+C$ )/2 |
|  | XRA A | ;CLEAR CARRY |
|  | HOV A,C | ;GET CHANMEL C |
|  | DCX H | ;POINT TO CHANNEL A |
|  | INX D | ; INCREMENT TABLE POINTER |
|  | ADD H | ;ADD CHANNELS A+C |
|  | RAR | ;DIVIDE BY 2 |
|  | STAX D | ;STORE ( $A+C) / 2$ |
|  | XRA A | ;CLEAR CARRY |
|  | MOV A, ${ }^{\text {A }}$ | ; GET CHANNEL A |
|  | INX H | ;POINT TO CHANNEL B |
|  | INX D | ; INCREMENT TABLE POINTER |
|  | ADD M | ;ADD CHANNELS A+B |
|  | RAR | ;DIVIDE BY 2 |
|  | STAX D | ;STORE ( $A+B$ )/2 |
|  | RET |  |
| ; <br> AVRGE3: |  |  |
|  | PUSH B | ; SAVE REGISTERS |
|  | PUSH H |  |
|  | PIUSH D |  |
|  | LXI D,0000H | ; INITIALISE SUM |
|  | XRA A | ;CLEAR CARRY |
|  | MOV $A, n$ | ;GEt ist value |
|  | INX H |  |
|  | ADD M | ;ADD 2ND VALUE |
|  | HOV E,A | ; SAVE IN E REG. |
|  | JNC SKIP | ; INCREMENT D IF CARRY |
|  | INR D | - |
| SKIP: | XRA A | ;CLEAR CARRY |
|  | INX H | . |
|  | HOV A, ${ }^{\text {H }}$ |  |
|  | ADD E | ;ADD IN 3RD VALUE |
|  | MOV E,A | ;SAVE IN E REG. |
|  | JNC DIV3 | ; INCREMENT D IF CARRY ELSE SKIF |
|  | INR D |  |
| Qiv3: | LXI H,0000H | ;INITIALISE RESULT |
|  | MVI $\mathrm{C}, 04 \mathrm{H}$ | ; SHIFT AND ADD 4 TIMES EQUALS DIVIDE BY 3 |


| DIV3A: | CALL SHIFT | ; Shift de right one place |
| :---: | :---: | :---: |
|  | CALL SHIFT | ;AND AgAIN |
|  | DAD D | ;ADD TO PARTIAL RESULT |
|  | DCR C | ;DECREMENT SHIFT COUNTER |
|  | JNZ DIV3A |  |
|  | MOV A,L | ; Put result in a reg. |
|  | POP D |  |
|  | POF H |  |
|  | POP B |  |
|  | RET |  |
| ; |  |  |
| SHIFT: | MOV A, D | ;SHIFT D RIGHT THRO CARRY |
|  | RAR |  |
|  | HOV $D, A$ |  |
|  | Hov A,E | ;SHIFT E RIGHt LInking de Shift via carry |
|  | RAR |  |
|  | HOV E,A |  |
|  | RET |  |
| ; |  |  |
| CHECK: | HVI B,O3H | ;CHANMEL COUNTER - 3 CHANMELS |
|  | CALL AVR6E3 | ;Calculate average of 3 Chanmels |
|  | STA PTEMP | ;SAVE RESULT IN CASE IT IS USED |
| CHECK3: | CALL MOD | ; IS MOD ( $A-(\mathrm{B}+\mathrm{C}) / 2)$ ETC. GTR. THAN ALLOMED |
|  | JC CHECK1 | ; JUMP IF NDT |
|  | LDAX D | ; IF SO OVERURITE PRESSURE |
|  | STA PTEMP | ;/hith average of other 2 |
|  | PUSH H | ;SAVE REGS |
|  | PUSH B |  |
|  | LXI H, PRETRY |  |
|  | DCR M | ;DECREMENT RETRY COUNTER |
|  | Jl CHECK2 | ;DO NOT LOG ERROR IF RETRY EXHAUSTED |
|  | LXI H, CHMS6 | ;LOG ERROR |
|  | CALL MS6E |  |
|  | MOV $A, B$ | ;GEt Channel count |
|  | ADD C | ;ADD OFFSET |
|  | CALL NHOUT | ;LOG CHAMNEL IN ERROR |
|  | CALL TIMLO6 | ;AND TIME |
| CHECK2: | MVI $\mathrm{A}, 01 \mathrm{H}$ |  |
|  | STA PRETRY | ; SET RETRY COUATER TO 1 |
|  | POP B | ; RESTORE REGS. |
|  | POP H |  |
| CHECKI: | INX H | ; INCREMENT TABLE COUNTERS |
|  | INX D | ; /TO AEXT CHANNEL |
|  | DCR B | ;DECREHENT LOOP COUNTER |
|  | JNZ CHECK3 | ;LOOP UNTIL ALL 3 CHANNELS DONE |
|  | LDA PTEMP | ;RETURN WITH RESULT IN A REg. |
|  | RET |  |
| ; PL B |  |  |
| H00: | PUSH B | ;SAVE BC |
|  | LDAX D | ; (average of other 2) minus (other value) |
|  | SUB A | ;E.6. (B+C)/2-A |
|  | JNC RODI | ; Jump if result positive |
|  | LDAX D | ; OTHERMISE REVERSE SUBTRACTION ORDER |
|  | HOV $C, A$ |  |
|  | MOV $A, H$ |  |
|  | SUB C |  |

```
MOD1: CPI DELTA ;COMPARE WITH PERHISSABLE DIFFERENCE
    POP B
    RET
;
CHMSE: DB OAH,'PRESSR ERROR CH',OOH
```

```
;
```

;
OFFSET EQU 10
OFFSET EQU 10
dELTA EQU 10
dELTA EQU 10
;1 INCH \#.6.
;1 INCH \#.6.
;1 INCH \#.6.
;1 INCH \#.6.
;
;
DSEG
DSEG
PTABLE: DS 12
PTABLE: DS 12
PTEMP: DS 1
PTEMP: DS 1
PGOV: DS 1
PGOV: DS 1
POUT: DS 1
POUT: DS 1
PDELTA: DS 1
PDELTA: DS 1
PRETRY: DS 1
PRETRY: DS 1
END

```
END
```



```
;SAVES CURRENT PROGRAM STATUS OF ALL REGISTERS
```



```
;
    NAME ('RBKGEN')
    CSE6
    PUBLIC RBKGEN,RBLOCK
    RBK6EN: 
        OUT OOH
        SHLD RBLOCK+6
        XTHL
        SHLD RBLOCK+8 ;SAYE PC RETURN ADDRESS
        XTHL ;RESTORE STACK TOP
        LXI H,0000H
        DAD SP
        SHLD RBLOCK+10 ;SAVE SP
        LXI SP,RBLOCK+6 ;CHANGE SP TO RECOVERY BLOCK ADDRE5S
        PUSH D ; SAVE DE
        PUSH B ;SAVE BC
        PUSH PSH ;SAVE PSH
        RIM
        ;IF RST6.5 DISABLED LEAVE RST/HOLD DISABLED
    AN! 02H
    JNZ RETURN
    OUT O2H
        ;ENABLE RST/KOLD
    RETURN: FOP PSHG
        LHLD RBLDCK+10
        SPHL ;RESTORE SP
        LHLD RBLOCK+6 ;RESTORE HL
        El
        RET
;
    DSE6
RBLOCK: DS 12 ;RESERVE 12 BYTES
    END
```


; RESTORES ALL REGISTERS AND PROGRAM STATUS - COMPLEMENTARY TO RBKGEN
 ;

NAME ('BLOCK')
CSEE
EXT RBLOCK, TIMLDE, MSGE
PUBLIC BLOCK
BLOCK: LXI H,VASG
;LOE VECTORED RECDVERY
CALL HSGE
CALL TIALOG
LXI SP,RBLOCK
;LOAD SP HITH RECOUERY BLOCK START ADDRESS
POP PSW ;RESTORE PSH
POP B ;RESTORE BC
POP D ; RESTORE DE
LHLD RBLOCK +10
SPHL
LHLD RBLOCK +8 ; RESTORE SP

XTHL
LHLD RBLLOCK+6 ;RESTORE HL
RET ; JUMP TO RETURN ADDRESS
;
VA56: DB OAH, 'VECTORED RECOVERY', OOH

END

20
;SELF TEST OF SOLENOID DRIVERS - CAN BE EXPANDED TO TEST ENTIRE SYSTEM (1)

NAME ('SLFTST')
CSE6
EXT MSGE,TIMLDG
PUBLIC SLFTST,TFLAG
SLFTST: IN $14 \mathrm{H} \quad$;READ MINUTES FROM CLOCK
ANI OFH ;SELECT LOMER BITS
CPI OFH ;ILLEGAL READ?
JL SLFTST ; IF SO TRY AGAIN
CPI OOH $\quad$;HIMS $=0$ ?
Jl SLFI ;CLEAR SELF TEST FLAG AND RETURN
XRA A
STA TFLAG
RET
SLFI: LDA TFLAG
INR A
RL
MVI A,OFFH
STA TFLAG
XRA A
OUT 04H
CALL DELAY
HVI $A, 06 H$
CALL ATOD
CPI HIGH
JC ERROR
MVI $A, 07 H$
;TEST FLAG F:OR FFH
; FF BECOMES 00
;RETURN IF FLAG SET
;OTHERUISE SET FLAG AND DO SELF TEST

CALL ATOD
CPI HIGH
JC ERROR MVI A,OFFH OUT O4H CALL DELAY hel $\mathrm{A}, 06 \mathrm{H}$ CALL ATOD CPI LOH JNC ERROR ;ERROR IF TOO HIGH MVI A,OTH ;CHECK OTHER SOLEMOID CALL ATOD CPI LOH JNC ERROR ; ERROR IF TOO HIGH HVI $A, O F H$ OUT O4H
;HOLD PRESSURE VIA SOLENOIDS RET
;
ERROR: LXI H,SOLASG ;LOG FAILURE

CALL HS6E
CALL TIMLOG ;AND TIME
RET
;
ATOD: OUT O5H
OUT O3H ;START CONVERSION
NEOC: IN 04H
RLC

```
    JNC NEOC ;LOOF UNTIL EOC
    IN OSH
    RET
;
DELAY: LXI H,0700H
DELAYI: DCR L
    JNZ DELAYI
        DCR H
        JNZ DELAY!
        RET
;
SOLA56: DB OAH,'SOLENOID FAILURE',OOH
;
HIGH EQU BCH
LOW EQU 61H
;
    DSE6
TFLAG: DS !
    END
```


;USES NO RAM AND THEREFORE NO SUBROUTINES
;PRINTS MESSAGE THEN EXECUTES A SOFTWARE FAILURE - FAIL SAFE

mane ('Sfall')
CSE6
public sfail
SFAIL: MOV C,M
XRA A
ora c
jl SAFE
SFCOUT:
RII
ANI 80 H
jwi SFcout
Mil B, OBH
nwi $\mathrm{A}, \mathrm{OCOH}$ ;CLEAR A
;TEST FOR RULL - END OF MESSAGE
; MAKE PLANT SAFE/SHUT DOLN


SIM
LXI D,0172H ;1200 baUd delay
SFCO2: DCRE
JNZ SFC02
DCR D
JNZ SFCO2
STC
MOU A,C
RAR
MOV C,A
RUI $\mathrm{A}, \mathrm{BOH}$
RAR
XRI BOK
DCR B
JNZ SFCOI
INX H
;GET NEXT CHARACTER iN HESSAGE
JAF SFAIL
SAFE: XRA A
OUT 04H
; TURN BOTH SOLENOIDS DFF - FAILS SAFE
;
;
;
add Extra shut down routimes here
HLT ;halt CONTROLER
END


```
;SYSTEM CRASH - HATCHDDG TIRER CAUSES VECTORED RECOVERY AND LOGS
;TIME AND ADDRESS
;DESTROYS ALL REES.
```



```
NAHE ('HTRAP')
CSEG
EXT MSEE,NHOUT,BLOCK,BUFFST,BUFFRD,BUFFWR,COUT
FUBLIC WTRAP,WRETRY
WTRAP: POF D
LXI H, WRETRY
DCR H
JNZ GTRAP1
HLT
WTRAPL: LXI H,BUFFST
    SHLD BUFFRD
    SHLD BUFFWR
    MVI C,OOH
    CALL COUT
    LXI H,WDMS6
    CALL HSGE
    MOV A,D
    CALL MHOUT
    MOV A,E
    CALL NHOUT
    EI
    JAP BLOCK
    ;EMABLE INTERRUPTS AFTER COUT
    ;VECTORED RECOUERY
;
#DMSE: DR OAH,'TRAP #DOG ADR=',OOH
;
    DSE6
WRETRY: DS 1
;RETRY COUNTER
    END
```

[^2]END

; SOFTHARE ERROR - LOG ADDRESS AND TIME AND EXECUTE VECTORED RECOVERY ;DESTROYS ALL REGS
; $\ddagger$ \#
NAME ('DFAULT')
CSEG
EXT NMOUT, BLOCK,HSEE
PUBLIC DFAULT
DFAULT: POP D
LXI H,SOFMSG ;LOG ERROR
CALL HSGE
MOV A,D
;LOG HIGH ORDER ADDRESS
CALL NMOUT
MOV $A, E$
CALL NBOUT
JMP BLOCK
; VECTORED RECOVERY AND LOG TIME
;
SOFMSG: DB OAH,'SOFT ERROR ADR $=\prime, O O H$

END


```
;INITIALISE REGISTERS AND VARIABLES
```



```
    NAME ('INITL')
    CSEG
    EXT URETRY,SRETRY,PRETRY,TFLAG,COUT
    PUBLIC INITL,BUFFST,BUFFRD,BUFFUR,BUFEND
INITL
    LXI H,00OOH
    LXI B,0000H
    LXI D,0000H
    PISH PSM
    POP PSW
    MV! A,08H
    5IH
    IN OOH ;RESET MEMORY FLAGS LATCH
    OUT OIH ;CLEAR CPU ERROR LATCH
    LXI H,BUFFST ;INITIALISE OUTPUT BUFFER POINTERS
    SHLD BUFFRD
    SHLD BUFFWR
    HVI A,OFFH ;INITIALISE RESYNC RETRY COUNTER
    STA SRETRY
    NVI A,OAH
    STA PRETRY
    MVI A,05H
    STA WRETRY
    XRA A
    STA TFLAG
    HVI C,OOH
    CALL COUT
    RET
;
BUFEND: DN 4000H
BUFFST EQU 3FOOH
;
    DSEG
BUFFRD: DS 2
BUFFHR: DS 2
END
```



```
;OUTPUTS CHARACTER IN C REG. TO SERIAL DEVICE
;DESTRO%S BC,A
```



```
;
    NAME ('COUT')
    CSE6
    PUBLIC COUT
    COUT: PUSH D ;SAVE DE
    OI ;DISABLE INTERRUPTS AND RST/HOLD
    OUT OOH ;/COUT - SPEED SENSITIVE
    MVI B,OBH ;ND. BITS TO SEND
    MVI A,OCOH ;SET START BIT AND SOD ENABLE
    SIM
    LXI D,0172H
    DCR E
    JN2 COZ
    DCR D
    JNZ CO2
    STC ;SET EVENTUAL STOP BITS
    MOV A.C ;ROTATE CHAR, RIGHT FUITINE NEXT
    RAR ;DATA BIT INTO CARRY
    MOV C,A
    HVI A,BOH
    RAR
    XRI BOH
    DCR B
        JN2 CO1
        RIM
        ANI O2H
        JNZ RETURN
        OUT O2H
RETURN:
    POP D
    RET
        ;RESTORE DE - CALLING ROUTINE
        ;/MUST ENABLE INTERRUFTS IF REQUIRED
            END
```



```
;PUTS CHARACTER IN E REG. INTO OUTPUT BUFFER
;DESTROYS A
```



```
    NAME ('COUTBF')
    CSEG
    PUBLIC COUTBF
    EXT BUFFST,BUFEND,BUFFRD,BUFFWR
COUTBF: PUSH H ;SAVE HL REG.
    LHLD BUFFGR ;GET BUFFER URITE POINTER
    INX H ;INCREMENT POINTER
    LDA BUFEND ;TEST FOR END OF BUFFER
    SUB L
    JNI FULTST
    LDA BUFEND+1
    SBE H
    JNZ FULTST
    LXI H,BUFFST ;RESET POINTER TO START OF BUFFER
FULTST: LDA BUFFRD ;TEST FOR BUFFER FULL
    SUB L
    JNZ MRITE
    LDA BUFFRD+1
    SBB H
    JZ RETURN
WRITE: SHLD BUFFUR ;SAVE BUFFER POINTER
    MOV H,C ;SAVE CHARACTER IN BUFFER
RETURH: POP H ;RESTORE HL REG.
    RET
    END
```



```
;PRINT A CHARACYER FROM BUFFER UNLESS EAPTY
```



```
    NAME ('PRBUFF')
    CSEG
    EXT BUFFRD,BUFFHR,BUFEND,BUFFST,COUT
    PUBLIC PRBUFF
PRBUFF: RIM
    ANI 8OH
    RNZ
    LHLD BUFFRD
    LDA BUFFGR
    SUB L
    JNZ READ
    LDA BUFFWR+1
    SBB H
    RZ ;RETURN IF BUFFER EMPTY
READ: INX H ;INCRENENT BUFFER POINTER
    LDA BUFEND ;TEST FOR END OF BUFFER
    SUB L
    JNZ READI
    LDA BUFEND+1
    SBB H
    JN2 READI
    LXI H,BUFFST
READI: SHLD BUFFRD
    MOV C,H
    CALL COUT
    EI
    RET
    END
    ;READ SOD LINE
    ;TEST HSB - PRINTER BUSY LIME
    ;RETURN IF BUSY
    ;TEST IF BUFFER EMPTY
    ;JUMP IF NOT EMPTY
    ;JUMP IF NOT
    ; duMP IF NOT
    ;OTHERUISE SKIP TO BEgINHING OF BUFFER
;STORE NEM BUFFER POINTER
;READ A CHARACTER FROM BUFFER
;PRINT IT
;ENABLE INTERRUPTS DISABLED BY COUT
```



```
;PUTS A MESSAGE POINTED TO BY HL REG. INTO OUTPUT BUFFER
;DESTROYS HL,A
```



```
;
    NAP作 ('MSGE')
    CSE6
    PUBLIC MSGE
    EXT COUTBF
HS6E: PUSH B ;SAVE BC
CHAR: HOV C,M ;EET A CHARACTER
    XRA A
    ORA C
    JI RETURN ;RETURN IF NULL
    CALL COUTBF ;SEND CHARACTER
    INX H
    JHP CHAR ;GET HEXT CHARACTER
RETURN:
POP B
    ;RESTORE BC
RET
END
```



```
;SENDS Nlmber in a REG. TO OUTPUT BUFFER AS 2 HEX DIGITS
;DESTROYS A
```



```
;
    NAME ('NHOUT')
    CSE6
    PUBLIC MMOUT
    EXT COUTBF
    MHOUT: PUSH B
    PUSH FSH ;SAVE ARGUMENT
    RRC ;GET UPPER 4 BITS TO LOWER
    RRC
    RRC
    RRC
    ANI OFH ;SELECT LOB
    CALL CNURT ;CONVERT TO ASCII
    MOV C,A
    CALL CDUTBF ;SEND IT
    POP PSH ;RESTORE ARGUMENT
    ANI OFH ;GET LOMER BITS
    CALL CNVRT ;CONYERT TO ASCII
    MOV C,A
    CALL COUTBF ;SEND IT
    POP B ;RESTORE BC
    RET
;
CNVRT: ORI 3OH ;ADD OFFSET FOR 0-9
    CFI 3AH ; TEST IF A-F
    RC
    ADI 07H
    ;RETURN IF MOT
    ;OTHERHISE ADD FURTHER OFFSET OF 7
;
EAD
```




[^0]:    ** Considerable differences exist in screening specification between MIL-M-38510 Class S and BS9400 Class A.

[^1]:    Indut and Output fields of unused P．farms can be ieft blank Unusedinputs and ontpitis afn frla lerminais inft ionsing

[^2]:     ;SNAKE - INVALID ADDRESS RANGE CAUSES EXECUTION OF RST7 - LO6 ADDRESS, ;TIME AND EXECUTE VECTORED RECDVERY ;DESTROYS ALL REGISTERS
    
    NAME ('SNAKE')
    CSE6
    EXT MMOUT,BLOCK, MSEE,BUFFST,BUFFRD,BUFFUR
    PUBLIC SNAKE
    SNAKE: POP D
    ;GET ERRDR ADDRESS FROM STACK
    LXI H, BUFFST
    ;CLEAR OUTPUT BUFFER IN CASE CORRUPTED
    SHLD BUFFRD
    SHLD BUFFMR
    LXI H, SNMS6
    ;LOG ERROR MESSAGE
    CALL MSGE
    MOV A,D
    CALL MMOUT ;HIGH ORDER ADDRESS
    HOV $A, E$
    CALL MHOUT ;LOW ORDER ADDRESS
    JMP BLOCK ;VECTORED RECOVERY AND LOG TIME
    ;
    SNMS6: DB OAH, 'SMAKE RST7 ADR $=$ ', OOH

