Software implemented fault tolerance for microprocessor controllers: fault tolerance for microprocessor controllers

Wingate, Guy A.S.

How to cite:

Use policy
The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for personal research or study, educational, or not-for-profit purposes provided that:

- a full bibliographic reference is made to the original source
- a link is made to the metadata record in Durham E-Theses
- the full-text is not changed in any way

The full-text must not be sold in any format or medium without the formal permission of the copyright holders.

Please consult the full Durham E-Theses policy for further details.
SOFTWARE IMPLEMENTED
FAULT TOLERANCE
FOR MICROPROCESSOR CONTROLLERS

by

Guy A.S. Wingate, B.Sc.(Hons), M.Sc.

A Thesis submitted in fulfilment
of the requirement for the degree of
Doctor of Philosophy

Engineering Science

The University of Durham
1992
DECLARATION

None of the work contained within this thesis has previously been submitted for a degree at this or any other university. The work contained in this thesis is not part of a joint research project.

Copyright ©1992 Guy A.S. Wingate
The copyright of this thesis rests with the author. No quotation from it should be published without Guy A.S. Wingate’s prior written consent, and information derived from it should be acknowledged.
SOFTWARE IMPLEMENTED FAULT TOLERANCE
FOR MICROPROCESSOR CONTROLLERS

Guy A.S. Wingate, B.Sc.(Hons), M.Sc.

ABSTRACT

It is generally accepted that transient faults are a major cause of failure in microprocessor systems. Industrial controllers with embedded microprocessors are particularly at risk from this type of failure because their working environments are prone to transient disturbances which can generate transient faults.

In order to improve the reliability of processor systems for industrial applications within a limited budget, fault tolerant techniques for uniprocessors are implemented. These techniques aim to identify characteristics of processor operation which are attributed to erroneous behaviour. Once detection is achieved, a programme of restoration activity can be initiated.

This thesis initially develops a previous model of erroneous microprocessor behaviour from which characteristics particular to mal-operation are identified. A new technique is proposed, based on software implemented fault tolerance which, by recognizing a particular behavioural characteristic, facilitates the self-detection of erroneous execution. The technique involves inserting detection mechanisms into the target software. This can be quite a complex process and so a prototype software tool called Post-programming Automated Recovery UTility (PARUT) is developed to automate the technique’s application. The utility can be used to apply the proposed behavioural fault tolerant technique for a selection of target processors. Fault injection and emulation experiments assess the effectiveness of the proposed fault tolerant technique for three application programs implemented on an 8, 16, and 32-bit processors respectively. The modified application programs are shown to have an improved detection capability and hence reliability when the proposed fault tolerant technique is applied. General assessment of the technique cannot be made, however, because its effectiveness is application specific.

The thesis concludes by considering methods of generating non-hazardous application programs at the compilation stage, and design features for incorporation into the architecture of a microprocessor which inherently reduce the hazard, and increase the detection capability of the target software. Particular suggestions are made to add a ‘PARUT' phase to the translation process, and to orientate microprocessor design towards the instruction opcode map.
ACKNOWLEDGEMENTS

Firstly, I would like to thank my supervisor Dr. Clive Preece for his tremendous support and encouragement throughout the duration of the research project. I am also grateful to British Gas plc. for funding the research; in particular I would like to thank Dr. Ken Jenkins and colleagues at the Engineering Research Station, Killingworth. In addition, I owe thanks to: Prof. Ed Czeck (Computer Science, Carnegie-Mellon University, USA) for his comments on the work contained within Chapter 5 and 7; Mr. Alan Timothy (Microprocessor Centre, University of Durham) for comments on the design and implementation of an Advanced Micro Device Am29000 processor system C compiler and its implications on the work presented in Chapter 8; Prof. Dan Siewiorek (Carnegie-Mellon University, USA), Dr. Janusz Sosnowski (Warsaw Technical University, Poland), and Prof. Hermann Kopetz (Vienna Technical University, Austria) for the provision of details of their related research; and finally, Mr. Jim Roper (Computer Science, University of Durham) and Dr. Per Nylen (University of Stockholm, Sweden) for access to an Intel 80386 processor and Motorola 68(7)05 processor emulator respectively.

My time at Durham has not been all work! I would like to thank my friends over the years for making my stay so enjoyable: especially the 'tea room boys' – Sabah, Lee, Ken, Jean, and Norman; Hatfield College Middle Common Room; and the university squash and badminton teams for helping me release tension. Thanks also to Grandma for constant provision of orange cake.

Whilst writing this thesis I have phoned home many times to report “It’s finished!”. The natural reply soon became “Except for ... ?”. Lastly then, I would like to thank my parents and my fiancée Sarah for their patient support and to assure them that I really have finished.

(Submitted November 1990)

The examination of this thesis was delayed by illness. I would like to thank the staff of Dryburn Hospital (County Durham), and R.N.H. Haslar (Gosport, Hants) for their care in the intervening period.

(Examined April 1992)
“Bloody instructions, which, being learned, return to plague the inventor.”

'Macbeth': Act 1, Scene 7, Lines 8-10
by William Shakespeare.
Chapter 1

RELIABILITY AND MICROPROCESSOR-BASED CONTROLLERS

1.1. Introduction ........................................... 1
1.2. Microprocessor-Based Control Systems .................. 3
1.3. Faults, Errors, and Failures in Electronic Systems .... 4
1.4. Engineering Reliability Through Design ................ 4
1.4.1. Reliable Hardware ................................ 5
1.4.2. Reliable Software ................................ 5
1.5. Evaluating Controller Reliability ....................... 6
1.5.1. Hardware Reliability ............................... 7
1.5.2. Software Reliability ............................... 7
1.5.3. Interface Reliability ............................... 8
1.6. Low-Cost Enhancement of Controller Reliability .... 8
1.7. Thesis Preview ....................................... 9
# Chapter 2

**TEMPORARY FAULTS: GENERATION, IMPLICATION, & DETECTION**

2.1. Introduction .................................................. 12
2.2. Faults and Their Implication on Microprocessor System Reliability ... 13
2.3. Erroneous Behaviour of Microprocessor Systems ......................... 18
   2.3.1. Data Flow Errors .................................. 21
   2.3.3. Program Flow Errors ................................ 21
2.4. Assessing Error Detection Techniques for Microprocessor Systems ... 23
   2.4.1. Watchdog Timers .................................... 23
   2.4.2. Capability Checking .................................. 23
   2.4.3. Program Flow Monitoring ............................. 24
   2.4.4. Hazards Associated with Error Detection Techniques .......... 27
   2.4.5. A Novel Error Detection Technique ...................... 27
2.5. Reliability Evaluation ........................................ 28
2.6. Summary and Conclusions ...................................... 29

# Chapter 3

**MODELLING ERRONEOUS MICROPROCESSOR BEHAVIOUR**

3.1. Introduction .................................................. 31
3.2. Initiating Erroneous Microprocessor Behaviour .......................... 31
3.3. Erroneous Behaviour .......................................... 32
3.4. Erroneous Execution ......................................... 34
3.5. Halse Execution Model ...................................... 36
3.6. Hybrid Execution Model ..................................... 38
   3.6.1. Linear Erroneous Execution ........................... 38
   3.6.2. Propagating Further Periods of Erroneous Execution .......... 40
   3.6.3. Detection of Erroneous Execution ..................... 41
   3.6.4. Erroneous Execution Stall ............................ 44
3.7. Reliability Analysis ......................................... 45
   3.7.1. Failure Rate .......................................... 46
   3.7.2. Probability of an Event Leading to Failure ................ 48
   3.7.3. Reliability Evaluation ................................ 49
   3.7.4. Mean Time To Failure ................................ 51
3.8. Availability Analysis ........................................ 52
3.9. Summary ..................................................... 53
Chapter 8

GENERATING NON-HAZARDOUS SOFTWARE

8.1. Introduction ......................................................... 162
8.2. Bridging the Semantic Gap ........................................... 163
8.3. The Risks of Erroneous Execution .................................. 163
  8.3.1. Catastrophic Processing Failures ............................... 163
  8.3.2. Critical Hazard Coverage ....................................... 164
8.4. Non-Hazardous Program Area Code ................................. 165
  8.4.1. Hazardous Instruction Formats ................................. 165
  8.4.2. Hazardous Opcodes .............................................. 166
  8.4.3. Hazardous Operands ............................................. 168
    8.4.3.1. Prevention of Address Mode Hazards ..................... 168
    8.4.3.2. Inherent Addressing ....................................... 168
    8.4.3.3. Manipulating Direct Addressing ............................ 169
Chapter 10

CONCLUSION

10.1. Microprocessor Controllers for Industrial Applications ................. 196
10.2. Reliable Microprocessor Controllers ........................................ 196
10.3. Modelling Erroneous Microprocessor Behaviour .............................. 197
10.4. Detecting Erroneous Microprocessor Execution ............................. 199
10.5. Evaluating Fault Tolerance ...................................................... 201
10.6. Generating Non-Hazardous Software ......................................... 201
10.7. Microprocessor Design for Fault Tolerance .................................. 202
10.8. Summary ................................................................................. 202

Bibliography & References ......................................................... 204

Appendix A

INSTRUCTION SET PARAMETERS

A.1. Introduction .............................................................................. 214
A.2. Instructions Influencing Program Flow ....................................... 214
A.3. Microprocessor Jump Type Instruction Data ............................... 215

Appendix B

THE DESIGN OF AN ACCESS GUARDIAN

B.1. Introduction .............................................................................. 222
B.2. An Access Guardian Design ..................................................... 222
B.3. The Address Decoder ............................................................... 225
B.4. The Restart Generator ............................................................. 225
B.5. The Timer Unit .......................................................................... 228
B.6. Design Simulation ..................................................................... 231
B.7. The Design's Hardware Requirement ........................................ 235
B.8. Summary ................................................................................. 235
**Appendix C**

**PARUT AND OTHER RELATED CODE LISTINGS**

<table>
<thead>
<tr>
<th>Section</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>C.1</td>
<td>Introduction</td>
<td>237</td>
</tr>
<tr>
<td>C.2</td>
<td>PARUT Listing</td>
<td>237</td>
</tr>
<tr>
<td>C.3</td>
<td>Microprocessor Description File, MICRO_FILE</td>
<td>257</td>
</tr>
<tr>
<td>C.4</td>
<td>Target Software, CODE_FILE</td>
<td>261</td>
</tr>
<tr>
<td>C.5</td>
<td>Target Software with Fault Tolerance, RESULT_FILE</td>
<td>263</td>
</tr>
<tr>
<td>C.6</td>
<td>PARUT Report File, ANALYSIS_FILE</td>
<td>266</td>
</tr>
<tr>
<td>C.7</td>
<td>PARUT Diagnostics, TRACE_FILE</td>
<td>269</td>
</tr>
</tbody>
</table>

**Appendix D**

**EXAMPLE PROGRAMS**

<table>
<thead>
<tr>
<th>Section</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>D.1</td>
<td>Introduction</td>
<td>273</td>
</tr>
<tr>
<td>D.2</td>
<td>Program 'A' Targeting the Motorola 68000 Microprocessor</td>
<td>274</td>
</tr>
<tr>
<td>D.3</td>
<td>Program 'B' Targeting the Motorola 68(7)05 Microprocessor</td>
<td>284</td>
</tr>
<tr>
<td>D.4</td>
<td>Program 'C' Targeting the Intel 80386 Microprocessor</td>
<td>288</td>
</tr>
</tbody>
</table>

**Appendix E**

**PUBLICATIONS**

<table>
<thead>
<tr>
<th>Section</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>E.1</td>
<td>Introduction</td>
<td>297</td>
</tr>
<tr>
<td>E.2</td>
<td>EUROMICRO '88 Paper</td>
<td>299</td>
</tr>
<tr>
<td>E.3</td>
<td>EUROMICRO '89 Paper</td>
<td>307</td>
</tr>
<tr>
<td>E.4</td>
<td>IEE '89 Paper</td>
<td>315</td>
</tr>
<tr>
<td>E.5</td>
<td>IEE '90 Paper</td>
<td>318</td>
</tr>
<tr>
<td>E.6</td>
<td>EUROMICRO '91 Paper</td>
<td>323</td>
</tr>
<tr>
<td>Figure</td>
<td>Description</td>
<td>Page</td>
</tr>
<tr>
<td>--------</td>
<td>--------------------------------------------------</td>
<td>------</td>
</tr>
<tr>
<td>2.1</td>
<td>Fault and Error Latencies</td>
<td>14</td>
</tr>
<tr>
<td>3.1</td>
<td>Microprocessor Erroneous Behaviour</td>
<td>33</td>
</tr>
<tr>
<td>3.2</td>
<td>Erroneous Execution Model</td>
<td>35</td>
</tr>
<tr>
<td>3.3</td>
<td>Reliability Model</td>
<td>37</td>
</tr>
<tr>
<td>4.1</td>
<td>8-Bit Microprocessor Linear Erroneous Execution</td>
<td>61</td>
</tr>
<tr>
<td>4.2</td>
<td>16-Bit Microprocessor Linear Erroneous Execution</td>
<td>62</td>
</tr>
<tr>
<td>4.3</td>
<td>32-Bit Microprocessor Linear Erroneous Execution</td>
<td>63</td>
</tr>
<tr>
<td>4.4</td>
<td>Catastrophic Failure - Instruction Mix Analysis</td>
<td>68</td>
</tr>
<tr>
<td>4.5</td>
<td>Recovery Through Detection - Instruction Mix Analysis</td>
<td>70</td>
</tr>
<tr>
<td>4.6</td>
<td>Microprocessor Reliability - Instruction Mix Analysis</td>
<td>72</td>
</tr>
<tr>
<td>4.7</td>
<td>Microprocessor Availability - Instruction Mix Analysis</td>
<td>75</td>
</tr>
<tr>
<td>5.1</td>
<td>Functional Address Space Allocation</td>
<td>80</td>
</tr>
<tr>
<td>5.2</td>
<td>The IEJ Characteristic</td>
<td>82</td>
</tr>
<tr>
<td>5.3</td>
<td>Embedded Access Guardian</td>
<td>86</td>
</tr>
<tr>
<td>5.4</td>
<td>Access Guardian</td>
<td>86</td>
</tr>
<tr>
<td>5.5</td>
<td>The HIT Function</td>
<td>89</td>
</tr>
<tr>
<td>5.6</td>
<td>The SEJ Characteristic</td>
<td>90</td>
</tr>
<tr>
<td>5.7</td>
<td>Detection Mechanism Constructions</td>
<td>93</td>
</tr>
<tr>
<td>5.8</td>
<td>Detection Mechanism Placement</td>
<td>94</td>
</tr>
<tr>
<td>5.9</td>
<td>Erroneous Execution Model : Enhanced Fault Recovery</td>
<td>100</td>
</tr>
<tr>
<td>5.10</td>
<td>Microprocessor Reliability with Access Guardian</td>
<td>102</td>
</tr>
<tr>
<td>5.11</td>
<td>Enhanced Microprocessor Reliability</td>
<td>105</td>
</tr>
<tr>
<td>6.1</td>
<td>PARUT Overview</td>
<td>109</td>
</tr>
<tr>
<td>6.2</td>
<td>Program ‘MAIN’ Call Chart</td>
<td>113</td>
</tr>
<tr>
<td>6.3</td>
<td>Screen Dump of PARUT User Interface</td>
<td>117</td>
</tr>
<tr>
<td>6.4</td>
<td>Algorithmic Processing of Invalid Branches</td>
<td>118</td>
</tr>
<tr>
<td>6.5</td>
<td>A Complex Example of Algorithmic Processing</td>
<td>121</td>
</tr>
<tr>
<td>7.1</td>
<td>Basic Microprocessor Topology</td>
<td>129</td>
</tr>
<tr>
<td>7.2</td>
<td>Microprocessor Laboratory System Topology</td>
<td>131</td>
</tr>
<tr>
<td>7.3</td>
<td>Nature of Fault Injection Outcomes</td>
<td>142</td>
</tr>
</tbody>
</table>
TABLES

Table 2.1. : Observed Temporary & Permanent Errors ........................................ 15
Table 2.2. : Fault Emulation/Simulation Experiments ........................................ 19
Table 2.3. : Fault Injection Experiments .......................................................... 20
Table 2.4. : Capability Checks ...................................................................... 25

Table 4.1. : Microprocessor Instruction Set Evaluation ........................................ 56
Table 4.2. : Microprocessor Undefined Instruction Evaluation .......................... 57
Table 4.3. : Random Data Instruction Interpretation .......................................... 59
Table 4.4. : Microprocessor 'Mean Time To Failure' Evaluation ......................... 73

Table 5.1. : MTTF with Unused Area Detection .................................................. 103
Table 5.2. : MTTF Enhancement with Used Area Detection ............................... 103

Table 7.1. : Single-Bit Fault Injection Programme .............................................. 136
Table 7.2. : Fault Injection Outcomes ............................................................... 139
Table 7.3. : Nature of Fault Injection Outcomes ................................................ 141
Table 7.4. : Observed Behaviour of Program 'A' ............................................... 149
Table 7.5. : Observed Behaviour of Program 'B' ............................................... 153
Table 7.6. : Observed Behaviour of Program 'C' ............................................... 154
Table 7.7. : Erroneous Infinite Execution Loop .................................................. 159
Table 7.8. : Placement Deadlock ...................................................................... 159

Table A.1. : MC 6800 Microprocessor Instruction Set Evaluation ....................... 217
Table A.2. : Intel 8048 Microprocessor Instruction Set Evaluation ...................... 217
Table A.3. : Intel 8085 Microprocessor Instruction Set Evaluation ...................... 217
Table A.4. : Intel 8086 Microprocessor Instruction Set Evaluation ...................... 217
Table A.5. : MC 68000 Microprocessor Instruction Set Evaluation ...................... 217
Table A.6. : MC 68010 Microprocessor Instruction Set Evaluation ...................... 217
Table A.7. : AMD 29000 Microprocessor Instruction Set Evaluation ................... 217
Table A.8. : MC 68020 Microprocessor Instruction Set Evaluation ...................... 217
Table A.9. : Intel 80386 Microprocessor Instruction Set Evaluation .................... 217
Table A.10. : MC 6800 Jump Instructions .......................................................... 218
Table A.11. : Intel 8048 Jump Instructions ....................................................... 218
Table A.12. : Intel 8085 Jump Instructions ....................................................... 219
Table A.13. : Intel 8086 Jump Instructions ....................................................... 219
<table>
<thead>
<tr>
<th>Table A.14.</th>
<th>MC 68000, MC 68010, and MC 68020 Jump Instructions</th>
<th>220</th>
</tr>
</thead>
<tbody>
<tr>
<td>Table A.15.</td>
<td>AMD 29000 Jump Instructions</td>
<td>221</td>
</tr>
<tr>
<td>Table A.16.</td>
<td>Intel 80386 Jump Instructions</td>
<td>221</td>
</tr>
<tr>
<td>-------------</td>
<td>---------------------------------------------------</td>
<td>-----</td>
</tr>
<tr>
<td>Table B.1.</td>
<td>Truth Table for SRFF Control Logic</td>
<td>227</td>
</tr>
<tr>
<td>Table B.2.</td>
<td>Set-Reset Flip Flop Transition Table</td>
<td>227</td>
</tr>
<tr>
<td>Table B.3.</td>
<td>Access Guardian Parts List</td>
<td>236</td>
</tr>
<tr>
<td>-------------</td>
<td>---------------------------------------------------</td>
<td>-----</td>
</tr>
<tr>
<td>Table C.1.</td>
<td>Chronological Order of Functions in PARUT Listing</td>
<td>238</td>
</tr>
<tr>
<td>Abbreviation</td>
<td>Description</td>
<td></td>
</tr>
<tr>
<td>--------------</td>
<td>-------------</td>
<td></td>
</tr>
<tr>
<td>ADB</td>
<td>UNIX 'A DeBugger' Facility</td>
<td></td>
</tr>
<tr>
<td>ALU</td>
<td>Arithmetic Logic Unit</td>
<td></td>
</tr>
<tr>
<td>AMD</td>
<td>Advanced Micro Device</td>
<td></td>
</tr>
<tr>
<td>BG-Rig</td>
<td>British Gas Rig (experimental)</td>
<td></td>
</tr>
<tr>
<td>BIC</td>
<td>Bus Interface Circuitry</td>
<td></td>
</tr>
<tr>
<td>Cdf</td>
<td>Cumulative Density Function</td>
<td></td>
</tr>
<tr>
<td>CISC</td>
<td>Complex Instruction Set Computer</td>
<td></td>
</tr>
<tr>
<td>CLASSIC</td>
<td>Custom Logic Analysis Simulation System for Integrated Circuits</td>
<td></td>
</tr>
<tr>
<td>Cm*</td>
<td>Carnegie-Mellon Multi-processor</td>
<td></td>
</tr>
<tr>
<td>CMOS</td>
<td>Complementary Metal-Oxide Semiconductor</td>
<td></td>
</tr>
<tr>
<td>CMUA</td>
<td>Carnegie-Mellon University 'A' File System</td>
<td></td>
</tr>
<tr>
<td>CMU-AFS</td>
<td>Carnegie-Mellon University Andrew File System</td>
<td></td>
</tr>
<tr>
<td>C.vmp</td>
<td>Carnegie-Mellon Voting Multi-processor</td>
<td></td>
</tr>
<tr>
<td>DIS</td>
<td>UNIX 'DIS-assembler' Facility</td>
<td></td>
</tr>
<tr>
<td>ECL</td>
<td>Emitter-Coupled Logic</td>
<td></td>
</tr>
<tr>
<td>EMI</td>
<td>Electro-Magnetic Interference</td>
<td></td>
</tr>
<tr>
<td>ESD</td>
<td>Electro-Static Discharge</td>
<td></td>
</tr>
<tr>
<td>FMECA</td>
<td>Failure-Mode, Effects and Criticality Analysis</td>
<td></td>
</tr>
<tr>
<td>FORTRAN</td>
<td>'Formula Translation' Programming Language</td>
<td></td>
</tr>
<tr>
<td>FTA</td>
<td>Fault Tree Analysis</td>
<td></td>
</tr>
<tr>
<td>FTMP</td>
<td>Fault Tolerant Multi-Processor</td>
<td></td>
</tr>
<tr>
<td>IBM</td>
<td>International Business Machines</td>
<td></td>
</tr>
<tr>
<td>IC</td>
<td>Integrated Circuit</td>
<td></td>
</tr>
<tr>
<td>IEE</td>
<td>Institution of Electrical Engineers</td>
<td></td>
</tr>
<tr>
<td>IEEE</td>
<td>Institute of Electrical and Electronics Engineers</td>
<td></td>
</tr>
<tr>
<td>IEJ</td>
<td>Initial Erroneous Jump</td>
<td></td>
</tr>
<tr>
<td>IPQ</td>
<td>Instruction Prefetch Queue</td>
<td></td>
</tr>
<tr>
<td>HDL</td>
<td>Hardware description Language</td>
<td></td>
</tr>
<tr>
<td>JPI</td>
<td>Jensen and Partners International</td>
<td></td>
</tr>
<tr>
<td>LP</td>
<td>Low Pressure</td>
<td></td>
</tr>
<tr>
<td>MC</td>
<td>Motorola Corporation</td>
<td></td>
</tr>
<tr>
<td>MTTF</td>
<td>Mean Time To Failure</td>
<td></td>
</tr>
<tr>
<td>MTTR</td>
<td>Mean Time To Repair</td>
<td></td>
</tr>
<tr>
<td>NASA</td>
<td>National Aeronautics Space Administration</td>
<td></td>
</tr>
<tr>
<td>NMOS</td>
<td>N-Type Metal-Oxide Semiconductor</td>
<td></td>
</tr>
<tr>
<td>Acronym</td>
<td>Definition</td>
<td></td>
</tr>
<tr>
<td>---------</td>
<td>------------</td>
<td></td>
</tr>
<tr>
<td>PARUT</td>
<td>Post-programming Automated Recovery UTility</td>
<td></td>
</tr>
<tr>
<td>PC</td>
<td>Program Counter</td>
<td></td>
</tr>
<tr>
<td>PCU</td>
<td>Program Control Unit</td>
<td></td>
</tr>
<tr>
<td>pdf</td>
<td>Probability Density Function</td>
<td></td>
</tr>
<tr>
<td>RAM</td>
<td>Random Access Memory</td>
<td></td>
</tr>
<tr>
<td>RISC</td>
<td>Reduced Instruction Set Computer</td>
<td></td>
</tr>
<tr>
<td>RFI</td>
<td>Radio-Frequency Interference</td>
<td></td>
</tr>
<tr>
<td>ROM</td>
<td>Read Only Memory</td>
<td></td>
</tr>
<tr>
<td>SEJ</td>
<td>Subsequent Erroneous Jump</td>
<td></td>
</tr>
<tr>
<td>SLAC</td>
<td>Stanford Linear Accelerator Centre</td>
<td></td>
</tr>
<tr>
<td>SRFF</td>
<td>Set-Reset Flip Flop</td>
<td></td>
</tr>
<tr>
<td>TMR</td>
<td>Triple Modular Redundancy</td>
<td></td>
</tr>
<tr>
<td>TTL</td>
<td>Transistor-Transistor Logic</td>
<td></td>
</tr>
<tr>
<td>VLSI</td>
<td>Very Large Scale Integration</td>
<td></td>
</tr>
</tbody>
</table>
LIST OF SYMBOLS

\( A \) Set of Faults Detected by Access Guardian
\( A_d \) Physical Address in the Address Space
\( A_v \) Availability
\( C_n \) Ripple Counter Base
\( D(k) \) Probability of Detecting Erroneous Execution
\( D_{IEJ}(k) \) Probability of Detecting Erroneous Execution Following an IEJ
\( D_{SEJ}(k) \) Probability of Detecting Erroneous Execution Following an SEJ
\( E \) Error Detection Coverage
\( E_f \) Event Leading to Failure
\( E_r \) Event Leading to Recovery
\( E_s \) Set of Events Disrupting Processor Execution
\( f(t) \) Probability Density Function
\( F \) Set of Injected Faults
\( F(t) \) Cumulative Density Function
\( H(l,j) \) HIT Function: Probability of an SEJ Target in Used Area
\( i \) Number of Defined Opcodes within Instruction Set
\( I \) Mean Period of Linear Erroneous Execution, (No. Instructions)
\( I_f \) Microprocessor Instruction Frequency, (No. Instructions/s)
\( j \) Jump Type Instruction
\( J \) Set of Jump Type Instructions
\( J' \) Return or Unspecified Jump Outcome
\( k \) Instruction Index During Linear Erroneous Execution
\( l \) Jump Type Instruction Location in Used Area
\( L \) Set of Address Space Locations
\( L_d \) Error Detection Latency (No. Instructions)
\( L_s \) Stalling Latency (No. Instructions)
\( m \) Integer
\( M \) Set of Faults Detected by Hardware
\( n \) Number of Bits in Opcode Format
\( N_{eJ} \) Effective Number of 'Jump' Outcome Instructions
\( N_{eRN} \) Effective Number of 'Return' Outcome Instructions
\( N_{eRT} \) Effective Number of 'Restart' Outcome Instructions
\( N_{esw} \) Effective Number of 'Stop/Wait' Outcome Instructions
\( N_{eUJ} \) Effective Number of 'Unspecified Jump' Outcome Instructions
\( N_L \) Number of Locations Open to Instruction Interpretation
Number of Instructions in Recovery Routine

Set of Faults Detected by Software

Probability of Controlled State

Probability of IEJ Target in Used Area

Probability of IEJ Target in Unused Area

Probability of a 'Jump' Instruction Outcome

Functional Probability of a 'Jump' Outcome

Probability of a 'Jump' Instruction Outcome

Functional Probability of a 'J' Outcome

Functional Probability of a 'J' Outcome Following an IEJ or SEJ

Functional Probability of Linear Erroneous Execution

Probability of a 'Non-Jump' Instruction Outcome

Functional Probability of a 'Non-Jump' Instruction Outcome

Functional Probability of a 'Return' Instruction Outcome

Functional Probability of a 'Restart' Instruction Outcome

Probability of SEJ Target in Used Area

Probability of SEJ Target in Unused Area

Functional Probability of a 'Stop/Wait' Instruction Outcome

Functional Probability of a 'Stop/Wait' Instruction Outcome

Functional Probability of a 'Unspecified Jump' Instruction Outcome

Functional Probability of a 'Unspecified Jump' Instruction Outcome

Probability Function

Event Rate, (per hour)

Set of Faults Generating Re-synchronization

Reliability Function

Reliability Signature Monitoring Hardware

Reliability of Microprocessor Employing Signature Monitoring

Return Outcome

Restart Outcome

Number of Bytes in Memory Fetch

Probability of Erroneous Execution Stall

Probability of Erroneous Execution Stall following an IEJ

Probability of Erroneous Execution Stall following an SEJ

Stop/Wait Outcome

Time, (s)

Set of Locations Addressed by Jump Type Instruction

Unspecified Jump Outcome

Member of the Set \{RN, RT, SW, UJ\}
\( y \)  
Member of the Set \( \{IEJ, SEJ\} \)

\( Z(t) \)  
Hazard Rate, (per hour)

\( \beta \)  
Probability Hardware Exception: ‘Restart’ Outcome

\( \gamma \)  
Probability Software Exception: ‘Restart’ Outcome

\( \lambda \)  
Constant Failure Rate, (per hour)

\( \lambda(t) \)  
Failure Rate, (per hour)

\( \Psi \)  
Recursive Element of Function \( D(k) \) and \( S(k) \)
# Chapter 1

## RELIABILITY AND MICROPROCESSOR-BASED CONTROLLERS

<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.1</td>
<td>Introduction</td>
<td>1</td>
</tr>
<tr>
<td>1.2</td>
<td>Microprocessor-Based Control Systems</td>
<td>3</td>
</tr>
<tr>
<td>1.3</td>
<td>Faults, Errors, and Failures in Electronic Systems</td>
<td>4</td>
</tr>
<tr>
<td>1.4</td>
<td>Engineering Reliability Through Design</td>
<td>4</td>
</tr>
<tr>
<td></td>
<td>1.4.1. Reliable Hardware</td>
<td>5</td>
</tr>
<tr>
<td></td>
<td>1.4.2. Reliable Software</td>
<td>5</td>
</tr>
<tr>
<td>1.5</td>
<td>Evaluating Controller Reliability</td>
<td>6</td>
</tr>
<tr>
<td></td>
<td>1.5.1. Hardware Reliability</td>
<td>7</td>
</tr>
<tr>
<td></td>
<td>1.5.2. Software Reliability</td>
<td>7</td>
</tr>
<tr>
<td></td>
<td>1.5.3. Interface Reliability</td>
<td>8</td>
</tr>
<tr>
<td>1.6</td>
<td>Low-Cost Enhancement of Controller Reliability</td>
<td>8</td>
</tr>
<tr>
<td>1.7</td>
<td>Thesis Preview</td>
<td>9</td>
</tr>
</tbody>
</table>
CHAPTER ONE

RELIABILITY AND MICROPROCESSOR-BASED CONTROLLERS

1.1. Introduction

The continuing technological evolution of microprocessor design has resulted in a large number of commercially available devices with a wide variety of characteristics. Many microprocessors are embedded within control systems to automatically operate, monitor, or control a physical process. Applications range from relatively simple control of domestic appliances such as toasters or washing machines, to the complex control of industrial plant such as power stations or chemical works. An important feature of all these control systems is their reliability, that is, the probability that the system will perform its function under stated environmental conditions, without malfunction, over a specified period of time or operational duration (adapted from [Bennetts, 1979]).

British Gas use microprocessor systems to manage individual governors controlling the transmission of gas to industrial and public customers. Such microprocessor controllers require high reliability because of the proximity of their installation, with the hazard of gas, to the general public.

The United Kingdom gas distribution system involves the transmission of gas at high- and medium- low pressures between the off-shore gas fields, local consumer districts, and end-customers respectively. The low pressure (LP) system dates back to the production of town gas in the 19th century and serves local districts and individual users. The LP network has traditionally been controlled by independent pneumatic governors. Governors are devices which control, through a valve action, the gas pressure in a pipeline. The governors implement a clocking mechanism which alters the gas pressure depending on predicted daily fluctuations in demand. Gas pressures that fall below a critical level allow air to enter the gas pipework and can produce potentially explosive mixtures. To prevent this situation arising the governor system implements a shut-down operation, called 'slam-open', when the
gas pressure falls at an excessive rate or when the gas pressure falls beneath the statutory safety limit of 12.3 mbar. The slam-open operation involves completely opening the governor valve so that adjacent normal and hazardous gas pressures across the valve are equalized. Slam-open operations may cascade through several consecutive governor systems before the mean pressure along a section of pipeline is acceptable.

In order to increase management efficiency of the LP network it is necessary to improve control of the gas distribution. The small value of gas handled by each governor system in the gas network means that a low-cost upgrade is required. To this end, microprocessors have been embedded within the governor control systems [Clark et al, 1987] to facilitate more effective and integrated control of the LP gas distribution system [Wynne et al, 1988]. The distribution of gas can now be managed in an efficient and interactive manner. Seasonal and diurnal load variations can be monitored and appropriate responsive action taken automatically in a real-time environment. Stringent safety regulations of the gas industry require the back-up of electronic systems failure by traditional pneumatic slam-open operation.

British Gas predict failures of the microprocessor assisted governors to occur once every 10 years, a failure rate approximately one thousandth of the original pneumatic governor systems [Clark et al, 1987]. The slam-open back-up ensures that microprocessor controller failures are not catastrophic. Nevertheless, loss of the governor management function incurs a financial penalty; reduced gas distribution efficiency, and repair of the controller which may reside at a remote site. An overall improvement in the reliability of the microprocessor controller is required. British Gas are particularly interested in the effect of software, 'which may become highly unpredictable' under the influence of hardware faults, on system reliability [Clark et al, 1987].

This thesis presents techniques for enhancing the operational reliability of microprocessor-based controllers without adversely increasing their cost. These techniques are applicable to any microprocessor system, including those used by British Gas to manage governors.
1.2. Microprocessor-Based Control Systems

Many control systems have their designs based on complex logic circuits incorporating flip-flops, analogue-to-digital converters, shift-registers, and other logic gate structures. In these cases it is often convenient to incorporate, or replace, such circuitry with a dedicated microprocessor and its support chips. The control system behaviour can now to a large extent be governed by the software stored on a microprocessor memory chip. Software can be maintained without physically altering the system hardware. Such flexibility can be valuable as in a recently reported incident when a car manufacturer discovered a design error in a fuel injection system [IEEE, 1989]. Replacing the system with an alternative was extremely expensive. However, replacement was not necessary: the system was controlled by a microprocessor which was re-programmed to compensate for, and effectively mask, the design error. The problem was rectified at little cost. In such instances the maintenance engineers must be careful not to introduce new errors (the 'Software Death Cycle', [Rigby & Norris, 1990]).

In recent years digital techniques have become so powerful that tasks well suited to analogue systems are often partially or totally controlled by digital systems. For example, a temperature meter based on a thermocouple or thermistor might incorporate a microprocessor and memory in order to improve accuracy by compensating for the instrument's departure from perfect linearity.

Although ever more powerful microprocessors are being developed, most controllers do not require further advanced processing capabilities. A recent Japanese survey [Fujimura, 1989] reported that approximately 80% of microprocessor-based controllers incorporated either an 8 or 16-bit machine. Despite the commercial availability of 32-bit processors for over five years, they were only used in approximately 11% of the controllers. The remaining 9% of controllers had embedded 4-bit microprocessors. Obviously the 4-bit microprocessor-based controllers had a very simple function.
1.3. Faults, Errors, and Failures in Electronic Systems

A system failure is said to occur when the behaviour of the system first deviates from that required by the specification of the system (as defined by [Anderson & Lee, 1982]). System failures are caused by the external exposure of a defective internal state. Deficiencies in the internal state of a system, referred to as ‘errors’, can exist without the generation of a failure.

A system consists of a set of components (or sub-systems) which interact under the control of a design. Errors originate from the activation of defective system components. Defective components are referred to as ‘faults’.

A fault in a digital electronic system is characterized by its nature, extent, and duration [Avizienis, 1976]. The nature of the fault can be classified as either logical or non-logical. A logical fault causes the logic value at a point in the digital circuit to become opposite to the specified value. Non-logical faults include the remaining faults such as a malfunction of the clock signal. The extent of a fault specifies whether the effect of the fault is localized or distributed in the digital system. Finally, the duration of a fault refers to whether the fault is permanent or temporary.

McCluskey & Wakerly [1981] distinguish between two classes of temporary fault, transient and intermittent. Transient faults are non-recurring temporary faults which are caused by environmental influences. They are not repairable because there is no physical damage to the hardware. Intermittent faults are recurring temporary faults caused by deteriorating or ageing hardware. Intermittent faults may eventually become permanent and can be repaired.

1.4. Engineering Reliability Through Design

Reliability can be engineered in a digital system by implementing a disciplined design process. The reliability of a microprocessor-based system depends on its hardware and software design, deficiencies in either are expensive to correct. It is therefore prudent, when developing reliable systems, for the design to be fault-free or fault-tolerant.
1.4.1. Reliable Hardware

Poor specification, design, and manufacture can individually or collectively intro­duce faults within digital electronic systems. Specifications are normally written in natural language which makes their integrity extremely difficult to check. Specifications can be written using Formal Methods, enabling designs to be proven to comply with their specification, but this technique does not ensure that the specification itself is defect-free [Cullyer, 1988].

The integrity of manufacture can be validated using 'black-box' tests. Digital systems, however, can be complex and comprehensive black-box testing extremely expensive. Intel only test 98% of the logic nodes for faults in each manufactured 80486 microprocessor (even though untested nodes may be faulty) because, as for many other digital systems, complete testing is considered prohibitively expensive [IEEE, 1990].

The methods outlined above for the procurement of reliable hardware are all fault avoidance techniques. A complementary approach involves tolerating faults through the implementation of special design features. Fault tolerant techniques can be divided into those that detect faults and initiate recovery such as parity checking and watchdogs, and those that mask faults such as Triple Modular Redundancy (TMR) and error-correcting codes [Carter, 1985].

1.4.2. Reliable Software

Software faults (commonly called ‘bugs’) can arise from the specification, design, or coding process. Typically more than half the faults which are recorded during the software development originate in the specification [O’Connor, 1985]. This is mainly due to the use of natural language for documenting the ‘non-technical’ user requirements specification [Hitt & Webb, 1985]. Engineering principles are being proposed [Sommerville, 1985] to enable defect-free development and maintenance of software.

Software verification involves semantic and syntactic checks on the program code for programmer error, and structured walk-through checks for functional correctness. Black-box tests can be used to identify faults but the complexity of software often
prevents exhaustive checking due to prohibitive costs. The complexity of software
testing can be reduced by adopting a modular code structure. Many methods have
been proposed to assess acceptable test-set coverage for software [Musa et al, 1987]
but they all are subject to the Dijkstra maxim ‘testing reveals the presence of faults,
not their absence’ [Dijkstra, 1972].

Software can be manipulated to tolerate faults. Two well known approaches to
fault tolerant software are N-Version Programming [Chen & Avizienis, 1978], and
Recovery Blocks [Randall, 1975]. Both techniques rely on design diversity, the avail­
ability of multiple implementation of a specification, to tolerate faults. N-Version
Programming requires the independent implementation of multiple, ‘N’, versions of
the specification. These versions are processed in parallel with the same inputs. A
voter collects the outputs and a majority decision made to select the perceived correct
output. Theory implies high reliability for this method, but in practice the multiple
program versions can share common mode failures [Eckhardt & Lee, 1988].

Recovery Blocks consist of a primary routine, which normally performs a task; an
acceptance test which checks the primary routine result; and an alternative routine
which is executed if the check fails. Unlike N-Version Programming where routine
independence is assumed, Recovery Blocks require ensured independence between
the primary routine, the acceptance test, and the alternative routine. Application
of recovery blocks improves reliability. The degree of reliability can be enhanced by
extending the number of independent alternative routines ensuring each acceptance
test is also wholly independent.

Software data structures can also be manipulated so that they can tolerate the
presence of faults. Taylor et al [1980] briefly outline this topic and propose fault
tolerant structures for linear lists and binary trees.

1.5. Evaluating Controller Reliability

To evaluate the reliability of a microprocessor-based controller it is necessary to
apply a ‘systems approach’. The systems approach involves integrating the inter­
dependencies of all sub-systems constituting a whole system. Microprocessor-based
controllers consist of two entities; hardware and software. Many authors, including
Hitt & Webb [1985], and Ferrara et al [1989], integrate calculations for hardware and software reliability. However such reliability assessments do not involve any allowance for the internal interaction of hardware faults on software. Internal hardware/software interaction occurs across what is referred to as the 'interface'. To determine system reliability more accurately it is necessary to integrate assessments of hardware reliability, software reliability, and interface reliability.

1.5.1. Hardware Reliability

It is valuable to calculate the reliability of a hardware product for the duration of its ‘useful’ lifetime. Historical failure data which takes into account benign operating conditions and general age degradation is used to assess the expected lifetime of the hardware. Popular compilations of such data include the United States Air Force ‘Reliability Prediction for Electronic Systems’ (MIL-HDBK-217), and the United Kingdom British Telecom ‘Handbook of Reliability Data’ (HRD-4). Techniques for manipulating this data to reflect hardware architecture are well understood [Lala, 1985].

1.5.2. Software Reliability

Methods of establishing the reliability of software are still under development. Although many techniques have been proposed none have had the widespread acceptance given to the corresponding assessment of hardware reliability.

Assessments of software reliability usually involve the prediction of errors existing in the software. However, the reliability of the software depends not only on the existence of a fault but also its activation. Many authors have used Markov processes to model software reliability [Musa, 1987]. There are two types of Markovian software reliability model widely used; Poisson and binomial. The Poisson models assume an infinite number of faults in the software, whilst the binomial models assume a fixed number of faults. Both model types assume faults exist randomly within the software. Musa [1975] refined the basic Poisson model so that fault activation is a function of the time for which the software is executed. In reality, however, fault exposure is dependent on the fault location within the software and its associated probability of
activation by program execution. Littlewood [1981] attempts to model this situation with a binomial model that weights software faults according to the probability of their execution. Trachtenberg [1990] has recently reviewed and suggested a general theory of software reliability models based on Markovian processes.

The Markov software reliability models provide a valuable indication of the likelihood that a fault will be exposed during software execution. The hazard attributed to fault exposure can be further estimated by using *ad hoc* methods such as 'Fault Tree Analysis' (FTA) or 'Failure-Mode, Effects, and Criticality Analysis' (FMECA).

### 1.5.3. Interface Reliability

In software controlled digital systems, failures can occur which are difficult to diagnose as being due to the exposure of a hardware fault or software error. A distinction is not clear usually because the systems internal hardware/software interface has not been defined. The interface occurs within electronic devices such as processors and memories. For example, a fault in an individual cell on a memory device holding a program can cause what appears to be a software error. Memory devices are sometimes referred to as firmware to reflect their hardware/software interface. Other faults may occur on a data bus line with similar effect.

Permanent faults relating to interface reliability should be identified by the burn-in procurement of the hardware. However because of their limited duration, temporary faults are rarely located during the burn-in process. Assessment of the interface reliability requires knowledge of the occurrence of faults and errors they induce.

### 1.6. Low-Cost Enhancement of Controller Reliability

The reliability of microprocessor controllers can be enhanced by addressing the problem of faults introduced during procurement and operation. Techniques for procuring reliability have been briefly outlined. Operational faults are generated by component aging and transient disturbances in the working environment such as power supply fluctuations, electro-magnetic interference (EMI), and electro-static discharge (ESD) [Siewiorek & Swarz, 1982].
Transient disturbances are associated with temporary faults in digital systems. Unlike analogue or mechanical systems which tend to pass the effect of a transient disturbance as a temporary signal discrepancy whilst retaining overall function, digital systems are susceptible to malfunction in the presence of temporary faults because of their discrete state nature. Indeed it is becoming established that, even in 'benign' working environments, about 90% of microprocessor system failures can be attributed to temporary faults [Siewiorek & Swarz, 1982].

Control systems are often required to operate in 'harsh' industrial environments liable to produce transient disturbances. Although shielding can be employed to reduce the effects of transient disturbances on digital systems, their elimination is rarely possible [Horowitz & Hill, 1986]. The benefit of tolerating temporary faults induced by transient disturbances can be considerable. Industrial microprocessor controllers, however, are often developed within a limited budget which cannot support the redundancy incurred by many established fault tolerant techniques. This thesis approaches the topic of interface reliability, proposing a low-cost software-implemented fault tolerant technique for temporary hardware faults.

1.7. Thesis Preview

The topic of reliability for microprocessor-based controllers has been introduced with respect to the requirement for low-cost fault tolerance (the objective of the research presented in this thesis). Chapter 2 surveys literature investigating the fault-error-failure mechanism in microprocessor systems. The failure process is identified with malfunction, particular hazard being associated with the corruption of program flow. Current techniques to detect this class of fault are reviewed, but many require considerable expense to implement.

As a first step to developing new and more cost-effective techniques to detect program flow corruption, it is useful to consider the character of associated erroneous microprocessor behaviour. Chapter 3 presents a model for erroneous microprocessor execution. Performance parameters are evolved to show the benefit of implementing a detection capability together with a recovery mechanism. These parameters
include detection latency, reliability, Mean Time To Failure (MTTF), and availability. The model is applied to a selection of microprocessors commonly embedded within controllers, results are discussed in Chapter 4.

Microprocessor software displays various characteristics depending on the function of its implementation. Functional sections of code include program areas, data areas, and reserved memory mapped input/output areas. In addition, the microprocessor may have a proportion of its address space which is not populated by software. Whilst correct microprocessor operation executes instructions within the program area in a predictable manner, erroneous execution can invalidly interpret an instruction anywhere in the microprocessor address space in an unpredictable manner. Fault tolerant techniques whose implementation is based on software for detecting erroneous execution within each functional area of the microprocessor address space are presented in Chapter 5. In particular a novel technique for detecting erroneous program flow is proposed.

An algorithm for implementing the proposed fault tolerant technique in the program area is presented in Chapter 6. The technique involves manipulating the program code in order to strategically insert detection mechanisms. The mechanisms are designed to detect erroneous execution by identifying beforehand particular corrupted execution routes in the software. The algorithm is implemented in a software utility so that program code can automatically be given the detection capability. The software utility is called Post-programming Automated Recovery UTility (PARUT). PARUT is designed to be flexible, allowing the generation of fault tolerant code for a selection of microprocessors.

Several example programs are processed by the PARUT algorithm in Chapter 7 so that the performance of the fault tolerant technique can be assessed. Enhanced program code is evaluated by emulation and fault injection programmes. These experiments monitor the behaviour associated with corrupted patterns of erroneous execution. The information obtained from the experiments enables the detection performance of individual programs to be assessed.
Translators are used to generate significant amounts of software for microprocessor based controllers. The programmer has no control over the nature of the translator generated code. Chapter 8 identifies critical hazards which are not covered by the fault tolerant techniques outlined in Chapter 5. These hazards are associated with the catastrophic failures of cessation of processing and infinite execution loops. Techniques are proposed for the translator code generation process so that critical hazards are not produced. These techniques are influenced by the nature and structure of the target microprocessor instruction set.

Fault tolerant microprocessor design features are proposed in Chapter 9. These features facilitate rapid detection of corrupted program flow.

The final chapter reviews the thesis and draws conclusions on the research. Five appendices provide details of: microprocessor parameters applied in Chapter 4 to the model presented in Chapter 3; the design of a hardware unit associated with the software implemented fault tolerant technique proposed in Chapter 5; a code listing of the PARUT tool described in Chapter 6; the example programs evaluated in Chapter 7; and the papers published by the author relating to work presented in this thesis.
Chapter 2

TEMPORARY FAULTS: GENERATION, IMPLICATION, AND DETECTION

2.1. Introduction ........................................... 12
2.2. Faults and Their Implication on Microprocessor System Reliability . 13
2.3. Erroneous Behaviour of Microprocessor Systems ..................... 18
  2.3.1. Data Flow Errors .................................. 21
  2.3.2. Program Flow Errors ............................... 21
2.4. Assessing Error Detection Techniques for Microprocessor Systems ... 23
  2.4.1. Watchdog Timers ................................... 23
  2.4.2. Capability Checking .................................. 23
  2.4.3. Program Flow Monitoring ............................ 24
  2.4.4. Hazards Associated with Error Detection Techniques ............ 27
  2.4.5. A Novel Error Detection Technique ....................... 27
2.5. Reliability Evaluation ................................... 28
2.6. Summary and Conclusions .................................. 29
CHAPTER TWO

TEMPORARY FAULTS: GENERATION, IMPLICATION, AND DETECTION
(A LITERATURE REVIEW)

2.1. Introduction

An industrial environment can be less than ideal for microprocessor-based controllers. In particular externally generated transient events can disrupt microprocessor operation. Erroneous microprocessor behaviour is associated with a degraded or lost control function, and the mal-operation of any equipment under the microprocessor based controller's supervision. Mal-operation of controlled equipment can be extremely hazardous because of the unpredictable nature of erroneous microprocessor behaviour.

This chapter discusses transient events that lead to temporary corruption of a microprocessor bus, register, or memory. Such corruption incurs no permanent hardware damage. The limited duration of temporary faults inhibits their detection. Without detection, and exercised by circuit action, temporary faults generate errors which can spawn other errors, the process terminating as either benign activity or catastrophic failure. The fault-error-failure mechanism is explored using the results of fault observations in real processor systems, fault simulations, and physical fault injection experiments.

Prolonged periods of erroneous behaviour increase the likelihood of generating a catastrophic failure. In order to prevent catastrophic failure, erroneous operation must be detected and appropriate recovery action initiated. Error detection is achieved by identifying characteristics of erroneous behaviour. Several techniques are reviewed which offer fault tolerance suitable for low-cost microprocessor based systems. The performance of the techniques is discussed in relation to the benefit of rapid error detection. Finally, assessing the reliability of a microprocessor systems adopting a fault tolerant technique is considered.
2.2. Faults and Their Implication on Microprocessor System Reliability

Assessing the reliability of a microprocessor based system involves evaluating the probability of failure which in turn is dependent on the fault-error-failure mechanism. As defined in Chapter 1, a fault is a physical defect, an error is an activated fault, and a failure is classified as the deviation of system behaviour from that expected. The time interval between the occurrence of a fault and its manifestation as an error is called the fault latency. Similarly, the time interval between the occurrence of an error and the generation of a failure is called the error latency. Fault and error latencies are shown in Figure 2.1. The relationship between faults, errors, and failures is now explored.

Over the last decade, computer failure data has been collected for several continuously operational computer systems. Diagnosis of the data reveals temporary faults to be a significant cause of microprocessor failure. Collated computer failure data, see Table 2.1., shows temporary faults to account for between 93% and 98% of the induced computer system failures, the remaining failures being due to permanent faults. Furthermore, within the selection of computer systems, temporary faults have been observed to occur approximately every 40 to 330 hours during continuous operation. Temporary faults in digital devices are associated with electro-magnetic interference (EMI) [Liu & Whalen, 1988], electro-static discharge (ESD) [Bhar & Mahon, 1983], electrical noise [Shoji, 1987], ionizing radiation [Amerasekera & Campbell, 1987], and power supply fluctuations [Cortes et al, 1986].

The manifestation of a temporary fault within a microprocessor based system is dependent on the susceptibility of its digital circuitry. Ball & Hardie [1969] report experiments which suggest that the probability of logic malfunction is dependent on the duration of the temporary fault. Typically, temporary faults only had a significant effect when they existed in excess of five clock cycles. Sequential logic was more susceptible than combinational logic, its probability of malfunction being in excess of 90% for temporary faults of 100 clock cycle duration.

Additionally, the miniaturisation of digital integrated circuits (scaling) makes them more vulnerable to many environmental effects [Russel & Sayers, 1989]. Scal-
Figure 2.1: Fault and Error Latencies
<table>
<thead>
<tr>
<th>System</th>
<th>Source</th>
<th>Processor</th>
<th>Technology</th>
<th>Detection Mechanism</th>
<th>Hours Observed</th>
<th>Mean Time To Failure (Hours)</th>
<th>Temporary Faults (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>CMUA</td>
<td>Fuller &amp; Harbison, [1978]</td>
<td>PDP-10</td>
<td>ECL</td>
<td>Parity</td>
<td>5700</td>
<td>44</td>
<td>800-1600</td>
</tr>
<tr>
<td>Cm*</td>
<td>Siewiorek et al, [1978a]</td>
<td>LSI-11</td>
<td>NMOS</td>
<td>Diagnostics</td>
<td>15000</td>
<td>128</td>
<td>4200</td>
</tr>
<tr>
<td>C.vmp</td>
<td>Siewiorek et al, [1978a]</td>
<td>TMR LSI-11</td>
<td>NMOS</td>
<td>Crash</td>
<td>15000</td>
<td>97-328</td>
<td>4900</td>
</tr>
<tr>
<td>Telettra</td>
<td>Morganti et al, [1978]</td>
<td>UDET 7116</td>
<td>TTL</td>
<td>Mismatch</td>
<td>N/A</td>
<td>80-170†</td>
<td>1300</td>
</tr>
<tr>
<td>SLAC</td>
<td>Iyer &amp; Rossetti, [1982]</td>
<td>N/A</td>
<td>N/A</td>
<td>Diagnostics</td>
<td>26000</td>
<td>58</td>
<td>2300</td>
</tr>
<tr>
<td>CMU-AFS</td>
<td>Lin, [1988]</td>
<td>MC 68010/20</td>
<td>NMOS</td>
<td>Diagnosis</td>
<td>212800*</td>
<td>201</td>
<td>6552</td>
</tr>
<tr>
<td>BG-Rig</td>
<td>Wingate &amp; Preece, [1991]</td>
<td>TMR 8085</td>
<td>NMOS</td>
<td>Crash</td>
<td>9400</td>
<td>118</td>
<td>4100</td>
</tr>
<tr>
<td>BG-Rig</td>
<td>Wingate &amp; Preece, [1991]</td>
<td>TMR 8085</td>
<td>NMOS</td>
<td>Crash</td>
<td>12300</td>
<td>154</td>
<td>3100</td>
</tr>
</tbody>
</table>

Notes: † Reported by [McConnel, 1981]; ‡ From which 85% were recovered; * 13 applications monitored over 22 months.

Table 2.1. : Failures Attributed to Temporary and Permanent Faults in Observed Microprocessor Systems
ing is associated with two operational characteristics on VLSI devices; lower operating power and higher processing speed. Lower operating power means that smaller power variations can generate a signal ambiguity or fault. Hence the severity of a transient disturbance required to induce a temporary fault is reduced. Higher processing frequencies enable smaller duration temporary faults to induce a logic fault.

A fault may have various effects depending on the state of the microprocessor system at the time of the fault and the duration of the fault. Errors are not generated when the fault duration is less than the fault latency. A primary error is produced when the duration of the fault is equal to the fault latency, the associated event probability being denoted by $Pr\{Error \mid Fault\}$. Subsequent errors, referred to as secondary errors, are generated in numbers that increase with the fault duration beyond the fault latency [Damm, 1988].

Errors can influence a microprocessor system in several ways. Errors can spawn further errors as modelled by Stiffler [1980] and Kopetz [1982]. Each of these errors can cause passive or active failure depending on whether or not they are dormant. Dormant errors, e.g. memory errors, have an error latency dependent on the access frequency of the corrupted memory locations. Hence, the microprocessor system application can have a major influence on determining whether or not an error leads to failure. Indeed Iyer & Rossetti [1986] and Woodbury & Shin [1990] both report evidence that a processor’s workload can affect fault latency. Furthermore, the same error may have various effects on the microprocessor system depending on the timing of the error manifestation. The probability that an error generates a failure is denoted by $Pr\{Failure \mid Error\}$.

Reliability is defined in Chapter 1 as the probability of operating without failure, and can be expressed as,

$$R = 1 - Pr\{Failure\}. \quad (2.1.)$$

Inserting conditional probabilities, associated with the fault and error latencies, representing the fault-error-failure mechanism yields,
Techniques to improve microprocessor system reliability can be broadly divided into two groups: those that endeavour to prevent or reduce the generation of faults (fault prevention, to reduce \( Pr\{Fault\} \)), and those which attempt to intervene and prevent generated faults from causing system failure (fault tolerance, to reduce \( Pr\{Failure \mid Error\} \) and \( Pr\{Error \mid Fault\} \)).

Fault prevention can involve the strategic implementation of two types of techniques [Anderson & Lee, 1982]. Firstly, fault avoidance can be employed to protect the controller from transient disturbances. This commonly involves 'shielding' the controller to obstruct the effects of a transient disturbance. Secondly, fault removal can be applied to identify weak design or components within the microprocessor-based controller. This process is commonly referred to as 'screening'. Rectifying the weaknesses should reduce the susceptibility of the controller to the effects of transient disturbances.

Anderson & Lee [1982] identified four operations to be complete in order for a system to be fault tolerant.

- error detection,
- damage assessment,
- error recovery, and
- fault treatment and continued system service.

Once an error is detected it is necessary to identify and isolate the damage incurred. Then the system must be restored to a valid operational state in order to prevent recurring errors evoking system failure. Finally, any damage must be repaired and the original system operation re-initiated. Temporary faults do not incur any physical damage and hence on completion of the fault tolerant process, assuming the temporary fault has terminated its existence, the system is returned to complete working order.
The duration of the error until its detection is called *error-detection latency*. The sum of the fault latency and error-detection latency is referred to as the *fault detection latency* [Damm, 1988], see Figure 2.1.

Equation (2.2.) denotes two levels at which fault tolerant techniques can be applied: the circuit (or gate) level $Pr\{Error \mid Fault\}$, and the functional (or component) level, $Pr\{Failure \mid Error\}$. Temporary faults are extremely difficult to detect because of their short duration. The overhead in providing a detection capability for individual faults in a VLSI device at circuit level is considered by Mahmood & McCluskey [1985] to be prohibitive. Some of the faults generated will in any case be benign and hence not require detection. Nevertheless, other faults, exercised by circuit operation, can generate errors. In order for a system to tolerate such faults it is necessary to detect their associated errors before they in turn induce system failure. Fortunately, there appears to be a good correlation between the circuit level and behaviour level reaction of common VLSI design elements, such as Arithmetic Logic Units (ALU) and multiplexers, to the effects of faults [Chakraborty & Ghosh, 1988] which suggests that functional level fault tolerance will be acceptable for most fault tolerant systems.

### 2.3. Erroneous Operation of Microprocessor Systems

The impact of faults upon microprocessor operation has been the subject of much research. Iyer uses fault simulation experiments, see Table 2.2., to investigate the susceptibility of generating and the likelihood of propagating functional element errors. Memory, arithmetic logic units (ALU), and multiplexers are found to be, in descending order, the most susceptible functional elements to error generation and propagation. The remaining fault simulation experiments in Table 2.2., together with the physical fault injection experiments listed in Table 2.3., investigate the effect of such errors on software operation.

Software execution errors generated by faults are diverse, but can be divided into two general groups: data flow errors and program flow errors. Both are affected by fault latencies with a bimodal distribution, i.e. there are two or more distinct classes of error manifestation [Arlat et al, 1990] [Czeck & Siewiorek, 1990]. Faults, such
<table>
<thead>
<tr>
<th>Source</th>
<th>Target Processor</th>
<th>Method</th>
<th>Goal</th>
</tr>
</thead>
<tbody>
<tr>
<td>Marchal &amp; Courtois, [1982]</td>
<td>MC6800 &amp; MC68000</td>
<td>Simulation</td>
<td>Detection Time</td>
</tr>
<tr>
<td>Li et al, [1984]</td>
<td>SBR9000</td>
<td>Emulation</td>
<td>Coverage Measurement</td>
</tr>
<tr>
<td>Yang et al, [1985]</td>
<td>iAPX 432</td>
<td>Emulation</td>
<td>Coverage of TMR</td>
</tr>
<tr>
<td>Lomelino &amp; Iyer, [1986]</td>
<td>AMD 2901</td>
<td>Simulation</td>
<td>Error Propagation</td>
</tr>
<tr>
<td>Duba &amp; Iyer, [1988]</td>
<td>HS 1602</td>
<td>Simulation</td>
<td>Error Propagation</td>
</tr>
<tr>
<td>Kim &amp; Iyer, [1988]</td>
<td>AMD 2901</td>
<td>Simulation</td>
<td>Error Propagation</td>
</tr>
<tr>
<td>Segall et al, [1988]</td>
<td>Intel 80186</td>
<td>Simulation</td>
<td>Concept Study</td>
</tr>
<tr>
<td>Choi et al, [1989]</td>
<td>HS 1602</td>
<td>Simulation</td>
<td>Error Propagation</td>
</tr>
<tr>
<td>Czeck &amp; Siewiorek, [1990]</td>
<td>Intel 80186</td>
<td>Simulation</td>
<td>Detection Time &amp; Coverage</td>
</tr>
</tbody>
</table>

Table 2.2. : Fault Emulation/ Simulation Experiments.
<table>
<thead>
<tr>
<th>Source</th>
<th>Target Processor</th>
<th>Method</th>
<th>Goal</th>
</tr>
</thead>
<tbody>
<tr>
<td>Crouzet &amp; Decouty, [1982]</td>
<td>Intel 8080</td>
<td>Hardware Pin</td>
<td>Methodology Study</td>
</tr>
<tr>
<td>Liu, [1982]</td>
<td>Z8</td>
<td>Ion Radiation</td>
<td>System Evaluation</td>
</tr>
<tr>
<td>Lala [1983]</td>
<td>FTMP</td>
<td>Hardware Pin</td>
<td>System Evaluation</td>
</tr>
<tr>
<td>Cusick et al, [1985]</td>
<td>Z80 &amp; NSC-800</td>
<td>Ion Radiation</td>
<td>System Evaluation</td>
</tr>
<tr>
<td>Chillarege &amp; Bowan, [1989]</td>
<td>IBM 3081</td>
<td>Memory Corruption</td>
<td>Large System Failures</td>
</tr>
<tr>
<td>Arlat et al, [1990]</td>
<td>MC68000</td>
<td>Hardware Pin</td>
<td>Methodology Study</td>
</tr>
<tr>
<td>Madeira et al, [1990]</td>
<td>Z80</td>
<td>Hardware Pin</td>
<td>Error Detection Evaluation</td>
</tr>
</tbody>
</table>

Table 2.3: Fault Injection Experiments.
as stack corruption, are dormant requiring particular processing for their activation whilst other faults generate 'fast' failures. Nineteen percent of faults injected into a Motorola 6809 microprocessor system through ion bombardment [Gunneflo et al, 1989], and 22% of faults injected into an IBM 3081 processor system through memory corruption [Chillarege & Bowan, 1989], produce dormant faults.

Gunneflo's experiment also reports 78%, 17%, and 5% of the injected faults to generate program flow errors, data flow errors, and other consequences respectively [Gunneflo et al, 1989]. Experiments that physically inject faults on processor package pins, Schmid et al [1982] and Schuette & Shen [1986], support these results with 63% and 78% of faults generating program flow errors in Zilog Z80 and Motorola 68000 microprocessor systems respectively. Further, McGough & Swern [1981] report over half the logical faults inserted in a simulation of a AMD 2901 processor system to generate 'wild' branches, i.e. program flow errors. Caution should be exercised when comparing the significance of data and program flow errors in the fault insertion experiments because the experiments use different microprocessors, different fault insertion methods, and different fault locations.

2.3.1. Data Flow Errors

Data flow errors are generated by corruption or incorrect processing of data elements resident in data structures or instruction operands. Erroneous data can produce 'unreasonable' as well as failure conditions [Damm, 1988]. Data flow errors do not disturb the program flow and hence acceptance tests can be embedded within the program to check for bad data.

2.3.2. Program Flow Errors

Program flow errors can be initiated in microprocessor systems by the incorrect identification of a memory location as containing an instruction [Marchal & Courtois, 1982], or corruption of an instruction itself [Carpenter, 1989]. There are three main mechanisms by which faults can generate erroneous program flow. Firstly, corrupted opcodes may have a different instruction length. Carpenter [1989] demonstrates this effect for the Motorola 68000 microprocessor. A change in the instruction length
forces incorrect interpretation of the memory location for the next instruction opcode. Secondly, corruption of a branch or jump instruction operand will alter the location of the next instruction opcode to be executed. Finally, registers in the microprocessor specifying the address of the next instruction opcode may be directly corrupted. A SBR9000 processor fault simulation, [Li et al, 1984], reports 73% of program flow errors to be generated in this way. Either of these three mechanisms effectively initiates execution of an unknown program. The nature of ensuing execution is dependent on the code content of memory under interrogation. The character of this execution is haphazard and not usually in sympathy with the organized execution associated with the application program. This behaviour leads to the malfunction of the controller. Particular hazard is attributed to such occurrences because of the implications of their unpredictable effect on controller operation.

Once incorrect memory locations are accessed for instruction opcodes then there is a high probability that following instruction opcodes will also be incorrectly identified [Marchal & Courtois, 1982]. As erroneous program flow progresses, processing may further corrupt memory containing the original software. Erroneous program flow can terminate naturally via re-synchronization. Re-synchronization involves re-establishing identification of instruction opcodes within the original software application program. A physical fault injection experiment reports 75%, 6%, and 19% of program flow errors diverge permanently from the correct program (program crash), diverge temporarily from the program flow (re-synchronization), and are dormant (stack errors, etc) respectively. Tests cannot be embedded within the application program to identify opcode corruption because the application program is no longer executed.

Sosnowski [1986a] considers steady state outcomes of erroneous behaviour associated with microprocessor failure. He develops models for false loops, traps, and deadlocks. False loops and deadlocks are essentially infinite execution loop phenomena which have also been modelled by Halse & Preece [1987]. Deadlocks involve the termination of execution when the processor enters a ‘wait’ state. Halse & Preece
[1985] and Sosnowski [1986b] investigate characteristics of erroneous behaviour relating to the influence of different microprocessor instruction sets and address space utilization.

2.4. Assessing Error Detection Techniques for Microprocessor Systems

Hardware and software fault tolerant techniques are briefly outlined in Chapter 1. Many techniques, however, may be unsuitable for microprocessor controllers because the costs of their application exceed the controllers budget. This section reviews low-cost fault tolerant techniques for microprocessor controllers requiring high reliability.

2.4.1. Watchdog Timers

One of the most basic techniques for checking the operation of a microprocessor-based system is the use of a watchdog timer [Connet et al., 1972], [Ornstein et al., 1975]. The system is designed such that, under normal operation, program execution signals the watchdog timer within a specified time interval. The signal presets the timer to its initial value. The timer generates an error if no preset signal is forthcoming during the specified time interval. On receiving the error signal from the watchdog, the system initiates suitable recovery action. Typically this involves re-establishing a correct set of operating parameters. Watchdogs incur a small switching overhead and no performance degradation is incurred directly upon the executing software.

Watchdog timers, however, may have a hazardous error-detection latency. Consider a processor operating at 8 MHz, whose mean instruction processing time is 100 clock cycles, implementing a watchdog with a 100 ms interval. If a malfunction occurs in the middle of this interval then approximately 4000 instructions can be processed erroneously before the malfunction is detected. During the malfunction, processor operation is uncontrolled and may have hazardous implications for the processor activity. Therefore, other techniques providing fault detection are being developed.

2.4.2. Capability Checking

Lu [1980] was one of the first to propose what is now commonly referred to as the 'smart' watchdog. These units are based on an additional processor to provide

Capability checks, see Table 2.4., implemented together create reliable computer systems as demonstrated by Schmid et al [1982], Gunneflo et al [1989], and Madeira et al [1990] with 88%, 79%, and 75% fault detection respectively. The variation in fault detection is due to different selection of the capability checks implemented by each system, and the method of fault insertion when evaluating the microprocessor systems fault tolerance. Collectively applying capability checks provides both program and data flow error detection. However, implementing all these techniques can be complex so another simpler alternative is being explored by researchers. It involves detecting data flow errors by placing reasonableness checks in the software [Damm, 1988] (this includes Recovery Blocks, and N-Version Programming), and implementing a monitoring scheme to directly verify the program flow of the application processor.

2.4.3. Program Flow Monitoring

Program flow monitoring schemes are typically based, to some extent, on control flow graphs. The graphs consist of linked nodes. Each node represents a sequence of instructions performing some task, and each link represents the transition conditions, i.e. status information. Lu [1982] proposed a scheme called 'structural integrity
<table>
<thead>
<tr>
<th>Capability Checks</th>
</tr>
</thead>
<tbody>
<tr>
<td>a) incorrect sequence of instructions</td>
</tr>
<tr>
<td>b) branch to invalid destination</td>
</tr>
<tr>
<td>c) fetch illegal instruction</td>
</tr>
<tr>
<td>d) fetch an opcode from a none opcode address</td>
</tr>
<tr>
<td>e) invalid read within permitted memory</td>
</tr>
<tr>
<td>f) invalid write within permitted memory</td>
</tr>
<tr>
<td>g) access to memory outside permitted memory area</td>
</tr>
<tr>
<td>h) watchdog timer</td>
</tr>
</tbody>
</table>

Table 2.4. : Capability Checks
checking' involving the generation of a tag for each task. These tags are checked during execution to verify correct operation. Lu does not check transition conditions. Yau & Chen [1980] ensure each task does not have any inherent loops, hence, preventing the possibility of infinite erroneous execution loops without potential detection. They also implement verification of transition conditions between tasks.

Task tags assigned values based on cyclic encoding of their instruction sequences are called 'signatures'. Two techniques implementing signatures, Path Signature Analysis [Namjoo, 1982] and Signatured Instruction Streams [Shen & Schuette, 1983] embed precomputed signatures into the application program. During program execution, special circuitry re-computes the signature and compares it with the embedded code signature, any ambiguity signalling detection of erroneous program flow. Both techniques impose a performance and code overhead. Schuette & Shen [1986] have implemented an embedded signature technique. The dedicated circuitry required 3947 gates and 5435 bytes of memory, a hardware overhead of approximately 38% compared to the gate count of the host Motorola 68000 application processor. The memory overhead is incurred by embedded tags in the application program: typical overhead estimates range between 10 and 20% [Wilken & Shen 1987]. Finally, pseudo-branches, required by the implementation so that correct execution by-passes embedded signatures, are estimated to reduce application program performance by 10%. The technique implemented in a Motorola 68000 processor system is reported to have a mean detection latency of less than 100 μs, the maximum latency being 3.8 ms. This is a considerable improvement on the detection latency expected from watchdogs. Schuette & Shen [1986] and Segall et al [1988] report a 98% and 94% coverage of program flow errors respectively. Wilken & Shen [1987] review the mechanism of several signature monitoring techniques that embed signatures in the application software.

Namjoo [1983] proposes a smart watchdog to compute run-time signatures independently to verify application processor behaviour. This technique does not incur the performance or code overhead associated to earlier schemes. Eifert & Schuette [1984] refine the technique, replacing the smart watchdog with dedicated circuitry.
These techniques whilst not degrading system performance or requiring extra memory, do require an additional hardware unit. Smart watchdogs introduce approximately 100% redundancy by duplicating the number of processors. Replacing the watchdog with dedicated circuitry implies that the design is not directly applicable for use with different microprocessor types.

2.4.4. Hazards Associated with Error Detection Techniques

It should be noted that those fault tolerant techniques implementing hardware redundancy will also be susceptible to the effects of faults induced by the environment. In particular, hazard is associated with those detection techniques such as the smart watchdogs that use a microprocessor to monitor a microprocessor. Duba & Iyer [1988] and Choi et al [1989] in their fault simulations identify the watchdog element of their microprocessor system to be significantly vulnerable to temporary fault generation, and diagnose a critical fault propagation path between the control unit and the watchdog. The reliability of a microprocessor-based controller implementing a watchdog device can be seriously undermined if the detection capability of the watchdog is lost. Damm [1988] refers to this occurrence as the 'doomsday' syndrome.

2.4.5. A Novel Error Detection Technique

This thesis proposes an alternative technique based on the self-detection of program flow errors by erroneous execution. Potential program flow errors within the software are identified and the code structure enhanced by the strategic placement of detection mechanisms. These mechanisms can only be activated by erroneous program flow. The technique does not inherently require additional hardware. The only overhead is the extended code requirement, and the associated additional execution to by-pass the inserted detection mechanisms. The code extension, and the degradation to application program performance by by-pass operations is comparable with that required by the embedded tags and pseudo-branches reported for an implementation of the embedded signature technique [Schuette & Shen, 1986], [Wilken & Shen, 1987].
2.5. Reliability Evaluation

The reliability of microprocessor systems implementing fault masking can be assessed using architectural analysis. However, this method cannot be used in microprocessor systems implementing other fault tolerant techniques, such as those discussed in this chapter because of the uncertainty of fault detection.

An alternative reliability evaluation which can be adapted to the fault tolerant techniques discussed in this chapter is given by Schuette & Shen [1986]. They propose the following estimation for reliability of a microprocessor-based system, $R_s(t)$, employing signature monitoring,

$$R_s(t) = [R(t).R^F(t)] + [(1 - R(t)).R^F(t).E]. \quad (2.3.)$$

where the reliability of the microprocessor-based system $R(t)$ is the expectation of an error occurring in the microprocessor before time $t$, $R^F(t)$ is the expectation of an error occurring in the additional circuitry required by the fault tolerant mechanism before time $t$, and $E$ is the error coverage of the employed detection mechanism.

Re-arranging equation (2.3.) gives,

$$R_s(t) = R^F(t)[R(t) - (1 - R(t)).E]. \quad (2.4.)$$

The correct operation of the system is dependent on avoiding the ‘doomsday’ syndrome discussed earlier, for the detection mechanism. Therefore the reliability of the system, $R_s(t)$, is directly dependent on the reliability of the detection mechanism, $R^F(t)$. An additional constraint on system reliability is the sum of the probability that the processor is working correctly, $R(t)$, with the probability of fault coverage by the detection mechanism when the processor is not working correctly, $(1 - R(t)).E$.

Reliability engineers commonly use the complementary reliability parameter Mean Time To Failure (MTTF) for failure rates; increases in the event rate reduce the MTTF expectations. Event rates are, as discussed earlier in this chapter, dependent on the occurrence of transient disturbances and the susceptibility of a system to this disturbance. It cannot be assumed that environmental conditions are stable – the
mean rate of occurrence of transient disturbances may approximate to a constant, but it should be recognized that occurrences may cluster. Clustered occurrences are known as 'bursts', and within the C.vmp microprocessor system approximately 25% of observed fault events were bursts [McConnel & Siewiorek, 1978].

Variations in the event rate will alter the expected reliability for individual applications, and hence limit the usefulness of equations like (2.3.) when evaluating system reliability. However, equation (2.3.) can be applied to many other fault tolerant techniques, implemented in a simple single processor system, to provide a comparative index of their effectiveness.

2.6. Summary and Conclusions

Temporary faults have been diagnosed as causing between 15 and 50 times more failures in microprocessor systems than permanent faults. Reliability engineers attribute the generation of many temporary faults to the occurrence of transient environmental disturbances.

Microprocessor-based controllers are often required to operate in harsh environments where transient disturbances are a regular hazard. Prevention techniques involving screening and shielding can be applied in an attempt to eradicate the effects of transient disturbances on microprocessor controllers. In practice these techniques, however, only reduce the problem.

Temporary faults are difficult to detect because of their limited duration. Furthermore, the errors they generate can induce microprocessor malfunction and this may lead to erroneous operation of equipment under directives from the controller. Equipment operation may be haphazard and pose a danger in particular applications.

Microprocessor malfunction should be detected rapidly to reduce the hazard of erroneous equipment operation. It is with this purpose that fault tolerant techniques have been developed. Fault tolerant techniques enable digital systems to isolate and repair the effects of temporary faults before restoring the application function. Program flow errors are identified as having particular influence on the character of
microprocessor malfunction. A selection of techniques applicable to microprocessor-based controllers are reviewed. In particular, techniques are discussed which incur a low system overhead.

Assessment of the fault tolerant techniques reviewed in this chapter involves inserting faults into a microprocessor system and monitoring its response. These assessments only reflect the efficiency of the implemented fault tolerant techniques. A major number of observed computer system failures are attributed to temporary faults. Assessments of microprocessor system reliability should, therefore, take this class of fault into account. Reliability assessments should involve information including knowledge of the susceptibility of the processor to transient disturbances, and the likelihood of such disturbances in the application environment.

In summary, temporary faults can be responsible for a significant number of microprocessor system failures. Fault prevention techniques cannot guarantee the eradication of all faults. It is therefore pertinent to incorporate fault tolerant features into the system design. Effective fault tolerance can be implemented at low-cost. For microprocessor systems with safety applications, developed within limited financial budgets, these techniques can provide highly beneficial and cost-effective reliability.
# Chapter 3

**MODELLING ERRONEOUS MICROPROCESSOR BEHAVIOUR**

<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>3.1.</td>
<td>Introduction</td>
<td>31</td>
</tr>
<tr>
<td>3.2.</td>
<td>Initiating Erroneous Microprocessor Behaviour</td>
<td>31</td>
</tr>
<tr>
<td>3.3.</td>
<td>Erroneous Behaviour</td>
<td>32</td>
</tr>
<tr>
<td>3.4.</td>
<td>Erroneous Execution</td>
<td>34</td>
</tr>
<tr>
<td>3.5.</td>
<td>Halse Execution Model</td>
<td>36</td>
</tr>
<tr>
<td>3.6.</td>
<td>Hybrid Execution Model</td>
<td>38</td>
</tr>
<tr>
<td>3.6.1.</td>
<td>Linear Erroneous Execution</td>
<td>38</td>
</tr>
<tr>
<td>3.6.2.</td>
<td>Propagating Further Periods of Erroneous Execution</td>
<td>40</td>
</tr>
<tr>
<td>3.6.3.</td>
<td>Detection of Erroneous Execution</td>
<td>41</td>
</tr>
<tr>
<td>3.6.4.</td>
<td>Erroneous Execution Stall</td>
<td>44</td>
</tr>
<tr>
<td>3.7.</td>
<td>Reliability Analysis</td>
<td>45</td>
</tr>
<tr>
<td>3.7.1.</td>
<td>Failure Rate</td>
<td>46</td>
</tr>
<tr>
<td>3.7.2.</td>
<td>Probability of an Event Leading to Failure</td>
<td>48</td>
</tr>
<tr>
<td>3.7.3.</td>
<td>Reliability Evaluation</td>
<td>49</td>
</tr>
<tr>
<td>3.7.4.</td>
<td>Mean Time To Failure</td>
<td>51</td>
</tr>
<tr>
<td>3.8.</td>
<td>Availability</td>
<td>52</td>
</tr>
<tr>
<td>3.9.</td>
<td>Summary</td>
<td>53</td>
</tr>
</tbody>
</table>
CHAPTER THREE

MODELLING ERRONEOUS MICROPROCESSOR BEHAVIOUR

3.1. Introduction

This chapter investigates microprocessor behaviour with particular regard to fault conditions. Temporary hardware faults may disrupt software processing and induce erroneous execution. The event initiating erroneous behaviour is defined. A model is proposed to simulate erroneous microprocessor behaviour. This model is developed for the von Neumann class of microprocessor which has dominated processor design over the last thirty years. Erroneous behaviour is investigated. A facility for detecting erroneous execution is introduced into the model. The efficiency of detection is examined with respect to the latency between initiation and detection of erroneous behaviour. A stochastic reliability model is proposed to assess the effect of software disruption on microprocessor performance. A method is developed for calculating Mean Time To Failure (MTTF) of the microprocessor system. Availability is also determined under the assumption that the processor has a resident recovery routine in its memory. MTTF and availability are common engineering measures of reliability. Hence the modelled reliability for a microprocessor can be compared to other device reliabilities.

The models presented are robust, not relying on the features of any microprocessor(s). Model application allows the analysis and comparison of a wide selection of microprocessors.

3.2. Initiating Erroneous Microprocessor Behaviour

Erroneous microprocessor behaviour is considered to occur when a temporary fault manifests itself as a control flow failure. Loss of the correctly operating control flow will cause the microprocessor to mis-interpret its software with the hazard of malfunction. The ensuing behaviour comprises of propagating erroneous execution with a progressively increasing likelihood of catastrophic failure.
Any control flow failure will be reflected by an erroneous target entry in the microprocessor's program counter, and hence, the following assumption is made.

**Assumption (1):** The event initiating erroneous microprocessor behaviour is that of program counter corruption.

The program counter is considered as a single register used to locate instructions through the whole address space. The nature of the program counter corruption is not known. The event initiating erroneous behaviour may have had a variety of sources including stack pointer corruption, bus-line transients, and memory bit-flips. It is assumed that all bits in the program counter are equally susceptible to error.

**Assumption (2):** The contents of the microprocessor program counter are corrupted randomly by the event initiating erroneous behaviour.

These assumptions enables a mathematical model, based on probability theory, to be developed for erroneous microprocessor behaviour.

### 3.3. Erroneous Behaviour

Consider erroneous behaviour to be initiated by random corruption of the program counter. This event produces a jump in the control flow of the existing software to a random location in the address space of the microprocessor. This random jump is termed the *Initial Erroneous Jump* (IEJ). Erroneous execution then commences. The data contents of the memory at the new location will be executed as if they were instruction codes. Erroneous execution will take place in a linear fashion until the execution of a further jump instruction causes a *Subsequent Erroneous Jump* (SEJ). Repeated periods of linear erroneous execution interspersed by SEJs follow until terminated either by catastrophic failure or system recovery. This process is shown in Figure 3.1., where the execution flow through the address space is shown as a stream of linked periods of linear erroneous execution.
Figure 3.1. Microprocessor Erroneous Behaviour.
3.4. Erroneous Execution

Erroneous execution consists of a sequence of execution states, each state representing the operation of an instruction. Execution states can be categorized with respect to the nature of their outcome. Halse [1984] identified the state outcomes listed below.

- **Non-Jump**: leads to the program counter pointing to the next instruction in the address space.
- **Restart**: leads to a jump to a predefined location in the address space.
- **Unspecified Jump**: leads to a jump to a new location in the address space determined by volatile memory contents.
- **Return**: leads to a jump to an address held in a stack.
- **Stop/Wait**: leads to a cessation of processing; and requires an interrupt or hardware reset to exit from this state.

Restart outcomes are usually generated by interrupts or exceptions. The restart outcome vectors execution to a location predefined by the microprocessor architecture. A recovery routine can be placed at the restart outcome vector target. Hence for controlled recovery, a restart outcome defines erroneous behaviour detection and an ordered return to a recovery routine.

A model for erroneous execution is shown in Figure 3.2. The model shows erroneous microprocessor behaviour being entered by program counter corruption (IEJ). The cascading sequence of state outcomes throughout erroneous execution can be traced. Successive 'non-jump' state outcomes will produce a period of linear erroneous execution. A restart outcome can be used to provide detection of erroneous execution. Any of the remaining state outcomes, including a stop/wait outcome, are defined as generating further erroneous jump (SEJ) in the erroneous execution.

In a particular make of microprocessor, not all possible instruction bit patterns are necessarily defined as instructions. Illegal, or 'undefined' instructions can, when
Figure 3.2.: Erroneous Execution Model
(* manifested transient fault)
executed, result in any one of the state outcomes defined above. The actual state outcome depends on the particular microprocessor die, as manufacturers are not obliged to ensure that every batch produces the same operation. In some machines, such as the Motorola 68000 and Intel 80386, the execution of all undefined instructions is specified as an exception (software interrupt) and hence will produce a restart state outcome.

3.5. Halse Execution Model

This section briefly reviews the foundations of a model for microprocessor operation proposed by Halse [1984]. The model analyses erroneous microprocessor operation.

A statistical model of erroneous execution is made using the assumption that the memory contents, throughout the address space, have a distribution that does not change. This clearly does not reflect the memory content distribution for real microprocessor based systems. Their distribution will vary through the memory map dependent on the utilization of locations. Nevertheless, the model does enable the identification of some general characteristics of erroneous execution.

As a result of erroneous behaviour, erroneous execution will interpret some location in the address space as an instruction. This results in one of two outcomes. Either an erroneous jump is generated which transfers control to another part of the address space; or no jump occurs and control passes onto the next logical location.

Let the probability of an instruction execution yielding a 'jump' or 'non-jump' outcome be $P_J$ and $P_{NJ}$ respectively. Hence by definition,

$$P_J = 1 - P_{NJ}$$

(3.1.)

It follows that the probability of terminating a period of linear execution on the $k^{th}$ instruction (i.e. generate an erroneous jump) is given by

$$P_J(k) = P_{NJ}^{(k-1)} \cdot P_J, \quad \text{where } k \geq 1.$$

(3.2.)

36
When evaluating a microprocessor's erroneous behaviour, it is more realistic to use effective instruction outcome distributions rather than instruction outcome distributions based upon instruction set definitions. It is recognized that some instructions have different outcomes dependent on some conditional test. In particular it is noted that conditional branch instructions can be paired, such that groups of two instructions covered a condition and its complement. Hence each pair of conditional branch instructions, such as a 'branch if zero' and 'branch if not zero', can be treated as if it were a single jump instruction and a single non-jump instruction. Halse [1984] assumes conditional instructions to have a 50% chance of occurring. Although this is strictly not true for individual conditions, the overall treatment of conditional instructions in this manner is considered valid.

Let \( Ne_J \) be the effective number of address space locations that when interpreted as an instruction generate a 'jump' outcome instructions. Let \( N_L \) be the number of address space locations that could be interpreted as an instruction.

Then:

\[
P_J(1) = P_J = \frac{Ne_J}{N_L}, \tag{3.3.}
\]

and,

\[
Ne_J = Ne_{RN} + Ne_{RT} + Ne_{S/W} + Ne_{UJ} \tag{3.4.}
\]

where \( Ne_{RN} \), \( Ne_{RT} \), \( Ne_{S/W} \), and \( Ne_{UJ} \) are the effective numbers of 'return', 'restart', 'stop/wait', and 'unspecified jump' outcome instructions in the available address space.

The probability that termination of a period of linear execution results in a particular outcome \( P_x(k) \), is dependent on the proportion of that type of instruction in the set of 'jump' instructions. Collecting equations (3.2.), (3.3.), and (3.4.) gives,

\[
P_x(k) = \begin{cases} 
0, & \text{when } k = 0, \\
\frac{Ne_x}{Ne_J} P_J(k), & \text{when } k \geq 1.
\end{cases} \tag{3.5.}
\]

where the subscript \( x \) denotes the respective jump outcomes; \( x \in \{RT, UJ, RN, S/W\} \) represents restart, unspecified jump, return, and stop/wait.
3.6. Hybrid Execution Model

A hybrid model is proposed here facilitates further investigation of erroneous microprocessor behaviour. Erroneous behaviour has two phases of execution; linear execution following an IEJ, and linear execution following an SEJ. The hybrid model enables the examination of the characteristics associated with each of the two patterns. In particular, three mechanisms have been identified that terminate linear erroneous execution.

a) Another period of linear erroneous execution is initiated: an ‘unspecified jump’ or ‘return’ state is entered.

b) Processing stalls: a ‘stop/wait’ state is entered.

c) Detection of erroneous execution: a ‘restart’ state is entered.

This section initially models the periods of linear erroneous execution associated with each of the two patterns of behaviour. These are then developed to investigate the probability of further periods of linear erroneous execution being initiated, stalled, or detected.

3.6.1. Linear Erroneous Execution

Both restart and stop/wait erroneous jump outcomes terminate erroneous execution. Unspecified jump and return outcomes initiate a new period of linear erroneous execution. Let \( P_{J'}(k) \) represent the probability of the \( k \)th instruction processed yielding a jump outcome, other than a restart or stop/wait, following an erroneous jump. The subscript \( J' \) represents the jump outcomes; unspecified jump, and return.

\[
P_{J'}(k) = \sum_{x \in \{UJ, RN\}} P_x(k), \quad \text{where } k \geq 0.
\]  

(3.6.)

Logical processing errors such as ‘divide by zero’ can cause premature completion of an instruction’s execution, generating an otherwise unexpected restart outcome within a fraction of the clock cycles normally required by the instruction. These errors have different influences on each of the two phases of erroneous behaviour. An IEJ may not initiate erroneous execution but fire what is considered by the model
to be an immediate restart outcome. The same assumption is made for an SEJ such
that the SEJ instruction is considered to fire a restart rather than a jump outcome.
Let $\beta$ and $\gamma$ be the respective proportion of IEJs and SEJs firing a restart outcome
in this way, such that $0 \leq \beta \leq 1$ and $0 \leq \gamma \leq 1$.

Consider the execution model probability for the outcome of the $k$th processed
instruction, equation (3.5.). Let $P_x^\text{IEJ}(k)$ and $P_x^\text{SEJ}(k)$ be the probability density
functions for the execution outcome of the $k$th processed instruction. The subscript
$x$ denotes the class of outcome; $x \in \{RT, UJ, RN, S/W\}$ representing restart, un­
specified jump, return, and stop/wait outcomes respectively. The superscript $\text{IEJ}$
or $\text{SEJ}$ denotes execution following an IEJ or SEJ respectively.

Evaluating erroneous execution following an IEJ when $k = 0$;

$$P_x^\text{IEJ}(k) = \begin{cases} 0, & \text{where } x \in \{UJ, RN, S/W\}, \\ \beta, & \text{where } x \in \{RT\}. \end{cases} \quad (3.7a.)$$

and when $k \geq 1$;

$$P_x^\text{IEJ}(k) = \begin{cases} (1 - \beta).[(1 - \gamma).P_x(k)], & \text{where } x \in \{UJ, RN, S/W\}, \\ (1 - \beta). [P_x(k) + \gamma.P_f(k)], & \text{where } x \in \{RT\}. \end{cases} \quad (3.7b.)$$

Evaluating erroneous execution following an SEJ when $k = 0$,

$$P_x^\text{SEJ}(k) = 0, \quad \text{where } x \in \{RT, UJ, RN, S/W\}, \quad (3.8a.)$$

and when $k \geq 1$,

$$P_x^\text{SEJ}(k) = \begin{cases} (1 - \gamma).P_x(k), & \text{where } x \in \{UJ, RN, S/W\}, \\ P_x(k) + \gamma.P_f(k), & \text{where } x \in \{RT\}. \end{cases} \quad (3.8b.)$$

Let $P_f(k)$ be the probability of terminating a period of linear erroneous execution
where the subscript $x$ denotes the class of outcome; $x \in \{RT, UJ, RN, S/W\}$ repre­
senting restart, unspecified jump, return, and stop/wait respectively. The superscript
\( y \) denotes execution following an erroneous jump; \( y \in \{IEJ, SEJ\} \) representing IEJ and SEJ respectively.

\[
P^y_P(k) = \sum_x P^y_x(k), \tag{3.9.}
\]

The probability that \( k \) instructions have been linearly processed in the current phase of linear execution is evolved from equation (3.9.),

\[
P^y_L(k) = 1 - \sum_{k} P^y_j(k), \quad \text{where } k \geq 0, \text{ and } y \in \{IEJ, SEJ\}. \tag{3.10.}
\]

The mean number of instructions expected to be executed during each phase of erroneous behaviour, \( I \), is given by,

\[
I = E[K] = \sum_k k \cdot P^y_P(k), \quad \text{where } k \geq 0, \text{ and } y \in \{IEJ, SEJ\}. \tag{3.11.}
\]

where \( K \) is the random variable of the probability density function, defined by equation (3.9.), \( P^y_P(k) \).

If the probability of a jump in either of the patterns of erroneous behaviour is zero then the number of instructions processed during linear erroneous execution is infinite. This, of course, assumes that repeated passes of execution through the address space due to program counter overflow, are considered as a single period of linear erroneous execution.

### 3.6.2. Propagating Further Periods of Linear Erroneous Execution

Periods of linear erroneous execution are propagated when erroneous behaviour generates a SEJ. The probability of a SEJ outcome on the \( k \)th processed instruction during either of the two patterns of linear erroneous execution is developed from equation (3.6.),

\[
P^y_P(k) = \sum_x P^y_x(k), \quad \text{where } k \geq 0 \tag{3.12.}
\]
where the subscript $x$ denotes the class of outcome; $x \in \{UJ, RN\}$ representing unspecified jump, and return respectively. The superscript $y$ denotes execution following an erroneous jump; $y \in \{IEJ, SEJ\}$ representing IEJ and SEJ respectively.

3.6.3. Detection of Erroneous Execution

Error detection latency is defined as the time between the initiation of erroneous behaviour and its detection. This parameter is an important performance characteristic when evaluating detection techniques [Blough & Masson, 1987]. The discrete state nature of the microprocessor model presented in this chapter means that error detection latency is determined as a function of the number of erroneously processed instructions during erroneous behaviour.

Detection of erroneous behaviour can be provided by implementation of restart outcomes. Restart outcomes take execution to an address space location predefined by the processor architecture. A recovery routine can be placed at this address. Now restart outcomes produce controlled return to the recovery routine. Hence erroneous behaviour is detected by a restart outcome. In order to remove any ambiguity, recovery routines are only placed for restart outcomes generated by erroneous behaviour.

Detection of erroneous behaviour may occur during the linear erroneous execution following an IEJ, or one or more SEJs. The probability of detection $D(k)$ on the $k$th processed instruction of erroneous execution is given by,

$$D(k) = D_{IEJ}(k) + D_{SEJ}(k), \quad \text{where } k \geq 0.$$  

where $D_{IEJ}(k)$ and $D_{SEJ}(k)$ respectively represent the detection coefficients of erroneous execution following either an IEJ or SEJ.

The detection coefficients are derived using the execution characteristics of erroneous behaviour during processing following an IEJ and SEJ. Let $P_{IEJ}^{EE}(k)$ and $P_{SEJ}^{EE}(k)$ represent the probability of the $k$th instruction processed yielding detection (restart outcome) of erroneous behaviour following an IEJ and SEJ respectively.

The detection coefficient $D_{IEJ}(k)$ is the probability of detecting erroneous execution on the $k$th instruction before a SEJ occurs,
The detection coefficient $D_{SEJ}(k)$ sums the probability of all possible execution paths resulting in detection after one or more SEJs. Every such execution route requires at least one SEJ, other than a stop/wait or restart outcome representing a processing stall and error detection respectively, after the IEJ commencing erroneous behaviour.

\[
D_{SEJ}(k) = \sum_{m=0}^{k} P_{IEJ}^m(m) \cdot \left[ \sum_{n=0}^{k-m} \Psi(n) \cdot P_{RT}^{SEJ}(k - m - n) \right]. \tag{3.15a}
\]

where,

\[
\Psi(n) = \begin{cases} 
1, & \text{when } n = 0, \\
\sum_{z=1}^{n} P_{IEJ}^z(z) \cdot \Psi(n - z), & \text{when } n \geq 1.
\end{cases} \tag{3.15b}
\]

Equation (3.15b.) calculates the probability associated with periods of erroneous execution initiated by and terminating with an unspecified jump or return outcome.

To demonstrate the function of equation (3.13.) consider a simple example to determine the probability of detecting erroneous execution before four instructions have been erroneously processed. That is, evaluate $D(k)$ when $k = 3$. From equation (3.13.),

\[
D(3) = D_{IEJ}(3) + D_{SEJ}(3) \tag{3.16}
\]

Equation (3.14.) yields the probability of detecting erroneous execution when no SEJs occur,

\[
D_{IEJ}(3) = P_{RT}^{IEJ}(3) \tag{3.17}
\]

Equation (3.15.) yields the probability of detecting erroneous execution when one or more SEJs occur,
\[ D_{SEJ}(3) = P^{IEJ}_{j}(0).1.P^{SEJ}_{RT}(3) + \]
\[ P^{IEJ}_{j}(0).P^{SEJ}_{RT}(1).1.P^{SEJ}_{RT}(2) + \]
\[ P^{IEJ}_{j}(0).P^{SEJ}_{RT}(1).P^{SEJ}_{RT}(2).1.P^{SEJ}_{RT}(1) + \]
\[ P^{IEJ}_{j}(0).P^{SEJ}_{RT}(2).1.P^{SEJ}_{RT}(1) + \]
\[ P^{IEJ}_{j}(0).P^{SEJ}_{RT}(1).P^{SEJ}_{RT}(1).1.P^{SEJ}_{RT}(0) + \]
\[ P^{IEJ}_{j}(0).P^{SEJ}_{RT}(2).1.P^{SEJ}_{RT}(0) + \]
\[ P^{IEJ}_{j}(0).P^{SEJ}_{RT}(3).1.P^{SEJ}_{RT}(0) + \]
\[ P^{IEJ}_{j}(1).1.P^{SEJ}_{RT}(2) + \]
\[ P^{IEJ}_{j}(1).P^{SEJ}_{RT}(1).1.P^{SEJ}_{RT}(1) + \]
\[ P^{IEJ}_{j}(1).P^{SEJ}_{RT}(1).P^{SEJ}_{RT}(1).1.P^{SEJ}_{RT}(0) + \]
\[ P^{IEJ}_{j}(1).P^{SEJ}_{RT}(2).1.P^{SEJ}_{RT}(0) + \]
\[ P^{IEJ}_{j}(2).1.P^{SEJ}_{RT}(1) + \]
\[ P^{IEJ}_{j}(2).P^{SEJ}_{RT}(1).1.P^{SEJ}_{RT}(0) + \]
\[ P^{IEJ}_{j}(3).1.P^{SEJ}_{RT}(0), \]

where equation (3.7.) and equation (3.8.) define \( P^{IEJ}_{j}(0) = 0, P^{SEJ}_{RT}(0) = 0, \) and \( P^{SEJ}_{RT}(0) = 0, \)

Substituting equation (3.17.) and equation (3.18.) into equation (3.16.) gives,

\[ D(3) = P^{IEJ}_{RT}(3) + \]
\[ P^{IEJ}_{j}(1).1.P^{SEJ}_{RT}(2) + \]
\[ P^{IEJ}_{j}(1).P^{SEJ}_{RT}(1).1.P^{SEJ}_{RT}(1) + \]
\[ P^{IEJ}_{j}(2).1.P^{SEJ}_{RT}(1). \]

This simple example demonstrates the function of equation (3.13.). Although complex, the equation does simplify to produce an almost intuitive result.

Error detection latency \( L_d \) can be determined for the microprocessor model by calculating the expectation of detecting erroneous behaviour:
where $K$ is the random variable of the detection function $D(k)$.

This formula determines the mean number of instructions expected to be processed before detection.

### 3.6.4. Erroneous Execution Stall

A processing stall is considered to occur when the microprocessor enters a stop/wait state. Execution stalls require an external hardware interrupt to facilitate state exit and continuing execution, but the occurrence of such events during erroneous execution is unknown. It is therefore important to predict the significance of this eventuality. The following evaluations determine the probability of a processing stall during erroneous execution.

Erroneous execution may stall during the linear erroneous execution following an IEJ, or one or more SEJs. The probability of stalling $S(k)$ on the $k$th processed instruction of erroneous execution is given by,

$$S(k) = S_{IEJ}(k) + S_{SEJ}(k), \quad \text{where } k \geq 0. \tag{3.21.}$$

where $S_{IEJ}(k)$ and $S_{SEJ}(k)$ respectively represent the stalling coefficients of erroneous execution following either an IEJ or SEJ.

The stalling coefficients are derived using the execution characteristics of erroneous behaviour during processing following an IEJ and SEJ. Let $P_{SW}^{IEJ}(k)$ and $P_{SW}^{SEJ}(k)$ represent the probability of the $k$th instruction processed stalling (stop/wait outcome) erroneous behaviour following an IEJ and SEJ respectively.

The stalling coefficient $S_{IEJ}(k)$ is given by,

$$S_{IEJ}(k) = P_{SW}^{IEJ}(k), \quad \text{where } k \geq 0. \tag{3.22.}$$
which is the probability of execution following an IEJ terminating with a stall outcome before a SEJ occurs.

The stalling coefficient $S_{SEJ}(k)$ represents all possible execution routes to a stall incorporating one or more SEJs. Every such execution route requires at least one SEJ, other than a stop/wait or restart outcome representing $S_{IEJ}(k)$ and detection respectively, after the IEJ commencing erroneous behaviour ($P_{IEJ}^f$). The stalling execution path may or may not include SEJs propagating further periods of linear erroneous execution ($\Psi$). Finally, an execution path generating a processing stall must terminate with a stop/wait outcome ($P_{SW}^{SEJ}$). Hence,

$$S_{SEJ}(k) = \sum_{m=0}^{k} P_{IEJ}^f(m). \left( \sum_{n=0}^{k-m} \Psi(n).P_{SW}^{SEJ}(k-m-n) \right).$$

(3.23.)

where $\Psi(n)$ is defined by equation (3.15b.).

The recursive nature of equation (3.15.) is similar to the above equation, and its functional description can be shared.

The stalling latency $L_s$ can be determined for the microprocessor model by calculating the expectation of stalling erroneous behaviour:

$$L_s = E[K] = \sum_k k.S(k),$$

(3.24.)

where $K$ is the random variable of the stalling function $S(k)$.

### 3.7. Reliability Analysis

The reliability model proposed here defines a microprocessor system failure as hazardous behaviour rather than loss of function. Hazardous behaviour is unpredictable and may mutilate system integrity and/or lead to catastrophic failure. The model outlined below describes how loss of function may not immediately induce hazardous behaviour if there is automatic repair.

The reliability of a microprocessor system can be analysed using a state/time random variable stochastic model. Let a microprocessor occupy one of two behavioural states: controlled ($C$), and uncontrolled ($U$). The processor remains in a controlled
state until the occurrence of an event induces a transition to the uncontrolled state. Such transitions are considered system failures and mark the initiation of hazardous microprocessor operation. The probability of a transition in time $\delta t$ is $\lambda(t)\cdot\delta t$, where $\lambda(t)$ is the failure rate. The reliability model is shown in Figure 3.3. The model is called a Markov Process because of its discrete-state, continuous-time nature.

3.7.1. Failure Rate, $\lambda(t)$

Let the sample space $E_s$, comprise of a set of events corresponding to initiated erroneous behaviour, that is, 1EJs. Let $E_r \in E_s$ where $E_r$ is an event leading to recovery, and $E_f \in E_s$ where $E_f$ is an event leading to failure. Within the sample space the conditions $E_r \cup E_f = E_s$, and $E_r \cap E_f = \emptyset$ exist.

Let the probability of the event $E_r$ and $E_f$ be $P(E_r)$ and $P(E_f)$ respectively, which leads to

$$P(E_r) + P(E_f) = 1 \tag{3.25.}$$

Now, assuming the event $E_s$ occurs randomly at a rate of $q$ events per hour, then the failure rate of the microprocessor is given by

$$\lambda(t) = q \cdot P(E_f) \tag{3.26.}$$

The failure rate is not time dependent giving,

$$\lambda(t) = \text{constant} = \lambda \tag{3.27.}$$

so the Markov Process is termed homogeneous.

Equation (3.25.) shows that the probability of an event leading to failure is dependent on the event not leading to recovery. The probability of recovery is itself dependent on the detection of erroneous behaviour.
1 - \lambda(t) \Delta t

C \rightarrow U

\lambda(t) \Delta t

U \rightarrow C

C : Controlled Behaviour State
U : Uncontrolled Behaviour State
\lambda(t) : Failure Rate

Figure 3.3. : Reliability Model
3.7.2. Probability of an Event E Leading to Failure, $P(E_f)$

A stringent specification of failure requires the uncontrolled (erroneous) execution of one or more instructions to be complete. The failure event is therefore any outcome, other than a restart, generated on or before completion of the first erroneously processed instruction following an IEJ. The detection capability of a restart outcome means that its operation is controlled. The probability of a failure event is expressed as,

$$P(E_f) = 1 - \sum_k (D(k)), \quad \text{where } k \leq 1. \quad (3.28.)$$

The cumulative density of the detection function incorporates the probability of detection through the two basic phases of erroneous execution: execution following an IEJ and SEJ. Properties of these phases are now given in respect of the microprocessor model. A jump outcome, other than restart, following an IEJ or SEJ can only occur when an instruction has completed its processing. Equations (3.7a.) and (3.8a.) yield $P_{IEJ}^{IEJ}(0)$ and $P_{SEJ}^{SEJ}(0)$ with nil probabilities. Equation (3.8a.) also yields $P_{RT}^{SEJ}(0)$ a nil probability. Evaluating the effective instruction distribution in the address space given by equation (3.7.) yields $P_{IEJ}^{IEJ}(1) = \beta$, and $P_{RT}^{IEJ}(1) = (1 - \beta)[P_{RT}(1) + \gamma.P_J(1)]$. Substituting equation (3.13.) into equation (3.28.) and applying these conditions gives,

$$P(E_f) = 1 - [\beta + (1 - \beta)(P_{RT}(1) + \gamma.P_J(1))] \quad (3.29.)$$

substituting equations (3.2.) and (3.6.),

$$P(E_f) = 1 - \left[\beta + (1 - \beta)(\frac{Ne_{RT}}{Ne_J}.P_J + \gamma.\left(\frac{Ne_{UJ} + Ne_{RN} + Ne_{S/W}}{Ne_J}\right).P_J)\right], \quad (3.30.)$$

substituting equation (3.4.),

$$P(E_f) = 1 - \left[\beta + \frac{(1 - \beta)}{N_L}.(Ne_{RT} + \gamma.[Ne_{UJ} + Ne_{RN} + Ne_{S/W}])\right], \quad (3.31.)$$
\[ P(E_f) = (1 - \beta) \left( 1 - \frac{N_{eRT} + \gamma [N_{eUJ} + N_{eRN} + N_{eS/w}]}{N_L} \right). \quad (3.32. \) \\

and from equation (3.25.),

\[ P(E_r) = \left[ \beta + \frac{(1 - \beta)}{N_L} \left( N_{eRT} + \gamma [N_{eUJ} + N_{eRN} + N_{eS/w}] \right) \right]. \quad (3.33. \)

### 3.7.3. Reliability Evaluation

Consider the respective probabilities, for the reliability model, of being in a controlled state or uncontrolled state at time \( t + \delta t \).

\[ P_c(t + \delta t) = \left[ 1 - \lambda(t) \cdot \delta t \right] P_c(t) , \quad (3.34. \) \\

\[ P_u(t + \delta t) = \left[ \lambda(t) \cdot \delta t \right] P_c(t) + P_u(t) . \quad (3.35. \)

Substituting equation (3.27.) and re-arranging equations (3.34.) and (3.35.) gives,

\[ \frac{P_c(t + \delta t) - P_c(t)}{\delta t} = -\lambda P_c(t) , \quad (3.36. \) \\

\[ \frac{P_u(t + \delta t) - P_u(t)}{\delta t} = \lambda P_c(t) , \quad (3.37. \)

and passing to a limit as \( \delta t \to 0 \) yields,

\[ \frac{d\{P_c(t)\}}{dt} = -\lambda P_c(t) , \quad (3.38. \) \\

\[ \frac{d\{P_u(t)\}}{dt} = \lambda P_c(t) . \quad (3.39. \)

Re-arranging and integrating equation (3.38.),
\[
\int \frac{d\{P_c(t)\}}{P_c(t)} = -\lambda \int dt, 
\]

(3.40.)

\[
ln\{P_c(t)\} = -\lambda t + C_1, 
\]

(3.41.)

\[
P_c(t) = exp\{-\lambda t + C_1\}. 
\]

(3.42.)

Applying the initial conditions; when \( t = 0 \), then \( P_c(t) = 1 \) and \( P_u(t) = 0 \) giving \( C_1 = 0 \). Hence equation (3.42.) becomes,

\[
P_c(t) = exp\{-\lambda t\}. 
\]

(3.43.)

Now re-arranging and integrating equation (3.39.) gives,

\[
\int d\{P_u(t)\} = \lambda \int P_c(t).dt, 
\]

(3.44.)

\[
\int d\{P_u(t)\} = \lambda \int e^{-\lambda t}dt, 
\]

(3.45.)

\[
P_u(t) = exp\{-\lambda t\} + C_2. 
\]

(3.46.)

Again applying the initial conditions; when \( t = 0 \), then \( P_c(t) = 1 \) and \( P_u(t) = 0 \) giving \( C_2 = 1 \). Hence equation (3.46.) becomes,

\[
P_u(t) = 1 - exp\{-\lambda t\}. 
\]

(3.47.)

The reliability of the microprocessor system in the model is given by the probability of the system remaining in a controlled state. That is,

\[
R(t) = P_c(t), 
\]

(3.48.)
\[ R(t) = \exp\{-\lambda t\}, \quad (3.49) \]

substituting equations (3.26.) and (3.27.) gives

\[ R(t) = \exp\left\{ -qt(1 - \beta) \cdot \left( 1 - \frac{N_{eRT} + \gamma [N_{eUJ} + N_{eRN} + N_{eS/W}]}{N_L} \right) \right\}. \quad (3.50) \]

### 3.7.4. Mean Time To Failure

The concept of Mean Time To Failure (MTTF), used in hardware reliability calculations, can be adapted for this work. It provides a method of comparing hardware and software reliability.

MTTF is defined as,

\[ MTTF^h = \int_0^\infty R(t).dt, \quad (3.51) \]

substituting (3.49.) gives,

\[ MTTF = \int_0^\infty \exp\{-\lambda t\}.dt, \quad (3.52) \]

\[ MTTF = \left[ \frac{\exp\{-\lambda t\}}{\lambda} \right]_0^\infty, \quad (3.53) \]

\[ MTTF = \frac{1}{\lambda}. \quad (3.54) \]

Substituting equations (3.26.) and (3.27.) gives,

\[ MTTF = \frac{1}{qP(E_f)}, \quad (3.55) \]

and substituting equation (3.32.) gives,

\[ MTTF = \left[ \frac{N_L}{q(1 - \beta)(N_L - N_{eRT} - \gamma [N_{eUJ} + N_{eRN} + N_{eS/W}])} \right]. \quad (3.56) \]
3.8. Availability

The availability of a microprocessor is the proportion of time for which the microprocessor is fully operational. An inherent assumption made when calculating availability is that the target system is maintained, i.e. the system has its operation restored after failure. Within the microprocessor model presented in this chapter, restoration is provided by the automatic execution of a recovery routine when a restart outcome is generated during erroneous execution. Availability $A_v$ is dependent on Mean Time To Failure (MTTF) and Mean Time To Repair (MTTR),

$$A_v = \left[ \frac{MTTF}{MTTF + MTTR} \right].$$  \hspace{1cm} (3.57.)

The Mean Time To Failure (MTTF) is defined by equation (3.54.). The Mean Time To Repair (MTTR) includes all processing before the microprocessor is restored to its fully operational state. In order to facilitate repair the microprocessor model must allow for the implementation of a recovery routine. Mean Time To Repair can be estimated using the following equation.

$$MTTR = \left( \frac{L_d + N_R}{I_f} \right),$$  \hspace{1cm} (3.58.)

where $L_d$ is the mean number of instructions processes erroneously before detection of erroneous behaviour (error latency from equation 3.20.), $N_R$ is the mean number of instructions executed after detection by the recovery routine, and $I_f$ is the mean number of instructions executed per hour.

Substituting equations (3.54.) and (3.58.) into (3.57.) gives an estimate for the availability of a microprocessor system that employs coverage for the erroneous behaviour described in this chapter.

$$A_v = \left[ \frac{\frac{1}{\lambda}}{\frac{1}{\lambda} + \left( \frac{L_d + N_R}{I_f} \right)} \right],$$  \hspace{1cm} (3.59.)

$$A_v = \left( \frac{1}{1 + \frac{1}{I_f} \left[ L_d + N_R \right]} \right).$$  \hspace{1cm} (3.60.)
3.9. Summary

The event, induced by a temporary fault, initiating erroneous microprocessor behaviour is defined as an Initial Erroneous Jump (IEJ). Erroneous behaviour is characterized by periods of linear erroneous execution interspersed by erroneous jumps. The characteristics of erroneous execution following an IEJ or SEJ can be statistically modelled. Error latency is derived from detection capabilities in the microprocessor model. Failure mode analysis is used within a Markov Model to determine functions of reliability and Mean Time To Failure (MTTF). Availability of the microprocessor system by the model is estimated under the assumption that a recovery routine is implemented. These functions allow the comparative assessment of recovery techniques to be made for software disrupted by temporary faults in a form which can be related to calculations for permanent faults in digital systems.
Chapter 4

EVALUATING MICROPROCESSOR BEHAVIOUR

4.1. Introduction .................................................. 54
4.2. Instruction Mix Analysis ........................................ 54
4.3. Architecture Parameters for the Microprocessor Model .......... 55
   4.3.1. Built-In Microprocessor Detection Capability .......... 55
   4.3.2. Modelling the Microprocessor Program Counter .......... 58
   4.3.3. Instruction Processing Exceptions .................... 58
4.4. Evaluating Microprocessor Models of Erroneous Behaviour ...... 58
   4.4.1. 8-Bit Processor Evaluations .......................... 64
   4.4.2. 16-Bit Processor Evaluations .......................... 65
   4.4.3. 32-Bit Processor Evaluations .......................... 66
4.5. Catastrophic Failure Analysis ................................ 67
4.6. Recovery Through The Detection of Erroneous Execution ....... 69
4.7. Evaluating Microprocessor Reliability ........................ 67
4.8. Evaluating Microprocessor Availability ....................... 74
4.9. Conclusions ................................................ 76
CHAPTER FOUR
EVALUATING MICROPROCESSOR BEHAVIOUR

4.1. Introduction

A model of erroneous microprocessor behaviour is presented in the previous chapter. This chapter applies the model to a selection of target processors which include 8, 16, and 32-bit architectures using instruction mix analysis.

The chapter commences with the derivation of parameter values required by the model for the microprocessors under investigation. The content of the address space is assumed to be random for the purpose of statistical analysis. Characteristics of erroneous execution are described for each of the target processors. In particular the possibilities of catastrophic failure and recovery are investigated because of their influence on the dependability of a microprocessor based system.

Finally, the reliability of a microprocessor system is considered. A comparison is made between the microprocessors modelled using the reliability parameter Mean Time To Failure (MTTF). Reliability calculations assume that the host processor has no recovery capability. Many of the microprocessors investigated, however, do have a recovery capability provided by the detection attribute of instructions that develop a restart outcome. In order to assess the performance of such maintained microprocessor systems, the microprocessor systems availability is evaluated.

4.2. Instruction Mix Analysis

The model of erroneous microprocessor behaviour presented in Chapter 3 is evaluated using instruction mix analysis. This involves determining the mean instruction distribution for a section of memory and modelling the expected behavioural characteristics. Such modelling is abstracted from actual microprocessor behaviour which is dependent on instruction sequences. Nevertheless, instruction mix analysis does provide a valuable method for indicating the nature of erroneous microprocessor behaviour and its variation between target processors.
This chapter evaluates the erroneous behaviour of a selection of microprocessors. The processors considered are: Motorola 6800, Intel 8048, Intel 8085, Intel 8086, Motorola 68000, Motorola 68010, AMD Am29000, Motorola 68020, and Intel 80386. The microprocessors are chosen to include common application examples of 8, 16, and 32-bit architectures. In addition, these processors implement various design features including reduced instruction sets, ROM instruction decoders, and instruction processing exceptions.

The instruction mix of the target processors is shown in Table 4.1., data being collated from Appendix A. Each instruction set is divided into instruction state outcomes, non-jump, restart, undefined jump, return, and stop/wait. The instruction mix of the undefined instructions within the processor instruction sets is detailed in Table 4.2. Some of these instruction sets contain unspecified instructions which through experiment have been defined [Halse, 1984].

4.3. Architecture Parameters For The Microprocessor Model

This section determines the parameter values for the model of erroneous microprocessor behaviour which are dependent on the processor architecture.

4.3.1. Built-In Microprocessor Detection Capability

The Motorola 68000 family of microprocessors execute instruction op-codes residing at an even byte boundary location in the address space. If an attempt to process an instruction op-code at an odd byte boundary location in the address space is made, then an immediate 'restart' outcome is entered. The outcome is assumed to be immediate because the 'odd byte address' exception does not process the instruction op-code concerned but rather after a few clock cycles determines an illegal odd address has been accessed. The \( \beta \) parameter defined in equations (3.7.) and (3.8.) will therefore have an inherent value of 0.5 for this family of microprocessors.

The remaining microprocessors evaluated within this chapter do not have this or a similar method of generating a restart outcome during instruction processing. The models of these microprocessors therefore define \( \beta \) to be zero.
<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>8-Bit</th>
<th>16-Bit</th>
<th>32-Bit</th>
</tr>
</thead>
<tbody>
<tr>
<td>Microprocessor</td>
<td>MC</td>
<td>Intel</td>
<td>AMD</td>
</tr>
<tr>
<td>8-Bit</td>
<td>6800</td>
<td>8048</td>
<td>8086</td>
</tr>
<tr>
<td></td>
<td>173</td>
<td>183</td>
<td>209</td>
</tr>
<tr>
<td>Non-Jump</td>
<td>0</td>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td></td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Stop/Wait</td>
<td>2</td>
<td>2</td>
<td>5(4)</td>
</tr>
<tr>
<td>Return</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr>
<td>Unsnepecific Jump</td>
<td>1</td>
<td>0</td>
<td>1(16)</td>
</tr>
<tr>
<td>Restart</td>
<td>1</td>
<td>1(4)</td>
<td>2</td>
</tr>
<tr>
<td>Undefined</td>
<td>59</td>
<td>26</td>
<td>10</td>
</tr>
<tr>
<td>TOTAL</td>
<td>256</td>
<td>256</td>
<td>340</td>
</tr>
</tbody>
</table>

Note: Data from Appendix A. Brackets denote the number of those instructions which are dependent on a condition for their operation.
<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>8-Bit</th>
<th>16-Bit</th>
<th>32-Bit</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>MC</td>
<td>Intel</td>
<td>Intel</td>
</tr>
<tr>
<td></td>
<td>6800</td>
<td>8048</td>
<td>8085</td>
</tr>
<tr>
<td>Non-Jump</td>
<td>49</td>
<td>22</td>
<td>7</td>
</tr>
<tr>
<td>Stop/Wait</td>
<td>4</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Return</td>
<td>2</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Unspecified Jump</td>
<td>4(1)</td>
<td>4(3)</td>
<td>2(2)</td>
</tr>
<tr>
<td>Restart</td>
<td>0</td>
<td>0</td>
<td>1(1)</td>
</tr>
<tr>
<td>TOTAL</td>
<td>59</td>
<td>26</td>
<td>10</td>
</tr>
</tbody>
</table>

*Note:* Data from Appendix A. Brackets denote the number of those instructions which are dependent on a condition for their operation.

**Table 4.2.** Microprocessor Undefined Instruction Evaluation
4.3.2. Modelling the Microprocessor Program Counter

Most of the microprocessors evaluated in this chapter have a single program counter which is capable of specifying every location in the address space. The Intel 8086 and 80386 microprocessors, however, have an internal address bus smaller than its external address bus. To derive a location in the 8086 address space it uses two *program counters*. The address put on the external address bus is the sum of the 16-bit Instruction Pointer and the 16-bit Control Segment Register which has already been left shifted four bits. Hence a 20-bit location is put on the external address bus. This method of deriving the address bus value means that corruption of either the Instruction Pointer or the Control Segment Register corrupts the microprocessor's effective *program counter*. The model considers the generation of an erroneous jump as corruption of the single effective program counter.

4.3.3. Instruction Processing Exceptions

Some microprocessor instruction sets include instructions which generate a restart outcome when an abnormal processing condition is identified. A good example common to most microprocessors is the 'divide by zero' processing exception. Processing exceptions should not be confused with conditional instruction outcomes where a test is incorporated into the instruction operation in order to determine whether or not a task is performed, e.g. conditional branch. For the purpose of statistical analysis within this chapter, instruction processing exceptions are considered not to occur. The $\gamma$ parameter used in equation (3.7.) and (3.8.) is therefore zero.

4.4. Evaluating Microprocessor Models of Erroneous Behaviour

Erroneous execution within the used area is initially modelled by execution through a memory of random content. The two main behavioural characteristics, linear erroneous execution and erroneous jumps, are investigated.

A selection of microprocessor instruction sets are examined in Table 4.3. The distribution of instructions through the address space of random content for many target processors is the same as their instruction set mix. In many instances, this is
<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>8-Bit</th>
<th>16-Bit</th>
<th>32-Bit</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>6800</td>
<td>8048</td>
<td>8085</td>
</tr>
<tr>
<td>Non-Jump</td>
<td>89.648%</td>
<td>84.570%</td>
<td>89.648%</td>
</tr>
<tr>
<td>Stop/Wait</td>
<td>1.953%</td>
<td>0.000%</td>
<td>0.391%</td>
</tr>
<tr>
<td>Return</td>
<td>1.563%</td>
<td>0.782%</td>
<td>1.172%</td>
</tr>
<tr>
<td>Unspecified Jump</td>
<td>6.445%</td>
<td>14.648%</td>
<td>4.688%</td>
</tr>
<tr>
<td>Restart</td>
<td>0.391%</td>
<td>0.000%</td>
<td>3.711%</td>
</tr>
<tr>
<td>Undefined</td>
<td>*</td>
<td>*</td>
<td>*</td>
</tr>
</tbody>
</table>

*Notes*: Instruction set data from Appendix A. The instruction set distributions for the Intel 8086 and 80386 microprocessors have been manipulated for random data interpretation. The symbol '*' denotes that any undefined instructions have been grouped with their associated instruction type. Undefined instructions without the symbol '*' are specified as having a restart outcome. Conditional jump instructions are evaluated as having an equal chance of activation or non-activation, reference equations (3.3) and (3.4).

Table 4.3: Random Data Instruction Interpretation
due to the processor implementing a ROM instruction decoder, within its architecture, which specifies an instruction for all possible opcode bit formats. The Intel 8086 and 80386, however, require their instruction set mix to be manipulated to reflect a random data instruction mix.

Linear erroneous execution state outcomes are plotted as cumulative functions using equation (3.9.) in Figure 4.1. (8-bit processors), Figure 4.2. (16-bit processors), and Figure 4.3. (32-bit processors). The cumulative probability that linear erroneous execution has a particular state outcome after processing a number of instructions, indicated by the instruction index, is shown by the vertical width of the labelled area at that point. The features of each plot to notice are:

i) *Continued Linear Erroneous Execution*

This is the vertical width of the area labelled ‘linear erroneous execution’ at each instruction index. The larger the vertical width, the greater the probability that linear erroneous execution has continued through the number of instructions indicated by the instruction index.

ii) *Termination by Erroneous Jump*

This is the vertical distance of the combined areas beneath the area labelled ‘linear erroneous execution’ at each instruction index. The larger the vertical width, the greater the probability that linear erroneous execution has been terminated by the present (or any preceding) processed instruction indicated by the instruction index.

iii) *Stop/Wait Outcome*

This is the vertical width of the area labelled ‘stop/wait’ at each instruction index. This outcome represents a catastrophic failure involving the termination of processor activity until the appropriate external interrupt re-initiates operation. The likelihood...
Figure 4.1. : 8-Bit Microprocessor Linear Erroneous Execution
Figure 4.2: 16-Bit Microprocessor Linear Erroneous Execution
of such a restoring event occurring is unpredictable. The larger the vertical width of this area, the higher the probability of this outcome.

iv) Restart Outcome

This is the vertical width of the area labelled 'restart' at each instruction index. The restart outcome is the only one that when executed erroneously is considered to generate a controlled outcome. It is for this reason that it will be used for detecting erroneous execution. Hence the 'inherent' detection capability of a microprocessor may be observed by thickness of the 'restart' area. The larger the vertical width, the greater the probability that linear erroneous execution has been detected, and hence terminated, by the present (or any preceding) processed instruction indicated by the instruction index.

The investigation of linear erroneous execution and erroneous jumps gives an indication of the character and attributes of erroneous behaviour.

4.4.1. 8-Bit Processor Evaluations

All the 8-bit microprocessors evaluated exhibit a high probability of periods of linear erroneous execution exceeding ten instructions, see Figure 4.1. In particular the Intel 8048 and 8085 processor models suggest 31% and 27%, respectively, of the periods of linear erroneous execution are expected to terminate within ten instructions. The Motorola 6800 is about twice as likely to terminate a period of linear erroneous execution within ten instructions.

A SEJ terminates all periods of linear erroneous execution in the 8048 microprocessor system. However, for the other two 8-bit processors some periods of linear erroneous execution are terminated by recovery or catastrophic failure. Approximately 79% of the periods of linear erroneous execution are terminated with a SEJ for the Motorola 6800 processor, a similar value of 65% is modelled for the Intel 8085 microprocessor. Although the Intel 8048 processor will never catastrophically fail in
the model, it will also never inherently recover. Within the Motorola 6800 processor non-SEJ terminations of linear erroneous execution as catastrophic failure are expected to occur four more times than recovery. The Intel 8085 processor has a converse relationship, recovery being expected to occur six more times than catastrophic failure.

In summary, the model for the Motorola 6800 microprocessor suggests periods of linear erroneous execution of approximately ten instructions which are approximately five times more likely to terminate in an SEJ than failure, and the chance of recovery is small. The Intel 8048 processor model predicts much longer periods of linear erroneous execution which will always terminate with an SEJ, no catastrophic failure of recovery is possible. Although the Intel 8048 processor will never catastrophically fail in the model, failure is implied by the fact that erroneous execution never ceases. Within the model for the Intel 8085 microprocessor periods of linear erroneous execution are expected of a similar length to those evaluated for the Intel 8048 processor, of which approximately one third terminations are expected to generate recovery, the vast majority of the remaining terminations producing a SEJ.

4.4.2. 16-Bit Processor Evaluations

The instruction mix analysis of erroneous execution presented for the 16-bit microprocessors in Figure 4.2. suggests that these processors have shorter periods of linear erroneous execution than those modelled for the 8-bit processors. The model predicts that in excess of 80% of linear erroneous execution periods will terminate before their tenth processed instruction. The Intel 8086 processor has a mean expected period of linear erroneous execution longer than that for the Motorola 68000 and 68010 microprocessors. This is due to the influence of the instruction set and architecture. The 68000 detection capability is considerably influenced by the 'odd address exception' processor facility. This exception yields a restart outcome for any access to an instruction located at an odd byte address in the memory map.

Within the Intel 8086 processor model there is a 90% probability that a period of linear erroneous execution is terminated by a SEJ. This is much larger than that for the Motorola 68000 and 68010 processors whose model suggests the likelihood of
the same outcome as less than 3%. Hence, not only are the periods of linear erroneous execution expected to be shorter for the Motorola processors than the Intel, but also the Motorola processors will have fewer periods of linear erroneous execution before either catastrophic failure or recovery is attained. The Intel 8086 processor model has a similar probability of linear erroneous execution being terminated by recovery through a restart outcome or catastrophic failure through a stop/wait outcome. The Motorola 68000 processors yields very different results. The probability of a stop/wait outcome for the Motorola 68000 processors is too small to be shown on Figure 4.2(b & c) whilst the probability of a restart outcome and hence recovery is approximately 93% after just four processed instructions during linear erroneous execution.

4.4.3. 32-Bit Processor Evaluations

The result of model application for a selection of 32-bit processors is shown in Figure 4.3. The microprocessors evaluated are the Advanced Micro Devices Am29000 (Version D), the Motorola 68020, and the Intel 80386.

The influence in the model of the ‘odd byte address’ exception of the Motorola 68020 microprocessor is clearly seen as the 50% intercept in Figure 4.3(b). This feature greatly reduces the mean expected period of linear erroneous execution as previously described for the Motorola 68000 and 68010 processors. The Advanced Micro Devices Am29000 achieves similar periods of erroneous execution without this architectural feature. Its performance relies totally on its instruction set attributes. Both the AMD Am29000 and Motorola 68020 models predict approximately 90% of linear execution periods of five instructions to terminate. This characteristic is not shared by the Intel 80386 processor model in which approximately 63% of linear erroneous execution periods of five instructions are expected to terminate.

Termination of linear erroneous execution for the AMD Am29000 and Motorola 68020 processor models have about a 90% expectation of of generating a restart outcome and hence detection and recovery. The Intel 80386 does not compare so favourably with a 37% chance of a restart terminating linear erroneous execution. The
Intel 80386 restart probability is, however, higher than all 8-bit and 16-bit processors considered earlier with the exception of the Motorola 68000 microprocessor family.

The probability for the Motorola 68020, as with other members of the Motorola 68000 processor family considered earlier, of catastrophic failure through a stop/wait outcome is too small to be shown in Figure 4.3(b). At 0.0015% it is very small when compared with the 8 and 16-bit processor evaluations. Although the AMD Am29000 and Intel 80386 processors have a larger probability of 0.391% of a stop/wait outcome which is visible in Figure 4.3(a & c), it is still small in relation to other processor model evaluations.

4.5. Catastrophic Failure Analysis

The probability of a stop/wait outcome is identified as a catastrophic failure. Such a processing outcome stalls execution until an external interrupt generates a restart outcome and hence initiates recovery. The occurrence of a stop/wait outcome during periods of linear erroneous execution has been investigated for a selection of 8, 16, and 32-bit microprocessors. It is valuable to further consider the probability of catastrophic failure as a function of general erroneous execution. Figure 4.4. shows the predicted chance of catastrophic failure using instruction mix analysis for a selection of processors. The graph is developed using equation (3.21) and data from Table 4.3.

The Motorola 6800 microprocessor has a significantly higher probability of producing a catastrophic failure than the other evaluated processors. The Intel 8086 has an 8% probability of catastrophic failure after ten instructions have been processed during erroneous execution which is just under half that expected for the Motorola 6800. The remaining processors are plotted as two groups in Figure 4.4. One group, comprising of the Intel 8085, AMD Am29000, and Intel 80386, has twice the expectation of catastrophic failure of the other group of the Motorola 68000 family processors. This can be illustrated by a comparison between the Intel 8085 and AMD Am29000 which have the respective likelihoods of 4% and 2% for catastrophic failure before ten instructions are erroneously processed.
Figure 4.4. : Catastrophic Failure - Instruction Mix Analysis
4.6. Recovery Through The Detection of Erroneous Execution

The probability of detecting erroneous execution is dependent on the generation of a restart outcome. This can be achieved through either an instruction's natural outcome, or a hardware induced outcome where a processing exception occurs. Although the probability of a restart outcome has been considered during periods of linear erroneous execution, it is valuable to evaluate the general function of this outcome during erroneous behaviour. Figure 4.5. shows the probability of a restart outcome during erroneous execution for a selection of processors. The graph is plotted using equation (3.13) and data from Table 4.3.

Both the Motorola 6800 and Intel 8086 have very low probabilities of generating a restart outcome, the former having less than half the expectation of the other. The Intel 8085 shows a significant improvement with a 32% chance of initiated recovery after ten instruction are erroneously processed. This represents over a four-fold improvement on the Intel 8086 processor. The Intel 80386 also shows an enhanced performance with approximately 63% probability that erroneous execution is detected after ten processed instructions. The remaining processors, the AMD Am29000 and Motorola 68000 processor family, have a much better performance. Their instruction mix models suggest the likelihood of erroneous execution being detected after ten erroneously processed instructions is in excess of 97%. The 'odd byte address' exception is a major contributory factor for the performance of the Motorola 68000 processor family. This influence of the exception is shown as the 50% intercept value for the Motorola 68000 processor plots. The performance of the AMD Am29000 processor is dependent solely on its instruction set mix.

4.7. Evaluating Microprocessor Reliability

Microprocessor reliability is calculated under the ‘worst case’ assumption that any processor activity other than immediate detection is considered as system failure. Detection is provided by those instructions which generate a restart outcome. Immediate detection requires a restart outcome to be attained before the second instruction of erroneous behaviour is processed.
Figure 4.5: Recovery Through Detection - Instruction Mix Analysis
The reliability of the microprocessors modelled, \( R(t) \), is assessed using equations (3.49.) with the substitution of equations (3.27.), (3.26.), and (3.32.), to yield,

\[
R(t) = \exp \left\{ -qt(1 - \beta) \cdot \left( 1 - \frac{Ne_{RT} + \gamma [Ne_{UJ} + Ne_{RN} + Ne_{SW}]}{N_L} \right) \right\}, \tag{4.1.}
\]

where \( q \) is the event rate initiating erroneous behaviour, \( \beta \) is the probability of hardware detection of erroneous execution, \( \gamma \) is the probability of a processing exception, \( Ne_{RT} \), \( Ne_{UJ} \), \( Ne_{RN} \), and \( Ne_{SW} \) are the effective numbers of restart, unspecified jump, return and stop/wait outcome generating instructions within the instruction mix, and finally, \( N_L \) is the number of instructions in the instruction mix.

Details of the instruction mix for reliability evaluation are shown in Table 4.3. The parameter \( \gamma \) is defined in section 4.3.3., for the purpose of statistical analysis, to be zero so equation (4.1.) becomes,

\[
R(t) = \exp \left\{ -qt(1 - \beta) \cdot \left( 1 - \frac{Ne_{RT}}{N_L} \right) \right\}. \tag{4.2.}
\]

The parameter \( \beta \) is set to zero except for the Motorola 68000 microprocessor family where it is set to 50% to represent the 'odd byte address' exception facility. The determination of processor parameters is discussed in section 4.3.

The reliability curves evaluated for the selection of 8, 16, and 32-bit processors considered in this chapter are shown in Figure 4.6. Unfortunately some processor evaluations are so similar that their individual reliability curves cannot be distinguished. In particular this occurs for the Motorola 68000 and 68010 processors, and the Motorola 6800, Intel 8048, and Intel 8086 processors. The variation in the respective processor reliability for these clusters of curves is indicated by the Mean Time To Failure (MTTF) calculations shown in Table 4.4. These calculations assume the events which initiate erroneous behaviour to occur once per month or every 714 hours.

As discussed earlier the Intel 8048 has no detection capability for erroneous execution and hence its MTTF is equivalent to the event rate. The remaining microprocessors have some degree of detection capability depending on instruction mix
Figure 4.6: Microprocessor Reliability - Instruction Mix Analysis
<table>
<thead>
<tr>
<th>Microprocessor</th>
<th>Evaluation Parameters</th>
<th>Mean Time To Failure</th>
</tr>
</thead>
<tbody>
<tr>
<td>Type</td>
<td>Class</td>
<td>$\gamma$</td>
</tr>
<tr>
<td>6800</td>
<td>8-Bit</td>
<td>0.0</td>
</tr>
<tr>
<td>8048</td>
<td>8-Bit</td>
<td>0.0</td>
</tr>
<tr>
<td>8085</td>
<td>8-Bit</td>
<td>0.0</td>
</tr>
<tr>
<td>8086</td>
<td>16-Bit</td>
<td>0.0</td>
</tr>
<tr>
<td>68000</td>
<td>16-Bit</td>
<td>0.0</td>
</tr>
<tr>
<td>68010</td>
<td>16-Bit</td>
<td>0.0</td>
</tr>
<tr>
<td>29000</td>
<td>16-Bit</td>
<td>0.0</td>
</tr>
<tr>
<td>68020</td>
<td>16-Bit</td>
<td>0.0</td>
</tr>
<tr>
<td>80386</td>
<td>16-Bit</td>
<td>0.0</td>
</tr>
</tbody>
</table>

Notes: MTTF derivation from equation (3.49) and data from Table 4.3. Transient events initiating erroneous microprocessor behaviour are assumed to occur once per month (every 714 hours).

Table 4.4: Microprocessor 'Mean Time To Failure' Evaluation
and architectural influences. This detection capability has little effect of the MTTF for the Motorola 6800, Intel 8086, Intel 8085, and Intel 80386 processors. A more significant improvement is shown by the AMD Am29000 processor which has an MTTF approximately double the inter-arrival event period. Finally, the Motorola 68000 processors have the best MTTF which are about three times the inter-arrival event rate. The performance of the Motorola processors can be largely attributed to their 'odd byte address' exception facility which will, under the model conditions, immediately detect half the initiated periods of erroneous microprocessor behaviour.

4.8. Evaluating Microprocessor Availability

The availability of a microprocessor system can only be evaluated if the system is maintained, i.e. manual or automatic repair is facilitated. The model presented for the erroneous behaviour of a microprocessor in Chapter 3 assumes that restart outcome generating instructions can be used to provide a detection capability for erroneous execution. Hence equation (3.60.) can be used to calculate availability,

\[ A_v = \frac{1}{1 + \frac{\lambda}{I_f} (L_d + N_R)} \]  

(4.3.)

where \( \lambda \) is the rate of failure events (per hour) that initiate erroneous behaviour, \( I_f \) is the mean number of instructions executed per hour by the processor, \( L_d \) is the error detection latency, and \( N_R \) is the number of instructions to be processed in the recovery routine.

A three dimensional availability plot shown in Figure 4.7. describes how availability varies with different event rates and error detection latencies. The solid plane denotes the use of a recovery routine with 50 instructions, whilst the dashed plane shows the effect of a larger recovery routine of 1000 instructions.

In order to aid comprehension of the equation (4.3.) plot consider the following example. A microprocessor operates at 0.4 MHz and has a mean instruction processing time of 400 cycles. Failure events initiating erroneous microprocessor behaviour are expected to occur once per second. Hence,
MODEL PERFORMANCE

DASHED SURFACE : VERSION 1
BOLD SURFACE : VERSION 2

Figure 4.7. : Microprocessor Availability - Instruction Mix Analysis
The mean error detection latency \( L_d \) is ten instructions, and the recovery routine consists of 50 instructions \( N_R \). The availability of the system can now be determined using equation (4.3.):

\[
\frac{\lambda}{I_f} = \frac{1.400}{400000} = 0.001
\]  

(4.4.)

\[
A_v = \left( \frac{1}{1 + 0.001[10 + 50]} \right),
\]

(4.5.)

\[
A_v = 94.3\%.
\]  

(4.6.)

The calculated availability can be located on Figure 4.7. by mapping the event rate (per instruction) \( \frac{\lambda}{I_f} \), and error latency (number of instructions) \( L_d \) on the solid surface representing 50 instructions in the recovery routine \( N_R \). The actual operating frequency of a processor may be higher and the mean instruction cycle time shorter than that used in this example. These parameter values would increase the calculated value of system availability.

4.9. Conclusion

A model based on probability theory is used to predict microprocessor erroneous behaviour. Characteristics of linear erroneous execution and their termination are compared for a selection of 8, 16, and 32-bit microprocessors using instruction mix analysis. Whilst this method of analysis does not reflect the time dependency of erroneous behaviour on the instruction sequence, it does provide a valuable insight into the patterns of erroneous behaviour.

Two important processing outcomes are studied: the probability of catastrophic failure and recovery. Catastrophic failure is defined as the entry into a stop/wait state where normal instruction activity ceases. This state can only be exited by the generation of an external interrupt. Such a reset signal cannot be relied upon because they have not been incorporated into the microprocessor system design. The
other important processing outcome is recovery. This is attained through entry into a restart state which by its definition establishes controlled microprocessor activity.

The evaluation of recovery leads to the determination of reliability for the microprocessor system. The selection of target processors is compared using the reliability parameter Mean Time to Failure (MTTF). Many processors show little improvement in the MTTF with respect to the inter-arrival event period. The best results are obtained from the microprocessor models for the AMD Am29000 and Motorola 68000 family which have a three-fold increase in their MTTF compared with the inter-arrival event period.

Finally, the chapter concludes with an investigation of microprocessor system availability. A general plot is shown to indicate the availability of a processor system implementing recovery routine of two sizes and an example discussed. Availability calculations are only made possible where a microprocessor system has an automatic detection and recovery capability for erroneous behaviour.
Chapter 5

DETECTING ERRONEOUS MICROPROCESSOR BEHAVIOUR

5.1. Introduction ................................................................. 78
5.2. Address Space Allocation ............................................. 78
5.3. Erroneous Execution in the Unused Area of the Address Space ...... 79
  5.3.1. The Initial Erroneous Jump Characteristic ..................... 81
  5.3.2. Detecting Erroneous Execution .................................. 81
    5.3.2.1. A Software-Based Technique ............................... 83
    5.3.2.2. Watchdog Timers and Smart Watchdogs ................. 84
    5.3.2.3. The Access Guardian Proposal ............................ 84
5.4. Erroneous Execution in the Used Area of the Address Space ........ 87
  5.4.1. The Subsequent Erroneous Jump Characteristic .................. 87
  5.4.2. Detection Using Software Implemented Fault Tolerance ........ 91
    5.4.2.1. Program Areas ............................................. 91
    5.4.2.2. Data Areas and Reserved Input/Output Areas .......... 96
5.5. The Overheads of Implementing Fault Tolerance ................. 97
  5.5.1. Hardware Fault Tolerance ...................................... 97
  5.5.2. Software Implemented Fault Tolerance ....................... 98
5.6. A Fault Tolerant Strategy for Microprocessor Controllers .......... 99
5.7. Summary ................................................................. 104
5.1. Introduction

The previous chapters of this thesis have proposed and evaluated a model of erroneous behaviour for a selection of microprocessors. This chapter considers methods of exploiting characteristics of erroneous execution as part of a detection strategy. In particular the characteristics of Initial Erroneous Jump (IEJ) and Subsequent Erroneous Jump (SEJ) are identified for this purpose.

The chapter commences by defining the functional allocation of a microprocessor's address space: the used area consisting of program, data, and reserved input/output areas; the unused area consisting of physically implemented and non-existent memory. Detection techniques are then considered for these functional address space allocations. Particular proposals are made using software techniques for the program and physically implemented unused areas of the address space. In instances where a microprocessor does not have its address space totally implemented with physical memory, a proposed hardware unit called an Access Guardian can be implemented to provide detection of unused area access.

The application of fault tolerance incurs a physical and/or performance overhead to the target microprocessor system. Each of the detection techniques considered within this chapter has its required overhead evaluated.

5.2. Address Space Allocation

Within the system memory, areas can be defined which have different execution characteristics dependent on the defined utilization of that memory area. For the purpose of statistical analysis, the memory is divided into functional areas. Each functional area will exhibit a particular instruction distribution.

Initially the address space is divided into two distinct areas, used area and unused area, such that
Total Address Space \( (\text{bytes}) \) = Used Area \( (\text{bytes}) \) + Unused Area \( (\text{bytes}) \) \hspace{1cm} (5.1.)

The unused area is considered to include those address space locations not reserved for, or required by the processor during correct operation. This area can be subdivided depending on the implementation of the resident address space,

- physically implemented memory, and
- non-existent memory.

The used area contains locations reserved for external communication, and locations for instructions and the data they require for correct operation. The used area can be functionally subdivided into,

- program area,
- data area, and
- input/output reserved area.

The program area is considered to contain all software instructions, opcodes and operands. The data area is considered to contain any information required by the software, i.e. data structures including stacks and linked lists. The input/output reserved area contains those locations specified as reserved for communication ports and exception targets.

The functional allocation of the address space into its constituent areas is shown in Figure 5.1. It should be realized that for microprocessor systems, allocation of the address space rarely involves contiguous functional areas.

5.3. Erroneous Execution in the Unused Area of the Address Space

This section initially describes the character of modelled erroneous execution in the unused area of the address space. In particular the Initial Erroneous Jump (IEJ), first described in Chapter 3, is identified with erroneous execution in the unused area. Software and hardware detection techniques are presented which exploit this characteristic. Finally, the design of the hardware technique is detailed.
Figure 5.1: Functional Address Space Allocation
5.3.1. The Initial Erroneous Jump Characteristic

The event associated with the initiation of erroneous behaviour is the Initial Erroneous Jump (IEJ). The destination of an IEJ has been assumed to be a random location in the address space in the model of erroneous execution presented in Chapter 3. Hence the probability that the destination of an IEJ will be in the unused area, \( P_{IEJ}(\text{Unused Area}) \), is dependent on the proportion of the total address space occupied by the used area. That is,

\[
P_{IEJ}(\text{Unused Area}) = \left\{ \frac{\text{Unused Area (bytes)}}{\text{Total Address Space (bytes)}} \right\}
\]

(5.2.)

and,

\[
P_{IEJ}(\text{Unused Area}) = 1 - P_{IEJ}(\text{Used Area})
\]

(5.3.)

A linear relationship between \( P_{IEJ}(\text{Used Area}) \) and \( \text{Used Area} \) for a selection of address space sizes is shown in Figure 5.2. In particular the graph describes the IEJ characteristic exhibited by the Motorola 6800, Intel 8048, and Intel 8085 microprocessors which have a 64 KByte address space, the Intel 8086 microprocessor which has a 1 MByte address space, and finally the Motorola 68000 and 68010 microprocessors which have a 16 MByte address space.

Consider a particular software application on two microprocessors whose only difference is the size of their respective address space. It is evident from equation (5.1.) that the microprocessor with the larger address space will have an unused area occupying a greater proportion of the address space. The IEJ characteristic of equation (5.2.) highlights the profitability of detecting processor execution in the unused area which by definition is erroneous, particular benefit being offered by those microprocessors with a larger address space.

5.3.2. Detecting Erroneous Execution

Access to the unused area of the address space is indicative of erroneous execution. Therefore a technique to detect such access is required to prevent erroneous execution. The unused area may partially or entirely include physical memory locations. These
Figure 5.2 : The IEJ Characteristics
locations may occur as a single contiguous block or as a collection of blocks distributed throughout the address space.

5.3.2.1. A Software-Based Technique

Software constructs can be placed in the blocks of physically implemented unused address space to provide detection of erroneous execution. The principles of their structure are as follows:

i) All instructions within the construct should be without an operand requirement in order that there is only one possible program flow path through the memory.

ii) The program flow path of the software construct must have one or more termination points where recovery action is initiated, otherwise no recovery is possible.

The software construct adopted by particular processor systems is influenced by the availability of instructions in their respective instruction sets.

Error latency can be minimized by the placing of restart instructions without an operand requirement at every location in a block of unused physical memory. Processing any one of these restart instructions initiates execution of a recovery routine for the application software. However, not all microprocessors possess such a restart instruction, e.g. the Intel 8048 processor. For microprocessors like these, a software construct called the 'snake' can be used [Pearson, 1983]. The snake construct involves placing a chain of ‘no-operation’ instructions, without an operand requirement, at each location except the last in a block of unused memory. The terminating location in the block holds a jump instruction. The action of the snake is to ‘slide’ erroneous execution initiated on it to the jump, which then directs execution to the recovery routine. The error latency associated with the processing time required to ‘slide’ down the snake can be reduced by placing intermediate jumps, whose destination is the recovery routine, within the blocks. However, in order to preserve detection of erroneous execution at every location in a block of unused physical memory, such intermediate jumps should not require any operands.
Many microprocessor systems do not implement their entire unused area in physical memory, and hence a complementary or alternative detection technique is required for non-existent memory in the address space.

5.3.2.2. Watchdog Timers and Smart Watchdogs

As described in Chapter 2, watchdog timers and smart watchdogs can be implemented in a microprocessor system to detect access to the unused areas of the processors address space.

IBM [1986] describe an analogue watchdog timer which they have developed for microprocessor controllers. It was developed because automatic reset is required when an embedded microprocessor in a controller improperly executes code. Watchdog timers require the application software to correctly reset the watchdog timer. Hence the programmer requires a prerequisite knowledge of the target processor system in order to satisfy the watchdog timer requirements. Such programming practice is much slower than code generation for microprocessor systems employing a transparent fault tolerant technique.

Namjoo & McCluskey [1982] propose a smart watchdog, based on an additional processor, to detect (in a transparent fashion to the application software) unused area access. The watchdog monitors access to the unused area. On an invalid access, the watchdog signals a hardware interrupt to the host microprocessor indicating detection of erroneous behaviour. The host machine then processes the interrupt in a manner that will initiate recovery. It should be realized that the additional processor implemented by the smart watchdog is susceptible to error generation in the way as the microprocessor it is protecting.

5.3.2.3. The Access Guardian Proposal

The redundancy of the smart watchdog can be reduced by implementing its detection function in a dedicated hardware unit. The design of such a unit, referred to as an ‘Access Guardian’, is proposed here. The Access Guardian provides independent on-line monitoring and detection of unused area access by the microprocessor.
The topology of a microprocessor system incorporating an Access Guardian is shown in Figure 5.3.

The Access Guardian detects whether or not invalid address lines are activated, and if so, impresses an interrupt signal to the microprocessor. This induces a restart state outcome within the microprocessor which then directs execution to a recovery routine for the software application. Implementation of an Access Guardian requires a prerequisite knowledge of the residence of the used area within the microprocessor address space. This implies that Access Guardians can only be used in dedicated microprocessor systems.

The general function of the Access Guardian is shown in Figure 5.4. The Access Guardian takes the system address bus as input to its 'address decoder' which generates a signal when invalid address lines are activated. This signal is then processed with Access Guardian status information by the 'restart generator' to produce an interrupt signal 'RESTART' for the application processor. The interrupt signal must exist slightly in excess of the microprocessor interrupt latency. The interrupt latency is the length of time an interrupt must exist to guarantee processing by the microprocessor. Assuming the interrupt is given highest priority the processor will detect it following the execution of the present instruction. The interrupt signal must therefore be just longer than the longest execution time required by any instruction. A 'timer unit' holds a set interrupt signal for the required period. The detailed design of an Access Guardian is presented in Appendix B.

The effectiveness of detecting unused area access has been investigated using fault insertion experiments. Gunneflo et al [1989] report 60% of faults inserted into an operating Motorola 6809 microprocessor system as generating access to an unused area occupying 88% of the address space. They expect unused area area to fall and rise with decreasing and increasing proportions of the address space occupied by the unused area respectively. This suggestion is supported by Schmid et al [1982] who report only 10% of faults inserted into a Zilog Z80 microprocessor system as generating access to an unused area occupying 10% of the address space.
Figure 5.3.: Embedded Access Guardian

Figure 5.4.: Access Guardian
5.4. Erroneous Execution in Used Area of the Address Space

A technique has been proposed so that erroneous execution in the unused area will be detected. There still remains, however, the possibility of erroneous execution in the used area. This involves periods of linear erroneous execution interspersed by Subsequent Erroneous Jumps (SEJs) within the used area. This characteristic is investigated by tracing successive SEJ destinations, and several microprocessors are evaluated using the model of erroneous microprocessor behaviour proposed in Chapter 3. Techniques are proposed for detecting erroneous execution using software implemented fault tolerance.

5.4.1. The Subsequent Erroneous Jump Characteristic

The probability of a SEJ whose generator and destination are both in the used area can be determined as follows. Let the set \( \{L\} \) contain every location in the used area, and \( N_L \) be the number of items in the set \( \{L\} \). Let the set \( \{J\} \) contain every jump type instructions in the microprocessor instruction set, and \( N_J \) be the number of items in the set \( \{J\} \). Let \( l \) be a location in the used area, and \( j \) be a particular jump type instruction in the instruction set. Hence \( l \in \{L\} \), and \( j \in \{J\} \).

Let the function \( H(l, j) \) represent the percentage of possible destinations generated by a particular jump type instruction \( (j) \) at a used area location \( (l) \), that reside within the used area. This function is referred to as the 'hit' function. Let the set \( \{T\} \) contain all the locations that can be addressed by a jump type instruction \( (j) \) residing at location \( (l) \). Then

\[
H(l, j) = \frac{Pr(\{T\} \cap \{L\})}{Pr(\{L\})}, \quad \text{where } Pr(\{L\}) = 1, \quad (5.4.)
\]

which can be expressed as a conditional probability,

\[
H(l, j) = Pr(\{T\}|\{L\}), \quad (5.5.)
\]

having the boundary condition \( Pr(\{L\}) > 0 \) is satisfied.
Typically jump type instructions can generate destinations within either a local $2^8$, $2^{16}$, or $2^{32}$ byte range. Consider a jump type instruction employing relative addressing with a byte operand specifying the displacement. Various operand values alter the destination of the jump type instruction. If this jump type instruction resides in the middle of the used area of $> 2^8$ bytes then it is guaranteed to generate all its possible destinations within the used area: that is $H(l, j) = 1$. However the same jump type instruction residing in the middle of a used area of $2^7$ bytes only has half its possible destinations within the used area: that is $H(l, j) = 0.5$. Figure 5.5. is provided to further aid comprehension of the HIT function.

Let $P_{SEJ}(Used\ Area)$ be the probability of a jump type instruction $(j)$ selected at random from the microprocessor instruction set, residing at a random location in the used area $(l)$, generating a destination which is also within the used area.

$$P_{SEJ}(Used\ Area) = \frac{1}{N_J \cdot N_L} \sum_{j \in \{J\}} \sum_{l \in \{L\}} H(l, j).$$

(5.6.)

The relationship between $P_{SEJ}(Used\ Area)$ and $Used\ Area$ for a range of 8, 16 and 32-bit microprocessors is investigated in Figure 5.6. All jump type instructions within the 8-bit Intel 8085 microprocessor instruction set specify absolute target addresses and hence its SEJ characteristic is linear. When half the Intel 8085 processors address space is used, there is a 50% expectation that a SEJ in the used area will return to the used area. For the same address space utilization, however, the 8-bit Motorola 6800 microprocessor has an 80% probability that a SEJs generator and destination lie within the used area. This is because the Motorola 6800 processor instruction set does have some relative addressing mode jump instructions. Similar results are obtained for the 16-bit processors evaluated. The Motorola 68000 microprocessor has a much higher probability of a used area SEJ targeting the the used area again than the Intel 8086 because it has a far higher proportion of opcode formats within its instruction set dedicated to relative addressing. For example, a 300 KByte used area on the Motorola 68000 and Intel 8086 have respective probabilities of approximately 95% and 70% that a SEJ generated in the used area will also target the used area. The 32-bit microprocessors evaluated show that for a 300 KByte used
Let `y` be the number of location addressed by jump type instruction `j`.

Let `x` be the number of locations `y` that fall within the used area when then jump type instruction `j` resides at location `l` in the used area.

The HIT function of the jump type instruction at location `l` in the used area is then defined as

$$H(1,j) = \frac{x}{y}$$

Figure 5.5.: The HIT Function
Figure 5.6. : The SEJ Characteristic
area of random data, the AMD Am29000 Intel 80386, and Motorola 68020 microproces­sors are expected to generate a target with the used area of approximately 25%, 60%, and 95% respectively. The variation in the HIT function evaluation for the processors is due to the proportion of relative jump opcode formats in their instruction sets.

A high probability of an SEJ in the used area generating a target address which is also in the used area suggest extended periods of erroneous execution in the used area. Such behaviour can be extremely hazardous, not only involving mal-operation but also program area corruption from invalid data manipulation. Therefore, a method of detection is required within the used area to prevent extended erroneous execution.

5.4.2. Detection Using Software Implemented Fault Tolerance

The following section describes techniques which can be implemented in the software to detect erroneous execution in the used area. The techniques utilize instructions that generate a restart outcome. Such instructions direct execution to a predefined location in memory at which a recovery routine resides. The recovery routine restores operation as required by the application software, two possible recovery strategies are reset and rollback.

5.4.2.1. Program Areas

A software detection technique is proposed which exploits the SEJ characteristic. In particular, erroneous jumps are considered which are generated by invalid interpretation of an operand as an opcode. Such erroneous jumps are referred to as invalid branches. Detection mechanisms are strategically inserted at each identifiable invalid branch destination.

Detection Mechanism Construction

The actual construction of a detection mechanism will vary in detail for different microprocessors. A detection mechanism consists of an initial relative branch instruction over the remainder of the detection mechanism so that logical control flow of
the correctly executing program is not interrupted. The remainder of the detection mechanism consists of a number of 'seed' instructions.

The instructions used to construct a detection mechanism should not require any operands, however, this may not always be possible. In such instances care must be taken to avoid the use of operand bit formats which when erroneously interpreted as an opcode generate erroneous jumps. Failure to ensure that detection mechanisms do not themselves generate erroneous jumps leads to successive detection mechanism placements with no guarantee of placement completion.

A detection mechanism seed instruction is a software exception without operands which, through a restart outcome, directs execution flow to a recovery routine. The number of seeds required depends on the individual placement of a detection mechanism.

General examples of detection mechanism structure are shown in Figure 5.7. The Motorola 68000 microprocessor facilitates detection mechanism constructs like that in Figure 5.7(a), the jump over mechanism being provided by the operandless hexadecimal instruction 600X where 'X' is the necessary relative displacement, and the seed is provided by the hexadecimal instruction 6001 which is branch that always generates an 'odd address' exception (restart) and again does not require any operands. Within the Intel 8048 microprocessor there are no restart instructions or jump instructions without a operand requirement. The detection mechanism in this case is like that in Figure 5.7(b). The mechanism uses the hexadecimal instruction 04XX for the jump over the mechanism, the hexadecimal no-operation instruction 00 for the snake, and the hexadecimal instruction 04XX to jump to the recovery routine (mimic restart function) where 'XX' specifies the hexadecimal representation of the jump destination location.

_Detection Mechanism Placement_

Figure 5.8. shows the two basic methods of inserting a detection mechanism within the target software. Forward invalid branch destinations are covered by inserting the detection mechanism at the location of the preceding instruction.
a) Instruction Set includes restart instructions that do not require operands. The instruction set includes jump instructions.

b) Instruction Set does not have any restart instructions, but has 'no-operation' instructions that do not require operands. The instruction set includes jump instructions.

Figure 5.7: Detection Mechanism Constructions
Figure 5.8: Detection Mechanism Placement
Detection mechanisms are inserted to cover backward invalid branches at the location of their respective following instruction. Each detection mechanism requires sufficient seeds to ensure that the invalid branch destination has a resident seed. The maximum number of seeds required will be equivalent to the byte length of the largest instruction construct in the instruction set.

The insertion of a detection mechanism may itself alter the destination of an invalid branch. This situation arises when the invalid branch generator has a specified displacement contained in the byte locations directly following the host instruction and a detection mechanism is inserted immediately after the host instruction. A good example of this is provided by the Motorola 68(7)05 microprocessor where the 'test and branch' instruction requires two operands, the final operands containing the relative displacement. This instruction has the same size as the maximum length instruction within the instruction set; therefore, whenever an invalid branch has this operation, its destination information will always be contained outside the generating instruction. If a detection mechanism is inserted immediately after the generating instruction then the destination for the invalid branch is altered by the change in the information content at the location specifying the relative displacement. Successive invalid branches can be generated in this manner. Application of detection mechanism placement should test for this situation and take evasive action to prevent changing the destination of an invalid branch.

*The Detection Capability of the Mechanism*

The mechanisms inserted within the program code provide a detection capability which has two methods of activation. The first method of detection is associated with an invalid branch generated by erroneous execution. Detection mechanisms are placed in order to detect such SEJs. Secondly, ensuing linear erroneous execution processing through a placed mechanism is detected when a seed is interpreted as an opcode. This second method of detection is guaranteed to be successful if the detection mechanism has a number of seeds equivalent to the byte length of the longest instruction within the application processors instruction set.
Detection Mechanism Placement Deadlock

Detection mechanisms cannot be placed where the generator and destination of an invalid branch are both operands of a single instruction. This occurrence is referred to as placement deadlock. There are two particular types of placement deadlock, those where the invalid branch has a forward direction and those with a backward direction. In addition placement deadlock occurs when the destination of an invalid branch is displaced forward of its generating instruction by the equivalent or fewer bytes than that required by the detection mechanism’s jump instruction. Placement deadlock with a backward invalid branch is critical because an infinite execution loop may be created. This processing outcome is classified as a failure and has a particular hazard in real-time systems. Placement deadlock involving a forward branch does not share this hazard, erroneous execution continuing unhindered through the associated code.

5.4.2.2. Data and Reserved Input/Output Areas

Data areas and areas reserved for input/output communication can be manipulated using similar techniques to facilitate a detection capability of erroneous execution. The information content of both area types cannot be changed, but the method and location of storage can be altered.

Halse [1984] investigates methods of inserting special sections of code within the data area. Erroneous execution in the data area is detected when it flows through one of these sections of inserted code. The efficiency of various sizes of code insertions with different desperations throughout the data area are analysed by Halse. This technique is not transferable to the reserved input/output area because the locations of this area are fixed and no code insertion is therefore possible.

An alternative technique involves utilizing particular bits of each memory element to specify an opcode restart operation. The remainder of the memory element is free for information storage. This technique can best be explained using the Motorola 68000 as an example. Within this processor the hexadecimal opcode format FXXX can be used where ‘X’ denotes an unspecified content. This opcode format is not necessarily available in upwardly compatible members of the Motorola 68000
family, e.g. the Motorola 68020 reserves this format for a co-processor. The execution of a Motorola 68000 FXXX opcode is defined to be illegal and to generate an exception (restart outcome), and hence can be used to detect erroneous execution. The unspecified opcode bits can be used to contain useful data area or input/output reserved area information. Implementation of this technique in the data area and input/output reserved areas will required the application software only to extract the least significant 12-bits of information from each memory location. In addition, the input/output locations should have the most significant 4-bits hard-wired for the restart opcode format. This technique incurs redundancy, which may be substantial in some microprocessor applications. Indeed, the technique may not be feasible for some microprocessor systems.

A third technique involves moving the location of data areas in the address space so that their address specification within an operand has the format of a restart instruction. This ensures that when erroneous execution in the program area generates a SEJ destination in the data area, erroneous execution is detected. Application of this technique is highly dependent on the host microprocessor instruction set. The best results are obtained for those instruction sets with a larger number of restart generating opcode formats. Small numbers of restart opcode formats restricts the number of locations available for positioning sections of data area. This technique is investigated further by Halse [1984]. Again, however, this technique cannot be applied to the input/output reserved area because its locations are fixed, although some locations may be available that, by coincidence, map restart opcode formats.

5.5. The Overheads of Implementing Fault Tolerance

5.5.1. Hardware Fault Tolerant Techniques

The Access Guardian and smart watchdog have an error latency much smaller than that typically exhibited by watchdog timers. In addition their application, unlike the watchdog timer, is transparent to the application program. The use of an Access Guardian or smart watchdog, therefore, releases the application programmer
from requiring a knowledge of the target processor system architecture in order to produce dedicated software.

The design of the Access Guardian can be applied to most microprocessor systems. The complexity of the 'address decoder' is dependent on the nature of the address space, and the bus architecture employed by the microprocessor. The 'timer unit' will also vary in size depending on the microprocessor interrupt latency. The 'restart generator' has a fixed size. An Access Guardian designed in Appendix B requires 60 logic gates, representing 17 standard TTL IC parts. This represents a significant reduction in the gate overhead introduced by a smart watchdog which typically has thousands of gates.

The Access Guardian acts in parallel with the microprocessor and does not inflict a performance overhead to the system during correct operation. Its reduced size in comparison with a smart watchdog also implies a smaller chance of the 'doomsday' syndrome occurring by which the hardware detection unit fails [Damm, 1988].

5.5.2. Software Implemented Fault Tolerant Techniques

The overheads of used area software enhancement are additional memory requirement and increased operational processing during correct operation of detection mechanism jumps. Modification of the unused physical memory locations does not incur any overhead to the microprocessor system because during the course of correct operation the section of the address space is totally independent of processor action.

The memory extension required by the software implemented fault tolerant technique proposed for the program area can be reduced by generating optimum size detection mechanisms at each placement rather than a default maximum size. Whilst this reduces the memory overhead, it also decreases the effectiveness of the mechanism to detect linear erroneous execution. The relative cost of a byte of physical memory has decreased over recent decades [Freer, 1987] and, therefore, the memory overhead is not predicted to be a major system constraint. Nevertheless, in those systems with a limited memory, optimum size detection mechanisms can be implemented.

A memory overhead is also produced by the insertion of software in the data area to detect erroneous execution. Further details of the expected overhead for particular
processor systems can be found in Halse [1984]. This memory overhead does not have an associated performance overhead.

The extra processing requirement of detection mechanism jumps during correct operation of program code may prove critical for some stringent real-time and parallel processing applications. This processing overhead is not influenced by the placement of optimum or default size detection mechanisms in the target code. However, for the majority of applications this overhead is considered to be acceptable.

The technique proposed for the data and reserved input/output area by which the data content of each memory location is reduced in order to give that location a detection capability, also generates a processing overhead. Data transfers and memory requirement may be increased. The magnitude of this overhead is application dependent. The architecture of a microprocessor could incorporate this technique in order to reduce lost operational performance.

5.6. A Fault Tolerant Strategy for Microprocessor Controllers

The detection techniques described in this chapter cover attributes of erroneous execution associated with the model of erroneous microprocessor behaviour presented in Chapter 3. Individual application of one of the techniques will improve the reliability of the processor system in relation to this mode of failure. However, the reliability of the system can be further improved by the selection of techniques for collective implementation. Such techniques should be complementary and feasible for incorporation into a particular microprocessor system. Hence, the selection of fault tolerant techniques is termed 'strategic'.

The model of erroneous microprocessor behaviour described in Chapter 3 is summarized in Figure 3.2. This figure can now be modified to include the detection capability of unused area access via the Access Guardian, and invalid used area operation via the detection mechanisms planted in the software. The model of erroneous behaviour in a microprocessor implementing such fault tolerant techniques is shown in Figure 5.9.
Figure 5.9: Erroneous Execution Model: Enhanced Transient Fault Recovery (*manifested transient fault)
The reliability of a selection of microprocessors implementing an Access Guardian are shown in Figure 5.10. The evaluations assume a worst case of no recovery capability exhibited by the used area so equation (3.49.) with equation (3.26.) substituted becomes,

\[ R(t) = \exp\{-q.P(E_f).t\}, \quad (5.7.) \]

where,

\[ P(E_f) = \left\{ \frac{\text{Used Area (bytes)}}{\text{Address Space (bytes)}} \right\}, \quad (5.8.) \]

and \( q \) is the event rate, \( P(E_f) \) the probability that the event initiates erroneous execution, and \( t \) is time.

Figure 5.10. has a normalized time base. Reliability can also be expressed as MTTF using the following equation for the microprocessor system described above,

\[ MTTF = \frac{1}{q} \left\{ \frac{\text{Address Space (bytes)}}{\text{Used Area (bytes)}} \right\} \quad (5.9.) \]

Table 5.1. shows the MTTF calculations corresponding to Figure 5.15. where events initiating erroneous microprocessor behaviour are taken to occur once a month (every 714 hours).

The complementary application of fault tolerance in the used area enables the reliability of the microprocessor system to be improved. Future chapters investigate the recovery capability enhancement realized by particular microprocessor systems. Meanwhile it is sufficient to demonstrate the benefit of detecting erroneous microprocessor behaviour in the used area of the address space. Consider a Motorola 68000 microprocessor system with 48 KBytes used area and implementing an Access Guardian. The used area is initially assumed not to have a recovery capability. For the purpose of statistical analysis let the used area be divided into program and
Figure 5.10: Microprocessor Reliability with Access Guardian
(Used Area = 48 KBytes)
<table>
<thead>
<tr>
<th>Microprocessor</th>
<th>Class</th>
<th>$P(E_f)$</th>
<th>MTTF</th>
</tr>
</thead>
<tbody>
<tr>
<td>6800</td>
<td>8-Bit</td>
<td>0.750000</td>
<td>952 hrs = 39 days</td>
</tr>
<tr>
<td>8048</td>
<td>8-Bit</td>
<td>0.750000</td>
<td>952 hrs = 39 days</td>
</tr>
<tr>
<td>8085</td>
<td>8-Bit</td>
<td>0.750000</td>
<td>952 hrs = 39 days</td>
</tr>
<tr>
<td>8086</td>
<td>16-Bit</td>
<td>0.046870</td>
<td>15232 hrs = 21.6 months</td>
</tr>
<tr>
<td>68000</td>
<td>16-Bit</td>
<td>0.001465</td>
<td>487567 hrs = 56 yrs</td>
</tr>
<tr>
<td>68010</td>
<td>16-Bit</td>
<td>0.001465</td>
<td>487567 hrs = 56 yrs</td>
</tr>
<tr>
<td>68020</td>
<td>32-Bit</td>
<td>0.000006</td>
<td>119047619 hrs = 13590 yrs</td>
</tr>
</tbody>
</table>

Table 5.1.: MTTF with Unused Area Detection

<table>
<thead>
<tr>
<th>Program/Data</th>
<th>$P(E_f)$</th>
<th>MTTF</th>
</tr>
</thead>
<tbody>
<tr>
<td>48K/0K</td>
<td>0.001465</td>
<td>487567 hrs = 56 yrs</td>
</tr>
<tr>
<td>40K/8K</td>
<td>0.001383</td>
<td>516476 hrs = 59 yrs</td>
</tr>
<tr>
<td>8K/40K</td>
<td>0.001052</td>
<td>678979 hrs = 78 yrs</td>
</tr>
</tbody>
</table>

Table 5.2.: MTTF Enhancement with Used Area Detection
data areas with no and total recovery capability respectively. Reliability enhancement is now dependent on the proportion of the used area occupied by the data area. Table 5.2. notes 8 KByte and 40 KByte data areas and calculates their respective MTTF influence on the microprocessor system. These calculations are represented as reliability curves in Figure 5.11. Of course the program area can have a recovery capability too: the above calculations are given purely as an example.

The strategic selection of fault tolerant techniques is not limited to those specialized techniques presented in this chapter for particular modes of failure attributed to erroneous microprocessor behaviour described in Chapter 3. Additional techniques such as Recovery Blocks for programs, and parity bit checking for physical memory (both described in Chapter 1), can be incorporated to further enhance reliability through the coverage of other modes of processor system failure.

5.7. Summary

Operational characteristics have been identified in the model of erroneous microprocessor behaviour proposed in Chapter 3. In particular the characteristics of Initial Erroneous Jump (IEJ) and Subsequent Erroneous Jump (SEJ) are associated with erroneous execution within the unused and used areas of the processor address space respectively. The characteristics are modelled for a selection of 8-bit, 16-bit and 32-bit processors, and variations are observed with differences in processor architecture and instruction sets. Detection techniques are proposed which exploit these characteristics in order to provide fault tolerance and hence increased reliability to the microprocessor system.

A detection capability can be provided for the unused area using either a software based technique for physically implemented memory, and/or an Access Guardian which provides an additional detection capability for non-existent memory.

The used area incorporates functional areas for the program, data, and input/output communications, and can be implemented in volatile and non-volatile memory. A detection capability is provided for the program area by the application of a proposed technique involving the strategic insertion of mechanisms at invalid branch
Figure 5.11.: Enhanced Microprocessor Reliability with Access Guardian and Used Area Detection Capability
(SEJ) destinations to detect erroneous execution. The application of this technique and its performance evaluation are covered by future chapters. Other techniques are discussed in relation to providing a detection capability for the data and reserved input/output areas of the used area.

Application of fault tolerance in the used area involves an additional memory overhead. This memory overhead inflicts a processing performance overhead when inserted within the program area, namely the jumps over detection mechanism which avoid correct program flow corruption. The magnitude of these overheads is evaluated for particular applications in the following chapters.

Fault tolerant techniques can be strategically selected for collective application in order to achieve high reliability. The selection criteria used depends on the fault classes which require detection and the feasibility of implementing fault tolerant techniques within particular microprocessor systems.
Chapter 6

POST-PROGRAMMING, AUTOMATED, RECOVERY UTILITY (PARUT)

6.1. Introduction .................................................. 107
6.2. Design and Development Objectives for the PARUT Prototype .... 107
6.3. A Functional Overview of PARUT ................................ 108
6.4. Design Features Incorporated within the PARUT Prototype ...... 110
    6.4.1. Programming Language .................................. 110
    6.4.2. Programming Style ...................................... 110
    6.4.3. The Diagnostic Facility .................................. 110
    6.4.4. Target Software ......................................... 111
    6.4.5. Target Processors ....................................... 112
6.5. A Description of the PARUT Prototype's Operation .............. 112
    6.5.1. Data Code Analysis .................................... 114
    6.5.2. Program Code Analysis .................................. 115
    6.5.3. The 'Seeding' Algorithm ................................ 116
6.6. PARUT: A Review of the Prototype ............................ 122
6.7. PARUT: Developing a Standard Programming Tool ............... 123
6.8. Summary ................................................................ 124
CHAPTER SIX

POST-PROGRAMMING, AUTOMATED, RECOVERY UTILITY (PARUT)

6.1. Introduction

A prototype software utility has been designed to automatically apply the software implemented fault tolerant technique, proposed in the previous chapter, on program code. This utility, called PARUT (Post-programming Automated Recovery UTility), can also apply other software based fault tolerant techniques.

This chapter initially outlines the objectives of PARUT, and these are appraised at the end of the chapter to assess the success of the prototype. PARUT, its function and structure, are described in overview, a description of the physical mechanics of the code can be found by examining the annotated utility listing in Appendix C. Finally, enhancements for the PARUT prototype are suggested, and a proposal for the development and application of PARUT as a standard programming tool discussed.

6.2. Design and Development Objectives for the PARUT Prototype

The software implemented fault tolerant technique proposed in Chapter 5 involves inserting detection mechanisms at machine code level to cover invalid branches associated with erroneous microprocessor behaviour. Manual application of the technique can be complex, especially for large target programs. The PARUT prototype is designed to automate the technique’s application.

As a prototype, PARUT does not have rigorous specification but rather a set of objectives. In order to facilitate a wider application of PARUT, its initial objective is broadened to include those listed below.

- Apply and assess software implemented fault tolerant techniques.
- Process software for a variety of target microprocessors.
- Facilitate application to any software whose host processor is supported.
- Produce a report assessing utility activity.
In addition to the design objectives it is worthwhile specifying some development objectives for the production of this prototype utility. In particular the prototype program should exhibit qualities facilitating programmer/analyst comprehension and modification of module mechanics. These qualities are especially important in the production of a prototype because alterations are commonplace.

6.3. A Functional Overview of PARUT

PARUT has two input requirements: a copy of the software and a description of the microprocessor on which it resides. The microprocessor description input to PARUT is a file, referred to as MICRO_FILE, containing the target processor instruction set. The file lists the defined instructions within the instruction set, specifying each instruction opcode. The software presented to PARUT for processing is in machine code format because the detection capability assessment and fault tolerant technique application require knowledge of the opcode and operand usage on the target processor. The target software is held in a file called CODE_FILE.

The execution of PARUT generates a report file, referred to as ANALYSIS_FILE, which documents the detection capability assessment of the software under investigation. The detection capability is evaluated by determining the proportion invalid branches that are detected during erroneous execution. PARUT also produces a file called RESULT_FILE, containing the fault tolerant version of CODE_FILE, when a software implemented fault tolerant technique is applied to the target software represented in CODE_FILE. The format of this file can be tailored to meet specific requirements. PARUT currently outputs the enhanced software in a format which facilitates easy user interpretation of the utility action at machine code level.

An overview of the PARUT program is shown in Figure 6.1. Examples of the two input files, MICRO_FILE (Motorola 68000 microprocessor) and CODE_FILE, and the two output files, RESULT_FILE and ANALYSIS_FILE, associated with the operation of PARUT on CODE_FILE can be found in Appendix C.
Figure 6.1. : PARUT Overview
6.4. Design Features Incorporated into the PARUT Prototype

Particular design features can be incorporated into the PARUT program to realize prototype objectives, outlined in section section 6.2., or to underpin the procedural activity of PARUT which is described in the next section of this chapter. This section presents such design features and describes the criteria for their application.

6.4.1. Programming Language

PARUT is implemented in the Pascal programming language. This language was selected for the prototype because of its structural constructs and readability. An alternative considered was the programming language C, but was rejected because language constructs such as those involving linked lists are difficult to understand when the reader is not proficient in the language. An important feature of a prototype program is readability. A future development of PARUT might involve the translation of the present code into another language deemed more appropriate. It is considered that Pascal is relatively easy to translate.

6.4.2. Programming Style

The utility program has been written implementing 'good' programming practice [Sommerville, 1985]. This involves developing code in concise modules (a few tens of lines) which exhibit low coupling and high cohesion. Coupling and cohesion refer to the required passing of external parameters to a module, and unity of operation respectively. Such programming practice facilitates easy modification or replacement of modules without disruptive consequences for the remainder of the utility program. Furthermore, 'good' programming practice also encourages the production of readable code.

6.4.3. The Diagnostics Facility

A diagnostics facility is provided to aid understanding of the function of PARUT. The facility is activated by setting the 'DIAGNOSTIC' variable at the beginning of the PARUT listing to 'true'. When active, the facility generates a file called TRACE.FILE in which all functions and procedures accessed by the code operation
insert an entry specifying their name. Nested functions appear in the TRACE_FILE as indented entries. This file can be accessed by the analyst to monitor the chronological activity of PARUT. The last enclosure in Appendix C is a typical TRACE_FILE.

6.4.4. Target Software

PARUT processes target software at machine code level in order to apply and/or analyse its fault tolerance. The data structure chosen to represent the machine code is a linked list of records. This data structure is used because it requires no predefinition of dimensions and can easily be manipulated when inserting records representing code associated with the application of fault tolerance.

Each record within the linked list contains information describing the characteristic of a machine code element (usually 8 or 16 bits). The contents of a record are itemized below and can be found at the beginning of the PARUT listing in Appendix C under the 'TYPE' declaration. The items within each record are reviewed below:

- **next_address & last_address**
  - are pointers connecting adjacent records in the linked list. 'next_address' and 'last_address' are set to nil in the last and first records in the linked list respectively denoting the lists termination.

- **op, optype, & address**
  - specify the absolute value of the machine code element (typically 8 or 16 bits), whether it is an opcode or operand, and its resident location in the address space.

- **offset**
  - is a parameter used in address processing.

- **seeded**
  - specifies the status of an operand identified as a potential invalid branch; 'true' and 'false' signify the presence and absence of fault tolerance respectively.

- **offset**
  - is a parameter used in address processing.
jump type, jump too & jump address

- specify the type of jump instruction that the item ‘op’ generates under erroneous execution as an opcode (details are shown under the ‘TYPE’ declaration at the beginning of the PARUT listing), a pointer to the target location, and the target address respectively. Non-jump instructions set ‘jump_type’, ‘jump too’, and ‘jump_address’, to ‘0’, ‘nil’, and ‘0’ respectively.

jump from

- is a pointer to a record containing machine code identified as potentially generating an invalid branch whose target is this record, the pointer has the default setting of ‘nil’.

6.4.5. Target Processors

Target software is input to the utility via a file called CODE_FILE. PARUT processes the software at source code level. The source code on most microprocessor systems is not directly readable and hence an indirect method of input for the code is required. CODE_FILE contains data generated by UNIX ‘adb’ (a debugger). Particular details of CODE_FILE can be found in Appendix C. In summary, the file contains two sections: a memory dump of the resident source code, and a list of opcode addresses within the source code.

6.5. A Description of the PARUT Prototype’s Operation

This section briefly describes the procedural activity of, and user interaction with, PARUT. An annotated listing of the PARUT code is held in Appendix C.

Within the program structure defined by Pascal, the root module is a function called ‘MAIN’. A call chart of functions and procedures used by MAIN is shown in Figure 6.2. The chart depicts module operations in rectangular boxes, and functions and procedures in rectangular boxes with duplicated vertical bars.

The initial job of module MAIN is to initialize variables and prepare all files required by PARUT. After this is completed, the user is required to respond to a prompt which enquires whether or not a data area requires analysis, and if not,
Sequence of function/procedure operation: left to right

Figure 6.2.: Program 'MAIN' Call Chart

- Function or Procedure: 
- Module Operation:
then whether a program area requires analysis. If neither option is requested, the prompts are repeatedly presented until the user chooses an option. After completion of the requested analysis PARUT terminates activity and returns the user to the host environment.

Data and program code analysis implement a common multi-stage processing approach. The approach involves the following sequence of activity,

i) Generating a linked list to represent the machine code of the software under investigation. This includes the insertion of information within each record of the linked list detailing program flow associated with valid and invalid interpretation of the machine code.

ii) Apply fault tolerance by manipulating the linked list ensuring that the valid program flow is preserved (program code only).

iii) Produce a report documenting fault tolerant analysis of the machine code and, in the case of program code, detail the enhancement provided by the application of software implemented fault tolerant techniques.

6.5.1. Data Code Analysis

Analysis of data code involves four basic operations, each operation being contained within a function or procedure. Initially the user may request by prompt to construct a data structure either from actual data code (procedure BUILD_ADB), or from pseudo-random generated code (procedure BUILD_RNG). Actual data code was received in a UNIX 'DIS' format in an early version of the PARUT prototype, and for this the procedure BUILD_DIS was written. Now data code is received in a UNIX 'ADB' format and hence procedure BUILD_ADB is used, however, BUILD_DIS remains available for future use if required. After generating the data structure for the code, procedure BUILD_JUMPS is used to derive the program flow through the data area if it was incorrectly interpreted as program code. The penultimate operation of data area analysis is the output of the data code (primarily of use when the code
is generated by PARUT) using the procedure PRINT_LIST. Finally, the data code is analysed to determine the hazard of interpreting it as program code, and this is achieved by the procedure DATA_ANALYSIS.

6.5.2. Program Code Analysis

Program area analysis calls more procedures than data area analysis but retains the same basic approach. Initially a data structure is constructed by procedure BUILD_ADB to represent input program code. The user is then requested by a sequence of prompts to select a combination of software implemented fault tolerant techniques to be applied to the code including the technique proposed in Chapter 5.

Selection of any of the fault tolerant techniques offered to the user requires a duplicate copy of the linked list representing the target program code and this is provided by activating the procedure COPY_LIST. This copy is then processed by the procedure BUILD_JUMPS so that the necessary program flow information associated with both valid and invalid interpretation of the program code is incorporated. Then depending on the fault tolerant techniques chosen by the user, the procedures SEED_LIST (technique proposed in Chapter 5), STREAM_LIST (signature analysis), and RELOCATE_LIST (an alternative technique now discarded due to poor results) are executed. Other techniques can be added to those offered by PARUT, and would be included here in the structure of PARUT.

Once the selected techniques have been implemented on the copies of the linked list representing the original target program code, two 'housekeeping' operations are required. Firstly, procedure ALIGN_LIST is called which resets any disturbed absolute branches in the program code. In this manner PARUT does not compromise the transparent application feature of the software implemented fault tolerant technique (proposed in Chapter 5) upon the target software. Secondly, procedure TIDY_LIST is executed to remove any redundancy in the placed detection mechanisms. When the housekeeping is complete, the resultant code enhancement is output by procedure PRINT_LIST.

The operation of program code analysis terminates with a prompt to the user requesting a choice whether or not program flow analysis is required. If it is not required
then the action of this section of code is complete, otherwise the procedure ANALYSIS is activated. As a prerequisite to executing ANALYSIS the original program code data structure must be prepared for comparison with the enhanced program code version(s). This is achieved by the activation of the procedure BUILD_JUMPS. On returning from ANALYSIS this section of the PARUT code completes its operation. ANALYSIS reports instances of placement deadlock when the software implemented fault tolerant technique proposed in Chapter 5 is applied. A screen dump of the user interface for program code analysis is shown in Figure 6.3.

6.5.3. The ‘Seeding’ Algorithm

This section describes the algorithm used by PARUT to apply the fault tolerant technique proposed in Chapter 5. The algorithm is implemented by a procedure called SEED_LIST.

It is necessary before the algorithm is described to introduce some basic terminology and observations concerning the structure of invalid branches within machine code. The ‘range’ of an invalid branch describes the machine code locations lying between its generating location and target address. The range of an invalid branch has a ‘level’ which denotes how many target addresses of other invalid branches lie within its range. Within a section of machine code, invalid branches can ‘group’ incorporating features of four identified basic constructs. These constructs are shown in Figure 6.4. and are:

a) Non-Intersecting Invalid Branches

The generating locations and target addresses of the individual invalid branches range over independent areas of the machine code.

b) Intersecting Invalid Branches

The generating locations and target addresses of the individual invalid branches range over areas of the machine code that overlap. The range of one invalid branch contains the target address but not the generating location of the remaining invalid branch.
#Execution begins

PARUT : TRANSIENT FAULT RECOVERY TOOL
=======================================

INFORMATION INPUT :- Please type appropriate response

Data or Program Area (DATA/PROGRAM) ?
PROGRAM
Detection Mechanism Placement (YES/NO) ?
YES
=> Optimise Placement (YES/NO) ?
YES
Boundary Relocation (YES/NO) ?
NO
Signature Placement (YES/NO) ?
NO

<<< Original Code (for comparison) being prepared >>>

Analysis Required (YES/NO) ?
YES

#Execution terminated

Figure 6.3. : Screen Dump of PARUT User Interface
Figure 6.4: Algorithmic Processing of Invalid Branches

(a) Non-Intersecting Invalid Branches

(b) Intersecting Invalid Branches

(c) Coupled Invalid Branches

(d) Embedded Invalid Branches
c) **Coupled Invalid Branches**

Intersecting invalid branches except that ranges of both invalid branches contain the target address, but not the generating location, of the respective remaining invalid branch.

d) **Embedded Invalid Branches**

The generating location and target addresses of one invalid branch both lie within the range of the remaining invalid branch.

Of course there may be more complex situations of invalid branch interaction in the machine code under investigation, but such situations are constructs of the primitives listed above. The 'seeding' algorithm ensures that all invalid branches are resolved by 'seeding' except where placement deadlock is identified.

**The Algorithm**

**Stage 1.** Check whether or not there remain any unresolved invalid branches within the machine code. If not go to stage 8 of the algorithm.

**Stage 2.** 'Seeding' required. Investigate the machine code resolving invalid branches at level zero unless this is not the first pass of the code, in which case, increment the level to be investigated by 1.

**Stage 3.** Search the machine code until an unresolved invalid branch with the same level as that under investigation is found, or the end of the machine code is located. Searching commences initially from the start of the machine code. However, if an invalid branch has been resolved in the current code pass then the search commences at the location following the last address of the group in which that invalid branch was a member.

**Stage 4.** Resolve the invalid branch unless the end of the machine code was located in which case go to stage 6 of the algorithm.

**Stage 5.** Addresses are updated and valid program flow re-established for the machine code due to the insertion of a detection mechanism. Go to stage 3 of the algorithm.
Stage 6. Remove complex groups of invalid branches. If the present level of investigation is greater than zero then recursively apply the 'seeding' algorithm from stage 3 incrementing the level of investigation from zero to one lower than the present level.

Stage 7. If there remain unresolved invalid branches at the level of investigation after the code pass then return to stage 3 of the algorithm and start a new pass of the code, otherwise go to stage 1.

Stage 8. 'Seeding' complete.

The function of the 'seeding' algorithm is now demonstrated with the example of a complex invalid branch group shown in Figure 6.5. Noted below are the stages processed by the algorithm with status comments. The example should be examined in association with the 'seeding' algorithm.

Stage 1: Unresolved invalid branches.
Stage 2: 'Seeding' required. Level = 0.
Stage 3: Start pass of linked list.
Stage 4: Pass of linked list completed.
Stage 6: No recursive call.
Stage 7: No invalid branches at this level.
Stage 1: Unresolved invalid branches.
Stage 2: 'Seeding' required. Level = 1.
Stage 3: Start pass of linked list.
Stage 4: Invalid branch 'B' identified.
Stage 5: Resolve coupled invalid branch.
Stage 3: Complete pass of linked list.
Stage 4: Pass of linked list completed.
Stage 6: Recursive call to stage 3 with level = 0.
Stage 4: No invalid branches at this level.
Stage 6: Recursive call completed.
Stage 7: Level 1 invalid branches remain unresolved.
Figure 6.5. : A Complex Example of Algorithmic Processing

B = Invalid Branch Reference

1 = Resolving Order of Algorithm
Stage 3: Start pass of linked list.
Stage 4: Invalid branch 'C' identified.
Stage 5: Resolve coupled invalid branch.
Stage 3: Complete pass of linked list.
Stage 4: Pass of linked list completed.
Stage 6: Recursive call to stage 3 with level = 0.
Stage 3: Start pass of linked list.
Stage 4: Invalid branch 'D' identified.
Stage 5: Resolve embedded invalid branch.
Stage 3: Continue pass of linked list.
Stage 4: Invalid branch 'A' identified.
Stage 5: Resolve non-intersecting invalid branch.
Stage 3: Continue pass of linked list.
Stage 4: Pass of linked list completed.
Stage 6: Recursive call completed.
Stage 7: Level 1 invalid branches resolved.
Stage 1: All invalid branches resolved.
Stage 8: ‘Seeding’ complete.

The procedural implementation of this algorithm is now briefly reviewed. Stages 1, 2, and 8 of the algorithm are implemented directly by procedure SEED_LIST, whilst the remaining stages are controlled by the called procedure SEED_PLACEMENT. Procedure SEED_PLACEMENT manages three procedures and recursive activation of itself. SEED_PLACEMENT initially executes procedure SEED_LOCATION to achieve stage 3 and 4 of the algorithm. This routine activates five other procedures. Initially JUMP_DIRECTION is used to determine the forward or backward nature of the invalid branch, then INTERVAL evaluates the level of the invalid branch and procedure TEST_SEED checks whether the invalid branch is already resolved by another detection mechanism placement. If the present invalid branch can be resolved then procedures SEED_DETAILS and PLACE_SEED construct and insert the detection mechanism into the machine code. SEED_DETAILS can place a default size or
optimum size detection mechanism, depending on user criteria passed by the root procedure MAIN. After executing SEED_LOCATION, SEED_PLACEMENT processes stage 5 of the 'seeding' algorithm by activating procedures ADDRESS_LIST and BUILD_JUMPS. These procedures are also directly used by the root procedure MAIN and are described in the following sections of this chapter. Finally, SEED_PLACEMENT implements repetitive and recursive calls to stages 3-7 of the 'seeding' algorithm in order to resolve complex invalid branch groups.

6.6. PARUT: A Review of the Prototype

The PARUT prototype successfully applies the software implemented fault tolerant technique proposed in Chapter 5. The 'seeding' algorithm employed by PARUT appears from experience to be efficient, but no quantitative assessment of its performance has been attempted. The algorithm is based on solving constructs of invalid branches: non-intersecting, intersecting, coupled, and embedded. The algorithm is also validated for an example of complex invalid branch group structure in machine code. In addition PARUT is designed to enable the simple inclusion of other software implemented fault tolerant techniques. In particular the prototype currently implements a simulation of the signature analysis technique. Further techniques can be included as required during any future development.

It is important that the operation of the prototype can be easily understood. A diagnostic facility is built into PARUT enabling the generation of a procedure call list referred to as TRACE_FILE. This list enables the operation of the utility to be monitored and hence aid comprehension of operation. Furthermore, programming language and style are adopted to facilitate understanding of the PARUT program. These qualities of the prototype have also proved valuable during the utility development, facilitating easy code manipulation without the disruption often associated with prototype development of similar sized programs.

A linked list is employed to represent the input machine code for processing by PARUT. This data structure is not dimensioned and does not itself restrict the size of machine code input. Equally, modules processing the linked list are designed not to impose a dimension restriction. However, there will be a constraint on the size
of the input code due to general limitations of the host system environment, e.g. a maximum size of CODE_FILE generation by UNIX 'adb'. Such restrictions are outside the scope of the PARUT development programme.

One of the most difficult objectives to achieve is the use of PARUT with a range of target processor types. Software manipulation required for the application of software implemented fault tolerance uses the pseudo-compiler action of procedure BUILD_JUMPS on the source code of the target processor. Such activity implies knowledge of the processor's instruction set, currently input to PARUT in MICRO_FILE. A robust version of PARUT should incorporate design features which facilitate a complete specification of a microprocessor type within MICRO_FILE to be processed by BUILD_JUMPS whose activity is independent of microprocessor architecture. Implementing such a robust specification is complex; therefore, for simplicity PARUT was developed to target only one processor type: the Motorola 68000 family. This family of microprocessors have a fixed size instruction set and extensions to the used instruction set are upwardly compatible. Hence, although only one processor type is made available by PARUT, the utility in reality can be used with a selection of processors within the Motorola 68000 family. This gives PARUT a base selection of target processors.

6.7. PARUT: Developing a Standard Programming Tool

The PARUT prototype extensively realizes its design and development objectives. It therefore appears feasible to further develop PARUT into a standard programming tool. Such a tool is valuable when implementing and assessing fault tolerance associated with the characteristics of erroneous behaviour described in Chapter 3.

A standard programming tool based on the prototype PARUT should adopt the following recommended enhancements. Firstly, the range of target processors should be extended. This is possibly the most complex modification of PARUT involving the integrated development of a robust BUILD_JUMPS module and general purpose format for MICRO_FILE. Secondly, PARUT should be extended to implement (rather than simulate) other software implemented fault tolerant techniques. The program structure of the PARUT program has been demonstrated to provide easy inclusion
of new techniques. Thirdly, the analysis techniques used to assess the effectiveness of the fault tolerant techniques implemented could, in addition to static analysis, provide dynamic analysis via emulation/simulation.

As a standard programming tool PARUT might be incorporated into a compiler. This would remove the necessity of generating and processing CODE_FILE because all the required information on the target software is inherently available from the translation process of the compiler.

6.8. Summary

This chapter describes the design and development of the prototype programming tool PARUT. Design objectives are successfully attained. In particular a selection of software implemented fault tolerant techniques, including that proposed in Chapter 5, are facilitated for a variety of target processors. This is achieved without undue restrictions on the size of the target software. Additionally the fault tolerance of the software can be assessed in respect of the hazard of erroneous microprocessor behaviour modelled in Chapter 3.

The structured programming language Pascal is used in conjunction with 'good' programming practice to generate readable code and hence ease comprehension and modification. A diagnostic facility which monitors procedure access by PARUT operation is also provided to aid understanding of the utility function. All these design features have proved valuable during the development of the PARUT prototype.

The success of PARUT leads to the proposal that further development be initiated in order to produce a standard programming tool. Enhancements to PARUT for this purpose are outlined. The post-programming nature of the software implemented fault tolerant techniques applied by the PARUT function suggests its possible inclusion within a compiler, providing an additional code enhancement stage at the end of the translation process.
Chapter 7

ASSESSING FAULT TOLERANCE

7.1. Introduction ............................................................ 125
7.2. Assessing the Fault Tolerance of a Microprocessor System ........ 126
  7.2.1. Assessment Parameters ........................................... 126
  7.2.2. Parameter Evaluation ............................................ 126
  7.2.3. Internal Microprocessor Faults .................................. 128
  7.2.4. Assessment Dependence on Application Software ............. 128
  7.2.5. Behavioural Observations ....................................... 130
7.3. Single-Bit Fault Injection Experiment ................................ 130
  7.3.1. Fault Injection Experiment ...................................... 130
  7.3.2. Microprocessor Application Under Investigation ............. 132
  7.3.3. Programme of Injected Faults ................................... 133
  7.3.4. Selected Single-Bit Faults ...................................... 135
  7.3.5. Decoupling the Microprocessor Detection Mechanisms ......... 137
  7.3.6 Performance Evaluation .......................................... 138
7.4. Multiple-Bit Fault Emulation Experiment ............................ 144
  7.4.1. Emulation and Fault Investigation ............................. 144
  7.4.2. Microprocessor Applications under Investigation ............ 145
  7.4.3. Programme of Emulated Faults .................................. 146
  7.4.4. Behavioural Analysis ............................................ 147
  7.4.5. Identified Phases of Erroneous Execution .................... 147
    7.4.5.1. The Initial Erroneous Jump Phase ......................... 148
    7.4.5.2. The Subsequent Erroneous Jump Phase ..................... 151
  7.4.6. Analysing Detection Capability ................................ 151
  7.4.7. Critical Hazards of Erroneous Behaviour ...................... 157
    7.4.7.1. Cessation of Processing .................................. 158
    7.4.7.2. Execution Loops .......................................... 158
    7.4.7.3. Placement Deadlock ...................................... 158
  7.4.8. Re-synchronizing Erroneous Execution ......................... 160
7.5. Summary and Conclusions ........................................... 160
CHAPTER SEVEN
ASSESSING FAULT TOLERANCE

7.1. Introduction

The effectiveness of software implemented fault tolerance in a microprocessor system is difficult to assess. Microprocessors and their memories are VLSI devices which have a huge number of potential logical fault sites. In addition these faults will not always be activated because of time dependent circuit operation. This chapter reports two fault insertion experiments initiated in order to investigate temporary faults within a microprocessor system implementing the software based fault tolerance technique proposed in Chapter 5.

The fault insertion experiments investigate the effect of single-bit and multiple-bit faults on program behaviour. The faults are inserted into memory and the program counter, and the response of the microprocessor system tracked instruction by instruction.

The first set of experiments involve injecting single-bit faults into a microprocessor system. Five classes of single-bit fault are injected in order to model faults at various locations in the microprocessor and memory. The faults include line-errors on the address and data bus during instruction and data cycles, and program counter faults. The response of the microprocessor system once the faults have been injected is monitored. A detailed analysis of the system response gives an important insight into the character and nature of erroneous microprocessor behaviour.

Within microprocessor based systems faults are often observed to affect multiple-bit as well as single-bit locations. This occurrence is investigated by the second set of fault injection experiment. The fault class selected for multiple-bit fault injection is program counter corruption. This fault is selected because it is observed in many of the single-bit fault injection experiments, and it is relatively simple to inject. The fault response of three microprocessor systems (an 8, 16, and 32-bit machine), is analysed.
Finally, the chapter concludes with a summary of experimental observations relating to the effectiveness of the software detection mechanisms and the Access Guardian function. In particular, detection latency and the hazard of re-synchronization are discussed.

7.2. Assessing the Fault Tolerance of a Microprocessor System

7.2.1. Assessment Parameters

Techniques providing tolerance of temporary faults in VLSI devices are difficult to assess. This is particularly true for microprocessor systems. In order to assess the effectiveness of a fault tolerant technique it is necessary to evaluate several performance parameters. Two performance parameters are particularly important: detection latency, and error coverage. Error coverage is the percentage of error conditions that can be detected by the technique, and the time taken between the activation of a fault as an error and its detection is referred to as its detection latency. It is also important to quantify the performance overhead imposed by the technique. The overhead comprises processing degradation and additional hardware requirement.

7.2.2. Parameter Evaluation

Analytic assessment of fault tolerant microprocessors is a difficult task, because in most applications it is impossible to determine appropriate error classes and the distribution of errors amongst these classes. Although models have been developed to investigate the effect of temporary hardware faults on executing microprocessor systems, their analysis is limited by assumptions and imposed restrictions. Mahmood & McCluskey [1985], and Namjoo [1982] have modelled the error coverage of signature analysis techniques but their investigation was limited to the effect of control flow errors, and their estimates proved slightly optimistic compared with actual results presented by Schuette & Shen [1986]. Nevertheless, the model did give a valuable indication of the effectiveness of signature analysis.

Experimental evaluation by fault injection into actual hardware in many cases is the only way to estimate fault tolerance effectiveness successfully. In such experiments
the selection of the fault injection method is crucial. Ideally a fault should be capable of injection at random and specific VLSI device locations, and at a certain point in time for a controlled period. Executing software is dynamic and temporary faults can have different outcomes depending on the activation of circuitry by microprocessing within the processor. This is referred to as the instruction sequence dependency. Higher processor workloads, in a multi-tasking environment, may also increase the probability of activating a fault.

Temporary faults that occur in microprocessor devices are difficult to mimic in the laboratory. Initially fault injection experiments, see Table 2.3., inserted faults via the hardware pins of a device. More recently Chillarege & Bowen [1989] inserted faults into a microprocessors memory. Whilst these faults model internal faults, faults are not actually injected within the device. Although there are methods of injecting internal faults to a microprocessor, notably Damm [1988] by power line fluctuations, and Gunneflo et al [1989] by ion radiation, these methods generate multiple faults at uncontrolled locations.

Controlled fault generation can be provided by using microprocessor emulators and simulators, see Table 2.2., but analysis of the microprocessor response may be limited by the tool’s sophistication. Armstrong & Devlin [1981] suggest that gate-level microprocessor simulators are prohibitively expensive for fault injection experiments. Therefore, they used a microprocessor simulator based on a functional model [Li et al, 1984]. More sophisticated emulators have become available to researchers, whereby gate-level descriptions are incorporated into functional models. Czeck & Siewiorek [1990] recently used such a sophisticated simulator. However, it must be realized that as the simulators increase in complexity so they become increasingly susceptible themselves to development errors. A simulator was not available to the author of this thesis so an alternative method of assessing fault tolerance is utilized.

As mentioned above, an accepted method used by many researchers to obtain assessments of fault tolerant techniques is fault injection. The method is popular because it gives actual error coverage analysis for the injected faults. A limitation of the method is the validity of the result for the whole device. Locations for fault
injection are usually chosen by the experiment designer following some predefined selection criteria. The representivity of these points for the remainder of the device should always be critically examined. This method appeared to be the best approach to analyse the software based fault tolerant technique proposed in Chapter 5.

7.2.3. Internal Microprocessor Faults

Faults are selected for insertion in order to model actual faults in a microprocessor system. The inserted faults are only valid when modelling processor and memory faults, faults in the peripheral devices are not accurately modelled. Before comparing inserted and actual faults, it is necessary to describe the basic structure of the tested microprocessor system.

The microprocessor system under test is considered as the integration of a data path (consisting of an ALU and data registers), and a control path (consisting of an opcode decoder, controller, and program control unit). The program control unit (PCU) contains bus interface circuitry (BIC), program counter (PC), instruction prefetch queue (IPQ), and a controller. The PCU is responsible for address calculations, and it is assumed to be responsible for the reading and writing of opcodes and operands. Interrupt handling and bus arbitration circuitry will not be considered because the inserted faults will not closely model faults within these units. A general microprocessor topology is shown in Figure 7.1. The microprocessor system under test does not have the following features: co-processor, memory cache, instruction pipeline (beyond the prefetch queue), and multiplexed busses.

7.2.4. Assessment Dependence on Application Software

The effectiveness of software based fault tolerant techniques is difficult to access. Techniques such as recovery blocks, signature analysis, and the technique proposed in Chapter 5 are all dependent on the application program for their performance. It is therefore important to select a representative application program when assessing the effectiveness of a fault tolerant technique. Czeck & Siewiorek [1990] and Schuette & Shen [1986] each choose a different collection of application programs.
in order to provide an assessment benchmark. However, particular applications for implementation can have quite different characteristics from any of the benchmark software. Hence the value of a benchmark program is limited to giving an indication of the effectiveness of a fault tolerant technique implemented on a similar application program.

7.2.5. Behavioural Observations

Recently reported fault injection experiments (see Table 2.2.) measure the detection latency of fault tolerant techniques. Such experiments do not enable the mechanism of spawning errors, at a functional level, to be observed. The fault injection experiments described within this chapter involve tracing the instruction sequence of microprocessor operation from the activation of the fault to its detection or deactivation. These experiments provide an interesting insight into the mechanisms, modelled in Chapter 3, of processing failures induced by temporary faults.

7.3. Single-Bit Fault Injection Experiment

7.3.1. Fault Injection Experiment

The experimental set-up of the fault injection programme is shown in Figure 7.2. Within the Engineering Department at the University of Durham there is a Vittese 5 computer system based on the Motorola 68020 microprocessor which services many devices, including terminals independently connected through a dedicated Motorola 68000 microprocessor system, in a multi-user environment. Programs written for implementation on a dedicated Motorola 68000 system are coded on the Vittese in Assembler before being assembled. The object code can then be down-loaded onto an MC68000 microprocessor system if one is attached to the terminal in use. The MC68000 microprocessor system used was developed by the Microprocessor Centre at the University of Durham, and has a special 'monitor' program which provides, amongst other facilities,
Figure 7.2: Microprocessor Laboratory System Topology
The fault injection programme implemented involves injecting faults into the microprocessor memory or program counter. These faults are activated by the application program's execution, and their effects are monitored by using the trace facility and interrogating the register contents to ascertain the processor's status.

7.3.2. Microprocessor Application Under Investigation

For the purpose of this section, a single microprocessor application is analysed. The system chosen has many typical characteristics of an industrial microprocessor based controller such as the monitoring and control of equipment to perform an on-going task or process.

The application system monitors the water level in two connected reservoirs, one higher than the other. If the higher reservoir level falls beneath a minimum marker (solenoid float) then water is pumped from the lower to the higher reservoir. Similarly, if the higher reservoir level goes above a maximum marker (solenoid float) then water is drained from the higher to the lower reservoir.

The controller is based on a Motorola 68000 microprocessor operating at 8MHz although a lower operating frequency would be acceptable for this application. The microprocessor executes software stored within 64 KBytes of memory. The actual program size is initially 381 Bytes. The application program is processed by the PARUT tool, described in Chapter 6, in order to strategically place software detection mechanisms within the code. These mechanisms, designed to provide fault tolerance, for the application program under investigation required an additional 108 Bytes of memory. An annotated listing of the original application program, and the same program after processing by PARUT can be found in Appendix D respectively.

The Motorola 68000 microprocessor system has several detection mechanisms; bus errors (access to unavailable address space) are detected by logic external to the processor whilst address errors (invalid specification of memory locations, e.g. odd
byte address exception) and illegal opcodes are inherently detected by the processor. Collectively these mechanisms shall be referred to as the MC68000 detection facility.

The bus and address errors within the MC68000 detection facility duplicate the function of the Access Guardian proposed in Chapter 5. Therefore an Access Guardian unit is not attached to the microprocessor system. The collected results from the fault injection experiments are processed in order to de-couple the Access Guardian function from the MC68000 detection facility.

7.3.3. Programme of Injected Faults

The fault insertion programme was based on that used by Schuette & Shen [1986]. They injected faults by temporarily altering pin logic values on an embedded Motorola 68000 microprocessor. The faults in this insertion programme are injected by corrupting the microprocessor memory. The experiment models five classes of fault within the microprocessor system as described below.

A. Instruction Cycle : Data Bus Faults

These faults are inserted by corrupting bit positions of an opcode stored in memory. The fault appears on the data bus when the instruction opcode is read. This class of fault models the following situations.

- Memory degradation or data bus line-errors external to the microprocessor.
- Errors in the bus interface circuitry or internal data bus line-errors.
- Errors in the Opcode Decoder.
- Program Counter faults or Address Calculating Circuitry errors as a result of either determining an incorrect branch address (special MC68000 case where displacement is held in the opcode), or corrupted opcode, or opcode decoder error leads to the wrong number of operands being read and hence an incorrect location is accessed for the next opcode.
B. Data Cycle: Data Bus Faults

These faults are inserted by corrupting bit positions of an operand in memory, the fault appearing on the data bus when the operand is accessed. This class of fault models the following situations.

- Memory degradation or data bus line-errors external to the microprocessor.
- Errors in the bus interface circuitry or internal data bus line-errors.
- Faults in the data registers when operands are moved by them.
- Arithmetic Logic Unit (ALU) errors if the operands are processed.
- Program Counter faults or Address Calculating Circuitry errors if the operand is used in determining a branch address.

C. Instruction Cycle: Address Bus Faults

These faults are inserted by replacing an opcode in memory with data from another location in the address space. The fault is activated when the opcode is accessed and an alternative opcode value is put on the data bus. This class of fault models the following situations.

- Memory degradation or data bus line-errors external to the microprocessor.
- Errors in the bus interface circuitry or internal data bus line-errors.
- Faults in the stack pointer (when retrieving the next opcode location), program counter faults, or errors in the Address Calculating Circuitry.
- Multiple faults in the Opcode Decoder which cause severe malfunction and have an effect similar to multiple line-errors on either the internal or external data bus, and burst faults in memory.

D. Data Cycle: Address Bus Faults

These faults are inserted by replacing an operand in memory with data from another location in the address space. The fault is activated when the operand is accessed and an alternative operand value is put on the data bus. This class of fault models the following situations.
- Memory degradation or data bus line-errors external to the microprocessor.
- Errors in the bus interface circuitry or internal data bus line-errors.
- Errors in the Address Calculating Circuitry.
- Alternatively this fault class can mimic multiple faults in the data register, or ALU malfunction, or multiple line-errors on the internal or external data bus, or burst faults in memory.

E. Program Counter Faults

These faults are inserted by corrupting the contents of the program counter using the status facility in the MC68000 board monitor program. The fault becomes active when the next opcode is accessed and processing is forced to deviate from its intended path. This class of fault models the following situations.

- Line-errors on the internal address bus.
- Opcode Decoder faults initiating a branch.
- Corruption of the program counter, errors in the Address Calculating Circuitry, or stack pointer faults which lead to an incorrect branch.

Review

A summary of the injected fault programme and modelled fault situations can be found in Table 7.1. The injected faults are all single-bit; multiple bit faults were not injected in this experiment. The modelled faults are single-bit, or simple errors, except where stated.

7.3.4. Selected Single-Bit Faults

The faults injected for the instruction and data cycle address and data bus experiments, and the corrupted program counter experiment are single-bit corruptions. Bit faults on the address bus and program counter affect address bit positions 1, 4, 8, and 12. Bit fault positions on the data bus are 0, 7, 8, and 15. These bit positions are used by Schuette & Shen [1986] and Damm [1988] in their fault injection
<table>
<thead>
<tr>
<th>Fault Modelled</th>
<th>Fault Class Injected</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>A</td>
</tr>
<tr>
<td>Memory Bit Faults</td>
<td>X</td>
</tr>
<tr>
<td>Memory Select Circuitry Error</td>
<td></td>
</tr>
<tr>
<td>Line Errors: Internal Data Bus</td>
<td>X</td>
</tr>
<tr>
<td>External Data Bus</td>
<td>X</td>
</tr>
<tr>
<td>Internal Address Bus</td>
<td></td>
</tr>
<tr>
<td>External Address Bus</td>
<td></td>
</tr>
<tr>
<td>Bus Interface Circuitry Errors</td>
<td>X</td>
</tr>
<tr>
<td>Faults: Data Registers</td>
<td></td>
</tr>
<tr>
<td>ALU</td>
<td></td>
</tr>
<tr>
<td>Opcode Decoder</td>
<td>X</td>
</tr>
<tr>
<td>Program Counter Faults</td>
<td>X(2,3)</td>
</tr>
<tr>
<td>Address Calculating Circuitry Errors</td>
<td>X(2,3)</td>
</tr>
<tr>
<td>Stack Pointer Faults</td>
<td></td>
</tr>
</tbody>
</table>

**Fault Class Injected:**
A: Instruction Cycle: Data Bus Faults
B: Data Cycle: Data Bus Faults
C: Instruction Cycle: Address Bus Faults
D: Data Cycle: Address Bus Faults
E: Program Counter Faults

**Notes:**
(1) Only if the operand is used in determining a branch address.
(2) Special MC68000 case (1): displacement held in opcode.
(3) Corruption of an opcode or a fault in the Opcode Decoder can result in the wrong number of operands being read for an instruction and hence a program counter or address calculating circuitry error when accessing the next opcode.
(4) When retrieving next opcode address from stack.
(5) Multiple faults causing severe malfunction.
(6) Multiple faults causing severe mis-interpretation of opcode/operand.

Table 7.1. : Single-Bit Fault Injection Programme
experiments. The application program resides in low memory and hence address bus faults on line 12 are typically detected by the Access Guardian function.

In total 2136 faults were injected during the single-bit fault experiment. These faults disrupted program and data flow. A feature of the fault injection programme is that results are dependent on the instruction sequence and not the instruction mix: actual execution paths are traced instruction by instruction. This is important because faults can have different effects when the microprocessor system is in various run-time conditions.

Each instruction cycle fault class had 472 faults injected, as did the program counter fault experiment. The data cycle fault classes comprise of 360 faults each.

7.3.5. Decoupling the Microprocessor Detection Mechanisms

The microprocessor system under investigation has two sources of detection mechanisms: those implemented by the software based fault tolerant techniques proposed in Chapter 5, and those inherently present in the embedded MC68000 microprocessor.

Let the sample space $F$ contain all the faults injected into the microprocessor system. Let the sets $M$ and $P$ be the faults covered by the MC68000 detection facility and the software detection mechanisms planted in the application program respectively. Now,

$$ (M \cap P) = \emptyset, \quad (7.1.) $$

but unfortunately $M$ and $P$ do not describe the whole fault set $F$.

$$ (M \cup P) \subset F. \quad (7.2.) $$

This is because the errors generated by some faults cause erroneous execution to re-synchronize and hence avoid detection.

Let $R$ be the set of faults that generate a re-synchronized outcome, and therefore the injected fault outcomes can be described as follows.
\[(M \cup P) \cup R = F,\]  \hspace{1cm} (7.3.)

\[(M \cup P) \cap R = \emptyset.\]  \hspace{1cm} (7.4.)

The MC68000 detection facility is observed to detect fault outcomes which would otherwise have been diagnosed by an Access Guardian. For the purpose of analysis it is useful to de-couple the Access Guardian function from the inherent MC68000 detection mechanisms. Let set \(A\) contain all the faults that are detectable by the Access Guardian function. Then

\[(M \cap A) = A,\]  \hspace{1cm} (7.5.)

\[(M \cup A) = M.\]  \hspace{1cm} (7.6.)

The faults detected by the MC68000 detection facility without the Access Guardian function is therefore given by \((M \cap \overline{A})\).

### 7.3.6. Performance Evaluation

The detection mechanisms provided fault tolerance for approximately 60% of the faults injected into the system. The outcome of the faults no detected is observed as re-synchronization. The mean latency to either detection or re-synchronization is 1.2 processed instructions. The implications of this latency is discussed below. The overhead associated with the implementation of the software based fault tolerant technique have been estimated in Chapter 5. The hardware overhead attributed to the Access Guardian is approximately 5% of the MC68000 transistor count. The additional memory required by the fault tolerant software is approximately 30% of the original application program size.

A summary of the processing outcome after each single-bit fault is injected is shown in Table 7.2. The processing outcomes are classified as re-synchronization,
<table>
<thead>
<tr>
<th>Injected Fault Class</th>
<th>Re-synchronisation</th>
<th>Detected by</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>with stack corruption</td>
<td>without stack corruption</td>
<td>Software Mechanism</td>
</tr>
<tr>
<td>A</td>
<td>2</td>
<td>133</td>
<td>121</td>
</tr>
<tr>
<td>B</td>
<td>2</td>
<td>210</td>
<td>2</td>
</tr>
<tr>
<td>C</td>
<td>5</td>
<td>132</td>
<td>91</td>
</tr>
<tr>
<td>D</td>
<td>10</td>
<td>198</td>
<td>6</td>
</tr>
<tr>
<td>E</td>
<td>36</td>
<td>145</td>
<td>121</td>
</tr>
<tr>
<td>Total</td>
<td>55</td>
<td>818</td>
<td>341</td>
</tr>
</tbody>
</table>

Table 7.2: Fault Injection Outcomes
detection by a software mechanism, detection by the MC68000 microprocessor excluding the Access Guardian function, and finally, detection by the Access Guardian function. The processing response en route to the monitored outcomes is detailed in Table 7.3. and shown in Figure 7.3.

Re-synchronization involves the program flow, which has already diverged from the specified control path, rejoining the specified control path. Over 40% of the injected faults lead to re-synchronization. This is not surprising, as a significant proportion of the faults injected during the experiment corrupted the instruction without initially affecting the program counter, but corrupted the program counter following the completed execution of the first instruction because the wrong number of operands were interpreted as belonging to the initial opcode. This scenario initiates irregular processing through the coded area before processing once more aligns itself with the original program flow. Gunneflo et al [1989] recorded 5% of his injected faults leading to re-synchronization, the lower value of this figure can be attributed to the nature of his fault experiments. Arlat et al [1989] and Chillarege & Bowen [1989] report 46% and 42% of their fault injections leading to undiagnosed errors which did not prevent continued processing although the function may have been slightly corrupted. These faults together with similar faults, referred to as overwritten faults and collated by Czeck & Siewiorek [1990], (59% [Schuette & Shen, 1986], 60-70% [Czeck & Siewiorek, 1990], and 77% [Choi et al, 1989]), may to some degree be due to the re-synchronization phenomena.

It is clear that re-synchronization is an important class of fault outcome. In the experiment re-synchronization occurred within five erroneously processed instructions, and had a mean occurrence latency of 1.2 instructions. It is interesting to note that 29 of the injected faults classified as re-synchronization, had no effect on execution - the induced erroneous behaviour being totally benign as far as the system status is concerned. For example, in the data cycle with address bus fault, the incorrectly accessed operand address may contain the same operand value as the correct operand address. These occurrences are shown in Table 7.3. where re-synchronization is labelled as occurring after no instructions are processed.
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>12 78 35 9 1 0</td>
<td>0 70 33 13 1 5 2 0</td>
<td>0 180 2 0</td>
<td>0 21 9 1 0</td>
<td>472</td>
</tr>
<tr>
<td>B</td>
<td>17 194 1 0 0 0</td>
<td>0 0 1 0 1 0 0 0</td>
<td>0 146 0 0</td>
<td>0 0 0 0 0</td>
<td>360</td>
</tr>
<tr>
<td>C</td>
<td>0 107 8 13 9 0</td>
<td>0 57 30 4 0 0 0</td>
<td>0 190 0 0</td>
<td>0 52 1 1 0</td>
<td>472</td>
</tr>
<tr>
<td>D</td>
<td>0 202 6 0 0 0</td>
<td>0 1 4 1 0 0 0</td>
<td>0 146 0 0</td>
<td>0 0 0 0 0</td>
<td>360</td>
</tr>
<tr>
<td>E</td>
<td>0 116 50 14 1 0</td>
<td>0 68 32 12 6 3 0</td>
<td>0 150 0 0</td>
<td>0 17 3 0 0</td>
<td>472</td>
</tr>
<tr>
<td>Total</td>
<td>29 697 100 36 11 0</td>
<td>0 196 100 30 8 8 2 0</td>
<td>0 812 2 0</td>
<td>0 90 13 2 0</td>
<td>2136</td>
</tr>
</tbody>
</table>

Table 7.3. : Nature of Fault Injection Outcomes
Figure 7.3. Fault Injection Outcomes
A particular hazard associated with re-synchronization is that from a user perspective there is a small and perhaps un-noticeable processing glitch which can involve corruption of the microprocessor stack or stack pointer. Approximately 3% of the injected faults re-synchronize with a corrupted stack. A similar result is reported by Damm [1988] who diagnosed 3.6% of his injected faults to cause this error. This hazard may prove critical at a much later processing stage when the return from a subroutine or other stack access occurs. This phenomena of a sleeping fault, called a potential hazard by Chillarege & Bowen [1989], being awakened at some future period could explain the system crash data presented by Czeck & Siewiorek [1990] where high corruption of system integrity is experienced.

Those faults injected to disrupt the data cycle have a 56% probability (approximately) of generating re-synchronization, which is twice the expectation for instruction cycle faults. This is not an unexpected result because corrupted operands will typically alter the result of the function but not the function itself whilst corrupted opcodes will alter the function and interpretation of operands. It is interesting to note that the program counter faults, which are intuitively more allied to instruction cycle faults, generate re-synchronization for 30% of their outcomes which is very similar to the data cycle fault experiment.

In the experiment non-re-synchronized erroneous execution was detected, by either a software detection mechanism, Access Guardian function, or a non-Access Guardian function of the MC68000 detection facility, with a mean latency of 1.2 instructions. This is extremely rapid and highly desirable for reliable systems. The faults trapped with greatest latency (6 instructions) were detected by the software detection mechanism.

Just over one quarter of the faults which do not re-synchronize, are caught by the software detection mechanisms. Only 8, or 1%, of the data cycle faults are detected in this way, compared with 20% of the instruction cycle and program counter faults. The software detection mechanisms have caught erroneous execution as late as the sixth processed instruction, and have a mean detection latency of 1.6 instructions in the experiment.
The Access Guardian function was very successful in detecting injected faults. It detected 40% of the faults inserted with a mean detection latency of one processed instruction. The experimental result reported here compares favourably with other documented versions of this detection function; 60% [Gunneflo et al, 1989] where the used memory filled 12% of the Motorola 6809 microprocessor address space, and 58% [Schmid et al, 1982] where the used area filled 90% of the Zilog Z80 microprocessor address space. As noted by Gunneflo et al [1989], the effectiveness of this detection mechanism will increase as the percentage of used memory in the microprocessor address space decreases.

The remaining injected faults were detected by the illegal opcode facility of the MC68000 microprocessor. These accounted for almost 5% of the detected faults. Early microprocessor designs did not incorporate this facility and under these fault injection experiments would have a reduced reliability. Schmid et al [1982] and Gunneflo et al [1989] attached an illegal opcode detector to their respective Z80 and MC6809 microprocessor systems; the facility detected approximately 35% and 30% of the injected faults. Most modern microprocessor designs incorporate this detection capability. The effectiveness of this mechanism is dependent on the number of illegal opcodes in the instruction (opcode) map, and the data diversity within the microprocessor used memory which is application dependent. It is therefore difficult to quantify the usefulness of this utility, but it can considerably improve a microprocessor system's reliability.

7.4. Multiple-Bit Fault Emulation Experiments

7.4.1. Emulation and Fault Investigation

Emulators are software tools that mimic the register action of a target microprocessor. As such the injection of a fault will not be as accurately modelled as in a simulator which models the functional/gate activity of a microprocessor. However, emulators, unlike simulators, are commonly available and inexpensive. Indeed many modern microprocessor systems are provided with an emulator within a debugging facility.
The register selected for fault investigation is the program counter. Other register faults would only investigate data type errors, whilst program counter faults are indicative of instruction type faults. Gunneflo et al [1989], who injected faults internally at random via ion radiation, found that 77% of faults were instruction type. Other fault injection experiments support this finding: 77% [Schuette & Shen, 1986] and 60% (experienced in the single-bit experiment documented within this chapter). It therefore seems reasonable to conduct experiments that investigate instruction type faults.

7.4.2. Microprocessor Applications Under Investigation

Three application programs have been selected for the multiple-bit fault injection experiment. The first program ('A'), is the same program used for the single-bit fault injection experiment. That is, a slurry pump control application involving the monitoring and control of a reservoir system. Program A is written in assembler for the Motorola 68000 microprocessor. The second program ('B'), is written in assembler for a Motorola 68(7)05 microprocessor based system and is concerned primarily with data movement using the processor input/output ports. The third program ('C'), unlike the other programs, is not an application program but rather a section of high level code written in C. The purpose of this program is to investigate the hidden hazards that can be generated when high level language programs are translated to machine code. Program C is translated into machine code for the Intel 80386 microprocessor.

The three programs were selected to provide a diverse variety of application processors types and sizes; the Motorola 68(7)05, Motorola 68000, and Intel 80386 are 8, 16, and 32-bit machines respectively. As for the single-bit fault injection experiment, the programs are prepared by applying the software based fault tolerant technique proposed in Chapter 5. Annotated assembler/machine code listings of the original application programs and the programs with strategically placed software detection mechanisms are shown in Appendix D.
7.4.3. Programme of Emulated Faults

The faults injected into the program counter are single and multiple-bit corruptions. All possible program counter corruption patterns representing execution in the used area of the microprocessor address space are analysed through fault injection. The remaining program counter corruption patterns generate detection by the Access Guardian and are evaluated. Hence this class of fault is comprehensively analysed. High order bit faults in the program counter typically lead to detection by the Access Guardian feature because the application program resides in low memory.

The programs A, B, and C are investigated to assess their respective fault tolerance with and without the software based fault tolerant technique proposed in Chapter 5. Each program is evaluated:

Version 1: without the software technique applied,
Version 2: with the software technique applied (default detection mechanism size),
Version 3: with the software technique applied (optimum detection mechanism size),

Program A: \{68000\}

The size of this program is 308 bytes, increasing to 500 and 416 bytes when default and optimum size software detection mechanisms are inserted respectively. Faults are emulated to analyse the microprocessor response to even byte program counter corruption covering every location in the program for each of the three program versions; a total of 612 fault runs. Odd byte program counter corruptions are detected automatically by the inherent MC68000 detection facility.

Program B: \{68(7)05\}

The size of this program is 53 bytes, increasing to 86 and 80 bytes when default and optimum size software detection mechanisms are inserted respectively. Faults are emulated to analyse the microprocessor response to program counter corruption covering every location in the program for each of the three program versions; a total of 219 fault runs.
Program C : 80386

The size of this program is 293 bytes, increasing to 365 and 323 bytes when default and optimum size software detection mechanisms are inserted respectively. Faults are emulated to analyse the microprocessor response to program counter corruption covering every location in the program for each of the three program versions; a total of 981 fault runs.

7.4.4. Behavioural Analysis

The performance evaluation in the preceding section, concerning the fault injection experiment, and other fault tolerance evaluations (see Tables 2.2. and 2.3.) involving fault injection, simulation, or emulation, provide static analysis. They do not investigate the processing behaviour associated with the latency of the spawning errors generated by the injected fault, and hence cannot identify dangers or assets of the techniques under evaluation.

The emulation experiment reported here involves tracing erroneous execution instruction by instruction. In this way, the character of erroneous microprocessor behaviour can be investigated. This study validates the erroneous microprocessor behaviour model presented in Chapter 3, demonstrates the effectiveness of the software based fault tolerant technique proposed in Chapter 5, and re-iterates the importance of the re-synchronization phenomenon.

7.4.5. Identified Phases of Erroneous Execution

The erroneous microprocessor behaviour model assumes two phases of erroneous execution: that following an Initial Erroneous Jump (IEJ), and that following a Subsequent Erroneous Jump (SEJ). The fault programme primarily investigates the SEJ phase, but the results can be extended to investigate the IEJ phase under the assumption that an Access Guardian is embedded within the microprocessor system being evaluated. Execution within each phase can generate either a restart, unspecifed jump, return, or stop/wait outcome as described in Chapter 3. Restart outcomes signify detection of erroneous execution, whilst stop/wait outcomes signify a
cessation of execution. Only unspecified jump and return outcomes lead to another SEJ phase of erroneous execution.

The fault emulation results for program A, shown in Table 7.4., are analysed in detail in order to validate the erroneous microprocessor behaviour model. Erroneous execution can be detected in the MC68000 microprocessor system by either the Access Guardian, the inherent MC68000 detection mechanisms, or the software detection mechanisms. The Access Guardian detects access to the unused address space. Inherent MC68000 processor detection mechanisms include the odd byte address exception for the program counter, and the illegal opcode exception. The software detection mechanisms are designed to detect SEJs.

7.4.5.1. The Initial Erroneous Jump Phase

The purpose of Figure 7.4. is to show the behavioural phase of program execution following an Initial Erroneous Jump (IEJ). The IEJ is generated by corrupting the program counter. Some IEJ destinations are detected immediately, such as target locations in the unused area and odd byte addresses, and these are represented by the ordinate intercept in Figure 7.4. During erroneous execution such detection coerces the outcome of an instruction to a restart. The ordinate intercept is lower for version 2 because the injected software mechanisms increase the used memory requirement, which reduces the initial effectiveness of the Access Guardian. The odd byte address exception facility in the MC68000 microprocessor detects all program counter corruptions with an odd byte address, and hence will always have the same detection capability.

During the IEJ phase, detection is provided by the software detection mechanisms or instructions generating conditions that are detected by the Access Guardian or an inherent MC68000 detection mechanism. Version 1 does not have any inserted software detection mechanism, whilst version 2 does. Not all return instructions in program A produce a return outcome, some are liable to generate conditions which are detected by the Access Guardian or an inherent MC68000 detection mechanism and generate a restart outcome. The effect of the software detection mechanisms is
### (a) Version 1: Original Program

<table>
<thead>
<tr>
<th>Jump Outcome</th>
<th>Number of Instructions Processed</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT</td>
<td></td>
<td>0</td>
<td>13</td>
<td>7</td>
<td>4</td>
<td>5</td>
<td>4</td>
<td>4.5</td>
<td>2.5</td>
<td>3</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>UJ</td>
<td></td>
<td>0</td>
<td>16.5</td>
<td>14</td>
<td>14</td>
<td>16.5</td>
<td>12</td>
<td>6.5</td>
<td>4.5</td>
<td>4</td>
<td>2.5</td>
<td>1</td>
</tr>
<tr>
<td>RN</td>
<td></td>
<td>0</td>
<td>2</td>
<td>2</td>
<td>3</td>
<td>3.5</td>
<td>2</td>
<td>1.5</td>
<td>1.5</td>
<td>1</td>
<td>0.5</td>
<td></td>
</tr>
<tr>
<td>SW</td>
<td></td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td></td>
</tr>
</tbody>
</table>

### (b) Version 2: Insertion of Default Size Detection Mechanisms

<table>
<thead>
<tr>
<th>Jump Outcome</th>
<th>Number of Instructions Processed</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT</td>
<td></td>
<td>0</td>
<td>93</td>
<td>17</td>
<td>5</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td></td>
</tr>
<tr>
<td>UJ</td>
<td></td>
<td>0</td>
<td>32.5</td>
<td>37.5</td>
<td>33</td>
<td>15</td>
<td>8.5</td>
<td>4</td>
<td>0.5</td>
<td>0</td>
<td>0</td>
<td></td>
</tr>
<tr>
<td>RN</td>
<td></td>
<td>0</td>
<td>2</td>
<td>2</td>
<td>0.5</td>
<td>1</td>
<td>0.5</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td></td>
</tr>
<tr>
<td>SW</td>
<td></td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td></td>
</tr>
</tbody>
</table>

### (c) Version 3: Insertion of Optimum Size Detection Mechanisms

<table>
<thead>
<tr>
<th>Jump Outcome</th>
<th>Number of Instructions Processed</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT</td>
<td></td>
<td>0</td>
<td>51</td>
<td>16</td>
<td>5</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>UJ</td>
<td></td>
<td>0</td>
<td>32.5</td>
<td>35.5</td>
<td>33</td>
<td>15</td>
<td>8.5</td>
<td>4</td>
<td>0.5</td>
<td>0</td>
<td>0</td>
<td></td>
</tr>
<tr>
<td>RN</td>
<td></td>
<td>0</td>
<td>2</td>
<td>2</td>
<td>0.5</td>
<td>1</td>
<td>0.5</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>SW</td>
<td></td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td></td>
</tr>
</tbody>
</table>

**Notes:**

i) RT - Restart Outcome  
ii) UJ - Unspecifies Jump Outcome  
iii) RN - Return Outcome  
iv) SW - Stop/Wait Outcome

**Jump Outcome Statistics**

i) There are a total of 32 potential jump instructions open to interpretation (version 1).  
ii) 15 valid branches are specified within the original program.  
iii) None of the invalid branches can be detected by an Access Guardian.  
iv) 17 invalid branches can be detected by the insertion of 16 software detection mechanisms (version 2 and version 3).

---

Table 7.4. : Observed Behaviour of Program ‘A’
Figure 7.4. : Program 'A' - IEJ Execution Phase
clearly observed in version 2 in which approximately half of the unspecified jump outcomes now generate a restart outcome, the remaining unspecified jumps occurring within re-synchronization.

7.4.5.2. The Subsequent Erroneous Jump Phase

The observed Subsequent Erroneous Jump (SEJ) phase of erroneous execution for version 1 and 2 of program A are shown in Figure 7.5. The observations made for the SEJ phase are the same as those for the IEJ phase except that the ordinate intercept is origin. This is because the detection capabilities of the Access Guardian and the MC68000 odd byte address exception are taken into account by the detection outcome of their generating instructions in the previous phase of erroneous execution.

The period of linear erroneous execution following an SEJ in version 1 is typically longer than that for version 2. This is denoted in Figure 4 by the combined cumulative jump outcomes for version 1 reaching approximately 100% after 10 instructions processed, compared to 6 instructions processed for version 2. Furthermore, the percentage of SEJ phases terminated by a restart outcome, representing detection, is greater for version 2 than version 1. This observation clearly demonstrates the enhanced detection capability, provided by the insertion of software detection mechanisms, in the SEJ phase.

7.4.6. Analysing Detection Capability

The data collected for the two phases of erroneous execution are inserted into equation (3.13.) in order to determine the dynamic detection capability of the software based fault tolerant technique. The fault emulation results for program A, B, and C, shown in Tables 7.4., 7.5., and 7.6., are processed by equation (3.13.) to produce Figure 7.6(a, b, c). respectively. Each figure shows the detection capability for version 1, 2, and 3 of the program. The programs are assumed to be implemented on a microprocessor system with an embedded Access Guardian. This assumption facilitates evaluation of program counter faults covering the whole address space without the need to model and emulate the unknown data content of the unused address space.
Figure 7.5. : Program 'A' - SEJ Execution Phase
(a) Version 1 : Original Program

<table>
<thead>
<tr>
<th>Jump Outcome</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT</td>
<td>0</td>
<td>2.5</td>
<td>1.5</td>
<td>0.5</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>UJ</td>
<td>0</td>
<td>15</td>
<td>10</td>
<td>6.5</td>
<td>4.5</td>
<td>5</td>
<td>3</td>
<td>1.5</td>
<td>0.5</td>
<td>0.5</td>
<td>0</td>
</tr>
<tr>
<td>RN</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>SW</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

(b) Version 2 : Insertion of Default Size Detection Mechanisms

<table>
<thead>
<tr>
<th>Jump Outcome</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT</td>
<td>0</td>
<td>20.5</td>
<td>13.5</td>
<td>6.5</td>
<td>2.5</td>
<td>1.0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>UJ</td>
<td>0</td>
<td>19.5</td>
<td>8</td>
<td>5.5</td>
<td>3</td>
<td>3.5</td>
<td>2</td>
<td>0.5</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>RN</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>SW</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

(c) Version 3 : Insertion of Optimum Size Detection Mechanisms

<table>
<thead>
<tr>
<th>Jump Outcome</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
</thead>
<tbody>
<tr>
<td>RT</td>
<td>0</td>
<td>14.5</td>
<td>13.5</td>
<td>6.5</td>
<td>2.5</td>
<td>1.0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>UJ</td>
<td>0</td>
<td>19.5</td>
<td>8</td>
<td>5.5</td>
<td>3</td>
<td>3.5</td>
<td>2</td>
<td>0.5</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>RN</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>SW</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

Notes:
- RT - Restart Outcome
- UJ - Unspecifies Jump Outcome
- RN - Return Outcome
- SW - Stop/Wait Outcome

Jump Outcome Statistics
- i) There are a total of 22 potential jump instructions open to interpretation (version 1).
- ii) 7 valid branches are specified within the original program.
- iii) 5 invalid branches can be detected by an Access Guardian (version 1).
- iv) 9 invalid branches can be detected by the insertion of 9 software detection mechanisms, the remaining invalid branches created by the inserted software detection mechanisms can be detected by the Access Guardian (version 2 and version 3).

Table 7.5. : Observed Behaviour of Program 'B'

153
### (a) Version 1: Original Program

<table>
<thead>
<tr>
<th>Jump Outcome</th>
<th>Number of Instructions Processed</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0</td>
</tr>
<tr>
<td>RT</td>
<td>0</td>
</tr>
<tr>
<td>UJ</td>
<td>0</td>
</tr>
<tr>
<td>RN</td>
<td>0</td>
</tr>
<tr>
<td>SW</td>
<td>0</td>
</tr>
</tbody>
</table>

### (b) Version 2: Insertion of Default Size Detection Mechanisms

<table>
<thead>
<tr>
<th>Jump Outcome</th>
<th>Number of Instructions Processed</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0</td>
</tr>
<tr>
<td>RT</td>
<td>0</td>
</tr>
<tr>
<td>UJ</td>
<td>0</td>
</tr>
<tr>
<td>RN</td>
<td>0</td>
</tr>
<tr>
<td>SW</td>
<td>0</td>
</tr>
</tbody>
</table>

### (c) Version 3: Insertion of Optimum Size Detection Mechanisms

<table>
<thead>
<tr>
<th>Jump Outcome</th>
<th>Number of Instructions Processed</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0</td>
</tr>
<tr>
<td>RT</td>
<td>0</td>
</tr>
<tr>
<td>UJ</td>
<td>0</td>
</tr>
<tr>
<td>RN</td>
<td>0</td>
</tr>
<tr>
<td>SW</td>
<td>0</td>
</tr>
</tbody>
</table>

**Notes:**

i) RT - Restart Outcome  
ii) UJ - Unspecifies Jump Outcome 
iii) RN - Return Outcome  
iv) SW - Stop/Wait Outcome 

**Jump Outcome Statistics**

i) There are a total of 35 potential jump instructions open to interpretation (version 1). 
ii) 27 valid branches are specified within the original program. 
iii) None of the invalid branches can be detected by an Access Guardian. 
iv) 7 invalid branches can be detected by the insertion of 6 software detection mechanisms (version 2 and version 3). 
v) There is 1 non-critical placement deadlock. 

Table 7.6: Observed Behaviour of Program 'C'
The relationship between the three versions of each of the programs is very similar. There is a detection capability enhancement shown by versions 2 and 3 over version 1, and version 3 over version 2. However, there is an initial performance overhead associated with inserting software detection mechanisms. This is clearly seen in Figures 7.6(a, b, c), where for the first one or two erroneously processed instructions the detection capability of versions 2 and 3 is lower than that for version 1 (the original) of the program. This performance overhead is due to the increased program size and hence the higher probability that the corruption of the program counter, initiating the IEJ phase of erroneous execution, takes a value corresponding to a location within the program. The performance overhead is quickly over-ridden by the enhanced detection capability provided by the software detection mechanisms.

The relative overhead of the version 2 compared with version 1 depends on the number and default size of the detection mechanisms. The number of detection mechanisms that can be placed is dependent on the code content of the application program. The default size of detection mechanisms is dependent on the application processor's instruction set, and the optimum size required for each placement is dependent on the local code structure. The overhead associated with version 3 is dependent of the memory saving facilitated by using the optimum size of detection mechanism on each placement. There is little difference in versions 2 and 3 for program B because the mean optimum detection mechanism size is about the same as the default detection mechanism size. There is a larger difference in program A, and larger again in program C, because the mean optimum detection mechanism size is smaller than the default detection mechanism size.

The effectiveness of the software detection mechanisms is dependent on their number and distribution within the application program. Program B yields some of the best results due to a large number of software detection mechanisms spread evenly throughout the code. Program A obtains similar benefits from the insertion of software detection mechanisms. Program C results are poorer because fewer software detection mechanisms could be placed, and there is a contiguous block the size of
half the application program which is void of mechanism placements. The effect on program $C$ is to greatly slow down the detection capability.

It will also be noticed that the version 3 results, whilst having a smaller overhead have a slightly reduced detection capability, compared with version 2, after several instructions have been erroneously processed. This is clearly seen for programs $A$ and $C$ in Figure 7.5. The variation in detection capabilities is due to erroneous execution flow, other than a SEJ, being detected by the software detection mechanisms. The default size detection mechanisms detect all non-re-synchronized linear erroneous execution because they have the same number of seed bytes as bytes in the longest instruction construct for the application processor. Hence, it is guaranteed that during linear erroneous execution a seed will be interpreted as an opcode, generating a restart outcome and detection. Optimum size detection mechanism will not detect all linear erroneous execution passing through the detection mechanism because some erroneously interpreted opcodes will consider all the seeds as operands of the current instruction.

Re-synchronization of erroneous execution within the experiment generates a detection capability ceiling. This ceiling is rapidly approached by programs $A$ and $B$, due to the large number of inserted software detection mechanisms. Program $C$ did not facilitate the same potential for software detection mechanism insertion because of the limited number of invalid branches, and hence the detection capability enhancement is not so rapid. Nevertheless, all three programs show improved detection capability. For highly reliable systems, additional techniques should be employed to cover the probability of re-synchronized erroneous execution.

7.4.7. Critical Hazards of Erroneous Behaviour

Three main classes of critical hazard are identified: cessation of processing (described in Chapter 3), placement deadlock (described in Chapter 5), and infinite execution loops. The observations and implications of these hazards within the example programs $A$, $B$, and $C$ are now discussed. The next chapter in this thesis considers a technique for removing these hazards.
7.4.7.1. Cessation of Processing

None of the example programs included code which could be interpreted during erroneous execution as a stop/wait outcome and hence cause microprocessor operation to cease. Program B does, however, include instruction sequences which require user input. These instruction sequences could generate an apparent cessation of processing if erroneous execution re-synchronizes at one of these instruction sequences - the user not being aware of the required input. Such occurrences are very difficult to prevent and an additional fault tolerant methodology is required to remove the hazard.

7.4.7.2. Infinite Execution Loops

These hazards involve the creation of an infinite loop by erroneous jumps. Once erroneous execution enters such a structure, the correct function of the program is permanently lost. The hazard is removed by implementing the software based fault tolerant technique proposed in the thesis, involving the placement of software detection mechanisms and the attachment of an Access Guardian to the microprocessor system as required. A good example of this hazard is demonstrated in program C. The assembler listing of the program can be found in Appendix D. An infinite loop is generated by the erroneous jump at location 1D_{hex} in the 'getvalue' routine, as shown in Table 7.7.

7.4.7.3. Placement Deadlock

The potential hazard of a placement deadlock occurred once during the application of software implemented fault tolerance documented in Appendix D. Placement deadlock describes the situation where an erroneous jump has its generator and destination are operands within the same instruction. The identified placement deadlock, shown in Table 7.8., is at location 2E_{hex} within the 'main' routine in version 1 of program C (listed in Appendix D). Fortunately this potential erroneous jump is not backward, and hence there is no danger of an erroneous infinite execution loop.

Placement deadlocks can, however, be hazardous and it is pertinent to develop techniques for manipulating code with this attribute.
### Table 7.7. Erroneous Infinite Execution Loop

<table>
<thead>
<tr>
<th>Intended Execution</th>
<th>Program Segment</th>
<th>Erroneous Execution</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Address</td>
<td>Code</td>
</tr>
<tr>
<td><code>addl $8,%esp</code></td>
<td>001B</td>
<td>83</td>
</tr>
<tr>
<td><code>operand</code></td>
<td>001C</td>
<td>C4</td>
</tr>
<tr>
<td><code>operand</code></td>
<td>001D</td>
<td>08</td>
</tr>
<tr>
<td><code>movl -(4(%ebp),%eax)</code></td>
<td>001E</td>
<td>FF</td>
</tr>
<tr>
<td><code>operand</code></td>
<td>001F</td>
<td>75</td>
</tr>
<tr>
<td><code>operand</code></td>
<td>0020</td>
<td>FC</td>
</tr>
</tbody>
</table>

### Table 7.8. Placement Deadlock

<table>
<thead>
<tr>
<th>Intended Execution</th>
<th>Program Segment</th>
<th>Erroneous Execution</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Address</td>
<td>Code</td>
</tr>
<tr>
<td><code>call swap</code></td>
<td>002D</td>
<td>E8</td>
</tr>
<tr>
<td><code>operand</code></td>
<td>002E</td>
<td>7A</td>
</tr>
<tr>
<td><code>operand</code></td>
<td>002F</td>
<td>FF</td>
</tr>
<tr>
<td><code>operand</code></td>
<td>002F</td>
<td>FF</td>
</tr>
<tr>
<td><code>operand</code></td>
<td>002F</td>
<td>FF</td>
</tr>
</tbody>
</table>
7.4.8. Re-synchronizing Erroneous Execution

The re-synchronization experienced by version 1 of the example programs is influenced by the relative number of opcodes to operands. Lower ratios encourage re-synchronization, because there are more opcodes available for interpretation as such in the range of locations open to instruction translation during erroneous execution. Version 1 of programs C, A, and B have ascending speeds of re-synchronization, with descending opcode/operand ratios respectively. The opcode/operand ratio is dependent on the instruction mix within the application program, and the instruction constructions within an applications processor's instruction set.

7.5. Summary and Conclusions

This chapter documents the results of two experimental programmes. The first involved injecting 2136 faults into a MC68000 microprocessor-based system and monitoring the system's response. The second involved emulating the response of a MC68(7)05, a MC68000, and an Intel 80386 microprocessor-based system with a combined total of 1812 emulated faults.

The faults selected for the experiments were single and multiple bit. Several microprocessor systems with various detection facilities are monitored in order to evaluate their response to each fault insertion. The behavioural observations highlight the re-synchronization phenomenon whereby erroneous execution, diverted by the activation of an inserted fault, returns to a valid program path. In particular two types of hazard are associated with re-synchronization: placement deadlock, and sleeping corruption of the processor stack. Access to a corrupted stack re-initiates erroneous execution and detection is facilitated by the fault tolerant mechanisms resident in the experimental system. Cessation of processing and placement deadlock are identified as not being covered by the implemented fault tolerant mechanisms, and require other techniques to remove their hazards. Such techniques are proposed and discussed in the next chapter.

Fault injection experiments and fault emulations have indicated the effectiveness of the software based fault tolerant technique proposed in Chapter 5. Performance parameters used to assess the technique are largely dependent on the nature of the
application program to be implemented with the microprocessor system under evaluation. The only overhead which can be accurately stated is the hardware overhead attributed to the design of the Access Guardian, and this is discussed in Chapter 5. Therefore although figures for error coverage, detection latency, and processing overhead are derived for application programs, their replication for other program implementations is not assured. This is a point not sufficiently stressed by other reports of fault injection, simulation, or emulation experiments.

Nevertheless, the benefits of implementing the software based fault tolerant technique have been clearly demonstrated for those microprocessor systems under evaluation. In addition, the behaviour of a microprocessor’s erroneous execution has been observed. These observations have shown the importance of re-synchronizing erroneous execution, and have validated the erroneous microprocessor behaviour model proposed in Chapter 3.
Chapter 8

GENERATING NON-HAZARDOUS SOFTWARE

8.1. Introduction .................................................. 162
8.2. Bridging the Semantic Gap .................................. 163
8.3. The Risks of Erroneous Execution .......................... 163
  8.3.1. Catastrophic Processing Failures ....................... 163
  8.3.2. Critical Hazard Coverage ............................. 164
8.4. Non-Hazardous Program Area Code ....................... 165
  8.4.1. Hazardous Instruction Formats ......................... 165
  8.4.2. Hazardous Opcodes .................................. 166
  8.4.3. Hazardous Operands ................................ 168
    8.4.3.1. Prevention of Addressing Mode Hazards ........ 168
    8.4.3.2. Inherent Addressing ............................. 168
    8.4.3.3. Manipulating Direct Addressing .................. 169
    8.4.3.4. Manipulating Immediate Addressing .............. 170
    8.4.3.5. Manipulating Indirect Addressing ............... 170
    8.4.3.6. Manipulating Indexed Addressing ............... 171
    8.4.3.7. Manipulating Register-Indexed Addressing ...... 171
    8.4.3.8. Manipulating Relative Addressing ............... 172
8.5. Influencing Translator Practices ....................... 174
  8.5.1. Instruction Selection .................................. 174
  8.5.2. Coupling and Cohesion ................................ 174
  8.5.3. Macros .................................................. 175
  8.5.4. Peephole Optimization ................................ 176
8.6. Non-Hazardous Data Area Code ....................... 176
8.7. Non-Hazardous Input/Output Reserved Area Code ......... 177
8.8. Influence of the Instruction Set ....................... 177
  8.8.1. Undefined Instructions ............................... 177
  8.8.2. Restart Instructions .................................. 178
  8.8.3. Stop/Wait and Return Instructions ............... 178
  8.8.4. Unspecified Jump Instructions ....................... 179
8.9. Conclusions ................................................. 179
8.1. Introduction

Previous chapters have identified the characteristics of microprocessor behaviour following an event disrupting software execution. A fault tolerant technique has been developed to detect erroneous behaviour and initiate recovery. There still, however, remains the uncertainty of the nature of any erroneous behaviour.

Erroneous microprocessor execution is dependent upon the operation outcomes of interpreted instructions through the address space. There may be associated risks with erroneous execution which adversely effects system integrity. In particular, erroneous execution may process critical hazards which are indicative of catastrophic processing failures.

Translators can have considerable influence on the nature of target machine code generated to implement a high-level language program [Ciminiera & Valenzano, 1987]. Machine code is produced primarily on the criteria of performance and efficiency, however, such code may incorporate critical hazards as observed in the previous chapter. Microprocessor reliability can be considerably improved if hazardous code is not generated.

This chapter investigates the generation of software by two types of translator: compilers and assemblers. Generated software consists of machine code for program, data, and input/ output reserved areas of the address space. Techniques are proposed for each of these areas to manipulate the machine code in order to prevent the generation of critical hazards within the code released by a translator. To facilitate these techniques, it is necessary to encourage and discourage particular translator practices. These practices are influenced by the characteristics of the target microprocessor. A selection of 8, 16, and 32-bit microprocessors are examined including the Intel 8086 and Motorola 68000 families, and their implications for translator action discussed.
8.2. Bridging the Semantic Gap

Microprocessor design is advancing so rapidly that design-oriented software can quickly become obsolete. It is therefore important that software should be written in a manner that facilitates implementation on a variety of processors. Even with upwardly compatible microprocessor designs, implementation of an original piece of software inefficiently uses the capability of the advanced design.

High-level languages have been developed which are abstracted from processor architectures and configurations to the extent that the language is independent of the target machine. Examples of high-level languages include FORTRAN, Pascal, and C. They are considered portable languages because they allow software to be implemented on a range of processor types.

The difference between a high-level language and its target microprocessor implementation is known as the semantic gap. Bridging this gap has a major influence on the overall execution efficiency and software reliability of the microprocessor application. The tool used to bridge the gap is the translator. Translators automatically process the high-level software down through the levels of abstraction until code is generated which is executable on the target processor. The nature of translation means that the programmer has no control over the generation of the machine code and any hazards it may contain. It is important that translators are designed to generate non-hazardous machine code.

8.3. The Risks of Erroneous Execution

8.3.1. Catastrophic Processing Failures

Two particular hazards identified with catastrophic failure are the possibilities of cessation of processing, and infinite processing loops during erroneous execution. Both occurrences prevent the possible detection of erroneous microprocessor behaviour by software techniques: an external hardware reset being necessary to restore the microprocessor system.
Processing ceases when an instruction execution outcome is a stop/wait state. Exit from this state requires external intervention. The probability of process cessation can be predicted by examining individual address space elements for stop/wait instruction codes. However, it must be recognized that the processing action of the processor with volatile memory can introduce further occurrences of this hazard which cannot be predicted.

Infinite execution loops are instruction sequences that are continuously executed in a cyclic fashion. There is no processing exit from these loops except through some external intervention. Such loops may incorporate a chain of erroneous jumps, and operate through a combination of different functional areas in the address space (described in Chapter 5). Identification of all infinite execution loops for application software would involve tracing the execution path for all possible erroneous execution variants following an Initial Erroneous Jump (IEJ). In reality, execution paths may change. A loop might include a conditional branch instruction. Erroneous execution might change the tested condition codes and hence the loop would not be infinite. Prediction of infinite execution loops would also have to allow for a section of address space being implemented on volatile or non-volatile memory. Obviously volatile memory is subject to manipulation through processor activity such as stack operations.

8.3.2. Critical Hazard Coverage

In order to prevent the occurrence of catastrophic processing failures, it is necessary to provide coverage for all associated critical hazards. The two identified hazards of catastrophic failure, cessation of processing and infinite execution loops, rely on the action of erroneous jumps. It has been shown in the previous chapter that translator generated software can include hazardous code which if interpreted as an opcode produces an erroneous jump. The generation and eventuality of executing such hazardous code is covered to improve the reliability of the microprocessor system.

A fault tolerant technique based on software enhancement has been proposed in Chapter 5. This technique provides rapid termination of erroneous behaviour by
detecting the execution of erroneous jumps. The technique implements an Access Guardian which provides immediate detection of erroneous execution in the unused area of the address space. The effectiveness of the technique in the used area depends on the number of erroneous jumps within the machine code. However there are three particular erroneous jump constructs which cannot be covered by the fault tolerant technique: stop/wait erroneous jump outcomes, return erroneous jump outcomes, and placement deadlock.

The interval between the initiation of erroneous execution and its detection is called error latency. During this period there is a risk of a critical hazard not covered by the fault tolerant technique being executed. Such processing may generate a catastrophic failure.

Clearly there is a need to develop a translator which avoids generating these classes of erroneous jump hazard. Such a translator would reduce the inherent hazard associated with the code without reducing the detection improvement offered by application of the fault tolerant technique. Indeed the ability to manipulate used area code enables not just the prevention of hazards, but also the introduction of attributes facilitating detection of erroneous execution.

8.4. Non-Hazardous Program Area Code

8.4.1. Hazardous Instruction Formats

The program area consists of instruction sequences generated by the translator. Each instruction has a conceptual structure consisting of a descriptor and an address field. The descriptor specifies the instruction operation. The address field specifies the information to be processed by the instruction. The manner in which the address field is interpreted is known as the addressing mode of the instruction. The content of the address field is referred to as the stated address. The address of the referenced memory location, containing the information to be processed, is referred to as the effective address.

The physical implementation of an instruction consists of opcodes representing the descriptor, and operand(s) if required by the addressing mode representing the
stated address. Opcodes are defined in the microprocessor's instruction set. Operands are specified by the addressing modes implemented within the microprocessors architecture. The most common addressing modes are

<table>
<thead>
<tr>
<th>Addressing Mode</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Inherent Addressing</td>
<td>No effective address required.</td>
</tr>
<tr>
<td>Direct Addressing</td>
<td>The stated address is the effective address of a register.</td>
</tr>
<tr>
<td>Immediate Addressing</td>
<td>The effective address is the location immediately following the opcode.</td>
</tr>
<tr>
<td>Indexed Addressing</td>
<td>The effective address is the stated address added to the contents of a specific register (index register).</td>
</tr>
<tr>
<td>Indirect Addressing</td>
<td>The stated address serves as a pointer to the location at which the effective address resides.</td>
</tr>
<tr>
<td>Register-Indirect Addressing</td>
<td>The stated address serves as a pointer to a specified register which contains the effective address.</td>
</tr>
<tr>
<td>Relative Addressing</td>
<td>The stated address is an offset which is added to the contents of a location inherently selected by the opcode to form the effective address.</td>
</tr>
</tbody>
</table>

The hazard identified with catastrophic processing failures, the erroneous jump, can be generated by opcodes and operands. The addressing mode adopted by an instruction influences the requirement and nature of operands. Techniques are proposed for the selection of instructions and their manipulation to prevent the generation of critical hazards. To this end, certain translator practices can be encouraged or discouraged. Several practices are identified and discussed.

### 8.4.2. Hazardous Opcodes

Early microprocessor designs such as the Intel 8080 and Motorola 6800 implement an instruction set in which all operations can be uniquely identified by a single-byte opcode. Hence these opcodes have no associated hazard.
Later processor designs extended the instruction set to improve flexibility and performance. Examples of this type of machine include the Motorola 68000 and Intel 8086. Such microprocessors are known as Complex Instruction Set Computers (CISC). Processor designs that extend the instruction set beyond 256 unique operations require a multiple-byte opcode. However, the second and subsequent opcode bytes are liable to have a hazardous format susceptible to erroneous execution. The nature of this hazard can be determined by mapping the opcode bytes onto the first-byte opcode map.

The Intel 8086 microprocessor family have instructions which require an additional opcode byte to further specify the operation code of the first opcode byte. Hazardous bytes can be identified and suitable translator action taken.

The Intel 80286 microprocessor instruction set has an instruction ‘SETNP’ whose second byte, if interpreted as an opcode, generates a stop/wait outcome. This is a critical hazard associated with catastrophic processing failure. There is no method of manipulating the second-byte of the opcode because its format is completely defined. The selection of this and other instructions that have hazardous opcode bytes that cannot be manipulated should be avoided in order to prevent the possibility of catastrophic failure during erroneous execution.

All instructions in the Motorola 68000 microprocessor family instruction sets have a two-byte opcode. Furthermore, these microprocessors implement memory organization which specifies that instructions reside at even-byte boundary addresses. Interpretation of an instruction at an odd-byte boundary address generates a software exception and hence a restart outcome. Hence there is no hazard associated with the second opcode byte because of the over-ride action of the odd-address exception.

Some modern microprocessor designs have returned to a smaller instruction set, discarding under-utilized instructions. The core of the instruction set facilitates efficient processing within a simplified machine architecture. The processors are known as Reduced Instruction Set Computers (RISC). An example of this processor type is the AMD Am29000. This microprocessor specifies its instructions to have a single-byte opcode and a three-byte operand, with the notable exception of the load and
store instructions where there is a two-byte opcode and a two-byte operand. Single-byte opcodes have no hazard whilst two-byte opcodes may have a hazardous second byte.

8.4.3. Hazardous Operands

8.4.3.1. Prevention of Addressing Mode Hazards

In general, instructions have operands which are susceptible to interpretation as a first-byte opcode during erroneous execution. The addressing mode adopted by an instruction influences the requirement and nature of these operands. Instructions may be selected on the basis of the non-hazardous operands associated with addressing mode and their ease of manipulation.

An exception to the rule is the AMD Am29000 microprocessor. This machine locates instructions at every other even-byte address. An exception facility can be enabled within this processor such that a program counter, containing the address of an operand byte, is masked to access the opcode address. Hence the operands may have a hazardous format but the microprocessor architecture prevents their interpretation during erroneous execution. No operands, therefore, require manipulation to remove any hazard.

The Intel 8086 microprocessor family architecture implements segmentation. Every memory access requires the specification of a segment register in conjunction with an addressing mode. The segment register and the addressing mode implement the most and least significant address bits respectively. Hence all addressing modes facilitated by these processors are pseudo-relative. The allocation of segment registers can be automatic which eases the complexity for low-level programmers and translators alike.

8.4.3.2. Inherent Addressing

In this addressing mode there is no effective address requirement. Inherent addressing instructions have all necessary information for processing within the instruc-
tion opcode. No operands are needed. A common inherent addressing instruction to many microprocessor instruction sets is 'no-operation'.

8.4.3.3. Manipulating Direct Addressing

The number of operands required by an instruction implementing direct addressing to specify the effective address is dependent upon the size of the microprocessor address space and the size of each operand. For instance, a 4 GByte address space requires 4 bytes to specify an absolute address, whilst a 64 KByte address space requires only 2 bytes. Operands typically take a one-byte structure. However, larger byte structures are implemented by particular microprocessors, notably the Motorola 68000 family which have two-byte operands. Each operand used to represent the effective address may have a hazardous format.

The hazard associated with an operand that partially or fully represents an effective address is dependent on that address. Accessed memory can be relocated so that the operands specifying the effective address take non-hazardous code formats.

To give the operands of an instruction implementing direct addressing a restart format, it is necessary to have groups of consecutive restart opcodes within the microprocessor instruction set. Larger groups of restart opcodes reduce the complexity of movement of the used area when covering each directly addressed location. The Intel 8086 microprocessor like the AMD Am29000 has an instruction set which has few useful restart instruction formats. The advanced Intel 80286 and 80386 microprocessors, which are upwardly compatible with the 8086, have no additional restart instructions of a useful format. The Motorola 68000 microprocessor family have a large number of useful restart instruction formats. Their instruction sets of 65536 instructions comprise of approximately 2% defined restart instructions, and 30% undefined instructions specified to generate a software exception which is a restart outcome. Within these restart formats are two large blocks taking the hexadecimal values AXXX and FXXX, where 'X' is a 'don't care' hexadecimal value. These represent undefined instructions which are reserved for implementation by later releases of the microprocessor family. In particular, the Motorola 68030 microprocessor facilitates the use of some FXXX hexadecimal format instructions for co-processor operation.
However, if the microprocessor system does not incorporate a co-processor then these instructions revert to generating a restart outcome. Each of the two blocks of restart instructions when converted to an address range cover 4 KBytes. The remainder of the instructions generating a restart outcome are scattered about the opcode map.

8.4.3.4. Manipulating Immediate Addressing

In the immediate addressing mode, the effective address is, by default, the instruction operand(s). The operands contain data to be processed by the instruction.

The operands are by their nature application specific. Their hazards are dependent on the host microprocessor instruction set. Although the data stored cannot be altered, the format of its storage can be manipulated.

A technique is proposed for microprocessors whose architecture organizes memory on a double byte addressing scheme such as the Motorola 68000 family. Operands therefore have a two-byte format. For single byte data, one byte holds the data, while the remaining byte can be set to a value such that the resultant operand code is non-hazardous. The choice of operand values that do not exhibit a hazard is dependent on the target instruction set.

8.4.3.5. Manipulating Indexed Addressing

This mode of addressing requires operands to represent the stated address and the index register. This addressing mode is particularly useful when implementing a stack structure in memory for microprocessor operation.

Some microprocessor architectures, including the Motorola 68000 family, specify register use within the instruction opcode. Other architectures, such as the Intel 8086 family require an operand to specify register usage. In these instances it is suggested that some registers may be preferred. The generation of operand code specifying particular registers and which has a hazardous format should be avoided.

The operand(s) representing the stated address can be manipulated using the same techniques proposed for direct addressing: operands specifying absolute addresses for direct addressing should not have a hazardous bit format.
The Intel 8086 and Motorola 68000 microprocessor families implement additional variants of the indexed addressing mode. These involve a base address displacement which is added to the effective address. Such extensions to the indexed addressing mode facilitate enhanced data processing techniques.

8.4.3.6. Manipulating Indirect Addressing

The stated address contained in the operands attached to an instruction implementing indirect addressing is an intermediate location in the address space. The content of this location specifies the effective address of the data to be processed by the instruction.

It is proposed that an intermediate location is selected so that the operands specifying its address do not take a hazardous value. However, the intermediate locations available may be restricted by the residence of used areas in the address space. Movement of the used area can release useful locations.

It is valuable when manipulating indirect addressing, to have a large number of scattered opcode formats for restart instructions. This reduces the likelihood of having to move sections of the used area in order to release addresses that mimic a restart opcode format.

Both the Intel 8086 and Motorola 68000 microprocessor families do not implement indirect addressing, preferring to use the equivalent register-indirect addressing mode.

8.4.3.7. Manipulating Register-Indirect Addressing

This method of addressing is closely allied to the indirect addressing mode. The effective address is contained within a register specified by the stated address.

As identified with the index addressing mode, some microprocessor architectures require the specification of a register within an operand. These operands can be manipulated to remove any hazard.

This addressing mode is implemented with particular effect in respect to hazard generation in the Motorola 68000 microprocessor family. These machines implement an instruction using register-indirect addressing without an operand requirement, i.e. the opcode specifies the register. Opcodes for these processors do not have a
hazard, so instructions implementing this addressing mode do not introduce hazards and should be encouraged by translator code generation.

8.4.3.8. Manipulating Relative Addressing

The inherent register specified by relative addressing serves as a pointer to a memory location. The stated address acts as a displacement to the pointer. The result is a method of referencing memory in the vicinity of the address contained in the register.

An important class of instruction that implement relative addressing are branches. These facilitate changes in the otherwise sequential control flow of program execution. Relative addressing is used in conjunction with the microprocessor’s dedicated register, the program counter, which serves as a pointer to the next instruction to be processed.

The displacement required by this addressing mode is stored in one or more operands. A large proportion of relative addressing displacements are local. Operands specifying local forward and backward displacements have low and high hexadecimal values respectively. The hazards associated with such operand values are dependent on the host microprocessor instruction set. Displacements represented by operand code of a hazardous format should not be used. Operand manipulation is only possible where the displacement value is independent of run-time conditions.

Assuming that branch offsets tend to be local, deductions can be made for different microprocessor implementations. This is known as the principle of program locality [Ciminiera & Valenzano, 1987]. Two popular microprocessor families are investigated with respect to local relative addressing hazards.

Firstly, consider a Motorola 68000 target machine. Local backward branches will produce an operand of the hexadecimal format FFXX, an undefined instruction, which, when executed as an opcode, generates a restart outcome and hence no hazard. Local forward branches produce an operand of the format 00XX which, when executed as an opcode, is highly likely to produce an OR instruction. Although there is no immediate hazard associated with an OR instruction it does continue erroneous execution, and may corrupt system data. Implementation on the compatible but
extended instruction set of the Motorola 68020 realizes usage of some of the FFXX instructions for co-processor operation. When a co-processor is incorporated into the microprocessor system, local backward jumps have a probability of producing an erroneous instruction execution hazard. However, if a co-processor is not present then these instructions continue to generate a restart outcome.

Secondly consider the Intel 8086 microprocessor family. Local backward branches tend to produce FF hexadecimal values for the most-significant, and high hexadecimal values for the least-significant operand bytes. Equally, local forward branches tend to produce 00 hexadecimal values for the most-significant, and low hexadecimal values for the least-significant operand bytes. The byte FF hexadecimal value, when erroneously interpreted as a first-byte opcode specifies a multi-byte opcode, which can take a format associated with critical hazards. The byte 00 hexadecimal value, when mapped onto the first-byte opcode map, reveals that erroneous execution will generate an ADD instruction. Whilst this is not classified as a critical hazard, it still allows erroneous execution to propagate and perhaps activate a critical hazard elsewhere. Low and high hexadecimal values can thus generate both detection of erroneous execution and catastrophic processing failures. Hence local branches are fraught with danger for translators generating machine code for the Intel 8086 microprocessor family. Translators can move sections of code within the address space to ensure bytes specifying local branches do not take formats defining critical hazards. The intermediate vacant slots of memory between relocated sections of code can be filled with 'no-operation' single-byte instructions, or sections of code linked by a branch instruction and intermediate locations filled with restart single-byte restart outcome instructions yielding a detection capability. This proposed manipulation of the program area will introduce a processing performance overhead to the correctly executing program. However, this overhead is deemed to be acceptable for most microprocessor applications.
8.5. Influencing Translator Practices

8.5.1. Instruction Selection

It is desirable that a translator uses a microprocessor instruction set efficiently during its code generation phase. That is, program code is produced for optimal performance. Ineffective use of the instruction set could lead to unnecessary increases in the processing time and memory space required by the program code.

Selection of instructions which process registers has a particular benefit. Addressing modes using registers do not require a memory fetch and hence are more efficient than addressing modes using direct addressing. Instructions which process registers thus enhance program code performance.

An example of translator inefficiency is demonstrated by the UNIX 'cc' compiler. The target machine is the Motorola 68000. The instruction set for this microprocessor includes a branch instruction with an embedded byte displacement, for relative addressing, in the opcode. This compiler, however, generates a branch instruction with a two-byte displacement for a branch requiring a single byte displacement. The implemented instruction requires an otherwise unnecessary operand. Resultant program code requires extra memory, and has a slower execution for this instruction.

Whilst generated code should be efficient, it should also not exhibit any critical hazards. Manipulation of the code to remove these hazards may degrade program performance. Particular translator actions can be encouraged or discouraged so that the generated code is acceptable.

8.5.2. Coupling and Cohesion

Selected instructions should exhibit the characteristics of coupling and cohesion. These characteristics are more commonly associated with high-level language software engineering [Sommerville, 1985]. Coupling is a measure of the program code dependence on referenced parameters. Cohesion is a measure of the functionality (unity of operation) of the program code. Translators should generate program code that exhibits both high coupling and cohesion characteristics.
Low coupling indicates the use of specified parameters rather than referenced parameters. Specified parameters give operand formats which may have an associated hazard. Program code may replicate use of the parameter and hence the hazard is propagated. An example of this is demonstrated by the UNIX ‘cc’ translator on the SUN machines whose target processor is the Motorola 68020. The translator generates instructions with an immediate addressing mode to access repeated data values, and hence any hazard associated the data value is replicated. Translators can avoid this problem by implementing reference parameters. Reference parameters are independent of the specified parameter value. The reference can be altered so that its operand format does not have an associated hazard. Such manipulation does not affect the specified value of the referenced parameter. High coupling is, therefore, a useful attribute when manipulating the program code to remove hazards.

Cohesion reflects the efficiency of the generated program code. Low cohesion indicates unnecessary control flow in the generated instruction sequences. Loop invariants are a source of inefficient control flow [Aho & Ullman, 1977]. They involve a computation that produces the same result at each cycle of a loop. The computation can be moved to a point just before the loop is entered. In general, loops are a major source of program inefficiency. High cohesion is an attribute of effective program code generation.

8.5.3. Macros

A macro is a collection of target machine instructions which perform some operation not directly facilitated by the microprocessor instruction set. Macros appear in the intermediate code generated by a translator. The translator replaces the macro with its defined target instruction sequence when producing target machine code. A macro should not be confused with a procedure. Macros are called and expanded by translator action whereas procedures are called and processed by program execution.

The specification of macros should ensure no resident hazards. Removal of hazards can be achieved using the code manipulation techniques outlined in this chapter. The importance of macros increases with processor architectures implementing
smaller instruction sets which provide fewer directly executable operations. This is particularly so in RISC processors such as the AMD Am29000.

8.5.4. Peephole Optimization

Translator generated code usually goes through the process of peephole optimization before release. This is designed to enhance the performance of the program code by increasing the efficiency of small sections of code. The JPI TopSpeed (version 1) Modula-2 translator produces code for the Intel 80296 microprocessor which is particularly effective in this respect. However, optimization may involve manipulation of instruction sequences such that hazards are re-introduced. Nevertheless, peephole optimization is very valuable in generating efficient program code. Therefore, this process cannot be abandoned, but rather it must endeavour not to create any new hazards.

8.6. Non-Hazardous Data Area

These areas can include both volatile and non-volatile memory. Non-volatile memory is used to implement static data structures such as 'look-up' tables. Volatile memory can implement dynamic structures such as stacks, as well as static data structures.

Halse [1984] suggests many variants of a software seeding strategy to implement detection of erroneous behaviour in this area. Experiments demonstrate the effectiveness of seeding. However, erroneous execution in this area can have unpredictable behaviour before detection, depending on the data content. In some instances the content may change through the operation of the microprocessor system.

The AMD Am29000 microprocessor has a separate instruction and data channels, therefore instruction fetches from the data area are impossible. Interpretation of a data area as a program area generates an exception and hence a restart outcome which denotes detection of erroneous behaviour: the data area within this processor is therefore intrinsically non-hazardous.
8.7. Non-Hazardous Input/Output Reserved Area

These memory mapped areas contain locations reserved by the microprocessor architecture for communication with external devices. Used locations will contain time dependent values related to the requests to, and responses from external devices. Unused locations with an attached communication link can have an undefined content.

All memory mapped I/O locations are susceptible to erroneous interpretation as an opcode. Such erroneous behaviour may be hazardous. Manipulating the content of the I/O reserved locations can remove any associated hazard. The proposed alterations of the microprocessor system require knowledge of the I/O communication of the application software.

The values resident at used locations can be manipulated using the same technique proposed for immediate addressing. Alternatively, values sent and received from any external device could be defined to take non-hazardous formats. The communication links for the unused locations can be tied high or low, corresponding to logic 1 and logic 0 respectively. In this manner, unused memory mapped I/O addresses can be set to hold non-hazardous values.

8.8. Influence of the Instruction Set

Manipulation techniques for machine code in the used area of the address space have been proposed, based on microprocessor architecture. Specific implementation is dependent upon the instruction set of the host microprocessor. This section identifies instruction set characteristics, including content and format, which facilitate or hinder the manipulation techniques.

8.8.1. Undefined Instructions

Some microprocessors do not declare the operation of undefined instructions within their instruction set. The physical implementation of such instructions depends on the manufacturer. A manufacturer will, typically, introduce useful and varied operations for these instructions. However, there are no standard operations
for these instructions and their function can vary between manufacturers. Furthermore, the manufacturer is under no obligation to keep a particular function for different fabrications of the same microprocessor. Modern microprocessors have generally avoided this ambiguity by declaring their undefined instructions to generate a software exception - a restart outcome. Undefined instructions have therefore been given a detection attribute. Translators should encourage the use of these codes so that the detection capability of the generated code is enhanced. Modern microprocessors generally declare the operation of all possible instruction opcodes even though the use of some is undefined.

8.8.2. Restart Instructions

Instructions with a restart outcome are used to detect erroneous behaviour. Where possible code should be manipulated to attain this detection capability. The ability of a translator to endow code with this attribute is influenced by the numbers of restart instructions and their position within the target processor's instruction set.

The proportion of instructions in a microprocessor instruction set that generate a restart outcome are typically small, see Appendix A. However, in some microprocessors the undefined instructions are specified to generate a restart outcome. Undefined instructions also usually occur as groups within the instruction set, being reserved for future instruction set extensions. There can be substantial numbers of undefined instructions within an instruction set.

8.8.3. Stop/Wait and Return Instructions

Hazards associated with code which reflect the format of a stop/wait instruction are critical. They are synonymous with cessation of processing: a catastrophic failure. Similarly, code with a return instruction format is deemed a critical hazard because it can create an infinite execution loop during erroneous execution. Such erroneous processing is also associated with catastrophic failure.

Fewer stop/wait and return instructions in the instruction set discourage the generation of critical hazards. This assumes that the translation of target machine code can be broadly considered as a pseudo-random process.
Additionally, the principle of locality, associated with relative addressing, identifies that high and low value operands occur more regularly than other formats. To further reduce the probability of hazard generation, it will be an advantage if the stop/wait and return instructions take mid-range operand values.

The generation of hazards will also be influenced by the application code access to input/output reserved areas. Specific, utilized locations in this area may have an address which has a hazardous format corresponding to a stop/wait or return instruction. The translator generation of these hazards is application specific. For this reason, microprocessors are preferred whose stop/wait and return instructions do not reflect any reserved input/output locations.

8.8.4. Unspecified Jump Instructions

Code which has a format of an unspecified jump instruction is considered hazardous in the same manner as code with a return instruction format, the hazard being indicative of an infinite execution loop and hence catastrophic failure. Although such code has an embedded critical hazard, it is a feature of asset rather than liability. The fault tolerance technique proposed in this thesis uses the occurrence of code with unspecified jump instructions formats to detect erroneous execution. Hence the presence of this code hazard is an attribute which indirectly enhances the software detection capability.

8.9. Conclusion

A significant amount of software produced for microprocessor applications is written in a high-level language which is independent of the target processor. This software is converted by a translator to equivalent machine code implementation. The programmer, therefore, does not have any control over the nature of the machine code generated. This code could contain embedded hazards which cause catastrophic failures during erroneous execution.

It has been shown that translators can implement techniques to manipulate machine code. Such manipulation can prevent the release of hazardous code by the
translator. The techniques involve encouraging and discouraging particular translator practices. Within program areas instructions are selected with respect to their opcode and operand. Both may have a hazardous format. Equivalent operations can be chosen to provide alternative opcodes, and operands can be manipulated by implementing different addressing modes so that no critical hazards are generated. In addition, the configuration of the address space can be altered to cover certain hazards. The manipulation techniques proposed to prevent hazardous code generation in the data and input/output reserved areas are dependent to a large extent on the data structures implemented by the microprocessor. Applying the collection of manipulation techniques can prevent the generation of critical hazards.

The proposed translator techniques are influenced by the format and content of the target processor instruction set. General inferences are made for instruction set characteristics.

The ability to manipulate translation of target machine code also facilitates further working of the code to introduce a detection capability for erroneous execution. This is achieved by encouraging the generation of code with a restart instruction format. Such enhancement of the code can further improve the reliability of a microprocessor application by reducing error latency.
9.1. Introduction

The hazard of erroneous microprocessor behaviour is modelled and evaluated in the early chapters of this thesis. A technique is proposed in Chapter 5 using software implemented fault tolerance to provide detection of erroneous execution. A utility has been built which automatically applies the proposed software based technique to target code, with the performance of the enhanced code being analysed and the benefit of its improved detection capability demonstrated. Furthermore, the previous chapter discusses methods of software translation which reduce the hazard of the target source code in respect of erroneous execution. This chapter concludes the research presented in this thesis by considering design features that can be incorporated within the architecture of a microprocessor to provide fault tolerance.

The design features presented identify particular characteristics of erroneous execution. Once erroneous execution is distinguished from valid processing, appropriate recovery action can be initiated to restore the integrity of the microprocessor system. The performance of the implemented design features and their attributed architecture overhead for a microprocessor is discussed.

9.2. The Effectiveness of Fault Tolerance

As discussed in Chapter 1, the implementation of fault tolerance at the logic level within a microprocessor’s architecture incurs large overheads, in particular, a large extension in the number of gates is required. Chakraborty & Ghosh [1988] show that logic faults have a good correlation to functional errors in processor operation. It is therefore feasible to implement techniques facilitating functional-level fault tolerance. These techniques require less gate overhead than logic fault tolerance because they only attempt to detect particular characteristics of erroneous behaviour rather than all possible failure patterns.
Established functional-level fault tolerant techniques for microprocessor systems, called capability checks, are discussed in Chapter 2. Application of these techniques to microprocessor systems involves the addition of dedicated hardware units and/or software manipulation. Fault tolerant techniques which operate on functional aspects of erroneous execution are extremely difficult to assess. Erroneous behaviour will vary from application to application, depending on the target software and processor and hence the detection capability of the applied technique will vary too. This is known as the 'instruction sequence' dependency. Nevertheless, inclusion of capability checks within microprocessor systems can be beneficial where high reliability is required.

9.3. Implementing Fault Tolerance

The implementation of fault tolerance as described in Chapter 2 involves the co-ordinated detection of an error and restoration of the microprocessor system integrity. This thesis considers the first stage of fault tolerance, i.e. the detection of erroneous microprocessor behaviour as described in Chapter 3. It is proposed that detection can be achieved through the generation of a restart outcome during erroneous execution. Such a processing outcome re-establishes control of the program flow by directing execution to a predefined location in the address space. In order to complete recovery, a restart outcome must initiate execution of a module of code that restores system integrity. The predefined location in memory, accessed as a result of a restart outcome, is therefore hardwired within the microprocessor architecture to be a section of Read Only Memory (ROM). The micro-code in the ROM is programmed with a recovery routine for the application software.

Microprocessor design features are proposed in this chapter which facilitate an increased probability of erroneous execution generating a restart outcome, without the need for peripheral circuitry or software manipulation.

9.4. Influences on Microprocessor Design

It is very difficult to present a list of influences on microprocessor architecture design because of the diversity of application requirements between different processors. For instance, the Intel 4004 is a 4-bit microprocessor designed for simple arithmetical
functions, while the Motorola 68(7)05 is an 8-bit processor specifically designed for programmable controllers, and the Advanced Micro Device Am29000 processor is a 32-bit processor designed for real-time systems requiring large amounts of processing. The architecture of these microprocessors satisfies many design requirements including processing performance, processing capability, and cost. This is achieved through the implementation of various design features: for example, the AMD Am29000 implements a Reduced Instruction Set Computer (RISC) architecture incorporating a cache memory, instruction pipeline, and reduced instruction set in order to improve its run-time processing capability.

The following sections of this chapter consider design features which can be incorporated into the microprocessor architecture to facilitate the self-detection of erroneous behaviour. In particular, design features are proposed which facilitate detection of the erroneous execution modelled in Chapter 3, and the hardware representation of the software implemented fault tolerant technique presented in Chapter 5.

9.5. Instruction Set Architecture

All programs which are executable on a target microprocessor consist of a sequence of machine instructions. Typical instruction formats contain the following basic elements:

- an opcode specifying the operation,
- addressing mode specification for each of the input operands and result, and
- addressing mode data, e.g. immediate data for direct addressing.

An instruction's addressing mode is designated by either the opcode or a reserved tag within the operand. The Motorola 68000 processor family defines the addressing mode of instruction operands within the opcode, whilst the Intel 8086 processor implements a tag within the operand bit format which is decoded to ascertain the operand's addressing mode.
The architecture and organization of the processor instruction set can be manipulated to enhance the inherent detection capability of erroneous execution by the microprocessor. Techniques considered within this section involve the instruction set mix, the opcode map, and the instruction operand requirements.

9.5.1. Instruction Set Mix

The selection of instruction operations for inclusion in an instruction set is the subject of much research [Tanenbaum, 1990]. Instructions can be separated into two broad groups: general purpose and specialized instructions. General purpose instructions include data movement operations that are needed by almost every application. Specialized instructions are only useful in specific applications, e.g. the Motorola 68000 instruction MOVEP takes the content of the D register and stores it in alternate bytes and is intended for communication with specific 8-bit peripheral devices.

Studies of instruction set usage have led to the development of Reduced Instruction Set Computers (RISC) such as the Advanced Micro Device Am29000 processor. These processors incorporate only the most used and general purpose instructions unlike Complex Instruction Set Computers (CISC) which have large instruction sets that are often extended when the processor is upgraded through an upwardly compatible revision. Within the RISC processor, particular tasks that can be achieved by a single sophisticated CISC instruction may require several of the more basic instructions provided in its instruction set. Although the application software for a particular task is larger on a RISC compared to a CISC processor, its simpler data path processing (partially facilitated by the RISC instruction set) improves overall performance.

The instruction mix analysis of Chapter 4 which is used to evaluate the model of erroneous microprocessor behaviour highlights the advantages and disadvantages associated with the inclusion of restart and stop/wait outcome generating operations within the processor instruction set respectively. Stop/wait outcome instructions initiate catastrophic failure during erroneous execution because all independent operation is lost, and restart outcome instructions provide a controlled route for erroneous
execution to a recovery routine which restores processor integrity. Obviously the number of stop/wait outcome generating instructions in the instruction set should be minimized, although the degree of their influence on erroneous execution is dependent on their activation rather than their occurrence in an instruction mix.

9.5.2. Opcode Maps

An instruction opcode is designed to have sufficient bits to identify all facilitated unique operations. The number of bits, $n$, used to specify an opcode format should be an integral multiple, $m$, of the processors memory element length $b$ (usually a byte) in order to simplify data flow processing. Additionally, the number of bits specified for an opcode should be kept as small as possible in order to reduce the memory requirement of software. Hence opcodes which specify $i$ instruction operations are designed with a bit length $n$ where,

$$i \leq 2^n. \quad (9.1.)$$

Re-arranging gives,

$$n \geq \frac{\log i}{\log 2}, \quad (9.2.)$$

$$n \geq \log_2 i, \quad (9.3.)$$

and,

$$n = m.b. \quad (9.4.)$$

The specified opcodes form an instruction set, and the range of opcode formats is usually referred to as the processors opcode map.

Equation (9.1.) notes that there may be some redundancy in the opcode map. This redundancy is acceptable in many architectures because the processing benefits of fixed-length instruction implementation within the overall processor design outweigh the sacrifice of larger average code size [Hennessy & Patterson, 1990].
All redundant opcode formats in the opcode map should be defined to generate a restart outcome. In this manner, execution of an undeclared operation, which is synonymous with erroneous microprocessor behaviour, is detected.

The opcode map can also be designed to eliminate the hazard of instruction operands associated with a local displacement by a relative branch. Program flow typically follows the 'principle of locality', well known by memory systems' designers, and can be explained by using a rule of thumb:

'A program executes about 90% of its instructions in about 10% of its code'.

The implication is that the majority of program flow involves branches whose target destination is in the local vicinity to the location of the generating instruction in the program code. Hennessy & Patterson [1990] describe the analysis of benchmark programs which support this assertion.

Many processor architectures define local branch displacement information to be contained with a branch instruction operands. These operands are susceptible to interpretation as an opcode during periods of erroneous execution. It is therefore proposed that the processor instruction set reserves areas representing low and high values in the opcode map for restart generating instructions, see Figure 9.1. Such opcode map organization ensures that local branch instruction operands interpreted as an opcode generate a restart outcome and hence detect erroneous execution.

A particular hazard associated with branch locality is observed within the Motorola 6805 microprocessor instruction set. Here, branch instructions specify their displacement in the attached operands. Low order relative displacements for forward branch instructions which specify byte operands with the most significant four bits set to '0000' or '0010' are potential invalid branches.

The Motorola 68000 microprocessor reserves high value opcode formats for future upwardly compatible processor extensions. Execution of a memory element with such an opcode format results in a restart outcome via a software exception. This attribute
The unused opcode bit formats represent high and low values. These opcodes can be set to generate a restart outcome. Relative branch instructions specify a local displacement. Operands containing such displacements which are erroneously interpreted as an opcode will now generate a restart outcome and hence detect erroneous execution.
of the Motorola 68000 instruction set facilitates a detection capability where relative branch instructions specify an operand to contain a local backward displacement.

9.5.3. Operand Requirements and Specification

Operands have a bit size equivalent to an opcode in order to preserve the memory and data path organization associated with opcode processing. Microprocessor architects aim to keep the number of operands as small as possible in order to reduce instruction decoding and hence improve performance [Ciminiera & Valenzano, 1987]. Additionally, implementing fewer operands reduces the memory requirement of a program. The number of operands required depends on the amount of data to be processed and the addressing mode used.

The code extension required by the application of the software implemented fault tolerant techniques proposed in Chapter 5 is largely influenced by the operand requirements of instructions within the processor instruction set. Particular influence is observed where default size mechanisms are inserted with target software; the default size is equivalent to the maximum number of operands required by an instruction in the instruction set. Optimum size detection mechanisms may require less additional code, depending on the placement conditions at each insertion. Further details of detection mechanism construction can be found in Chapter 5. In order to reduce the code extension required when applying the software implemented fault tolerant technique proposed in Chapter 5, the instruction architecture should specify instructions to have fewer operands.

Register oriented addressing modes can be encouraged to reduce the operand requirements of an instruction. Such addressing modes remove the necessity for operand specification of absolute addresses or immediate data. Details of the registers to be processed can be specified within the opcode bit format. Where this information cannot be incorporated within the opcode bit format, a single operand can be specified to contain the register allocation information. The bit format of the operand is designated to represent a restart generating opcode. In many instances a single operand will be shorter than the number of operands required to represent an absolute address.
9.6. Input/Output Communication Ports

Chapter 5 discusses two attributes of a microprocessor's architecture which facilitate an erroneous execution detection capability for the reserved input/output port locations in the address space. These attributes can be incorporated into the design specification of a processor.

Firstly, instructions that communicate with external devices may specify access to an input/output port within an operand. In particular, individual instruction operands may be set to contain the whole or partial address of the communication port. The location of these communication ports in the memory map can be defined to take bit formats which represent restart generating instructions in the opcode map. Hence, erroneous execution of an input/output port address as an opcode results in detection and recovery can be initiated.

Secondly, erroneous execution may itself interpret an input/output port location as an opcode. The microprocessor design can incorporate the hardwiring of particular bits in the input/output port such that the bit format represents a restart instruction, as Figure 9.2. Erroneous execution of this location as an opcode is self-detected. This method, of course, incurs an overhead in that the data transfer capability to an external device is reduced because of the redundant bits reserved for erroneous execution detection.

9.7. Memory Organization

This section discusses techniques which involve the organization and implementation of memory used by a microprocessor.

9.7.1. Memory Alignment

Some microprocessors require elements (e.g. byte, double-byte, or quad-byte) in memory to be aligned according to the memory organization. A memory element of size $s$ bytes resident at location $A_d$ in the address space is aligned when,

$$[A_d \mod s] = 0 \quad (9.5.)$$
Figure 9.2: Input/Output Location Content
Hence, byte memory elements in a byte orientated memory organization will always be aligned. Byte memory elements in a double-byte orientated memory are aligned on even byte boundary addresses, but mis-aligned on odd byte boundary addresses. Mis-aligned element access specified by an instruction requires multiple physical memory accesses, whilst aligned memory access requires only one physical memory access by the microprocessor hardware. From a performance perspective, therefore, elements of memory should be aligned rather than mis-aligned.

The memory organization employed in a microprocessor architecture has particular implications for the software implemented fault tolerant techniques proposed in this thesis. Those processors that implement a byte memory organization for the program area allow any instruction operand to be processed erroneously as an opcode. Hence erroneous execution in the program area is not hindered by the memory organization. Other means of memory organizations allow the possibility of mis-aligned opcode access. Mis-aligned opcode access is a characteristic of erroneous behaviour. Fault tolerant techniques can be implemented to detect this error.

The Advanced Micro Device Am29000 microprocessor and the Motorola 68000 microprocessor family implement a similar approach to mis-aligned opcode access for their respective quad-byte and double-byte memory organization. Access to mis-aligned double-byte elements in memory within the Motorola 68000 microprocessor architecture generates an exception which, naturally, is called the 'odd byte address' exception. This method of imposing a double-byte organization in memory to improve the operational performance of the processor, also makes possible a detection capability to prevent erroneous execution. The Advanced Micro Device Am29000 processor has a similar function. Its exception generation can, however, be disabled by a special status register. Enabling the mis-alignment exception results in the least two significant bits of the program counter being masked and hence an opcode address is generated. Erroneous execution is, therefore, re-synchronized.

The design feature described in this section can be incorporated into the architecture of a microprocessor. Digital circuitry can generate a restart outcome when
it detects mis-aligned memory access. Such an outcome can be used to initiate a recovery procedure and hence improve the reliability of the microprocessor system.

9.7.2. Defining Memory Utilization

The address space of a microprocessor can be divided according to its functional allocation. Chapter 5 initially divides the address space into used and unused areas discusses the implications of the division with respect to erroneous behaviour. Moreover, the said chapter describes the digital implementation of a hardware unit called an Access Guardian which detects the unused area access characteristic of erroneous behaviour. The design of the Access Guardian is not complex and can be incorporated into the architecture of a microprocessor.

There, nevertheless, remains the problem of detecting erroneous execution in the used area of the address space. Glaser & Masson [1982] present a hardware unit referred to as a ‘SAFE ROM’ which has been implemented within a microprocessor system by Li et al [1984]. The SAFE ROM is a one-bit wide memory which is attached to each memory element (in this case byte) of Read Only Memory (ROM) to signify its usage as either opcode or operand, see Figure 9.3. Invalid opcode address access is identified by detection circuitry which determines whether or not the location in question has its SAFE ROM bit set to opcode or operand.

A similar approach is proposed here which can be applied to all address space locations regardless of their implementation in either Read Only Memory or Random Access Memory (RAM). Instead of implementing additional memory, a bit of each memory element is reserved to perform the SAFE ROM function. Hence, there is not a memory overhead as such but rather memory redundancy associated with the reserved bit, see Figure 9.4. This method provides detection of erroneous execution for all physically implemented memory. The memory redundancy incurred by applying this technique can be calculated, thus:

\[
\text{Memory Redundancy (\%)} = \frac{1}{n},
\]

where the opcode memory element has \(n\) bits.
### Figure 9.3: SAFE ROM [Glaser & Masson, 1982]

<table>
<thead>
<tr>
<th>Usage Bit</th>
<th>Memory Element Data Bits</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1 1 1 0 1 0 0 1</td>
</tr>
<tr>
<td>1</td>
<td>0 0 0 0 1 0 1 0</td>
</tr>
<tr>
<td>1</td>
<td>1 1 1 0 0 0 0 1</td>
</tr>
<tr>
<td>0</td>
<td>1 0 0 1 0 1 1 1</td>
</tr>
<tr>
<td>1</td>
<td>0 1 1 0 1 1 0 0</td>
</tr>
<tr>
<td>1</td>
<td>0 0 1 0 1 1 1 1</td>
</tr>
<tr>
<td>0</td>
<td>0 1 1 1 0 1 0 1</td>
</tr>
<tr>
<td>1</td>
<td>1 0 1 0 1 0 0 0</td>
</tr>
</tbody>
</table>

**Usage Bit**
- **Usage Bit = 1**: Defines Opcode
- **Usage Bit = 0**: Defines Operand

### Figure 9.4: Memory with Utilisation Assignment

<table>
<thead>
<tr>
<th>Usage Bit</th>
<th>Data Bits Free for Information</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0 1 0 1 0 0 0</td>
</tr>
<tr>
<td>1</td>
<td>1 0 0 1 0 1 1</td>
</tr>
<tr>
<td>0</td>
<td>1 1 0 1 0 0 1</td>
</tr>
<tr>
<td>0</td>
<td>0 0 1 0 1 0 1</td>
</tr>
<tr>
<td>1</td>
<td>0 1 0 1 1 1 0</td>
</tr>
<tr>
<td>1</td>
<td>1 0 1 0 1 0 1</td>
</tr>
<tr>
<td>1</td>
<td>0 1 0 0 1 0 1</td>
</tr>
</tbody>
</table>

**Usage Bit**
- **Usage Bit = 1**: Defines Opcode
- **Usage Bit = 0**: Defines Operand
The memory redundancy associated with this technique may be considered significant for some processor applications. For example, microprocessors whose memories are organized as byte or double-byte structures have a 12.5% and 6.25% redundancy in their memory respectively. Nevertheless, the technique is extremely effective, providing detection of erroneous execution except in circumstances involving re-synchronization. Detection capability is important for systems requiring high reliability and in these applications the memory redundancy is expected to be acceptable.

9.8. Monitoring Branch Activity

The architecture of a microprocessor can be extended to incorporate design features facilitating the recognition of branch operations similar to that provided by the Motorola 68030 microprocessor. Such a facility could be used to activate special circuitry dedicated to determining whether an invalid or valid branch is being processed.

Two methods of determining invalid branch activity are suggested here. Firstly, the software implemented fault tolerant technique proposed in Chapter 5 can be incorporated into the microprocessor architecture. The required digital circuitry is based on a logical AND function using, as inputs, the recognition of branch activity and the opcode 'usage' bit proposed in the previous section. The outcome is detection of all invalid branches regardless of whether or not their destination leads to the re-synchronization of erroneous execution.

The second technique is based on verifying branch activity, and is commonly referred to as 'signature analysis'. The theory supporting the technique is described in Chapter 2. Schuette & Shen [1986] have implemented the technique using additional digital circuitry in a Motorola 68000 microprocessor based system, which incurred a 17% gate overhead compared with the number of gates in the processor. More recently, the technique has been incorporated within the architecture of development processors where it had a 13% chip area overhead [Leveugle et al, 1990]. The technique detects erroneous execution by failing to verify the occurrence of a valid branch.
On identifying an invalid branch erroneous execution is detected and appropriate hardwired recovery action can be initiated. Although the techniques described above are only activated when branch activity is recognized, they do detect re-synchronized erroneous execution. This contrasts with the technique based on defining memory utilization which is activated more regularly because of its on-line monitoring process, but which cannot detect re-synchronized erroneous execution.

9.9. Conclusion

Design features which can be incorporated into the architecture of a microprocessor to provide a self-detection capability for erroneous behaviour have been proposed. These include suggestions for the instruction set architecture and memory organization. The benefit of the inclusion of these techniques in the microprocessor hardware (enhanced detection capability) are dependent on the particular instruction sequences of erroneous execution for a target processor system. Nevertheless, capability checks are being included into microprocessor designs, notably the mis-aligned opcode address exception. Further capability checks, including those proposed in this chapter, may be implemented within commercial microprocessor designs in the future.
Chapter 10

CONCLUSION

10.1. Microprocessor Controllers for Industrial Applications ............ 196
10.2. Reliable Microprocessor Controllers ................................. 196
10.3. Modelling Erroneous Microprocessor Behaviour .................... 197
10.4. Detecting Erroneous Microprocessor Execution ..................... 199
10.5. Evaluating Fault Tolerance ........................................... 201
10.6. Generating Non-Hazardous Software ................................. 201
10.7. Microprocessor Design for Fault Tolerance ......................... 202
10.8. Summary .................................................................... 202
10.1. Microprocessor Controllers For Industrial Applications

In recent years it has become popular practice to implement industrial control systems using digital circuitry with an embedded microprocessor rather than analogue systems. Microprocessor controllers provide a flexible design approach, the nature of their operation being easily tailored to particular tasks through the alteration of control software. Digital systems, however, are more susceptible to transient disturbances, common within industrial environments, than similar analogue systems. Analogue systems tend to pass the effects of a transient disturbance as a temporary processing discrepancy, whilst digital systems, because of their discrete state nature, can have their operation disrupted. This thesis addresses the problem of improving the reliability of low budget microprocessor systems where fault masking is considered too expensive.

10.2. Reliable Microprocessor Controllers

The failure process of a digital system involves the manifestation of a logic fault, its activation as an error, and finally error spawn until a fatal operating condition is generated. Faults can be manifested as either temporary or permanent logic corruption. Temporary faults have been observed to cause a significant proportion of microprocessor system failures. Studies, reviewed in Chapter 2, suggest that in excess of 90% of processor failures are generated by temporary faults rather than permanent faults.

Temporary faults, attributed to disturbances in the operating environment of a microprocessor based controller, are called transient faults. Environmental disturbances can involve electro-magnetic interference (EMI), electro-static discharge (ESD), electronic noise, ionizing radiation, and power supply fluctuations. The operating environment can often be harsh. In order to achieve high reliability,
microprocessor controllers can implement fault tolerance. Fault tolerant techniques implement four tasks in sequence:

- error detection,
- damage assessment,
- error recovery, and
- system restoration.

Although this thesis focuses on error detection, the remaining fault tolerant tasks are equally important and should be considered when implementing a fault tolerant system.

Additional circuitry to detect individual logic faults can be prohibitive within a low budget microprocessor architecture. An alternative technique for detecting logic faults is based on the premise that logic faults are complemented by processing errors. The technique is referred to as functional fault tolerance because it relies on distinguishing between valid and invalid characteristics of microprocessor execution.

Functional fault tolerance must ensure that erroneous microprocessor execution does not generate a fatal error and hence catastrophic failure. To achieve this, detection techniques can be applied to recognize different attributes of erroneous execution. These techniques are collectively known as 'capability checks'. In order to design and evaluate the effectiveness of capability checks it is necessary to model erroneous microprocessor behaviour.

10.3. Modelling Erroneous Microprocessor Behaviour

Erroneous microprocessor behaviour involves either erroneous data or program flow within executing software. The 'reasonableness' of data flow can be checked by the operating software. Erroneous program flow, however, cannot be verified in this manner because predictable operation of the software is lost. This has serious implications for industrial applications where microprocessors are responsible for the monitoring, process, or control of equipment. Unpredictable processor action can command equipment to malfunction, the hazard of this situation being dependent on the equipment's task.
A model has been developed which investigates the program flow associated with erroneous microprocessor behaviour. Erroneous execution is defined to be initiated by a temporary fault generating an Initial Erroneous Jump (IEJ) through corruption of the processor's program counter. Ensuing erroneous microprocessor behaviour is characterized by periods of linear erroneous execution interspersed with further erroneous jumps called Subsequent Erroneous Jumps (SEJs). In addition, the model allows consideration of particular processing outcomes associated with catastrophic failure and recovery. Recovery is achieved through the processing of an instruction developing a restart outcome, which directs execution to a pre-defined location in memory where a recovery routine resides. The recovery routine is programmed to fulfil the restoration requirements of the application software, two possible recovery strategies are reset and roll-back.

The model of erroneous behaviour is applied to a selection of 8, 16, and 32-bit processors. The following microprocessors are assessed: (8-bit) Motorola 6800, Intel 8048, and Intel 8085; (16-bit) Intel 8086, Motorola 68000, and Motorola 68010; (32-bit) Advanced Micro Device Am29000, Motorola 68020, and Intel 80386. All processors are assumed to have a random content address space. Erroneous execution is evaluated using instruction mix analysis to predict the mean expected operation.

The character of erroneous jumps is studied. The model assumes that IEJs have a random target in the address space. Such erroneous jumps have particular significance in relation to microprocessor reliability when their destination is in the unused area whose code attributes are unknown. Where an IEJ directs erroneous execution to the used area, further erroneous jumps (SEJs) occur as a result of invalid software processing. The model suggests that SEJs typically generate a local target and hence erroneous execution initiated within the used area is likely to remain there for a significant period. This is a hazardous situation because not only is the processor out of control but its erroneous activity could be mutilating system integrity, making restoration of the microprocessor system more difficult.

A Markov process is used to predict the reliability of a microprocessor. In particular, the Mean Time To Failure (MTTF) parameter is used because it has more intu-
itive meaning to reliability engineers. The instruction mix analysis for the selection of microprocessors described above suggests that both the instruction distribution and the processor architecture can have a significant influence on reliability. The Motorola 68000 microprocessor family have the highest reliability with an MTTF three times the mean inter-arrival time of events that initiate erroneous execution. These processors have instruction sets which define the execution of an undeclared opcode and mis-aligned memory access to generate a restart outcome. The Advanced Micro Device Am29000 processor has an MTTF prediction twice that of the mean inter-arrival event period which is due solely to undeclared opcodes generating a restart outcome. The remaining microprocessor reliability models predict a MTTF of similar magnitude to the mean inter-arrival event period. These processors do not define their undeclared opcodes to have a restart outcome, and implement a byte memory organization and hence mis-aligned memory access is impossible.

The availability of a microprocessor system is dependent on the detection of an error and time taken to restore the system integrity. A general model is presented: the influences on availability are the mean inter-arrival event period, the processor operational frequency, and the size of the routine required to restore system integrity. The latter two factors are processor and application dependent. Microprocessor architectures that require little restoration activity and simple application software can reduce the size of the recovery routine. These, together with high processor operating speeds, facilitate higher availability.

10.4. Detecting Erroneous Microprocessor Execution

This thesis proposes a new capability check for detecting erroneous microprocessor execution. The technique is based on software implemented fault tolerance.

The aim of the technique is to identify, through static analysis, the potential targets of erroneous jumps, referred to as invalid branches, and to place at these locations a software detection mechanism which is activated by the erroneous execution. Invalid branches are unsynchronized erroneous jumps and should not be confused with synchronized erroneous jumps. Interpretation of a memory location containing an
opcode is termed 'synchronized', whilst interpretation of any other content is termed 'unsynchronized'.

Different approaches are required when applying software implemented fault tolerance to the used and unused areas of the address space respectively. The first approach considers the unused area, which can consist of physical and non-existent memory. Physical memory is filled with restart generating instructions which do not have an operand requirement. In this way, erroneous execution at any location will develop a restart outcome and hence detection of erroneous behaviour. Non-existent memory requires a hardware solution, and hence a unit called an Access Guardian is designed which detects memory access by monitoring the processor address bus. The complexity of the Access Guardian is dependent on the contiguity of non-existent memory locations, and whether or not the processor has a multiplexed address/data bus such as the Intel 8086 microprocessor.

Secondly, within the used area, detection mechanisms are inserted within the software at invalid branch targets. Some manipulation of the application software may be necessary so that the its function is not disturbed by the placement of detection mechanisms. Construction principles for the detection mechanisms and an algorithm for their placement within the application program are presented. An important limitation of the technique is that of placement deadlock. This describes a situation where an invalid branch destination cannot be covered by a detection mechanism placement because the generator and destination of the invalid branch are both within the same instruction, or the destination resides at the location immediately following the generating instruction. The former can generate a catastrophic failure if an infinite execution loop is created.

A software tool, called the Post-programming Automated Recovery UTility (PARUT), has been developed as a prototype in order to assess the capability of the software implemented fault tolerant technique and to assess the feasibility of developing a standard software tool to apply the technique. The structure and organization of the prototype is described. The tool is designed to be robust, capable of generating enhanced program code for a variety of target processors. The future of
PARUT appears to be its incorporation, as an processing option, within a translator. This is because PARUT uses much of the information inherently required within the translation process, and its phase of activity immediately follows translation.

10.5. Evaluating Fault Tolerance

The dynamics of erroneous execution can only be evaluated through instruction sequence analysis which involves tracing the execution attributed with each initiated period of erroneous behaviour.

The effectiveness of the software implemented fault tolerant technique proposed in this thesis is evaluated by investigating the erroneous behaviour of application software before and after application of the technique. Erroneous microprocessor behaviour is investigated using fault emulation and fault injection experiments. The fault injection experiment involves physically inserting faults on the the address bus, data bus, and program counter during instruction and data cycles. The fault emulation experiment involved inserting faults within a register model of a processor. Almost 4000 faults are investigated for a selection of three programs, each with a different application processor.

Improved performance is observed in the processor systems when they employ the software implemented fault tolerant technique. The degree of improvement is related to the number of detection mechanism placements in the application software. The effectiveness of the software technique is clearly demonstrated. The memory overhead and performance of the software technique is, however, application specific. Within the example programs, the application of the technique required an approximate software extension of 20% to 30% for 16-bit and 32-bit microprocessors.

10.6. Generating Non-Hazardous Software

Many microprocessor systems utilize application code written in a high level language which is independent of the target processor architecture, e.g. the programming language C. In these instances a translator is used to convert the source code through levels of abstraction to the object code. This process can be influenced so that target
code is produced without the hazards associated with catastrophic failure, and with a high inherent detection capability against erroneous execution.

Five tasks of the translation process are identified which can influence the production of reliable code. Firstly, opcodes with more than one addressable memory element can be hazardous on a mis-aligned opcode access. The selection of opcodes during translation should avoid identified hazardous opcodes, their function being implemented by other equivalent instructions. Secondly, the addressing mode selected for an instruction should not generate hazardous operands. Thirdly, macros used to implement high level language constructs should not incorporate instruction sequences with a hazard. Fourthly, peephole optimization should not create new code hazards. Finally, address space allocation for the object code should not introduce hazards, for instance through relative address operands.

These translator proposals have not been implemented. An alternative method of generating non-hazardous software is to design a microprocessor architecture which inherently defines code with a detection capability against erroneous execution.

10.7. Microprocessor Design for Fault Tolerance

Fault tolerant techniques implemented as additional hardware circuitry, with or without software manipulation, can be incorporated within the architecture of a microprocessor. Many modern microprocessors incorporate a mis-aligned memory access exception, and other prototype processors implement signature analysis. The techniques proposed in this thesis concerning software implemented fault tolerance can also be embedded within the design of a microprocessor. These techniques use an Access Guardian to detect all unused area access, and software detection mechanisms to detect invalid branches regardless of their re-synchronization. Additional techniques involve influencing the design of the processor instruction opcode map and operand requirements, reserved input/output ports, and memory organization.

10.8. Summary

Microprocessor systems are incorporated within many industrial control systems. Such applications are often required to be highly reliable. Working environments can,
however, be harsh and microprocessor systems are prone to disruption from transient disturbances. It is therefore necessary to apply fault tolerance to the microprocessor system in order to improve its reliability. The sophistication of the fault tolerance may be limited by budget constraints which prevent fault masking.

The solution is the application of capability checks within a uniprocessor controller. The capability checks identify particular characteristics of erroneous processor behaviour and initiate recovery. This thesis models erroneous microprocessor behaviour and proposes a new low-cost software-implemented capability check involving the recognition of invalid branches. The effectiveness of applying the capability check is demonstrated; however, a general assessment cannot be made because the techniques action is application specific. The error detection capability can be further improved by strategically selecting several capability checks for collective application. It should be realized that these techniques cannot guarantee enhanced reliability because they are reliant on particular attributes of erroneous execution being exhibited. It is pertinent for the reliability engineer to incorporate a back-up detection facility into the system design, such as a watchdog timer as well as a fail-safe action, in order to prevent unpredicted failure modes causing a catastrophic outcome.
Bibliography

and

References
Bibliography

[British Telecom, 1986]

[Fontaine & Barrand, 1989]

[Intel, 1982]

[Intel, 1988]

[Johnson, 1989]

[Klaassen & van Peppen, 1989]

[Lewis, 1987]

[Motorola, 1983]

[Motorola, 1984]

[Motorola, 1987]
References

[Aho & Ullman, 1977]

[Amerasekera & Campbell, 1987]

[Anderson & Lee, 1982]

[Arlat et al, 1990]

[Armstrong & Devlin, 1981]

[Avizienis, 1976]

[Ball & Hardie, 1969]

[Bennetts, 1979]

[Bhar & McMahon, 1983]

[Blough & Masson, 1987]

[Bourne, 1982]

[Carpenter, 1989]
[Carter, 1985]

[Chakraborty & Ghosh, 1988]

[Chen & Avizienis, 1978]

[Chillarege & Bowen, 1989]

[Choi et al, 1989]

[Ciminiera & Valezmzano, 1987]

[Clark et al, 1987]

[Connet et al, 1972]

[Cortes et al, 1986]

[Crouzet & Decouty, 1982]

[Cullyer, 1988]

[Cusick et al, 1985]
[Czech & Siewiorek, 1990]

[Damm, 1988]

[Dijkstra, 1972]

[Duba & Iyer, 1988]

[Eckhardt & Lee, 1988]

[Eifert & Schuette, 1984]

[Ferrara et al, 1989]

[Freer, 1987]

[Fujimura, 1989]

[Fuller & Harbison, 1978]

[Glaser & Masson, 1982]

[Gunneflo et al, 1989]
[Halse, 1984]

[Halse & Preece, 1985]

[Halse & Preece, 1987]

[Hennessy & Patterson, 1990]

[Hitt & Webb, 1985]

[Horowitz & Hill, 1986]

[IBM, 1986]

[IEEE, 1989]

[IEEE, 1990]

[Iyer & Rossetti, 1982]

[Iyer & Rossetti, 1986]

[Kim & Iyer, 1988]

[Kopetz, 1982]
[Lala, 1983]

[Lala, 1985]

[Leveugle et al, 1990]

[Li et al, 1984]

[Lin, 1988]

[Lin & Siewiorek, 1990]

[Littlewood, 1981]

[Liu, 1982]

[Liu & Whalen, 1988]

[Lomelino & Iyer, 1986]

[Lu, 1980]

[Lu, 1982]
[Madeira et al, 1990]

[Mahmood & McCluskey, 1985]

[Mahmood & McCluskey, 1988]

[Mahmood et al, 1983]

[Marchal & Courtois, 1982]

[McCluskey & Wakerly, 1981]

[McConnel, 1981]

[McConnel & Siewiorek, 1978]

[McGough & Swern, 1981]

[McGough & Swern, 1983]

[Millman, 1979]

[Morganti et al, 1978]
[Musa, 1975]

[Musa et al, 1987]

[Namjoo, 1982]

[Namjoo, 1983]

[Namjoo & McCluskey, 1982]

[O'Connor, 1985]

[Ornstein et al, 1975]

[Pearson, 1983]

[Randall, 1975]

[Rigby & Norris, 1990]

[Russel & Sayers, 1989]

[Saxena & McCluskey, 1990]

[Schmid et al, 1982]
[Schuette & Shen, 1986]

[Segall et al, 1988]

[Shen & Schuette, 1983]

[Shoji, 1987]

[Siewiorek & Swarz, 1982]

[Siewiorek et al, 1978a]

[Siewiorek et al, 1978b]

[Sommerville, 1985]

[Sosnowski, 1986a]

[Sosnowski, 1986b]

[Stiffler, 1980]

[Tanenbaum, 1990]
[Taylor et al, 1980]

[Trachtenberg, 1990]

[Wilken & Shen, 1987]

[Wingate & Preece, 1991] {Copy in Appendix E}

[Woodbury & Shin, 1990]

[Wynne et al, 1988]

[Yang et al, 1985]

[Yau & Chen, 1980]
Appendix A

INSTRUCTION SET PARAMETERS

A.1. Introduction ................................................................. 214
A.2. Instructions Influencing Program Flow ................................. 3214
A.3. Microprocessor Jump Type Instruction Data ......................... 215
A.1. Introduction

The model of erroneous microprocessor behaviour described in Chapter 3 is based on the instruction mix of the processor software. The evaluation of the model for a selection of 8, 16, and 32-bit microprocessors, presented in Chapter 4, requires the mix of their respective instruction sets. This appendix contains details of the instruction set parameters used in Chapter 4. The microprocessors considered are: (8-bit) MC 6800, Intel 8048, and Intel 8085; (16-bit) Intel 8086, MC 68000, and MC 68010; (32-bit) AMD 29000-D, MC 68020, and Intel 80386. The notations MC and AMD specify 'Motorola Corporation' and 'Advanced Micro Devices' respectively.

Data for the 8-bit processors is taken from Halse [1984]. Data parameters for the 16 and 32-bit microprocessors has been evaluated from appropriate manufacturers' manuals listed in the bibliography.

A.2. Instructions Influencing Program Flow

Program flow through software is determined by the content of the microprocessor program counter. The instruction set of a microprocessor contains three types of instruction, each affecting the program counter content in a different way. Firstly, 'non-jump' instructions are classified as those instructions that perform some operation and increment the program counter to the next logical instruction location in the address space. Secondly, 'unconditional jump' instructions specify the program counter to contain the next sequential instruction address which may not be the next logical instruction in memory. Finally, 'conditional jump' instructions test a specified parameter value and, if successful, generate a branch like the unconditional jump instruction. If the test is unsuccessful then the program counter increments to the next logical instruction address like the non-jump instruction.
Microprocessor architectures implementing a ROM opcode decoder generate operations for all possible instruction opcode formats, i.e., \(2^n\) instructions where the opcode has \(n\) bits. ROM decoders often have some redundancy, for instance the MC 68000, MC 68010, MC 68020, where \(n = 16\) have more ROM opcode values than implemented operations and hence there is a large number of undeclared instructions. The AMD Am29000 is a Reduced Instruction Set Computer (RISC) and implements a ROM decoder for opcodes of 8 bits leaving fewer unused opcode formats (undeclared instructions). Processors not implementing a ROM decoder, directly interpret the opcode through digital circuitry (examples include the Intel 8048, Intel 8085, Intel 8086, and Intel 80386). These microprocessors have all their instructions hardwire defined, although some instructions may appear undeclared because the manufacture withholds information. Undeclared instructions can be non-jumps, conditional jumps, or unconditional jumps, and knowledge of these instructions may be of benefit to the programmer. Tables A.1. to A.9. summarise details of the instruction sets for a variety of processors.

A.3. Microprocessor Jump Type Instruction Data

A fundamental characteristic associated with the program flow during erroneous microprocessor behaviour is the 'erroneous jump'. Jump type instructions can be categorised in terms of the nature of their branch operation. Chapter 3 defines the following categories of jump instruction:

- **Restart (RT)**: Leads to a jump to a predefined location in the address space.
- **Return (RN)**: Leads to a jump to an address held in a stack.
- **Stop/Wait (SW)**: Leads to cessation of processing and requires an interrupt or hardware reset to exit this state.
- **Unspecified Jump (UJ)**: Leads to a jump to a new location in the address space, determined by volatile memory content.

Deriving the distribution of restart, return, stop/wait, and unspecified jump instructions in an instruction set requires evaluation of all undeclared instructions.
Within the selection of microprocessors analysed in this appendix, the MC 68000, MC 68010, AMD Am29000-D, MC 68020, and Intel 80386 specify their undeclared instructions to generate a restart. Investigation has shown that the undeclared instructions in the remaining processors can be restart, return, stop/wait, or unspecified jump [Halse, 1984].

The distribution of different jump type instructions for the MC 6800, Intel 8048, Intel 8085, and Intel 8086 microprocessors are shown in Tables A.10 to A.13, respectively. Undeclared instructions with a jump type operation are identified in the Tables by the mnemonic ‘***’.

The MC 68000, MC 68010, and MC 68020 instruction sets are upwardly compatible; the distribution of their jump type instructions is collated in Table A.14. Similarly, the jump type instruction distributions for the AMD Am29000 and Intel 80386 microprocessors are shown in Table A.15 and Table A.16, respectively.
<table>
<thead>
<tr>
<th>Instr. Type</th>
<th>Declared</th>
<th>Undeclared</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Non-Jump</td>
<td>173</td>
<td>49</td>
<td>222</td>
</tr>
<tr>
<td>Cond. Jump</td>
<td>14</td>
<td>1</td>
<td>15</td>
</tr>
<tr>
<td>Uncond. Jump</td>
<td>10</td>
<td>9</td>
<td>19</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td>197</td>
<td>59</td>
<td>256</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Instr. Type</th>
<th>Declared</th>
<th>Undeclared</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Non-Jump</td>
<td>183</td>
<td>22</td>
<td>205</td>
</tr>
<tr>
<td>Cond. Jump</td>
<td>20</td>
<td>3</td>
<td>23</td>
</tr>
<tr>
<td>Uncond. Jump</td>
<td>27</td>
<td>1</td>
<td>28</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td>230</td>
<td>26</td>
<td>256</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Instr. Type</th>
<th>Declared</th>
<th>Undeclared</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Non-Jump</td>
<td>209</td>
<td>7</td>
<td>216</td>
</tr>
<tr>
<td>Cond. Jump</td>
<td>24</td>
<td>3</td>
<td>27</td>
</tr>
<tr>
<td>Uncond. Jump</td>
<td>13</td>
<td>0</td>
<td>13</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td>246</td>
<td>10</td>
<td>256</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Instr. Type</th>
<th>Declared</th>
<th>Undeclared</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Non-Jump</td>
<td>262</td>
<td>25</td>
<td>287</td>
</tr>
<tr>
<td>Cond. Jump</td>
<td>20</td>
<td>0</td>
<td>20</td>
</tr>
<tr>
<td>Uncond. Jump</td>
<td>27</td>
<td>6</td>
<td>33</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td>309</td>
<td>31</td>
<td>340</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Instr. Type</th>
<th>Declared</th>
<th>Undeclared</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Non-Jump</td>
<td>39039</td>
<td>0</td>
<td>39039</td>
</tr>
<tr>
<td>Cond. Jump</td>
<td>4224</td>
<td>0</td>
<td>4224</td>
</tr>
<tr>
<td>Uncond. Jump</td>
<td>79</td>
<td>22194</td>
<td>22273</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td>43342</td>
<td>22194</td>
<td>65536</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Instr. Type</th>
<th>Declared</th>
<th>Undeclared</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Non-Jump</td>
<td>39217</td>
<td>0</td>
<td>39217</td>
</tr>
<tr>
<td>Cond. Jump</td>
<td>4224</td>
<td>0</td>
<td>4224</td>
</tr>
<tr>
<td>Uncond. Jump</td>
<td>80</td>
<td>22015</td>
<td>22273</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td>43521</td>
<td>22015</td>
<td>65536</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Instr. Type</th>
<th>Declared</th>
<th>Undeclared</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Non-Jump</td>
<td>151</td>
<td>0</td>
<td>151</td>
</tr>
<tr>
<td>Cond. Jump</td>
<td>27</td>
<td>0</td>
<td>27</td>
</tr>
<tr>
<td>Uncond. Jump</td>
<td>8</td>
<td>70</td>
<td>78</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td>186</td>
<td>70</td>
<td>256</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Instr. Type</th>
<th>Declared</th>
<th>Undeclared</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Non-Jump</td>
<td>41623</td>
<td>0</td>
<td>41623</td>
</tr>
<tr>
<td>Cond. Jump</td>
<td>4832</td>
<td>0</td>
<td>4832</td>
</tr>
<tr>
<td>Uncond. Jump</td>
<td>140</td>
<td>18941</td>
<td>19081</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td>46595</td>
<td>18941</td>
<td>65536</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Instr. Type</th>
<th>Declared</th>
<th>Undeclared</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Non-Jump</td>
<td>333</td>
<td>0</td>
<td>333</td>
</tr>
<tr>
<td>Cond. Jump</td>
<td>37</td>
<td>0</td>
<td>37</td>
</tr>
<tr>
<td>Uncond. Jump</td>
<td>21</td>
<td>225</td>
<td>246</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td>391</td>
<td>225</td>
<td>616</td>
</tr>
</tbody>
</table>

**Tables A.1-9. Microprocessor Instruction Set Evaluation**

Instr. = Instruction; Cond. = Conditional; Uncond. = Unconditional
<table>
<thead>
<tr>
<th>Conditional Jumps</th>
<th>Code Mnemonic Type</th>
<th>Code Mnemonic Type</th>
<th>Code Mnemonic Type</th>
<th>Code Mnemonic Type</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>UJ</td>
<td>UJ</td>
<td>UJ</td>
<td>UJ</td>
</tr>
<tr>
<td></td>
<td>E8</td>
<td>E9</td>
<td>EA</td>
<td>EB</td>
</tr>
<tr>
<td></td>
<td>CALL</td>
<td>CALL</td>
<td>CALL</td>
<td>CALL</td>
</tr>
<tr>
<td></td>
<td>DJN</td>
<td>DJN</td>
<td>DJN</td>
<td>DJN</td>
</tr>
<tr>
<td></td>
<td>UJ</td>
<td>UJ</td>
<td>UJ</td>
<td>UJ</td>
</tr>
<tr>
<td></td>
<td>06</td>
<td>12</td>
<td>16</td>
<td>26</td>
</tr>
<tr>
<td></td>
<td>32</td>
<td>36</td>
<td>46</td>
<td>52</td>
</tr>
<tr>
<td></td>
<td>56</td>
<td>66</td>
<td>72</td>
<td>76</td>
</tr>
<tr>
<td></td>
<td>83</td>
<td>93</td>
<td>94</td>
<td>A4</td>
</tr>
<tr>
<td></td>
<td>64</td>
<td>84</td>
<td>96</td>
<td>A6</td>
</tr>
<tr>
<td></td>
<td>74</td>
<td>84</td>
<td>96</td>
<td>B2</td>
</tr>
<tr>
<td></td>
<td>93</td>
<td>94</td>
<td>96</td>
<td>B6</td>
</tr>
<tr>
<td></td>
<td>44</td>
<td>54</td>
<td>64</td>
<td>74</td>
</tr>
<tr>
<td></td>
<td>34</td>
<td>44</td>
<td>54</td>
<td>64</td>
</tr>
<tr>
<td></td>
<td>24</td>
<td>34</td>
<td>44</td>
<td>54</td>
</tr>
<tr>
<td></td>
<td>14</td>
<td>24</td>
<td>34</td>
<td>44</td>
</tr>
<tr>
<td></td>
<td>04</td>
<td>14</td>
<td>24</td>
<td>34</td>
</tr>
<tr>
<td></td>
<td>**</td>
<td>***</td>
<td>**</td>
<td>**</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Conditional Jumps</td>
<td>Code Mnemonic Type</td>
<td>Code Mnemonic Type</td>
<td>Code Mnemonic Type</td>
<td>Code Mnemonic Type</td>
</tr>
<tr>
<td>-------------------</td>
<td>--------------------</td>
<td>--------------------</td>
<td>--------------------</td>
<td>--------------------</td>
</tr>
<tr>
<td></td>
<td>UJ</td>
<td>UJ</td>
<td>UJ</td>
<td>UJ</td>
</tr>
<tr>
<td></td>
<td>21</td>
<td>22</td>
<td>23</td>
<td>24</td>
</tr>
<tr>
<td></td>
<td>3A</td>
<td>3B</td>
<td>3C</td>
<td>3D</td>
</tr>
<tr>
<td></td>
<td>3E</td>
<td>3F</td>
<td>3G</td>
<td>3H</td>
</tr>
<tr>
<td></td>
<td>4A</td>
<td>4B</td>
<td>4C</td>
<td>4D</td>
</tr>
<tr>
<td></td>
<td>5A</td>
<td>5B</td>
<td>5C</td>
<td>5D</td>
</tr>
<tr>
<td></td>
<td>6A</td>
<td>6B</td>
<td>6C</td>
<td>6D</td>
</tr>
<tr>
<td></td>
<td>7A</td>
<td>7B</td>
<td>7C</td>
<td>7D</td>
</tr>
<tr>
<td></td>
<td>8A</td>
<td>8B</td>
<td>8C</td>
<td>8D</td>
</tr>
<tr>
<td></td>
<td>9A</td>
<td>9B</td>
<td>9C</td>
<td>9D</td>
</tr>
<tr>
<td></td>
<td>A2</td>
<td>A3</td>
<td>A4</td>
<td>A5</td>
</tr>
<tr>
<td></td>
<td>B2</td>
<td>B3</td>
<td>B4</td>
<td>B5</td>
</tr>
<tr>
<td></td>
<td>C2</td>
<td>C3</td>
<td>C4</td>
<td>C5</td>
</tr>
<tr>
<td></td>
<td>D2</td>
<td>D3</td>
<td>D4</td>
<td>D5</td>
</tr>
<tr>
<td></td>
<td>E2</td>
<td>E3</td>
<td>E4</td>
<td>E5</td>
</tr>
<tr>
<td></td>
<td>F2</td>
<td>F3</td>
<td>F4</td>
<td>F5</td>
</tr>
<tr>
<td></td>
<td>**</td>
<td>**</td>
<td>**</td>
<td>**</td>
</tr>
</tbody>
</table>

Table A.10: MC 6800 Jump Instructions
Code = Hexadecimal Instruction Opcode

Table A.11: Intel 8048 Jump Instructions
Code = Hexadecimal Instruction Opcode
### Table A.12. : Intel 8085 Jump Instructions

<table>
<thead>
<tr>
<th>Code</th>
<th>Mnemonic</th>
<th>Type</th>
<th>Code</th>
<th>Mnemonic</th>
<th>Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>76</td>
<td>HLT</td>
<td>SW</td>
<td>C0</td>
<td>RNZ</td>
<td>RT</td>
</tr>
<tr>
<td>C3</td>
<td>JMP</td>
<td>UJ</td>
<td>C2</td>
<td>JNZ</td>
<td>UJ</td>
</tr>
<tr>
<td>C7</td>
<td>RST0</td>
<td>RT</td>
<td>C4</td>
<td>CNZ</td>
<td>UJ</td>
</tr>
<tr>
<td>C9</td>
<td>RET</td>
<td>RN</td>
<td>C8</td>
<td>RZ</td>
<td>RT</td>
</tr>
<tr>
<td>CD</td>
<td>CALL</td>
<td>UJ</td>
<td>CA</td>
<td>JZ</td>
<td>UJ</td>
</tr>
<tr>
<td>CF</td>
<td>RST1</td>
<td>RT</td>
<td>CB</td>
<td>***</td>
<td>RT</td>
</tr>
<tr>
<td>D7</td>
<td>RST2</td>
<td>RT</td>
<td>CC</td>
<td>CZ</td>
<td>UJ</td>
</tr>
<tr>
<td>DF</td>
<td>RST3</td>
<td>RT</td>
<td>D0</td>
<td>RNC</td>
<td>RT</td>
</tr>
<tr>
<td>E7</td>
<td>RST4</td>
<td>RT</td>
<td>D2</td>
<td>JNC</td>
<td>UJ</td>
</tr>
<tr>
<td>E9</td>
<td>PCHL</td>
<td>UJ</td>
<td>D4</td>
<td>CNC</td>
<td>UJ</td>
</tr>
<tr>
<td>EF</td>
<td>RST5</td>
<td>RT</td>
<td>D8</td>
<td>RC</td>
<td>RT</td>
</tr>
<tr>
<td>F7</td>
<td>RST6</td>
<td>RT</td>
<td>DA</td>
<td>JC</td>
<td>UJ</td>
</tr>
<tr>
<td>FF</td>
<td>RST7</td>
<td>RT</td>
<td>DC</td>
<td>CC</td>
<td>UJ</td>
</tr>
<tr>
<td>DD</td>
<td>***</td>
<td>UJ</td>
<td>E0</td>
<td>RP0</td>
<td>RN</td>
</tr>
<tr>
<td>E2</td>
<td>JP0</td>
<td>UJ</td>
<td>E4</td>
<td>CP0</td>
<td>UJ</td>
</tr>
<tr>
<td>E8</td>
<td>RPE</td>
<td>RN</td>
<td>EA</td>
<td>JPE</td>
<td>UJ</td>
</tr>
<tr>
<td>EC</td>
<td>CPE</td>
<td>UJ</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Table A.13. : Intel 8086 Jump Instructions

<table>
<thead>
<tr>
<th>Code</th>
<th>Mnemonic</th>
<th>Type</th>
<th>Code</th>
<th>Mnemonic</th>
<th>Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F</td>
<td>***</td>
<td>RN</td>
<td>EB</td>
<td>JMP</td>
<td>UJ</td>
</tr>
<tr>
<td>2E</td>
<td>SEG</td>
<td>UJ</td>
<td>F0</td>
<td>LOCK</td>
<td>SW</td>
</tr>
<tr>
<td>3E</td>
<td>SEG</td>
<td>UJ</td>
<td>F1</td>
<td>***</td>
<td>UJ</td>
</tr>
<tr>
<td>9A</td>
<td>CALL</td>
<td>UJ</td>
<td>F2</td>
<td>REP</td>
<td>UJ</td>
</tr>
<tr>
<td>9B</td>
<td>WAIT</td>
<td>UJ</td>
<td>F4</td>
<td>HALT</td>
<td>SW</td>
</tr>
<tr>
<td>C0</td>
<td>***</td>
<td>RN</td>
<td>FE[xx10][0xxx]</td>
<td>CALL</td>
<td>UJ</td>
</tr>
<tr>
<td>C1</td>
<td>***</td>
<td>RN</td>
<td>FE[xx10][1xxx]</td>
<td>CALL</td>
<td>UJ</td>
</tr>
<tr>
<td>C2</td>
<td>RET</td>
<td>RN</td>
<td>FE[xx10][0xxx]</td>
<td>JMP</td>
<td>UJ</td>
</tr>
<tr>
<td>C3</td>
<td>RET</td>
<td>RN</td>
<td>FE[xx10][1xxx]</td>
<td>JMP</td>
<td>UJ</td>
</tr>
<tr>
<td>C8</td>
<td>***</td>
<td>UJ</td>
<td>FF[xx01][0xxx]</td>
<td>CALL</td>
<td>UJ</td>
</tr>
<tr>
<td>C9</td>
<td>***</td>
<td>RN</td>
<td>FF[xx01][1xxx]</td>
<td>CALL</td>
<td>UJ</td>
</tr>
<tr>
<td>CA</td>
<td>RET</td>
<td>RN</td>
<td>FF[xx10][0xxx]</td>
<td>JMP</td>
<td>UJ</td>
</tr>
<tr>
<td>CB</td>
<td>RET</td>
<td>RN</td>
<td>FF[xx10][1xxx]</td>
<td>JMP</td>
<td>UJ</td>
</tr>
<tr>
<td>CC</td>
<td>INT</td>
<td>RT</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>CD</td>
<td>INT</td>
<td>RT</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>CF</td>
<td>IRET</td>
<td>RN</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>E2</td>
<td>LOOP</td>
<td>UJ</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>E8</td>
<td>CALL</td>
<td>UJ</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>E9</td>
<td>JMP</td>
<td>UJ</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EA</td>
<td>JMP</td>
<td>UJ</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

*Code = Hexadecimal Instruction Opcode ; [ ] = Hexadecimal Digit ; x = Further Specification Required*
<table>
<thead>
<tr>
<th>Code</th>
<th>Quantity</th>
<th>Mnemonic</th>
<th>Type</th>
<th>Code</th>
<th>Quantity</th>
<th>Mnemonic</th>
<th>Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>4AFC</td>
<td>1</td>
<td>ILLEGAL</td>
<td>RT</td>
<td>50C[1xx]</td>
<td>8</td>
<td>DBT</td>
<td>UJ</td>
</tr>
<tr>
<td>4E4x</td>
<td>16</td>
<td>TRAP</td>
<td>RT</td>
<td>51C[1xx]</td>
<td>8</td>
<td>DBF</td>
<td>UJ</td>
</tr>
<tr>
<td>4E70</td>
<td>1</td>
<td>RESET</td>
<td>RT</td>
<td>52C[1xx]</td>
<td>8</td>
<td>DBHI</td>
<td>UJ</td>
</tr>
<tr>
<td>4E72</td>
<td>1</td>
<td>STOP</td>
<td>SW</td>
<td>53C[1xx]</td>
<td>8</td>
<td>DBLS</td>
<td>UJ</td>
</tr>
<tr>
<td>4E73</td>
<td>1</td>
<td>RTE</td>
<td>RN</td>
<td>54C[1xx]</td>
<td>8</td>
<td>DBCC</td>
<td>UJ</td>
</tr>
<tr>
<td>4E75</td>
<td>1</td>
<td>RTS</td>
<td>RN</td>
<td>55C[1xx]</td>
<td>8</td>
<td>DBCS</td>
<td>UJ</td>
</tr>
<tr>
<td>4E76</td>
<td>1</td>
<td>TRAPV</td>
<td>RT</td>
<td>56C[1xx]</td>
<td>8</td>
<td>DBNE</td>
<td>UJ</td>
</tr>
<tr>
<td>4E77</td>
<td>1</td>
<td>RTR</td>
<td>RN</td>
<td>57C[1xx]</td>
<td>8</td>
<td>DBEQ</td>
<td>UJ</td>
</tr>
<tr>
<td>4E[10xx]x</td>
<td>28</td>
<td>JSR</td>
<td>UJ</td>
<td>58C[1xx]</td>
<td>8</td>
<td>DBVC</td>
<td>UJ</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>5AC[1xx]</td>
<td>8</td>
<td>DBPL</td>
<td>UJ</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>5BC[1xx]</td>
<td>8</td>
<td>DBMI</td>
<td>UJ</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>5DC[1xx]</td>
<td>8</td>
<td>DBLT</td>
<td>UJ</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>5EC[1xx]</td>
<td>8</td>
<td>DBGE</td>
<td>UJ</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>5DF[1xx]</td>
<td>8</td>
<td>DBGT</td>
<td>UJ</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>5F[1xxx]</td>
<td>48</td>
<td>TRAPcc</td>
<td>RT</td>
</tr>
<tr>
<td>06[1xx]x</td>
<td>52</td>
<td>CALLM</td>
<td>RT</td>
<td>5FC[1xx]</td>
<td>8</td>
<td>DBLE</td>
<td>UJ</td>
</tr>
<tr>
<td>484[1xxx]</td>
<td>8</td>
<td>BKPT</td>
<td>UJ</td>
<td>F[xxx0][1xxx]</td>
<td>448</td>
<td>cpBcc</td>
<td>UJ</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>F[xxx0][4][xxx]</td>
<td>56</td>
<td>cpDBcc</td>
<td>UJ</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>F[xxx0][7][xxx]</td>
<td>56</td>
<td>cpTRAPcc</td>
<td>RT</td>
</tr>
</tbody>
</table>

Table A.14. : MC 68000, MC 68010, and MC68020 Jump Instructions

Code = Hexadecimal Instruction Opcode ; [] = Hexadecimal Digit ; x = Further Specification Required
### Table A.15. : AMD 29000 Jump Instructions

<table>
<thead>
<tr>
<th>Unconditional Jumps</th>
<th>Conditional Jumps</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Code</strong></td>
<td><strong>Mnemonic</strong></td>
</tr>
<tr>
<td>88</td>
<td>IRETINV</td>
</tr>
<tr>
<td>89</td>
<td>HALT</td>
</tr>
<tr>
<td>8C</td>
<td>IRET</td>
</tr>
<tr>
<td>A0</td>
<td>JMP</td>
</tr>
<tr>
<td>A1</td>
<td>JMP</td>
</tr>
<tr>
<td>A8</td>
<td>CALL</td>
</tr>
<tr>
<td>A9</td>
<td>CALL</td>
</tr>
<tr>
<td>C8</td>
<td>CALLI</td>
</tr>
<tr>
<td>58</td>
<td>ASGT</td>
</tr>
<tr>
<td>5A</td>
<td>ASGTU</td>
</tr>
<tr>
<td>5C</td>
<td>ASGE</td>
</tr>
<tr>
<td>5E</td>
<td>ASGEU</td>
</tr>
<tr>
<td>70</td>
<td>ASEQ</td>
</tr>
<tr>
<td>72</td>
<td>ASNEQ</td>
</tr>
<tr>
<td>A4</td>
<td>JMPF</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Unconditional Jumps</th>
<th>Conditional Jumps</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Code</strong></td>
<td><strong>Mnemonic</strong></td>
</tr>
<tr>
<td>2E</td>
<td>SEG</td>
</tr>
<tr>
<td>3E</td>
<td>SEG</td>
</tr>
<tr>
<td>9A</td>
<td>CALL</td>
</tr>
<tr>
<td>9B</td>
<td>WAIT</td>
</tr>
<tr>
<td>C2</td>
<td>RET</td>
</tr>
<tr>
<td>C3</td>
<td>RET</td>
</tr>
<tr>
<td>CA</td>
<td>RET</td>
</tr>
<tr>
<td>CB</td>
<td>RET</td>
</tr>
<tr>
<td>CC</td>
<td>INT</td>
</tr>
<tr>
<td>CD</td>
<td>INT</td>
</tr>
<tr>
<td>CE</td>
<td>INTO</td>
</tr>
<tr>
<td>CF</td>
<td>IRET</td>
</tr>
<tr>
<td>E2</td>
<td>LOOP</td>
</tr>
<tr>
<td>E8</td>
<td>CALL</td>
</tr>
<tr>
<td>E9</td>
<td>JMP</td>
</tr>
<tr>
<td>EA</td>
<td>JMP</td>
</tr>
<tr>
<td>EB</td>
<td>JMP</td>
</tr>
<tr>
<td>FF[xx01][0xx]</td>
<td>CALL</td>
</tr>
<tr>
<td>FF[xx01][1xx]</td>
<td>CALL</td>
</tr>
<tr>
<td>FF[xx10][0xx]</td>
<td>JMP</td>
</tr>
<tr>
<td>FF[xx10][1xx]</td>
<td>JMP</td>
</tr>
</tbody>
</table>

### Table A.16. : Intel 80386 Jump Instructions

<table>
<thead>
<tr>
<th>Code</th>
<th>Mnemonic</th>
<th>Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F00</td>
<td>JO</td>
<td>UJ</td>
</tr>
<tr>
<td>0F01</td>
<td>JNO</td>
<td>UJ</td>
</tr>
<tr>
<td>0F02</td>
<td>JB</td>
<td>UJ</td>
</tr>
<tr>
<td>0F03</td>
<td>JNB</td>
<td>UJ</td>
</tr>
<tr>
<td>0F04</td>
<td>JZ</td>
<td>UJ</td>
</tr>
<tr>
<td>0F05</td>
<td>JNZ</td>
<td>UJ</td>
</tr>
<tr>
<td>0F06</td>
<td>JBE</td>
<td>UJ</td>
</tr>
<tr>
<td>0F07</td>
<td>JNBE</td>
<td>UJ</td>
</tr>
<tr>
<td>0F08</td>
<td>JS</td>
<td>UJ</td>
</tr>
<tr>
<td>0F09</td>
<td>JNS</td>
<td>UJ</td>
</tr>
<tr>
<td>0F0A</td>
<td>JP</td>
<td>UJ</td>
</tr>
<tr>
<td>0F0B</td>
<td>JNP</td>
<td>UJ</td>
</tr>
<tr>
<td>0F0C</td>
<td>JL</td>
<td>UJ</td>
</tr>
<tr>
<td>0F0D</td>
<td>JNL</td>
<td>UJ</td>
</tr>
<tr>
<td>0F0E</td>
<td>JLE</td>
<td>UJ</td>
</tr>
<tr>
<td>0F0F</td>
<td>JNLE</td>
<td>UJ</td>
</tr>
</tbody>
</table>

*Code = Hexadecimal Instruction Opcode ; | = Hexadecimal Digit ;

x = Further Specification Required*
Appendix B

THE DESIGN OF AN ACCESS GUARDIAN

B.1. Introduction ................................................. 222
B.3. The Address Decoder ......................................... 225
B.4. The Restart Generator ....................................... 225
B.5. The Timer Unit ............................................... 228
B.6. Design Simulation ............................................ 231
B.7. The Design's Hardware Requirement ..................... 235
B.8. Summary ..................................................... 235
B.1. Introduction

This appendix describes the design of an Access Guardian proposed in Chapter 5. The Access Guardian detects whether or not invalid address lines are activated by the microprocessor whose operation it is monitoring, and if so, impresses an interrupt signal to the microprocessor. The design is validated through a gate-level simulation, and the hardware requirement is listed. The topology of a microprocessor system incorporating an Access Guardian is shown in Figure B.1.

B.2. An Access Guardian Design

The Access Guardian design presented here monitors a dedicated address bus for access outside a contiguous 16 MByte block of memory, and has a required interrupt latency of ten clock cycles. Whilst particular microprocessor applications are expected to have a more complex Access Guardian specification, the requirements used here are sufficient to indicate design implications.

The Access Guardian design is based on the interaction of three functional units: the 'address decoder', the 'restart generator', and the 'timer unit'. The general function of the Access Guardian is shown in Figure B.2. The 'address decoder' generates a signal when invalid address lines are activated. This signal is then processed with 'timer unit' status information by the 'restart generator' to produce an interrupt signal for the application processor. The interrupt signal must exist slightly in excess of the microprocessor interrupt latency. The interrupt latency is the length of time an interrupt must exist to guarantee processing by the microprocessor. Assuming the interrupt is given highest priority the processor will detect it following the execution of the present instruction. The interrupt signal must therefore be just longer than the longest execution time required by any instruction.
This diagram shows the typical topology of a microprocessor based controller incorporating an Access Guardian.

RESTART is the interrupt signal impressed by the Access Guardian upon the microprocessor to notify detection of invalid address line activity indicative of erroneous abortion.

Figure B.1: Embedded Access Guardian
This diagram shows the functional units of the Access Guardian. Not all address lines may be required by the address decoder. Signal A denotes detection of invalid address line activity. RESTART is the interrupt signal impressed upon the microprocessor which indicates erroneous execution. The timer unit holds the generated RESTART signal for a duration that guarantees processor receipt. The feedback signal FB reacts the RESTART at the end of the timed period if there is no further invalid address line activity. Continuing invalid address line activity initiates a further duration of the RESTART signal output by the restart generator.

The address decoder, restart generator, and timer unit all take the system input signals CLR (clock) and MRESET (manual reset).

Figure B.2. : Access Guardian
B.3. The Address Decoder

The 'address decoder' determines whether or not invalid address lines have been activated. An example decoder is shown in Figure B.3. for a Motorola 68000 address bus (address lines ‘A0’, ‘A1’,...‘A23’ specifying a 16 MByte address space) where only the least significant 16 MByte of memory is used. A simple OR function for the four most significant address lines (‘A23’, ‘A22’, ‘A21’, and ‘A20’) determines an invalid access and generates a signal ‘A’. The 'address decoder', however, will be more complex if sections of the used area are dispersed across the address space, or if the address bus is multiplexed with the data bus as in the Intel 8086 microprocessor.

B.4. The Restart Generator

The 'restart generator' consists of control logic driving a Set-Reset flip-flop (SRFF). The control circuitry determines the logic values for the SRFF depending on the Access Guardian operating conditions. The controller takes the input ‘A’ (from the address decoder), the manual reset line (‘MRESET’), and the feedback signal ‘FB’ (from the timer). The SRFF has inputs ‘S’ and ‘R’, and output ‘Q’. Table B.1. shows the truth table for the control logic to drive the SRFF. The following expressions for ‘S’ and ‘R’ are developed from the truth table.

\[ S = A \overline{MRESET} \quad (B.1.) \]

\[ R = MRESET + FB\overline{A} \overline{MRESET} \quad (B.2.) \]

Applying DeMorgan's Theorem yields to equation (B.1.),

\[ S = A \overline{MRESET} \quad (B.3.) \]

\[ S = \overline{A} + MRESET \quad (B.4.) \]

and to equation (B.2.),
This is a simple example of an address decoder. The address decoder determines whether or not the microprocessor has exercised one of the address bus lines A23, A22, A21, or A20. The complexity of the address decoder is dependent upon the memory configuration employed by the microprocessor application, and whether or not the microprocessor has a multiplexed data/address bus.

A is the signal (active high) that denotes invalid address line activation.

Figure B.3. : Address Decoder
Table B.1. : Truth Table for SRFF Control Logic in Restart Generator.

<table>
<thead>
<tr>
<th>A</th>
<th>MRESET</th>
<th>FB</th>
<th>S</th>
<th>R</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

Table B.2. : Set-Reset Flip Flop Transition Table

<table>
<thead>
<tr>
<th>$S_n$</th>
<th>$R_n$</th>
<th>$Q_{n+1}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>$Q_n$</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>?</td>
</tr>
</tbody>
</table>
The SRFF is enabled by an input clock signal ('CLK'), and its output 'Q' is dependent upon the input control signals 'S' and 'R'. The relationship between 'Q', 'S', and 'R' is shown in the transition table, see Table B.2. The output 'Q' is set high when an invalid address is decoded and there is no manual reset and no feedback signal indicating the continuing activity of a previously identified invalid address line activity. The output 'Q' remains at the same logic value after being set. The SRFF control circuitry resets the output 'Q' to a low when either the manual reset is exerted, or the feedback signals completed processing of the Access Guardian, and there is no current address line discrepancy detected by the address decoder. The logic design for the 'restart generator' implementing equations (B.3.) and (B.5.) to drive the SRFF is shown in Figure B.4.

B.5. The Timer Unit

A 'timer unit' is used to hold the interrupt signal for the required interrupt latency period and is shown in Figure B.5. This unit takes as inputs the RESTART interrupt signal, the manual reset signal ('MRESET'), and the clock signal ('CLK'). The 'timer unit' generates a feedback signal ('FB') which is used by the 'restart generator'. Initially the clock, restart interrupt, and inverse manual reset signal are put through a logic AND gate to produce a control signal 'PULSE'. The 'PULSE' line is used to drive, together with the clock signal 'CLK', the ripple counter. The ripple counter consists of a series of master-slave JK flip-flops active on the negative edge of the clock signal ('CLK'). The 'PULSE' signal provides the clocking signal to the first master-slave JK flip-flop to generate an output 'Q1'. The output 'Q1' is used as the clocking signal for the second master-slave JK flip-flop. This method of connecting the master-slave JK flip-flops is repeated throughout the ripple counter.
MRESET is the manual reset signal
A is the output signal from the address decoder
FB is the feedback signal from the timer unit
CLK is the system clock signal
S is the 'set' input to the SR flip flop
R is the 'reset' input to the SR flip flop
RESTART is the interrupt signal to be impressed upon the microprocessor

Figure B.4: Restart Generator
This example timer unit incorporates a base 10 counter.

MRESET is the manual reset signal.
CLK is the system clock signal.
RESTART is the interrupt signal to be impressed upon the microprocessor.
PULSE is the clocking signal for the counter.
The COUNTER is negative-edge triggered.
COUNT is active high on counter reset.
CLEAR resets the counter on either manual reset (MRESET) or on the positive-edge on the COUNT signal.
FB is the feedback signal; set high after counter timeout.

Figure B.5. : Timer Unit
The 'J' and 'K' inputs to the flip-flops are set high. A 'CLEAR' signal is used to reset the 'Qn' outputs of the flip-flops to logic 0. The ripple counter architecture is shown in Figure B.6. [Millman, 1979].

The ripple counter is set to count to a specific number 'Cn' by taking the flip-flop outputs 'Q1', 'Q2',.. 'Qn' and applying them to a NAND logic gate as required. The example ripple counter is a base 10 counter, the NAND gate taking the binary inputs representing decimal 10, 'Q2' and 'Q4'. The NAND gate generates the 'COUNT' signal which indicates a necessary reset of the ripple counter. This signal line could be connected directly to the 'CLEAR' line but there may be timing difficulties due to the unequal internal delays within the ripple counter flip-flops. These timing difficulties are removed by inserting a latch between the 'COUNT' and 'CLEAR' lines.

The latch unit takes the additional input of the clock signal ('CLK') and the manual reset line ('MRESET'). The latch is now reset by the positive edge of the 'COUNT' signal to set the ripple counter low. The ripple counter itself is clocked on the negative edge of the 'PULSE' signal. There are now no timing difficulties. The manual reset line ('MRESET') is used to set the ripple counter outputs low.

The 'Q1', 'Q2',... 'Qn' outputs of the ripple counter are taken to a feedback unit which consists of an AND logic gate. The outputs of 'Qn' represent the binary of (Cn - 1), where the ripple counter is base Cn. In the example, the ripple counter outputs representing decimal 9 (Q1 and Q4) are used, see Figure B.6. The ripple counter can, however, be extended as required to produce the interrupt signal 'RESTART' of necessary duration depending on the microprocessor interrupt latency. The feedback unit generates an automatic reset signal ('FB') to the 'restart generator'.

B.6. Design Simulation

The gate-level design for the Access Guardian was simulated using a CLASSIC (a trademark of Plessey plc.) Gate Array Simulator. The digital circuit description used as input to the simulator, and describing the design presented in this appendix, is shown in Figure B.7. The output of the simulator is shown in Figure B.8. The design of the Access Guardian is shown to operate correctly.
This example shows the general construction for the counter. It consists of a series of connected Master-Slave JK Flip Flops. All the flip flops have J, K, and Pr (preset) inputs tied high. The flip flop QBAR output is not used. The first flip flop is clocked by the PULSE signal at its Ck input. Successive flip flops are clocked by the Q output of the previous flip flop. The counter is binary, the least significant bit being the output Q1. Counting occurs on the negative-edge of the PULSE signal due to the operation of the master and slave internal to each flip flop. The counter can easily be extended by adding extra Master-Slave JK Flip Flops onto the counter logic structure.

This logic circuit and its inputs and outputs are defined in the logic diagram for the timer unit.

JDEFF01, 02, 03, 04 are JK Master-Slave Flip Flops.

Figure B.6: Ripple Counter Unit
Figure B.7. : Access Guardian Timing Simulation
Figure B.7: Access Guardian Timing Simulation
UNITS = 10 pS
SUMM
ARRAYSIZE = 640

CIRCUIT
GUARDIANRESTART:A23,A22,A21,A20,CLK,RESET;
SUPPLIES:+VDD, -VSS

DECODER1:1N04 [D1,A23,A22,A21,A20]
DECODER2:2INV [A,D2,D1,A]

RESETS1:NOR2 [S, D2, RESET]
RESETS2:NAND3 [E1, F8, D2, RESETBAR, -]
RESETS3:NAND2 [R, RESETBAR, E1]

SR1FF:NAND2 [S1, S, CLK]
SR1FF2:NAND2 [R1, R, CLK]
SR1FF3:2INV [S2, R2, S1, R1]
SR1FF4:NOR2 [SR10BAR, S2, Q1]
SR1FF5:NOR2 [Q, R2, SR10BAR]

GATE2:NAND3 [PBAR1, PULSE, Q, CLK, RESETBAR, PBAR1]

JK1FF1:NAND3 [J11, +, PULSE, JK10BAR, -]
JK1FF2:NAND3 [K11, +, PULSE, Q1, -]
JK1FF3:NAND2 [J12, J11, K12]
JK1FF4:NAND3 [K12, K11, J12, CLEAR, -]
JK1FF5:NAND2 [J13, J12, PBAR1]
JK1FF6:NAND2 [K13, K12, PBAR1]
JK1FF7:NAND2 [Q1, J13, JK10BAR]
JK1FF8:NAND2 [JK10BAR, K13, Q1]

JK2FF1:NAND3 [J21, +, 01, JK20BAR, -]
JK2FF2:NAND3 [K21, +, 01, Q2, -]
JK2FF3:NAND2 [J22, J21, K22]
JK2FF4:NAND3 [K22, PBAR2, K21, J22, CLEAR, Q1]
JK2FF5:NAND2 [J23, J22, PBAR2]
JK2FF6:NAND2 [K23, K22, PBAR2]
JK2FF7:NAND2 [Q2, J23, JK20BAR]
JK2FF8:NAND2 [JK20BAR, K23, Q2]

JK3FF1:NAND3 [J31, +, Q2, JK30BAR, -]
JK3FF2:NAND3 [K31, +, Q2, Q3, -]
JK3FF3:NAND2 [J32, J31, K32]
JK3FF4:NAND3 [K32, PBAR3, K31, J32, CLEAR, Q2]
JK3FF5:NAND2 [J33, J32, PBAR3]
JK3FF6:NAND2 [K33, K32, PBAR3]
JK3FF7:NAND2 [Q3, J33, JK30BAR]
JK3FF8:NAND2 [JK30BAR, K33, Q3]

JK4FF1:NAND3 [J41, +, Q3, JK40BAR, -]
JK4FF2:NAND3 [K41, +, Q3, 04, -]
JK4FF3:NAND2 [J42, J41, K42]
JK4FF4:NAND3 [K42, PBAR4, K41, J42, CLEAR, Q3]
JK4FF5:NAND2 [J43, J42, PBAR4]
JK4FF6:NAND2 [K43, K42, PBAR4]
JK4FF7:NAND2 [Q4, J43, JK40BAR]
JK4FF8:NAND2 [JK40BAR, K43, Q4]

COUNTER:NAND2 [COUNT, Q2, Q4]

LATCH1:NAND2 [EP, COUNT, RST]
LATCH2:NAND2 [RST, P, CLKBAR]
LATCH3:2INV [CLKBAR, CLKB, -]

RST1:NAND2 [CLEARBAR, RST, RESETBAR]
RST2:2INV [CLEAR, RESETBAR, CLEARBAR, MRESET]

FEEDBACK1:NAND2 [FBI, Q1, Q4]
FEEDBACK2:2INV [FB, FBI, -]

RESULT:2INV [RESTART, QN, QN, Q]

END

Figure B.8: Access Guardian Circuit Description
B.7. The Design's Hardware Requirement

The hardware requirement for the Access Guardian design is shown in Table B.3. The design specification requires only 60 logic gates, which can be implemented by 17 standard TTL IC parts.

B.8. Summary

The Access Guardian proposed in Chapter 5 has been designed and its operation verified through a gate-level simulation. The design implemented is simple, other designs may be more appropriate in particular applications. The design chosen here operates independently of the processor whose bus activity it is monitoring.

The hardware requirement of the Access Guardian design presented here for a contiguous 16 MByte of used memory, dedicated address bus, and ten cycle interrupt latency, can increase with more complex microprocessor systems. The complexity of the address decoder will increase with a non-contiguous used area of memory and multiplexed busses. The timer unit may be slightly more complicated with particular binary time representations. The restart generator has a fixed logic gate requirement.
<table>
<thead>
<tr>
<th>Gate Description</th>
<th>Number Required</th>
<th>TTL Part</th>
<th>Description</th>
<th>Number Required</th>
</tr>
</thead>
<tbody>
<tr>
<td>2-Input NAND</td>
<td>28</td>
<td>74F00</td>
<td>Quad 2-Input NAND Gate</td>
<td>7</td>
</tr>
<tr>
<td>2-Input NOR</td>
<td>3</td>
<td>74F02</td>
<td>Quad 2-Input NOR Gate</td>
<td>1</td>
</tr>
<tr>
<td>2-Input Inverter</td>
<td>14</td>
<td>74F04</td>
<td>Hex Inverter</td>
<td>3</td>
</tr>
<tr>
<td>3-Input NAND</td>
<td>14</td>
<td>74F10</td>
<td>Triple 3-Input NAND Gate</td>
<td>5</td>
</tr>
<tr>
<td>4-Input NOR</td>
<td>1</td>
<td>74F260</td>
<td>Dual 5-Input NOR Gate</td>
<td>1</td>
</tr>
<tr>
<td>Total Required</td>
<td>60</td>
<td></td>
<td>Total Required</td>
<td>17</td>
</tr>
</tbody>
</table>

Table B.3. : Access Guardian Parts List.
Appendix C

PARUT AND OTHER RELATED CODE LISTINGS

C.1. Introduction ......................................................... 237
C.2. PARUT Listing ......................................................... 237
C.3. Microprocessor Description File, MICRO_FILE .................... 257
C.4. Target Software, CODE_FILE .................................. 261
C.5. Target Software with Fault Tolerance, RESULT_FILE ............ 263
C.6. PARUT Report File, ANALYSIS_FILE ............................ 266
C.7. PARUT Diagnostics, TRACE_FILE ............................... 269
APPENDIX C

PARUT AND OTHER RELATED CODE LISTINGS

C.1. Introduction

This appendix is designed to be read in conjunction with Chapter 6 which des­cribes the Post-programming Automated Recovery Utility (PARUT). The appendix holds enclosures of the PARUT program listing, and copies of typical input and output files processed by PARUT during its operation. These files, referred to as MICRO_FILE, CODE_FILE, RESULT_FILE, ANALYSIS_FILE, and TRACE_FILE within Chapter 6, are presented respectively with covering notes.

C.2. PARUT Listing

The first enclosure in this appendix is that of the PARUT program. The utility code is written in Pascal. The listing is annotated to aid comprehension of program activity. Table C.1. details, in chronological order, the utility functions and procedures within the listing and is designed for use as an index when examining the PARUT program.
<table>
<thead>
<tr>
<th>Function or Procedure Name</th>
<th>Called Functions or Procedures</th>
</tr>
</thead>
<tbody>
<tr>
<td>placement_deadlock</td>
<td>None</td>
</tr>
<tr>
<td>data_analysis</td>
<td>None</td>
</tr>
<tr>
<td>analysis</td>
<td>placement_deadlock</td>
</tr>
<tr>
<td>jump_direction</td>
<td>None</td>
</tr>
<tr>
<td>interval</td>
<td>None</td>
</tr>
<tr>
<td>test_seed</td>
<td>None</td>
</tr>
<tr>
<td>seed_details</td>
<td>None</td>
</tr>
<tr>
<td>place_seed</td>
<td>None</td>
</tr>
<tr>
<td>analysis</td>
<td>jump_direction</td>
</tr>
<tr>
<td></td>
<td>interval</td>
</tr>
<tr>
<td></td>
<td>test_seed</td>
</tr>
<tr>
<td></td>
<td>seed_details</td>
</tr>
<tr>
<td></td>
<td>place_seed</td>
</tr>
<tr>
<td>digit</td>
<td>None</td>
</tr>
<tr>
<td>build_adb</td>
<td>digit</td>
</tr>
<tr>
<td>build_rng</td>
<td>None</td>
</tr>
<tr>
<td>build_dis</td>
<td>None</td>
</tr>
<tr>
<td>set_jump</td>
<td>None</td>
</tr>
<tr>
<td>align_address</td>
<td>None</td>
</tr>
<tr>
<td>instru_set</td>
<td>None</td>
</tr>
<tr>
<td>build_jumps</td>
<td>instru_set</td>
</tr>
<tr>
<td></td>
<td>set_jump</td>
</tr>
<tr>
<td></td>
<td>align_address</td>
</tr>
<tr>
<td>copy_list</td>
<td>None</td>
</tr>
<tr>
<td>address_list</td>
<td>None</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Function or Procedure Name</th>
<th>Called Functions or Procedures</th>
</tr>
</thead>
<tbody>
<tr>
<td>seed_placement</td>
<td>seed_location</td>
</tr>
<tr>
<td></td>
<td>address_list</td>
</tr>
<tr>
<td></td>
<td>build_jumps</td>
</tr>
<tr>
<td></td>
<td>seed_placement</td>
</tr>
<tr>
<td></td>
<td>build_adb</td>
</tr>
<tr>
<td></td>
<td>copy_list</td>
</tr>
<tr>
<td>seed_list</td>
<td>seed_placement</td>
</tr>
<tr>
<td>relocate_list</td>
<td>None</td>
</tr>
<tr>
<td>stream_list</td>
<td>None</td>
</tr>
<tr>
<td>standard_offset</td>
<td>None</td>
</tr>
<tr>
<td>align_list</td>
<td>standard_offset</td>
</tr>
<tr>
<td>tidy_list</td>
<td>None</td>
</tr>
<tr>
<td>print_list</td>
<td>None</td>
</tr>
<tr>
<td>main</td>
<td>build_rng</td>
</tr>
<tr>
<td></td>
<td>build_adb</td>
</tr>
<tr>
<td></td>
<td>copy_list</td>
</tr>
<tr>
<td></td>
<td>build_jumps</td>
</tr>
<tr>
<td></td>
<td>seed_list</td>
</tr>
<tr>
<td></td>
<td>relocate_list</td>
</tr>
<tr>
<td></td>
<td>stream_list</td>
</tr>
<tr>
<td></td>
<td>address_list</td>
</tr>
<tr>
<td></td>
<td>align_list</td>
</tr>
<tr>
<td></td>
<td>tidy_list</td>
</tr>
<tr>
<td></td>
<td>print_list</td>
</tr>
<tr>
<td></td>
<td>data_analysis</td>
</tr>
<tr>
<td></td>
<td>analysis</td>
</tr>
</tbody>
</table>

Table C.1: Chronological Order of Functions in PARUT Listing
program asp( input, output, trace_file, analysis_file, code_file, result_file );

CONST
  FORWARD = TRUE;
  BACKWARD = FALSE;
  OPERAND = true;
  NO_SEED = false;
  SEED = 24577;
  BBA = 24576;
  MAX = 65536;
  START_ADD = 22000;
  INDENT = 9;

DATA_WORD is a record which contains all the information required to place seeds into a piece of code.

TYPE
  link = data_word;
  data_word = RECORD
    jump_too : link;
    jump_from : link;
    last_address : link;
    address : integer;
    op : integer;
    seeded : boolean;
    jump_address : integer;
    jump_type : integer;
    jump_too : link;
    jump_from : link;
    last_address : link;
    address : integer;
    op : integer;
    seeded : boolean;
    jump_address : integer;
    jump_type : integer;
    jump_too : link;
    jump_from : link;
    last_address : link;
    address : integer;
    op : integer;
    seeded : boolean;
    jump_address : integer;
    jump_type : integer;

FUNCTION placement_deadlock( p : link ) : boolean;
VAR
  q : link;
  found : boolean;
BEGIN
  IF diagnostic THEN write( trace_file, ' placement_deadlock( ');
  q := p;
  found := false;
  WHILE ( (q.address <> p.jump_too.address) AND (q <> nil) AND (found) ) DO
    IF ( q.address > p.jump_too.address ) THEN
      BEGIN (search backwards)
      IF ( q.op_type = OPERAND ) THEN
      OTHERWISE
      BEGIN (search forwards)
      IF ( q.op_type = OPERAND ) THEN
      OTHERWISE
      END;
found := false; (end of instruction search) 
q := q*.next_address; 
END ( then ) 
ELSE BEGIN (search forwards)
IF ( q*.next_address.opcode = opcode ) THEN
found := false; (end of instruction search) 
q := q*.next_address; 
END ( else )
IF ( q = nil ) THEN found := false; (search extended outside used area)
IF diagnostic THEN writeln( trace_filo, 'found' );
placement_doseek := found; 
END;

**************************************************************************

Procedure DATA_ANALYSIS is used to analyze the data area code.

**************************************************************************

PROCEDURE data_analysis( base : link )
VAR
jump_out : INTEGER;
jump_in : INTEGER;
area : INTEGER;
seeds : INTEGER;
p : link;
min, max : INTEGER;
BEGIN
IF diagnostic THEN writeln( trace_filo, 'data_analysis' );
area := 0;
jump_out := 0;
jump_in := 0;
seeds := 0;
p := base;
min := base*.address;
WHILE ( p <> nil ) DO
BEGIN
max := p*.address;
p := p*.next_address;
END ( while )
p := base;
WHILE ( p <> nil ) DO
BEGIN
area := area + 1;
IF ( p*.jump_address <> nil ) THEN
IF ( (p*.jump_address + p*.address < min) OR 
( p*.jump_address + p*.address > max ) ) THEN
jump_out := jump_out + 1
ELSE jump_in := jump_in + 1;
IF ( p*.og = SEED ) THEN seeds := seeds + 1;
p := p*.next_address;
END ( while )
write( trace_filo, 'INITIAL', area, 0, 
jump_in, 0, jump_out, 0, seeds, 0 );
writeln( trace_filo, ' 0' );
END;

**************************************************************************

Procedure ANALYSIS is used to analyze the program area code.

**************************************************************************

PROCEDURE analysis( title : titlotype; var area : INTEGER; base : link )
CONST
OUTSIDE = 3;
INSIDE = 1;
ODACOD = 5; (* JUMP STATES *)
INERN = 6; (* )
NODED = 7; (* )
INFO = 8; (* )

TYPE
data_jump = ARRAY [ OUTSIDE...INFO ] OF REAL;
data_area = ARRAY [ 0..16 ] OF data_jump;
data_array = ARRAY [ 0..16 ] OF data_area;
VAR
p : link;
data : data_array;
umber_seeds : INTEGER;
umber_nodes : INTEGER;
umber_nodes : INTEGER;
umber_blocks : INTEGER;
BEGIN
IF diagnostic THEN writeln( trace_filo, 'analysis : ', title );
umber_seeds := 0;
umber_nodes := 0;
umber_nodes := 0;
umber_blocks := 0;
area := 0;
recovrey := 0;
BEGIN
IF diagnostic THEN writeln( trace_filo, 'area : ', title );
umber_seeds := 0;
umber_nodes := 0;
umber_nodes := 0;
umber_blocks := 0;

BEGIN
  datla. opcodc. j] := 0;
  datla. operand. j] := 0;
END;

WHILE (p <> nil) DO
  BEGIN
    IF (p.jump_type = 0) THEN
      BEGIN
        IF (p.jump_address mod 2 = 1) THEN
        ELSE
          BEGIN
            IF (NOT p. seeded) THEN
            END;
          END;
        END;
      END;
    ELSE
      BEGIN
        IF (p. op = SEED) THEN
          BEGIN
            IF (p. op = BRA + 1 AND (p. next_address. op = SEED)) THEN
              BEGIN
                number. seeds := number. seeds + 1;
              END;
            END;
          END;
        IF (title = ORIGINAL COVER) THEN
          BEGIN
            aorl := aorl + 1;
          END;
        END;
      END;
    END;
  END;

BEGIN
  unseeded_jump := datla. 1. opcode. HOSED] + datla. 1. operand. HOSED] +
  datla. 5. opcode. HOSED] + datla. 5. operand. HOSED] +
  datla. 6. opcode. HOSED] + datla. 6. operand. HOSED];
  opcode. roots := datla. 2. opcode. INFO] + datla. 3. opcode. INFO] + datla. 4. opcode.
  opcode. roots := datla. 7. opcode. INFO] + datla. 12. opcode. INFO] +
  datla. 13. opcode. INFO] + datla. 10. opcode. INFO;}
  operand. roots := data. 2. operand. INFO] + datla. 3. operand. INFO] +
  datla. 4. operand. INFO] + datla. 5. operand. INFO];
  number. roots := opcode. roots + operand. roots;
  number. jump := datla. 1. opcode. INFO] + datla. 5. opcode. INFO] +
  datla. 6. opcode. INFO] + datla. 9. operand. INFO];
  number. jump := datla. 9. operand. INFO] + datla. 6. operand. INFO];
  number. roots := opcode. roots + operand. roots;
  number. roots := data. 11. opcode. INFO];
  number. roots := data. 11. operand. INFO;}
  IF (area > 0) THEN
    BEGIN
      aorl := (number. seeds + aorl) * 100 / (area + 1):
    END;
  END;
END;


write( analysis_file2, 'OUTSIDE TO INFO DO
write( analysis_file2, data[ 10, OPCODE, 1 ]:8:2 );
write( analysis_file2, 'W' );
write( analysis_file2, 'OUTSIDE TO INFO DO
write( analysis_file2, data[ 10, OPEND, 1 ]:8:2 );
write( analysis_file2, 'W' );
END;

FUNCTION jump_direction( var start, finish : link; prog_unit : link): boolean;
VAR
i : Integer;
direction : boolean;
BEGIN
IF diagnostic THEN write( trace_file, ' jump_direction : ');
prog_unit.address(); ' to ', prog_unit. jump_to(). address();
start := prog_unit;
finish := prog_unit;
IF ( prog_unit.address() = START_ADDRESS
< prog_unit. jump_to(). address() ) THEN BEGIN
FOR i := 1 TO prog_unit. jump_address DO
write( analysis_file2, ', ' )
IF ( i mod 2 = 0 ) THEN
finish := finish.next_address;
direction := FORWARD;
END ELSE BEGIN
FOR i := prog_unit. jump_address TO ( -1 ) DO
start := start.next_address;
END;
direction := BACKWARD;
END;

IF diagnostic AND ( direction = FORWARD ) THEN writeln( trace_file, ' : forward' );
IF diagnostic AND ( direction = BACKWARD ) THEN writeln( trace_file, ' : backward' );
jump_direction := direction;
END;

*********************************************************

Procedure INTERVAL determines the number of armorous jump generators and
destinations within an address range.
*********************************************************

PROCEDURE interval{ var cross_over: integer; (No. Generators/Destinations) start, finish : link; (Address Range Boundaries) } var p : link; ( Pointer to search interval )
BEGIN
IF diagnostic THEN writeln( trace_file, ' : interval' );
new( p );
cross_over := 0;
WHILE ( p <> finish ) DO
BEGIN
IF ( p < start ) THEN
  IF ( p^.jump_from <> nil ) THEN
    IF ( ( NOT p^.jump_from^.ooded ) AND
         ( p^.jump_from^.otype = OPERAND ) ) THEN
      cross_over := cross_over + 1;
      p := p^.next_address;
    END; ( while )
  IF diagnostic AND ( cross_over > 0 ) THEN
    writeln( trace_file, ' : not clear ', cross_over,' );
  IF diagnostic AND ( cross_over = 0 ) THEN writeln( trace_file, ' : clear' );
END;

*********************************************************

Function TEST_SEED determines whether the process of coding is necessary
because there might already be a seed placed at the armorous jump location.
*********************************************************

FUNCTION test_seed( unit : link; direction : boolean ); boolean;
var found : boolean;
BEGIN
IF diagnostic THEN writeln( trace_file, ' : test_seed' );
found := false;
CASE ( direction ) OF
  FORWARD : found := unit^.ooded;
  BACKWARD : found := unit^.jump_from^.ooded;
END; ( case )
IF diagnostic AND found THEN writeln( trace_file, ' : found' );
IF diagnostic AND NOT found THEN writeln( trace_file, ' : not found' );
test_seed := found;
END;

*********************************************************

Procedure SEEDDETAILS determines the OFFSET required by the seed and
adjacent locations UNIT and FINISH inbetween which the seed is to be
placed. The appropriate SEEDDETAILS field for the jumping operand is set
true. Note that FINISH is modified due to determining the next or last
opcode in the program linked list.
*********************************************************

PROCEDURE seed_details( start : link; direction : boolean; var unit, finish : link; var offset : integer );
BEGIN
IF diagnostic THEN writeln( trace_file, ' : seed_details' );
offset := 1;
CASE ( direction ) OF
  FORWARD : BEGIN
    start^.ooded := true;
    unit := finish^.last_address;
    WHILE ( unit^.otype = OPERAND ) DO
      BEGIN
        offset := offset + 1;
        unit := unit^.last_address;
      END; ( while )
    END; ( if )
  BACKWARD : BEGIN
    finish^.ooded := true;
    unit := start^.last_address;
    WHILE ( unit^.otype = OPERAND ) DO
      BEGIN
        offset := offset + 1;
        unit := unit^.last_address;
      END; ( while )
    END; ( if )
  END;
finish := unit;
unit := unit^.last_address;
END;

*********************************************************

Procedure PLACE_SEED inserts a seed of length OFFSET inbetween the
adjacent locations UNIT and FINISH in the program linked list.
*********************************************************

PROCEDURE place_seed( unit, finish : link; offset : integer );
var
  good_block : link;
  q : link;
  t : integer;
BEGIN
IF diagnostic THEN writeln( trace_file, ' : place_seed' );
new( q );
q := unit;
new( good_block );
seed_block.next_address := seed_block

FOR i := 1 TO offset DO
    new := seed_block;
    seed_block.next_address := new.address;
    seed_block.address := new.address;
    seed_block.next_address := new.next_address;
    seed_block.address := new.address;
END;

FUNCTION seed_location(var prog_unit : link);
    level : integer;
   /markor : integer ;
    boolean;

VAN
    direction : boolean;
    start : link;
    finish : link;
    base : link;
    top : link;

FUNCTION DIGIT converts a character representing a hexadecimal digit into an integer.
FUNCTION digit( ch : char ) : integer;
VAR
   number : integer;
BEGIN
   IF diagnostic THEN writeln( trace_file, ' digit' );
   ( Diagnostic removed to reduce TRACE_FILE size )
   number := 0;
   CASE ( ch ) OF
     '0' : number := 0;
     '1' : number := 1;
     '2' : number := 2;
     '3' : number := 3;
     '4' : number := 4;
     '5' : number := 5;
     '6' : number := 6;
     '7' : number := 7;
     '8' : number := 8;
     '9' : number := 9;
     'A' : number := 10;
     'B' : number := 11;
     'C' : number := 12;
     'D' : number := 13;
     'E' : number := 14;
     'F' : number := 15;
     '0' : number := 10;
     '1' : number := 11;
     '2' : number := 12;
     '3' : number := 13;
     '4' : number := 14;
     '5' : number := 15;
     '6' : number := 16;
     '7' : number := 17;
     '8' : number := 18;
     '9' : number := 19;
     ELSE
      BEGIN ( no word )
      number := -1;
      END;
   END;
   otherwise BEGIN ( error )
   number := -3;
   END;
END;
(**********************************************************
Procedure BUILD_ADD constructs the linked list that represents the program.
**********************************************************)
PROCEDE build_add;
CONST
   HEADER_LENGTH = 3;
VAR
   i : integer;
   q : link;
   next_section : boolean;
   old_code : integer;
   ch : char;
   number : integer;
   first := boolean;
   label : integer;
BEGIN
   IF diagnostic THEN writeln( trace_file, 'build_add' );
   p0 := base;
   first := true;
   old_code := 0;
   i := 0;
   WHILE ( i < code_file ) AND ( i < HEADER_LENGTH ) DO
   BEGIN
      read( code_file, ch );
      i := i + 1;
   END;
   next_section := false;
   label := 0;
   WHILE NOT next_section DO
   BEGIN ( remove code )
      read( code_file, ch );
      WHILE NOT old( code_file ) DO
      BEGIN
         CASE ( ch ) OF
            ' ' : BEGIN ( new label )
            label := label + 1;
            read( code_file, ch );
            END;
            ' ' : BEGIN ( end of section )
            next_section := true;
            WHILE NOT old( code_file ) DO
            read( code_file, ch );
            END;
            '0' : BEGIN ( get data word )
            read( code_file, ch );
            i := i + 1;
            END;
            IF ( ch = "Z" ) THEN
            BEGIN
            read( code_file, ch );
            number := digit( ch );
            code := 0;
            WHILE ( number > -1 ) DO
            BEGIN ( build number )
            code := code * 16 + number;
            read( code_file, ch );
            number := digit( ch );
            END;
            END;
            read( code_file, ch );
            q := jump( q, nil );
            END;
            q := jump(q, nil );
            q := op_code;
            q := operand;
            q := last_address := p0;
            q := start := true;
            q := offset := label;
            IF first THEN
            BEGIN ( start of program )
            q := address := START_ADDRESS;
            base := q;
            first := false;
            END;
            ELSE
            BEGIN
            q := address := p0 + address + 2;}
PROCEDURE build_rmp(var q, base : link);
VAR  
p : link;
i : integer;
number : integer;
length : integer;
BEGIN
  IF diagoff THEN writeln( 'trace_file', 'build_rmp');
  number := 16380;  
  q := base;
  writeln( 'Input decimal length of generated code : '); 
  readln( user, length )
END; 
p := 1 TO length DO
  IF ( i mod 2 = 0 ) THEN
    BEGIN
      p := start_address;
      base := p;
      END; 
ELSE
      q := q + address + 2;
      p := q + address + 1;
      IF ( p = start_address ) THEN
        BEGIN
          p := p + address + 1;
          IF ( p = start_address ) THEN
            BEGIN
              q := p + address + 1;
              IF ( p = start_address ) THEN
                BEGIN
                  q := p + address + 1;
                  IF ( p = start_address ) THEN
                    BEGIN
                      q := p + address + 1;
                      IF ( p = start_address ) THEN
                        BEGIN
                          q := p + address + 1;
                          IF ( p = start_address ) THEN
                            BEGIN
                              q := p + address + 1;
                              IF ( p = start_address ) THEN
                                BEGIN
                                  p := p + address + 1;
                                  IF ( p = start_address ) THEN
                                    BEGIN
                                      p := p + address + 1;
                                      IF ( p = start_address ) THEN
                                        BEGIN
                                          p := p + address + 1;
                                          IF ( p = start_address ) THEN
                                            BEGIN
                                              p := p + address + 1;
                                              IF ( p = start_address ) THEN
                                                BEGIN
                                                  p := p + address + 1;
                                                  IF ( p = start_address ) THEN
                                                    BEGIN
                                                      p := p + address + 1;
                                                      IF ( p = start_address ) THEN
                                                        BEGIN
                                                          p := p + address + 1;
                                                          IF ( p = start_address ) THEN
                                                            BEGIN
                                                              p := p + address + 1;
                                                              IF ( p = start_address ) THEN
                                                                BEGIN
                                                                  p := p + address + 1;
                                                                  IF ( p = start_address ) THEN
                                                                    BEGIN
                                                                      p := p + address + 1;
                                                                      IF ( p = start_address ) THEN
                                                                        BEGIN
                                                                          p := p + address + 1;
                                                                          IF ( p = start_address ) THEN
                                                                            BEGIN
                                                                              p := p + address + 1;
                                                                              IF ( p = start_address ) THEN
                                                                                BEGIN
                                                                                  p := p + address + 1;
                                                                                  IF ( p = start_address ) THEN
                                                                                    BEGIN
                                                                                      p := p + address + 1;
                                                                                      IF ( p = start_address ) THEN
                                                                                        BEGIN
                                                                                          p := p + address + 1;
                                                                                          IF ( p = start_address ) THEN
                                                                                            BEGIN
                                                                                              p := p + address + 1;
                                                                                              IF ( p = start_address ) THEN
                                                                                                BEGIN
                                                                                                  p := p + address + 1;
                                                                                                  IF ( p = start_address ) THEN
                                                                                                    BEGIN
                                                                                                      p := p + address + 1;
                                                                                                      IF ( p = start_address ) THEN
                                                                                                        BEGIN
                                                                                                          p := p + address + 1;
                                                                                                          IF ( p = start_address ) THEN
                                                                                                            BEGIN
                                                                                                              p := p + address + 1;
                                                                                                              IF ( p = start_address ) THEN
                                                                                                                BEGIN
                                                                                                                    p := p + address + 1;
                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                      BEGIN
                                                                                                                        p := p + address + 1;
                                                                                                                        IF ( p = start_address ) THEN
                                                                                                                          BEGIN
                                                                                                                            p := p + address + 1;
                                                                                                                            IF ( p = start_address ) THEN
                                                                                                                              BEGIN
                                                                                                                                p := p + address + 1;
                                                                                                                                IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                      BEGIN
                                                                                                                                      p := p + address + 1;
                                                                                                                                      IF ( p = start_address ) THEN
                                                                                                                                        BEGIN
                                                                                                                                        p := p + address + 1;
                                                                                                                                        IF ( p = start_address ) THEN
                                                                                                                                          BEGIN
                                                                                                                                          p := p + address + 1;
                                                                                                                                          IF ( p = start_address ) THEN
                                                                                                                                            BEGIN
                                                                                                                                            p := p + address + 1;
                                                                                                                                            IF ( p = start_address ) THEN
                                                                                                                                              BEGIN
                                                                                                                                              p := p + address + 1;
                                                                                                                                              IF ( p = start_address ) THEN
                                                                                                                                                BEGIN
                                                                                                                                                p := p + address + 1;
                                                                                                                                                IF ( p = start_address ) THEN
                                                                                                                                                  BEGIN
                                                                                                                                                  p := p + address + 1;
                                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                                    BEGIN
                                                                                                                                                    p := p + address + 1;
                                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                                      BEGIN
                                                                                                                                                      p := p + address + 1;
                                                                                                                                                      IF ( p = start_address ) THEN
                                                                                                                                                          BEGIN
                                                                                                                                                          p := p + address + 1;
                                                                                                                                                          IF ( p = start_address ) THEN
                                                                                                                                                              BEGIN
                                                                                                                                                              p := p + address + 1;
                                                                                                                                                              IF ( p = start_address ) THEN
                                                                                                                                                                  BEGIN
                                                                                                                                                                  p := p + address + 1;
                                                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                                                      BEGIN
                                                                                                                                                                      p := p + address + 1;
                                                                                                                                                                      IF ( p = start_address ) THEN
                                                                                                                                                                          BEGIN
                                                                                                                                                                          p := p + address + 1;
                                                                                                                                                                          IF ( p = start_address ) THEN
                                                                                                                                                                              BEGIN
                                                                                                                                                                              p := p + address + 1;
                                                                                                                                                                              IF ( p = start_address ) THEN
                                                                                                                                                                                  BEGIN
                                                                                                                                                                                  p := p + address + 1;
                                                                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                                                                      BEGIN
                                                                                                                                                                                      p := p + address + 1;
                                                                                                                                                                                      IF ( p = start_address ) THEN
                                                                                                                                                                                          BEGIN
                                                                                                                                                                                          p := p + address + 1;
                                                                                                                                                                                          IF ( p = start_address ) THEN
                                                                                                                                                                                              BEGIN
                                                                                                                                                                                              p := p + address + 1;
                                                                                                                                                                                              IF ( p = start_address ) THEN
                                                                                                                                                                                                  BEGIN
                                                                                                                                                                                                  p := p + address + 1;
                                                                                                                                      IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
                                                                                                                                  IF ( p = start_address ) THEN
                                                                                                                                    BEGIN
                                                                                                                                    p := p + address + 1;
                                                                                                                                    IF ( p = start_address ) THEN
                                                                                                                                  BEGIN
                                                                                                                                  p := p + address + 1;
Procedure BUILD_DIS constructs the linked list that represents the program.
BASE is a pointer to the first item in the list and is a global.

***************************************************

PROCEDURE build_dis( var q, base : link );
VAR
  i : integer;
  count : integer;
  p : link;
  code : integer;
  ch : char;
  finish_line : boolean;
  first : boolean;
BEGIN
  IF diagnostic THEN writeln( trace_file, 'build_dis' );
  i := 0;
  q := base;
  WHILE ( NOT off( code_file ) ) AND ( ( i < HEADER_LENGTH ) ) DO
    BEGIN [ remove header ]
      readin( code_file );
      i := i + 1;
    END; [ while ]
    first := true;
    WHILE ( NOT off( code_file ) ) DO
      BEGIN [ remove code ]
        FOR i := 1 TO HEADER_LENGTH DO
          count := 0;
          finish_line := false;
          WHILE ( NOT finish_line ) DO
            BEGIN [ get word ]
              IF ( digit( ch ) > -1 ) THEN
                BEGIN
                  count := 0;
                  code := 0;
                  FOR i := 1 TO 4 DO
                    code := code * 16 + digit( ch );
                  END;
                  readin( code_file, ch ); [ perhaps next word ]
                  count := count + 1;
                  p.*.op := code;
                  p.*.last_address := q;
                  p.*.offset := 0;
              END;
              IF ( count = 1 ) THEN p.*.optype := OPCODE
              ELSE p.*.optype := OPNRAND;
            END;
            NEW( p.*.next_address := p.*.address + 2 );
          END;
          THEN ]
          ELSE BEGIN [ do some work ]
            readin( code_file );
            finish_line := true;
          END; [ else ]
          END; [ while ]
          END; [ while ]
      END;
  END;
  ***************************************************

Procedure SET_JUMP constructs the jump data for a particular jump in the linked list program.

***************************************************

PROCEDURE oct_jump( var p , link ; lt , li : integer );
VAR
  q : link;
  relativo : integer;
  finish : boolean;
BEGIN
  IF diagnostic THEN writeln( trace_file, ' oct_jump' );
  THEN
    readin( code_file ) ;
    IF ( ( lt = 0 ) AND ( li = 0 ) ) THEN
      BEGIN [ displacement not word ]
        IF diagnostic THEN writeln( trace_file, ' displacement not word' );
        IF ( p.*.last_address .* .op < MAX / 2 ) THEN
          relativo := p.*.last_address .* .op
        ELSE
          relativo := p.*.last_address .* .op - MAX;
      END; [ then ]
    ELSE BEGIN [ displacement in lower byte ]
      IF diagnostic THEN writeln( trace_file, ' displacement lower byte' );
      relativo := p.*.op and 256;
      IF ( relativo > 127 ) THEN
        relativo := relativo - 256;
      END; [ else ]
      relativo := relativo + 2; [ alter for PC pointing to first operand ]
      q := p.
      finish := false;
    END;
    WHILE ( q.*.address <> p.*.address + relativo ) AND ( ( NOT finish ) ) DO
      BEGIN
        IF ( relativo < 0 ) THEN
          THEN
          ELSE q := q.*.next_address;
        IF ( q.*.last_address = nil ) THEN finish := true;
        IF ( q.*.next_address = nil ) THEN finish := true;
      END; [ while ]
    END;
  END; [ then ]
END; [ else ]
BEGIN
  p.*.jump_toe := q;
  p.*.jump_free := p.*.next_address;
  ( default )
  p.*.jump_free := nil;
  ( DEFAULT )
  p.*.jump_type := 0;
  ( NOT SET )
  p.*.next_address := nil;
  q := p.
  END; [ then ]
ELSE BEGIN [ do some work ]
  readin( code_file );
  finish_line := true;
END; [ else ]
END; [ while ]
END;
END; [ else ]
***************************************************
9  BEGIN  {  JRW  :  (dn)/(ln)+  }
  prog_unit.*jump_type  :=  2;
  END;
10  BEGIN  {  CHK  :  d(ln)/-(ln)  }
  prog_unit.*jump_type  :=  2;
  END;
11  BEGIN  IF  {  (ll  <  13)  }
  BEGIN  (  CHK  )
  prog_unit.*jump_type  :=  2;
  END;
END;
othervise
BEGIN
  [  no_action  ]
END;
END;  [  case  ]
END;  [  if  ]
CASE  (  t1  )  OF
10  BEGIN  IF  {  (lt  =  15)  AND  (ll  =  12)  }
  THEN
  prog_unit.*jump_type  :=  6;  [  ILLEGAL  ]
  END;
14  BEGIN
CASE  (  lt  )  OF
  0  :  BEGIN
  prog_unit.*jump_type  :=  12;  [  TRAP  ]
  END;
7  BEGIN  CASE  (  ll  )  OF
  0  :  BEGIN  [  RESET  ]
  prog_unit.*jump_type  :=  7;
  END;
2  :  BEGIN  [  STOP  ]
  prog_unit.*jump_type  :=  11;
  END;
3  :  BEGIN  [  DTE  ]
  prog_unit.*jump_type  :=  8;
  END;
5  :  BEGIN  [  MPS  ]
  prog_unit.*jump_type  :=  10;
  END;
6  :  BEGIN  [  TRAPV  ]
  prog_unit.*jump_type  :=  13;
  END;
7  :  BEGIN  [  NVR  ]
  prog_unit.*jump_type  :=  9;
  END;
othervise
BEGIN
  [  no_action  ]
END;
END;  [  case  ]
END;
9  BEGIN  IF  {  (ll  <  8)  }
  BEGIN  [  JRW  :  (An)  ]
  prog_unit.*jump_type  :=  6;
  END;
10  BEGIN  IF  {  (ll  >  7)  }
  THEN
  BEGIN  [  JRW  :  (dn)  ]
  prog_unit.*jump_type  :=  6;
  END;
END;
11  BEGIN  CASE  (  ll  )  OF
  0  :  BEGIN  [  JRW  :  (dn,X1)  ]
  prog_unit.*jump_type  :=  6;
  END;
1  :  BEGIN  [  JRW  :  (dn,X1)  ]
  prog_unit.*jump_type  :=  6;
  END;
2  :  BEGIN  [  JRW  :  (dn2,X1)  ]
  prog_unit.*jump_type  :=  6;
  END;
3  :  BEGIN  [  JRW  :  (dn2,X1)  ]
  prog_unit.*jump_type  :=  6;
  END;
4  :  BEGIN  [  JRW  :  (dn,X1)  ]
  prog_unit.*jump_type  :=  6;
  END;
5  :  BEGIN  [  JRW  :  (dn,X1)  ]
  prog_unit.*jump_type  :=  6;
  END;
6  :  BEGIN  [  JRW  :  (dn,X1)  ]
  prog_unit.*jump_type  :=  6;
  END;
7  :  BEGIN  [  JRW  :  (dn7,X1)  ]
  prog_unit.*jump_type  :=  6;
  END;
8  :  BEGIN  [  JRW  :  (An2)  ]
  prog_unit.*jump_type  :=  6;
  prog_unit.*jump_address  :=
  prog_unit.*next_address  .ep  +
  prog_unit.*address  +
  START_ADDRESS;
  align_address(  prog_unit  );
  END;
9  :  BEGIN  [  JRW  :  (An1)  ]
  prog_unit.*jump_type  :=  6;
  prog_unit.*jump_address  :=
  prog_unit.*next_address  .ep  +
  prog_unit.*next_address  .next_address  +
  prog_unit.*address  +
  START_ADDRESS;
  align_address(  prog_unit  );
  END;
10  BEGIN  [  JRW  :  (dnC)  ]
  prog_unit.*jump_type  :=  6;
  set_jump(  prog_unit,  0 ,  0 );
  END;
11  BEGIN  [  JRW  :  (dnC,X1)  ]
  prog_unit.*jump_type  :=  6;
  END;
othervise
  [  RESET  ]
END;
if (ll < 6) THEN
BEGIN (JMP : (An))
prog_unit$.jump_type := 5;
END;
END;
CASE (ll) OF
0 : BEGIN (JMP : d(A0, XI))
prog_unit$.jump_type := 5;
END;
1 : BEGIN (JMP : d(A1, XI))
prog_unit$.jump_type := 5;
END;
2 : BEGIN (JMP : d(A2, XI))
prog_unit$.jump_type := 5;
END;
3 : BEGIN (JMP : d(A3, XI))
prog_unit$.jump_type := 5;
END;
4 : BEGIN (JMP : d(A4, XI))
prog_unit$.jump_type := 5;
END;
5 : BEGIN (JMP : d(A5, XI))
prog_unit$.jump_type := 5;
END;
6 : BEGIN (JMP : d(A6, XI))
prog_unit$.jump_type := 5;
END;
7 : BEGIN (JMP : d(A7, XI))
prog_unit$.jump_type := 5;
END;
8 : BEGIN (JMP : d(A0, C))
prog_unit$.jump_type := 5;
prog_unit$.jump_address :=
prog_unit$.next_address$ .ep -
prog_unit$.addr$ +
START_ADDRESS;
align_address (prog_unit$);
END;
9 : BEGIN (JMP : d(A0, L))
prog_unit$.jump_type := 5;
prog_unit$.jump_address :=
prog_unit$.next_address$.ep $06536 +
prog_unit$.next_address$.next_address$.
prog_unit$.addr$ +
START_ADDRESS;
align_address (prog_unit$);
END;
10 : BEGIN (JMP : d(PC))
END;
13 : BEGIN (JMP : d(PC, XI))
prog_unit$.jump_type := 5;
END;
CASE (ll) OF
0 : BEGIN (BRA)
set_jump (prog_unit, lt, ll);
END;
1 : BEGIN (not a jump)
{ no action }
END;
otherwise (BC)
BEGIN
set_jump (prog_unit, lt, ll);
END;
CASE (ll) OF
12 : BEGIN (DIV/DIVU)
prog_unit$.jump_type := 5;
END;
13 : BEGIN (DIV/DIVU)
prog_unit$.jump_type := 5;
END;
14 : BEGIN (DIV/DIVU)
prog_unit$.jump_type := 5;
END;
15 : BEGIN (DIV/DIVU)
prog_unit$.jump_type := 5;
END;
otherwise
BEGIN
{ no action }
END;
END;
END;
END;
Procedure COPY_LIST takes a copy of the program linked list.

PROCEDURE copy_list( var prog_unit, base : link ; list_base : link );
VAR
  p, q : link;
  first : boolean;
BEGIN
  IF diagnostic THEN writeln( trace_filo, 'copy_list');
  prog_unit := base;
  p := list_base;
  q := nil;
  first := true;
  WHILE ( p <> nil ) DO
    BEGIN
      new( prog_unit );
      prog_unit^.prog_unit := p;
      prog_unit^.list_address := q;
      prog_unit^.jump_next := nil;
      prog_unit^.jump_from := nil;
    END;
    IF first THEN
      base := prog_unit;
      first := false;
    END ( if )
  ELSE BEGIN ( tie up pointers )
    prog_unit^.list_address := p;
    prog_unit := q;
  END ( else )
  END;
END;

Procedure ADDRESS_LIST walks through the program linked list starting at BASE and assigns in a consecutive fashion address, commencing at the passed parameter START_ADDRESS, to the new needed linked list.

PROCEDURE address_list( var prog_unit : link;
                         base : link;
                         start_address : integer );
VAR
  i : integer;
BEGIN
  IF diagnostic THEN writeln( trace_filo, 'address_list');
  prog_unit := base;
  i := start_address;
  WHILE ( prog_unit <> nil ) DO
    BEGIN
      p := prog_unit;
      i := i + 2;
      prog_unit := prog_unit^.next_address;
    END;
END;

Procedure SEED_PLACEMENT injects seeds through a number of paces through code) to resolve the placement of optimum size detection mechanism to cover all erroneous jump. The parameter PLACEMENT keeps track of successful seed insertion for the recursive level of code intertigation.

PROCEDURE seed_placement( base: list;
                          level: integer;
                          placement: boolean );
VAR
  i : integer;
BEGIN
  IF diagnostic THEN writeln( trace_filo, 'seed_placement(LEVEL,level2;')
  in complete := true;
  WHILE in complete DO
    BEGIN
      in complete := false;
      marker := -1;
      unit := base;
      WHILE ( unit <> nil ) DO
        BEGIN
          UN : = unit^.next_address;
          IF diagnostic THEN writeln( trace_filo, 'INFO : Feed Complete');
          unit := base;
        END;
        IF ( unit <> nil ) THEN BEGIN
          BEGIN
            IF ( marker = -1 ) THEN marker := unit^.next_address;
            altered := seed_location( unit, level, marker );
            in complete := in complete OR altered;
            placement := placement OR altered;
          END ( if )
        END;
        unit := base;
      END;
    END;
  END;
END (if)
ELSE incomplete := false;
END; (while)
IF (level > 0) THEN
FOR scan := 0 TO (level -1) DO
coed_placement( base1, scan, plocament);
END;


****************************************************************************

Procedure 3EED LIST repeatedly goes through the linked program list until
all erroneous jumps are resolved by coed. The number of passes required
is dependent on the complexity of invalid branch groups.

*****************************************************************************

PROCEDURE seed_list( var pl, basei : link);
VAR
complete : boolean; ( Seed placements complete in code pass )
placement : boolean; ( Seed placements made in code pass )
level : integer; ( No. interocess erroneous jumps in group )
BEGIN
IF diagnostic THEN writeln( traco_filo, 'seed_list' );
complete := false;
level := 0;
WHILE NOT complete DO
BEGIN
complete := true;
placement := false;
coed_placement( basei, level, plocament );
pl := basei;
WHILE ( pl <> nil ) DO
BEGIN
IF ( NOT pl.hooded ) THEN complete := false;
pl := pl.next_address;
END; (while)
IF ( NOT complete AND NOT placement ) THEN level := level + 1;
END; (while)
END; ( begin)


****************************************************************************

Procedure RELOCATE LIST walks through the linked program list relocating
each instruction to a 8-word address boundary. Specifically 8-word because
the 64000 max instruction is 8-words.

*****************************************************************************

PROCEDURE relocate_list;
CONST
BOUNDARY = 16;
VAR
i : integer;
q, r, x : link; ( program linked list pointer )
first : boolean;
finish : boolean;
BEGIN
IF diagnostic THEN writeln( traco_filo, 'relocate_list' );
q := nil;
new( x ); ( NOP )
r.HJUMP_TEO := nil; ( DEFAULT )
r.HJUMP_FROM := nil; ( DEFAULT )
r.HADD := 0; ( NOT SET )
r.HOPTYPE := OP_CODE; ( SET )
r.HOP := 29001; ( SET )
r.HCOED := false; ( DEFAULT )
r.HOFFSET := 0; ( DEFAULT )
r.HJUMP_ADDRESS := 0; ( NOT SET )
r.HJUMP_TYPE := 0; ( NOT SET )
r.HLAST_ADDRESS := nil; ( NOT SET )
r.HMEXT_ADDRESS := nil; ( NOT SET )
p2 := base2;
first := true;
q := nil;
WHILE ( p2 <> nil ) DO
BEGIN
i := 2; ( operands )
p2.HLAST_ADDRESS := q;
p2 := p2.HNEXT_ADDRESS;
finish := false;
WHILE ( p2 <> nil ) AND NOT finish DO
BEGIN
IF ( p2.HOPTYPE <> OPERAND ) THEN
finish := true
ELSE BEGIN
i := i + 2;
p2 := p2.HNEXT_ADDRESS;
END; (else)
END; (while)
IF ( p2 <> nil ) THEN
BEGIN
p := p2;
q := p2.HLAST_ADDRESS;
WHILE ( i < BOUNDARY ) DO
BEGIN
i := i + 2; ( NOP )
new( p2 );
p2 := p2.HNEXT_ADDRESS;
p2.HLAST_ADDRESS := q;
p2.HNEXT_ADDRESS := p2;
q := p2;
END; (while)
p2 := p;
q.HNEXT_ADDRESS := p2;
END; (if)
END; (while)


*****************************************************************************

Procedure STREAM LIST walks through the linked program list until hashing
blocks are inserted.

*****************************************************************************

PROCEDURE stream_list;
CONST
HASH_NUMBER = 1; ( Number of units required for hashing )
RANDOM GENERATOR = 16390;
VAR

BEGIN
  IF diagnostic THEN writeln( trace_file, ' standard_offset' );
  IF ( address_offset < 0 ) THEN
    address_offset := address_offset + address_offset;
    IF (( l = 0 ) AND ( l = 0 )) THEN
      BEGIN ( alter next word )
        p.nxt_address := address_offset;
      END ;
      ELSE BEGIN ( alter lenor byte )
        p.nxt_op := ( p.nxt_op div 256 ) + address_offset;
      END ;
    END ;
    ELSE
      writeln( trace_file, ' ERROR : offset underflow at `, p.address ; 6 ;
    END ;

BEGIN
  IF diagnostic THEN writeln( trace_file, ' standard_offset' );
  IF ( address_offset > 255 ) THEN
    address_offset := address_offset + address_offset;
    IF (( l = 0 ) AND ( l = 0 )) THEN
      BEGIN ( alter next word )
        p.nxt_address := address_offset;
      END ;
      ELSE BEGIN ( alter lenor byte )
        p.nxt_op := ( p.nxt_op div 256 ) + address_offset;
      END ;
    END ;
    ELSE
      writeln( trace_file, ' ERROR : offset overflow at `, p.address ; 6 ;
    END ;

BEGIN
  IF diagnostic THEN writeln( trace_file, ' standard_offset' );
  IF ( address_offset > 255 ) THEN
    address_offset := address_offset + address_offset;
    IF (( l = 0 ) AND ( l = 0 )) THEN
      BEGIN ( alter next word )
        p.nxt_address := address_offset;
      END ;
      ELSE BEGIN ( alter lenor byte )
        p.nxt_op := ( p.nxt_op div 256 ) + address_offset;
      END ;
    END ;
    ELSE
      writeln( trace_file, ' ERROR : offset underflow at `, p.address ; 6 ;
    END ;
END:
15 : BEGIN
IF ( l1 = 10 ) THEN ( JNP : d(PC) )
standard_offset( 0, 0, prog_unit );
END;
otherwise
BEGIN
{ no action }
END;
END; { case }
END;
6 : BEGIN
CASE ( t1 ) OF
0 : BEGIN ( BRA )
IF ( NOT ( ( l1 = 0 ) AND ( l1 = 6 ) ) ) AND
{ prog_unit".next_address".op = SEED } THEN ( not no
standard_offset( l1, l1, prog_unit ));
END;
1 : BEGIN ( not a jump )
{ no action }
END;
otherwise ( BOC )
BEGIN ( Boc : t1 = 2 to 15 )
standard_offset( l1, l1, prog_unit );
END;
END; { case }
END;
otherwise
BEGIN
{ no action }
END;
END; { case }
CASE ( prog_unit".op ) OF ( check absolute jump )
20152 : BEGIN ( JN = Abc.C)
write1n( trace_file, ' abo address' );
prog_unit".jump_address" :=
prog_unit".jump_to".address -
prog_unit".address";
prog_unit".next_address".op :=
prog_unit".jump_to".address;
END;
20153 : BEGIN ( JNP = Abc.L)
write1n( trace_file, ' abo address' );
prog_unit".jump_address" :=
prog_unit".jump_to".address -
prog_unit".address";
prog_unit".next_address".op :=
prog_unit".jump_to".address div 66536;
prog_unit".next_address".next_address".op :=
prog_unit".jump_to".address mod 66536;
END;
2016 : BEGIN ( JNP = Abc.C)
write1n( trace_file, ' abo address' );
prog_unit".jump_address" :=
prog_unit".jump_to".address -
prog_unit".address";
prog_unit".next_address".op :=
prog_unit".jump_to".address;
END;
2017 : BEGIN ( JNP = Abc.L)
write1n( trace_file, ' abo address' );
prog_unit".jump_address" :=
prog_unit".jump_to".address -
prog_unit".address";
prog_unit".next_address".op :=
prog_unit".jump_to".address;
END;
BEGIN;
if ( pl = basel ) { then }
WHILE ( pl <> nil ) { do }
BEGIN
IF ( pl".op_type = op CODE ) AND ( pl".op > 20576 ) AND
( pl".op < 24580 ) THEN ( BBA )
BEGIN
PL1 := pl ( starting point )
pl := pl".next_address;
mechanism := true;
WHILE ( ( pl".op_type = op CODE ) ) AND ( pl <> nil ) { do }
BEGIN
IF ( ( pl".op = SEED ) ) THEN mechanism := false;
pl := pl".next_address;
END; { while }
pl := pl".next_address;
mechanism := true;
WHILE ( ( pl".op_type = op CODE ) ) AND ( pl <> nil ) { do }
BEGIN
IF ( ( pl".op = SEED ) ) THEN mechanism := false;
pl := pl".next_address;
END; { while }
***********
PROCEDURE tidy_list ( var pl, basel : list );
VAR
q1, q2 : list;
mechanism : boolean;
BEGIN
IF diagnostic THEN write1n( trace_file, ' tidy_list' );
pl := basel;
WHILE ( pl <> nil ) { do }
BEGIN
IF ( ( pl".op_type = op CODE ) AND ( pl".op > 20576 ) AND
( pl".op < 24580 ) ) THEN ( BBA )
BEGIN
q1 := pl ( starting point )
pl := pl".next_address;
mechanism := true;
WHILE ( ( pl".op_type = op CODE ) ) AND ( pl <> nil ) { do }
BEGIN
IF ( ( pl".op = SEED ) ) THEN mechanism := false;
pl := pl".next_address;
END; { while }
pl := pl".next_address;
mechanism := true;
WHILE ( ( pl".op_type = op CODE ) ) AND ( pl <> nil ) { do }
BEGIN
IF ( ( pl".op = SEED ) ) THEN mechanism := false;
pl := pl".next_address;
END; { while }
***********
IF ( mechanism ) THEN { found adjacent mechanism: }
BEGIN
q1*.jump_address := q1*.jump_address + q2*.jump_address;
q1*.op := q1*.op + q2*.jump_address;
q1*.jump_toe := pl;
q2*.offset := 0;
q2*.jump_toe := jump_from := q1;
q2*.jump_toe := nil;
q2*.op := SEED;
q2*.op_type := OPERAND (SEED);
q2*.jump_type := 12; (TRAP)
q2*.jump_address := 1; (SEED);
pl*.jump_from := q1;
pl := pl*.next_address;
END; [if]
END; [if]
p1 := pl*.next_address;
END; [while]
END;

[***********************************************************************

Procedure PRINT_LIST outputs the linked list fields of address, opcode or
oprand, contents of location, and where appropriate the location pointed
to if the present instruction is a jump.

**************************************************************************]

PROCEDURE print_list( num : integer ; title : titletype);
VAR
prog_unit : link;
BEGIN
IF diagnostic THEN writeln( trace_file, 'print_list' );
CASE ( num ) OF
  0 : BEGIN [ Original Code ]
  prog_unit := base0;
END;
  1 : BEGIN [ Detection Mechanism Placement ]
  prog_unit := base1;
END;
  2 : BEGIN [ Boundary Reallocation ]
  prog_unit := base2;
END;
  3 : BEGIN [ Instruction Stream Signature ]
  prog_unit := base3;
END;
  otherwise
  BEGIN [ no action ]
END;
END; [case]
writeln( result_file[ num ] );
writeln( result_file[ num ], '--------------------------');
end;
ELSE
BEGIN
writeln( result_file[ num ] );
WHILE ( prog_unit <> nil ) DO
BEGIN
write( result_file[ num ], prog_unit*.address, prog_unit*.op_type, 
prog_unit*.op, prog_unit*.seeded);
BEGIN ( prog_unit*.jump_address <> 0 ) THEN
BEGIN
IF ( prog_unit*.jump_address mod 2 = 0 ) THEN
writeln( result_file[ num ], ' Exception' );
ELSE writeln( result_file[ num ];
prog_unit := prog_unit*.next_address;
END; [while]
END;

[*****************************************************************************

The MAIN program starts here, and initially constructs a linked list from
the disassembly output from 'dis' ( Vitosco 5 System, Engineering Dept,
University of Durham ), then places nodes throughout there required, then
re-aligns any called addresses within the program list, and then finally
tidies the program list. The completed program list is output into a file
'results' in a coding type format.

*****************************************************************************]

BEGIN
Xdiagnostic := true;
rewrite ( trace_file, 'file=trace' );
base0 := nil;
base1 := nil;
base2 := nil;
base3 := nil;
p0 := base0;
p1 := base1;
p2 := base2;
p3 := base3;
writeln( trace_file, 'TRACE FOR FANUT' );
writeln( trace_file, '--------------------------');
writeln( trace_file );
IF diagnostic THEN writeln( trace_file, 'program' );
writeln( result_file[0], 'file=FANUTtrace0' );
writeln( result_file[1], 'file=FANUTtrace1' );
writeln( result_file[2], 'file=FANUTtrace2' );
writeln( result_file[3], 'file=FANUTtrace3' );
writeln( result_file[4], 'file=analysis' );
writeln( result_file[5], 'file=analysis2' );
writeln( result_file[6], 'file=code' );
writeln( 'user, 'file=SOURCE,INTERACTIVE' );
area := 0;
choice := false;
writeln( 'FANUT : TRANSIENT FAULT RECOVERY TOOL' );
writeln( '--------------------------');
writeln( 'INFORMATION INPUT : Please type appropriate responses');
writeln( '--------------------------');
WRITE( 'INFORMATION INPUT : Please type appropriate responses');
ELSE
BEGIN
writeln('Data or Program Area (DATA/PROGRAM) ?');
readln( user, response );
BEGIN
END;
END;
END;
END;
END;
writeln('Randomly generated or code input?');
readln(user, response);
if (response = RANDOM) then
begin
choice := true;
buid_reg(p0, base0);
writeln(analysis_file, 'randomly generated');
end; (if)
end; (if)
end; (while)
writeln(analysis_file, 'data area: analysis');
writeln(analysis_file, '-----------------'');
write(analysis_file, '

jumps inside jumps outside no. seeds recovery factor');
writeln(analysis_file, 'bytes

data area data area

build_jumps(p0, base0)
print_list(0, 'original')
data_analysis(base0)
end; (if)
if response = 'progress' then
begin
writeln(analysis_file, 'program area: analysis');
writeln(analysis_file, '-----------------'');
write(analysis_file, '

program area

no. unseeded no. seed blocks

area increase

operands jumps

write(analysis_file, 'bytes

build_code
response = 'no'
writeln('option mechanism placement (yes/no)');
readln(user, response)
if (response = 'yes') then
begin
copy_list(p1, base1, base0)
buid_jumps(p1, base1)
response := 'no'
writeln('option placement (yes/no)');
readln(user, response)
if (response = 'yes') then option_code = true
else option_code = false
seed_list(p1, base1)
tidy_list(p1, base1)
address_list(p1, base1, start_address)
align_list(p1, base1)
print_list(1, 'option mechanism placement')
if (if)
end; (if)
response := 'no';
C.3. Microprocessor Description File, MICRO_FILE

The following enclosure is a copy of MICRO_FILE which describes the defined instructions within the Motorola 68000 microprocessor instruction set. Similar versions of MICRO_FILE can be designed for the Motorola 68010 and 68020 microprocessor instruction sets.

The first line of the file contains the number of remaining lines in the file, in this case 359. Each of the remaining lines of this file contains a 16 character code which may or may not be followed by a comment string. For instance, consider the second line of the MICRO_FILE example,

$\begin{array}{c}
1100XX110000XXXX \\
\text{16 character code} \\
ABCD \\
\text{comment string}
\end{array}$

The 16 character code represents a 16-bit opcode value, where the most significant bit is leftmost (or the first character read). Characters ‘1’, ‘0’, and ‘X’ denote logic values ‘1’, ‘0’, and ‘don’t care’ (i.e. either logic value) respectively. The arbitrary logic representation allows multiple opcode values to be represented by a single entry in MICRO_FILE. This is particularly beneficial in the case of the Motorola 68000 microprocessor instruction set which has 43342 defined instructions. Comment strings following character codes describe the instruction represented. Some instructions require several character codes to describe their opcode values, in which case a comment string is only attached to the first entry associated with that instruction.
<table>
<thead>
<tr>
<th>1011XX010110XXX</th>
<th>0011XX000110XXX</th>
</tr>
</thead>
<tbody>
<tr>
<td>1011XX0101110XX</td>
<td>0011XX0001110XX</td>
</tr>
<tr>
<td>1011XX010111100</td>
<td>0011XX000111100</td>
</tr>
<tr>
<td>1011XX110100XXX CMPA</td>
<td>0011XX0010100XXX</td>
</tr>
<tr>
<td>1011XX1101110XX</td>
<td>0011XX0010110XX</td>
</tr>
<tr>
<td>1011XX110111100</td>
<td>0011XX001011100</td>
</tr>
<tr>
<td>0001100000XXX CMPF</td>
<td>0011XX0010000XXX</td>
</tr>
<tr>
<td>00011000011XXX</td>
<td>0011XX0010001XXX</td>
</tr>
<tr>
<td>00011000101XXX</td>
<td>0011XX0010010XXX</td>
</tr>
<tr>
<td>00011000110XXX</td>
<td>0011XX0010011XXX</td>
</tr>
<tr>
<td>00011001001XXX</td>
<td>0011XX0010100XXX</td>
</tr>
<tr>
<td>00011001011XXX</td>
<td>0011XX0010101XXX</td>
</tr>
<tr>
<td>00011001101XXX</td>
<td>0011XX0010110XXX</td>
</tr>
<tr>
<td>00011001110XXX</td>
<td>0011XX0010111XXX</td>
</tr>
<tr>
<td>00011010000XXX EOR</td>
<td>0011XX0000000XXX MOVE (WORD &amp; LONG)</td>
</tr>
<tr>
<td>00011010001XXX</td>
<td>0011XX0000001XXX</td>
</tr>
<tr>
<td>00011010011XXX</td>
<td>0011XX0000011XXX</td>
</tr>
<tr>
<td>00011010101XXX</td>
<td>0011XX0000101XXX</td>
</tr>
<tr>
<td>00011010110XXX</td>
<td>0011XX0000110XXX</td>
</tr>
<tr>
<td>00011010111XXX</td>
<td>0011XX0000111XXX</td>
</tr>
<tr>
<td>00011011000XXX BORI</td>
<td>0011XX0000100XXX</td>
</tr>
<tr>
<td>00011011011XXX</td>
<td>0011XX0000110XXX</td>
</tr>
<tr>
<td>00011011100XXX EORI</td>
<td>0011XX0000111XXX</td>
</tr>
<tr>
<td>00011011110XXX</td>
<td>0011XX0000100XXX</td>
</tr>
<tr>
<td>00011011110XXX</td>
<td>0011XX0000110XXX</td>
</tr>
<tr>
<td>00011011111XXX</td>
<td>0011XX0000111XXX</td>
</tr>
<tr>
<td>00011011100XXX</td>
<td>0011XX0000100XXX</td>
</tr>
<tr>
<td>000110111100000000000</td>
<td>0011XX0000110XXX</td>
</tr>
<tr>
<td>000110111110000000000</td>
<td>0011XX0000111XXX</td>
</tr>
<tr>
<td>000110111110100000000</td>
<td>0011XX0000100XXX</td>
</tr>
<tr>
<td>000110111111000000000</td>
<td>0011XX0000110XXX</td>
</tr>
<tr>
<td>000110111111100000000</td>
<td>0011XX0000111XXX</td>
</tr>
<tr>
<td>000110111111100000000</td>
<td>0011XX0000100XXX</td>
</tr>
<tr>
<td>000110111111100000000</td>
<td>0011XX0000110XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000111XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000100XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000110XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000111XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000100XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000110XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000111XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000100XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000110XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000111XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000100XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000110XXX</td>
</tr>
<tr>
<td>000110111111110000000</td>
<td>0011XX0000111XXX</td>
</tr>
</tbody>
</table>
C.4. Target Software, CODE FILE

As described in Chapter 6, CODE FILE contains the target software to be processed by PARUT. The software is presented to PARUT in a machine code representation. The example of CODE FILE in this enclosure is generated by the UNIX 'adb' facility.

CODE FILE can be split into three sections. The first three lines of the file contain redundant information which is ignored when the file is processed by PARUT. The remaining two sections are each generated by an 'adb' command of the form,

\[ <\text{address}>, <\text{count}> <\text{request}> <\text{modifier}> \]

where \(<\text{address}>\) specifies the location from which processing commences, \(<\text{count}>\) specifies the number of consecutive locations to be processed, and \(<\text{request}>\) specifies the output of the operation specified by \(<\text{modifier}>\). Further details of the UNIX 'adb' facility can be found in Bourne [1982].

The next section of CODE FILE contains a source code dump and is produced by the command,

\text{lstart,180?i}

The second section contains information on the location of opcodes within the source code and is generated by the command,

\text{lstart,180?x}

In both 'adb' commands, 'lstart' is a label in the source code denoting the start of the information to be retrieved, '180' is the hexadecimal number of code lines to be extracted, '?' specifies output to the file system file 'a.out' (later renamed CODE FILE), and 'i' and 'x' specify code dump and opcode information respectively.
a.out file = a.out
cannot open core
read

listart: 0x13FC 0xFF 0x2 0x6602 0x13FC 0x0 0x2 0x6604
0x13FC 0x92 0x2 0x660E 0x13FC 0x1 0x2 0x660A
0x13FC 0xFF 0x2 0x660C 0x203C 0x0 0x1 0x01F9
0x0 0x1A 0x0E28 0xF 0x223C 0x0 0x0 0x1239 0x2
0x6608 0x201 0x1 0x6600 0x10 0x13FC 0x0 0x2
0x6606 0x0FF9 0x0 0x66
ldrain: 0x13FC 0x0 0x2 0x6606
loop: 0x27C 0x0FF 0xFF 0x27F9 0x0 0x0 0x6E4F 0xAA 0x6E71 0x6E71
0x4E71 0x6E71 0x6E79 0x0 0x60
list: 0x2FF1 0x223C 0x0 0x0 0x1239 0x2 0x660C 0x201
0x10 0x6700 0x2A3A 0x1239 0x2 0x6608 0x201 0x2
0xC01 0x0 0x6600 0x28 0x4E69 0x0 0x1A0E 0x1239
0x2 0x6608 0x201 0x2 0x6600 0x0 0x6600 0x10
0x13FC 0x0 0x2 0x6606 0x0E2F 0x0 0x102E
lint1: 0x223C 0x0 0x0 0x1239 0x2 0x660C 0x201 0x2
0x6700 0x2A3E 0x1239 0x2 0x6608 0x201 0x1 0xC01
0x1 0x1 0x6600 0x22F 0x6E9 0x0 0x1A 0x1239 0x2
0x6608 0x201 0x1 0xC01 0x1 0x6600 0x0A 0x13FC
0x1 0x2 0x6606
lout: 0x13FC 0xFF 0x2 0x660C 0x221F 0x4E73
lwait: 0x2F06 0x223C 0x0 0x6FF
l2: 0x686 0x0 0x1 0x6600 0x0FF8 0x2C1F 0x4E75

message: text address not found

listart: mov.b 0x2FF, $dra
listart+0x8: mov.b 0x2E, $dbr
listart+0x10: mov.b 0x2F, $cr
listart+0x18: mov.b 0x21, $cr
listart+0x20: mov.b 0x2FF, $frr
listart+0x28: mov.l 0x21, $dl
listart+0x2C: lea listart, %a0
listart+0x34: trap 0xFF
listart+0x38: mov.l.1 $d0, $d1
listart+0x3C: mov.b $r8, $d1
listart+0x44: and.b $d1, $d1
listart+0x48: bne ldrain
listart+0x4C: mov.b $d0, $r0
listart+0x54: jmp loop
ldrain: mov.b $d0, $r0
loop: loop: and. $d0FF, %a0
loop+0x4: inc listart, %a3
loop+0x8: trap 0xFF
loop+0x10: nop
loop+0x14: nop
loop+0x18: jmp loop
section 3 (cont.)

list: mov.l 0d1, -(%7)
list+0x8: mov.b $d0, $d1
list+0x14: mov.b $r8, $d1
list+0x18: and.b $d1, $d1
list+0x1C: bne lwait
list+0x20: mov.b $r8, $d1
list+0x24: mov.b $r8, $d1
list+0x28: jmp lout
list+0x2C: bne lwait
list+0x30: mov.b $r8, $d1
list+0x34: and.b $d1, $d1
list+0x38: cmp.b $d1, $d0
list+0x3C: bne lwait
list+0x40: mov.b $r8, $d1
list+0x44: bne lwait
list+0x48: lea listart, %a3
list+0x4C: mov.b $d0, $r0
list+0x54: jmp loop
ldrain: mov.b $d0, $r0
loop: loop: and. $d0FF, %a0
loop+0x4: inc listart, %a3
loop+0x8: trap 0xFF
loop+0x10: nop
loop+0x14: nop
loop+0x18: jmp loop
message: text address not found
C.5. Target Software with Fault Tolerance, RESULT.FILE

The application of software implemented fault tolerance can be complex. This enclosure contains an example of the enhanced code generated by PARUT when it applies fault tolerance to the target software. The format of the file is tailored to the development needs of PARUT.

RESULT.FILE contains a listing representing the target machine code which has four columns of information. The first column denotes the decimal address of an opcode or operand. The second and third columns denote whether the location content is an opcode ('true') or an operand ('false'), and the decimal value of that content respectively. The fourth column, by default, is set to 'true'. Positions marked 'false' describe identified potential invalid branches that require resolving by the insertion of a detection mechanism. Finally, the fifth column contains the relative target destination displacement of jump related instructions which have a valid or invalid interpretation. This column may also contain entries marked 'exception' which describe the action of a seed within a placed detection mechanism.

Within the listing of RESULT.FILE, the locations of detection mechanisms inserted within the target software are highlighted by a surrounding box, and potential invalid branches have arrows drawn to emphasis their location and action.
C.6. PARUT Report File, ANALYSIS.FILE

This enclosure contains the tables of information generated by PARUT when it assesses the fault tolerance of target software. The first table provides a summary of hazards posed by invalid branches and, where appropriate, the inclusion of details of the fault tolerance achieved by applying a software implemented fault tolerant technique. The second table collates information regarding the distribution and action of jump related instructions during erroneous execution.

The information contained within ANALYSIS.FILE provides an indication of the target software fault tolerance. In particular, it details the recovery capability of erroneous jumps within target software. The recovery performance of the target software is dependent of the instruction sequence, a dynamic process, and hence the static analysis contained within the file has limited application. Chapter 7 describes experiments which assess the fault tolerance of target software through dynamic testing.
<table>
<thead>
<tr>
<th>TYPE</th>
<th>PROGRAM AREA (BYTES)</th>
<th>NO. UNSEEDED OPERAND JUMPS</th>
<th>NO. SEED BLOCKS</th>
<th>AREA INCREASE</th>
</tr>
</thead>
<tbody>
<tr>
<td>ORIGINAL CODE</td>
<td>308</td>
<td>17</td>
<td>0</td>
<td>0.000 %</td>
</tr>
<tr>
<td>DETECTION MECHANISM PLACEMENT</td>
<td>416</td>
<td>0</td>
<td>16</td>
<td>35.065 %</td>
</tr>
</tbody>
</table>

**ANALYSIS_FILE (1)**
### Original Code

<table>
<thead>
<tr>
<th>OPCODE</th>
<th>CODE</th>
<th>OPCODE</th>
<th>OPCODE</th>
</tr>
</thead>
<tbody>
<tr>
<td>RETAIN</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
</tbody>
</table>

### TOTAL

<table>
<thead>
<tr>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
</tbody>
</table>

### DETECTION MECHANISM PLACEMENT

<table>
<thead>
<tr>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
</tbody>
</table>

### TOTAL

<table>
<thead>
<tr>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
<th>CODE</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
</tbody>
</table>

**ANALYSIS_FILE (2)**
C.7. PARUT Diagnostics, TRACE_FILE

The final enclosure of this appendix contains an example of TRACE_FILE. This file is generated when the 'diagnostic' facility within PARUT is activated. The file contains, in chronological order, a list of procedures and functions operated by PARUT. Indented entries in the file denote nested module calls. TRACE_FILE is intended to aid analyst/programmer comprehension of the PARUT function. Examination of this file should be made in association with Chapter 6.
TRACE FOR PART

program
build_cdb
copy_list
build_jumps
}
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : False
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
instru_set : True
instru_set : True
 instru_set : True
set_jump displacement lower byte
place_men_deadlock : False
intr
C»

§

§
S 3" " S

2- - 8 8
8

0 0

0

fa

0

88

S S 3 g 1 S8
P-

0

tJ
O

O
*>

O

Q

Q

Q O »
-H O
0 +> U r4
O O - H fiJ

O

-6

|

I U

0

TJ

B

>

1

\U

O

1

D

9

a

2 2 22

0

O

0

E

0

a 9 9 9 9 9
I (V xJ xJ xa XJ

O

O

O

V

B

O

< 9 9 9 ? 9 3 d ?
I <V xJ xJ xJ xJ
I O. x l

: c so

ft-3

ft

M

ft

I

ft

JQ

X>

J )

U

U

U

ft

g

A
M

§
1 1 i i
*j2^22*jjw22222.22.22222224jj*22 2 ^ 2 ^ 2 ^ 2 ^ 2 2 2 2^
u.
go
b«cu
g o
3
§ € § 9u § o2 § 2u § 2o '
5 o
a a
a
a
a
a
M

"O

a
„

v

u

„

u

„

0

0

O

0

0

0

B

0

O

O

O

O

O

O

O

O

O

O

a a 5 S a o o a a a a 0 . a a, o o o a a o o a a a a a a a Q O
3 P p' P p 11 I
I
I JI»'»'
p'p I 8 p
g
»'='
1

1

1

5 5 B w n a n i h a "a a S a

v

b.

B r *

•** V

>

I

§
S " "

l H

H

I
I

8§

c c c c o

c o c o S o c o c c c o o

1

1

O
O

O

O

O

<H

i
S

r-t

a
0

0

O

rJ

o,
0

<-«

a
0

_

22
2
§
o
a

»H

a.
O

I

a
0

0

0

0

0

.

o
0

*

0

°

°

o o . a o o . a. a. a c s a o . A a . & a , o, a, o a , a a *g a a
p P »' «,' I S »' p ' f »'£ » ' f »•? g » ' P ' g 1 p' j p' '
Q

1

1

1

1

1

B

11 a a a a a a a 5 a a a l a a a a *ianana3! a | a a 3 a §>* I

c e c c B c a c a o c c c f l o

271

e c c o c o c o c o a c c c e o e ^ a c

B


# Appendix D

## EXAMPLE PROGRAMS

<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>D.1</td>
<td>Introduction</td>
<td>273</td>
</tr>
<tr>
<td>D.2</td>
<td>Program 'A' Targeting the Motorola 68000 Microprocessor</td>
<td>274</td>
</tr>
<tr>
<td>D.3</td>
<td>Program 'B' Targeting the Motorola 68(7)05 Microprocessor</td>
<td>284</td>
</tr>
<tr>
<td>D.4</td>
<td>Program 'C' Targeting the Intel 80386 Microprocessor</td>
<td>288</td>
</tr>
</tbody>
</table>
D.1. Introduction

This appendix has three main enclosures, each containing an example of the application of the software implemented fault tolerant technique proposed in this thesis. Each enclosure has a suite of program listings, each suite having a different target processor for program implementation. The first and second enclosures contain three program code listings concerning Program A and Program B respectively. The first enclosure listing details the original code written in Assembler. The second and third enclosure listing show the insertion at Assembly code level of the detection mechanisms planted by the software implemented fault tolerant technique. The second listing shows the insertion of default size detection mechanisms, and the third listing shows the insertion of an optimum size detection mechanism for each placement. The final enclosure contains Assembler code listings, like the previous enclosures, except that there is an initial high-level language listing of the original program before its compilation to Assembler code.

To aid comprehension of the listings, two types of function are marked. Firstly, erroneous jumps are highlighted by connecting the generator and destination code, which are circled, with an arrow denoting the branch direction. Secondly, inserted detection mechanisms are distinguished from the program code by their encapsulation in rectangular boxes. The detection capability of the inserted mechanisms is shown where the destination of an erroneous jump lies with a detection mechanism. Erroneous jumps whose destination is outside the memory occupied by the software are represented by arrows which terminate with a 'star' symbol denoting detection by an Access Guardian.
D.2. Program ‘A’ Targeting the Motorola 68000 Microprocessor

The original version of Program ‘A’ is written in Assembler for the 16-bit Motorola 68000 microprocessor and is shown as the first in this enclosure. The program is written as an example for the Engineering Microprocessor Laboratory at the University of Durham. It monitors two water reservoirs and controls the level of one by pumping water from or draining water to the remaining reservoir.

The second and third listings shown the insertion of detection mechanisms by the software implementation of fault tolerance proposed in this thesis. The second and third listings respectively detail default size and optimum size detection mechanism placements.
This program was originally written by Clivo Procio as an example program for the Engineering Microcomputer Laboratory. It monitors two water reservoirs and controls the water level of one by pumping from or draining water to the remaining reservoir.

---

**Listing**

<table>
<thead>
<tr>
<th>Addr</th>
<th>Code</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>00000 L3F 00F 0002</td>
<td>movb 03, data</td>
<td>loop A side for output</td>
</tr>
<tr>
<td>00000 L3F 00F 0002</td>
<td>movb 03, data</td>
<td>loop B side for input</td>
</tr>
<tr>
<td>00010 L3F 009 0002</td>
<td>movb 03x2, xar</td>
<td>initialize C1L and O1L</td>
</tr>
<tr>
<td>00010 L3F 009 0002</td>
<td>movb 03, per</td>
<td>loop active edge for CHI bottom float</td>
</tr>
<tr>
<td>00010 L3F 009 0002</td>
<td>movb 03, per</td>
<td>loop active edge for CHI top float</td>
</tr>
<tr>
<td>00020 L3F 00F 0002</td>
<td>movb 03x2, xfr</td>
<td>initial flag register</td>
</tr>
</tbody>
</table>

---

**Comment**

- **First call up VIA register**
  - Address of data direction reg. A side
  - Address of data direction reg. B side
  - Address of output reg. A side
  - Address of output reg. B side
  - Address of input reg. A side
  - Address of input reg. B side
  - Address of peripheral control reg
  - Address of interrupt flag reg
  - Address of interrupt control reg

- **Loop:**
  - mmb 032, data
  - mmb 03, data
  - movb 03x2, xar
  - movb 03, per
  - movb 03, per
  - movb 03x2, xfr

---

**Interruptions**

- **Trap 015:**
  - CALL Monitor routine

- **Trap 016:**
  - word 13
  - SETVCC

- **VIA and interrupts are now out of scope, as pumping can begin.**
  - Do not first test the state of the float. Instead, after the pump starts, as control will only occur after a float transition takes place.

- **Loop:**
  - mmb 03x2, xfr
  - mmb 03, data
  - mmb 03, data
  - movb 04, xar

- **Loop:**
  - jmp loop
  - Jump to exit loop

- **The cycle now starts, the rest of the control will be by interrupt.**

---

**Interrupt Routine List**

- **Trap 015:**
  - movl 01, d1
  - movl 00, d1
  - movl 02, d1
  - movl 03, d1

- **Trap 016:**
  - movb 01, xar
  - movb 02, xar
  - movb 03, xar
  - movb 04, xar

- **Loop:**
  - jmp loop
This program was originally written by Clive Prosser as an example program for the Engineering Microprocessor Laboratory. It monitors two motor reservoirs and controls the motor level of one by pumping free or draining water to the remaining reservoir.

|| Addr | Code | Listing |
|------|------|---------|
| 0000 | 13FC | .word 05601 | LIM OVER |

; Location: 0020-0021

; Description: Pump Control using interrupts with fault tolerance

; Using registers with own byte branch format

; First set up VIA registers

<table>
<thead>
<tr>
<th>Addr</th>
<th>Code</th>
<th>Listing</th>
</tr>
</thead>
<tbody>
<tr>
<td>0002</td>
<td>6600</td>
<td>.word 05600</td>
</tr>
<tr>
<td>0002</td>
<td>6601</td>
<td>.word 05601</td>
</tr>
<tr>
<td>0002</td>
<td>6602</td>
<td>.word 05602</td>
</tr>
<tr>
<td>0002</td>
<td>6603</td>
<td>.word 05603</td>
</tr>
<tr>
<td>0002</td>
<td>6604</td>
<td>.word 05604</td>
</tr>
<tr>
<td>0002</td>
<td>6605</td>
<td>.word 05605</td>
</tr>
<tr>
<td>0002</td>
<td>6606</td>
<td>.word 05606</td>
</tr>
</tbody>
</table>

; Location: 0030-0031

; Description: Control the motor level of one by pumping free or draining water to the remaining reservoir.

; Location: 0040-0041

; Description: Control the motor level of one by pumping free or draining water to the remaining reservoir.

; Location: 0050-0051

; Description: Control the motor level of one by pumping free or draining water to the remaining reservoir.

; Location: 0060-0061

; Description: Control the motor level of one by pumping free or draining water to the remaining reservoir.

; Location: 0070-0071

; Description: Control the motor level of one by pumping free or draining water to the remaining reservoir.
This program was originally written by Clive Proctor as an example program for the Engineering Microcomputer Laboratory. It monitors two motor reservoirs and controls the motor level of one by pumping from or draining to the reservoir.

Addr Code Listing

[Diagram of program flow and machine code]

| G2(e). Pump Control using interrupts with fault tolerance |
| Using registers with even byte branch format |
| First set up VIA registers |
| 0002 6602 | ddi = 026602 | address of data direction reg A side |
| 0002 6604 | ddr = 026604 | address of data direction reg B side |
| 0002 6606 | err = 026606 | address of output reg A side |
| 0002 6608 | ir = 026608 | address of input reg B side |
| 0002 660a | pcr = 02660a | address of parich control reg |
| 0002 660c | ltr = 02660c | address of interrupt flag reg |
| 0002 660e | lor = 02660e | address of interrupt enable reg |

000000 13FC 0000 0002 | int: movb 00fff,ddr | reset A side for output |

000001 6602 | .word 0a6602 | JUMP OVER |
000002 6601 | .word 0a6601 | SEED |

000000 13FC 0000 0002 | movb 10, ddb | reset B side for input |

000004 6604 | .word 0a6604 | JUMP OVER |
000005 6601 | .word 0a6601 | SEED |

000008 6601 | .word 0a6601 | SEED |

000000 13FC 0000 0002 | movb 00ms,lor | Interrupt CA and CB |

000023 13FC 0001 0002 | movb 01,perc | reset active edge reg for |
| IC1 bottom float |
| load active edge reg for |
| IC1 top float |

000008 6602 | .word 0a6602 | JUMP OVER |
000001 6601 | .word 0a6601 | SEED |

000001 6604 | .word 0a6604 | JUMP OVER |
00000a 6601 | .word 0a6601 | SEED |

000004 6601 | .word 0a6601 | SEED |

000007 13FC 0004 0002 | movb 04,erc | I/O, no pump up |

000000 6604 | .word 0a6604 | JUMP OVER |
000001 6606 | .word 0a6606 | SEED |

000000 6604 | .word 0a6604 | JUMP OVER |
000001 6606 | .word 0a6606 | SEED |

000001 6604 | .word 0a6604 | JUMP OVER |
00000a 6601 | .word 0a6601 | SEED |

000000 6604 | .word 0a6604 | JUMP OVER |
000001 6606 | .word 0a6606 | SEED |

000000 6604 | .word 0a6604 | JUMP OVER |
000001 6606 | .word 0a6606 | SEED |

000000 6604 | .word 0a6604 | JUMP OVER |
000001 6606 | .word 0a6606 | SEED |

000000 6604 | .word 0a6604 | JUMP OVER |
000001 6606 | .word 0a6606 | SEED |

000000 6604 | .word 0a6604 | JUMP OVER |
000001 6606 | .word 0a6606 | SEED |

000000 6604 | .word 0a6604 | JUMP OVER |
000001 6606 | .word 0a6606 | SEED |

000000 6604 | .word 0a6604 | JUMP OVER |
000001 6606 | .word 0a6606 | SEED |
D.3. Program ‘B’ Targeting the Motorola 6805 Microprocessor

Program ‘B’ is written in Assembler code targeted at the 8-bit Motorola 6805 microprocessor. The first listing in this enclosure details the program code. The program demonstrates the SC687 development system for MC68(7)05 software. It simply sends a sequence of user inputs to an output device.

The insertion of detection mechanisms by the software implemented fault tolerant technique, proposed in this thesis, is shown in the second and third listings of this enclosure. Default size and optimum size detection mechanism placements are detailed in the second and third listings respectively.
This program was originally written by Per Nylen (1989) to demonstrate the 6502 development system for 6502 software. It simply sends a sequence of words into the output device.

<table>
<thead>
<tr>
<th>Addr</th>
<th>Code</th>
<th>Listing</th>
</tr>
</thead>
<tbody>
<tr>
<td>0000</td>
<td>LDY 0</td>
<td>; PORT A Output</td>
</tr>
<tr>
<td>0000</td>
<td>BS 5</td>
<td>; PORT B Output</td>
</tr>
<tr>
<td>0000</td>
<td>STY 3</td>
<td>; PORT C Output</td>
</tr>
<tr>
<td>0001</td>
<td>SED 02</td>
<td>; BD TST1</td>
</tr>
<tr>
<td>0002</td>
<td>BSR Y</td>
<td>; Unit while key is pressed</td>
</tr>
<tr>
<td>0003</td>
<td>AS 00</td>
<td>; Send code for &quot;ONE&quot;</td>
</tr>
<tr>
<td>0005</td>
<td>AS 02</td>
<td>; Send code for &quot;TWO&quot;</td>
</tr>
<tr>
<td>0006</td>
<td>AS 03</td>
<td>; Send code for &quot;THREE&quot;</td>
</tr>
<tr>
<td>0007</td>
<td>AS 04</td>
<td>; Send code for &quot;FOUR&quot;</td>
</tr>
<tr>
<td>0008</td>
<td>AS 05</td>
<td>; Send code for &quot;FIVE&quot;</td>
</tr>
<tr>
<td>0009</td>
<td>AS 06</td>
<td>; Send code for &quot;SIX&quot;</td>
</tr>
<tr>
<td>000A</td>
<td>AS 07</td>
<td>; Loop back</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>078C</td>
<td>00</td>
<td>; RC-Oscillator</td>
</tr>
<tr>
<td>078C</td>
<td>BS 0</td>
<td>; Get address $0000 at RESET</td>
</tr>
</tbody>
</table>

This program uses the 6502 processor and the CBM operating system for 6502 software. It demonstrates how to send a sequence of words into an output device.
D.4. Program ‘C’ Targeting the Intel 80386 Microprocessor

The original version of Program ‘C’ is written in the high-level programming language known as C. High-level language programs can be transported for application on many target processors because they describe a function in terms totally abstracted from the architectural influences of any one microprocessor system. The first listing in this enclosure details the original program. The program has no particular function: its purpose is to demonstrate the potential hazards of machine code which are transparent to the high-level language. Functions within the program use local or passed parameters.

The target processor selected for Program ‘C’ is the 32-bit Intel 80386. This microprocessor has been chosen so that the software implemented fault tolerant technique proposed in this thesis is demonstrated on a variety of microprocessors. In order to generate code tailored for application on the Intel 80386 processor, Program ‘C’ is compiled and an Assembler code representation of the source code is shown as the second listing in this enclosure.

Application of the software implemented fault tolerant technique proposed in this thesis is shown within this enclosure at Assembler code level. The third and forth listing respectively detail default size and optimum size detection mechanism placements.
This program was written by Guy Stallato purely to demonstrate that high level
languages hide the potential hazards of machine code when implemented in a
microprocessor system. The program passes parameters by reference amongst
several functions.

/* function to return user input value */
int getvalue()
{ int inputvalue;
  /* get an integer value from the user at standard input */
  printf("input a value: ");
  scanf("%d", &inputvalue);
  printf("inputvalue: %d", inputvalue);
  /* return value for function main */
  return(inputvalue);
  }/* end function getvalue */

/* function to determine power of value */
int powr(int firstnumber, secondnumber)
{ int temppower, i;
  for (i = 0; i <= firstnumber; i++)
    temppower = temppower * (secondnumber);
  /* return power result */
  return(temppower);
  }/* end function powr */

/* function to calculate */
int calculate(int firstnumber, secondnumber)
{ int temppower;
  printf("Original order: %d and %d\n", firstnumber, secondnumber);
  temppower = firstnumber;
  secondnumber = firstnumber;
  *secondnumber = temppower;
  printf("Scramble order: %d and %d\n", *secondnumber, secondnumber);
  /* exit function */
  return;
  }/* end function calculate */

/* main: get two number and calculate second number to power of first */
main()
{ int firtvalue, secondvalue;
  firtvalue = getvalue();
  secondvalue = getvalue();
  printf("Firstvalue: %d and Secondvalue: %d\n", firtvalue, secondvalue);
  calculate(&firtvalue, &secondvalue);
  printf("Secondvalue to power of Firstvalue = %d\n", powr(firtvalue, secondvalue));
  /* exit main function */
  }/* end main */
<table>
<thead>
<tr>
<th>Address</th>
<th>Instruction</th>
<th>Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>002C</td>
<td>50</td>
<td>pushl 0x00a</td>
</tr>
<tr>
<td>002D</td>
<td>26</td>
<td>call swap</td>
</tr>
<tr>
<td>0032</td>
<td>00000000</td>
<td>addl 0, locp</td>
</tr>
<tr>
<td>0033</td>
<td>7F</td>
<td>pushl -8 (locp)</td>
</tr>
<tr>
<td>003C</td>
<td>7F</td>
<td>pushl -0 (locp)</td>
</tr>
<tr>
<td>003D</td>
<td>2D</td>
<td>call proc</td>
</tr>
<tr>
<td>0040</td>
<td>00000000</td>
<td>addl -0, locp</td>
</tr>
<tr>
<td>0043</td>
<td>57</td>
<td>pushl 0x00a</td>
</tr>
<tr>
<td>0044</td>
<td>0000000000</td>
<td>pushl locm+0x120</td>
</tr>
<tr>
<td>0049</td>
<td>0000000000</td>
<td>call locm+0x70</td>
</tr>
<tr>
<td>004C</td>
<td>08000000</td>
<td>addl 0, locp</td>
</tr>
<tr>
<td>0051</td>
<td>CS</td>
<td>locm</td>
</tr>
<tr>
<td>0052</td>
<td>CS</td>
<td>not</td>
</tr>
<tr>
<td>0053</td>
<td>50</td>
<td>pushl 0x00a</td>
</tr>
<tr>
<td>0054</td>
<td>CNEC</td>
<td>movl 0x070, locp</td>
</tr>
<tr>
<td>0055</td>
<td>CNEC</td>
<td>addi 0x030, locp</td>
</tr>
<tr>
<td>0059</td>
<td>BNE7</td>
<td>jnp main+2</td>
</tr>
</tbody>
</table>

*Other routines are referenced in this code, but for the purpose of analysis the assembler program is considered to be complete.*
This is the assembler translation of a program written in C.

**Address** | **Code** | **Listing**
--- | --- | ---
0000 | EB13 | gotvalue:
0002 | 682C10000 | pushl $0tost+0x12C
0007 | 839010000 | call oftost+0x78
000C | 59 | popl tocm
000D | 8C35FC | local -(tchop), tocm
0010 | 50 | pushl tocm
0011 | 838210000 | pushl $0tost+0x12D
0016 | ES1000000 | call oftost+0x88
001B | 83C400 | addl $0, tocm

--- | --- | ---
0012 | EB01 | jmp over1
0016 | C0 | int3

**/ DETECTION MECHANISM 1**

--- | --- | ---
0020 | FF73 | PC
0024 | 888210000 | pushl -(tchop)
0029 | 830210000 | pushl $0tost+0x12E

--- | --- | ---
0034 | EB01 | jmp over2
0036 | C0 | int3

**/ DETECTION MECHANISM 2**

--- | --- | ---
0040 | EB00 | jmp gotvalue+0x3A
0043 | C9 | loavo
0043 | C3 | rot
004C | 55 | pushl tchop
004D | 83C0C | movl tchop, tchop
0057 | 50 | pushl tocm
0060 | 83C7 | jmp gotvalue+0x2A
0060 | EB00 | gotvalue:

--- | --- | ---
0077 | EB02 | jmp over3

**/ DETECTION MECHANISM 3**
0051 81C008
addl -8, loop
0054 SO
pushl 6000
0055 689210000
pushl $offset+610D
005A 839008
call $offset+670C
005U 81C009
addl 60, loop
005F C9
lOdwo
0063 C9
ret
0066 55
pushl 60
0069 83EC
movl $offset+897c, 60
0067 83E908
addl 60, loop
006A 826B
jmp eax+3

Other routines are referenced in this code, but for the purpose of analysis the
accumulator program is considered to be complete.
### Appendix E

**PUBLICATIONS**

| E.1. | Introduction | 297 |
| E.3. | Paper 2: EUROMICRO '89 Paper | 307 |
| E.4. | Paper 3: IEE '89 Paper | 315 |
| E.5. | Paper 4: IEE '90 Paper | 318 |
| E.6. | Paper 5: EUROMICRO '91 Paper | 323 |
APPENDIX E

PUBLICATIONS

E.1. Introduction

To date five papers have been published in connection with the research presented in this thesis. They are as follows,


The first, second, and last paper listed were presented at the international EUROMICRO '88, EUROMICRO '89, and EUROMICRO '91 conferences held in Zürich (Switzerland), Köln (West Germany), and Vienna (Austria) respectively. The third paper was presented at the IEE Colloquium "Control Systems Software Reliability for
TRANSIENT FAULT RECOVERY ASSESSMENT IN 8 AND 16 BIT MICROPROCESSOR BASED CONTROLLERS IN EMBEDDED SYSTEMS

G.A.S. Wingate and C. Preece

United Kingdom
School of Engineering and Applied Science
University of Durham

Keywords: Transient Fault Tolerance, Fault Recovery, Microprocessors, Industrial Controllers, Embedded Systems.

Microprocessors used in embedded systems for industrial control applications are often subject to transient disturbances. This can cause system failure unless fault tolerance can be introduced into the design. This paper discusses software design techniques for enhancing fault tolerance in small digital systems. A metric is proposed for assessing different designs, and its influence on MTTF is illustrated.

1. INTRODUCTION

Modern industrial control systems are increasingly based on digital circuits incorporating microprocessors. In particular, the use of microprocessor based digital controllers in embedded systems provides an example of a low cost application where simple architectures are preferred. However when replacing analogue circuitry with digital systems, it is important to note that the digital replacements may be more susceptible to catastrophic failure from transient disturbances.

Microprocessor based industrial controllers can enter a state of erroneous execution due to the effects of a transient disturbance. The subsequent execution history depends on the particular microprocessor architecture and the system configuration. In many embedded industrial applications there is a requirement for high reliability, and this can be enhanced by attention to software design. Further improvement can be achieved by integration of hardware and software methods. This problem has traditionally been tackled by electrical shielding techniques. The methods presented here provide added protection by increasing the probability of recovery once a transient fault has occurred.

Studies have been published elsewhere, and are referenced below, showing how the probability of recovery can be determined and subsequently enhanced, for systems based on 8-bit microprocessors. Techniques have been proposed for assessing the probability of fault recovery following certain classes of transient disturbance. In this paper this approach is extended to consider 16-bit processors, and in particular the Motorola M68000 series. In order to compare different processors, an overall measurement parameter or metric is defined as the probability of executing an instruction which will cause an ordered re-entry to the program. This metric is used to discuss the implications of processor choice for digital controller design.

This paper shows how the inherent properties of 16-bit microprocessors can be utilised to improve the probability of recovery from a transient fault.

The concept of Mean Time To Failure for a system, commonly used for hardware failure estimation, is adapted to provide a measure of transient fault recovery capability.

2. CLASSES OF TRANSIENT DISTURBANCE

The industrial environment is a source of transient disturbances many of which are derived from power supply transients and electromagnetic radiation. It is critical that industrial digital systems in applications such as real-time monitoring and control, have a minimal possibility of complete system failure from transient disturbances. Digital systems are more prone to such a failure than analogue systems which tend to filter the disturbances.

Practical observations and experiments have shown that transient disturbances can cause an erroneous jump in the execution of a program, due to corruption of the program counter. This may be caused by direct corruption of the program counter, of the bus signals, or by data errors which lead to corruption...
of stored addresses. The statistical calculation of the probability of recovery following an erroneous jump provides a method of quantifying fault tolerance, and leads to the definition of a recovery metric, which enables recovery strategies to be assessed qualitatively.

Execution at any address results in the processor entering one of a fixed number of classified states. The probability of reaching a particular state may be calculated, based on the proportion of the instruction types in the instruction set, and their distribution in the microprocessor. Some of these states lead to further erroneous execution, whereas others allow an ordered recovery to take place.

Within the system memory, areas can be defined which have different characteristics dependent on the defined utilisation of that memory area. For the purpose of statistical analysis, the memory is divided into distinctive areas, and models are derived for erroneous execution in these areas. The models for 16-bit microprocessors follow the same principles as those developed for 8-bit microprocessors. The differences are due to the inherent architecture and instruction word-length of the 16-bit machines. The common model allows comparisons between machine types to be made. Once figures for recovery probabilities have been calculated for a particular design, techniques for improving the metric can be proposed. The aim of the design technique is to maximise the probability of an ordered recovery after an erroneous jump.

3. REVIEW OF 8-BIT CONCEPTS

The concepts used in the study of 8-bit machines are reviewed here, in order to illustrate the comparison between 8- and 16-bit processors. A full description will be found in Reference 1.

Take the M6800 microprocessor which has an addressing range of 64K bytes using a 16-bit address bus. An erroneous jump from a running program to a random address in the address space results in an entry to one of five states, as illustrated in Figure 1, the state reached being determined by the value of the data at the particular address. The probability of reaching one of the five states can be calculated knowing the proportion of particular instructions within the instruction set.

Of the 256 possible op-codes in the M6800 microprocessor, not all are defined. Those that are undefined have various state outcomes. The probabilities of entering each state after an erroneous jump are shown in Figure 2 for the M6800 when the contents of the memory area are assumed to be random. The probabilities of entering each state change as further instructions are executed following the original jump as shown in the figure, the plotted lines represent the boundaries between states.

The initial probability of entering a particular state, \( P_S \), is given by

\[
P_S = \frac{N_S}{N_T}
\]

where \( N_S \) is the number of op-codes corresponding to this state, and \( N_T \) is the total number of possible op-codes.

The probability that \( K \) instructions will be executed before a state is reached where a jump out of ordered processing will occur is given by

\[
P_J(K) = (1 - P_J)^{K-1}.P_J
\]

from Reference 2.

An ordered recovery following a transient event implies the execution of a restart instruction. The probability of executing such an instruction, illustrated in Figure 2, is low because of the small proportion of restart op-codes in the M6800 instruction set. Similar results are cited for other 8-bit processors to confirm this conclusion for this class of processor.

Techniques for improving the probability of recovery in 8-bit systems have been reported in Reference 3. It is shown below that these methods are much more effective when applied to 16-bit processors.
0.6

ERRONEOUS EXECUTION

RETURN

UNSPECIFIED JUMP

0.4

0.2

0.0

0 1 2 3 4 5 6 7 8 9 10

INSTRUCTIONS EXECUTED

Figure 2. 16-Bit Model Erroneous Execution State Outcome.

4. COMPARISON WITH THE M68000 SERIES

4.1 State and Type definitions

The microprocessors considered here in the Motorola M68000 family are the M68000, M68010, and the M68020. These microprocessors have addressing capabilities of between 24 and 32 bits, and have 16 bit data buses. The instruction set is based upon a 16-bit instruction word. This gives the microprocessors a possible instruction set of 65536 instructions.

The M68000 family of microprocessors are micro-coded. That is, the 16-bit op-code is presented to an execution unit. This unit is effectively a ROM to which the 16-bit op-code is an address. The ROM then releases, for every possible address variant, an appropriate sequence of micro-codes which will carry out the requested operation. Illegal and undefined instructions are treated in exactly the same way as legal and defined instructions. Illegal and undefined instructions have a specified operation: an exception call.

This fact is particularly significant in terms of fault recovery as execution of any invalid op-code leads to an exception, which, in turn, can lead to an ordered recovery through an exception service routine.

The numbers of defined and undefined instructions for the M68000 series are shown in Table 1, where they are compared with equivalent figures for the M6800.

<table>
<thead>
<tr>
<th>Microprocessor</th>
<th>defined</th>
<th>undefined</th>
</tr>
</thead>
<tbody>
<tr>
<td>MC6800</td>
<td>197</td>
<td>59</td>
</tr>
<tr>
<td>M68000</td>
<td>43342</td>
<td>22194</td>
</tr>
<tr>
<td>M68010</td>
<td>43521</td>
<td>22015</td>
</tr>
<tr>
<td>M68020</td>
<td>46595</td>
<td>18941</td>
</tr>
</tbody>
</table>

Table 1. Number of defined and undefined instructions for the MC6800 and M68000 microprocessor family.

The instruction types can be classified as shown in Table 2.

<table>
<thead>
<tr>
<th>Instruction Type Classification</th>
</tr>
</thead>
<tbody>
<tr>
<td>Non - Jump</td>
</tr>
<tr>
<td>Restart ( software interrupt, software exception )</td>
</tr>
<tr>
<td>Return</td>
</tr>
<tr>
<td>Stop ( wait )</td>
</tr>
<tr>
<td>Undefined Instruction</td>
</tr>
<tr>
<td>Unspecified Jump</td>
</tr>
</tbody>
</table>

Table 2. Instruction Type Classification

We consider an "op-code state" to be the state resulting from the interpretation of one of the instruction types as follows:

Non-Jump - leads to the program counter pointing to the location following a valid single or multi-byte instruction.

Restart - leads to a jump to a predefined location in the memory map (by exception).

Return - leads to a jump to an address held in a stack.

Stop/Wait - leads to cessation of processing; and requires an interrupt or hardware reset to exit from this state.

Undefined Instruction - leads to a restart in the M68000 series because of the undefined instruction exception feature.

Unspecified Jump - leads to a jump to a new location determined by local memory contents.

Of these states, only two are capable of providing an ordered restart, the restart and the undefined instruction exception. A diagram showing the possible states...
following an erroneous jump in a 16-bit microprocessor is shown in Figure 3. This diagram can be compared to the corresponding diagram for 8-bit microprocessors shown in Figure 1.

4.2 Microprocessor Model

In developing the model of microprocessor operation subsequent to a transient event leading to an erroneous jump, the assumption is made that there is an equal probability of the program counter containing any address within the memory map.

The memory map is divided into notional areas. Assumptions will be made later about the properties of particular areas which will modify the statistical evaluation of state outcome. The categories are shown in Table 3.

Table 3. Memory Map Categories

<table>
<thead>
<tr>
<th>Memory Map Category</th>
</tr>
</thead>
<tbody>
<tr>
<td>Input/Output Reserved Area</td>
</tr>
<tr>
<td>Program Area</td>
</tr>
<tr>
<td>Data Area</td>
</tr>
<tr>
<td>Unused Area</td>
</tr>
</tbody>
</table>

If an initial assumption is made that the data area of the memory contains random data, then comparison can be made between the 8-bit and 16-bit machines, by calculating the probability of entering a particular state as before. However, an important feature of the M68000 leads to an addition to the equation. Any attempt to fetch an instruction from an odd address leads to an exception. Assuming that the value of the corrupted program counter is random, then the probability that the program counter holds an odd address is 0.5.

All probability calculations for the M68000 series therefore apply to even addresses, recovery is guaranteed for all odd address references. A diagram of state probability for the M68000 following an erroneous jump to random data area is shown in Figure 4. Comparison of this diagram with Figure 2 shows the enhanced probability of restart in the M68000. The percentages of different instruction types in the microprocessor instruction sets are given in Table 4.

Table 4. Instruction Type Percentages in Microprocessor Instruction Sets.

<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>M68000</th>
<th>M68010</th>
<th>M68020</th>
<th>M6800</th>
</tr>
</thead>
<tbody>
<tr>
<td>Non-Jump</td>
<td>57.825</td>
<td>58.095</td>
<td>59.327</td>
<td>91.600</td>
</tr>
<tr>
<td>Restart</td>
<td>1.970</td>
<td>1.970</td>
<td>3.683</td>
<td>0.300</td>
</tr>
<tr>
<td>Return</td>
<td>0.003</td>
<td>0.006</td>
<td>0.006</td>
<td>1.300</td>
</tr>
<tr>
<td>Stop</td>
<td>0.002</td>
<td>0.002</td>
<td>0.002</td>
<td>1.600</td>
</tr>
<tr>
<td>Undefined Instruction</td>
<td>33.865</td>
<td>33.592</td>
<td>28.902</td>
<td>*</td>
</tr>
<tr>
<td>Unspecified Jump</td>
<td>6.335</td>
<td>6.335</td>
<td>8.080</td>
<td>5.200</td>
</tr>
</tbody>
</table>

* The M6800 has 59 undefined instructions, but unlike the MC68000 family of microprocessors, these instructions do not lead to exception and recovery. They have therefore been grouped with the other instruction types dependant on their effective action.

5. EXECUTION OF A PROGRAM IN AN EMBEDDED SYSTEM

For many industrial control and monitoring applications, the program may occupy only a small proportion of the addressable memory map. When a transient event occurs, the corrupted program counter may point to any location, whether used or not, in the addressing range. We will refer to the area occupied by the program, data storage, and memory mapped I/O as the used area, and the remainder of the map as the unused area. If the unused area is filled with data which are interpreted as restart instructions, or other instructions which would generate exceptions,
then the probability of recovery is further improved. Methods of achieving this have been discussed in Reference 3, the most powerful of which is bus-biasing. This consists of external circuitry which asserts a bit pattern on the data bus when any unused memory address is accessed during an instruction fetch. If the bit pattern is chosen to force an exception, then all references to unused areas provide recovery.

In this situation the probability of recovery depends, not only on the proportions of restart instructions in the instruction set, but also on the ratio of used to unused memory in the whole addressable memory area.

The concept of Mean Time To Failure, MTTF, used in hardware reliability calculations, can be adapted for this work. It provides a method of comparing the improvement in reliability brought about by fault recovery with other hardware and software methods in embedded systems. It also enables comparisons between designs using different microprocessors to be made.

6. MTTF FOR A SYSTEM SUBJECT TO TRANSIENT EVENTS

Let the sample space $E$, comprise a set of events corresponding to erroneous jumps in the running program. Let $E_r \in E$ where $E_r$ is an event leading to a recovery, and $E_f \in E$ where $E_f$ is an event leading to a failure. We can also state that $E_r \cup E_f = E$ and $E_r \cap E_f = \emptyset$.

Let the probability of event $E_r$ be $P(E_r)$ and the probability of event $E_f$ be $P(E_f)$, which leads to

$$P(E_r) + P(E_f) = 1$$  \hspace{1cm} (3)

If we assume that transient events occur at a rate of $k$ events per hour, then the rate of failures per hour is given by $\Lambda$ where

$$\Lambda = k.P(E_f)$$  \hspace{1cm} (4)

Assuming an exponential probability distribution function $f(t)$, then

$$f(t) = e^{-\Lambda t}$$  \hspace{1cm} (5)

and

$$MTTF = \frac{1}{k.P(E_f)} = \frac{1}{k.(1 - P(E_r))}$$  \hspace{1cm} (7)

Equation 7 gives a value of MTTF for the system when subjected to transient events at the rate of $k$ per hour, and where the probability of recovery from any single event is $P(E_r)$. It allows a comparative assessment to be made for software recovery techniques in a form which can be related to MTTF calculations for permanent faults in digital systems.
Let us take an example of a system having a used area of 48K bytes. If we assume, as a worst case, that any erroneous jump into the 48K byte area will result in system failure, but any jump into the unused area is recoverable, then the effect of different processors on the reliability can be seen in Figure 5. These curves are drawn with a normalised time base.

It is helpful to consider a numerical example to illustrate the point. If we assume that the system is subjected to transient events at a rate of, say, $1.4 \times 10^{-3}$ per hour, or approximately once per month. Table 5 shows the effect on MTTF of variation in $P(E_T)$. The MTTF for this event rate with no recovery, (i.e. $P(E_T) = 0$), is 714 hours.

<table>
<thead>
<tr>
<th>Processor</th>
<th>$P(E_T)$</th>
<th>MTTF</th>
</tr>
</thead>
<tbody>
<tr>
<td>M6800</td>
<td>0.250000</td>
<td>952 hrs = 39 days</td>
</tr>
<tr>
<td>8086 *</td>
<td>0.250000</td>
<td>952 hrs = 39 days</td>
</tr>
<tr>
<td>8086 **</td>
<td>0.999969</td>
<td>23405714 hrs = 2672 yrs</td>
</tr>
<tr>
<td>8086 ***</td>
<td>0.953125</td>
<td>15238 hrs = 21 mths</td>
</tr>
<tr>
<td>M68000</td>
<td>0.998535</td>
<td>487567 hrs = 56 yrs</td>
</tr>
<tr>
<td>M68010</td>
<td>0.998535</td>
<td>487567 hrs = 56 yrs</td>
</tr>
<tr>
<td>M68020</td>
<td>0.999994</td>
<td>119047619 hrs = 13590 yrs</td>
</tr>
</tbody>
</table>

* assuming corruption of the program counter only, and that the program is contiguous.

** assuming corruption of the segment register only, and that the program is contiguous.

*** assuming corruption of the program counter and segment register are taken together as a single register.

In the case of the 8086, two registers are involved in specifying the address of any instruction. Corruption of each register has been treated separately. A third case has been considered (assuming that the 8086 could be considered to have a single rather than a multiple, program counter) so that a comparison for a microprocessor with the same addressing range as the 8086 can be made. Table 5 illustrates the probability of recovery and the MTTF for a range of microprocessors.

It can be seen that once the value of $P(E_T)$ approaches a value of 0.99 or better, a small increase in value can bring a large improvement in MTTF.

As the proportion of used area increases, the change in MTTF can be calculated as, in general,

$$MTTF = \frac{\text{total area in bytes}}{k \times \text{used area in bytes}}$$ (8)

This calculation assumes that any entry into the used area leads to failure. However this is not actually the case. As mentioned above, different areas of the memory map have different properties, and more accurate figures for recovery probability can be determined for these areas.

7. MODIFICATION TO MTTF ESTIMATE BY MEMORY CATEGORIES

Within the used area of the memory map, that is the area containing the program with its data area, and mapped I/O addresses, distinctly different properties of the contents of these areas can be defined.

The program area contains program code consisting of op-codes and operands. As a first approximation we may assume that there are no restart codes in the program area. This means that any erroneous jump to a program area results in $P(E_T) = 0$.

The data area on the other hand, consists of numerical data which bear no relation to an ordered sequence of instructions. It is therefore possible to assume a random distribution of data values. We here include I/O mapped registers in the data area. For the M68000 this gives a value of $P(E_T) = 0.35835$ within the data area for all even addresses.

![Figure 6. Improved M68000 Microprocessor Reliability.](image-url)
The enhancement of recovery for different used areas of memory is shown in the examples given in Figure 6. A typical program area of 48K is assumed in one case to comprise 40K of program and 8K of data, in the second case the proportions are reversed. These are compared with the case of a 48K program with no data area. If we assume an event rate of approximately 1 per month as in the examples above, then the MTTF for these applications can be evaluated. The results for implementation on the M68000 are shown in Table 6.

Table 6. MTTF for various Program/Data ratios in M68000

<table>
<thead>
<tr>
<th>Program/Data</th>
<th>( P(E) )</th>
<th>MTTF</th>
</tr>
</thead>
<tbody>
<tr>
<td>48K / 0K</td>
<td>0.998535</td>
<td>487567 hrs = 56 yrs</td>
</tr>
<tr>
<td>40K / 8K</td>
<td>0.998617</td>
<td>516476 hrs = 59 yrs</td>
</tr>
<tr>
<td>8K / 40K</td>
<td>0.998048</td>
<td>678979 hrs = 78 yrs</td>
</tr>
</tbody>
</table>

Any pattern of memory use can be analysed to give a more accurate value of MTTF for a particular application program.

8. ENHANCEMENT OF RECOVERY

The detailed treatment of data and program areas of the memory is dependent on the properties of the code for a particular application. Some general rules can, however, be formulated.

As a first approximation, if data areas can be considered to contain random numbers, then the proportions of codes given by Table 4 apply. The inherent metric is a function of the restart and undefined instructions. However, over 50% of the instruction types are in the non-jump category, that is, execution proceeds beyond them in sequence. This observation introduces the possibility of utilising this property to force further restarts by seeding the data areas with recovery traps, sequences of codes spread throughout the data area. As well as providing direct recovery if an erroneous jump lands on a trap, it also enhances the probability of recovery after execution of a non-jump instruction.

The treatment of program areas is quite different. Here the codes are valid instructions, and entry to any program area has a high probability of resuming valid code execution. (Reference 1). Seeding the program area is application dependent, and the programmer needs to have this in mind when writing the code. Alternatively this function might be implemented by a high level compiler. Two examples of programming techniques involving operands in the M68000 series will be sufficient to illustrate the point.

i) A proportion of a program area may contain codes representing addresses. These addresses point to items of data storage. If the data is stored at addresses which themselves represent invalid instruction codes, then erroneous execution of any of these operands in the program area will force restarts.

ii) An alternative form of addressing can also produce the same result. If data is referenced using backward relative addressing, then execution of the operands containing the negative data offset will cause invalid op-code exceptions in the M68000 and M68010. This is because setting the four most significant bits in the word produces a code which is interpreted as an exception in these processors.

9. DISCUSSION

The method enables the reliability of microprocessor based embedded systems to be assessed, and different designs to be compared. It also provides a statistical basis for developing the techniques for enhancing the fault tolerance of small systems which are described in Reference 3.
The results suggest that software recovery for transient events can provide an additional means of protection for microprocessor based systems in addition to that traditionally provided by external hardware watchdog circuits. In selected 16-bit machines the figures for MTTF can be high even for onerous transient event rates.

ACKNOWLEDGEMENTS

The authors wish to acknowledge the support of the UK Science and Engineering Research Council and the British Gas Engineering Research Station, Killingworth, Newcastle upon Tyne. They would also like to thank Dr. R.G. Halse, of Westinghouse Signals Ltd., Chippenham, Wilts. for his suggestions and helpful comments on the paper.

REFERENCES


PERFORMANCE EVALUATION OF A NEW DESIGN-TOOL FOR MICROPROCESSOR TRANSIENT FAULT RECOVERY

G.A.S. Wingate (UK) & C. Preece (UK)

School of Engineering and Applied Science,
University of Durham, DH1 3LE, England.

Keywords: Transient Fault Tolerance, Fault Recovery, Design Tool, Embedded Systems, Microprocessor Controllers.

Approach: Evaluation.

A model of microprocessor erroneous behaviour has led to the development of a new design tool to automate the introduction of transient fault tolerance into program code. The design tool, PARUT (Post-programming Automated Recovery UTility) provides a method of enhancing existing program code to optimise the recovery capability following a transient disturbance. The tool can be used to implement a number of different recovery strategies, some of which may involve additional hardware. The paper examines the performance of the design tool for a range of techniques.

1. INTRODUCTION

Modern designs of industrial control systems incorporate microprocessors for control and monitoring purposes. The versatility that microprocessor based controllers offer to the designer, and the flexibility that customised software provides, makes them attractive in many industrial situations.

However, industrial environments are often harsh, falling short of the ideal for computer systems. In particular, microprocessor operation can be corrupted by externally generated transients events such as electrical power transients [1] and electro-magnetic radiation [2, 3, 4]. Even in 'benign' operating conditions transients have been observed to cause between 80% and 90% of digital system failures [5, 6, 7, 8].

In analogue systems these transients go through a 'filtering' process which generally means that the control function is not lost. Digital systems are much more liable to lose all control function following a transient disturbance. It is important that digital design should incorporate mechanisms for recovery in the event of erroneous execution.

Transient events considered in this paper are those which lead to corruption of data on the bus, or in the memory, or registers of a microcomputer system. While such corruption can lead to erroneous behaviour of the system, no permanent hardware damage is incurred, and if control of the computational process can be re-established, then this permits the possibility of overall recovery of the control system.

A number of well known techniques are available to reduce the probability of failure from transients. These include both hardware and software enhancement and usually incorporate some form of redundancy. A technique such as hardware modular redundancy with voting will prevent many transient failures but involves considerable hardware overhead [9]. Watchdogs circuits [10], although commonly used, themselves suffer from transients. They also introduce a performance overhead by constantly interrupting microprocessor operation.

This paper includes a comparison of two other recovery techniques which aim to improve transient fault tolerance by the inclusion of both hardware and software enhancements. These have been applied to 8-bit, 16-bit, and 32-bit microprocessors [11, 12, 13]. The hardware modification is simple with a very low overhead. The software enhancement has involved up to 60% increase in memory requirement, but this figure is very much code dependent. A post-programming utility has been built which automates the method of enhancing software to improve transient fault tolerance. The utility is described, and its performance with transient fault tolerant techniques is evaluated.

2. ERRONEOUS BEHAVIOUR

We consider erroneous behaviour to be initiated by corruption of the microprocessor’s internal state which leads to the corruption of the program counter. It has been suggested that 25% of transients affecting digital systems will corrupt the internal state [6]. The erro-
neous behaviour produced by a microprocessor is characterised as a sequence of erroneous execution states terminated either by recovery or catastrophic failure.

Execution states can be defined in terms of the outcome of each operation. The possible states are defined as follows:

**Non-Jump** - leads to the program counter pointing to the location following a valid instruction.

**Restart** - leads to a jump to a predefined location in the address space.

**Unspecified Jump** - leads to a jump to a new location in the address space determined by local memory contents.

**Return** - leads to a jump to an address held in a stack.

**Stop/Wait** - leads to a cessation of processing; and requires an interrupt or hardware reset to exit from this state.

Not all possible instruction bit patterns in a microprocessor are necessarily defined. In such cases these undefined instructions can, when executed, result in any of the defined states. It should be noted however that because no specification is available for these undefined instructions, manufacturers are not obliged to ensure every die batch produces the same operation for each of the undefined instructions. However, the problem does not arise in all microprocessors. In the case of the the Motorola 68000 family, for example, the execution of all undefined instructions results in an ‘exception’ (or software interrupt) leading to the ‘restart’ state as defined above.

For purposes of discussion, we assume that a transient fault will cause random corruption of the program counter contents. The instruction following the transient event will be fetched from a location pointed to by the corrupted contents of the program counter resulting in a jump in the control flow of the executing software. This random jump is termed the Initial Erroneous Jump (IEJ). We define execution following the IEJ as ‘erroneous execution’. The content of the memory at the new execution location is fetched and executed as if it were a valid instruction. The outcome of this erroneous execution will result in a new behavioural state. Execution of a non-jump type instruction continues linear erroneous execution. If the new state is a jump type (but not a ‘restart’), then a Subsequent Erroneous Jump (SEJ) will occur. In order to achieve controlled recovery it is necessary to maximise the probability of a ‘restart’ type instruction being executed as soon as possible after the inception of erroneous execution.

### 3. MODEL

A model of erroneous execution has been presented elsewhere [11, 12, 13], and is reviewed briefly here.

The probability of a jump type outcome after *k* instructions have been executed in a random area [11] is given by

$$P_J(k) = P_{NJ}(k - 1)P_J$$

where $P_J$ and $P_{NJ}$ are the initial probabilities of a jump and non-jump outcome. If the simplifying assumption is made that the content of the memory at the jump target is random, then the probabilities for individual outcomes can be calculated from the distribution of their associated instructions within the instruction set, and multiplying by $P_J(k)$.

Improvement in the probability of recovery following an initial erroneous jump (IEJ) has already been discussed in a previous paper [12] where unused address space is filled with restart type instructions. Major improvements were shown for a range of 8, 16, and 32 bit microprocessors. For erroneous execution in the ‘used areas’ of memory, the probability of a number of random subsequent erroneous jumps (SEJ) landing within the used area is given by

$$P_{SEJ}(\text{Used Area}) = \left\{ \frac{\sum_{j=1}^{L} \left( \frac{\text{No. addressed bytes in used area}}{\text{No. bytes in address range}} \right)}{N_JN_L} \right\}$$

**Figure 1 : SEJ Characteristic.**
where the summations to J, and L, map each jump type instruction and, every location in the used area respectively. \( N_J \) is the number of jump type instructions in the instruction set. \( N_L \) is the number of locations in the used area.

The SEJ characteristic for the Motorola 68000 and Intel 8086 is shown in Figure 1. This graph emphasises the importance of implementing a recovery technique within the used area. If no recovery technique is implemented then there is a high probability of an extended period of erroneous behaviour consisting of many SEJs. A detailed analysis of this characteristic is given in Reference 13.

4. PARUT (Post-programming Automated Recovery Utility)

PARUT has been designed to accomplish the following:

i) To analyse the original code and report the IEJ recovery capability inherent in the original code, and the SEJ recovery capability.

ii) To enhance the fault tolerance of the original code by a selection of methods. This usually involves inserting some redundancy into the code. When this is completed, the utility re-aligns the original software control flow which will have been offset at the machine code level by the introduction of redundancy.

iii) To analyse the enhanced code and report the improvement in the IEJ recovery capability, the improvement in the SEJ recovery capability, and the coded area extension overhead required.

The PARUT program has two input requirements, a description of the microprocessor on which the code is to reside, and a copy of the code. Processing this information results in two output streams, a report on the original and the enhanced codes, and a copy of the enhanced codes. An overview of the PARUT program is shown in Figure 2.

In the present version of the PARUT program, detection mechanisms cannot be inserted for SEJs that originate within, and whose destinations lie within, the same instruction [13]. The utility does, however, report these occurrences.

5. APPLICATION: Motorola 68000

Studies of a range of microprocessor types have suggested that while many features of erroneous behaviour are common to all, [11, 12], detailed analysis requires that each type is considered separately. We consider here the Motorola 68000 microprocessor as a typical example of commonly used 16/32 bit microprocessors. The analysis produced for this microprocessor is specific, but the general approach and results are valid for a range of microprocessors and architectures.

The model for the Motorola 68000 follows from the description of microprocessor behaviour. A particular feature of the Motorola 68000 family of microprocessors is the 'odd address exception'. The handling routine for this exception can be written so that it directs execution to the recovery routine. With unused even addresses also giving a restart (via a hardware address access guardian), the microprocessor gives an enhanced probability of recovery following an IEJ.

![Figure 2: PARUT Overview.](image)

![Figure 3: Example of Detection Mechanism Placement.](image)
To improve the recovery capability of the microprocessor still further, PARUT can implement detection mechanism placement. The detection mechanism used for the Motorola 68000 is shown in Figure 3. It consists of an initial one-word relative branch instruction over the remainder of the mechanism, so that logical control flow of the correctly executing program is not interrupted. This does however incur an additional processing overhead. The remainder of the detection mechanism consists of five one-word seed instructions. A seed instruction is a software exception instruction which directs execution flow to the recovery routine. Five seed instructions are necessary because the maximum length of an instruction in the 68000 microprocessor is five words. Detection mechanisms must be placed so as not to disrupt correct execution flow. The rule for placement is that where possible each SEJ destination becomes a seed in the detection mechanism, thus ensuring a restart during the next execution cycle, and an increased probability of recovery following an SEJ. The model applicable to the M68000 is shown in Figure 4.

![Figure 4: Erroneous Execution Model.](image)

6. PERFORMANCE EVALUATION

The results presented here show the effect of the PARUT utility on a particular piece of code. The results are entirely code dependant, and will vary greatly from one program to another. An assessment is made of the improvement in recovery capability achieved by detection mechanism placement.

The performance evaluation of original and PARUT enhanced codings requires examination of the execution histories belonging to each of the two phases of erroneous behaviour. The first phase is defined as that of erroneous execution following an IEJ. The second phase consists of the erroneous execution following each of a number of SEJs until either recovery or catastrophic failure occurs. The example coding is taken from a short Motorola 68000 control program written in assembler code.
6.1 Phase 1 Observations

The execution histories for the original code, the code with detection mechanism placement, and the embedded signature code, are shown in Figures 5, 6, and 7 respectively. These results are for execution histories following an Initial Erroneous Jump.

Both enhanced codings exhibit a lower initial probability of restart. This overhead is due to the extended memory required for the inserted detection mechanisms and embedded signatures.

The code with detection mechanism placement has a smaller probability that the outcome state will be a jump (without restart) compared to the original code. This is due to the probability of an initial SEJ destination being a detection mechanism seed delivering a restart outcome.

The signatured code also shows a decrease in the probability jump (without restart) outcome state. This is because the signature process delivers a restart outcome for any initial SEJ which does not synchronised with valid program flow.

6.2 Phase 2 Observations

The execution histories for the original code, the code with detection mechanism placement, and the embedded signature code, are shown in Figures 8, 9, and 10 respectively. These results are for execution histories following a Subsequent Erroneous Jump.

Some SEJ destinations cannot be determined due to their use of data which is only specified at run-time. In such instances, for analysis purposes, the destinations
are assumed to be random. This is important in estimating the proportion of SEJ destinations which lie outside the used area, or at an odd address, and which affect the probability of restart.

The effect of placing detection mechanisms is clearly demonstrated by the increased probability of restart as shown in Figure 9 compared to the original code shown in Figure 8.

In the case of the signature method, an SEJ by an attempted execution of an operand will not complete its operation because of the hardware detection circuitry. On the other hand an SEJ by an attempted execution of an op-code results in re-synchronisation with the program flow.

6.3 Recovery Performance

Recovery is assumed to be achieved through a restart type instruction directing the program flow to a recovery routine.

The absolute probability of restart can be calculated from the information held in both the execution histories for erroneous execution following an IEJ and SEJ. Let \( P_{RT}(k) \) be the absolute probability of restart after \( k \) instructions processed following an IEJ. Let \( J_{RT}(k) \) and \( I_{RT}(k) \) be the probabilities of restart and, jump without restart, after \( k \) instructions processed following an IEJ. Let \( S_{RT}(k) \) and \( S_{J}(k) \) be the probabilities of restart and, jump without restart, after \( k \) instructions processed following an SEJ. Finally, let \( n, x, y, z \) be instruction indices.

\[
P_{RT}(w) = I_{RT}(w) + \sum_{i=0}^{n} \sum_{x=0}^{z} \left( \frac{1}{Z} \right) S_{J}(y) \cdot S_{RT}(z) \tag{3}
\]

where \( x + \sum_{i=0}^{n} y + z = w \), and \( n, x, y, z \geq 0 \).

The absolute probability of restart for the different enhanced versions of the program are shown in Figure 11. The performance of the PARUT generated code shows improvement over the original code after one or more instructions have been erroneous executed. The period of erroneous behaviour is reduced leading to a smaller probability of data corruption. This in turn will improve availability, and reduce the chance of catastrophic failure.

Absolute probabilities of restart tends to a value less than 100% because of the probability of resynchronisation with the program flow. If resynchronisation occurs, complementary recovery techniques are required [15, 16].

6.4 Overheads

Detection mechanism placement incurs a memory overhead. This overhead is clearly demonstrated in the reduced initial probability of recovery through restart for erroneous behaviour following an IEJ, see Figure 6. Examination of Figure 11 reveals how this initially reduces the absolute probability of restart of the PARUT enhanced code compared to the original code. However
we note in Figure 11 that the performance of the enhanced code improves due to the increased probability of restart following a SEJ produced by detection mechanism placement, see Figure 9.

The execution of the detection mechanism jumps over the seeds, during correct program flow, will incur a small processing overhead. The influence of this on the program performance is code dependant.

7. DISCUSSION

The method described in this paper for improving the transient fault recovery capability is valuable for applications which require a high degree of dependability or which are safety critical.

The microprocessor model described above can be extended to other types of microprocessors [11, 12]. The design tool PARUT is based on this model and is therefore widely applicable.

The recovery technique used for erroneous execution following an SEJ requires the unused area to have a 100% detection capability during the processing of the initial instruction. In the case of the Motorola 68000 microprocessor, this can be achieved by additional circuitry that detects whether invalid address lines have been activated, and if so, impresses a bus error exception signal to the microprocessor. This is acceptable if the used area forms a contiguous block in the memory map. If it does not, then bus-biassing can be used so that the quiescent bus value is loaded to represent an exception instruction format.

One of the major advantages of the detection mechanism placement is seen in the software development cycle. As a post-programming technique, this approach does not constrain the initial software, and the programmer need not be aware of its subsequent application. There is no pre-requisite programming requirement, and the original language of the software is immaterial because the utility is applied to the machine code. The method has wide application, existing software can be processed as a maintenance up-grade, or new software processed for immediate enhancement.

The use of the utility is proposed as part of an overall fault-tolerant strategy, an addition to and not a replacement for, other software and hardware techniques. Watchdogs alone would permit recovery after some interrupt interval, but the intervening period could consist of prolonged erroneous behaviour. The method proposed here reduces the mean period of erroneous behaviour and hence decreases the probability of catastrophic failure.
8. CONCLUSIONS

It has been shown that the injection of exception generating mechanisms into the machine code of a program can enhance the probability of recovery following a transient disturbance. This technique provides coverage for transient events which cause erroneous jumps into the program code. A particular program example shows performance improvement without the need for complex additional hardware. The technique is implemented by a software utility, PARUT, applied to existing program code. The method is therefore transparent to the programmer. The utility can be used to investigate other techniques of fault coverage, and can form part of an overall design strategy for reliable digital controllers.

ACKNOWLEDGEMENTS

The authors would like to acknowledge the support of the UK Science and Engineering Research Council and the British Gas Engineering Research Station, Killingworth, Newcastle upon Tyne, England.

REFERENCES


† Special Note : The probability of outcome for a particular state is represented by the vertical distance of the labelled outcome band shown on the graph.
Abstract

This paper outlines the design of a microprocessor based controller with fault tolerance from transient disturbances. Such events can cause erroneous jumps within executing software. Fault tolerance is achieved through automatic enhancement of the application software. Performance issues are briefly discussed.

1. Introduction

Industrial controllers for monitoring and control are often based on microprocessors. The versatility offered to design aspects, both hardware and software, make them attractive in many industrial situations.

Operating conditions within industrial environments are often harsh. Transient disturbances such as mains power fluctuations [1], and electro-magnetic radiation [2] may occur. Analogue systems effectively 'filter' these events without losing their control function by passing the transient event as a temporary signal discrepancy. Digital systems having a discrete nature are much more liable to lose all control function. Transients have been observed to cause between 80 and 90% of digital system failures [3,4]. It is important that digital systems should incorporate mechanisms for recovery from transient failures.

This paper considers those transient events which lead to corruption of bus information, memory contents, and register contents of a microprocessor system. Such corruption can induce erroneous behaviour whilst no permanent hardware damage occurs. If control of the microprocessor can be restored then overall system recovery is attainable.

Most techniques available replicate hardware and/or software, involve particular programming style, or require complex dedicated hardware. All these techniques are expensive in design and/or construction and are generally tailored to individual applications. The technique described here consists of automated software enhancement transparent to the programmer. The technique is directly applicable to a range of microprocessor systems and involves the self-detection of erroneous behaviour.

2. Erroneous Behaviour

We consider erroneous behaviour to be initiated by the corruption of the microprocessor's program counter. The erroneous behaviour produced by the microprocessor is characterised as a sequence of erroneous states terminated either by catastrophic failure or system recovery.

Execution states can be defined in terms of the outcome of each operation [5]. The possible states are defined below.

Erroneous execution may be within the used (program, data, or I/O reserved) area, or the unused area of the microprocessor address space. Recovery is attained through a restart state which vectors execution to a predefined memory location which holds the recovery routine.

School of Engineering and Applied Science, University of Durham, DH1 3LE. England.
Non-Jump - leads to the program counter pointing to the location following a valid instruction.

Restart - leads to a jump to a predefined location in the address space.

Unspecified Jump - leads to a jump to a new location in the address space determined by local memory contents.

Return - leads to a jump to an address held in a stack.

Stop/Wait - leads to a cessation of processing; and requires an interrupt or hardware reset to exit from this state.

3. Detection Technique

Erroneous execution can flow through both used and unused regions of the address space of a microprocessor system. Detection is based on the occurrence of a restart state during erroneous execution. Techniques for used and unused areas will now be presented.

3.1. Unused Area Detection

All execution within the unused area is defined to be erroneous and hence total detection coverage is required. Detection is achieved by ensuring all memory locations have the capability of generating restart state when accessed. Unused address space consisting of memory elements has every location filled with a restart outcome instruction [6]. Where there are no memory elements, a restart outcome is achieved through a simple hardware unit. This unit monitors the address bus and any illegal access results in the unit developing an external reset for the microprocessor. The microprocessor will treat the external reset as a restart state.

3.2. Used Area Detection

Erroneous execution within the used area may flow through either program, data, or I/O reserved areas. Each will now be considered.

Program areas consist of opcodes and operands. Erroneous processing of an opcode will follow a valid execution path which is out of phase with the desired execution flow. Fault tolerance can be introduced by implementing software techniques such as recovery blocks or assertion tests [7].

Erroneous processing of an operand as an instruction will lead to erroneous execution dependent upon local memory contents. Such erroneous execution will follow paths, unpredictable to the programmer, through memory, leading to a danger of catastrophic failure. To detect this mode of erroneous behaviour, mechanisms are placed within memory so that any operand processed as a jump instruction will have at its destination a restart outcome instruction. This means that any operand being interpreted as an instruction and developing an erroneous jump will be followed by a restart and hence recovery. A design tool PARUT (Post-programming Automated Recovery UTility) automates this process [8].

Data and I/O reserved areas cannot have their structure altered in the same manner as the program areas. Erroneous execution within these areas may be detected through a watchdog timer.

4. Post-programming Automated Recovery UTility (PARUT)

A software tool called PARUT has been built to implement code enhancement. PARUT can provide automated code enhancement for a range of microprocessors. The tool works on machine code. There is no dependency on the original program language or pre-requisite programming style, hence the enhancement is transparent to the programmer. Enhancement may be provided
before software release or as a maintenance up-grade.

The performance improvement can be quantified with the application of PARUT. This is described in Reference 8. Many other fault tolerant techniques have been suggested but few offer analyses of their performance.

5. System Recovery

System recovery is attained through execution of the recovery routine accessed by the detection restart states. This routine may implement a number of recovery mechanisms including roll-back, roll-forward, or cold-start. The choice of recovery method will often be determined by the specific application of the microprocessor based controller.

6. Discussion

Detection is statistically very rapid. Early results show that detection within 500ns (10 instructions) has a likelihood of 99.9% [8]. A watchdog interval of 100ms, in a similar processor would permit approximately 2000 instructions to be erroneously processed before detection. Extended periods of erroneous execution increase the probability of catastrophic system malfunction.

Traditional methods of fault tolerance for this class of failure have involved large redundancy, or complex dedicated hardware, both very expensive and specific to a particular microprocessor. The technique presented here is easily transferable to other microprocessors, and its implementation, based on software, is automated.

7. Acknowledgements

The authors would like to thank the Science and Engineering Research Council, and British Gas plc (Killingworth, Newcastle-upon-Tyne) for sponsoring this work.

8. References


1. INTRODUCTION

Fault tolerance for system architectures is typically associated with multiple levels of redundancy, common examples are duplex, triplex, and quadruplex. Such system architectures are sometimes referred to as NMR (N-Modular Redundancy). Duplex systems can identify the module in error and switch to continue processing on the standby module. NMR systems, of a higher or than duplex, mask errant modules. Whilst these architectures offer very high reliability, their application also incurs in excess of 100% redundancy. The associated cost of this overhead may be significant and perhaps unacceptable for low budget systems. In such situations a uniprocessor fault tolerant approach may be appropriate.

2. FAULTS AND FAILURES

A significant hazard for all processor systems is that of temporary fault generation. Temporary faults, unlike permanent faults, incur no physical damage and have a limited duration and hence system recovery is possible without physical repair. Studies have suggested that temporary faults cause between 10 and 50 times more processor system failures than permanent faults.

Temporary faults can be classified as transient or intermittent. Transient faults occur unpredictably and are generated by environmental influences on the processor system such as electro-magnetic radiation, alpha-particles, power supply disturbances, and radio-frequency interference. Intermittent faults are recurring temporary faults and are indicative of imminent permanent fault generation.

3. CAPABILITY CHECKING

Namjoo & McCluskey [1982] first used the term 'capability checking' to describe a self-detection scheme that could be implemented by a uniprocessor system to identify errant processing. Since then, it has become evident in the literature that the collective application of a selection of capability checks provides the best method of achieving a highly reliable uniprocessor system.

The fault tolerant techniques proposed for capability checking in uniprocessor systems employ various strategies to detect erroneous processing. The strategies are based on the identification of different processing characteristics associated with erroneous behaviour. Implementation of these techniques incurs an overhead, physical (software and/or hardware) and/or processing (time). The various capability checks reported in the literature are summarised below.

Clive Preece is and Guy Wingate was formerly with the School of Engineering and Applied Science, University of Durham. Guy Wingate is now with ICI Engineering (Improved Manufacturing Systems Technology Group), Chilton House, Billingham, Cleveland.
A. Watchdog Timer

This is one of the most basic fault tolerant techniques, and involves a dedicated hardware unit called a 'watchdog timer'. The watchdog timer is incorporated into the processor system in such a way that failure of the processor to reset the watchdog timer periodically will result in the watchdog sending an alarm signal, representing unidentified failure, back to the processor. Watchdog timers incur a small switching overhead. A disadvantage of the watchdog timer as a stand-alone technique is that the timer interval to detection can allow many hundreds of instructions to be processed haphazardly. The length and effect of this period of malfunction vary will vary between individual processor types and their applications.

B. Fetch Invalid Instruction

Most processor architectures have defined and undefined instruction opcodes. Some instruction sets, however, do not specify the action of their undefined instructions which may or may not have an operation. Two good examples are the Motorola 68000 in which all possible instruction opcodes have a specified operation, and the Intel 8085 which does not specify all its opcodes of which some undefined opcodes have useful operations [Denhardt, 1979]. In order to prevent the execution of an opcode of unknown operation, only defined instruction fetches should be allowed - all illegal instruction fetches should be detected. This technique requires additional decoder circuitry to be added to the uniprocessor system.

C. Invalid Opcode Address

Glaser & Masson [1982] proposed a SAFE ROM whereby an extra memory-bit is attached to each memory unit (usually a byte) to signify usage as an opcode of operand. Interpretation of an instruction activates decoding of the 'usage' bit and if the location is not specified as an opcode then erroneous processing is assumed to have been identified. The technique incurs a memory overhead and additional circuitry to decode the extra memory-bit. Furthermore, the techniques cannot be implemented in those locations in the address space implementing Random Access Memory (RAM) or non-existant memory.

D. Invalid Read/Write Within Permitted Memory

A dedicated hardware unit can be embedded within the processor system to ensures that a read is not made from a write only address, eg. a specified output port, or that a write is made to a read only address, eg. a Read Only Memory (ROM) location.

E. Unused Memory Access & Non-Existant Memory Access

A commonly used technique involves additional circuitry to check that the address bus does not carry signals accessing unused locations of physical memory or addresses without resident physical memory in the address space.

F. Invalid Branch

An invalid branch involves the incorrect interpretation of an instruction as a branch and should not be confused with interpreting a valid branch instruction whose destination is incorrect. A technique has been proposed by Wingate & Preece [1989] in which software detection blocks are strategically placed within the application code at locations identified as destinations of invalid branch instructions. A complementary hardware unit called an Access
Guardian detects unpermitted memory access in parts of the address space not implemented in physical memory.

G. Incorrect Sequence of Instructions

This method, commonly referred to as 'signature analysis', assigns tag values on a cyclic encoding of instruction sequences which are inserted within the software before implementation. Additional circuitry monitors the tags in the software, comparing the tag with a hardware generated tag. A favourable comparison verifies execution, whilst a mis-match of tag values signifies the identification of erroneous behaviour. A good review of this approach can be found in Mahmood & McCluskey [1988].

4. EFFECTIVENESS

The effectiveness of the fault tolerant techniques can be assessed using two parameters, fault coverage and fault latency. Fault coverage is derived from fault insertion experiments (injection, emulation, or simulation) expressing the percentage of faults detected. Fault latency describes the time interval that passes between the fault insertion and its detection.

4.1. Fault Coverage

Schmid et al [1982] identified erroneous program flow as the most prominent exposure feature of uniprocessor failure induced by a fault. They evaluated the individual performance of some of the techniques outlined above. The three most successful techniques, during fault simulations on the Z80 microprocessor, with 63%, 58%, and 56% fault coverage respectively were those based on detecting incorrect sequences of instructions, invalid opcodes, and unused memory access. Similar results are reported by Gunnello et al [1989] and Li et al [1984] for the Z80 and SBR9000 processors respectively.

The reliability of a uniprocessor system can be improved by collectively applying several techniques. Table 1 collates data from three experiments evaluating different combinations of uniprocessor fault tolerant techniques. The fault coverage of the combinations is only an indication of the performance. There will be some variation in the results due to different methods of fault injection employed by the authors.

<table>
<thead>
<tr>
<th>Source</th>
<th>Techniques Employed</th>
<th>Fault Coverage</th>
</tr>
</thead>
<tbody>
<tr>
<td>Schmid et al [1982]</td>
<td>B, C, D, E, G</td>
<td>73%</td>
</tr>
<tr>
<td>Gunnello et al [1989]</td>
<td>C, D, E, G</td>
<td>79%</td>
</tr>
<tr>
<td>Madiera et al [1990]</td>
<td>A, B, D, E</td>
<td>75%</td>
</tr>
</tbody>
</table>

Table 1: Collective Application of Capability Checks
The overall fault coverage might appear low, but some account must be made for benign faults and those faults which generate data value errors and do not disrupt normal program action.

4.2. Fault Latency

Arlat et al [1990] describe a bimodal distribution of fault latencies, i.e. there are two or more distinct classes of error manifestation. Chillarge & Bowen [1989] identify some faults to be dormant, such as stack corruption requiring particular processor activity to exercise the fault, whilst other faults induce 'fast failure'. It is therefore important that error detection be provided with a minimal fault latency in order to detect faults that would otherwise generate a 'fast failure'.

The fault tolerant techniques outlined above have a short fault latency, e.g. the incorrect instruction sequence detection reported by Schmid et al [1982] occurred with a mean latency of 8 μs. The precise latencies for each fault tolerant technique will vary depending on their host uniprocessor.

The collective application of a selection of fault tolerant techniques, whilst improving the fault coverage, incur a higher mean detection latency. Assessment of the detection latency is application specific, further details can be found in the references given in Table 1.

5. FUTURE WORK

Future work is needed to compare the individual and collective effectiveness of all the capability checks listed in this paper. This will facilitate the assessment of the benefits of strategically applying particular selections of capability checks to a uniprocessor application and the overheads they incur.

6. CONCLUSION

Processes or equipment that require low budget and yet reliable control can utilise fault tolerant uniprocessor systems. Such systems have high reliability without the order of magnitude redundancy associated with NMR architectures. Reliability can be further improved by implementing fault tolerant techniques, but this has the effect of increasing fault latency. Nevertheless, fault latency is much less than that introduced by stand-alone watchdog timers traditionally associated with uniprocessor systems. In many industrial applications fault latency is not a critical factor because the equipment or process under control has a response time much longer than the fault latency.

REFERENCES


ANALYSIS OF FAILURE DATA COLLECTED FROM A TMR MICROPROCESSOR CONTROLLER

Guy A.S. Wingate* and Clive Preece
School of Engineering and Applied Science
University of Durham, England

KEYWORDS: Microprocessor System Reliability, Failure Analysis, Real Time Systems, Temporary Faults.

Experimental failure data has been collected over two operational periods, each in excess of one year's duration, from a Triple Modular Redundancy (TMR) microprocessor controller based on the Intel 8085. Failures of embedded microprocessors are diagnosed as due to either temporary or permanent faults. There are few published studies covering temporary fault analysis of microprocessor failures; the research reported here is a valuable addition. Failures attributed to temporary faults are observed to occur approximately 40 times more frequently than those attributed to permanent faults. Further analysis of each data set reveals a very good correlation (0.992 and 0.995) to a constant failure rate, which is associated with an exponential inter-arrival distribution. This paper will be of interest to reliability engineers considering aspects of operational microprocessor system reliability.

1. INTRODUCTION

Microprocessor failures can be diagnosed as due to either permanent or temporary faults [9]. Permanent faults are physical defects, whilst temporary faults have a limited duration and do not incur physical damage. Temporary faults are often divided within the literature into transient and intermittent classes. Transient faults occur unpredictably and are attributed to temporary environmental conditions such as electrical power disturbances, electro-magnetic interference, radio-frequency interference, electro-static discharge, and alpha-particle radiation. Intermittent faults are recurring temporary faults and are associated with the imminent creation of a permanent fault (wear-out phenomena), or are faults whose activation is pattern sensitive.

Reliability engineers have observed electrical systems to exhibit a time dependent failure rate, referred to as the hazard rate, \( Z(t) \). The Weibull function is widely used to describe the hazard rate as it varies during a system's lifetime (known as the 'Bathtub Curve'). The function is

\[
Z(t) = \frac{\beta}{\alpha^\beta} t^{\beta-1}
\]

where \( \alpha \) and \( \beta \) are known as the scale and shape parameters respectively.

The Bathtub Curve divides the lifetime of an electronic system into three phases. Firstly, the 'burn-in' phase involves the identification of premature faults and is modelled by \( \beta < 1 \) : hazard rate decreases. Secondly, the 'useful period' is marked by unpredictable faults, induced by component ageing and/or environmental stress, and is modelled by \( \beta = 1 \) : constant hazard rate. A constant failure rate is a special case of the Weibull function implying an exponential distribution of inter-arrival failure times. Finally, the 'wear-out' phase occurs when faults due to component degra-
2. DESCRIPTION OF MICROPROCESSOR CONTROLLER

Failure data was collected from a TMR microprocessor system, based on the Intel 8085, used to control a gas governor system. The microprocessor system, see Figure 1., was designed and implemented by Pearson [8], and is briefly reviewed below.

The TMR architecture provides fault tolerance for embedded processor failures. A voter is implemented which compares thirty bus channel signals from each processor every 3 μs. The voter outputs the majority agreed value for each signal. Hence, the correct output is guaranteed when not more than one processor fails. The voter implemented by Pearson has additional circuitry which identifies single errant processors, the diagnosis being output on two error flag signals. When more than one processor fails concurrently, the majority decision process breaks down, and the voter outputs a signal to indicate voting failure. Some processor failures require multiple reset attempts (every 3 μs), however, if more than 100 reset attempts are required a steady state failure is assumed to have occurred. The voter can operate at a maximum speed of 3 MHz and is the primary constraint on the microprocessor controller’s operational speed, the Intel 8085 microprocessor being capable of operating at 8 MHz.

<table>
<thead>
<tr>
<th>System</th>
<th>Source</th>
<th>Processor</th>
<th>Technology</th>
<th>Detection Mechanism</th>
<th>Hours Observed</th>
<th>Mean Time To Failure (Hours)</th>
<th>Temporary Faults</th>
<th>Permanent Faults</th>
</tr>
</thead>
<tbody>
<tr>
<td>CMUA</td>
<td>Fuller &amp; Harbison [2] PDP-10</td>
<td>ECL</td>
<td>Parity</td>
<td>5700</td>
<td>44</td>
<td>800-1600</td>
<td>94.8 - 97.3</td>
<td></td>
</tr>
<tr>
<td>Cn*</td>
<td>Siewiorek et al. [10] LSI-11</td>
<td>NMOS</td>
<td>Diagnostics</td>
<td>15000</td>
<td>128</td>
<td>4200</td>
<td>97.0</td>
<td></td>
</tr>
<tr>
<td>C.vmp</td>
<td>Siewiorek et al. [11] TMR LSI-11</td>
<td>NMOS</td>
<td>Crash</td>
<td>15000</td>
<td>97-328</td>
<td>4900</td>
<td>93.7 - 98.1</td>
<td></td>
</tr>
<tr>
<td>Teletra</td>
<td>Morganti et al. [7] UDET 7116</td>
<td>TTL</td>
<td>Mismatch</td>
<td>N/A</td>
<td>80-170*</td>
<td>1300</td>
<td>88.4 - 94.2</td>
<td></td>
</tr>
<tr>
<td>SLAC</td>
<td>Iyer &amp; Rossetti [3] N/A</td>
<td>N/A</td>
<td>Diagnostics</td>
<td>26000</td>
<td>58</td>
<td>2300</td>
<td>97.5**</td>
<td></td>
</tr>
<tr>
<td>CMU-AFS</td>
<td>Lin &amp; Siewiorek, [5] MC 68010/20</td>
<td>NMOS</td>
<td>Diagnosis</td>
<td>212800***</td>
<td>201</td>
<td>6552</td>
<td>97.3</td>
<td></td>
</tr>
</tbody>
</table>

Notes: * Reported by McConnel [6]  
** From which 85% were recovered  
*** 13 applications monitored over 22 months

Table 1: Observed Temporary & Permanent Faults

224
In addition, the microprocessor controller implements self-synchronising clock signals for the embedded microprocessors, a common reset signal, Random Access Memory (RAM) with code protection, Read Only Memory (ROM) duplicated - the reserve copy being selected by a watchdog timer, and an uninterruptable power supply.

3. DATA COLLECTION

The microprocessor controller was in continuous operation from October 1983 to February 1985, and from June 1989 to July 1990. During these periods the controller retained operation through the automatic activation of fault tolerant mechanisms implemented by the system. Failure of the fault tolerant mechanisms resulted in a crash outcome. Instances of automatic recovery are assumed to restore a temporary fault condition because the failure incurred no physical damage. Crashes require operator intervention for recovery, via manual reset or repair, and hence because of their steady state failure condition are attributed to permanent faults.

The controller self-diagnoses failures from which system integrity can automatically be restored. Self-diagnosis information is output by the controller on an attached dedicated printer. A typical print-out is shown in Figure 2. The print-out shows the date and time of the incident, the identified errant processor, and the number of sequential reset attempts required to secure restored system operation.

The collected 1983/85 failure data was recorded over 12299 hours, noted 3 permanent faults and 79 temporary faults, and is referred to as data set 'A'. Similarly, the 1989/90 failure data is referred to as data set 'B' and recorded 3 permanent and 80 temporary faults inducing failure over 9360 hours. In both failure data sets, temporary faults are diagnosed as responsible for 96.3% of controller failures.

4. DATA ANALYSIS

The analysis of temporary fault failure sets 'A' and

![Figure 3. : Collected Failure Data Set 'A'](image-url)
'B' are shown in Figures 3. and 4. respectively. Each figure has three graphs: (a) showing the cumulative number of events against operational time; (b) the failure time of each event as they occurred; and (c) the inter-arrival failure time distribution of the events.

4.1. Failure Occurrence

The observed failures caused by temporary faults are shown in Figures 3(a). and 4(a). for data sets 'A' and 'B' respectively. It is interesting to notice the step function in these figures, particularly of data set 'A'. The three steps of higher failure rate occurrence in data set A are recorded for the months of December 1983, July 1984, and December 1984. The steps are harder to distinguish for data set 'B', but occur at the similar months of December 1989 and May 1990. Other processor systems have been observed to have a workload dependent failure rate [3, 4]. This, however, does not explain the observed failure rate step function for the TMR controller because its workload is deemed to be constant. The method of Lin and Siewiorek [5] applies Dispersion Frame Technique (DFT) to processor system failure data and observes an increased hazard rate associated with an intermittent fault. Such an observation does not fit the TMR controller data because the steps of increased failure rate do not terminate with a failure attributed to a permanent fault. An alternative explanation is that the failure rate is dependent on some external environmental influence but this suggestion cannot be substantiated.

Analysing linear regression on the whole of each failure data set still yields a very good correlation to a constant failure rate, \( \lambda_{LR} \), despite the observed step function. Failure data sets 'A' and 'B' yield Mean Time To Failure (MTTF) parameters \( \lambda_{LR} \) of similar magnitude, 157.7 hours and 112.0 hours respectively. This compares favourably with the sample data MTTF of 147.8 hours and 117.6 hours for data sets A and B. Table 2 summarises the linear regression analysis on the whole of each failure data set.

4.2. MTTF Distribution

The inter-arrival times of failures diagnosed as due to temporary faults are shown in their sequential order of occurrence for failure data sets 'A' and 'B' in Figures 3(b). and 4(b). respectively. The plots suggest a memoryless failure mechanism because the scatter of inter-arrival failure times appears unchanged throughout the observed periods of controller operation.

A histogram distribution of the inter-arrival times for failures attributed to temporary faults for data sets 'A' and 'B' are shown in Figures 3(c). and 4(c). respectively. Histogram bars represent the number of failures.
that occur within sequential time intervals following the last failure event. The intervals, \( \Delta \), for the histogram, distribution plot are calculated using the following equation from Lewis [4]:

\[
\Delta = r(1 + 3.3\log_{10}(N_i))^{-1}
\]  

(2.)

where \( \Delta \) is considered to be a 'reasonable' interval, \( r \) is the range of values taken by the data, and \( N_i \) is the number of data items under analysis.

As, \( \Delta \) is not usually a convenient value for plotting a histogram an approximate value \( \Delta' \) is chosen. Table 3 summarises the \( \Delta \) derivations and the choice of \( \Delta' \). The histogram interval for data set 'A' and 'B' is 80 and 70 hours respectively.

The distributions for data sets 'A' and 'B', yielding a correlation of 0.969 and 0.956 respectively with the Draper & Smith [1] non-linear regression method, are modelled by,

\[
\gamma \exp\{-\lambda_{LR}t\}
\]  

(3.)

where \( \gamma \) is given by,

\[
\gamma = \lambda_{LR}N_i\Delta'
\]  

(4.)

The exponential distribution of the failure data validates the memoryless characteristic observation made earlier for the inter-arrival failure times.

4.3. Controller Unavailability

Some detected failures were only successfully recovered after multiple attempts (every 3 \( \mu \)s) to restore controller operation. Table 4 gives the down time distribution for the failures from which automatic recovery was achieved. Multiple attempts to restore the system integrity implies that the temporary fault causing the failure was still active when recovery action was initiated. Alternatively, a burst of temporary faults may have occurred. The nature of the collected failure data prevents further analysis or postulation.

<table>
<thead>
<tr>
<th>Failure Data Set</th>
<th>( r ) (hours)</th>
<th>( N_i ) (hours)</th>
<th>( \Delta ) (hours)</th>
<th>( \Delta' ) (hours)</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>600</td>
<td>79</td>
<td>82.6</td>
<td>80</td>
</tr>
<tr>
<td>B</td>
<td>500</td>
<td>80</td>
<td>68.7</td>
<td>70</td>
</tr>
</tbody>
</table>

Table 3. : Histogram Distribution Interval

<table>
<thead>
<tr>
<th>Down Time (( \mu )s)</th>
<th>Number of Failures</th>
</tr>
</thead>
<tbody>
<tr>
<td>Data Set 'A'</td>
<td>Data Set 'B'</td>
</tr>
<tr>
<td>0 - 3</td>
<td>59</td>
</tr>
<tr>
<td>3 - 6</td>
<td>20</td>
</tr>
<tr>
<td>6 - 9</td>
<td>0</td>
</tr>
<tr>
<td>9 - 12</td>
<td>0</td>
</tr>
<tr>
<td>12 - 15</td>
<td>0</td>
</tr>
<tr>
<td>15 - 18</td>
<td>0</td>
</tr>
<tr>
<td>18 - ( \infty )</td>
<td>0</td>
</tr>
</tbody>
</table>

Table 4. : Failure Induced Unavailability

<table>
<thead>
<tr>
<th>Failure Data Set</th>
<th>Linear Regression</th>
<th>Data Sample</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>( \lambda_{LR} ) (hrs(^{-1}))</td>
<td>MTTF (hrs)</td>
</tr>
<tr>
<td>A</td>
<td>0.00634</td>
<td>157.7</td>
</tr>
<tr>
<td>B</td>
<td>0.00893</td>
<td>112.0</td>
</tr>
</tbody>
</table>

Table 2. : Linear Regression Analysis
5. RELIABILITY ASSESSMENT

Siewiorek & Swartz [9] describe four evaluation parameters and their inter-relationship for failure distribution analysis: probability density function (pdf), cumulative density function (Cdf), reliability, and hazard function. Pdf, \( f(t) \), defines the probability of a failure occurring at a specific time. Cdf, \( F(t) \), defines the probability of failure occurring at or before a specific duration of operation. Reliability, \( R(t) \), is the probability of not observing a failure before a specific duration of operation. Finally, the hazard function, \( Z(t) \), is defined as the time dependent failure rate. Within this experiment the hazard function appears time independent, denoted by the constant \( \lambda \), hence equation (1.) becomes,

\[
Z(t) = \lambda
\]

Basic reliability theory gives the following parameter relationship,

\[
F(t) = 1 - R(t)
\]

\[
Z(t) = \frac{f(t)}{R(t)}
\]

The probability density function is given by converting the time intervals, used by the experimental results histogram, in equation (3.) into a general time parameter,

\[
f(t) = \lambda_{LR} \exp\{-\lambda_{LR}t\}
\]

Reliability is evaluated using equation (7.),

\[
R(t) = \exp\{-\lambda_{LR}t\}
\]

and the cumulative density function is evaluated from equation (6.),

\[
F(t) = 1 - \exp\{-\lambda_{LR}t\}
\]

The performance parameters of pdf, Cdf, and reliability are evaluated for failures attributed to temporary faults and are plotted in Figures 5., 6., and 7. respectively. In each graph, data set 'A' evaluations are shown.
Analysis of failure data collected from a TMR microprocessor controller

1.0
0.8
0.6
0.4
0.2
0.0
100 200 300 400 500 600
TIME (HOURS)

Solid line : from Data Set 'A'
Broken line : from Data Set 'B'

Figure 7 : Reliability

as solid lines and data set 'B' as broken lines. The pdf plot presents the inter-arrival failure times modelled for the two data sets in Figure 3(c) and 4(c). Data set 'B' exhibits a higher failure rate than data set 'A' which may be due to the aging of the controller or varying working conditions. Figure 6 plots the Cdf for each data set, ie the running sum of the pdf coefficients, and clearly shows the higher failure rate observed for data set 'B'. The effect of the observed failure rates on the controller reliability for each data set is shown in Figure 7. Data set 'A' with the lower failure rate has a higher reliability.

6. DISCUSSION

The aim of the work was to observe the effects of temporary faults on a real control system. The results presented add to the limited body of published data in this area. The failure history of a TMR microprocessor controller has been collected over a 17 and a 13 month period. During each period approximately 80 failures occurred, of which 96% were diagnosed as due to temporary rather than permanent faults. The inter-arrival time of the failures, attributed to temporary faults, follows an exponential distribution. This observation suggests that the controller was operated during its useful period as specified by the Bathtub Curve commonly used by reliability engineers. These results demonstrate the validity of modelling the effects of temporary faults using techniques developed for permanent fault modelling, with constant hazard rates.

ACKNOWLEDGEMENTS

The authors wish to record their debt to former research students Dr. J.C. Pearson and Dr. R.G. Halse who helped to gather the data on which the results in this paper are based. Continued support from the Science and Engineering Research Council and from British Gas is also acknowledged.

REFERENCES

ERRATA:

Three typo-graphical errors appear in the above paper. Mean and Standard Deviation are incorrectly marked on Figure 4b & 5b: the values given there should be ignored. Table 3 should mark $N_i$ as an integer. Correlation in Figure 4a should read ‘0.995’.