Trust your life to a computer?Follow article
Texas Instruments have recently launched a new range of ARM-core microcontrollers called Hercules. They contain redundant circuits to detect and deal with faults when used in safety-critical applications. Specifically, two Cortex R4F cores run a common program in lockstep and a comparator unit makes sure they produce the same outputs.
Computers are unreliable, prone to making random mistakes. Anybody using a PC for standard office work will testify to that: mysterious error messages, program ‘crashes’ and even the dreaded ‘blue screen of death’. For many years most computer installations’ MTBF (Mean Time Between Failures) could be measured in hours. Most of the problems were down to the hardware. Imagine a computer built with thousands of thermionic valves (tubes): to keep it going for a whole day involved increasing the power supply voltage first thing in the morning to stress weak valves into failure. With these replaced the machine would work without problems for the next 24 hours – usually. Even later very high performance computers built from early integrated circuits were not much better. It’s a good job these room-sized power-hungry monsters wouldn’t fit into a train cab or an airplane! The Apollo Guidance Computer is probably the first example of a mobile computer on which lives depended. Fortunately it was monitored by a human brain. Just as well: it got overloaded on the first moon landing requiring Neil Armstrong to switch to manual control. Mind you, it was human error that left a switch on causing the overload in the first place.
Nowadays, it is generally assumed that the hardware is very reliable and it’s undiscovered errors in the program software that cause all the problems. Certainly a great deal of effort has gone into devising software development tools using ’Formal Methods’ and high-level languages such as Modula 2 and Ada are designed to stop the programmer making mistakes. We still use C though, but its inherent flexibility and forgiving nature may also produce those hidden bugs that cause perhaps dangerous failures in a real-time control system. The use of a real-time operating system (RTOS) helps to ensure a solid program structure is developed from the start. Those of you who have followed my efforts to get FreeRTOS working on the RS EDP will know about its commercial version: SafeRTOS.
The reality is though, that the spectre of hardware failure still exists. It is less likely to be a ‘hard’ fault providing the devices are working well within their specified supply voltage and temperature margins, but more likely a transient fault or ‘glitch’. For this reason all mobile computer systems whose failure could lead to injury or death contain redundant circuits with a lot of self-checking and output monitoring. The most basic is Double Modular Redundancy (DMR) with two independent processors running a common program and comparing outputs at fixed intervals. One mismatch can trigger a retry, and if that fails external circuits will shut the whole system down because there is no way to determine which processor has failed! To improve system availability, Triple Modular Redundancy (TMR) or even Quadruplex (QMR) allows one or two hard faults to occur before the system must be shut down. Most modern flight control systems use this approach. Some even go to the extent of using a different microcontroller type in each redundant circuit running software from different programmers to eliminate that most dreaded of situations: the undetectable common-mode fault.
Texas Instruments have launched a DMR system on a chip, the most powerful of the range being the TMS570 series based on two 160MHz ARM Cortex R4F (floating-point) cores. It’s been designed to meet the safety requirements of IEC61508/SIL3 and contains many error-checking features alongside the redundant cores:
- The processors are locked to the same clock frequency, but one operates 1.5 cycles behind the other. The output of the leading processor is delayed by 1.5 clock cycles to bring the two outputs into line for the benefit of a comparator unit which generates an interrupt if a difference is detected. In this way transient power supply glitches or rogue cosmic particle impacts don’t cause each processor to make the same, undetectable, mistake.
- The second core is flipped and rotated to help reduce physical common-mode failures. The cores have a wide separation and there is a guard ring around each to reduce capacitive coupling.
- The memory has error correcting logic and all the communication channels feature parity error detection. All the error signals generate interrupts and it is left to the programmer to handle these.
- Built-in test circuits are included.
With these hardware features coupled with a safety RTOS, it should be possible to bring the cost of safety-critical systems down and even relieve the frustrations of users trying to sort out failures in more mundane applications. The next logical step is a triple core system that can flag the user that something has gone wrong briefly but the problem had no repercussions, or that a hard fault has occurred but maintenance can wait for a more convenient time.