TinyML: AI for MicrocontrollersFollow article
One of a new generation of microcontroller boards with hardware optimised for embedded AI at the ‘Edge’. Image credit: Eta Compute
No new electronic gadget is considered to be ‘high-tech’ nowadays unless it features some form of alleged ‘artificial intelligence’ or AI. The really sophisticated applications require staggering amounts of computer power – can anything be done with a humble microcontroller? I’ll start with some basics first.
What is AI?
One major problem encountered when trying to explain or discuss artificial intelligence is that everyone, whether layman or scientist/engineer, has a firm opinion on the subject. This is because there’s no agreed definition with limits or boundaries. So, a layman will tend to imagine that the current technology is only a small step away from giving us machines that really think, take our jobs, be able to reproduce themselves and eventually see us, their builders as vastly inferior beings with inevitably bad consequences. It may happen, but this dystopian vision comes from a major feature of the human brain that current AI does not possess: imagination. Ask a scientist or mathematician working on AI about its capabilities and you will be overwhelmed with incomprehensible jargon, ending up none the wiser but convinced it must be really advanced because you didn’t understand a word. Ask an engineer and the response will probably be rude, as their boss’s boss has read all about AI in a management magazine and wants it to be included in all future products. And anyway, Marketing is insisting that old labels such as ‘Smart’ or ‘Intelligent’ must be replaced with ‘Powered by AI’ or sales will fall off a cliff. Too cynical? Don’t you believe it.
So where are we with a definition? From an engineering perspective, the current concept of AI revolves around a very powerful pattern-detection algorithm based on an artificial version of the neural network found in the human brain. The Artificial Neural Network (ANN) is usually run in simulation on a conventional multi-core computer. Ultimately, the aim is to realise the ANN in hardware using a network of artificial neurons communicating with each other using the brain’s method of timed electrical pulses or ‘Spikes’. Hence the term Spiking Neural Network (SNN). There are three principal tasks involved in creating a working neural network: designing a network model to suit the application, training the network, and finally inferencing or inferring results from the network model.
A particular form of ANN is known as a Convolutional Neural Network (CNN) and consists of an array of interconnected Perceptrons (Fig.1), based on standard digital technology. This simple example has a four-input layer, two intermediate or ‘hidden’ layers, and a two-output layer. When starting a design from scratch, how do you decide on the number of inputs, outputs and importantly, the number of hidden layers?
The first two are relatively easy. Example: For object detection/classification within colour RGB image data from a 320 x 200pixel digital camera, you will need 320 x 200 x 3 = 192,000 inputs! The output layer will reflect the number of objects that the network will be trained to recognise. So, say, eleven outputs for ten objects including one for ‘nothing recognised’. The hidden layers will at least start out containing 192,000 nodes, but the big question is how many layers are required? In this ‘dense’ model with every node in one layer connected by a weighted link to every other node in the next, there is a vast amount of low-level processing required. Each node in the hidden layers will need 192,000 multiplies for a start (one for each weighted link), and if the number format is 32-bit floating-point you can see that any embedded processor is going to need a hardware Floating-Point Unit (FPU) at least. Imagine the processing power needed to drive a network trained to recognise hundreds of different objects in HD video with input frames changing at 30fps. And each frame could contain dozens of objects. That’s the kind of performance needed for an autonomous vehicle vision system…. If that’s the ML application you have in mind, then you need a very expensive multi-processor system; the kind that measures its performance in many petaFLOPS (1015 floating-point operations per second). Tesla has recognised that if their car auto-driving computer Autopilot is ever going to deserve the name, it will need exaFLOPS (1018 FLOPS) just to train it. They have announced the development of their own 362 teraFLOPS (1012 FLOPS) D1 chip which will be combined with 119 others on a ‘training tile’. With 25 tiles spread over several cabinets, they should just get throughput to the exaFLOPS level. That should be enough. Probably.
That’s MegaML, what about TinyML?
I’ve provided the numbers above to force home the point that natural intelligence is vastly superior in every way to that of even the most sophisticated machine on the planet. In spite of all that computer power-consuming vast amounts of electrical power, it’s a strong possibility that vision systems based on CNN's with video input only, will never be accurate enough for autonomous vehicles. Given that, what on Earth can you do with a single microprocessor-based SBC? Actually, quite a lot – just don’t get too ambitious. TinyML is causing much excitement in the IoT world (Internet of Things). The concept of connecting millions of household appliances and other sources of data to the ‘Cloud’ via the Internet seemed like a great idea at the time. Data could be collected by these devices, processed by Cloud servers and results or control signals returned. Unfortunately, the sheer volume of processing, not to mention security issues with medical/fitness monitors for example, rather put a damper on things. Until someone thought of doing some of the data processing locally, before sending it to the Cloud – so-called ‘Edge’ processing.
An example of AI at the Edge
The automated factories that are the basis of Industry 4.0 will be full of machines that must be kept running at peak efficiency. As there will be no human operators around to spot signs of imminent breakdown, the machines must be monitored by computers using electronic sensors. It’s unlikely that a single sensor or even type of sensor will be able to indicate a fault condition reliably. So, an electric motor about to suffer bearing failure may start to emit high-frequency sound at certain frequencies related to shaft rotation. This could progress to low-frequency vibration accompanied by an increase in armature current, a drop in rotation speed and an increase in case temperature. All of these changes could be very small initially and go unnoticed by a human monitor until it becomes too late for remedial action, such as applying an oil-can! The monitoring computer processes signals from accelerometers and thermo-sensors attached to bearing housings, ambient temperature sensors and microphones, and makes use of existing data from shaft tachometers and power supply current sensors. Standard computer programming based on an IF… THEN… ELSE structure could be used to make sense of all the data, but AI offers an alternative where real data taken from real machines is used to train a CNN. The advantage of using AI in these circumstances is that given a full set of training data, the network should not only detect all the known input patterns that lead to failures but may locate new, more subtle patterns overlooked by human inspection. A look at the pattern of weights generated might reveal redundant inputs that contribute little or nothing to the analysis. If the inputs used produce false positives or negatives it may suggest that an additional sensor is needed.
Clearly, TinyML means what its name implies: it’s designed for relatively small applications with few inputs/outputs, and a low number of objects/conditions to recognise. That makes it ideal for making embedded ‘Things’ a whole lot smarter, reducing the load on the Internet’s communication links and servers.
I’ve already shown that object detection/classification in real-time video requires supercomputer levels of processor power. Rather less ambitious projects can be run on some of the powerful but inexpensive SBCs and ARM Cortex-M microcontroller boards on the market. Low-resolution image object detection is well within the capabilities of the Raspberry Pi 4, BeagleBone AI, Sony SPRESENSE , Arduino Nano 33 Sense and even the Raspberry Pi Pico . The first two are relative processing heavyweights running versions of the Linux operating system. The second two are less-powerful microcontroller boards but feature many on-board sensors. Very basic AI can also be run on 32-bit microcontrollers with no FPU and a minimum of 16KB of memory – the Raspberry Pi Pico falls into this category. When working with low-power processors there are steps that can be taken to speed things up, for example:
- Use 8-bit integer numbers instead of 32-bit floating-point. This loss in resolution will severely affect detection accuracy on very large networks like the car vision system. But 8-bit numbers will be more than adequate for applications capable of implementation on a microcontroller.
- Pooling-layers can be inserted to reduce the number of nodes in subsequent layers. The technique is frequently used with image data where the features required for detection are strong, and reducing the resolution leaves them ‘blockier’ but still recognisable by subsequent smaller layers.
- When processing an RGB colour image (that’s three inputs), it can actually improve detection if only one colour channel is used.
Training the Model
The dominant AI development environment at the moment is Google’s TensorFlow. Initially aimed at applications needing supercomputers, there is now a cut-down version for more modest systems called Tensorflow Lite. An even-more cut-down version is now available for microcontrollers with or without FPUs and a minimum 16KB of memory. Unsurprisingly, it’s called TensorFlow Lite for Microcontrollers. Training a CNN so that it sort of works is easy; training it so it doesn’t make mistakes in life-or-death situations is very, very much harder. So hard in fact that no current system for object detection being developed for driverless cars is anywhere near achieving an acceptable level of confidence. A number of fatal crashes in recent years caused by incorrect interpretation of visual data testifies to that. Fortunately, TinyML networks, being small, are a lot easier to train. Training for large networks is usually performed by Cloud-based tools because of the need for hugely powerful computers, but for TinyML some tools will run on a local PC. The process of training is deceptively simple. So, for the machine diagnostic application:
- Collect a large number of sensor data sets with each ‘tagged’ according to whether it’s come from a machine known to be failing, or from one running normally. If the network has multiple outputs for reporting different kinds of failure, then all the corresponding input data sets must be so tagged. It is very important to have many examples for each condition to be detected.
- The simulated network is normally filled with random weight values initially. It is not cleared because if all weights (multipliers) are zero then nothing will change when the training starts!
- The training program presents a set of input data to the CNN input, runs the simulation and checks the output to see if it matches the tagged (correct) answer. If not, weights are adjusted and the network run again. If necessary, this process is repeated until the network ‘converges’ on the right answer. This is a massive over-simplification but it broadly describes the learning process known as back-propagation.
In practice, the process of training can take a long time, with much human intervention to tweak parameters until the network delivers acceptable results. In order to test it properly, fresh input data not used to train the network is run through it in simulation. That should be enough to verify that a TinyML network is ‘intelligent’ enough for the task. Those working on object detection for autonomous vehicles have found that a system that delivers say 95% accuracy in the lab can’t manage better than 50% out on the road. It seems that for this application at least, you never have enough training data.
Inferring a Result
Once ‘trained’, the ‘Inferencing’ phase begins when the network is shown live data and it looks for patterns that match those generated by its training. Results are inferred rather than calculated and are presented in terms of the probability of a match. If it all works in simulation then the network can be Flashed into the embedded microcontroller’s memory and tested in a real-time system.
Cheating, Bias and Insight
Despite having nothing like the capability of a natural brain, CNNs can exhibit behaviours that may seem a bit unnerving at first. Firstly, they can ‘cheat at school’, giving the ‘correct’ answer but for the wrong reasons. This can happen when your training data is not varied enough. For example, a warehouse robot may be trained to recognise green cardboard boxes. So, the training set contains pictures of all shapes and sizes of green box. First time out, the robot makes a grab for a human warehouseman. Why? Because he or she is wearing green overalls. While being trained, the network ‘deduced’ that anything coloured green must be a box – a nice ‘short-cut’ but potentially fatal. Secondly, for similar reasons the machine may produce a biased result. This has happened with criminal record databases loaded with an unrepresentative number of black people’s records. Thirdly, at last a positive feature: insight. CNNs are very good at spotting patterns in large volumes of data missed by a human brain: noticing subtle patterns of behaviour which could identify a serial killer in a vast police database for example. Or finding emerging health trends in a large population by analysing (anonymised) data from fitness watches.
Massive artificial neural networks running on supercomputers may grab the headlines when they beat humans at board games such as Chess and Go. Much smaller TinyML networks may not have the apparent insight to play games or make new discoveries in science and medicine, but they can save the life of a human being, or spot an ailing machine by making some simple deductions based on sensor inputs. They are also less likely to make catastrophic inferences such as mistaking an articulated lorry across the road in front of the car they are driving for a bridge over the road. A lot of new hardware is starting to appear aimed squarely at the TinyML/Edge computing market that everybody believes is going to be the next big thing. The ECM3532 chip from Eta Compute (see heading picture) is designed for embedded AI and is based on an ARM Cortex-M3 core working with an NXP DSP core. Another new device, GAP 9 from Greenwaves Technologies aimed at the same market contains a cluster of ten cores with a RISC-V architecture. Both of these companies also offer a complete toolchain for developing AI applications with their products. Expect a rash of microcontroller chips optimised for TinyML to appear in the next year or so. In the future, it’s likely AI will dominate the Internet of Things. But not the human race, just yet….
If you're stuck for something to do, follow my posts on Twitter. I link to interesting articles on new electronics and related technologies, retweeting posts I spot about robots, space exploration and other issues.