The AI IQ TestFollow article
Intelligence comes in many forms. Emotional and social intelligence might be described as subjective and being wholly related to the human condition, while the ability to process information is not restricted to organic brains. Knowledge processing is how we might describe general intelligence, but the definition of intelligence has expanded to include machines.
It isn’t hard to see how machines made that leap. We can consider knowledge to be gained through logical reasoning and problem solving. Machines are very good at logical reasoning; it’s what Babbage’s Difference Engine was founded on. But real intelligence also covers inferring, based on that knowledge. Here, the line between basic machines and artificially intelligent machines has been firmly drawn. Inferencing is a major part of AI and machine learning systems. We are now busy putting inferencing into smaller devices, to elevate them from the execution of simple procedures, to making and acting on their own decisions.
We can now say that machines have intelligence, but just how much is still open to question. Measuring an ability to complete a task correctly is a good indication that at least some intelligence is present. But measuring how well that task has been done reintroduces a subjective element.
The Intelligence Quotient, or IQ, was originally derived from mental age divided by chronological age. Today it is measured more objectively, using probability theory and standard deviation, but it is still only an estimate. Perhaps we shouldn’t put too much emphasis on an IQ score, but it is a helpful way of standardising ability.
Applying a standard way of measuring ability is arguably even more important when we apply it to an artificially intelligent device. While a person’s IQ score can’t really tell you much beyond how they performed at that moment on that day, an AI should perform consistently and, if reinforced learning is used, consistently better over time. That would make an AIQ score much more meaningful and potentially useful, particularly when it comes to selling AI as a service.
It follows that the discussions around how to measure an AI IQ are even more varied and opinionated than measuring human intelligence. At the very least, it should be possible to track how an AI improves over time. Unlike people, we should expect an AI not to suffer from age-related degeneration in its intelligence. Logically, the AI being deployed today should just get better with age.
Can AI age?
One thing an IQ test may tell you is how that degeneration affects your ability to perform tasks. The use of brain training applications has become quite popular within certain age groups because they have been shown to reverse the effects of aging, from the point of view of brain activity. An organic brain benefits from regular exercise, just like any other muscle in the body.
The same is probably not true of AI, although perhaps there just isn’t enough data to confirm that yet. But if an AI does improve with use, and that use takes time, can we think of AI as having ages? An adolescent AI would be curious and carefree, while the more mature AI would be responsible and cautious, drawing on its experience of past mistakes. The AI that is deployed at the edge and in volume would need to sit somewhere between these two ages: benefiting from shared knowledge while possibly also contributing to the greater experience.
If you were paying for AI as a service, would you demand an AI that has put away its childish things? In practice, the only way to assess the age of an AI – or perhaps maturity is a better term – would be to measure its capability in some way. The IQ test is raising its hand.
Although it is most commonly found in server farms and data centre, where processing resources are scalable to the point of being almost infinite, it is important to remember that AI is still very much a collaboration between hardware and software. This is even more apparent when we put that AI at the network’s edge, where the processing resources are very finite. The processor performance needed to run AI is, without doubt, higher than the resources needed in a regular endpoint. Just how much higher will depend on how much intelligence is needed and how well the hardware can execute it. This is why having some form of standard measurement could be crucial in the near future.
Building a benchmark for ML
When we talk about putting AI at the edge it is generally in the context of using inference models to implement ML (machine learning). These are models that are typically trained as AI systems in larger, less constrained frameworks, such as a cloud computing data centre, and then parred down to create a model than can infer – make decisions based on the data presented. These are still complex systems comprising both specialist hardware and software, and while efforts are being made to port ML models to even the smallest architectures, they still generally require a considerable amount of CPU or GPU cycles.
The Embedded Microprocessor Benchmark Consortium, EEMBC, has been developing independent benchmarks for over 20 years. One of its most recent is the MLMark benchmark, which characterises machine learning inference on edge devices. A guiding tenet for EEMBC is that its benchmarks must be reproducible, transparent and constrained.
In order to meet those objectives, manufacturers must implement the benchmark using the data set and rules provided. The MLMark uses a data set of images, so the benchmark is a measure of how well a platform can implement ML to detect features in images. While this is a real-world example of how ML is being used, it by no means covers all the possible ways manufacturers may use ML in edge devices. However, it does provide a good basepoint for measuring how a specific processor performs, typically using the manufacturer’s software framework.
But for these reasons, the MLMark benchmark cannot and is not intended to provide a way of measuring the intelligence of any given platform for any given application. That would require a different approach.
Explain it to me like I’m a 4-year-old
Intelligence in people is measured based on qualitative and quantitative metrics. Children are graded through examination, with average scores providing the framework for classification. The same approach could be applied to AI, if we can come up with and agree on the standard examinations.
Another option is to closely define the type of intelligence embedded in the device, and to what degree. For example, an edge device may predominantly employ logical reasoning but have little or no artificial empathy. That would be a good option for a security system monitoring door and window sensors, but perhaps not so good for access control using facial recognition.
The need for theses metrics hasn’t gone unnoticed. The Performance Metrics for Intelligent Systems – PerMIS – workshops have been taking place for the last twenty years. Originally part-funded by government agencies, including the National Institute of Standards and Technology (NIST), in the U.S., PerMIS continues its search.
There is growing interest in measuring artificial intelligence in terms of creativity. Creative services are commonly cited as being less under threat from AI than other activities, particularly those that can be mechanically automated. But, recently, examples of artificial creativity are becoming more frequent. Teaching an AI to not only play a musical instrument but compose a score, write a poem or paint a picture are examples.
This puts the measurement of intelligence firmly back on the side of subjective. It is difficult for people to agree on the emotional intelligence of a piece of art, so how are we supposed to judge an AI’s ability to create something that embodies emotional intelligence?
Putting AI at the edge
In general, the reason for understanding how much AI is ‘enough’ really matters when the resources available are limited. If we can’t currently work out how much AI is needed for a given application without trying, then perhaps we just need to get on with trying?
This means putting the resources in the right place. For edge devices, which are typically small, low power and cheap (but not necessarily all or any of those things), then the processor used will represent the biggest part of the budget under all three of those headings. If we assume that ML is going to be implemented, then there are two options: put more processing resources into the edge device or squeeze the ML down so it fits in the processing resources available. Both of these scenarios have their merits.
Manufacturers can now offer much more performance per watt of power. Techniques such as power and clock gating help keep system power low, while the use of hardware accelerators and high performance memory keep execution power low. And, on top of all this, the industry continues to drive feature sizes down, which comes with the benefit of lower operating power.
At the same time, a lot of effort is going into optimising the software models generated by training. Researchers are finding new ways to prune the neural network so that the inference engines are smaller, requiring fewer system resources. Platforms including the uTensor framework and TensorFlowLite Micro are helping to forge the path here.
This is where the real effort seems to be going at the moment, not in evaluating how to measure AI or even estimate how much intelligence it takes for a given task, but in making AI easier to deploy, wherever it can be used. We may never develop a really satisfactory way of measuring AI or modelling how much AI an application needs, but if we win the race to make processing resources more affordable, in every sense, then perhaps it doesn’t really matter.