Is AI about to run on every device?
The recent explosion in the use of Artificial Intelligence (AI) is associated for many people with web-based tools like ChatGPT, Gemini or Copilot. However, there has also been a growing interest in Local AI, or Edge AI. This is the implementation of AI in individual products, rather than on the cloud, not aimed at generative AI like Large Language Models (LLMs) but instead at locally processing complicated and messy sensor data, like video feeds or data from an accelerometer or inertial measurement unit.
Prompted by this trend, silicon manufacturers have recently released microcontrollers that advertise performance and efficiency gains by including an on-chip Neural Processing Unit (NPU). In this article, Jasper Rowell and Aidan O’Hare from DCA’s software and electronics team look at what is available and whether this new hardware could push the limits of where Edge AI is practical and feasible.
What is Edge AI, and what can I do with it?
Edge AI refers to any form of AI that is being run locally rather than on a centralised cloud server. This has several benefits, such as lower latency, local data storage for better privacy, and reduced bandwidth requirements, or even the ability to function without an internet connection. However, local implementation comes at the cost of having to have a system powerful enough to run these models, and so poses a challenge for smaller, battery-powered embedded devices.
Traditional embedded systems often run relatively simple firmware, often implementing logical control to process inputs from sensors and drive actuators or user interface elements. Edge AI, however, increases the computational load. The focus moves from a human-defined “if this happens, then do that” set of rules to machine learning (ML) where an input dataset is analysed against a known pattern. The processor has a lot more work to do to implement the necessary matrix operations, requiring more microcontroller operations and increasing the power consumption.
Some typical ways that Edge AI can be leveraged are:
- Classification - ML models excel at dealing with complex and combined sensor data, especially for classification where an activity or system behaviour needs to be recognized. This works especially well for applications such as gesture recognition which can be difficult with traditional processing and is the example that many manufacturers give for their chips. See Nordic’s Edge AI Demo Videos or DCA’s use of machine learning with a hybrid-reality train driver’s desk layout which allowed us to analyse how easy it was for a user to reach the system controls using pose estimation and posture recognition models.
- Prediction - By using a variety of sensors, the system can make predictions about its environment. This is used in cases such as fault prediction in industrial applications.
- “Always On” cameras can use AI to detect motion so that they only record the required information; this can even be built into the sensor itself, as with Sony’s Always-on Image Sensor.
- Smart devices often use forms of ML and AI for image processing and voice activation. Products like the Amazon Echo send most data externally for more powerful and accurate processing, whereas smartphones will often use a mix of local and cloud-based ML techniques for tasks such as voice recognition and post-processing of photos.
Edge AI has been performing tasks of the type listed above on embedded devices without dedicated NPUs for several years, but these tend to be physically larger and often stationary devices. This is generally because the processing capability needed and/or the power requirements of these systems exceed what is possible for a portable, battery-operated device. At DCA we are continually dealing with the contradictory challenges of achieving the desired high level functionality while keeping products small enough to be readily portable and attractive to use, and often with the additional constraint of running from a compact battery. The more power a device consumes, the larger the battery needs to be, which quickly becomes the key factor determining the product size and form factor. Commercially successful products are created when the right balance is struck between functionality, power requirements and the resulting physical size.
Whilst Edge AI has been running on embedded devices without dedicated NPUs for several years, these tend to be physically larger and often stationary devices.
Implementing Edge AI
Implementing Edge AI differs from traditional microcontroller development. As with all AI model development, most of the work is performed in creating the model itself, prior to implementation on the microcontroller. Building an AI model for microcontrollers starts out with the same process as non-microcontroller models, where data is first collected and split into sets. The data used for training is labelled, and the model is then created through rounds of training, validation and optimisation.
To convert this trained ML model into something a microcontroller can use, a framework such as LiteRT for microcontrollers (formerly TensorFlow Light) can be employed. If the target microcontroller has an NPU then the model will need to be compiled using the manufacturer’s tools, to make proper use of the NPU hardware. Some examples of these tools are the Ethos-U Vela and STM32Cube.AI compilers for the Arm Ethos-U and STM NPUs respectively.
Depending on the manufacturer and architecture of the microcontroller used, tools may exist to simplify the development of Edge AI models, such as no-code tools like Nordics Edge AI Lab or TI’s Edge AI Studio. These perform the training and optimisation of the data, exporting directly into a library format that can be used in the embedded code and is optimised for the target NPU hardware. These do, however, have the limitation that they only work for specific pre-defined activities such as object detection.
Most model development then happens separately from the implementation of the embedded software on the microcontroller. Whilst it can still take significant amounts of effort and time to tune the models for better performance, model development remains a very high-level process with limited control over the resulting output. This is especially true for more complex models which are often regarded to be more like “black boxes”. In such cases, understanding and predicting the inner workings of the models is difficult, if not impossible. This is in stark contrast to implementing traditional data processing techniques on microcontrollers, which tends towards very low-level deterministic code whose behaviour is highly predictable.
Why are microcontrollers starting to include NPUs and are they worth using?
An NPU implemented inside the microcontroller refers to a dedicated piece of hardware that can, usually, perform matrix operations in parallel. This makes it less versatile than the standard microcontroller core but means it can perform more power efficient and faster calculations. For certain computationally heavy workloads such as neural networks or some mathematical operations, this lets the microcontroller core offload all the difficult processing to this specialist hardware. The claims of improved performance vary wildly between manufacturers. NXP claims an increase in speed of up to 172x when compared to non-NPU performance. Other manufacturers like Nordic and TI claim more modest performance improvements of 7x to 10x. These discrepancies depend partly on the kind of operations that are possible on these devices, where support for 8-bit integer and 16-bit floating point operations vary depending on the manufacturer, and partly on the activity that is being benchmarked. The increased performance claims also come with significant increases in power efficiency when compared to performing the same Edge AI calculations without an NPU. This is in part because NPU-enabled systems can perform the calculations used for AI significantly faster and the power consumption of an added NPU is minimal, so the overall power used per inference is decreased.
A key factor in assessing the performance improvement that an NPU can bring to the standard microcontroller is the level of quantisation used by the AI model implemented on the device. Quantisation in AI is a technique for model optimisation that reduces the size of the model by decreasing the numeral precision of its parameters, which can in turn speed up execution. The challenge is to find the degree of quantisation that provides the desired performance boost without an unacceptable drop in precision and efficacy. The difference in quantisation levels between models is often why the quoted speed-up times from the manufacturer can differ so dramatically.
The arithmetic capability of the microcontroller also has a major impact on the degree of performance enhancement. Attempting to do floating point matrix operations on an embedded microcontroller, particularly one without a floating-point unit, will be a natural weak point. Hence, for models using floating point arithmetic, there are higher potential savings by offloading the calculations to dedicated hardware.
It is also worth mentioning that improved speed does not improve the precision of these models, and whilst it may make larger and more accurate models more feasible to run, available RAM is often the limiting factor for certain types of models and so should be factored in when selecting components. Embedded microcontrollers for small, battery-powered applications generally use internal RAM of relatively modest size though manufacturers seem to mostly offer NPUs in chips that also include an increased amount of RAM to accommodate the models.
So, a device with an NPU would be able to perform Edge AI functions quicker, whilst requiring less power than a traditional embedded microcontroller. This sounds perfect, so should we have an NPU in every product? Some applications may well benefit from improved processing speeds and efficiency, but there are some downsides that need to be considered.
NPU-enabled systems can perform the calculations used for AI significantly faster and with minimal additional power consumption, so the overall power used per inference is decreased.
Potential Downsides
Just because microcontrollers with NPUs exist, doesn’t mean they are right for your application. As discussed above, devices have been running Edge AI for several years without a dedicated NPU.
The following considerations should be borne in mind before selecting a processor with an NPU:
- Limited Options - There are far fewer microcontrollers available with a built in NPU, so for applications that require specific features you may find no options available that meet all your requirements. Often the chips that are targeted with NPUs are large, power-hungry devices aimed at embedded Linux or vision heavy applications. There are far fewer low power devices available, which can limit their suitability for applications where long battery life is needed. Whilst having an NPU may reduce the amount of power that a system needs per inference, if the closest suitable chip with an NPU is designed for significantly more demanding tasks than your application requires, then some of the power savings might be lost to the inherently higher power draw of these more powerful devices.
- Manufacturer Variance - Not all NPUs are made equal. As different NPUs have different levels of precision and speed, they will naturally be better suited for different applications. For example, the performance of large computer vision models will typically suffer less from being heavily quantised meaning having an NPU that can only perform 8-bit integer operations may not have any impact on accuracy. This is in contrast to some smaller, highly tuned AI application models that may lose significant accuracy when heavily quantised and where 16-bit floating point support would be required. Ensuring that the speed and accuracy benefits of the particular component match the intended use case is essential.
- Cost - Having an NPU adds more cost to the microcontroller. Comparable TI chips can be seen to be around 40% more expensive for the NPU models in high volumes (at the time of writing, the MSPM0G1107 microcontroller is approximately £0.70 whereas the NPU-enabled MSPM0G5187 is approximately £1.00 – both at 10k quantity). Many manufacturers are also only offering NPUs in their more high-powered chips, which again increases part cost if a less capable microcontroller would suffice for the application.
- Software - To make full use of the hardware, it may well be necessary to use the microcontroller manufacturer’s proprietary, closed-source software, which is designed to optimise models for their hardware. Dependency on the manufacturer’s support for this software could be a commercial concern, especially for web-only tools, which the vendor can update or discontinue at any time, potentially changing or breaking functionality. Providing long term support for a product also becomes difficult if there is no guarantee that the tools used to make the software will exist in the future. The use of these closed-source tools may also complicate submission for regulatory approval where this requires an explanation of the model.
- Data Security - A lot of manufacturers such as TI and Nordic offer no-code alternatives to train the models. While this allows for faster development of products, applications may be limited depending on the capabilities of the manufacturer’s tool. Some custom models are also only available through a web-based service, requiring all the training data to be submitted to the cloud rather than being trained locally, raising potential concerns about confidentiality, IP rights and data protection.
While these products may be billed as revolutionary, it requires real world testing and implementation to validate whether the claims made hold true in a particular application. It is inevitability harder and riskier to make a case for adopting these products without more concrete examples of how they perform in real world applications. Such examples are currently hard to find owing to the devices’ relatively recent availability.
Conclusions
Whilst NPU-enabled microcontrollers may offer significant performance increases for the right application, an initial review suggests that the benefits of adopting NPUs for common applications may be subtle rather than revolutionary. In some devices, the microcontroller power draw is not the limiting factor, and for devices that deal with human interaction, cutting processing times down by milliseconds may have no noticeable effect on the usability of the device.
However, the use of NPUs should certainly be considered in the right application and they will likely find their way into many of the devices that are already running Edge AI software. Their relevance may quickly increase as consumers’ expectations mean that the adoption of Edge AI in products becomes more commonplace and they should be considered as a potentially useful tool for any embedded engineer looking to get the most out of their system. These technologies may not completely replace the traditional processing methods that have developed over many years, but they should be used in tandem to create smarter, more reliable products.
The benefits of adopting NPUs may be subtle rather than revolutionary, but they should be considered as a potentially useful tool for any embedded engineer.