The Hardware Needed to Power AI at the Edge
Device and Model Optimisations Bring Agentic AI to the Edge
The rise of agentic artificial intelligence (AI) promises to transform embedded systems. Achieving high-level goals defined by the user, devices will use agentic AI to adapt autonomously to changing conditions and perform tasks without external intervention. This marks a major change in usage compared to conventional reactive AI, where the main use case in edge and embedded systems is to analyse images, video, audio, and other sensor inputs for downstream processing[1]. The traditional models will continue to be important, but they will form part of a larger AI-orchestrated system.
With agentic AI, sensor data fed through analytical AI models will feed information to reasoning modules that determine the most appropriate actions. In contrast to the agentic systems developed for enterprise workflows, embedded devices will increasingly run all or most of the required AI models locally. Local execution will unlock a number of critical benefits[2]. Decisions can be made with low latency and with no need for constant communications with remote servers, which improves overall reliability. There is no need to send sensitive data to a third-party system, which improves privacy. It also reduces bandwidth transfer and cloud costs.
The increased efficiency offered by local processing will take advantage of ongoing developments in AI-focused hardware-based acceleration and software. Research indicates that, for repetitive and narrowly focused tasks typical of many embedded systems, smaller, highly tailored models can outperform large, generalised models.
Hardware acceleration will be a major consideration in agentic AI at the edge. The reasoning modules employ technologies designed for generative AI, such as the large-language models (LLMs) used by services like ChatGPT. The key element is the Transformer structure[3]. This places increased demand on computational throughput during inferencing, the process of running a trained AI model on new data in the field. Embedded devices face significant challenges in delivering the computing throughput demanded by generative AI models[4]. For this reason, developers need to pay attention to both the software overhead of each generative-AI model needed.
NEED SUPPORT? CONTACT OUR AI EXPERTS
Streamlined models
Cloud-based models are often associated with sizes routinely measured in tens of billions of parameters or neuron weights. But organisations developing open-source generative-AI models have produced more streamlined software with 3 billion parameters or less. Some models in HuggingFace’s Smol series, which was developed by the open-source group as a line of size-optimised LLMs, have as few as 125 million parameters. Recent releases have been optimised for agentic use[5]. Another project is TinyLlama, which aims to use longer training schedules to improve accuracy for a version of the widely used cloud-based model Llama 2 model developed by Meta. TinyLlama has just 1.1 billion parameters[6]. The graph in Figure 1 shows how losses reduce with longer training and that further training could optimise the model even further. Lower losses translate into higher accuracy during inferencing.
Agentic AI invariably relies on the cooperation of multiple models. Though the top-level agent will probably use a generative-AI model, other sub-tasks do not necessarily need use the same technology, even for functions such as speech generation. Developers can optimise performance by selecting models that offer the trade-offs they need in terms of power, memory usage, and accuracy.

Figure 1: Training for longer steadily reduced losses in TinyLlama, with no saturation even past 3 trillion tokens (Source: https://kaitchup.substack.com/p/tinyllama-pre-training-a-small-llama)
There are further optimisations implementors can make to improve throughput on embedded hardware. Two important techniques are network pruning and quantisation[7], otherwise known as microscaling[8]. Pruning reduces the overall number of operations needed for each forward pass through the model during inference. The technique removes any neurons that will not have a significant influence on any outputs. Not including them in calculations makes little difference to accuracy.
Microscaling often delivers improved throughput on memory- and compute-constrained devices, often with little change in accuracy. This involves the replacement of floating-point arithmetic with 8-bit integer arithmetic or even smaller word widths. A 4-bit-quantised version of TinyLlama uses around 600MB of memory. As it is possible to use many 8-bit arithmetic engines in parallel in place of a single high-precision floating-point unit, an embedded processor with a single-instruction, multiple-data (SIMD) or similar execution pipeline can deliver major improvements in throughput for the same energy and die cost. The TinyML movement has developed other optimisations to reduce the computational and memory overhead of AI models[9].
In practice, quantisation is used more commonly in edge-based devices than pruning. With Transformer-based models, pruning often demands that the model be retrained after pruning. Though this provides some scope for optimising memory layout for higher performance, the process increases project time and cost significantly.
Pruning is often a poor fit for many hardware accelerators because it leads to less uniform memory accesses. The access irregularity reduces the ability to prefill caches with relevant data using pipelined reads. The memory controllers that perform these accesses are usually optimised for contiguous blocks of memory. However, it is possible to implement memory-access controllers in hardware that overcome many of these penalties.
Hardware acceleration
Hardware acceleration goes hand-in-hand with software-based changes. There are several ways in which hardware designers implement this acceleration, and some platforms, such as the AMD Xilinx range of devices, offer the developer the ability to use different parts of the hardware for each of these specific purposes. The most common acceleration technique is the use of SIMD or very long instruction word (VLIW) pipelines that allow multiple data elements to be processed in parallel. The use of SIMD often goes hand-in-hand with quantisation. A suitably designed accelerator can handle twice as many operations each cycle by using 8-bit operands instead of 16-bit words. The instructions typically optimise for vector and matrix-matrix arithmetic, as these operations lie at the core of any neural-network model.
The other direction in hardware acceleration lies in memory-access optimisation. Complex address generators can select the right data for each successive operation in a way that guarantees an uninterrupted stream of data to the execution pipeline itself. This can provide a speedup over hardware using more conventional memory interfaces in two ways. One is when handling the sparse matrices that can result from pruning. The other is improving throughput on the Transformer models required for agentic AI. The speed of inferencing on these models is more dependent on memory throughput than is the case for more traditional convolutional neural network (CNN) models. Caching, memory layout and access pipelining are, therefore, important components in Transformer acceleration.
AI Hardware – Solutions for Edge, Centralised, and Everything In Between
AI workloads, from edge inference to large-scale model training, place unique demands on electronic systems. AI often combines significant computation, high memory bandwidth, and continuous data flows, while adhering to strict latency and power constraints. Engineers are required to design systems that can process sensor inputs, execute models, and deliver insights reliably, all while maintaining energy efficiency and thermal stability.
Each tier of AI deployment, from sensor nodes and edge devices to near-edge infrastructure and centralised data, presents distinct hardware considerations. Optimising performance requires careful balancing of resources while considering integration and communication with other systems.
The right hardware choices directly shape scalability, efficiency, and the viability of emerging use cases. Avnet Silica assists engineers in these critical decisions through its comprehensive expertise and a vast selection of innovative hardware, helping to drive the creation of AI systems throughout all application levels.
LEARN MORE

Technology
Artificial Intelligence Overview
Head over to our Artificial Intelligence overview page for more AI articles, applications and resources.

Application
Generative AI at the Edge Chatbot
See how Avnet Silica, Tria and other partners brought hardware and software together to create the Generative AI at the Edge Chatbot, showcased at shows and conferences across Europe.

See the AI Knowledge Library
Head over to the AI Knowledge Library to see all of our AI and ML resources in one place. Explore articles, webinars, podcasts and more.

Figure 2: Block Diagram of the Qualcomm Hexagon coprocessor (Source: Qualcomm)
Support for both memory and arithmetic acceleration is available in platforms such as AMD, formerly Xilinx, Zynq UltraScale+ programmable system-on-chip (SoC) devices. These combine Arm® Cortex®-A53 cores with programmable logic. The field-programmable gate array (FPGA) logic on the Zynq provides developers with the ability to implement highly tuned custom accelerators for memory accesses as well as arithmetic functions. Developers can also choose to employ AMD’s optimised deep-learning processor IP within the FPGA fabric to accelerate inference for more mainstream neural networks. Arm® cores provide the throughput for handling agent-controller functions and the processing needed by other processing tools, such as symbolic-reasoning engines, databases, and software compilers.
The newer Versal adaptive compute acceleration platform (ACAP) devices include AI Engine DSP blocks. These blocks are designed for the vector and matrix operations used in both complex signal processing and machine learning tasks at the edge. The Vitis AI development toolkit complements this hardware, providing compilers and runtime libraries to map machine-learning applications onto the underlying hardware efficiently.
Qualcomm chose a VLIW architecture for the Hexagon accelerators in its Dragonwing and Snapdragon embedded processors[10]. Hexagon combines the ability to perform parallel vector arithmetic and multidimensional tensor operations alongside a scalar processor that can run an operating system, such as Linux, independently from the Arm® application processor on the same die. The ability to have the scalar, vector, and tensor engines allows for a high degree of software flexibility and the ability to use higher-level neural-network optimisations, such as layer fusing. The seventh generation of Hexagon used in the Snapdragon 8 processor family can deliver a sustained throughput of more than 50 tera operations per second (TOPS).
NXP’s i.MX9 family of applications processors incorporate dedicated accelerators for machine-learning applications. At this point, the i.MX95 has the most powerful accelerator in the NXP range through the inclusion of the eIQ Neutron neural processing unit (NPU). With a throughput of 2 Tera Operations per Second (TOPS), the NPU can support Transformer-based workloads, as well as CNNs intended for image and sensor analysis.
The AI-focused dynamically reconfigurable processor (DRP-AI) designed by Renesas pushes performance to as high as 15TOPS. The RZ/V2M and RZ/V2L are optimised for high-resolution image recognition in real-time with very low power, making them ideal for smart cameras or robotic vision.
Summary
Though developed initially with cloud deployment in mind, agentic AI has many applications on edge and embedded systems. By coupling models that have lower resource requirements with AI acceleration technologies developed for the low-power, real-time environment, Avnet Silica can help design teams take advantage of the latest developments in machine learning to give their products much greater autonomy.
References
[1] https://www.ibm.com/think/topics/agentic-ai-vs-generative-ai
[2] https://www.xenonstack.com/blog/agentic-ai-edge-computing
[4] https://arxiv.org/abs/2106.16006
[5] https://huggingface.co/blog/smollm3
[6] https://www.qualcomm.com/developer/blog/2025/06/optimizing-your-ai-model-for-the-edge
[7] https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf https://arxiv.org/pdf/2310.10537
[8] https://arxiv.org/abs/2401.02385
Working on an Artificial Intelligence project?
Our experts bring insights that extend beyond the datasheet, availability and price. The combined experience contained within our network covers thousands of projects across different customers, markets, regions and technologies. We will pull together the right team from our collective expertise to focus on your application, providing valuable ideas and recommendations to improve your product and accelerate its journey from the initial concept out into the world.
