A Hybrid Subsystem Architecture to Elevate Edge AI

2 Oct 2025 • 4 minute read

The world of artificial intelligence is moving beyond the cloud and into our everyday devices from smart sensors to robotics and AR/VR headsets. One of the key components that enables this shift is a neural processing unit (NPU), also known as an AI accelerator, which is a specialized hardware designed to execute AI models. Optimized for neural network, deep learning, and machine learning tasks, NPUs handle the fundamental, math-intensive operations that power these workloads while CPUs and GPUs handle a wider variety of tasks.

The NPU architecture evolved over time to accommodate the changing AI landscape. This evolution, driven by new and evolving use cases, has led to distinct NPU design philosophies, which can be broadly categorized into three types as shown in Table 1 below.

Table 1: NPU Performance and Application Tiers

Type	Key Architectural Features	Models Supported
Gen 1	Basic matrix multiplication, fixed point processing, limited programmability	Convolutional neural networks (CNNs)
Gen 2	Matrix multiplication with some level of programmability to handle some complex activation functions	CNNs, RNNs, and some transformers
Gen 3	Massive parallelism, optimized for FP8/FP4/INT8, inbuilt programmable core to handle more complex activation functions	Large language models (LLMs), large vision models (LVMs)

A deep learning workload comprises a wide range of operations, including data pre-processing, activation functions, and other data transformations. Despite their specialization, NPUs are not a silver bullet for the entire AI pipeline.

If we focus specifically on the Gen 1 NPUs, these are the embodiments of the AI-at-the-edge philosophy and are highly optimized for one thing: massive matrix multiplication, which forms the core of a CNN-based model. When these NPUs encounter a layer they don't support, they have no choice but to stop, hand the data over to the main host CPU, wait for it to finish, and then retrieve the data. This creates three major architectural issues in an AI subsystem:

CPU Bottleneck: A general-purpose CPU is architecturally inefficient at performing the parallel data processing required for these AI layers. This offloading process becomes the slowest part of the entire AI inference pipeline.
Data Traffic Jam: Constantly moving large tensors between the NPU's memory, the CPU's caches, and system DRAM consumes significant power and time, adding latency and negating the NPU's efficiency benefits.
Increased System Complexity: Software developers must manage this complex, fragmented workflow. The AI model is no longer running on a single accelerator but is partitioned across multiple processors, making performance unpredictable and debugging difficult.

These issues have become even more pronounced with the rise of complex transformer models. These models introduce operations like more complex Gaussian error linear unit (GELU), layer normalization, Softmax, complex element-wise operations, etc., that a type A NPU is forced to offload, creating new system bottlenecks.

A Hybrid Architecture: NPU + AI Co-Processor (AICP)

These limitations warrant a new approach to designing the AI subsystem: a hybrid architecture. Pairing the NPU with a companion such as the Cadence Tensilica NeuroEdge 130 AI Co-Processor, which is designed specifically to handle these offload tasks, can create a more powerful and efficient AI subsystem, simplify the design, and accelerate time to market.

The end-to-end inference flow for this kind of hybrid AI subsystem is a multi-step process that strategically leverages the strengths of both the NPU and NeuroEdge 130 AICP. The successful execution of ViT, a vision transformer, looks as detailed below and this could apply to any model, including LLMs and VLMs:

Step 1: Offloaded Pre-Processing: The NeuroEdge 130 AICP performs the initial, data-intensive, and non-MAC-heavy pre-processing tasks, including dividing the image into patches, converting them into tokens, and applying positional encodings.

Step 2: NPU-Centric Compute: Once the data is prepared, the NeuroEdge 130 AICP, acting as the control processor, transfers the data to the NPU, where the MAC unit computes the math-intensive tasks. This streamlined data flow ensures the NPU's expensive parallel units are kept at near-constant, high-level utilization.

Step 3: Offloaded Post-Processing: After the core computational layers are completed on the NPU, the aggregated output is returned to the NeuroEdge 130 AICP. The AICP then handles the final classification/post-processing tasks, including the pooling and the final SoftMax activation, which is specifically optimized to perform.

Step 4: Output Generation: The final classification probabilities are produced by the NeuroEdge 130 AICP, completing the inference cycle.

Table 2: Layer-by-Layer Execution Mapping

ViT Layer	Computational Characteristics	Optimal Execution Location	Rationale
Input Embedding & Patching	Data slicing, reformatting, non-MAC ops	Offloaded to AICP	Data pre-processing not suited for parallel NPU cores; requires a flexible, programmable processor
Positional Encoding	Vector addition, low compute	Offloaded to AICP	Low-intensity data manipulation; would idle the NPU's parallel units
Self-Attention Mechanism	High MAC operations, large matrix multiplications	Executed on NPU	Core parallel workload; canonical task for the NPU's tensor acceleration unit
Multi-Layer Perceptron (MLP) Blocks	Extremely high MAC ops, accounts for >50\% of total MACs	Executed on NPU	The primary computational bottleneck; the reason an NPU is in the system
Final Layers (Pooling, Softmax)	Low MAC ops (pooling), specialized function (Softmax)	Offloaded to AICP	Non-MAC-intensive and specialized mathematical functions are handled more efficiently by a flexible co-processor

The Benefits of a Hybrid Architecture

Enabling Advanced Features: A more capable AI subsystem allows for the deployment of cutting-edge features—like on-device generative AI, advanced sensor fusion, and multi-modal models—that would be impossible on a gen 1 NPU, creating significant product differentiation.
Lower Power and Smaller Area: By using a purpose-built co-processor instead of an inefficient general-purpose CPU, designs can achieve a significant reduction in dynamic power and optimal silicon area, lowering manufacturing costs and extending battery life.
Faster Time to Market: The combination of a mature, extensible hardware architecture and a unified software development kit (SDK) reduces development complexity and risk, allowing teams to bring innovative AI-powered products to market faster.

In the next post, we''ll look at how this hybrid architecture applies to a system which has a Gen 2 and Gen 3 NPUs. In the meanwhile, learn more about the Cadence NeuroEdge 130 AI Co-Processor.