Scale Up vs. Scale Out in Modern AI Factories

10 Apr 2026 • 7 minute read

Choosing pod fabrics, planning bisection bandwidth, and managing ordering semantics

Two Worlds Inside the AI Factory

As AI factories scale into tens of thousands of accelerators, architects must navigate two very different networking worlds. Inside a pod, accelerators are expected to behave as a single logical compute organism, synchronizing gradients and activations with precise timing. Outside the pod, the system expands into a distributed cluster where traffic patterns—data ingestion, KV‑cache lookups, multi‑tenant inference—behave more like a wide, shared backbone.

Treating these domains as equivalent leads to unpredictable performance. Understanding their boundaries is what allows large‑scale systems to operate smoothly. The scale-up domain is optimized for performance, while the scale-out domain prioritizes flexibility and cost-effectiveness. This separation allows architects to build systems that are both powerful and efficient.

Inside the Pod: Why Scale‑Up Fabrics Exist

The scale-up domain exists because modern training workloads require accelerators to access each other's memory nearly as if it were local. Tensor parallelism, expert routing, and collective operations demand sub‑microsecond determinism. Even tiny variations, micro‑jitter on a link or queue, can stall an entire training step. Scale‑up fabrics, therefore, adopt tightly controlled signaling, predictable ordering semantics, and minimal protocol overhead.

To achieve this, scale-up networks are designed for extreme performance with microsecond-level latency. Since modern GPUs operate at speeds where every microsecond counts, these fabrics must eliminate traditional transport layers and employ credit-based flow control to ensure reliability. These fabrics emphasize short physical reach, whether via on‑board traces, copper cabling, or tightly integrated optics for larger pods. The focus is not on routing flexibility but on preserving a clean, stable memory‑semantic environment across dozens or hundreds of accelerators.

A prime example of this philosophy is NVIDIA's NVLink technology. In systems like the Vera NVL72, NVLink is extended across the rack to create a single, massive scale-up domain of 72 GPUs. This architecture provides a unified memory space, allowing all GPUs to communicate with extremely high bandwidth and low latency, effectively acting as one super-processor.

NVLink is not alone in this space. UALink is another major standard that enables high-speed, memory-coherent communication among GPUs and CPUs within a scale-up domain. Developed by a consortium of industry leaders, UALink is designed to provide robust and efficient scaling for heterogeneous AI factories, supporting seamless data sharing and coordination across accelerators in complex computing environments. SUE (Scale-Up Ethernet) and ESUN (Ethernet for Scale-Up Networking) also play critical roles in this ecosystem, offering advanced solutions for memory coherence and high-speed data exchange in scale-up domains. These standards—along with NVLink—are key enablers of scale-up networks, each designed to maximize parallelism and efficient data exchange across accelerator-rich platforms.

Beyond the Pod: The Scale‑Out Imperative

Once computation crosses the boundaries of a pod, the network's role changes. The system must deliver high aggregate bandwidth across racks and rows, balancing flows across many possible paths. Latency still matters, but determinism gives way to resilience, congestion response, and adaptive routing. Large‑scale inference, often shaped by bursty, user‑driven workloads, puts additional pressure on these networks to handle hotspots, queue buildup, and rapid shifts in load.

Scale-out networks draw from traditional data center architectures, prioritizing flexibility and cost-efficiency. They are built to handle diverse traffic types and connect a vast number of nodes, often geographically dispersed. While their latency is typically in the millisecond range, which is higher than scale-up fabrics, it is sufficient for tasks like data parallelism and connecting different AI pods.

These networks rely on technologies that support multipathing, congestion marking, out‑of‑order forwarding, and global reach. For instance, RDMA over Converged Ethernet (RoCE) is a common choice for scale-out fabrics, providing a way to simulate memory access over a standard Ethernet network. This makes them ideal for connecting many pods rather than tightly coupling the accelerators within them.

Planning Bisection Bandwidth: Two Different Philosophies

Before diving deeper, it's important to understand the concept of bisection bandwidth. In network design, bisection bandwidth refers to the total bandwidth available for communication between two equal halves of a network when it is split down the middle. This metric indicates the maximum data capacity that can flow between these halves without bottlenecks, making it a key factor in evaluating how well a large computing cluster can handle simultaneous traffic between nodes.

Bisection bandwidth inside a pod must be as close to ideal as possible because scale‑up traffic is synchronous. A single slow link slows the entire training job. High‑efficiency, non‑blocking designs with extremely low switch latencies are required. The network must deliver consistent bandwidth regardless of which accelerators communicate because every training step repeats the same communication patterns thousands of times.

In the scale‑out domain, the goal is different. Here, architects focus on avoiding persistent hotspots rather than chasing theoretical maximums. The network must gracefully handle the unpredictable nature of multi-tenant workloads. Techniques such as packet spraying, congestion marking with ECN (Explicit Congestion Notification), and telemetry‑driven load balancing become more valuable than strict full‑bisection engineering. This shift acknowledges that large clusters cannot be perfectly uniform and that intelligence in the transport layer can compensate for uneven traffic patterns.

Ordering Semantics: Tight vs. Relaxed

The two domains also diverge sharply in how they treat packet ordering. In a scale‑up fabric, memory‑semantic operations depend on predictable sequencing. Even when completions may legally arrive out of order, requests themselves must follow strict rules. This ensures that accelerators participating in collectives or expert routing can trust the timing of the data they receive, which is crucial for deterministic performance.

Scale‑out fabrics relax these constraints. In large routed networks, enforcing strict packet‑level ordering is not only expensive but detrimental to performance. Modern transport protocols are designed to tolerate out‑of‑order delivery while reconstructing messages cleanly at the endpoints. This flexibility allows fabrics to use multipathing more efficiently and avoid head‑of‑line blocking, an essential feature for maintaining high throughput at cluster scale.

The Role of Physical Reach and Signaling

Another divider between scale‑up and scale‑out behavior is the physical medium. Scale‑up links begin with passive copper and PCB traces, which are ideal for very short reach with the lowest power cost. But as pods grow, these copper domains cannot stretch across racks while maintaining signal integrity at modern speeds (112Gbps/224Gbps).

The result is a progression into pluggable optics, then near‑packaged optics (NPO), and eventually co‑packaged optics (CPO), where the optical I/O is integrated directly with the accelerator package. This transition shrinks electrical trace lengths to mere millimeters, reducing power consumption and latency. This evolution sets the physical ceiling for how large a pod can be before it must spill into the scale‑out network.

Deciding Where the Boundary Should Be

The practical question for architects is not which technology is "better," but where the transition from scale‑up to scale‑out should be drawn. The dividing line moves based on the model's parallelism strategy, the number of accelerators per pod, and the network's signaling constraints.

For example, a system designed for training a single, massive model will push the scale-up boundary outward to encompass as many accelerators as possible. In contrast, a large-scale inference fleet serving many different models will benefit from a smaller pod size, pulling the boundary inward to prioritize the routing flexibility of the scale-out network. Designing AI factories at a modern scale means shaping this boundary deliberately rather than letting cabling limits or switch availability define it by accident.

The New Frontier: Hybrid and "Scale-Across" Models

The industry is also exploring ways to blur these boundaries. Current commercially available switches, for instance, incorporate features like lossless connectivity and low latency to make Ethernet-based scale-out networks perform more like scale-up fabrics. This allows for building very large, high-performance clusters using a familiar technology.

Furthermore, as AI workloads grow too large for a single data center, a new concept of "scale-across" is emerging. This approach aims to extend high-performance networking across multiple data center sites, maintaining performance levels comparable to an intra-data center network. This hybrid scaling strategy is critical for supporting the next generation of massive AI models that require computational resources beyond what any single location can provide.

Scale-Up/Scale-Out with Cadence

Cadence delivers top-tier scale-up and scale-out IP solutions for AI factory networks, including:

224G SerDes, the key building block for high-speed data movement, enabling reliable, energy-efficient signaling across copper and optical links to support both short-reach scale-up fabrics and long-reach scale-out interconnects at the highest bandwidth densities.
UALink Controller for unified accelerator memory access, providing a standards-based, memory-semantic interconnect that scales coherent communication across CPUs and accelerators within a pod, supporting tightly synchronized training and heterogeneous compute architectures.
Ultra Ethernet controllers for scalable Ethernet fabrics, delivering congestion-managed, multipath-capable connectivity optimized for large-scale AI clusters, inference fleets, and multi-tenant workloads beyond the pod boundary
C2C IP for NVLink Fusion for coherent, high-bandwidth, and low-latency communication. Used in NVIDIA scale-up rack architectures like Vera Rubin NVL72, it is available to specialized AI infrastructure developers (e.g., hyperscalers building their own XPUs) via the NVLink Fusion program, of which Cadence is an ecosystem partner.

Explore how Cadence can help you architect your next-generation AI infrastructure.