Die-to-Die Interconnect: The UltraLink D2D PHY IP

13 Nov 2019 • 5 minute read

One of the big trends that has been happening somewhat below the radar is the growth of various forms of 3D packaging. I noted this at HOT CHIPS in summer, when a big percentage of the designs were not a single big die, but were multiple die in the same package. I wrote about it in my post HOT CHIPS: Chipletifying Designs.

We can argue about the implications of "the end of Moore's Law" but it seems clear that there is no longer a compelling economic reason to put your design into 7nm (and below). If you need the performance, or the lower power, or the density, then go ahead. Until recently, moving a design to the most leading-edge process not only got those things, it also was cheaper per transistor. This meant that there was an economic rationale to keep on the leading edge even if your design performed in the previous-generation node. A competitive dynamic played into this, too—if you didn't move to the advanced node, and your competition did, then you would be at a big cost disadvantage. For years, decades even, the rule of thumb for a process node was that it would roughly double the transistor density, but cost 15% more than the previous node per mm², leaving a cost saving of 35%. But that is no longer true, as you can see from the above graph from Lisa Su's keynote at HOT CHIPS.

Another trend has been that 3D packaging has gotten cheaper as a result of many designs running in high-volume production, especially for smartphones and servers. This cost balance has shifted so that integrating everything on one big SoC has become less attractive, whereas putting multiple die into a package has become more attractive. The crossover point obviously depends on the details of the design and the actual costs, but the direction of the trends is clear. Moore's Law is giving way to what is catchily called More than Moore.

Another manufacturing reality is that very big chips don't yield as well as the same design split into separate die. If the design is a big multi-core processor or an FPGA, it is attractive to split the design into multiple die since the die will all be identical. At the highest end of all, there is a hard limit, the maximum reticle size. This is the largest design that the manufacturing equipment can handle. If a design would be larger than that, then there is no alternative to splitting it into multiple die, perhaps separately packaged as a "chipset" but increasingly in some sort of 3D package.

Parts of a design other than pure digital, such as analog, RF, photonics, or high-speed SerDes I/O, doesn't benefit from scaling at all. Another reason to leave these parts of the design in an older node is driven by time to market in the very early days of a new node. These parts of the design require test chips and silicon qualification. Leaving them in the older node gets those test chips off the critical path for the first designs in the new node.

A great example that pulls all this together is AMD's Zen2, that they presented at HOT CHIPS. In the same package, it has several identical compute die and either a client or server I/O die. The compute die is built in 7nm and has 3.8B transistors. The server and client I/O die are both built in 12nm. The server die is much bigger, at 8.3B transistors, than the client die, at 2B transistors. AMD's Rome server product has eight compute die and the server I/O die on the same interposer. The client version, Matisse, has two compute die and the client I/O die. See the image above from their presentation.

Several other designs presented at HOT CHIPS took a similar decision, putting the pure digital compute engine into an aggressive node, and doing a second chip in a less advanced node to hold the SerDes, RF, analog, photonics, or whatever the design requires.

Chiplets

Once it is accepted that not everything needs to be designed in the same process node, then a more modularized approach to SoC may become attractive in the longer term, with comparatively large numbers of small die known as chiplets.

The chiplet value proposition is:

Flexibility in picking the best process node for the part. In particular, SerDes I/O and analog does not need to be on the "core" process node
Better yield due to small die size
Shorten IC design cycle and integration complexity by using pre-existing chiplets
Lower manufacturing costs by purchasing known-good-die (KGD)
Volume manufacturing cost advantage when the same chiplet(s) are used in many designs

The long-term vision of this approach is that system-in-package (SiP) becomes the new SoC, and chiplets become the new "IP". However, for this to be viable, there need to be standard/common communication interfaces between the chiplets.

The basic idea is not new. Indeed, in Gordon Moore's 1965 Electronics article in which he introduced what became known as Moore's Law, he also said:

It may prove to be more economical to build large systems out of smaller functions, which are separately packaged and interconnected.

D2D Interconnect

Cadence has created a 7nm UltraLink D2D PHY IP and a test chip (or should that be test chiplet), containing our 40G SerDes for chip-to-chip connectivity, along with the die-to-die (D2D) high-bandwidth, low-power, low-latency in-package interconnect. It is low power, NRZ (as opposed to PAM-4), with no forward error correction (FEC). The top-level design aims were to maximize bandwidth across the edge of the die (beachfront) without having bump pitches so tight as to necessitate expensive silicon interposers (although those can obviously be used if they are motivated by other reasons such as using HBM memory stacks).

350 The details are:

Line-rate of 20-40Gbps
~500Gbps bidirectional BW in 1mm of beachfront
Insertion loss of 8db @ Nyquist (25-40mm)
Ultra-low power ~1.5pj/bit
Ultra-low latency (~2.8ns TX, ~2.6ns RX)
DC coupled
Forwarded clock raw BER of 1e-15, no FEC
Single-ended NRZ signaling with spatial encoding for signal and power integrity
Sideband for link management
Targets 130u bump pitch for MCM application
Also supports micro bump for silicon interposer

Here is the 40G PHY eye diagram:

Example Design

An example of the sort of design enabled by this technology is the 25.6Tbps switch design shown above. This is on an organic substrate (cheaper than silicon interposer). Each chiplet provides 1.2Tbps of bandwidth, and so 16 of them give an aggregate bandwidth of 25.6Tbps. The D2D interface is used between the chiplets and the switch core itself.