Get email delivery of the Cadence blog featured here
Arm's Shawn Hung (based in Austin) and Cadence's Rod Metcalfe presented on doing 3D design at Arm DevSummit, in a presentation titled Implementing 3D Neoverse N1: 3D Design Merits Meet In-Depth Analysis. What they described was an implementation of an Arm Neoverse N1 implemented on two die that were then attached face to face with a process known as hybrid wafer bonding. This was Arm's first face-to-face wafer-bonded design.
There is a huge increase in interest and use of various forms of advanced packaging that often go under the catchy name "More than Moore". The first chips that used 3D technology were actually the image sensors for cameras, which flipped the image sensor itself over (so the light entered through the back of the thinned die) and then attached it to the image processor, which could then pull data vertically from the sensor rather than having to get the data to the edge of the image sensor die. The next 3D chip that got attention was Xilinx's large FPGA where they split the array into four identical die and mounted them on an interposer. AMD's product line of CPUs are all built out of a range of die assembled on an interposer. The driver for AMD was that a die that big would not yield, or perhaps not even fit in the reticle, plus the high-end chips used HBM2 memories which pretty much requires an interposer. For a look at the range of designs that are using advanced SiP (system-in-package), see my post HOT CHIPS: Chipletifying Designs.
What Shawn described was something more ambitious still: to take a monolithic design, and split it in two identically sized die, and then flip the top die over and attach it to the lower die to form a sandwich (as in the picture above). He described it as a test chip but actually it is more of a proof-of-concept, and there are no plans to actually tape out and manufacture the test chip.
There are several motivations for why you might want to manufacture a processor like this:
They did a previous proof-of-concept design called Trishul last year (which they reported on at Arm TechCon although I didn't see it) to prove the readiness of 3D stacking:
For this project, the plan was to stack the main memory on the upper tier of a 3D microprocessor, since main memory is a significant bottleneck. In principle, for a processor with considerable memory demand, increasing the size of the on-chip L2 cache is an efficient approach to improve performance...except increasing the size of the L2 cache increases the time to access this memory. Folding the cache over the top of the logic stages of the pipeline reduces this access time. The design was done in 7nm.
In fact, for thermal reasons, it makes more sense to put the memory (L1 and L2 caches) on the bottom tier and logic on the top tier. This also enabled them to double the size of the L2 cache. It is only possible to build a 1MB L2 cache with a 9-cycle read in 3D. In 2D, it requires two extra cycles.
Rod explained some details about the Cadence 3D-IC solution. I won't repeat that since I've covered that extensively, for example in my post John Park's Webinar on Chiplets from a few months ago.
This is the flow that was used for this design. There are some considerations about what goes where. The blocks that communicate frequently should be assigned to adjacent tiers since that decreases the length of the inter-block connections. This both increases communication bandwidth while reducing power. But blocks with high switching activities should not be placed on top of each other vertically to keep the temperature profile within specified limits. The vertical connections were handled as virtual anchor cells which are pairs, one on each die, conceptually aligned in 3D (see the example diagram).
The design is actually done with the virtual anchor cells connected by a dummy wire that doesn't really exist. Eventually, that wire is removed and the two die flipped. But in the meantime, all the 2D design algorithms work normally.
Since both die are bonded face-to-face (and are the same size), traditional flip-chip packaging will not work since both "top" and "bottom" of the stack are actually the backsides of die. Power was handled with through-silicon-vias (TSVs) going through the bottom die. It was then spread out through the bottom die, and across the wafer bond to power the top die.
Shawn also went into a lot of detail about constructing the clock tree across the two die using Innovus Implementation and the CCOpt tool. I'm going to skip that as being too much of a deep dive for a post like this. But the clock tree was better than in the 2D implementation, with 18% lower clock latency, about half the number of clock buffers, and 27% lower clock tree power. What's not to like?
Even a 2D microprocessor requires some level of thermal analysis. It is even more essential for a design like this since the bottom die is sandwiched between the package substrate and the top die, so there are more limited paths for heat to "escape". The top die is in thermal contact with the heatsink so is less of a challenge. A Celsius Thermal Solver was used to create heatmaps.
The heatmap above shows the N1 in 2D on the left (that was used to develop models of the package and heatsink), and the two folded N1 die on the right. Celsius shows that the steady-state temperature when running at 'maxpower' is 6°C higher than the 2D N1. In reality, it might be lower since 'maxpower' is beyond realistic and is a viral power vector.
Voltus was used to do IR analysis. This is more critical than ever since all the power for the top die passes through the bottom die. Indeed, Voltus showed that most IR drop is on via pillars stacked on top of the TSV (see the diagrams earlier).
The final conclusions:
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.