System in Package, Why Now?

6 Dec 2019 • 6 minute read

At HOT CHIPS this summer, one of the things I noticed was just how many of the designs being presented were in some form of 3D packaging with multiple die. I wrote about many of them in my post HOT CHIPS: Chipletifying Designs. At last year's HOT CHIPS, I don't remember any designs being presented like this. So that raises the obvious question, why now?

Moore and More

For over 50 years, the semiconductor industry has enjoyed the benefits of Moore's Law. But now the economics of semiconductor scaling are over. Moore's Law was mainly an economic law. If you read his original article (based on four datapoints!), he points out that the economically optimal number of transistors on a chip was doubling every couple of years. Of course, underlying it was the development of technology to make this be true, and until a few years ago that continued. The very high-level economic proposition was that each process generation doubled the number of transistors in the same area, at a cost increase of just 15%, leaving a cost saving of 35% per transistor. But now transistors are more expensive each generation, since the processes are so complex and the capital investment to build a working fab (these days, including EUV steppers at over $100M each). So we have a process roadmap from 7nm, to 5nm, to 3nm, and a couple of generations after that. But the economics are such that these processes will not just be more expensive per wafer, as has been true for decades, but more expensive per transistor.

Gordon Moore knew this day would come, and he has said that he expected it much sooner. He never expected his law to last for over 50 years. In fact, he was somewhat embarrassed about it. I saw a video interview with him at SEMICON West a couple of years ago, and when he was asked what he'd like to be remembered for, he said "Anything but Moore's Law." But in that original paper in Electronics he said:

It may prove to be more economical to build large systems out of smaller functions, which are separately packaged and interconnected.

Well, that day has come.

The other trend that has been going on for some time is making complex packaging, by which I simply mean ways of putting more than one die in a package, have got more economical. Like all mass-production technologies, this has largely been driven by learning from mass production. Large microprocessors are using interposer types of technology. Smaller (both in transistor count and physically) communication chips have been using fan-out wafer-level packaging (FOWLP) technologies. Since smartphones ship about 1.5B units per year, meaning that any individual model may be shipping in hundreds of millions, that is a lot of learning.

Putting these things together, the balance has changed. The economics of manufacturing a huge number of transistors on the same chip, versus building smaller die and packaging them together, is now a complex decision. Until recently, at least for large designs, the economics always came down to a single SoC. Now, as evidenced by this year's HOTCHIPS, the decision is often for the complex packaging.

Die Size

Large die yield less well than smaller die. If fatal defects are randomly spread across a wafer, then a large die is more likely to be hit by one. There is also more of the area wasted around the edge of the wafer, since there is simply more of the wafer where there isn't room to put another chip. In the past, despite this, it has been more economical to suck it up and build a big SoC, rather than building separate die and packaging them together. The economics are now in favor of building smaller die, especially if a full system can make use of multiple copies of the same die. It is not too challenging to build a high-core multi-core microprocessor this way, or an FPGA (but obviously a huge die with no regularity cannot take advantage of this).

There is another problem with very large designs. The lithography process has a maximum reticle size. If the design is larger than that, then splitting it up is the only option.

Actually, that's not quite true. As I described in my post HOT CHIPS: The Biggest Chip in the World, Cerebras built a single chip that is the largest square you can put on a 300mm wafer. This approach requires special handling of interconnect across what is normally called the scribe-line (although obviously the die were not separated). It also requires a lot of regularity since all the die have to be the same. However, for most designs this approach is not going to be effective. It does, however, remind me of a conversation I had once with a guy from Digital in the 1980s who pointed out that you could do wafer-scale integration with Microvax chips, and you only needed to get three signals onto the wafer: power, ground, and Ethernet. So who knows? Maybe the Cerebras approach will become more widely used.

Keep Your Memory Close

All high-performance processors, whether CPUs, GPUs, deep-learning processors, or anything else, require access to large memories, either as caches or for directly storing the (big) data. A huge amount of the power consumption in most computation is simply moving the data around, not doing the actual calculations. A lot of the latency in the overall calculation comes from this movement too So an obvious thing to do is to move the memory closer to the processor. This will reduce power, and improve the performance.

The "obvious" way to do this would be to put the DRAM on the same chip as the processor, but there are two problems with this. First, all the die-size limitations I discussed earlier. Secondly, although it is possible to mix DRAM and logic processes, it is costly. You can't add DRAM to a logic chip with just a couple of masks.

The earliest approach to this is known as package-in-package (or PiP). This slightly odd term is to distinguish it from package-on-package (PoP) where two ball-grid-array (BGA) packages are literally stacked on top of each other. Two die, such as a smartphone application processor and a DRAM, are put in the same package and everything is wire-bonded to avoid the complexity of things like through-silicon-vias (TSVs). Smartphones have been doing this for years.

For high-performance computing (HPC) that is not enough memory. They typically want to access several high-bandwidth-memories (HBM or HBM2). These consist of a logic die, and then four or eight DRAM die stacked on top, with everything connected with TSVs. So this is already a 3D-IC, albeit not very useful on its own. This is then put on an interposer alongside the processor. The picture at the start of this section is AMD's Fuji, one of the first designs to use this approach.

There is also a JEDEC Wide I/O standard for high-bandwidth-memory intended to be standardized (so the memory doesn't depend on the design) and then the memory, with TSVs, is put on top of the logic die. Since the Wide I/O has 1000 or more pins, it can get very high bandwidth without requiring all the SerDes overhead of a DDR interface.

This approach is used for CMOS image sensors (CIS), too. The sensor is not strictly memory but it is memory-like. Surprisingly, if you don't already know this, the light for the sensor goes through the back of the wafer. That way the interconnect does not get in the way. The sensor is thinned, so it is transparent to the light, and then flipped over. The associated logic die is designed to be exactly the same size, so that the flipped sensor fits perfectly on top of it. Sometimes a third DRAM die is slipped into the middle of the stack. The diagram alongside is the Sony CIS with three layers.

More!

But wait, there's more (not Moore). This post will continue next week.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.