Huawei talks Smart Memory at Hot Chips 22: “The only practical solution”

30 Aug 2010 • 3 minute read

Last week saw the 22nd Hot Chips conference, held at held Stanford University, and one of the companies presenting their latest thoughts on “hot chips” was global networking leader Huawei. Sailesh Kumar presented some details on a network-processing chip currently under development at Huawei that is essentially “smart memory.” Such a chip is mostly memory--in this case, according to an EETimes article (http://www.eetimes.com/electronics-news/4206434/Huawei-smart-memory-chip) --32 Mbytes of IBM’s embedded DRAM (eDRAM) in 45nm process technology.

Now every chip designer knows that the most efficient way to place a large amount of memory on an ASIC is to put all of that memory in one large block with one memory controller, one set of address decoders, and one set of sense amps for the data. That’s the most efficient way to embed memory from a purely silicon perspective. However, that’s not how Huawei’s design team has architected this chip. Instead, the 32 Mbytes of eDRAM is split into 16 separate blocks and each block has its own attached processor, called an SM Engine. The sixteen SM engines communicate over a local interconnect grid that looks to be either a large crosspoint switch or a full-blown network on chip (NoC).

So why is Huawei not taking the most efficient approach from the silicon perspective. For a system designer, the answer is simple. Placing all of that memory in one block creates an artificial system bottleneck. With one large block of eDRAM, all 16 on-chip processors would need to access that memory through one memory interface. Certainly, it’s possible to add more interfaces to the memory block to create a multiport memory, but then the memory array itself would need to run faster. From a systems perspective, you achieve optimum balance with 16 processors, 16 memory blocks, and 16 memory interfaces. Such an approach boosts effective memory bandwidth by 16x, with some silicon overhead.

There are other benefits to this approach, however. One of the biggest benefits is that each of the 16 memory blocks is inherently locked to the processor as a private resource. There is no question of access arbitration and no possibility of nasty system-level bugs such as priority inversion, deadlock, or access-latency variation.

Network processing is one of those problems that is known to be “embarrassingly parallel.” Although there’s a torrent of packets entering the network processor, each packet needs a finite amount of processing. Cisco has already designed at least two processor-array chips that take the approach of giving individual processors responsibility for handling packets from birth to death: the 192-processor SPP and the Quantum Flow Processor with 40 quad-threaded processors. Now Huawei has taken a somewhat different approach along the same axis: fewer processors with more memory per processor and more threads per processor.

Within the problem context, why do this? After all, the smart memory approach flies in the face of conventional, contemporary ASIC design that couples an ASIC with big chunks of external, commodity SDRAM in the form of one or more DDR2 or DDR3 modules. This design approach is currently in favor because commodity SDRAM modules represent the absolute lowest cost per bit for any RAM available. Bandwidth is the reason for taking a different approach. To get the bandwidth needed in a multi-Gbit or Tbit router, you’d need several DDR channels, which incurs more silicon for memory controllers. Worse, you need more package pins to talk to each additional DDR channel and IC package technology advances far more slowly than Moore’s Law. You also need more power to talk to all of those DDR SDRAMs. Meanwhile, on-chip processing power and bandwidth are growing at a far faster pace than the bandwidth for external DDR memory interfaces.

Here are some small statistics from Huawei’s presentation to help you understand the nature of the memory-bandwidth problem for network processors:

1. A FIB (forwarding information base) lookup, used to determine a destination address, requires about six round-trip memory accesses (read/write).
2. An ACL (access control list) lookup, used to determine the type of handling a packet requires, needs about 20 round-trip memory accesses.
3. A hash-table lookup requires about four round-trip memory accesses.
4. Counters and policy policing require about four round-trip memory accesses.
5. Packet queue operations require about five round-trip memory accesses.

That’s a lot of memory accesses per processor and per packet. So then what’s the bottom line? Huawei’s presentation makes it clear. A large amount of distributed on-chip memory reduces pin, power, cost, and area by 10x. That’s a big enough number to get any system designer’s attention. In fact, says the Hot Chips presentation, this approach is the “only practical solution for 400Gbps” and beyond. Do you really need a more definitive statement than that?