Designing Chips for Hyperscale Data Centers: IP

9 Apr 2020 • 5 minute read

Last week I wrote two posts about the progression from the first commercial computers to today's hyperscale cloud data centers. Those posts were:

I didn't talk about the technology used to build the computers at each phase. The earliest commercial computers were built using discrete transistors, and lots of wire. Memory was ferrite core. For example, here is a picture of an IBM 360 mainframe with the covers off. This one is in the Deutsches Museum in Munich, which I wrote about in German Computer Museums. It is the largest science museum in the world, and highly recommended if you ever find yourself in southern Germany.

The minicomputers and workstations, at least in the beginning, were built out of standard TTL 7400 series integrated circuits. These were small-scale integration, initially created by Texas Instruments, but later available from many semiconductor companies (who kept the same numbers to avoid confusion). Each component would have about 16 pins and hold something like four 2-input NAND gates (74xx00), or dual D-type flops (74xx74). One component used to build the ALUs of the computers of that era was the 74181 ALU, which I covered in the last part of my post Carry: Electronics. The later part of this era made the transition from ferrite core memory to semiconductor memory aka DRAM.

Once we got to PCs, server farms, and cloud data centers, the computers are all built around single-chip semiconductor microprocessors.

But a cloud data center has a lot more in than just the microprocessor. There is a complete interconnect fabric at the level of racks, then rows of racks, then whole data centers. The storage is typically separate from the servers themselves, unlike in a laptop (or any of the earlier computers). Much of the networking transitioned from copper to optical fiber, too. The arrival of 3D NAND flash means that many "disk" drives are now flash-based (known as SSDs, for solid-state drives) as opposed to rotating media (known as HDDs, for hard disk drives). HDDs are still cheaper per byte and continue to be used for "cold storage", data that is not expected to be actively accessed (such as backups, as compared to databases).

The big difference between SoCs for hyperscale cloud data centers and other markets, such as mobile, is that data centers are the main part of the semiconductor market that goes by the initials HPC, for high-performance computing. Foundries produce special processes that make a different performance/power tradeoff than is required for mobile or IoT, and standard cell libraries that don't squeeze the number of tracks so aggressively so that they also provide higher performance.

Communication

There are two different types of communication that are important inside a hyperscale data center, between the chips and between the systems.

The first type of connectivity is between chips inside each unit (server, storage server, router, and so on). This is typically by some form of fast serial links powered by SerDes (serializer-deserializer). However, the actual protocol running over those links will typically have some other name, such as PCIe, CCIX, ASE. Or, for memories, DDRx. The way to build an SoC that uses any of these protocols is to license IP from a company like Cadence. We call this Design IP (or DIP for short). There are two good reasons not to do this yourself. The first is that building a high-performance SerDes from scratch is a complex undertaking that most SoC design groups are not staffed to accomplish successfully. But perhaps more important is the second reason: you can't build such a SerDes without building a test chip. If you do it yourself, then that puts the test chip and its validation on the critical path for your SoC. Cadence has to build test chips too, of course, but we are in deep partnerships with the leading-edge foundries and get early access to PDKs and can run test chips earlier than any individual design group. If you license DIP from Cadence, you can get all the silicon data without needing to wait. The current state-of-the-art SerDes run at 112G (and the next lower speed is 56G). Both of these use the PAM4 protocol, which transmits two bits per clock. So they have a distinctive eye-diagram when they are running correctly, as in the photo below of Cadence's 112G SerDes 7nm testchip.

The serial signals might just run from one chip to another nearby, but they can run across boards, through connectors, and through cables. For this to work correctly, the signal integrity needs to be analyzed closely. Cadence has two families of products to do this. One is the Clarity and Celsius solvers, which are focused on 3D modeling of connectors and cables. The Celsius Thermal Solver also takes heat flow (including air or coolant) into account. The Sigrity family handles signal integrity at the board and package level. The Allegro family are the tools to actually design the boards and packages. They are tightly integrated with both signal integrity tools and chip design tools. It is possible to analyze a chip in the context of a package, for example.

The second type of connectivity is between the various systems inside the data center. The details depend on the actual way the data center was designed but there are typically three distances that are relevant. From the server to the router or switch at the top of the rack (ToR), within the data center (this is often known as East-West traffic), and from inside the data center to somewhere else outside (often known as North-South traffic). These directional names come from how a data center is normally drawn on a slide or a whiteboard, with all the servers across the middle of the slide, traffic into the data center (known as southbound) at the bottom, and traffic out of the datacenter (known as northbound) at the top of the slide.

These connections all use various forms of Ethernet, sometimes over copper and sometimes over fiber. One big limitation is the space on the front of each rack available to actually connect the communication medium, since there is clearly a maximum of 19" available for that on a 19" rack (actually a bit less). Ethernet IP is divided into three parts: the controller (also known as the MAC), the PCS (physical coding sublayer), and the PHY. The PHY is tied to the process, since it is an analog block. The PCS and MAC are synthesizable. Cadence has a wide range of PHYs for different processes, and MACs and PCSs for different data rates.

Design Tools

The next post looks at the design tools required for the design of chips for hyperscale data centers.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.