• Skip to main content
  • Skip to search
  • Skip to footer
Cadence Home
  • This search text may be transcribed, used, stored, or accessed by our third-party service providers per our Cookie Policy and Privacy Policy.

  1. Blogs
  2. Breakfast Bytes
  3. Designing Chips for Hyperscale Data Centers: Tools
Paul McLellan
Paul McLellan

Community Member

Blog Activity
Options
  • Subscribe by email
  • More
  • Cancel
cloud
cadence cloud
datacenter

Designing Chips for Hyperscale Data Centers: Tools

10 Apr 2020 • 5 minute read

 breakfast bytes logo

Yesterday's post, Designing Chips for Hyperscale Data Centers: IP, covered the high-performance IP that is most useful in designing hyperscale cloud data centers. Today, it is the turn of the tools required to design SoCs, boards, and whole systems for such data centers.

Chips

Data center SoCs are the highest-performance chips for any market segment. Mobile, and SoCs for any other battery-powered device, is much more focused on getting adequate performance at low power. Battery life is the most critical budget. Automotive is somewhere in the middle, with the additional issue of ultra-high reliability since cars are life-critical in a way that mobile phones are not. Of course, reliability is important in data centers, too, but typically there is a lot of redundancy built into any cloud data center. When there are hundreds of thousands of servers in dozens of data centers, rare events, such as a processor or a disk drive failing, happen regularly across the whole compute fabric. I'm sure the numbers are different today, if only because all these drives were all rotating media, but in Google's 2007 paper Failure Trends in a Large Disk Drive Population it pointed out that annual failure rates went from 1.7% for drives in their first year of operation to 8.6% in three-year-old drives. That is one reason that the Google File System stores each file a minimum of three times in three different locations.

The design flow for an HPC chip for a data center is the same digital flow as for any other design. The tradeoff of power and performance, is different, of course. But even in a chip dissipating 100W, a full analysis of power is needed (to make sure it doesn't dissipate 200W). I covered this type of mainstream digital design in my recent post Digital Full Flow for 5/7nm.

Boards and Systems

 PCBs cover a huge range of performance, much bigger than the range for chips. It always surprised me when I came across rare instances of people designing chips with very low performance. One memorable one was EM Electronics Marin in Neucâtel in Switzerland. They were owned by Swatch and chips for watches had a clock frequency of 32KHz. But boards commonly range from simple low-performance boards that a fred-in-a-shed type of company might design, up to behemoths that go in cloud data centers or 5G basestations. But small is not always simple. See, for example, my two posts about designing the Raspberry Pi board, which is small but not low-performance. These are James Adams Talks About How Raspberry Pi Was Designed and The $10 Raspberry Pi Zero W. Many IoT devices face similar constraints: physically small, cheap, but not low performance since they contain radios and sensors such as video cameras. But that's not today's topic. We are at the other end of the scale with some of the highest performance boards and systems.

Cadence's suite for high-performance PCB design is Allegro. Cadence has a lower-end PCB design system under the name OrCAD, but for high-performance designs, where tight integration with signal integrity is so important, Allegro is the clear choice. For example, in the two posts about the Raspberry Pi I linked to above, the original design used OrCAD, but the higher performance Raspberry Pi W used Allegro.

The biggest challenge in designing very high-performance boards is coping with signal integrity issues resulting from very high-performance SerDes. For the last decade or so, limitations on the number of pins on packages has meant a switch from relatively low-speed parallel interfaces, to extremely high-speed serial interfaces. These interfaces can take literally hundreds of thousands of bits to lock when powered up, so clearly circuit simulation is not going to be the tool of choice for analysis.

High-speed serial interfaces are modeled using a standard called IBIS, and a second standard called AMI. For details of how these two technologies work together, see my post AMI and IBIS: Who Put the Eye in AMI? With the soon-to-be-published DDR5 memory interface standard, some of the same issues arise and so AMI is required there to make sure that the DFE (decision feedback equalization) works correctly with open eyes.

 When we go to the next level, where signals have to get on and off the board, and get to the next board, the signal integrity issues and the precision required go up. Full 3D modeling of connectors and the area near where the connector attaches to the board (known as the breakout region) is needed. This is because there is electromagnetic interaction between the signals on the board and the connector itself. Cadence's tool for this type of analysis is the Clarity 3D Solver. Another tool, the Celsius Thermal Solver, takes care of thermal issues especially under difficult conditions, since temperature affects resistance, and resistance affects power, and power affects temperature. Add in heat sinks, fans, and airflow, and the analysis rapidly becomes very complex. It is important in all of these situations that all the analysis is done in a combined fashion. If the chips, and boards, and connectors are modeled and analyzed separately, then there will be inaccuracy resulting from some double counting, and other inaccuracy as a result of not modeling some effects that will affect signal integrity. You can read more about the Clarity and Celsius solvers in my posts Bringing Clarity to System Analysis, Celsius: Thermal and Electrical Analysis Together at Last, and Under the Hood of Clarity and Celsius Solvers.

 Boards in hyperscale cloud data centers take all of these considerations to the max. They involve chips that dissipate a lot of power. They involve aggressive air cooling. The signal frequencies are very high. The data rates of signals leaving the board to the networks are very high. The connectors can be very complex. Heat transfer, fluid dynamics, circuit analysis, and electromagnetic combine together in ways that require them to be analyzed together.

 

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.