Best of CadenceLIVE 2020: Hyperscale Data Centers

25 Mar 2021 • 4 minute read

There is something in philosophy known as the Sorites paradox. If you have a heap of sand, and you remove a grain, is it still a heap? Well, sure. So, remove another grain. It's still a heap. But if you keep removing sand, does it always remain a heap? Well, no. A single grain of sand is not a heap. Sorites sounds like he should be an Ancient Greek philosopher, but actually, it's just the Greek word for a heap.

In the same way, I don't think that you can define a clear line between a hyperscale data center and a server farm. However, in a hyperscale data center, there are literally tens or hundreds of thousands of servers, and since these days they are all multicore, that could be millions of cores. It's not just the servers, it is also the communication architecture, which is usually divided up into rack (linking all the servers in a single rack to the router on the top), and the datacenter level (sometimes called the spine), and then linking the data center to other data centers and other client machines.

One of the first companies to build data centers on this scale was Google. Its first servers looked like below. Well, they didn't just look like this, this is an actual 1999-vintage Google server rack that you can see at the Computer History Museum in Mountain View (coincidentally, just near the Googleplex since, like much of Google, it is in an old Silicon Graphics building).

Hyperscale compute is one of the semiconductor technologies that was covered by customer presentations in last year’s virtual CadenceLIVE Europe event. To help share the learnings broadly, Cadence has consolidated several of the best presentations by key vertical segments: hyperscale, 5G, automotive, and artificial intelligence and machine learning (AI/ML). These solutions feature either customers who are designing hyperscale compute systems, or technology that is directly applicable in solving challenges presented by hyperscale. The semiconductors at the heart of hyperscale data centers and far and near edge applications require the most advanced design techniques in order to power the innovation the cloud offers the world. These chips feature advanced nodes, large design sizes, massive hierarchies, and power concerns with tough schedules.

Arm at 4GHz

Let's face it, "Arm" and "4GHz" are not usually words you see in the same sentence. Arm made its reputation with low-power cores for mobile applications, with modest clock-rates. But over the last few years, it has made a more serious foray into the data center world. For more on that, see my posts:

At CadenceLIVE Europe, Arm's Stephane Caneau, Olivier Rizzo, Bastien Metsu, and Florian Chailleux, along with Cadence's Ravi Andrew and Prashanth Lingalah presented How We Pushed Largest 5nm High-Performance Arm Core to 4GHz Frequency. The video is presented by Stephane with an appearance by Olivier.

As you might guess from his name, Stephane is French and is based in Sophia-Antipolis near Nice, where I lived for nearly six years.

I don't know which processor this was. Stephane only said, "the largest A-class core" (Cortex-A78?). The previous implementation, done by the original CPU team in 7nm, had achieved 3.2GHz.

Some specific challenges of the process and the design that Stephane called out:

High-resistive metal stack
Extra-low Vt cells can help close timing, but leakage can get unacceptable if used too generously
Very high instance count with over 6M placeable instances
Large floorplan with a single clock
Many timing-critical RAM-dominated paths
Physical design requires 15 days and 100BGB memory

The runtime for the whole chip was obviously too long to enable experimentation at that level. Instead, Arm adopted a divide and conquer strategy (see the diagram). Arm also had access to the latest pre-release version of the digital full-flow. The big contributors in the Cadence flow that contributed to meeting the frequency goal were:

Latest Genus Synthesis with advanced physical techniques
Genus iSpatial was one of the biggest contributors to Fmax uplift
Enhanced timing-driven placement engine
Via pillar aware optimization
Physical re-structuring and re-synthesis
Enhanced cluster skewing
Tighter integration between Tempus signoff and physical design
Power optimization to improve leakage without degrading maximum frequency

The graph shows the march to 4GHz. The drops at the start were due to switching to a new PDK and physical library that was more realistic but also more pessimistic in some areas.

Watch Stephane and Olivier's Presentation at CadenceLIVE Europe

Watch All the Hyperscale Videos

Sign up for Sunday Brunch, the weekly Breakfast Bytes email