Designing Planet-Scale Video Chips on Google Cloud

1 Nov 2021 • 4 minute read

cadenceLIVE In the last couple of months, I have attended two presentations about Google's video-encoder chip for YouTube videos. The motivation for Google to create this chip is easy to understand. Approximately a zillion videos per minute are uploaded to YouTube, and they all need to be encoded into a wide variety of screen formats (mobile, 1024, 4K, etc.). Video is also getting harder to compress with increased compression ratios and higher resolution. AV1 (8K, 60fps) takes 8000X the resources of H.264 (1080p, 24fps). Doing all this in software requires a huge number of servers (and a lot of electrical power). Creating a chip would reduce the resources required a lot, so Google created one. As the chip was deployed worldwide, the CPU cycles required dropped dramatically as the chip took over the heavy lifting.

cpu cycles required as vcu deployed

The two presentations were:

HOT CHIPS: Video Coding Unit (VCU) by Aki Kuusela and Clint Smullen
CadenceLIVE Americas: Designing Planet-Scale Video Chips on Google Cloud by Sashi Obilisetty, Peeyush Tugnawat, and Jon Heisterberg

HOT CHIPS

google vcu The VCU core internals are shown in the diagram above. The core is replicated ten times on the chip, along with I/O, memory interfaces, and a control processor (Arm). It provides full hardware acceleration of H.264 and VP9 (a format Google originated in 2013 although it is an open standard and now used by many others). Other formats, if they are required, remain in software. The chip sits on the standard AXI/APB bus. At the HOT CHIPS presentation, Google went into a fair bit of detail about the core (that's what HOT CHIPS presentations are all about), but I'm going to skip, since what was more interesting was the design methodology that they used.

The Google codec team has been using high-level synthesis (HLS) design flows for almost 10 years. I would love to tell you that the team used our Stratus High Level Synthesis, but Stratus didn't exist a decade ago, it was done with Catapult with a C++ based flow. HLS was instrumental in VCU development enabling SW/HW co-design and allowing very fast design iteration.

The advantages of HLS, whichever tool you are using, are:

No separate algorithmic model needed, single source of truth
Always bit-exact results between model and RTL
5-10x less code to write, review, and maintain vs. RTL
Software development tools
- Address/MemorySanitizer
- Distributed computing
Testing throughput 7-8 orders of magnitude higher vs. RTL
99% of the functional bugs found in C++ before running any RTL simulation
Team working on high-value problems
- Leave cycle-by-cycle design for the compiler
- No debugging of block internal timing bugs
Design space exploration
Try out high number of algorithms/architectures
Able to keep adding features & improvements very late in the process
Technology scaling is trivial
- Compiler creates new data path/FSM for a new clock target & technology from the same C++ source

google vcu block diagram and photo

CadenceLIVE

Recently, Google announced a custom video chip to perform video operations faster for YouTube videos. Google’s TPU series of chips are also well-known for accelerating AI/ML workloads. These are just a couple of examples of chips designed by Google engineers. Google hardware teams have been leveraging Google Cloud for chip design for a few years. Semiconductor customers can accelerate their time to market by leveraging the elasticity, scale and innovative AI/ML solutions offered by Google Cloud. Many large enterprises are choosing Google Cloud for their digital transformation. Google Cloud provides a highly secure and reliable platform, infrastructure modernization solutions, and an AI platform for the Semiconductor Industry. The three presenters shared how they leverage Google Cloud internally for designing chips such as the Argos video chip. They also shared the challenges faced, and discussed verification/design workload migration and best practices.

google design flow

Above is a simplified design flow for how chips are designed within Google using Google Cloud. One key point is that the resources in terms of both the number of machines and the class of machine vary as a design goes through the flow. I won't repeat everything presented about the advantages of using EDA in the cloud. I think that they are well-known now.

The above chart shows the typical way the design moves into Google Cloud. First, using Cloud Burst to offload on-premises data centers to get additional compute when needed without needing to over-provision in-house datacenters. Then the whole design flow is moved into the cloud. Finally, design can leverage AI and machine learning (ML) to further optimize the flow and benefit by learning from one design to the next.

Google does most of their design completely in the cloud, so that there is no need to deal with the overhead of moving data in and out of the cloud. Data remains in the cloud until tapeout. Google has experience with some projects using over 100,000 cores, and an individual machine can go up to 6 terabytes of memory. For storage, Google has used 6 petabytes of SSD for silicon development work, with a downtime of about 1 of every million disk hours.

One of the well-known advantages of cloud versus on-prem datacenters is the capability to scale up in minutes and then scale back down again. The above chart shows the last 22 months of EDA usage.

Above are some of the lessons that Google has learned from scaling large designs into 100% cloud usage. And below are the advantages that Google gained by moving its designs into Google Cloud.

There is a replay of the CadenceLIVE presentation available.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email