Climbing Annapurna to the Clouds

18 Aug 2020 • 4 minute read

One of the keynotes at last week's CadenceLIVE Americas 2020 was by Nafea Bshara. He is a VP/Distinguished Engineer at Amazon, working on system/hardware/silicon products for AWS infrastructure. But perhaps more importantly, he joined AWS in 2016 when Annapurna Labs was acquired by them. He was a founder of the company and its CTO. They were a bit of a mystery company, doing something to do with Arm, but that was all anyone knew.

In fact, the first time I came across Annapurna was when their CEO Hrvoye Bilic "Bili" gave a keynote at CDNLive Israel in 2018. AWS had acquired them by then but he still said almost nothing about what they were doing.

It turns out that AWS builds today almost entirely custom infrastructure: their own routers, storage, networking...and their own processor chips. That's what they were doing.

At HOTCHIPS last year, I wrote a post, HOT CHIPS: The AWS Nitro Project, about how they went about doing it. Over time, the replaced almost all of the software and hardware in their servers except for the x86 processor chips. The above picture is their current configuration.

In October, I wrote a post, EDA in the Cloud: Astera Labs, AWS, Arm, and Cadence Report, about Astera Labs moving all their EDA design flow into the cloud, and what that took. They were running on Arm-based AWS instances based on a processor called Graviton. You'll find it useful later to read this post if you don't know what AWS "spot" instances and "on-demand" instances are. It will be important for reading some of the graphs below.

Then, in December last year, I wrote a post Xcelium Is 50% Faster on AWS's New Arm Server Chip About the Graviton 2. As Nafea put it early in his keynote:

It's the highest performing server we have for mainstream workloads.

For non-mainstream workloads, they also have a machine learning processor called Inferential, which I don't seem to have mentioned!

The Keynote

Nafea titled the keynote Annapurna Labs' Journey: How the Cloud and Industry Collaboration Help Bend the Curve for Chip Development. They started the company in 2011 with a small team. They discovered that the semiconductor has a unique culture of embracing startups. Not just Cadence, but Arm, TSMC, assembly, and so on. Semiconductor is a high-risk worldwide business.

Their first chip was 28nm and they got through that with partners. But the advanced-node development trends the way they were doping them were not sustainable. You can add your own graph of the soaring costs of design. Or headcount. All going in the wrong direction. They decided to abandon the server farms that Annapurna had owned before AWS acquired them, and move everything to the cloud.

We've all seen graphs like this showing usage in the cloud, but this is real data and shows how extreme the changes are in EDA. This is data from Astera Labs (not Annapurna/AWS) that they presented at TSMC last year. The blue parts of the bars are spot instances, which come at about a 90% discount. The red ones are on-demand. During tapeout, they needed a lot. Then a 90% reduction just two months later before the new design started to ramp. With fixed resources (server farm), the design would have taken a lot longer since they would have been under-resourced. Then a couple of months later, they would have been paying a lot of money for servers that they were not using

Another aspect is that different EDA workloads require different server capabilities. Some need a lot of core, some need a lot of memory, and so on. The cloud allows you to use the optimal server for each task. (I'm not an expert on AWS server names, so I'm not going to attempt to pick the right ones to make the point.)

The most recent change is that they are now using AWS Graviton2 for most workloads. They get faster runs and a whopping increase of 89% in performance per dollar. Arm's development team has seen similar improvements simulating the Cortex-A53, with a cost-reduction of 40-50% for the same workload.

A Fundamental Shift

Historically:

CIO/CFO-level decision on which servers to buy and how long to keep them
The inherent conflict between running old servers for multiple years versus buying newer and faster ones
Long debates on which server to buy since different workloads have different needs
Long lead-times buying servers, and remote teams don't get the latest servers like headquarters
Security is sometimes a secondary consideration

In cloud-based EDA development:

Team leaders are responsible for picking the optimal servers, it is their job to optimize for time-to-market vs IT cost
Each team has their own budget and their spend is clearly monitored
Teams can use as many servers or any type they want, within their budget
CIO and CISO continue to drive security, monitoring, central license servers, storage/backup, and so on

As a result of this switch to the cloud, now they organize teams differently. Team leaders pick the servers and optimize for schedule and cost. Teams have their own budgets and are responsible for controlling their spend. The CIO/CISO/etc. continue to drive the big company-wide function stuff.

His final words:

It's really quite refreshing.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email