Never miss a story from Breakfast Bytes. Subscribe for in-depth analysis and articles.
Earlier this week I wrote a post covering the AWS presentation from HOT CHIPS about the Nitro project. Although the Nitro chips all contain Arm processors, that doesn't make them "Arm servers" in the sense that the processor running the application code is an Arm. Anthony Liguori mentioned in passing that Annapurna (part of AWS) had also created an Arm-based server chip called Graviton. This is the chip under the hood if you use an AWS EC2 A1 instance.
At the recent TMSC OIP Symposium, Jitendra Mohan of Astera Labs presented the first half of Accelerating Semiconductor Design Flows on the Cloud. One experience was that they used the cloud a lot more than they were used to with on-premises data centers. They estimated the amount of storage, but it doubled. And then doubled again. Of course, when they were done, they could simply scale back. The same thing happened with compute which was 3X what they expected. They simply did far more simulations. Of course, the costs went up. "How much? Don't ask."
What worked? He feels they got a higher quality chip through better verification. From assembling a team to having a working chip was less than a year. They had independent compute interfaces for verification and physical design, meaning they were not sharing the same data center servers for both needs, which would have been hard if they had gone the on-prem route. AWS was solid and in the entire time they only had one machine go down unexpectedly.
What didn't work? If you have unlimited compute power then that exposes new bottlenecks. The job scheduler was one. They used PBS (which used to be RTDA's NC when I did some consulting for them years ago), although LSF and the others are similar. "Schedulers are just not optimized for highly ephemeral computing".
There is room for improvement in the EDA tools too, he felt. The tools could be more optimized and tailored for cloud infrastructure, to take advantage of different types of instance and different tiers of storage.
Bottom line: It worked out well to be on the cloud.
Mark Duffield of AWS took over to talk about The Amazon/Annapurna Experience. When Amazon acquired Annapurna, they had an on-prem data center. It clearly made little sense to expand that, given that AWS had approximately a zillion servers already. Annapurna developed three chips, and all three were developed 100% on AWS. They were Nitro, announced last year; Graviton, which is an Arm-server chip, and Inferentia, a neural network accelerator chip. Gaviton was announced last year, and is now available in the A1 instance. Inferentia is "not GAed yet, due out soon."
By "eating their own dogfood" AWS/Annapurna found out what it took to make semiconductor design work on AWS, which has stood them in good stead as other customers have started to use it.
One thing I learned that day is that there are three different ways of using Amazon instances, with very different price points:
This diagram actually comes from the Arm presentation below, but it summarizes AWS's information:
In a subsequent presentation, Ajay Chopra of Arm, and Sheena Shankar of Cadence, presented Cloud-Based Characterization with Cadence Liberate Trio Characterization Suite and Arm-based Graviton. Just to be clear, all this work is running on AWS using A1 instances that contain Arm Cortex-A72 cores running at 2.3GHz. There are various configurations as in the table:
Sheena went first. The focus of this post is the challenges and results of running in the cloud, not especially on the capabilities of Liberate Trio itself. For that, see my post Liberate Trio: Characterization Suite in the Cloud. Characterization is, in some ways, ideal for scaling into the cloud. A cell-library contains hundreds of cells, perhaps thousands. These all have to be characterized at dozens or even hundreds of process corners (different process corner, supply voltage, and temperature). The computationally expensive part of the process is running circuit simulation for each cell-corner combination. However, there is no connection between one simulation and another, they do not have to synchronize with each other. This means that characterization scales linearly up to 50,000 CPUs.
Keeping track of 50,000 simultaneous jobs requires an industrial-strength message broker. This provides fault-tolerant Liberate job management regardless of network stability. It enables distribution of characterization jobs and is shared among multiple users. It also works with existing server-farm resource management solutions. Bolt handles common problems such as host out of RAM, NFS disk out of space, NFS automount failure, and overloading.
Ajay came on to talk about Arm's experience with Liberate and AWS. The big motivation for Arm to move to the cloud is that library characterization is CPU intensive but Arm has limited in-house capacity—it was a necessity. The big benefit of cloud is the paradigm shift from being CPU-limited to being human-limited. Instead of using 4K slots (simulations) for four months, they could switch to 20K slots for under a month. Costs are down 35%, runtime down 30%, overall turnaround time is down.
To optimize cost in the cloud, compute is the place to focus since that is 84% of the cost. The other 16% of the cost is made up from Bolt (which runs On-Demand not Spot, since it must not get pre-empted), data transfer, and storage cost. Arm use all spot pricing for the actual simulations since it is the cheapest. Although any instance can be taken away with two minutes notice, Liberate has a checkpointing system and so can save the simulation and restart it when resources are made available again. Amazon says that the spot interruption rate is less than 5%.
There is a big difference in pricing depending on the region, as much as 40%. Since library characterization doesn't have tight latency requirements, it can take advantage of underused regions such as Ohio rather than heavily loaded regions like Oregon. Of course, you need to keep a close watch on this as regions can go from underused to heavily loaded over time.
Liberate works so that only the high-compute jobs themselves are run in the cloud. CellBuilder and the bridge to the cloud run back in Arm's own on-prem data center, as in the diagram below.
The results from benchmarking Arm versus Intel AWS instances is that Arm are 3.8X cheaper than x86. There is a 25% reduction even for the same throughput, compensating for the single-thread slowdown with more cores (comparing 8 Arm cores against 4 Intel cores on the Oregon Spot market in July, to be precise).
Bottom line: Arm will run Liberate Trio for production library characterization using Graviton A1 instances on AWS by the end of 2019.
Of course, as I reported earlier, Astera Labs already are using EDA in the cloud for production, as are Amazon/Annapurna.
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.