Scaling to One Million Cores on AWS

31 Jan 2023 • 4 minute read

scaling to 1 million AWS cores At CadenceLIVE Europe last year, Ludwig Nordstrom of AWS presented Scaling to 1 Million+ Core to Reduce Time to Results, with up to 90% Discount on Compute Costs. I think that there are currently two trends in EDA infrastructure that cut across almost all design tools. They are adding AI and machine learning (ML) to tools. And switching from running on a single enormous server to running massively parallel in the cloud on fairly normal configurations. There is actually a play in AWS for AI/ML, too, for those design tools that can take advantage of GPU since AWS has instances with attached NVIDIA GPUs. But this presentation was more about scaling and some practical advice about how to keep costs under control when you scale massively.

He started off by addressing why you might use the cloud in general and AWS in particular for EDA. In the front-end part of the design cycle, there are lots of jobs. For example, verification requires millions of simulation runs, many of which are quite short. The most extreme example is library characterization, where each standard cell crossed with each process corner is its own job. Each job is independent, apart from competing for the same resources. It is comparatively straightforward to scale to enormous numbers of machines. On the other hand, in the back end, there are huge jobs. But under the hood, many (most) Cadence tools have been rearchitected to take advantage of large numbers of machines. For example, timing signoff of a large design using Tempus can scale. In fact, as Ludwig said, and I would agree with him, the cloud is becoming the standard signoff platform. It is also the standard physical verification platform for Pegasus. Flying horses in the cloud, or something like that.

cpu hours are fungible in the. cloud One thing about the cloud is that compute hours are fungible. This means that running a job for 10 hours on 1000 CPUs has the same cost as running on 10,000 CPUs for just one hour. For the front end (lots of small EDA jobs) this also scales. The backend scales too, but not quite so nicely. For example, Tempus or Pegasus doesn't scale completely linearly from 1 to 10,000 CPUs.

rules of spot

AWS has various levels of service, and they are associated with very different prices. These range from the most expensive, known as "on-demand" where you pay for compute capacity by the second with no long-term commitments. At the other extreme are "spot instances." These come with savings of up to 90% off the on-demand prices. The instances are just the same, but the downside is that your jobs may be pre-empted at short notice. Looking from AWS's point of view, this is a way of renting out spare capacity at a big discount, but with the capability to get the capacity back again if a higher paying customer shows up. EDA workloads are set up to handle server failure, and being pre-empted and kicked off a server appears almost exactly the same.

Spot instances come from spot pools. Each instance family, in each instance size, in each availability zone, in each region, is a separate spot pool with a separate price. It might sound like spot instances are impossible, with them coming and going all the time. But, in fact, less than 5% of spot instances were interrupted in the previous three months. Of course, you can only use spot instances for workloads that can handle interruptions. For short-running jobs, they can simply be restarted if pre-empted. For long-running jobs, checkpoint and retry strategies, or something similar, are required.

AWS also makes available Cyclone. This is not an AWS service. It is an open-source community-supported solution. Cyclone is a high-performance HPC scheduler. It integrates with AWS batch, EC2, and Slurm across AWS regions and on-prem to create superclusters. Cyclone lets customers leverage the 25 AWS regions along with on-prem capabilities to scale their compute clusters.

cyclone and global scale

Cyclone brings the benefits of global scale, diversifying across all spot pools globally. It is smart enough to prioritize regions with lower spot costs and will leverage the available capacity across all regions without having to retry jobs. Global scale lets you use the instance types that work best for your jobs and still get the scale you need.

Ludvig had an example from the Max-Planck Institute. Max Planck provisioned 4K+ EC2 instances to run 20 000 jobs with up to 7 hours runtime each on-spot for Drug Discovery using Cyclone configured for six AWS regions. The result:

Using more than 4,000 instances, 140,000 cores, and 3,000 GPUs around the globe, our simulation ensemble that would normally take weeks to complete on a typical on-premises cluster consisting of several hundred nodes, finished in about two days in the cloud

So a few weeks become a few days. What's not to like?

Learn More

There are lots of Breakfast Bytes posts about scaling into the cloud. Here are just a few:

Sign up for Sunday Brunch, the weekly Breakfast Bytes email

"+ res.PreviousPostTitle); // //NextPostUrl // //Previousposturl // } // }); }); if ( $('.blog-post.nextweb-blog-post .ifrmesrc').length ) { iframeattr = $('.blog-post.nextweb-blog-post .ifrmesrc'); markup = ''; $('.blog-post-content .ifrmesrc').html(markup); $('.blog-post.nextweb-blog-post .ifrmesrc').show(); } -->