Spectre Tech Tips: Increasing Performance and Capacity Using Spectre X Distributed Simulation

15 Dec 2020 • 5 minute read

Support for multithreaded/multi-core simulation has been available in Spectre for many years allowing users to run a single analysis, for example, TRAN or HB, on a single machine with multiple cores. However, simulating designs using advanced node processes could result in performance and capacity issues, especially when post-layout parasitics are included.

Spectre X introduced distributed simulation as an extension to the multithreaded simulation where cores from different machines are used. Distributed simulation provides access to more cores, thus increasing the performance and capacity and providing more value for large designs.

The "single analysis" distributed simulation above is different from distributed parametric sweeps or Monte Carlo simulations, where multiple analyses are run on different machines.

Why and When?

There are two reasons for running a Spectre X distributed simulation: performance and capacity. Using a distributed simulation, you can increase the performance by adding multiple cores from different machines and increase the capacity by adding the memory together from each machine.

Distributed simulation for transient analysis performs best when the circuit is very large and of the post-layout type. It is always faster to run a non-distributed simulation on a single host compared to a distributed multi-host simulation, if you are using the same number of cores. For example, running a 32 core (+mt=32) simulation on a single host is typically faster than a distributed 2 hosts x 16 core simulation under the same circumstances.

If your circuit is small and you are running transient analysis, the distributed simulation may slow down because of the necessary overhead for this type of simulation.

For harmonic balance, it is not the circuit size, but the number of harmonics that decides whether the distributed simulation is beneficial or not.

Use Model

There are two use models for a distributed simulation: local and farm.

Local Distributed Simulation

If your machine network is not on a farm, you can run a distributed simulation on a local network by specifying the machine resources directly at the Spectre command line:

spectre +xdp=rsh|ssh +hosts “host1:proc1 host2:proc2 ...” +mt=#T

Here, host1 and host2 are the hostnames of the machines on the network on which you want to run the simulation. You can add as many hosts as you like, however, more than eight hosts will not yield the best performance. proc1 and proc2 are the number of Spectre processes to be run on each host. You should specify either 1 or the number of NUMA nodes (typically the number of sockets) on each machine. Any other value will not yield the optimum performance. #T is the number of threads for each requested process.

Farm Distributed Simulation

It is more common to run a distributed simulation on a farm and the command-line option for this is simple:

spectre +xdp

It is simple because the farm job submit command bsub requests all the resources and the farm provides the machines, processes, and threads to Spectre X that reads this information because of the +xdp option.

The following is an example requesting a total of 128 cores on 4 machines on an LSF farm:

bsub -q <queue> -n 128 -R “span[ptile=32]” spectre +xdp

The above command will run a total of 4 Spectre processes, each with 32 threads. A single process will be run on each of the 4 machines resulting in 4*1*32=128 cores.

When you run a distributed harmonic balance simulation on an LSF farm, you can benefit from running one process per NUMA node (or socket). The benefit comes from shared memory utilization, and is enabled by adding +ppn at the Spectre command line.

bsub -q <queue> -n 128 -R “span[ptile=32]” spectre +xdp +ppn

The above command will run a total of 8 Spectre processes, each with 16 threads. In this case, the machines given by the farm each have 2 NUMA nodes (sockets), therefore, 2 processes will be run on each of the 4 machines. This results in 4*2*16=128 cores for the analysis.

Accuracy

The Spectre X distributed simulation retains the golden accuracy of Spectre X while increasing performance and capacity.

Performance

The Spectre X distributed simulation provides exceptional performance and capacity improvement compared to any other circuit simulator.

The chart below shows a large (500k M, 30M C, 3M R) post-layout transient simulation. Spectre X provides improved performance and scaling over Spectre APS, as can be seen by comparing 8T, 16T, and 32T. When you add 2 and 4 hosts with 32 cores, the simulation scales up to almost 5X over 8T. This case shows over 30X performance gain versus a single thread. In some cases, you can get up to 60X performance improvement using 128 cores.

The chart below shows a similar performance improvement when running a distributed harmonic balance analysis. You can now include a large number of additional harmonics, improve the performance, and retain the accuracy.

Another benefit from a Spectre X distributed simulation is the capacity and reduction in memory requirements for each host involved in the simulation. By using a distributed harmonic balance analysis, the memory reduction is basically linear up to 4 hosts providing you the ability to simulate using 4X the memory.

Setup Recommendations

To improve Spectre X performance, some farm and machine settings have been found to be optimal:

Disable hyperthreading
- If hyperthreading cannot be disabled, it is important not to overload the machine. The scheduler of a farm should only schedule the number of physical cores on a machine and not include the logical (hyper) cores.
Disable power saving
- The cores/CPUs should run with a constant, maximum frequency, and not be throttled.
Affinity
- If affinity (CPU pinning) is used, make sure to pin the CPUs on the same NUMA node, or at least only physical cores when hyperthreading is enabled.
Red Hat 7.X or equivalent
- The kernel of Red Hat 7.X contains fixes for optimal performance on multi-socket/multi-NUMA machines. Spectre performs much better with this OS version when running HCC (High Core Count) simulations like 16T+.

References

For more information on Spectre X, refer to the following:

About Spectre Tech Tips

Spectre Tech Tips is a blog series aimed at exploring the capabilities and potential of Spectre®. In addition to providing insight into the useful features and enhancements in Spectre, this series broadcasts the voice of different bloggers and experts, who share their knowledge and experience on all things related to Spectre. Enter your email address in the Subscriptions box and click SUBSCRIBE NOW to receive notifications about our latest Spectre Tech Tips posts.