For some time now, SoC design groups have had to optimize PPA: performance, power and area. These are tradeoffs. For example, if you increase the clock frequency then you will get higher performance (better) but also higher power (worse), and perhaps a little more area (worse). The catchy way of saying this has been “power is the new timing,” which captures in a few words the fact that the power budget, not the performance, is usually the biggest challenge at signoff.
However, today there is a new challenge that is starting to become very important. Thermal. Presumably the catch-phrase will need to be, “Thermal is the new power.” More and more SoCs are limited by thermal constraints, the requirement to get all the power out of the package without the chip overheating. This is especially difficult in smartphones that mix high-performance application processors with small physical size (and no heat-sinks, fans or other cooling). But it is not just smartphones. In a datacenter, total power is one of the limiting factors. A datacenter is all about performance, so typically the operating frequency for any SoC will be pushed as far as the thermal envelope will allow.
Of course, power and thermal are intimately related, thermal effects are an unwanted by-product of the power. Thermal affects performance too, since as a chip gets hotter its performance decreases.
This is not just a theoretical issue. There are designs, such as the one above, where performance increases as you would expect going from one core to two cores to four cores. But with eight cores the chip overheats and is throttled back (admittedly too aggressively) and, as a result, eight cores end up having the same performance as a single core.
Historically, thermal analysis has been treated almost as an afterthought, part of the signoff process when the design is basically complete. That approach is no longer adequate since thermal issues need to be addressed earlier in the design process when it is possible to do something about it. This is analogous to changes that we have made in the design process to handle power and performance early so that architectural changes can be made if necessary.
For most SoCs, the only way to do a good analysis of power and thermal issues is to bring up the operating system and run either a representative load or a synthetic benchmark. Since booting a modern operating system involves billions of instructions, the most practical way to do this is with hybrid emulation, running the software binary on a fast processor model and the rest of the system on an emulator. Simulation is just too slow.
Some level of thermal analysis can be done at the level of the chip as a whole, such as seeing whether the total power can be dissipated by the package in the physical environment where the chip will be run. But typically thermal issues are localized. Two things make analysis tricky. First, thermal effects issues do not arise instantly, heat takes time to build up. Second, thermal effects do not remain restricted to the part of the chip where the heat is generated; instead, they spread to surrounding areas.
Detailed thermal analysis acknowledges this and is typically done using a tiled approach: the chip is divided into a grid of smaller areas and then the analysis can be done using a subset of the vectors generated during the emulation phase. The environment (package, board, etc.) of the chip also needs to be modeled since different packages have very different abilities to dissipate heat.
Modern SoCs, especially one containing multi-core CPUs and GPUs, usually contain sensors for measuring the temperature so that adjustments can be made. Typically this is done under software control, although it can also be done in hardware. In fact, typically there are failsafe sensors to prevent complete thermal runaway that can lead to fires. If a core is running too hot, there are three main ways to mitigate the problem. Firstly, throttle back the clock frequency. Secondly, lower the power supply voltage, which will typically also require reducing the clock frequency, too, an approach known as DVFS for dynamic voltage and frequency scaling. The third approach is to move a high-CPU-usage task from an overheating core to one that is only being lightly used and let the hot core cool down.
At the architectural level there are other changes that can be made. One of the most common is to move some software functionality from pure software running on a core of the main CPU to a specialized offload processor, such as the Tensilica HiFi DSP that is used for MP3 decode and other audio functions. These specialized blocks are hundreds to thousands of times more power-efficient than just running code on the main CPU.
Ignoring thermal management is not an option for a modern SoC. Thermal, power and performance all interact, so it's critical to optimize all three of them together. Sometimes this is called PTP, performance-thermal-power. Frequency affects power, which affects thermal, which in turn affects performance. Mixed analysis, looking at all three holistically, is going to become something that will be required for all advanced SoCs.
If you do not, then your competition will be using your chips to melt butter: