Never miss a story from Breakfast Bytes. Subscribe for in-depth analysis and articles.
I recently attended a webinar presented by Rajat Chaudhry, who is a Product Engineering Director in the Multi-Physics System Analysis Group. The title was Chip Thermal Analysis with Celsius Thermal Solver. The word "chip" is important. Celsius can be used for thermal analysis at the system level, but this webinar (and this post) focused on using Celsius for analyzing thermal issues in chips. I wrote about Celsius last September when we announced it, in my post Celsius: Thermal and Electrical Analysis Together at Last and dived (dove?) a little bit deeper in Under the Hood of Clarity and Celsius Solvers. The key thing about thermal analysis is that you can't really do it without also doing electrical analysis: temperature affects power (especially leakage) and power affects temperature.
You can see those earlier posts for more details on Celsius, but I'll just point out that it combines both Finite Element Analysis (FEA) and Computational Fluid Dynamics (CFD). You need FEA to handle conduction, and CFD to handle convection and airflow. To give a more concrete example, if you put a heatsink on a package and put a fan on the heatsink, Celsius can use these two technologies to tell you how hot the chip is going to get.
When it comes to chip-level thermal analysis, the important partner tool is Voltus, or Cadence Voltus IC Power Integrity Solution to give its full name. The diagram above shows how the two tools work together. On the left is Voltus. It produces signoff IR drop results, which is its main function. But it also produces Voltus Thermal Models. These feed into Celsius on the right. In addition to Voltus Thermal Models for the chip, Celsius takes in environmental data: the package and board layout, whether there are heatsinks or fans, the ambient temperature, material properties, and so on. Celsius then provides Signoff System Thermal Gradient information and Signoff Thermal Stress (and those cute diagrams where red is hot and blue is cold). But it also produces a Chip Thermal Map. This map is then fed back into Voltus which can now do a more accurate analysis of chip power and IR drop analysis.
The Voltus Thermal Model contains on-die materials properties, metal densities (since metal conducts heat as well as electricity), static and transient power information, and temperature-dependent power. The user has a lot of control over the module granularity. The chip is basically divided up into tiles and the user can control how many, and can even vary the number of tiles per layer, or use finer granularity at hot spots compared to cooler parts of the die. Voltus Thermal Models can combine multiple Voltus Thermal Models from different blocks of IP. To give a sense of the granularity of the tiling, a small chip might be tiled into a 10x10 array, a large chip 500x500. Because all materials conduct heat to some extent, it is not necessary to work at nanometer or even micron precision.
As you can see from the above diagram, creating a Voltus Thermal Model takes as input the libraries (libs and LEF), the layout (DEF), vectors (VCD, FSDB, PHY), and thermal properties of the materials.
On the webinar, Rajat ran through an example in a reasonable level of detail. I won't try and cover everything in this blog post. It doesn't make much sense to analyze the thermal properties of a chip in isolation since it is affected by the package and board as a minimum. The example had a die in a wire-bond package on a simple PCB. The diagram above shows the setup. The conditions are an ambient temperature of 25°C, no airflow, natural convection. You can obviously change this, and can even use different temperatures for different parts of the board, add fans, and heatsinks, and other artifacts with thermal impact. This example keeps it simple and just models PCB top and bottom, and package top and bottom for heat-transfer coefficients.
The simple analysis just starts from the fact that the chip dissipates 4.85W and this is assumed to be evenly distributed across the die. The diagram above shows the result of Celsius' analysis. Variation across the die is just 3°C from 86°C to 89°C.
Next, Rajat redid the analysis, but this time using the Voltus Thermal Models from Voltus (instead of 4.85W evenly across the die). You can see the result above. This thermal map can be read back into Voltus and the analysis repeated with more accurate thermal data. The total power is still 4.85W but it is dissipated differently. Importantly (and potentially fatally) the temperature has jumped 10°C compared to the uniform distribution, and is now 100°C, and the variation across the chip is now 15°C (remember, it was just 3°C before).
The Chip Power Density Map (to the right) looks similar to the Thermal Map. This is normal unless there is a lot of metal conductivity inside the chip (in the thermal sense). Most variation is in the XY direction with little in the Z direction. The top layer is 0.2°C lower so variation top-to-bottom in this case is not strong.
I'm not going to go through the details of transient analysis. But I'll set up the problem. Let's assume you have a chip with dynamic thermal management. That means there are temperature sensors on the chip (hopefully the hot spots—run Voltus to find them!) and if the chip gets too hot then the clock is throttled (or perhaps a fan is turned on). So how fast does it run in practice? If the chip gets hot enough to require throttling if run at full clock speed, then a typical behavior might be that the chip runs for a time gradually getting hotter. Then the chip is throttled, performance drops, and the chip cools. Once it is cool enough, it can go back into high power mode. The key question is what is the actual system performance under these conditions? What percentage of the time does it run at full performance versus being throttled?
This requires vectors of some sort. Thermal changes slowly, over seconds or tens of seconds, and chips run at GHz frequencies. So you probably don't have a long enough vector sequence to use for a full thermal analysis (10s at 2GHz is 20B vectors). Usually different thermal models are used to assemble a full profile inside Celsius, with different vectors for each static thermal model.
The last thing that you want to do in a chip design is discover during signoff that your floorplan is not good. This is not an easy thing to change late in the design cycle. What you need is Thermal-Aware Chip Floorplanning early in the design cycle when the floorplan can be changed. For example, in the diagram above there are 5 blocks. If they are all close then the maximum temperature is 99°C. But if they are spread out a little, there is a 15°C reduction in the maximum temperature. Note that the power dissipation is the same, but there is a big improvement in thermal response. Celsius can be used to do a quick analysis of possible floorplans to find something good.
We can get more accuracy as the chip design progresses using a block-level thermal models, and IP thermal models. With a floorplan, Celsius can merge these models into a combined thermal model.
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.