Intel and PSS...and Simics, a Blast from My Past

24 Jun 2019 • 4 minute read

One of the newest standards in verification is PSS, the Portable Stimulus Standard. Whereas UVM is focused on verification of blocks and chips, PSS addresses how to do verification at the system level, where there are multiple processors, with multiple cores, complex devices like cameras, and where there is a notion that "anything can happen". At this level, the systems are not really deterministic. A call might arrive at any moment to a smartphone. The user might take a photo, while listening to an mp3 stored on the phone, or streamed over Wi-Fi. Even when all the blocks have been individually verified already, the number of cases like this that can be manually coded up are very limited. That's where PSS comes in. In a system like Cadence's Perspec System Verifier, a simple model of the design at this level can be created, and then tests can be created that produce a huge number of testcases randomly.

At DAC, Joydeep Maitra of Intel gave a presentation on their experience titled Low-Power Validation of Heterogeneous SoCs Using PSS.

He started off with a good joke that I hadn't heard before. How many software engineers does it take to change a lightbulb? None, since that's a hardware problem. How many hardware engineers does it take to change a light bulb? None, since there will be a software workaround.

Their philosophy and motivation is to create a single framework that can be used for pre-silicon testing (find bugs before tape-in), post-silicon testing (day 1 bringup and finding corner cases), and customer support (debugging anything that turns up during production).

Here's a simplified system architecture, involving a CPU and several "offload processors", including a couple of Tensilica cores on the right, and various other subsystems, all connected together by a NoC (network-on-chip). Note that each core has power-down states, and there is an SoC deepsleep too. The kind of problem that a system like this can easily have is that, for example, one of the Tensilica processors is shut down and the main CPU sends it a message through the NoC...but it is asleep so that isn't going to work.

So let's assume we want to do a randomized, multi-master, multi-memory, power-state traversal test. This will involve powering down and up various blocks and running enough communication that the sort of error that I just described might get caught. That means not just running vectors, it means running C-code on all the cores. Then, to shake up the system, alter the power states, put it into and out of deep sleep, randomly change the NoC frequency and the wakeup events. Change the functional threads. On the real silicon, these tests can be run for literally weeks. Well, they can run for weeks on an emulator, too, but usually someone else needs it rather sooner than that.

Here's how this is done using PSS.

Modeling. The components are modeled. This often means pulling models for, say, a PCIe4 interface from a library. At other times it might involve creating a model. For more details on how that is done, see my post Perspec Modeling.
Scenario generation. The use-cases that each block might perform are defined, and then these are composed. A constraint solver stops things that can't happen together from happening together, and other more complex constraints.
Test generation. A test intent flow is created and then Perspec generates C-code and emulation/simulation checks and stimulus. The tests are portable across platforms.
Test execution. Joydeep's diagram just shows emulator and silicon, but simulation and FPGA prototyping are also possibilities.
Debug and analysis. Latency measurements, coverage analysis, and automated log analysis. And, obviously, debugging any issues detected.

In the presentation, the above diagram is animated. DUT code is code running on the simulated processors. Simics is a word I wasn't expecting to see. It is Virtutech's system modeling and simulation system. Virtutech was acquired by Intel (actually, it was acquired by Wind River when it was part of Intel, but Intel kept it when they spun Wind River out again). I was VP of marketing for Virtutech for several years. In this example, the Simics code is running the checkers to make sure nothing untoward happens.

Creating the software test cases and the corresponding hardware checker and stimulus mechanisms is a one-time task. Having done it, the lower power verification team could pull in test development activity by four months. The challenge with tests at this level without using the Perspec/PSS approach is that the failures are often at a very high level ("the operating system crashed"). But it is possible to look at the threads and zoom in on the problem area. They couldn't really run multiple threads before, which makes the system much more deterministic than in the real world.

To demonstrate the power of this approach, here is one critical bug they found. This type of bug requires a high level of randomization and parallelization in the test flow, which is impractical to achieve in a simulation level or single-threaded test. What happened was that one of the Tensilica cores attempted to acess the LMU in the CCA cluster when the cluster was powered-off. The design requires this to trigger a wakeup for the cluster to enable access to the memory. No issue was seen in any of the SoC verification flows. But when the test framework randomly changed the NoC clock down to the reference clock frequency while it had scheduled the access, the test failed. This is a bug that would almost certainly have escaped to silicon, if not out into the field.

Sign up to get the weekly Breakfast Bytes email: