Formal Post-Silicon Debug

25 Oct 2018 • 6 minute read

Two outstanding presentations at the recent Jasper User Group were on using JasperGold (JG) for post-silicon debug. The two presentations were from Laurent Arditi of Arm, In Case of Emergency Call 1-800-FORMAL and from Jim Kasak of HP Enterprise, Accelerating Post-Silicon Debug with Formal Verification. As it happens, Laurent won this year's best presentation award, and Jim has won the best presentation award 3 times in past JUGs.

Both presentations were based on real recent events, with only the product names changed. These two presentations are a sort of masterclass on using Formal for post-silicon debug.

Laurent Arditi

Laurent got a panicked message from a manager: "need help on a scan dump." Often, this is all the information provided by the customer, essentially a list of every flop and what state it is in. Arm has a more difficult problem than HPE in some ways, since they didn't design the rest of the chip, typically only the processor subsystem. Management wants results fast but it is like looking for a needle in a haystack. The scan dump is unreadable, difficult to understand, and it is unlikely that the designer will spot an error. But they need to resolve the situation: if the bug is in Arm's IP then they need to find a workaround, and if the bug is elsewhere, they need to provide some useful information on what the other IP blocks are doing wrong.

The firefighters are the designers and verification engineers, but there needs to be the formal team there. A key thing is to keep the formal testbench runnable even years (seven in this case) after the design shipped. Arm uses an abstraction layer on top of JG and this makes it straightforward for them to keep things up to date as new versions of JG are released. However, even if the formal testbenches are not available, it is much easier to build them than simulation testbenches.

The first naive solution is to create a cover to verify that the dumped state is valid. If the dump cover is covered, then the dumped state is valid. If it is unreachable, then the bug is not in the IP. But it is not that simple, especially with old testbenches. An unreachable dump cover may just be unreachable in formal, and not in the actual design. It is very difficult to identify the combination of flop values that make the cover unreachable.

A refined approach is to analyze the dumped cycle and check it is a partially reachable state. Define an oracle wire which is true on hitting the state, and then auto-generate one assume and one cover per flop:

cpu.memsys.slot0.A_asm: assume property (W |-> cpu.memsys.slot0.A);
cpu.memsys.slot0.A_cover: cover property (cpu.memsys.slot0.A);
cpu.memsys.slot0.B_asm: assume property (W |-> !cpu.memsys.slot0.B);
cpu.memsys.slot0.B_cover: cover property (!cpu.memsys.slot0.B);

Then add a cover on W, called state_ok. You can then use get_needed_assumption to state_ok being unreachable.

There are many advantages to the refined approach, mainly that it can indicate which cover is unreachable individually.

If state_ok is still unreachable then it might be that the bug shows its symptom in the CPU but it might be due to another block not respecting its protocol

Using assume discrimination with get_needed_assumptions might zoom in on a single protocol rule that is violated, which is a good hint for a problem in external (non-Arm) IP. This turned out to be the case with this real-world example.

If state_ok is covered, even with the complete protocol checker constraints, then the problem is in the CPU IP. Now the need for a software fix is really urgent! First, the root cause needs to be found (and, lower priority, look into why those assertions were not present already). The software fix can be verified using formal verification, although it may involve more blocks than just the CPU.

Three techniques that Laurent talked about that can be useful here (since the verification may be running into depth limits in the JasperGold) are: state swarm, prove from a given trace, and fake reset sequence.

The results on this real problem, with a partner facing deadlock with a silicon sample:

Proved several issues were not CPU functional issues: unreachable state
Identified one issue potentially caused by another IP: interface property violations could lead to faulty state
Confirmed one real CPU bug and provided a trace for simulation to reproduce the bug
Refuted several workarounds
Validated hardware fixes with bug hunting techniques

The emergency flow is shown graphically in the diagram below:

Jim Kasak

It is a fact of life that some bugs escape pre-silicon verification. There are two places that this can be discovered:

post-silicon verification team: impact is high since a bug-fix may require a new ASIC tapeout, plus the bug may make testing other aspects of the silicon impossible
by a customer: the bug is critical if it causes downtime or an unusable product. If it is not resolved fast, the customer may give up being a customer, existing orders may be canceled, and there may be warranty returns.

This can result in distressed customers, unplanned multi-million dollar tapeouts, and delays in product launches. This really gets management attention.

Jim's solution is to use formal verification to see whether various hypotheses about what is going on explain what is being seen. This is shown in the chart below:

Jim worked through four examples, but that seems too much for a blog post like this so I'll just cover the first one. This involves an Ethernet switch that stops transmitting packets when a particular new device is attached. It was rapidly clear that the new device was spewing illegal less-than-64-byte packets (known as runts). But the switch should automatically drop them so it shouldn't cause a problem.

It was known by the architect that runts could cause a problem if they got through to the main switch fabric, rather than being dropped in the front end. So that gave a starting hypothesis as to what might be going on.

When they set up formal and investigated, they got a counter example with the input packet size being precisely 7. For any other packet size they could prove that it would work correctly. They had the culprit.

They found a software fix, and then used formal to check it, but it didn't work. A second fix passed formal. They had saved the business the cost of an extra tapeout (millions of dollars).

I won't go into detail on the other examples, as I said above, but they involved:

A CDC error where the problem had been correctly flagged but incorrectly waived. Again, using formal, they found a software workaround.
ASIC data corruption observed about once a week. Obviously, this sort of bug is really difficult to track down. The critical signal when they narrowed it down was a watchdog timer...that expired about once a week. The architect said it was a low priority feature so they disabled it.
Sporadic RAM read parity errors. They managed to prove that it was not a logic error, allowing a focus on other issues (timing error, memory margin etc). The designer was so relieved when he was exonerated.

So next time you have post-silicon bugs to track down, add JasperGold to your armory of weapons.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.