DVCon: There Be Dragons!

31 Mar 2022 • 12 minute read

At the recent DVCon, there was a panel session titled SoC Verification Hidden Dragons. The panel was moderated by Brian Bailey of Semiconductor Engineering. The panel consisted of 3 users and a vendor:

Mike Chin of Intel. He is a validation engineer focused on IP validation and reusable content.
Adnan Hamid of Breker (the vendor). He is CTO and is focused on IP and SoC verification.
Ty Garibay of Mythic AI. He is VP Engineering and Mythic on analog/digital verification issues.
Balachabndran (Bala) Rajendra of Dell-EMC. He is CTO for the semiconductor design vertical.

Ty had the best line of the session, where he compared verification to a dragon.

It's really hard to get started since you have to wade through fire. Then there is a huge amount of verification to be done, and finally there is a long tail that goes on forever.

In the rest of this post, when I introduce a question with "Q" that is Brian asking the panel. I've written this post as if everything is quoted word-perfectly, but actually, some of this is paraphrased, and some of the discussion is omitted.

Q: How does block-level verification differ from SoC verification?

Mike: It's breadth versus depth. It's very much a depth problem to validate all the nooks and crannies within IP. But at the IC level, we don’t need the same depth, now we are looking more broadly across the entire SoC. It's an orthogoonal problem going from IP to SoC.

Bala: Good analogy. There is a little bit of parallelism since there is a pre-verified IP on the SoC. SoC is a breadth problem, it is true, but sometimes you also want to dig deeper.

Ty: Verification at IP level is like single plates of armor covering a single spot. The SoC is like a suit of armor and the weaknesses are at the seams. I trust the engineers we have doing the IP verification, but worry about where on the seams a bug might break through.

Q: Adnan, what attempts have been made to reform verification for SoC?

Adnan: The landscape is shifting. IP are getting complex so the UVM approach doesn't scale. So we are introducing System-UVM. At SoC level, we are not interested in all the details of the IP, but we need to check that the IPs work well together, even though I don’t have time to understand all the IPs. Also, firmware is pushed way too late in the process, we need to test firmware on the IP, which is one thing we enable in the System-UVM level.

Q: We need a combination of hardware and software to put together end-to-end scenarios. What is scope of SoC verification in the sense of being the same as the eventual SoC?

Mike: It is not enough to look at just real use cases, since once something changes a bug will be found. Production use cases are not sufficient, so we must go with synthetic too. With interactions between IPs, the complexity grows huge. In order to do good engineering, we need synthetic, but we need a method to focus on what is important. Where are the cracks in the armor?

Ty: We have a finite number of simulation cycles measured in machines, licenses, engineers, time. The most important thing we do every day is decide where to focus our limited resources. We have to go synthetic since we don’t have enough compute to simulate a whole workload. A lot is put on the architectural level to get simplicity that can be addressed synthetically so there isn’t an exponentially growing space.

Q (from the audience): What do you mean by synthetic validation?

Mike: Internal tests that we produce, not something you'd find in a driver. Typically we validate to the spec.

Ty: Software tools are being created in parallel with silicon so even if we wanted to run a “real” workload it doesn’t exist yet. So we need to put something together ourselves.

Q: A little earlier, Ty, you mentioned not having enough cycles. In some ways this is since we are stuck at the RTL level. Can we take it up a level?

Ty: We do use architectural models especially for software where you are not so concerned with very accurate simulation of the actual blocks. Today, the only accelerator we have is a hardware box. At some point, we need to get on the box and run a big chunk of stuff.

Adnan: For most SoCs, abstract models don’t have enough detail. We need to be testing the abstract model with the same contents as we will use with the design. For example, using randomized content on the abstract model. This is possible today but is not being done.

Q (audience): What are your views on the future of formal verification when AI/ML is emerging? Do we really need both?

Adnan: I adore formal. If we could do everything with formal I’d be the first to go after it. If someone can show me how to use formal on silicon that would be great, but there are electrical issues that formal doesn’t cover. We need dynamic testing, meaning we need dynamic test generation, which creates a capacity problem. On the topic of ML, I’m in the business of using AI planning algorithms, which are fancy rules. But with verification, it is not good enough to be 98% correct, we must be 100%.

Ty: Can machine learning effectively guide randomized testing? And get to the 98% level faster leaving the 2% that still has to be addressed with more rigorous techniques? With formal, we seem to need to hire a PhD who did a thesis in formal, so it has been challenging to deploy.

Mike: AI/ML may be something we can use to find corner cases and generate tests to cover them. That can be a big step to finding those high quality bugs that we need to find. We need everything.

Bala: If it enables human-in-the-loop it is a good tool and will add value. I have limited resources and if ML can assist and move the curve to the left then it is good.

Mike: If I can find 10% more bugs with ML that is great, and I can then go to find the other harder 10% with different techniques

Adnan: Guiding where to look for debug can be useful. What I don’t see is feeding ML a lot of existing cases and hoping to see what the tests should be.

Mike: It is all about identifying where we lack knowledge.

Q: Follow up question. Considering the state space is large, what would be a good ratio for simulation vs emulation be? And we should drop post-silicon in too.

Adnan: We can put some algebra around and can talk about combinational coverage, sequential coverage, concurrent coverage. Things planning algorithms like to do is to start with the output and work backward. As you get more capacity, emulation, or post-silicon, we can try more ways to reach states, but we should admit we are solving very difficult problems. In a project, you might run 10⁷ to 10⁸ test cases on a state space of 10⁵⁰.

Ty: It's a miracle it works at all. From a practical point of view with limitation of resources, we will use simulation for months or years. We don’t have a solid enough design to put on emulation until maybe the last 3 months of the design. Doing test stuff, power analysis, and so on. Total cycles may be more in 3 months than for the two years of simulation, but the engineer time is a small percentage versus what we invest into a typical Verilog simulation.

Bala: I’m a big fan of emulation, but you can use it earlier, not just in the last 3 months. Do randomized testing even. If it is audio, you can use simulation, but video needs emulation since it is so complex.

Q: Where are you finding bugs? In short runs, long runs, in the concurrent space. Who has been tracking this?

Bala: The numbers in yesterday's keynote were 20% in formal, 65% simulation, 10% in hardware. Depending on when we do it in the whole continuous integration and development then you will find early bugs in formal and later in emulation where you may find fewer bugs since they’ve already been found earlier.

Ty: There is a practical limit on use cases but we do find bugs in software. These are bugs we could likely only have found on the emulator since the case would be too large for simulation. We may find a few bugs in emulation but mostly software bugs.

Adnan: This question ran a little false for me. I spend a lot of time at our customers. The rate at which we can find bugs is gated most by how fast we can come up with test cases.

Mike: I agree, but Ty said he is running simulation for years. But I need something faster and more responsive for software development. Simulation is just too slow.

Adnan: Many bugs we find are software guys misinterpreting what the spec says. It is great if we can get that low-level firmware routine to run on UVM. It’s like getting a heart to beat without the body around. We need to make sure firmware is working correctly on the hardware at the lowest level.

Q: What about better visualization tools? Companies are having to develop their own tools to fill in some of the needs. Different kinds of SoCs see different challenges. Mobile, datacenter, AI accelerators. How much should we be relying on EDA companies to provide just what we want versus developing just what we need?

Ty: There are lots of SaaS companies today helping companies visualize their data. Some EDA companies are providing similar tools to rapidly create dashboards. We would like to see more non-vendor-specific solutions offered in that space. Look at all the regression results and build up a visualization. Every verification team likes to think they know their specific products best.

Adnan: Humans are visual animals and think about problems visually. We already have visualization for coverage. Can do the same for sequential coverage and concurrent coverage. But the data looks horrible at that level because the state space is so large.

Mike: That’s the problem. We can go from simple visualization, and we are visually stimulated, but for the amount of time we are going to spend to create the right visualization to take out the 10⁴ of the coverage space, I would ask the question for the amount of effort put into visualization, and whether it would I be better served by having a tool that can identify the complexity I would be trying to identify in the visualization.

Ty: If you could have open unencrypted output from the tools then we can post-process, whether it is visualization or AI/ML.

Mike: We still have a problem identifying individual trees, problems, in this very large forest

Q: A question, Ty, that you said you were short on compute resources. Adnan, you said you are only constrained by how fast you can develop tests. How do we cope with the explosion in the size of regression suites?

Bala: I'm excited about more and more moving to the cloud. When on-prem, teams don’t worry about money. Licenses and computers are provided by the IT team. Moving to the cloud, people will become more conscious of resources and optimize. Verification engineers are great at optimizing!

Ty: It is certainly a limitation for us. In the end, especially today, the tightest resource are the engineers. We are all fighting for the same superstars who can make a big difference in our teams. Without the people, you are tempted to think you can run a million jobs instead of a hundred thousand to get to the end goal, but that’s not the case. I’ve read several articles that the world is short of tens of thousands or hundreds of thousands of IC engineers.

Adnan: Yes, we are short of people. Why? Because we don’t have enough test cases. Why do we need these really bright guys? We haven’t given them a good calculus. When things take 6 hours to run, people need to work on 3 or 4 things at a time and humans are not very good at that. We have to go to more simulation acceleration. Nobody talks about engineers sitting around for a job that runs for 6 hours.

Bala: That's very long coffee break! If a test bench fails, we work out which check-in caused it, and bounces it back to the engineer who did that checking. If we have a choice of going for mosquitos versus a shark, I’d rather go for the mosquitos since in aggregate they cause more problems.

Mike: We are trying to scale humans to meet the needs of our designs. But tools have limits. The simulation will take 6 hours and will always take 6 hours. But we are not measuring the human cost of tackling more and more complex designs.

Adnan: Can we analyze tests without actually running them? The IP verification engineer is the only person in the whole company who really understands what the IP does. It is the those guys job to understand 100%.

Mike: Our tools today are only as fast as they are. If we can get a technological leap forward and reduce 6 hours to 3 that would be great.

Ty: If it runs for more than 4-6 hours, we break it up

Q: What is the dragon? Is it the simulation limit of what we can do? The number of scenarios we can run through? What is the verification limit?

Adnan: It’s not a dragon, it’s a hydra

Ty: I see greater and greater emphasis on architectural specification and formal verification at those high levels so that doing IP verification will lead to a greater probability of successful SoC. We won’t be able to verify with traditional methods, just as we don’t verify boards with traditional methods.

Adnan: The biggest enemy is us. We have IP guys, integration guys, firmware guys (off on a different planet). These walls need to be broken down. How can we get useful sequences we write for UVM and make them available to firmware? These are big organizational issues.

Brian: Well, we are out of time. I want to thank everyone. We can’t give them a clap anyone can hear, but this was a very interesting panel.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.