Get email delivery of the Cadence blog featured here
EDPS is the Electronic Design Process Symposium, historically held in Monterey, but this year in Milpitas for the first time. My first post about the symposium is Solving the Design to Manufacturing Problems in Milpitas. That post ended saying that the rest of the day was taken up by two more sessions:
There were actually five people who presented in this session:
I am going to focus on just two of these, to keep this post to a reasonable length. David White because...well, Cadence. And Jeff Dyck of Solido, since they have been doing machine learning for longer than anyone else, I believe, with it being a key component of their products since they were founded up in the Saskatchewan prairies. Plus some slide decks were missing from the thumbdrive of the proceedings.
David opened by pointing out a big weakness of traditional EDA tools. They pretty much assume that each run of a tool is completely independent and behave as if nothing has been learned from any previous runs. Actually, that's not quite true. The user of the tools is expected to inspect the results and use information there to inform better choices of switches and parameters for subsequent runs. But the tool has no memory.
For example, during physical design:
The process seems to be especially difficult in EDA since there are lots of factors that may be unobservable, such as design intent or preferences. The process is dynamic, with one decision depending on others in a way that is not captured in machine-readable form. Users often complain that they are trying to get a tool to do something, but since there is no way for them to express that something, they have to indirectly use the knobs that they are given.
A trivial example in another domain I like to use to show this is that if you want to get a picture on the one page in a document (in, say, Word) with the text about it on the opposite page. Of course, you can explicitly place everything, but then when other things change in the document, that is the wrong decision. But (as far as I know) there is no direct way to express your "design intent" which is that you want the picture opposite the text. In a design with millions of placeable instances, there are lots of things like this that are wishes of the designer that there is no way to express to the physical design tool.
The solution if for the tool to learn what the designer considers "good" and do more of it. This requires a lot of dealing with unstructured data so three key technologies are analytics, machine learning, and optimization. These can then be used, as in the above diagram, to prepare the data, do inference on it, and the adapt the way the tool processes the design to learn.
Over the last few years, there has been a complete about-turn on how to do feature extraction in vision, moving from "programming" smart algorithms using very clever tricks and algorithms, to simply throwing the problem at a convolutional neural network and "training" it to recognize things. This whole about-turn can be summed up in the phrase "training is the new programming." This is what deep learning is. David gave a short tutorial on deep learning, but I'm assuming you already know, or else if you want a summary of where the latest thinking is, then try my post Machine Learning for Higher Performance Machine Learning.
The same ideas that are applicable to handling very large unstructured image datasets are also applicable to EDA. Some places that look especially attractive are fast models for parasitic extraction, hotspot detection in layout, place and route, and macro-models for circuit simulation.
For example, machine learning is used in Virtuoso Electrically-Aware Design (EAD) to estimate in-design capacitance, as shown in the above diagram. IR analysis is done in a similar way. David hinted at other work going on at Cadence (as has Anirudh Devgan in various keynotes in the last year or so) but since we haven't announced anything, you'll just have to wait until we do.
Jeff is the VP of Technical Operations for Solido Design Systems. They have a number of products but the problem that they try and address is characterization of analog designs, memories, and standard cells. To do this in a modern process doesn't just require characterizing at a few corners. There is no knowing where the worse case and the best case might occur. You might think that the typical case for any given parameter would always fall between the best case and the worse case...but you'd be wrong. So the only safe approach is to characterize everywhere, potentially millions of corners. What Solido does is use machine learning to reduce this intractably large number of simulations to discover where the interesting places have to be using machine learning.
This sort of approach parallelizes naturally onto big datacenter/cloud infrastructure, with literally 1000 or more cores. Since they have been doing it for a long time, they have learned that "things break and we need to recover and keep going." My PhD thesis is actually on distribute file-systems, and it is interesting to see some of the ideas that are required to build a robust file-system creeping back into what is required to build a robust parallelized EDA tool. Disks may go offline and come back, or they may go offline and die (if you ever saw a head crash on an old disk drive, with smoke pouring out of the drive, you knew you were not going to be reading that data ever again).
The other lesson that they learned early on is that you have to understand the machine learning algorithms since you have to be able to prove that the answers are correct. As Jeff put it, "nobody wants to get the wrong answer faster." In the early days of Solido, back in 2006/7, nobody would buy anything since they got the right answer, but couldn't prove the technology was working correctly.
This required them to build infrastructure to do what they were doing more robustly:
The above diagram shows the basic idea of what Solido does. The triangle on the left are "all" the points that should be simulated to be sure to cover everything. But instead a few points (the ones missing) are simulated (and moved to the box on the right since they now have data).
By using machine learning, they can deduce a lot about how the parameters interact and derive the entire surface of the model without needing to do all the simulations. Jeff went into a lot more detail on what is going on under the hood, but there isn't really space here to go into that. Having got the model, they can then run additional simulations and prove that at all those points, the model is very close to reality and thus the model is good, what Solido calls self-verification.
One other thing that Jeff talked about was taking their machine learning technology, stripping out all the circuit simulation stuff, and using that infrastructure to solve other customer problems. They partner with a lead customer, define the problem, prototype it and, if successful, Solido can productize the solution working with the partner. They have been doing this stuff for over ten years, since before it was a hot topic, so they know a lot that other companies have yet to discover.
The final session of EDPS was (mostly) about test. It took the form of a panel discussion, with questions from the moderator, Ron Leckie of Infras-Advisors, and then later from the audience. The panel was:
The first question was about what were the problems integrating across the silos of design, test, manufacturing, etc.
Dan said that there were starting to be big improvements in chip-package co-design and they can now walk the design team through what needs to happen for them to design something that can be put in a package, not just manufactured to the bare die stage. With IP vendors, though, it can be vary hard to get the information that is needed. Ketih said that there are no real technical problems. The fabs don't really want to share information like yield statistic, which is one problem, but even if they did there isn't really a mechanism. Typically what is needed is not the raw data, it can be obfuscated, but standard formats are a problem because there are none. Zöe pointed out that in her keynote she said that structural test doesn't exercise chips in the same way as the system. In Cisco, life revolves around the traffic test and you can't do that on an ATE. High coverage is required in the right spots, but there are challenges deciding where the right spots are. Craig said that NVIDIA has changed a lot over the years and there is more and more talk about deep learning (and NVIDIA is selling more and more boards and systems to do that). NVIDIA is trying to adopt an integrated test flow and reduce redundant testing, with the basic idea being to find a defect as close as possible to where it occurred (if it is a process problem, for example, catch at wafer sort not after packaging). Minimizing waste is a big thing, they aim never to throw away a board because of a bad chip.
The next question was how to handle DVFS (dynamic voltage and frequency scaling). Large server processors make a lot of use of DVFS, throttling clocks, and dark silicon (not all cores on). The voltage drops are all different depending on which cores are on. The temperatures are all different. How do you handle that?
Zöe said that Cisco uses this all the time but each IDM has a different approach to voltage scaling, making it harder. Crag said that at NVIDIA they are going through a lot of this and the only solution is to get system designers, board designers, and system architects to all work together. This is another area with a big gap between functional test and test on the ATE.
A question from the audience was how do all the different types of test (analog, digital, strucural) go together?
Derek said that ATE can now test everything at the same time rather than requiring separate ATEs, but developing the test program is very hard. System-level test needs things like jitter, crosstalk, effect of cables. If you had an unlimited number of power supplies, you could do more, but typically many pins with one power supply, so you don't get the access you want inside the chip. You need to decide at design time what you may need to isolate since you can't do it all.
Question: when you have very large die and mount on ceramic substrate, differential expansion, and get delaminatin. But you don’t have this problem with stacked die since same thermal expansion.
Derek: Tester ATE can now test everything at same kind, but its very hard to develop the program. System-level test needs jitter, crosstalk, effect of cables. If had unlimited number of power supplies can do more, but typically pins with one per supply. Don’t get the access you really want inside the chip. Need to decide at design time what you may need to isolate, you can do it all.
Question: How to improve yield when nobody will share the data required?
Zöe said that we need chip traceability, blinded from the full history of fab data. You need the silo so that the technology doesn't get in the way of business. A big thing is to avoid actually having to ship the parts back, just the data, because shipping parts back takes "forever." Craig focused on memory and said that you need a silo of information between the memory supplier and the customer. Memories are the toughest nut in the module business since memory suppliers share nothing. It took NVIDIA two years to get the information and it came with "don't tell your purchasing guys or they will beat us to death." You can't even find out how many cells needed to be repaired on a die, they won't open up.
Question: Memory repair has been going on for a long time. What about logic? Do you put on spare cores?
I missed who said it (probably NVIDIA) but they admitted the final cores were fewer than were actually on the chip, so yes, there are spare cores. Derek said that they actually build redundancy into the tester itself, stuff they think might be useful in the future. They put Arm cores in there in case they need to do additional calibration. So even at the tester level, building in some redundancy can have a lot of value.
Question: About multi-die ICs... Self-check? Maybe self-repair?
For NVIDIA, it is all driven by automotive standards, since a car has to check itself and run diagnostic. So there will be a lot more use of these built-in test modes. Zöe said that a driver for Cisco is that huge die doesn't yield so it makes more sense to break them apart and use chiplets, compared to a very big die with a very expensive mask set. Javi pointed out that stacked memories are already very popular, where each layer is tested separately to produce known-good-die. Each layer has redundancy, but it can only be used within that layer. If you had more flexibility, you might not even need to test the die at all prior to assembly.
And with that EDPS was over and we all went out into the Friday rush-hour Silicon Valley traffic.
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.