Get email delivery of the Cadence blog featured here
Simon Segars opened Arm TechCon with a new look, having discovered that real men have beards. This is the 15th Arm TechCon.
In this post I'm going to focus on the new things that Arm announced during the keynotes.
Back when TechCon started, Simon reminisced, digital cell phones had finished their period of explosive growth and nobody knew what was coming next. Then smartphone came along and had their period of explosive growth. Now growth has slowed again. But there are still enormous opportunities ahead in mobile in what will turn out to be a new era of computing, The 5th Wave of Computing.
Intelligent semiconductors are becoming ubiquitous. There are 150B chips shipped based on Arm products. It took roughly 50 years for the first 50B Arm chips, the next 50B took just two years. That first 50B will seem like a warm-up act as we move towards a trillion connected devices.
Already announced earlier in July, earlier in the summer, was Arm flexible access. This is targeted at smaller companies. You only pay for the IP you use at tapeout. There is zero license fee if the project is paused, changed, or stopped. There are straightforward terms. There is a big chart that lists which IP is included in the program. It excludes the big expensive Cortex-Axx processors. When someone asked during the press Q&A, Arm pointed out that these are already multi-million dollar projects so this sort of flexible access doesn't make sense there. Simon said that they have a partner coming into the program about once per week, many of which are new customers, not existing customers returning.
A change that Simon announced from the stage is Arm Custom Instructions. I have been involved with Arm back since when it was still part of Acorn. Arm has, until now, always taken the view that any software written for an Arm processor should run on any other with the same instruction set. I remember back when I was at VLSI that we had to be able to use Arm's manufacturing test vectors to test the microprocessor during manufacturing. Now that is total compatibility. "We are famed for standardization," Simon said.
Anyway, going forward you can add instructions. There are hooks in the instruction set for different types of instructions, how many registers are read, how many are written, and so on. There are corresponding hooks in the compilers to make creating libraries to get to the custom instructions easy. The first processor to have this custom instruction capability is the Cortex-M33, in early 2020. It will be built into all future v8-M class processors. There are about 30 licensees already for that processor and they will get the custom instruction capability for free. "This is a big change for us."
Next up was Dipti Vachani who runs automotive and IoT at Arm. She talked about safety, software, and compute as being the three missing links. But then she announced a fourth, collaboration. She announced a new consortium called AVCC, which stands for Autonomous Vehicle Computing Consortium. The members she announced were Arm, Bosch, Continental, Denso, GM, Toyota, NVIDIA, and NXP. The aim is to work on a common architecture. During the press Q&A, Arm said that the focus is not directly on standards, although standardization bodies might want to take some of the work they do and run with it.
Ian is the VP marketing for the cloud at Arm. He had a couple of announcements, too.
First, the next processor after the one code-named Hercules has a code-name Matterhorn. So the series is A73, A75, A76, A77, Hercules, Matterhorn.
Matterhorn will have matrix multiply instructions. This gives double the performance on GEMM, which stands for General Matrix to Matrix Multiply.
The other thing he announced is that Matterhorn will have support for bfloat16, sometimes called "brain float" since the motivation is for use in neural networks. Bfloat16 has the dynamic range of 32-bit FP32 but it fits into 16 bits, with corresponding reductions in silicon area and power. It does this by having the same number of bits for the exponent as FP32, but just 7 bits for the mantissa. It turns out that deep learning doesn't really need a lot of precision (often 8-bit integers are used). I believe Bfloat16 was first developed by Google. Certainly, the first time I heard about it was at Cliff Young's keynote at Linley last last year (see my post Inside Google's TPU for more about that).
The final keynote was Sha Rabii, VP of silicon engineering at Facebook. He talked about the computational directions required for augmented reality systems. The challenge is to fit everything into stylish AR glasses, which can then give you the ability to see in the dark, translate on the fly. And the one everyone loves, remember everyone's names. He went into a little detail on that, showing just how hard it is. You don't want thousands of people's names floating around over everyone in the audience at a keynote. But you don't want to have to make a specific requires either. So it needs to be a little assistant sitting on your shoulders. Then you need graphics to render the annotations, "world locking" to attach it to the person of interest as they move or you move your head. Accounting for occlusions when others move in front of them. Plus, of course, the display needs to be bright enough to see in daylight.
He moved on to silicon. We need dramatic improvements in silicon, he pointed out. 100X performance/power. Performance requirements are moderate, but power efficiency needs to be very high. The form factor is very demanding, the temperature rise has to be very limited for user comfort, wireless needs to be permanently connected. Latency has to be very low so you can see through lively images.
Need dramatic improvements in silicon. 100X performance/power. Moderate performance but very high power efficiency. Demanding form factors for stylish lightweight glasses. Temperature rise limited for user comfort. Always connected by high-speed wireless. On-device compute resources need low-latency experience. See-through lively images.
For the rest of his talk, he addressed three things: low power, AI, and process technology.
Power-constrained computing is dominated by memory access.
AI today tends to have a large monolithic highly capable accelerator. But an alternative is to put a neural net in every IP (audio, vision, speech, SLAM, SRAM, power management, and so on). Then the network can be customized for that particular compute element. Power can be reduced more than you think since neural networks are an approximate technology already, so we can stand some errors. He reckons you can reduce the supply voltage and save 25% of the power before the error rate increases unacceptably.
As for process technology, we need new wire and transistor materials and physics. Steep slope devices. Reliable lower resistance interconnect materials. "There are diminishing benefits from process technology and that puts the spotlight on innovation in our system."
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.