Get email delivery of the Cadence blog featured here
The very last presentation at HOT CHIPS earlier this week was by Elene Terry of Microsoft about The Silicon at the Heart of the Hololens 2. She said that her talk was about why they did custom silicon for the Hololens 2 and what's in it.
I wrote about Hololens 1 a couple of years ago in my post Lifting the Veil on Hololens, which was the opening keynote at that year's Embedded Vision Summit.
The Holographic Processor Unit, the HPU2, is in the front of the unit and there is a separate application processor (AP) at the back. Since the HPU is "brow mounted", the design is very thermally constrained.
I have used the Hololens (the 1 not the 2). One of the most impressive things is what they call the hologram stability. These are not true laser holograms, they are images superimposed on the world. For example, they can outline an object, say to help a mechanic repair something, and the outline appears to be locked onto the object (as in the image). If you move your head or your eyes, the hologram stays in place.
Here's a partial list of functions that the HPU2 handles:
Here's the obligatory HOT CHIPS die shot. It is a 79mm2 die built in TSMC 16FF+, 123M gates, 2 billion transistors, 125 Mb SRAM. Taped out in September 2016 and was first-time-right silicon.
The chip contains two kinds of general-purpose computers, known as SFP, for SIMD fixed-point, and VFP, for vector floating-point. There are two Tensilica processors per computer. They are both 128-bit (SFP was only 64-bit in HPU1). There are 13 of these cores on the die, 7 SFP and 6 VFP, giving a total of >1TOP of programmable compute over the 26 Tensilica processors. Each of the tracker functions is statically assigned to one of these compute nodes.
They have hundreds of customized instructions, some the obvious things for geometry like ARCTAN and SQRT. But they did a lot of analysis of inner loops to find operations that took tens of instructions and could be reduced to a single instruction. She had one example called boxavg_2x16x8, which averaged a block of pixels. This operation is applied thousands of times per frame. By using a custom instruction it was reduced to a single cycle.
There is also some "hardened compute", which means implementing functions in hardware rather than software on the processors. They took existing C code and turned it into a piece of RTL. For their joint bilateral filter, this reduced power by 1/3 and latency by 1/30th (I assume that this really means the power ended up being 1/3 and the latency 1/30th of what it was). There is also a hardened neural network on the chip, for use in one of the workloads.
The full list of hardware blocks Elene gave was:
The thermal budget was 2.4W averaged over 30 minutes at 85°C. They took four workloads, represented by the pictures below, and ensured they could accommodate the workloads within that power budget. Of course, they did the usual stuff like clock gating and power gating. Most of the digital logic is running at 250MHz and the processors are running at 500MHz. But they got further power saving by lowering the power supply voltage below 630mV, which saved a further 20% of power.
There is a lot of I/O. The AP at the back handles WiFi and storage, but also render which the HPU doesn't do. So data is sent to the back for rendering and then back to the front for display, which makes for a lot of fast I/O, which was a challenge.
Elene wrapped up with a comparison of HPU2 vs HPU1:
There are also 4 new workloads that HPU1 didn't handle:
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.