Samsung Galaxy S9's Application Processor

20 Sep 2018 • 5 minute read

At this year's HOT CHIPS, Jeff Rupley of Samsung presented the application processor that goes in their Galaxy S9 and S9+ smartphones. Apple only ever gives cursory information about their Ax chips, and I don't remember seeing a lot of detail about the HiSilicon chips that go into Huawei's smartphones, so this was an opportunity to get a more detailed look under the hood at a state-of-the-art smartphone SoC. The chip is called M3, simply because it is the 3rd version.

Jeff gave some insight into the development schedule:

Planning started Q2 2014
RTL started Q1 2015
Forked features for an incremental M2 in Q4 2015
Replanned for a bigger M3 push Q1 2016
First tapeout Q1 2017
Product launch (Galaxy S8) Q1 2018

The chip is an Arm v8.0 64-bit (with 32-bit compatibility) processor, manufactured in Samsung's 10nmm LPP process. It runs at 2.8GHz.

Jeff talked about the processor in parts: the front end (instruction decode and branch prediction), the middle machine (instruction reordering and dispatch), the FPU, and the load/store unit.

m3 branch predictor There is not a lot of point in having an enormous 228 entry ROB in the middle machine (comparable to Intel server chips) unless your branch prediction is extraordinarily good. They did a lot of work on this, using machine learning in the branch predictor. The M2 MPKI (missed predictions per 1000 instructions) was 3.92 so it is hard to get much better. But they got the M3 down to 3.29. I have no idea what numbers other processor manufacturers achieve since I don't recall anyone revealing their statistics before.

Everything is bigger and wider in the middle machine:

Decode up to 6 instructions per cycle (vs 4 in the M2)
Rename, dispatch, retire up to 6 instructions per cycle (vs 4)
Up to 9 integer ops issued per cycle (vs 7) and a 4th ALU including a second multiplier
228 entry ROB (vs 100)
128 entry distributed integer scheduler (also >2X)
More ops done in 1 cycle, and some optimized to 0-cycle (no idea quite what that means, but I assume the ops somehow get overlapped with stuff in the front end)

Floating point unit is also a "beast". With the importance of a lot of machine learning around floating point MAC operations, they have added a lot. A 3rd dispatch and issue port.3x 128b FP FMAC/FADD. 62-entry FP scheduler (>2x). FMAC down from 5 cycles to 4. FADD down from 3 cycles to 2.

The overall pipeline is in the above diagram.

The load and store unit has been beefed up, with 2 loads per cycle (up from 1), 1 store per cycle. It can handle 12 outstanding misses (versus 8 before). The translate lookaside buffers (TLBs) have been expanded with a new mid-level DTLB and a l2TLB with 4 times the capacity.

m3 instructions per cycle

The graph above shows the result across 4800 instruction trances. The IPC (instructions per cycle) has gone from 1.26 for the M2 to 2.01 (so cycles per instruction is below 0.5).

No HOT CHIPS presentation is complete without a die picture. The above plot shows the chip layout. This is just one core, and there are 4 cores on the whole M3.

The overall performance is impressive. The above chart compares the M3 to the M2 and also to the Arm A75 (presumably as it comes from Arm before all the modifications that Samsung made, but even running at a slightly higher clock rate). The graph for performance per Watt was equally impressive.

To wrap up, Jeff said that they were on a roll and are doing a new processor every year. He didn't quite say it, but it seems clear that there will be an M4 in 2019.

The takeaway that I got from this presentation is that there is starting to be very little difference between mobile processors (at least at the very high end) and server processors. The servers have a higher clock rate and burn a lot more power as a result, mobile processor have to back that off a bit (but 2.8GHz is not backing off a lot). Servers have a lot more cores too. But the underlying architecture with speculative execution, large caches, very wide and deep out-of-order execution, great branch prediction, and more, make for similar architectures.

HiSilicon

hisilicon logo I wrote this soon after HOT CHIPS even though it is only appearing now. In the meantime, having said that I'd not seen anything about HiSilicon's processors, they announced their latest Kirin 980, which is the world's first 7nm mobile chip. It was announced at IFA, Europe's biggest tech show, which I wasn't at, so this is second-hand information. What does IFA stand for, I hear you ask? Internationale Funkausstellung. So just say IFA like everyone else. In case you don't know, HiSilicon is a wholly-owned fabless semiconductor arm of Huawei, based in Shenzhen, just over the river from Hong Kong.

It has 6.9B transistors and it took 1000 engineers 36 months (3 years) to design. Venture Beat doesn't understand semiconductor design, since it says "it took more than 5,000 prototypes" to get it right. I can only guess that meant they compiled the RTL for verification 5,000 times, which is still about 5 times per day for 3 years. Power is down 40% from the prior 10nm chip. It also contains two (up from one) NPUs, neural processing units, presumably for doing all the MACs associated with neural net inference.

All that and "Facebook opens 0.3s faster; Snapchat opens 0.2s faster." That seems underwhelming, maybe unless you are an impatient teenager. More impressive to me is LTE cat 21 with download speeds of 1.4 Gbps, and WiFi even faster ("the world's fastest") at 1.7 Gbps. Despite its high-end specs, Huawei has announced that it will be used by Honor, its budget brand, in its Magic 2 smartphone, and not just in the high-end Mate 20 (not yet officially announced).

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.