HOT CHIPS: Arm's Morello

18 Oct 2022 • 5 minute read

hot chips logo HOT CHIPS was back in the summer, and I covered it in two overview posts (and some others on specific topics):

I mentioned that I would be writing about the Arm Morello presentation. Well, better late than never! Richard Grisenthwaite of Arm presented Arm Morello Evaluation Platform — Validating CHERI-Based Security in a High-Performance System. The whole CHERI/Morello project, which seems to involve Cambridge University, the University of Edinburgh (my Almae Matres, yes I did learn Latin in school), and Microsoft, is based on the premise that the biggest challenge facing the design of electronic systems is security. I would put the top challenge as coping with the fact that processors improve performance very fast but memory does not. But security is certainly up there. In fact, the Spectre and Meltdown security vulnerabilities arose from all the work to improve processor performance with out-of-order execution and caches.

security is greatest challenge

Matt Miller of Microsoft, in 2019, pointed out that around 70% of CVEs are memory unsafety issues. CVE stands for common vulnerabilities and exposures, what we might call a security bug. If you are not a programmer, this won't mean much, but the top four memory unsafety issues were:

Heap out of bounds
Use-after-free
Type confusion
Uninitialized use

Chromium (the rendering engine that underlies almost all web browsers) had a similar report:

70% of our serious security bugs are memory safety problems

Modern languages like Rust make this sort of error a lot less of an issue, but there are billions of lines of C and C++ that are not going away any time soon. Students might learn Python but for work that requires serious computation (like EDA), C++ is still the workhorse language.

CHERI

I last wrote about CHERI/Morello in my post What Is a Capability? CAP, CHERI, and Morello in 2020. In that post I said:

Morello, an experimental CHERI-extended, multicore, superscalar ARMv8-A processor, system on chip (SoC), and prototype board to be available from late 2021.

Well, it is now 2022 and Morello is a 7nm SoC implementation of a capability-enhanced version of the Arm Neoverse N1 processor.

CHERI architecture

The above slide gives a summary of the CHERI architecture. The basic idea of a capability is that it gives you permission to access a certain area of memory, and this is checked on all memory accesses. This obviously requires extensive hardware support if it is not going to slow computation down excessively. In CHERI, there is a 128-bit capability in the register file (plus a tag bit to prevent getting around the system by forging a capability). The PC (program counter) is a capability (called the PCC) that limits what code can be executed.

The two key applications of these CHERI primitives are:

Efficient, fine-grained memory protection for C/C++
- Strong source-level compatibility, but requires recompilation and minor source-code changes
- Deterministic and secret-free referential, spatial, and temporal memory safety
- Retrospective studies estimate 2⁄3 of memory-safety vulnerabilities mitigated
- Generally modest overhead (0%-5%, some pointer-dense workloads higher)
Scalable software compartmentalization
- Multiple software operational models from objects to processes
- Increases exploit chain length: Attackers must find and exploit more vulnerabilities
- Orders-of-magnitude performance improvement over MMU-based techniques (<90% reduction in IPC overhead in early FPGA-based benchmarks)

If you want to do a really deep dive into the security of CHERI C/C++ then Microsoft's Security Response Center created a 42-page report Security Analysis of CHERI ISA. One conclusion was that it would have mitigated at least two-thirds of Microsoft's security issues that I mentioned at the start of this post.

Arm has created a prototype architecture, a software model, a toolchain, and so on. Lots of detail on Arm's Morello page. But the most significant thing has to be...

arm morello chip

This is a 110mm2 design in TSMC's N7 process, a CPU that runs at 2.5GHz. A Morello-based system needs other changes than just a new CPU. It requires some additional storage, 1 bit per 16B of data. That is a lot. Note that it is 1 bit per 16 bytes, not per 16 megabytes or something. These extra bits can either be stored by widening existing memory structures, or by storing them in a separate structure. System buses need to transport tag information, which is done using existing signals to decrease "boil the ocean" protocol changes.

morello bounds checking The biggest change in the architecture from a "normal" Arm Neoverse is the bounds checking. The upper and lower bounds information is compressed into 64 bits. But when bounds checks are needed it has to be decompressed. And "when bounds checks are needed" means on every load, every store, and every branch. The decompression is done in parallel with address generation.

Decoding a compressed CHERI format requires two shifters, one adder, two short comparators, and one wide comparator. One big issue with compressing two 64-bit bounds and a 64-bit pointer into 128 bits is that not everything can be represented. There are three regions as a result: the entire address space, the representable region, and the dereferenceable region. When necessary, the representable region will need to be recalculated.

morello pcc

The PC is extended to 129-bits to form the Program Counter Capability or PCC. See above. All branches (including procedure returns) need bounds checks. Direct branches (including conditional branches) will be within the existing PCC. But indirect (calculated) branches, including returns, may not just bet within the existing PCC but may change the PCC. One of the challenges is that all high-performance microprocessors such as Morello contain a branch predictor, and that has to continue to work. There are three possible ways to handle this (and I don't think Richard said which of them is used in the chip):

Extend branch predictor to hold PCC (simple 1-bit direction prediction or more complex and larger predictions)
Statically predict PCC does not change and take branch mispredict penalty if wrong
Stall PCC dependencies until PCC is known

cheri capabilities in memory

There are tradeoffs in how to store the CHERI capability validity tag. in memory. Ideally, you could just get a DRAM with the extra bits but that's not going to happen. So you can carve out space in DRAM for the tags, at the cost of increased latency (two accesses) although this can be mitigated with more complex cache structures. Or you can use the ECC bits, at the cost of decreased resolution of single-bit errors.

One of Richard's last slides was the one below: what feedback does he (well, the whole team) want to get from Morello:

But Arm is a commercial microprocessor IP company, so the ultimate question of all is:

what to put in future arm processors

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.