Google's Titan: How They Stop You Slipping a Bogus Server into Their Datacenter

30 Aug 2018 • 6 minute read

At the recent HOT CHIPS conference, Scott Johnson of Google talked about some challenges that Google has. There have been stories about hackers infiltrating malware into the supply chain. Given the stories about the NSA intercepting Cisco router shipments and adding trojan loggers, this is not pure paranoia. As Scott put it, "how do we even know it is our equipment?" The solution is to tag and verify every device. Cloud companies like Google have numbers of servers measured in the millions, so you can't just go round and check them all visually.

Next problem is verifying the boot chain. When a server (or even a smartphone) is powered on, it first runs what is called the primary bootstrap. usually out of ROM (which can't be changed). Its function is to find the real bootstrap, sometimes called the secondary bootstrap or the bootloader. This checks various stuff and then finds the real code for the operating system and transfers control to it. Google worries about whether the bootloader is truly their code, and then whether the operating system code is truly Google's operating system. Remember, Google is not worried about some teenager in their basement, they are worried about national organizations and organized crime. The solution is to sign and verify all boot code.

They rapidly came to the conclusion that they need a silicon root of trust, and built on that they can move up to the datacenter hardware, then to the software infrastructure (operating system etc), and then up to the cloud software. They wanted this to have four important properties:

Every element in the datacenter should be securely identifiable, what Scott calls "cryptographic attestation."
The first code executed should be cryptographically signed and verified firmware, live-monitored for protection.
All activities in the datacenter should be monitored and logged in a tamper-resistant manner.
Own and/or verify every piece of the stack from transistors up to critical firmware.

So they decided to create a chip to do this. In turn, the above requirements led to a set of requirements for the chip itself:

On-chip verified boot.
Cryptographic identity and secure manufacturing.
Boot firmware check and monitor.
Silicon physical security.
Transparent development, full stack.

Titan

The chip they built is called Titan. It sits low down in the system hierarchy as you can see from the above diagram. Titan is a secure low-power microcontroller designed with cloud security as a first-class consideration. But it is more, not just a chip. It also involves a supporting system and security architecture, and a secure manufacturing flow.

Their motivation for doing their own chips was partially that there wasn't anything existing they could use. But also that they wanted complete ownership, auditability, and to build up local expertise in the area and not depend on 3rd party security experts. Also, new attack vectors arrive all the time and so they wanted agility and velocity. If it is their chip, they can respond faster.

The above diagram shows the architecture of the chip.

The blue boxes are memory: 32b microcontroller core, boot ROM, flash for instructions and data, SRAM scratchpad, and one-time programmable fuses (more about these later).

The green boxes contain cryptographic acceleration, key management and storage, and (true) random number generator, along with the usual mix of peripherals.

The red boxes are physical defenses, live status checking, and hardware security alert response.

Let's take a look under the hood.

Verified Boot

titan verified boot The verified boot progresses as follows, with each stage verifying the next. There is duplicate flash code so that it can be updated live, and the system is still in good shape if it fails during the update. Code signing is taken seriously, and though it was beyond the scope of this talk, Scott said that there are multiple key holders, offline logs, playbooks for who can do what, when.

The boot works like this:

LBIST (logic built-in-self-test) and MBIST (memory BIST) are run. If either fails, the system stays in reset. If all is OK, the system jumps to the boot rom.
The boot ROM compares the two bootloader (BL) version and chooses the most recent.
The bootloader signature is verified. If that fails, try and verify that other one. If that fails too, freeze.
Next, the bootloader compares the two firmware (FW) versions and chooses the most recent.
The FW signature is verified. If that fails, the other one is tried. If that fails too, freeze.
Execute the successfully verified FW.

Trusted Chip Identity

Trust is established at manufacturing. Each tested device is uniquely identified with an assigned serial number (unique but not secret), and it then generates its own cryptographically strong identity key. This is done using multiple silicon technologies (ROM, fuse, flash, logic) all of which need to be defeated to compromise the chip. This identity is registered in an off-site secure database. Parts are shipped and then put on datacenter devices for production. They are then available for attestation, proof that the servers are Google's. The boot ROM is locked down at tapeout, so it has to be small and bug-free since there is no way to change it.

Life-Cycle Tracking

After manufacturing, there is a continuing need to guarantee authenticity. So Titan is in one of six states, and moves irreversibly from one to another by blowing OTP fuses. The 6 stages are:

Raw: no features enabled, deters wafer theft
Test: enables test features only, no production features.
Development: enables production features for lab bringup.
Production: final production features, no testability, unique keys.
RMA: re-enable testability but disable production.
RIP: after RMA or manufacturing failure, permanently disable the device.

The above diagram shows the fuses used for each stage. Note that due to the choice of fuses, a given chip can only go from left to right, and a development chip (for playing in the lab) can never be enabled for production.

Physical and Tamper-Resistant Security

Scott admitted that some of this is overkill for a datacenter that is already protected by armed guards. If you manage to get into a datacenter, you are probably not going to use lasers to attack the Titan chips, but they wanted to learn what it would take and, in the future, Titan or similar chips might be used in less secure environments like smartphones.

Attack detection (power supply glitch, laser, thermal, voltage).
Fuse, key storage, clock and memory integrity checks. The clocks are generated on-chip, so you can't attack them directly.
Memory and bus scrambling and protection.
Register and memory range address protection and locking.
TRNG (true random number generator) entropy monitoring.
Boot-time and live status checks.

In the event tampering is detected, Titan responds by one of: an interrupt, a non-maskable interrupt, freezing the system, or performing a full system reset.

Open Titan

open titan Titan as described is proprietary to Google, but the basic security mechanisms and the digital implementation are commodities, and good candidates for open-sourcing. So Google is moving towards an open, transparent implementation of a secure root-of-trust, built around a RISC-V processor. It could be implemented in "any" technology, with standard-cells, memories, I/Os etc provided either open source or by the foundry, along with foundry specific blocks such as OTP and flash. Some of the blocks, such the TRNG, require more than digital logic and would depend on an analog implementation (with a digital wrapper). Those blocks have dotted red lines around the blocks in the above diagram. In fact, Google has set up the Silicon Transparency Working Group along with lowRISC, and ETHZurich to drive this project. Eventually, this will be open to anyone (some time next year, probably).

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.