I/O Is Faster than the CPU—What Now?

22 May 2019 • 5 minute read

At his keynote at CDNLive Silicon Valley, Andy Bechtolsheim made a throwaway remark that 1600G Ethernet would be a problem since "the packet rate is just 333 picoseconds so that needs wide Ethernet ports" (see my post Andy Bechtolsheim: 85 Slides in 25 Minutes, Even the Keynote Went at 400Gbps). At the time I wondered if I'd heard correctly. After all, a CPU cycle is about 300 picoseconds, and a DRAM access is 70-100 nanoseconds, or about 200 times as long (see my post Numbers Everyone Should Know). The keynote moved on and I didn't think much more about it.

Then, on the plane to CDNLive Munich, I happened to come across the paper I/O Is Faster than the CPU – Let’s Partition Resources and Eliminate (Most) OS Abstractions. The opening sentence of the abstract puts into clear words the half-baked thought that I'd had in Andy's keynote:

I/O is getting faster in servers that have fast programmable NICs and non-volatile main memory operating close to the speed of DRAM, but single-threaded CPU speeds have stagnated. Applications cannot take advantage of modern hardware capabilities when using interfaces built around abstractions that assume I/O to be slow.

NIC stands for "network interface card" although today it is often a chip or even a block of IP on an SoC. in a desktop PC, this is the type of card that provides PCIe to connect to the system bus, and an ethernet connection. In a modern server, it is the interface to the type of 400G optical connection that Andy was talking about in his keynote.

As it happens, despite having spent my entire career in the EDA and semiconductor industries, my PhD is in operating systems, in particular, networked operating systems (the title of my thesis was The Design of a Networked File System). So this is an area that I have been interested in for forty years and have, at least to some extent, kept an eye on.

A modern operating system like Linux, Android, or iOS is constructed in a way that has not changed much in those forty years. Some things are fast (like memory), and some things are slow (like disk and network operations). As a result, these resources are managed differently:

Memory is fast, so it makes no sense to provide an operating system call to access memory. Instead, the operating system gives processes a share of memory, then the process just accesses memory directly (with load and store instructions in the code). Hardware support, either in the form of virtual memory or in the form of segment registers, ensures that the process cannot access memory outside of its allocated share.
I/O is slow, so accessing a file or a network resource requires a call to the operating system, and the operating system then calls a device driver. In fact, the operating system can even do some work to provide nice abstractions, such as making everything look like a file, without worrying about the inefficiency of the interface.

Three things have changed:

Networks have got really fast, with rates of up to 400Gbps. Getting in and out of the operating system, and in and out of the device driver, takes longer than the packet rate.
Rotating disks (HDD) have been replaced with flash-based disk drives (SDD), and then, sometimes, with storage-class-memory-based disk drives. Again, these are getting so fast that the operating system overhead is significant.
A new level of non-volatile memory has been added to the memory hierarchy (this is still more in the future, but it is coming). It makes no sense to manage this like a disk drive, with sector read and write calls. It needs to be managed more like DRAM, where the process is given access and then just makes memory access directly from the code.

The solution to this conundrum is for the operating system to handle fast devices in the same way as memory (DRAM today, but even back in core memory days) has been handled. For memory today, the operating system allocates regions of memory to processes, but does not provide any abstractions on how to handle it. The process simply accesses memory using the underlying hardware, in whatever way it wants. The hardware is set up to catch an errant process accessing memory that was not allocated to it, to stop problems being caused by the fact that the operating system is not checking every memory access individually with a system call.

The focus of the paper is on giving processes direct access to receiver/transmitter queue pairs (RX/TX queues) rather than forcing them to go through the operating system to send or receive packets. There is still the need for a device driver to set up the pairs and configure the NIC to forward packets to the appropriate processes. Transmission is easier to handle since it originates in each process, which simply adds packets to its outgoing TX queue, whereas receiving requires incoming packets to be inspected by the hardware and added to the appropriate RX queue. The basic architecture is in the diagram. They call this a parakernel architecture.

The conclusion of the paper is worth reading, even if you are not an OS geek like me who is interested enough in the topic to read the whole thing:

The current works to bypass the kernel clearly highlight that OS abstractions are barriers that restrict the I/O performance. These works build on a well-known fact that exporting physical resources to the applications has the potential to solve a plethora of problems plaguing current OS architectures. We present an OS structure called parakernel which partitions the resources that can be partitioned and multiplexes only those resources that are not partitioned. This allows the parakernel to have a small trusted computing base which can be implemented in a high-level language. The parakernel facilitates application level parallelism and complements the thread-per-core design of popular server applications.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.