Cache Coherency Is the New Normal

12 Oct 2016 • 6 minute read

You hear a lot about cache coherency these days. In fact, at the recent Linley processor conference, no fewer than three companies announced new cache-coherent networks-on-chip (NoCs).

Caching

The first cache I ever ran into was on a computer at Cambridge University called Titan. It had a 32-word instruction cache, indexed off the lower five bits of the PC. It was a normal direct-mapped cache. If the higher order bits (above 5) of the PC matched the cache address, then instead of fetching the instruction from memory it was pulled from the cache. Of course, this was much faster, that is the point of caches. If the higher order bits didn't match, a cache-miss, the instruction was fetched from memory and also the cache was updated. These days, when three-level caches are common, and cache sizes can be measured in megabytes, this seems almost comically small. Would such a tiny cache make any difference? It turns out, when you think about it, that the architecture of the cache means that any loop of less than 32 instructions will run out of the cache. Since processors spend a lot of time in small loops, especially if they lack instructions for clearing or copying areas of memory, this made a big difference.

Another key thing to note is that the programmers don't have to do anything. If the cache is turned on, then the code will run unchanged, just faster. It is invisible to the programmers. The hardware designers worry about the cache, but they give the illusion to the software engineers that it doesn't exist.

Avoiding Cache Incoherency

Before talking about cache coherency, it is worth digressing to cache incoherency. With a modern multicore SoC with perhaps four or eight cores, each processor (usually) has its own L1 cache, there is perhaps a common L2 cache, and then off-chip DRAM. The problem comes when one core has the data at a particular address in its L1 cache, and another core writes a new value to the same address in its L1 cache, and (perhaps) writes it out to the L2 cache (called write through). Without taking special precautions, the first core is going to read the wrong value because it doesn't know the latest (correct) value in the L2 cache and its copy of the data in the L2 cache is now stale.

To avoid this, a procedure called "bus snooping" can be used. Each cache listens to the L2 memory bus and when, in the above example, the first cache notices the value being written by the second core, it invalidates the value in its own cache, forcing the correct value to be fetched from memory the next time it is accessed. Alternatively, a more sophisticated snooping scheme can be used, so that when one cache detects a read to the shared cache and it has the latest value, then it communicates the value. This can be faster since it avoids accessing the L2 cache unnecessarily if the correct value is in any L1 caches, and it doesn't require write through of all changed values immediately.

With a lot of cores and caches, clearly this gets complex very fast. The CPU-centric view of modern SoCs is outdated anyway. In fact, recently a group of companies including AMD, ARM, Huawei, Qualcomm, Xilinx, and Mellanix announced the Cache Coherent Interconnect for Accelerators, which they abbreviate to CCIX. But after announcing the initiative, they went quiet and it is not clear exactly what they are doing. Obviously there are some big names missing too, such as Samsung and Intel.

Forcing the software programmers to manage complicated cache flush algorithms and generally having to be aware of the cache hierarchy goes against the basic idea of a cache, which is that the hardware runs faster but the caching is invisible to the programmer. The solution is to add cache coherence to the underlying system architecture. In practice, this has to be done using a NoC that automates the process. Assembling an on-chip cache-coherent bus structure by hand is simply too hard, and will lead to weird hard-to-track-down errors when coherency is violated. Don't believe me? The last cache-coherent NoC architecture that ARM came up with was run through JasperGold technology to make sure there were no corner cases that were mishandled. There were. You probably are less skilled at designing cache-coherent NoCs than ARM, and they were on their second generation.

And at the Linley processor conference recently, Jeff Defilippi announced their third generation. If you didn't believe that cache-coherent NoCs are important, then Netspeed and Arteris also announced new cache-coherent NoC architectures at the same conference, less than an hour later.

ARM

ARM actually announced two products. The CoreLink CMN-600 Coherent Mesh Network, and the CoreLink DMC-620 Dynamic Memory Controller. They say that this gives 5X more throughput (terabit/second bandwidth), the fastest route to DDR4 memory (50% latency reduction), and supports up to 128 CPUs and 32 I/O-coherent subsystems. See the block diagram below for how the pieces fit together. The memory controller has TrustZone built in and support for 3D stacked DRAM (where it can deliver 1TB per channel).

The interconnect is designed to scale from IoT edge nodes all the way up to cloud datacenter servers.

Netspeed

Next up was Anush Mohandass of Netspeed, whose title I stole for this blog post. The first part of his presentation was about why cache coherency is important, but we'll take that as given. Their new NoC is called Netspeed Gemini III. One interesting aspect is that they use machine learning to optimized the network topology. This could just be a buzzword, since anything to do with machine learning is hot right now, or it could be a new optimization approach. Anush's last slide claimed to be "one generation ahead of the competition", but with everyone except Sonics announcing a new product that day, I think everyone is one generation ahead of where everyone else was the day before.

Arteris

The final new NoC was Arteris. I have to admit that when Arteris Mark I was sold to Qualcomm, who took the engineering team leaving just the business side of things, I thought it unlikely that Arteris Mark II would do much more than service the existing non-Qualcomm customer-base. But they seem to have fired up a new engineering team and are producing new products. Mathew Mangan presented Implementing Cache-Coherent Hardware Acceleration for ADAS and Machine Learning. Did I say machine learning was hot? Their product is called Ncore Cache Coherent Interconnect IP.

One feature is the idea of a proxy cache, that allows some level of caching for processing elements without a cache. It minimizes latency, bandwidth, and power since it avoids all communication having to happen through off-chip DRAM.

So over the decades, cache has gone from 32 words to a cache-coherent NoC linking dozens of processing elements.

Next: MemCon 2016: Storage Class Memory

Previous: RISC-V: the Case For and Against