Never miss a story from Breakfast Bytes. Subscribe for in-depth analysis and articles.
Cache coherency has become a big issue as the architecture inside devices such as smartphones and datacenter CPUs has got more complicated. In the old days...which sounds like the start of fairy tale...there was a single processor and the memory hierarchy was built to service it. There were some wrinkles, like DMA devices, but they could be handled as exceptions since the operating system controlled the devices. When multi-core CPUs became the norm, some cache coherency could be added fairly simply since all the cores were the same, and under control of the same design team, running the same operating system. When caches were small, a brute force approach was simply to flush the cache so that a device reading from DRAM would get the correct value, there would be no stale data.
That is no longer the case. Very compute-intensive workloads such as deep learning are often offloaded from the main processors onto specialized hardware constructed out of GPUs, FPGAs, or specialized processors such as the Tensilica Vision P6 DSP. In addition, very-high-speed networking at 50G and 100G (and soon 200G and 400G) mean that a typical architecture contains many devices making high-bandwidth access to memory in a way that the "main" CPU does not closely control. The legacy I/O and memory architecture is creating a bottleneck to system performance. There is also the potential of a new level in the memory hierarchy, now known as storage class memory, which has many of the attributes of NAND Flash (in particular, being non-volatile) with a performance close to DRAM.
The traditional way to handle the interface from a server core to an accelerator has been through PCI Express (PCIe) and explicitly programming the transfer. But that risks losing many of the advantages of the accelerator.
Cache coherency means that any device can read from and write to memory, and get the correct data, despite the accessing device not knowing where in the cache hierarchy the correct data is. However, now there are more places that the "correct" value of an address may reside, and more devices trying to make accesses, and all running at higher and higher data rates. The only practical solution is to add cache coherency to the interconnect so that devices always get the correct value and the interconnect handles the "make it so" aspects. However, proprietary standards will not do since the devices, whether chips or IP blocks, come from a range of suppliers, with all the obvious drawbacks of a closed business model and technical interoperability of a device in multiple systems. What is required is an open standard.
Last year, the Cache Coherent Interconnect for Accelerators was announced, usually known by its sort-of initials CCIX, and usually pronounced C6. The founding members of CCIX (from May last year) include AMD, ARM, Huawei, IBM, Mellanox, Qualcomm, and Xilinx. Among the second wave of members (in October last year) are Cadence, Arteris, Broadcom, Cavium, IDT, Keysight, Micron, Netspeed, Red Hat, Synopsys, Teledyne Lecroy, TI, and TSMC. There are probably more now, but even with that list you can see that there is broad support across the industry with some obvious exceptions such as Intel and NVIDIA. Intel's absence is not entirely surprising, since their roadmap is to integrate accelerators inside the package and use their Altera FPGA fabric to do so. NVIDIA has their own NVLink technology, although everyone assumes that if CCIX takes off that they will support it, since otherwise they will be making using NVIDIA GPUs for offload processing a challenge.
The basic idea of CCIX is to allow processors based on different instruction set architectures (ISAs) to extend their cache coherency to accelerators, interconnect, and I/O. These highly capable accelerators become a key component in the processor system. This means that the system designer can pick and choose components from multiple vendors and put them together, and reading and writing data should be handled correctly, with all the performance advantages that come from multi-level caches.
The implementation piggybacks on the PCIe standard. It has the same PHY and the same data-link layer, but replaces the transaction layer. The PHY is one of the harder and more complex parts of the interface to develop, since it depends closely on aspects of the underlying semiconductor process. The plan is to support 25Gbps with a fallback to PCIe speed of 16Gbps.
At the start of this month, April 2017, the CCIX Consortium was started. Everyone is moving over to the new organization, new members are joining, and there is a new website. According to Millind Mittal of Xilinx speaking at CDNLive, there are multiple designs in flight. He also revealed the reason for the slightly odd name. CCIA was too much of a mouthful and didn't abbreviate well, so they switched it to CCIX so it would abbreviate to C6 in conversation, giving me a title for this post. As President Obama discovered when he mispronounced "corpsman", people care about these details. If you want to sound knowledgeable, "C6" is how you must say it.
The first of two Whiteboard Wednesday videos on CCIX was published yesterday. Watch for the second one next week.