As you may know, I don’t come from a highly technical background—most of what I know about the semiconductor industry is what I have picked up on the job. But I watched a couple of Whiteboard Wednesdays by Tom Hackett, published in April of 2017, that explained what CCIX is, and why it’s important. And to do this, he had to go all the way back to the beginning of computing in the 1940s—and in doing so, he answered some questions I have had about computing, in general.
In 1945, a mathematician and scientist, John Von Neuman, came up with the architecture for computers that we even use today. It consists of a processing element (CPU), a memory element that stores program code and data, and a couple of in-out (I/O) devices.
Figure 1: Von Neuman architecture
This was a fundamental and important development because the program code was stored in the memory along with the data. The CPU had to go to the memory element to access the code. Simple as that. This architecture jump-started the computer age, working well through the ‘40s and ‘50s, working on vacuum tubes (not integrated circuits or transistors).
But as we got into the ‘60s and ‘70s, more and more applications were being applied to the computer, requiring more and more memory to hold those applications in the data, and a bottleneck developed between the CPU and the memory. This was the memory bottleneck, or the code bottleneck; it was the program code that caused the problem since it resided in the memory.
The way to fix that was to give the CPU a little bit of its own memory. This is the cache, where some of the program code and data can reside in the CPU and operate on that memory content without going out of the chip to get to the main memory. Also, the processing unit in the CPU began to be referred to as the “core”.
Figure 2: Adding the cache and the core
This architecture was the solution to the code bottleneck, and worked well through the ‘70s, ‘80s, and ‘90s. Moore’s Law was fueling the growth of computing, and circuits were getting denser and the CPUs were getting faster, being measured from kilohertz to megahertz to gigahertz.
Come the year 2000, however, that growth came to an end. At that point, processor cores topped out at 2-4Gb, depending on the application. The problem was a thermal one—if it were to be any faster, the device generated too much heat. This created another bottleneck: the core bottleneck.
In about 2005, Intel came up with a solution: they added another processing core into the CPU. This dual-core system could process applications almost twice as fast at the same clock frequency. Each core needed its own cache, of course, to avoid the code bottleneck.
The problem comes when both cores need the same data at the same time. If processor 1, core 1 has just changed a data value, then core 2 needs the same data. It can’t go to the cache in the CPU or even the memory, because that data is old; it must go to the first core. So, you can either add some extra software to manage that, or you can take a hardware approach.
Figure 3: The hardware approach
The hardware approach is to add another piece of hardware, called the cache coherent interconnect (CCI). This hardware manages the cache coherency and keeps the entire memory system organized.
In the last five years or so, however, this architecture has started to reach its data limitations. One solution is to add in more CPU cores that follow the same layout and then tie it into the system.
Figure 4: The CPU bottleneck
But now that everything has its own cache distributed across different processor chips, how do we manage this cache coherency problem? This is the CPU bottleneck.
Just as inside of one chip it was optimum to use a CCI, it is also optimum to do this between chips. This gets us to add more processing power by adding more computer chips.
Figure 5: Cache Coherency Interconnect (CCI)
But we’re not done yet. What about all the data that we must have access to, now that we can stream videos and use AI?
To no one’s surprise, there has been an exponential increase in the amount of data flowing through the internet. While this has been driven initially by the proliferation of mobile devices (streaming videos, etc.), today this exponential increase is driven by things like deep learning (DL) and artificial intelligence (AI) applications. But remember that we’re still limited by our processor speed (2-4GHz). And because all of this is taking place on servers in the cloud, we can call this the cloud bottleneck.
Hardware designers had to look at the kind of data being processed by the CPUs. It turns out that there are certain classes of data and processing jobs that can be singled out and processed by special-purpose hardware. The designers added in some extra hardware components, called accelerators, to handle this kind of processing. This could include security-related tasks that require a lot of computation, AI applications, certain network functions… the list goes on. And they all have caches of their own.
Figure 6: Accelerators in the mix
In this approach, now we can manage all this new data. But how can we tie this into the system?
One option is to think of these accelerators as I/O devices and connect them into the I/O channel… but there’s a problem.
Figure 7: Accelerators as I/O
Just as the cores were competing for memory, now we have accelerators competing for memory, along with the CPUs, creating another bottleneck! But there is a better solution. If you extend the CCI and connect it to the new elements, we can get over this cloud bottleneck.
Figure 8: Extending the CCI to the accelerators
This brings us to the final bottleneck, the company bottleneck. It’s not an architectural issue, it’s a practical one. It’s hard for any one company to supply all these elements. And CCIs are all proprietary. We need a CCI for multiple companies to use.
This is where the CCIX, pronounced “see-six” (not “209” in Roman numerals, or even “see see eye eks”) comes in. CCIX is a cache coherent interconnect designed specifically for accelerators. It’s an open standard that any company can join, thus standardizing cache management across components of a computer system.
Figure 9: CCIX in action
But there are millions of servers deployed in the cloud. How are we going to take this to all those servers? The answer is to use existing infrastructure. Another standard, called PCI Express (PCIe), is already used for most of the I/Os in servers. So CCIX uses PCIe in the lower levels to connect everything together, using the same cabling, connectors, and so forth—and allows CCIX to do its job of managing the caches on the upper levels.
Figure 10: CCIX and PCIe work together
With this overview, now you can read Paul’s blog CCIX Is Pronounced C6 and the CCIX IP web page and understand what they’re talking about! And if you’re interested in the history of computing, check out Paul’s blog Domain-Specific Computing 1: The Dark Ages of Computer Architecture.
So, keep an eye out for CCIX, until the next bottleneck comes along that we all have to solve!