• Home
  • :
  • Community
  • :
  • Blogs
  • :
  • Cadence on the Beat
  • :
  • Von Neuman Bottlenecks and CCIX

Cadence on the Beat Blogs

  • All Blog Categories
  • Breakfast Bytes
  • Cadence Academic Network
  • Cadence Support
  • Computational Fluid Dynamics
  • CFD(数値流体力学)
  • 中文技术专区
  • Custom IC Design
  • カスタムIC/ミックスシグナル
  • 定制IC芯片设计
  • Digital Implementation
  • Functional Verification
  • IC Packaging and SiP Design
  • In-Design Analysis
    • In-Design Analysis
    • Electromagnetic Analysis
    • Thermal Analysis
    • Signal and Power Integrity Analysis
    • RF/Microwave Design and Analysis
  • Life at Cadence
  • Mixed-Signal Design
  • PCB Design
  • PCB設計/ICパッケージ設計
  • PCB、IC封装:设计与仿真分析
  • PCB解析/ICパッケージ解析
  • RF Design
  • RF /マイクロ波設計
  • Signal and Power Integrity (PCB/IC Packaging)
  • Silicon Signoff
  • Solutions
  • Spotlight Taiwan
  • System Design and Verification
  • Tensilica and Design IP
  • The India Circuit
  • Whiteboard Wednesdays
  • Archive
    • Cadence on the Beat
    • Industry Insights
    • Logic Design
    • Low Power
    • The Design Chronicles
MeeraC
MeeraC
16 May 2019

Von Neuman Bottlenecks and CCIX

As you may know, I don’t come from a highly technical background—most of what I know about the semiconductor industry is what I have picked up on the job. But I watched a couple of Whiteboard Wednesdays by Tom Hackett, published in April of 2017, that explained what CCIX is, and why it’s important. And to do this, he had to go all the way back to the beginning of computing in the 1940s—and in doing so, he answered some questions I have had about computing, in general.

Some History

In 1945, a mathematician and scientist, John Von Neuman, came up with the architecture for computers that we even use today. It consists of a processing element (CPU), a memory element that stores program code and data, and a couple of in-out (I/O) devices.

Von Neuman Architecture

Figure 1: Von Neuman architecture

This was a fundamental and important development because the program code was stored in the memory along with the data. The CPU had to go to the memory element to access the code. Simple as that. This architecture jump-started the computer age, working well through the ‘40s and ‘50s, working on vacuum tubes (not integrated circuits or transistors).

The Code Bottleneck

But as we got into the ‘60s and ‘70s, more and more applications were being applied to the computer, requiring more and more memory to hold those applications in the data, and a bottleneck developed between the CPU and the memory. This was the memory bottleneck, or the code bottleneck; it was the program code that caused the problem since it resided in the memory.

The way to fix that was to give the CPU a little bit of its own memory. This is the cache, where some of the program code and data can reside in the CPU and operate on that memory content without going out of the chip to get to the main memory. Also, the processing unit in the CPU began to be referred to as the “core”.

Adding the cache and the core

Figure 2: Adding the cache and the core

This architecture was the solution to the code bottleneck, and worked well through the ‘70s, ‘80s, and ‘90s. Moore’s Law was fueling the growth of computing, and circuits were getting denser and the CPUs were getting faster, being measured from kilohertz to megahertz to gigahertz.

The Core Bottleneck

Come the year 2000, however, that growth came to an end. At that point, processor cores topped out at 2-4Gb, depending on the application. The problem was a thermal one—if it were to be any faster, the device generated too much heat. This created another bottleneck: the core bottleneck.

In about 2005, Intel came up with a solution: they added another processing core into the CPU. This dual-core system could process applications almost twice as fast at the same clock frequency. Each core needed its own cache, of course, to avoid the code bottleneck.

The problem comes when both cores need the same data at the same time. If processor 1, core 1 has just changed a data value, then core 2 needs the same data. It can’t go to the cache in the CPU or even the memory, because that data is old; it must go to the first core. So, you can either add some extra software to manage that, or you can take a hardware approach.

The hardware approach

Figure 3: The hardware approach

The hardware approach is to add another piece of hardware, called the cache coherent interconnect (CCI). This hardware manages the cache coherency and keeps the entire memory system organized.

The CPU Bottleneck

In the last five years or so, however, this architecture has started to reach its data limitations. One solution is to add in more CPU cores that follow the same layout and then tie it into the system.

The CPU bottleneck

Figure 4: The CPU bottleneck

But now that everything has its own cache distributed across different processor chips, how do we manage this cache coherency problem? This is the CPU bottleneck.

Just as inside of one chip it was optimum to use a CCI, it is also optimum to do this between chips. This gets us to add more processing power by adding more computer chips.

Cache Coherency Interconnect (CCI)

Figure 5: Cache Coherency Interconnect (CCI)

But we’re not done yet. What about all the data that we must have access to, now that we can stream videos and use AI?

The Cloud Bottleneck

To no one’s surprise, there has been an exponential increase in the amount of data flowing through the internet. While this has been driven initially by the proliferation of mobile devices (streaming videos, etc.), today this exponential increase is driven by things like deep learning (DL) and artificial intelligence (AI) applications. But remember that we’re still limited by our processor speed (2-4GHz). And because all of this is taking place on servers in the cloud, we can call this the cloud bottleneck.

Hardware designers had to look at the kind of data being processed by the CPUs. It turns out that there are certain classes of data and processing jobs that can be singled out and processed by special-purpose hardware. The designers added in some extra hardware components, called accelerators, to handle this kind of processing. This could include security-related tasks that require a lot of computation, AI applications, certain network functions… the list goes on. And they all have caches of their own.

Accelerators in the mix

Figure 6: Accelerators in the mix

In this approach, now we can manage all this new data. But how can we tie this into the system?

One option is to think of these accelerators as I/O devices and connect them into the I/O channel… but there’s a problem.

Accelerators as I/O

Figure 7: Accelerators as I/O

Just as the cores were competing for memory, now we have accelerators competing for memory, along with the CPUs, creating another bottleneck! But there is a better solution. If you extend the CCI and connect it to the new elements, we can get over this cloud bottleneck.

Extending the CCI to the accelerators

Figure 8: Extending the CCI to the accelerators

The Company Bottleneck

This brings us to the final bottleneck, the company bottleneck. It’s not an architectural issue, it’s a practical one. It’s hard for any one company to supply all these elements. And CCIs are all proprietary. We need a CCI for multiple companies to use.

This is where the CCIX, pronounced “see-six” (not “209” in Roman numerals, or even “see see eye eks”) comes in. CCIX is a cache coherent interconnect designed specifically for accelerators. It’s an open standard that any company can join, thus standardizing cache management across components of a computer system.

CCIX in action

Figure 9: CCIX in action

But there are millions of servers deployed in the cloud. How are we going to take this to all those servers? The answer is to use existing infrastructure. Another standard, called PCI Express (PCIe), is already used for most of the I/Os in servers. So CCIX uses PCIe in the lower levels to connect everything together, using the same cabling, connectors, and so forth—and allows CCIX to do its job of managing the caches on the upper levels.

CCIX and PCIe work together

Figure 10: CCIX and PCIe work together

 

With this overview, now you can read Paul’s blog CCIX Is Pronounced C6 and the CCIX IP web page and understand what they’re talking about! And if you’re interested in the history of computing, check out Paul’s blog Domain-Specific Computing 1: The Dark Ages of Computer Architecture.

So, keep an eye out for CCIX, until the next bottleneck comes along that we all have to solve!

—Meera

Tags:
  • ccix |
  • Cadence on the Beat |
  • history |