HOT CHIPS: CXL Tutorial

30 Aug 2022 • 4 minute read

hot chips logo Recently, it was HOT CHIPS 2022. The event was virtual again since when they had to make a final decision, no academic institution would let them use a stage (recently, HOT CHIPS has been at either De Anza College or Stanford University). But the format was the same as ever, with Sunday dedicated to two tutorials.

This year the tutorials were:

CXL Overview and Evolution
Heterogeneous Compilation in MLIR

This post will cover the morning's CXL tutorial. I'll cover MLIR in a later post. By the way, MLIR stands for multi-level intermediate representation. For once, the ML does not stand for machine learning. Plus, of course, I will cover several posts from the main conference itself, which took place on Monday and Tuesday. By the way, you can still register for HOT CHIPS, even though it is over, and you can then watch the video replays and download the slides.

Compute Express Link

I have written about CXL previously a couple of times. See my posts:

Recently, CXL 3.0 was released, and much of the tutorial was focused on the capabilities of the 3.0 release (and the significant differences from 2.0). Cadence also announced verification IP (VIP) for CXL 3.0, which you can read about on the product page Simulation VIP for CXL.

The tutorial was in four parts:

CXL Overview and Evolution by Ishwar Agarwal of Intel
CXL2/CXL3 Coherency Deep Dive by Robert Blankenship of Intel
Memory Use Cases and Challenges by Prakash Chauhan of Meta and Mahesh Wagh of AMD
CXL3 Fabric Introduction and Use Cases by Tony Brewer of Micron and Nathan Kalyanasundharam of AMD

Some background. CXL stands for Compute eXpress Link and is an industry-open standard for high-speed communications with nearly 200 member companies. The 1.0 standard was released in March 2019, with 1.1 following in September the same year, 2.0 in November 2020, and 3.0 just this month, August 2022. The CXL standard piggybacks on the PCIe standards. CXL 1.1 and 2.0 are aligned with PCIe 5, in particular 32 GT/s. CXL 3.0 aligns to PCIe 6 with 64 GT/s and uses PAM4 signaling. Most of the tutorial was focused on CXL 3.0 rather than the earlier versions.

The CXL transaction layer consists of three sub-protocols multiplexed over a single link:

CXL.io is used for discovery, configuration, interrupts, and so on
CXL.cache provides device access to processor memory
CXL.memory provides processor access to device memory

comparison of cxl versions

Note that CXL 3.0 is backwards compatible with CXL 1.0/1.1 and CXL 2.0. The above table shows a comparison (green ticks indicate supported, grey boxes are unsupported). Note, in particular, all the additional capabilities in 3.0 are directed toward shared memory pools. More details on this coming up below.

cxl use cases

The above diagram shows some representative CXL use cases. On the left, caching devices and accelerators, known as type 1 devices. In the middle, addressing accelerators with their own memory (such as GPUs), known as type 2 devices. And on the right, devices like memory controllers providing shared memory, known as type 3 devices.

global shared memory with cxl

CXL 3.0 is approaching the same speed as DRAM, about 200ns for CXL versus 100ns for DRAM. So it doesn't completely obviate the need for processors to have their own dedicated memory, but the penalty for adding additional pooled memory is quite low. In particular, it is a lot lower than each processor duplicating data in its local memory and then copying between different memories when needed. This allows global shared memory to be used, as in the above diagram. Even if the disaggregated memory is spread around the rack, the access time is still under 600ns.

cxl multiple levels

Another important change with CXL 3.0 is it allows multiple-level switching. CXL 2.0 on the left in the above diagram allows connection to one type 1 or type 2 device and multiple type 3 devices. CXL 3.0 allows connection to more than one device type (up to 16 CXL.cache devices) and also allows CXL switches to be connected to further CXL switches, known as a cascade. Also important is that CXL 3.0 enables non-tree topologies and peer-to-peer (p2p) communication without going through a host processor.

Don't forget that in all this memory sharing, cache coherency is maintained across the various caches in the devices. The second presentation was a deep dive into coherency, but that is too much for a blog post like this. Let me just point out that CXL uses the MESI protocol/states with snooping to maintain coherency. For more details on this, see my post What's the Difference Between MOESI and MESI? Cache-Coherence for Poets.

cxl gfam example

The ultimate in this type of memory architecture is a Global Fabric Attached Memory (GFAM) device (GFD) which differs from a traditional processor by completely disaggregating the memory from the processing units and implementing a large shared memory pool. The memories can be the same type (e.g., DRAM) or different types (mixture of DRAM, NAND flash). The above diagram shows an example with machine-learning accelerators attached to GFAM fabrics containing a mixture of volatile and non-volatile memories.

CXL 3.0 Summary

Enhanced memory pooling enables new memory usage models
Multi-level switching with multiple host and fabric capabilities
New symmetric coherency capabilities
Higher bandwidth
Optimized system-level flows with advanced switching, efficient p2p, and fine-grained resource sharing across multiple domains
Rack-scale memory fabric, a step on the journey to realizable memory-centric computing

Cadence's CXL IP

Read the datasheet Controller IP for CXL.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.