Never miss a story from Breakfast Bytes. Subscribe for in-depth analysis and articles.
Recently, it was HOT CHIPS 2022. The event was virtual again since when they had to make a final decision, no academic institution would let them use a stage (recently, HOT CHIPS has been at either De Anza College or Stanford University). But the format was the same as ever, with Sunday dedicated to two tutorials.
This year the tutorials were:
This post will cover the morning's CXL tutorial. I'll cover MLIR in a later post. By the way, MLIR stands for multi-level intermediate representation. For once, the ML does not stand for machine learning. Plus, of course, I will cover several posts from the main conference itself, which took place on Monday and Tuesday. By the way, you can still register for HOT CHIPS, even though it is over, and you can then watch the video replays and download the slides.
I have written about CXL previously a couple of times. See my posts:
Recently, CXL 3.0 was released, and much of the tutorial was focused on the capabilities of the 3.0 release (and the significant differences from 2.0). Cadence also announced verification IP (VIP) for CXL 3.0, which you can read about on the product page Simulation VIP for CXL.
The tutorial was in four parts:
Some background. CXL stands for Compute eXpress Link and is an industry-open standard for high-speed communications with nearly 200 member companies. The 1.0 standard was released in March 2019, with 1.1 following in September the same year, 2.0 in November 2020, and 3.0 just this month, August 2022. The CXL standard piggybacks on the PCIe standards. CXL 1.1 and 2.0 are aligned with PCIe 5, in particular 32 GT/s. CXL 3.0 aligns to PCIe 6 with 64 GT/s and uses PAM4 signaling. Most of the tutorial was focused on CXL 3.0 rather than the earlier versions.
The CXL transaction layer consists of three sub-protocols multiplexed over a single link:
Note that CXL 3.0 is backwards compatible with CXL 1.0/1.1 and CXL 2.0. The above table shows a comparison (green ticks indicate supported, grey boxes are unsupported). Note, in particular, all the additional capabilities in 3.0 are directed toward shared memory pools. More details on this coming up below.
The above diagram shows some representative CXL use cases. On the left, caching devices and accelerators, known as type 1 devices. In the middle, addressing accelerators with their own memory (such as GPUs), known as type 2 devices. And on the right, devices like memory controllers providing shared memory, known as type 3 devices.
CXL 3.0 is approaching the same speed as DRAM, about 200ns for CXL versus 100ns for DRAM. So it doesn't completely obviate the need for processors to have their own dedicated memory, but the penalty for adding additional pooled memory is quite low. In particular, it is a lot lower than each processor duplicating data in its local memory and then copying between different memories when needed. This allows global shared memory to be used, as in the above diagram. Even if the disaggregated memory is spread around the rack, the access time is still under 600ns.
Another important change with CXL 3.0 is it allows multiple-level switching. CXL 2.0 on the left in the above diagram allows connection to one type 1 or type 2 device and multiple type 3 devices. CXL 3.0 allows connection to more than one device type (up to 16 CXL.cache devices) and also allows CXL switches to be connected to further CXL switches, known as a cascade. Also important is that CXL 3.0 enables non-tree topologies and peer-to-peer (p2p) communication without going through a host processor.
Don't forget that in all this memory sharing, cache coherency is maintained across the various caches in the devices. The second presentation was a deep dive into coherency, but that is too much for a blog post like this. Let me just point out that CXL uses the MESI protocol/states with snooping to maintain coherency. For more details on this, see my post What's the Difference Between MOESI and MESI? Cache-Coherence for Poets.
The ultimate in this type of memory architecture is a Global Fabric Attached Memory (GFAM) device (GFD) which differs from a traditional processor by completely disaggregating the memory from the processing units and implementing a large shared memory pool. The memories can be the same type (e.g., DRAM) or different types (mixture of DRAM, NAND flash). The above diagram shows an example with machine-learning accelerators attached to GFAM fabrics containing a mixture of volatile and non-volatile memories.
Read the datasheet Controller IP for CXL.
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.