Never miss a story from Breakfast Bytes. Subscribe for in-depth analysis and articles.
Last week it was the Persistent Memory Summit 2020, which has been running annually since 2013. Jim Pappas gave the state of the union address to open the summit. Back in 2017, he used a space analogy where you just need to mix hydrazine and nitrogen tetroxide to get combustion, no igniter required. He figured that you just need to mix storage and memory to get architecture disruption. Everything has taken longer than expected, but his 2020 space analogy is that "we have cleared the tower".
I think it is important to understand all the developments in persistent memory for two reasons:
Non-volatile memory such as flash or 3DXpoint can be used in three ways, and during the day all of these were discussed:
There have historically been two barriers to adoption. One has been that technologies have been slow in coming, and they all have different tradeoffs. Flash is not good for an intermediate level of persistent memory in the hierarchy since it cannot be written directly (you have to erase a block at a time...plus handle wear-leveling). Ferroelectric memory is a future technology to keep an eye on. MRAM is what all the foundries use for embedded memories as a replacement for flash, but it is too expensive for standalone products. RRAM has been "disappointing". So for now, it is all PCRAM aka phase-change RAM aka 3DXpoint aka Optane (Intel's name). In almost all the presentations, and for the rest of this post, I'm going to assume we are talking about 3DXpoint-style memory when I say "persistent memory".
Dave Eggleston pointed out the big dilemma later in the day:
The keynote was given by Andy Bechtolsheim of Arista Networks. Much of it was an updated version of his keynote at CDNLive Silicon Valley last year that I covered in my post Andy Bechtolsheim: 85 Slides in 25 Minutes, Even the Keynote Went at 400Gbps. He went just as fast this time. "His clock rate is over 3GHz" was one remark in the summing up at the end of the day.
One thing he had added for the summit was a look at protocols for accessing storage over Ethernet, which was all new to me. The first technology is RoCE which stands for RDMA-over-converged-Ethernet and RDMA stands for remote-direct-memory-access. As Andy put it, "if your network is fast enough it doesn't matter where the memory is located" and so these protocols allow memory/storage on one processor to be accessed without interrupting the remote processor and requiring an operating system context-switch. Current implementations use priority flow control (PFC) to avoid packet loss. A single packet drop requires redoing an 8-megabyte transfer and so is very disruptive. A new version using explicit-congestion-notification (ECN) has been released.
Next, there is NVMe-over-TCP/IP, which leverages the TCP/IP protocol used all over the network. NVMe stands for non-volatile-memory-express. It is scalable to any size network but the TCP/IP protocol has a fair bit of overhead, reducing performance. The "new kid on the block" is NVMe Block Storage which is non-standard but it is being used in production for people who couldn't wait for TCP/IP to be available.
These protocols enable "disaggregated storage to be realized in a way that mere mortals can use it," as Andy put it.
Andy Rudoff came on next to talk about the persistent programming model. His day job is at Intel, but he was careful to point out that he was there as a founding member of the SNIA NVM Programming Technical Working Group. Although he did point out, on one of his slides, that you could tell he worked for Intel because he said 3DXpoint is "available", not "finally available" as appeared in a couple of later presentations.
The diagram above shows the basic model. The operating system contains a persistent-memory-aware filesystem and also allows an application to map part of the memory space into its address space and then access it just by using regular load and store instruction, known as direct-access or DAX. The path on the left is like a really fast SSD. The path on the right is like a really fast page cache.
The difference comes, however, when you use flush. Just to be clear, "flush" means that you force values out from the caches and DRAM into the persistent memory, so that if there is a failure, the data will survive. This was never required in current (non-persistent-memory) systems since all the memory, caches and DRAM, will be lost in the event of failure and so it really doesn't matter where each item of data was. In a system with persistent memory, the contents of caches and DRAM will be lost after a failure but the persistent memory will...err...persist. To add to the complexity, modern processors have delayed writes, which is a sort of hidden cache, and no mechanism to tell if the write has completed yet (since, with DRAM, it was never important).
Flush requires support in the hardware (Intel processors since Cascade Lake already have this) since normally there is no way for a program to make this happen and no way to determine when the whole operation is complete and everything has reached persistent memory safely.
There are two levels of ambition in using persistence across reboots, which is where it really differs from DRAM. The least ambitious is only to retain the contents of the persistent memory after a controlled shutdown, when the operating system can flush everything to the memory, and add at least a little data to indicate that it was a controlled shutdown. The more ambitious is to retain the contents even after a crash of some sort. When the system is rebooted, it will not find that little bit of data so that it "knows" it is recovering from a crash and may have to do extra work to reconstruct and verify the contents. In practice, you need to create a mechanism for atomic transactions, and pick up the pieces during a restart (discard all transactions that failed, complete all transactions that succeeded) just like in a disk-based database.
One hiccup was when "the Linux guys said we can't give you that flushing" and forced you to access the operating system on every write, which made everything too slow. There is also a requirement to replicate all the filesystem metadata since you don't have a RAID array of disks when using persistent memory, so an uncorrectable error in the metadata could mean you lose everything.
With Linux, a special device driver needs to be added to handle the flushing. This works but has one big disadvantage: it doesn't follow the Posix model, the standard Unix interface between the operating system and application programs.
Look for blog posts about the Twitter and Oracle presentations next week.
All the presentations (except Andy's Keynote...he never gives out his slides) are available in the SNIA Educational Library.
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.