• Skip to main content
  • Skip to search
  • Skip to footer
Cadence Home
  • This search text may be transcribed, used, stored, or accessed by our third-party service providers per our Cookie Policy and Privacy Policy.

  1. Blogs
  2. Breakfast Bytes
  3. Exadata: An Epic Journey at Oracle with Persistent Memo…
Paul McLellan
Paul McLellan

Community Member

Blog Activity
Options
  • Subscribe by email
  • More
  • Cancel
persistent memory summit
exadata
Oracle
optane
persistent memory

Exadata: An Epic Journey at Oracle with Persistent Memory

6 Feb 2020 • 5 minute read

 breakfast bytes logo A couple of weeks ago was the Persistent Memory Summit 2020. See my post Persistent Memory: We Have Cleared the Tower for an overview. This week I am covering two presentations, from Twitter and Oracle. There were a number of other presentations of proof-of-concept systems, but these two actually have systems with persistent memory in production use. You can read about Twitter in Persistent Memory at Twitter. Today it is Oracle's turn. This is not meant to be some sort of commercial for Exadata or Oracle. I doubt you are in the market for one of these—I hate to think how much they cost. But this is one of the first systems to use persistent memory and so is a sort of poster-child for its benefits.

 Exadata

Jia Shi of Oracle presented Exadata with Persistent Memory: An Epic Journey. You probably have no idea what an Exadata is, I certainly didn't. It is actually an Oracle product that has been in existence for 10 years and gone through eight and a half versions. The goal is that it is "ideal database hardware" that is scalable, optimized for compute, networking, and storage, to deliver the fastest performance at the lowest cost.

Larry Ellison, Oracle's longtime CEO and CTO said:

Exadata is the most successful product class that Oracle has ever done

By the way, if you Google "CEO of Oracle" you will get Larry. But he stepped down in 2014 and appointed Mark Hurd and Safra Catz co-CEOs. Unfortunately, Mark died late last year aged just 62, and so now Safra Catz alone is CEO. Larry is still CTO (and Chairman).

 The table above shows the ten+ year journey from the first Exadata V1 in 2008 to the one that Jia was discussing in her presentation. It has been shipping to customers since September of last year and is in commercial use at Oracle's customers. It uses 100G Ethernet RDMA on Ethernet (also known as RoCE). As I described in the overview post about the summit, RDMA allows access to the memory of one processor from another without involving the CPU on the operating system of the target.

It is worth looking at the specs of the X8M since they are pretty impressive by any standard: 384 CPU cores, 2.3 petabytes of data, 16M read IOPS, 200Gb/s RDMA bandwidth.

The difference between the X8 and the X8M is the addition of persistent memory. The processor generation is the same in both (Xeon 8260 aka Cascade Lake), the network is the same, and the software is the same. If you look at the performance (last row) in the table above, you can see that performance grew fast in the early generations but this slowed and "capped out" as Jia described it. But adding persistent memory was a game-changer, increasing performance by ~2.5X compared to the version without persistent memory.

The Secret Sauce

 Jia teased us with pictures of the secret sauce. The board on the left is Intel's Optane persistent memory. Of course, it survives power failures unlike DRAM, but requires sophisticated algorithms to maintain data integrity over failures. The board on the right is RDMA allowing one processor access the memory of another directly, in particular for the database server to read memory regions directly from the storage servers without any latency. Previously Exadata boxes had run RDMA on InfiniBand.

If they had just "dropped in" persistent memory, then they would be stuck with the conventional storage flow:

  • Database issues read I/O call to OS
  • OS sends message to storage
  • Storage CPU issues read to persistent memory
  • Storage CPU sens reply to server OS
  • Server OS wakes up database

This is overwhelmed by high-cost network and I/O software interrupts and operating system context switches. Even though the persistent memory access time is just 1us, the overall end-to-end latency would be about 100us, so well over 90% of the time would be wasted.

Instead, they use RDMA direct to the persistent memory, resulting in an end-to-end latency of 19us, 10X faster than the same box without persistent memory. This is mostly due to cutting out all the context switches in both operating systems.

Recovery

This may be too much information and how recovery logs work in databases. But, hey, my PhD was actually on distributed filesystems and databases, and I haven't geeked out on this stuff for years. The way the Exadata logged conventionally was:

  • Database sends request to storage
  • Storage server issues simultaneous writes to the flash log and the disk drive simultaneously
  • Storage server sends acknowledgment to the database

The big optimization is that the storage server does not need to wait for the hard disk drive wait to complete before it acknowledges, just the flash (much faster). If the system crashes then the flash contains all the recovery data and it can subsequently be read from flash and written to disk. But this has the same problem as the read access above, 90% of the time is wasted in context switches.

 Instead, the log is done using RDMA to use persistent memory to store the redo log, cutting out all those context switches. So now transactions are logged in persistent memory, and then asynchronously the flash log and the disk drive are written. The persistent memory log records only need to be kept until the information is safely in the flash, and the original code will be able to recover from a crash. But first, the cached persistent memory logs need to be written to the flash, since they may not have completed.

Faster log writes using persistent memory speeds up commit times for transactions since any log write slowdown stalls the whole database.

Using persistent memory and Ethernet for RDMA (as opposed to Infiniband) deliver big speedups as you can see from the table at the start of this post. 2.5X the performance of the system that omits persistent memory and RoCE (and 320X higher performance than the original 2008 Exadata V1).

More Information

All the presentations are available in the SNIA Educational Library. In particular Oracle.

 

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.