HOT CHIPS: In-DRAM Compute

9 Sep 2019 • 4 minute read

Something that has been discussed for years is the fact that we could add processors to DRAM memory pretty cheaply if we could work out what to do with them. Usually, when people suggest this, they don't really think it through. They are assuming that they would get something close to a state-of-the-art microprocessor on the same die as the DRAM. Then they look at the specs for a DRAM process and realize you can't easily build a useful microprocessor in a DRAM process.

At HOT CHIPS last month, Fabrice Devaux of Upmem detailed how they had done it successfully in a presentation titled The True Processing in Memory Accelerator.

Unless you are a DRAM designer, you probably don't know all that much about DRAM processes. By logic standards they are horrible:

3X slower transistors than the same node for a digital logic process
Logic is 10 times less dense than a logic process
Routing density dramatically lower, with just three layers of metal and a pitch four times as large
Fabrice didn't mention it, but I know from talking to other people, that the current-carrying capacity of the interconnect metal is low since DRAM doesn't need high power and it needs interconnect to be cheap. So electromigration can be an issue.

Upmem has developed a processor-in-memory (PIM) architecture and chip, embedded eight processors on a die, and is delivering them as standard DDR4 2400 DIMM modules with 16 chips. This means that a server CPU has the potential to be helped by thousands of additional cores. They see boosts of 20+X for data-intensive applications, with power efficiency 10X better by removing the need to move the data between the DIMMs and the CPU. The cost increment is small. They have done this on a modern (2x nm) unmodified DRAM process.

Building a Logic Flow on a DRAM Process

The first problem is that most of the basic foundation IP required is unavailable. There are no digital standard cells, no SRAMs. So the first thing that they did is to create a digital library and four different SRAMs from 320 bits up to 16KB, single and dual port. The focus of Fabrice's presentation was on the processor itself, but I'm sure building a good SRAM in a DRAM process has an interesting set of challenges. Of course, there is plenty of DRAM memory on the chip already but it was never designed to be modified, so it was necessary to minimize changes to the DRAM IP.

Building a fast processor using slow transistors is obviously a challenge. It takes a 14-stage pipeline to reach 500 MHz. The approach they took was to allow up to 24 hardware threads. But the pipeline is interleaved so that each stage in the pipeline is running a different thread, so there is no operand bypass, no stalls, no need for branch prediction. However, it does require a minimum of 11 threads to hit 100% performance. This pipeline runs one instruction per cycle, just not from the same thread. There is 1GB/s transfer from DRAM with transfers from 8B to 2KB. It is roughly equivalent to 1/6th of a Xeon core on PIM applications (branchy, integer-only code).

The heavy multi-threading also implies an implicit memory hierarchy. There is no data cache because there is too much threading for it to be effective. Instead, there is a 64KB SRAM called WRAM. There is no instruction cache, instructions run out of a 24KB SRAM called IRAM. DMA instructions move data between the DRAM and the WRAM and IRAM. The DMA is executed by a separate DMA engine with minimal impact on pipeline performance. This diagram shows the pipeline and memories.

The instruction set architecture (ISA) for the processor is proprietary, after they examined and rejected both Arm and RISC-V. It is a clean target for the LLVM/CLANG compiler system. It is only scalar, in-order, and multi-threaded. Some interesting things about the ISA:

It only supports 8x8 single cycle multiplies but otherwise is more powerful than other 32-bit ISAs
0 cycle conditional jump on result properties (due to the interlaced pipeline)
SHIFT+ADD/SUB instructions
Rich set of logic instructions including NAND, NOR, ORN, ANDN, NXOR
Rich set of shift/rotate instructions
Large immediate values supported

There is no OS since there is no DPU sharing, which also is a dramatic security simplification. There are simply so many DPUs that there is never a need to share one.

One problem with doing PIM is that the data has to be stored in a special way. Normally words are stored "horizontally" with eight bits from each 64-bit word stored in each memory chip. But that makes it impossible for the processor to do much. Instead, data needs to be rotated to be "vertical" so that all the bits of each 64-bit word are completely contained within a single DRAM chip and so is all accessible to the processors on that chip. So eight 64-bit horizontal words spread across all the memory chips are turned into eight vertical words, one in each memory chip. This rotation, an 8x8 matrix transposition, is done by the library inside a 64-byte cache line, thus very efficiently.

The DPUs can be programmed in C and can take on the performance-critical part of the application code, with libraries doing a lot of the heavy lifting. The main server processor (x86, Arm64, Power9) acts as the orchestrator and still executes most of the application code since it is not performance critical.

Results

The table below shows the speed up from a few algorithms from using DPUs versus just running the same algorithm on the host x86 server with standard DRAM. It looks like a 20-40X speedup for those algorithms that are a good match for this sort PIM architecture.

Availability

Production started in Q3 2018, with samples available about now. Mass production is scheduled for Q1 2020.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.