Squeeze bandwidth inefficiencies out of DDR DRAMs in memory subsystem designs

24 May 2010 • 6 minute read

This blog starts with a simple, sad truth: DDR DRAMs are naturally inefficient. If this statement bothers you, just get over it. All human-made artifacts have inefficiencies and DRAMs are no different. However, there are things you can do to squeeze every bit of bandwidth efficiency out of a DDR DRAM and your efforts can be rewarded with significant performance gains. You can improve memory-subsystem bandwidth by 20-30% or perhaps more, depending on how efficient your subsystem was to begin with. With this added bandwidth efficiency, you can also pick from one of two benefits depending on your system design goals. The obvious benefit you can derive is to get more bandwidth at a given SDRAM clock rate. But perhaps your system doesn’t actually need more bandwidth. Perhaps lower-power operation and lower memory costs are more pressing. In that case, efficiency improvements can cut the required SDRAM clock rate, which reduces power consumption and allows you to use slower, less expensive, and less power-hungry SDRAMs in your design.

Here’s a figure that illustrates these two benefits.

SDRAM Efficiency Examples

The top bar in the figure shows the bandwidth available from a DDR3-1600 SDRAM subsystem, which uses an 800MHz transfer clock. The theoretical peak bandwidth of such as system is 1600 Mtransfers/sec. However, in this theoretical system, the memory controller is only able to achieve 50% transfer efficiency with the SDRAM, so the memory subsystem has an effective bandwidth of 800 Mtransfers/sec. By making the memory controller somewhat smarter with respect to the SDRAM’s needs and with no change in clock rate, the memory controller can extract more bandwidth from the SDRAM. In the case shown by the second bar in the above figure, a controller that’s able to boost SDRAM transfer efficiency to 75% is able to achieve 1200 Mtransfers/sec with no increase in memory cost or clock rate. Alternatively, that same 75% memory-controller efficiency could be used to achieve the same 800-Mtransfer/sec bandwidth as was obtained with the 50%-efficient memory controller and an 800MHz transfer clock, but the controller with improved efficiency can hit the goal of 800 Mtransfer/sec with a peak transfer rate of only 1066 Mtransfers/sec. DDR3-1033 SDRAMs cost considerably less than DDR3-1600 SDRAMs, so there are real cost and power savings to be had by improving memory-controller efficiency. It’s really important to keep in mind that clock speed isn’t the critical figure of merit here. Effective bandwidth (actual data throughput) is.

How can a memory controller make an SDRAM more efficient? Throughput-improvement techniques are based on a firm understanding of the causes of SDRAM inefficiencies. An SDRAM can be completely, 100% efficient if all memory accesses are of the same type (read or write) and are directed at the same memory page. As soon as the memory controller must switch to another SDRAM page, access inefficiencies appear because the controller must open another SDRAM page and get it ready to access (activate it) prior to the actual access. There is another way to achieve high transfer efficiency and that is to direct every consecutive access to a rotating list of memory banks so that the access pattern never accesses different rows in the same bank with sequential memory accesses. Unfortunately, few real memory-access patterns look like either of these use cases so we must look elsewhere for efficiency improvement.

Improvements to SDRAM operational efficiency must emphasize intelligent, traffic-based page management and intelligent command reordering. To achieve these efficiency improvements, a memory controller must gather the commands issued by one or more memory-using blocks on the chip, order those commands in real time for highest SDRAM efficiency based on the immediate state of all the SDRAMs attached to the controller, and possibly reorder the access commands in real time based on assigned priorities and efficiency considerations as new commands enter the memory-controller’s command queue.

Before looking at the nitty gritty details of these efficiency-improving techniques, it’s helpful to step back and discuss why most SOC designs have standardized on one (or a very few) SDRAM ports. The most efficient memory, from access and bandwidth perspectives, is on-chip memory. Ideally, every memory-using function block on the SOC would have its own block of dedicated on-chip memory. Distributing memory on chip in this manner boosts raw available bandwidth and greatly diminishes memory resource conflicts. (There’s less conflict because there are more memory resources.) Each new block of memory with its own independent memory interface also clearly increases available bandwidth, but at a price of course. On-chip memory is relatively expensive. On-chip embedded SRAM can easily cost many dollars per Mbyte as opposed to memory packaged in stand-alone bulk SDRAM chips, which cost dollars per Gbyte. That’s a 1000x difference in the per-bit cost of memory. Consequently, system designers strive mightily to cram all of a system’s memory (or as much as possible) into a standard SDRAM chip to get the most memory per BOM dollar. Part of that mighty effort involves funneling the various memory-access steams on the SOC into one SDRAM port. A memory controller performs that funneling and it’s the memory controller’s job to make sure that all of the on-chip memory-using function blocks get the bandwidth they need from the attached SDRAM. The challenge is to give each memory client in the system the bandwidth, latency, and quality of service that it needs from the single SDRAM resource.

That said, let’s take a look at some possible memory-access optimizations. The following figure shows two memory reads with each read directed at a different bank of memory. The sequence on the right is unoptimized and shows a bank activation of Bank 0 followed (after an appropriate number of wait cycles) by five read operations directed at that bank. The access command sequence then performs a read from Bank 1 which starts by activating Bank 1 followed by the read commands after an appropriate delay. There’s no data coming from the SDRAM for eight cycles because the memory controller waited to activate Bank 1 until all reads from Bank 0 were complete. Those eight cycles represent lost bandwidth. Waiting for completion of all access to Bank 0 before activating Bank 1 needlessly delays the read operations on Bank 1.

SDRAM Read Optimization

The left side of the figure shows what can happen if the memory controller reorders the low-level commands sent to the SDRAM using a look-ahead ordering algorithm and bank interleaving. By promoting the activation command for Bank 1 ahead of the read commands for Bank 0, the memory controller is able to prepare Bank 1 for reads earlier in the sequence. As a result, the reads from Bank 1 occur sooner, recovering eight cycles of lost bandwidth. To safely reorder these low-level commands, the memory controller must fully understand the SDRAM’s rules of operation, the rules for activating more than one memory bank at a time, and the required timing for bank activation and bank reading. Note that in this example, the memory controller has not reordered the memory-access commands it’s received from the on-chip functional blocks. It has merely interleaved bank activation commands to the SDRAM and read commands sent to the SDRAM. The read commands to the two banks remain in order.

The above example is a simple one that exploits the SDRAM’s bank-interleaving abilities. It’s also possible to reorder the memory-access commands (reads and writes) from the on-chip functional blocks as long as coherency considerations are observed. For example, because of coherency considerations you really don’t want to promote reads to an address before writes to the same address if that wasn’t the original order of issue for those memory-access commands. However, it’s also possible to safely violate coherency considerations in some cases if you know what you’re doing. But that’s a more advanced topic, for a different blog entry.

(This blog entry is based on a presentation created by Denali's Marc Greenberg.)