RISC-V Cores: SweRV and ET-Maxion

4 Jan 2019 • 6 minute read

December was the first RISC-V summit at the Santa Clara Convention Center. I covered that in my post RISC-V: Real Products in Volume. The one-sentence summary of the state of RISC-V is that it is already dominant in academia, and has some traction with DARPA too. I doubt any chips will be built in academia that are not RISC-V-based, and it is clear that a lot of ideas for things like hardware security will be prototyped in RISC-V. The big question is how significant it will be in the commercial world.

The big potential driver in the industry is the end of progress in general-purpose processor performance. This is a confluence of a couple of factors: the inability to increase clock rates significantly to get extra performance, and the lack of new ideas in computer architecture to increase instructions per cycle. This means that the only route to improved performance is through special-purpose CPUs that match a specialized workload (such as neural networks, or wireless modem signal processing) to specialized architectures that can do tens or even hundreds of times better.

One leading indicator is how many commercial implementations are being built. At the summit, the details of several implementations were described, the two most significant of which I will cover in this blog post. Another teaser was Qualcomm, where Greg Wright finished by saying that they would be shipping a volume product containing RISC-V in 2019. However, he didn't say what the processor was. I'm guessing it is probably one of SiFive's, although it is certainly possible that they have their own implementation that they are keeping secret for now (they have their own Arm implementation, their own DSP, their own GPU).

SweRV

At the RISC-V workshop in December 2017, Martin Fink, the CTO of Western Digital, said that they would switch all their cores to RISC-V over the next few years. Their cores are mostly inside controllers for flash memory (remember, Western Digital acquired SanDisk), HDDs, and SSDs. But that adds up to a lot of volume. In this year's keynote, Martin said:

Western Digital ships in excess of 1 billion cores per year...and we expect to double that

swerv logo He announced Western Digital's new core called SweRV (the RV standing for RISC-V, obviously, the "we" is Western Digital). This is a two-way superscalar in-order core with a 9-stage pipeline. In 28nm, it will have a performance of up to 1.8 GHz. It actually has impressive performance. The table below shows how various cores stack up in the CoreMark per Megahertz, which is a measure of CPU goodness that is independent of the semiconductor process:

SweRV comes in at 4.9. Even though it is an in-order core, it beats some out-of-order cores such as BOOM (the Berkeley Out-of-Order Machine, admittedly an academic project) and the Arm Cortex-A15 (admittedly an old implementation). The core is targetted at NAND controller implementations.

Martin also announced that they would open-source the core, and a lot of the associated environment including the ISS (instruction set simulator) that can be run against any implementation as a part of verification.

It will be completely open-sourced. You can download it in Q1. It is written in Verilator-clean SystemVerilog and has an unrestricted “knock yourself out” license.

Later in the day, Zvonimir Bandic, Robert Golla, and Dejan Vucinic presented more details in a presentation titled CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces. With specialized processors offloading the compute-intensive part of the workload, their vision of a datacenter CPU architecture is to have a medium-performance out-of-order RISC-V core for general purpose OS and the non-compute-intensive parts of the applications. This only needs to be medium performance since there is little point in spending a huge amount of silicon to get higher performance once the compute-intensive portion is being run elsewhere. It is a better tradeoff to use that silicon for high-bandwidth, low-latency accelerator interfaces, and to support standardized memory fabric for further scale-out. See the diagram below:

Robert Golla, the architect of SweRV presented the internals. As Martin had said in the keynote, it is a 9-stage pipeline. It has 4 stall points at fetch 1 (for cache misses and line fills), at align (to form instructions from 3 fetch buffers), at decode (to decode up to 2 instructions from 4 instruction buffers), and at commit (can commit up to 2 instructions per cycle). There are two execution pipes, as you would expect for a 2-way superscalar processor (I0 and I1) on the diagram below. There is a separate 3-cycle latency multiply pipe, but divide is done out of pipe since it takes 34 cycles.

The branch predictor is done using the standard GSHARE algorithm of global branch history. Branches that hit the BTB result in a 1-cycle penalty. Branches that mispredict in primary ALUs result in a 4-cycle penalty, and branches that mispredict in the secondary ALUs result in a 7-cycle penalty.

The physical layout is as in the picture below. At the SSG corner (without memories) at 1GHz, it is 0.132mm². in TSMC 28nm (125°C, SVT, 150ps clock skew).

You can watch the presentation:

Esperanto Maxion

Esperanto's Polychronis Xekalakis gave a "sneak preview" of the ET-Maxion and how they went about building it. First, a summary of Maxion:

High-frequency design operation at 2+ GHz designed for TSMC 7nm (they feel the frequency can go higher)
10 stages from fetch to write-back for 1-cycle ALU op: 4-stage fetch, 2-cycle allocate and rename, 2-cycle dispatch and read PRF, 1 cycle execute, and 1 cycle write back
Fetch and decode: 48 entry iTLB, banked 32KB iCache, 2K entry compressed BTB, state-of-the-art conditional predictor with separate path-based indirect predictor
OoO and execution: 64 entry distributed scheduler, 128 entry reorder buffer, 32 entry load queue, 32 entry store queue, 8R/4W 128 entry iPRF, 3R/2W 64 fPRF, 1 load/store, 2 simple ALU, 2 complex ALU/branch
Memory: 32 entry dTLB and 1K entry unified L2 TLB, fully coherent 64KB data cache and unified 4MB L2 cache, aggressive stride prefetchers for L1 and L2 data caches
RISC-V ISA: support for compressed ISA, privileged ISA, fully respects the relaxed consistency model, supports external debug spec

They started with BROOM, the current version of the Berkeley Out-of-Order Machine. It was a great starting point but Esperanto did a lot of work to make it industrial strength, and to support the features in the micro-architecture:

Wider fetch/decode
State-of-the-art branch predictor
Larger caches
Support for the compress instruction extension
Redesigned the front end load/store
Improved silicon design

The performance is roughly the same as the Arm Cortex-A57, and about twice as fast as BOOM. It will be "competitive with high-end Arm cores". One caveat I would add is that Esperanto is comparing a core that is just taping out with cores that are in production (so maybe a 2-year lag), and not comparing it to unannounced cores in development.

The floorplan is below:

One advantage that Esperanto has coming to design in 2018 is that they can architect the implementation for side-channel timing attacks (Spectre and Meltdown). As they said, "we wouldn't have been thinking about this a couple of years ago." Their philosophy is that you can still speculate but you need to ensure that no effect of the speculation can be observed. They referred to a talk in Barcelona last year (this was at the same time at CDNLive EMEA so I was unable to attend, so I don't have any post on this). Central to the design is the notion of a "point of no return" or PnR, a logical point beyond which instructions are guaranteed to retire. Updates of state past the PnR are generally safe. More things they are considering are not allowing speculative updates to the branch prediction unit, not filling the data cache speculatively, and not training hardware pre-fetchers speculatively. The performance impact of these is small, and they are quite effective at making timing attacks significantly harder. Some of these are in the first core, with more planned for the second generation.

However, you can watch the whole presentation:

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.