Never miss a story from Breakfast Bytes. Subscribe for in-depth analysis and articles.
At this year's CDNLive, AdaptIP presented their experiences with high-level synthesis (HLS), in particular Cadence's Stratus HLS product. The presenter was billed as being Farhad Mighani but he couldn't make it, so instead it was Mike Sharp and Mike "Mac" McNamara. If that last name seems familiar in the HLS world, it should. Mac ran Cadence's C-to-Silicon project before he left to found AdaptIP as its CEO. Cadence then acquired Forte, which had an HLS product called Cynthesizer. Stratus is the combination of these two products.
AdaptIP was founded as a different sort of IP company, one that was built around complex algorithms that they would code at a high level in SystemC and then use HLS to create RTL and, if appropriate, take it further through synthesis and place and route. The advantage of using HLS for this sort of design is that the netlist required depends critically on the process, the performance required, and the user's requirements for optional aspects of the protocol. The designs that they are working on are for interfaces such as H.265, LTE, USB 3.0. As Mac said, they "try not to write any Verilog." From a staffing point of view, their approach is to add HLS experts and then add domain experts.
The example used during the presentation was IEEE 802.11ah. This is a version of WiFi (whenever you see IEEE 802.11xx, think WiFi) suited for IoT also known as HaLow. It operates in 700-900MHz bands normally used by cordless phones and garage-door openers. But at this relatively low frequency, it can go over a kilometer, unlike the wireless router in your living room, and supports multi-hop for longer distances. IoT devices tend not to be on the whole time, they wake up, take a measurement, communicate, and then go back to sleep, which means inititating a connection also needs to be fast.
Building an IEEE 802.11ah device depends on which end of the connection is being considered, the access point or the IoT device itself. A likely scenario would be to design the "thing" end in a process like 40ULP for multi-year battery life, and to design the access point in 28nm or even in a FinFET process like 16FFC. Since the access point has to handle up to 8191 connections (pop quiz: where does that number come from?*), it has much more demanding computational requirements than the device end, which only needs to have a single connection and spends most of its time hibernating. The PHY is the same as IEEE 802.11ac but at a tenth of the clock rate. There are advanced antenna features such as beam forming and 4x4 MIMO (multiple antennas). Security is the same as IEEE 802.11n and IEEE 802.11ac.
Two of the most complex blocks are the FFT/IFFT block and the Viterbi decoder. As an idea of size, the FFT is 140K gates after Synthesis through Stratus and Genus. Their methodology is to start in MATLAB to get the algorithm correct, using floating point. They can then use the MATLAB model to feed with random data, capture the results, and thus have an extensive test suite. Also, the MATLAB model enables them to decide what precision is required. Next, the algorithm is coded up in SystemC using fixed point, and the code verified against MATLAB using the tests. Finally it is run through HLS for whatever processes, options, and constraints are appropriate.
For example, with the FFT, the first synthesis was too slow (too many clock cycles). Mike made performance more important than area. He split the data memory into two (real and imaginary). Eliminated some unnecessary copies and computation by code inspection. Additional loop unrolling. Final was 96 clock cycles for 32 point and 168 clock cycles for 64 point. That was an 11X improvement from the first synthesis for 16 point and 18X for 32 point.
Mac presented a similar story about the Viterbi decoder, starting in MATLAB, recoding in SystemC, and then gradually working with Stratus to produce a much better result. The initial implementation used 17,000 clock cycles, which they got down to what looked like zero on the graph but was presumably a few hundred. Mac couldn't remember the precise number.
Rules of thumb they have learnt: Don't code for a specific number of resources, don't inline or unroll loops, don't add synthesis directives too soon. Basically you may just be forcing non-optimal synthesis. It is sometimes better to organize the SystemC close to the hardware partitioning so that it is possible to use HLS directives on smaller structures and avoid exponentially increasing compilation times.
The summary of the advantages of this approach are:
AdaptIP (and their radio partner) will be showing this design running in FPGAs at DAC, so go along and check it out. Talking of DAC, Mac will be next year's DAC chair, which also means he is this year's vice-chair. Chuck Alpert of Cadence is this year's chair.
Get more details on Cadence's Stratus HLS product.
*Answer to the pop quiz, where does 8191 come from. If you don't know, you are not a computer scientist. 8192 is 213 and so 8191 is the largest number that can be represented in 13 bits. I'm guessing that connection 0 indicates something special and is not used.
Previous: Tom Beckley's CDNLive Keynote: Addressing Complexity and Safety Challenges
Next: Jim Hogan and the Early Days of Virtuoso