Care for Some Gates With Your Server?

17 Feb 2016 • 4 minute read

One of the big themes from the Linley Data Center Conference earlier this month was the need to get more performance out of each server without increasing the power requirements. Adding more processing power in the form of more cores per server, or more servers, increases the power. So even if it does make incremental improvements in performance, it is at the price of increased power. In addition, many algorithms run out of steam once they have "enough" cores/servers. Very large data centers have almost unlimited numbers of servers and so it is hard to make any particular algorithm run faster just by adding more servers, even ignoring power budget issues. Google search is not going to run faster by adding another thousand processors (which Google does pretty much every day, by the way).

This is a big problem: how to get more performance out of each server. The problem is that clock rates stopped increasing nearly a decade ago and we switched to delivering additional compute power using multiple cores. Now we are running into another wall. Simply adding more cores doesn't speed up the algorithm enough to be interesting, and using a lot of cores consumes more power. A large part of the cost of ownership of a large data center is power, both the power needed for all the servers and routers, and the power for the air conditioning to get the heat back out again. Simply adding more cores, or more servers, is not the most power-efficient (nor the fastest) way to implement many algorithms.

There are some algorithms like data compression and encryption that are sufficiently ubiquitous that most processors already include specialized processors to handle these faster and more power efficiently, while at the same time offloading the main cores. During the panel session at the end of the second day, and in various other presentations, it was pointed out that handling small amounts of data must be done inside the processor core since they need to run in lock step rather than sending out the data for processing and asynchronously getting the result back later. But for large amounts of data, it makes more sense to have a separate compression processor or encryption processor. When it gets to more complex algorithms like image and speech recognition, which might require neural nets, then it makes even more sense to use a specialized offload processor that is designed specifically for that algorithm or that class of algorithms. There are various approaches that have different levels of flexibility.

The most common approaches to offloading specialized algorithms are:

Use an FPGA, covered in more detail below
Use a GPU, especially NVIDIA where the entire CUDA environment makes it easier to program
Use a specialized reprogrammable processor such as the Tensilica cores for audio processing, video processing, recognition
Build custom hardware for a very specific task. Although this is the most efficient of all, it is the most inflexible and so locks in the algorithm for the lifetime of the server which is several years

Alex Grbic of "Altera now part of Intel" talked about FPGA Breakthroughs for Data Center Acceleration. In a June 2015 investor presentation, Intel forecast that up to one-third of cloud service provider nodes would use FPGAs by 2020. There are two primary FPGA suppliers, Altera and Xilinx. Altera's next-generation arrays were always planned to use Intel as a foundry, and subsequently Intel acquired Altera, meaning that, in principle, processors and FPGAs could eventually be on the same die. Xilinx has FPGAs and processors on the same die already, but they are ARM processors. Their success in adding FPGA acceleration to the data center is closely bound up with how successful ARM turns out to be in the data center server market, which I will write about later this week.

Since Intel servers are the standard in data centers, at least for the time being, the big promise of the Altera acquisition is that the way that FPGAs and processors are used together could also become standard. Even before worrying about putting FPGA and cores on the same die, they can be put in the same package. Indeed, Alex announced (or rather pre-announced with minimal details) a product with an Intel Xeon and an Altera FPGA in the same package. This will be available starting in Q1 2016 (so I expect a more formal announcement is imminent). It is targeted at the largest cloud service providers for algorithm acceleration.

Alex pointed out that there are four breakthroughs that make FPGA in the cloud more accessible than before:

Programmer-friendly acceleration
- Use OpenCL for programming
- Channels/pipe extensions
DSP breakthroughs
- Hard FPGA floating-point (single-precision) multipliers and adders
- Up to 8+ TFLOP with Stratix 10
- Vector modes
New FPGA architecture
- Hyperflex Architecture
- Up to 2X faster clock speed vs previous generation
High-bandwidth memory (HBM) can be integrated in the same package as an FPGA
- 256GB/s per interface vs 85GB/s with 4 DDR4 banks
- 2/3 lower energy per bit accessed vs DDR

Alex reckons that this approach gives 2-5X better performance/watt compared to using a GPU. He didn't say what it would be compared to just using the main processor core(s) but presumably a lot higher number since the GPU approach is already a more efficient solution. It remains to be seen whether Intel's projection of 1/3 of cloud servers will contain an FPGA accelerator turns out to be true, but since Intel already has very deep relationships with the major cloud service providers (for example, they make custom cores for Amazon) then it is probably grounded in reality and not just a marketing person's dream put on a powerpoint slide.