The Latest MLPerf Results for Inference

14 May 2021 • 3 minute read

Just before the recent Linley Spring Processor Conference 2021, MLPerf released its latest round of benchmark results (just for inference). Until now, MLPerf benchmarks have not taken power efficiency into account. Even in the data center, where the cost of power and cooling represents one of the biggest costs of ownership, this is shortsighted. For edge inference devices, it is clearly silly. You are simply not going to put a 300W processor in your pocket, no matter how many TOPS it provides to your smartphone.

The benchmarks were based on six widely available and fairly well known models in different application areas:

Object Detection with SSD-ResNet34
Medical Image Segmentation with 3D UNET
Speech-to-Text with RNNT
Language Processing with BERT
Recommendation Engines with DLRM

These six benchmarks were also run in four different ways, as shown in this table:

Single stream is processing just one stream of data, and the next query is available when the previous one is done. Multiple stream sends a new query at regular intervals. Server sends queries at a random rate (Poisson). Offline makes all the data available at the beginning so that it can potentially be batched up or pipelined.

Submitters to this round include Alibaba, Centaur Technology, Dell Technologies, EdgeCortix, Fujitsu, Gigabyte, HPE, Inspur, Intel, Lenovo, Krai, Moblint, Neuchips, NVIDIA, Qualcomm Technologies, Supermicro, and Xilinx. There are several obvious names missing, such as Google, who didn't submit to this round. Also, none of the fabless startups submitted, despite being members of the MLPerf consortium, which is called MLCommons.

There were a total of just under 1,994 systems submitted, with 850 for the energy efficiency benchmark. Despite the large numbers, most of the systems were actually neural networks implemented on NVIDIA GPUs, so it is not surprising that NVIDIA had the highest performance. But when energy efficiency was considered, NVIDIA is fast but power-hungry.

Energy efficiency was measured by running the models/hardware for ten minutes and averaging the power, which evens out any power management such as DVFS that might be in use, not to mention variations in the power as different data is processed by the model.

Of course, there needs to be a quality standard associated with the benchmarks, since machine learning has a probabilistic aspect to it, unlike a normal processor benchmark like Dhrystone or SPEC, where the benchmark has to get the precisely correct answer. The table below shows how these quality standards were set, in general at 99% accuracy compared to a reference implementation using FP32 (32-bit floating point, what you might think of as "full accuracy"), and a latency constraint as to how long the model could take to produce each result.

The benchmarks were also run in two other ways: closed, and open. Closed means that the precise reference model (including all the weights) has to be run, so it is a test just of the implementation of the system. Open means that the reference model can be adjusted or even retrained, provided all the accuracy standards continue to be met.

Results

Here are all the results (big spreadsheets which tell you more than you probably want to know). Note that there are four menu items at the top for closed, open, closed+power, open+power. So there are really four spreadsheets in each of these:

MLPerf has so many rules relating to what anyone is allowed to say, mostly what companies producing the entries are allowed to claim. Cadence did not submit anything, so we are not claiming anything, but I'll just stick to giving you the links to the results so that you can see the raw data if you are interested. Or there is a good summary in the IEEE Spectrum article These Might Be the Fastest (and Most Efficient) AI Systems Around.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.