MLPerf: Benchmarking Machine Learning

16 Sep 2019 • 6 minute read

Most presentations at the recent HOT CHIPS conference are about actual chips, mostly processors of one kind or another. But in machine learning, there is a need to be able to compare these processors. This is where MLPerf comes in, putting together a suite of benchmarks for machine learning.

Before ImageNet came along, the situation was even worse since we didn't even have good data that was commonly shared. (See my post ImageNet: The Benchmark that Changed Everything.) But, to be honest, I didn't name the post correctly since ImageNet isn't really a benchmark, it is a large labeled dataset that can be used as the basis for benchmarking algorithms such as the various flavors of Resnet. In turn, these networks and data can be used as a pseudo-benchmark for deep learning processors.

The problem is that there are too many moving parts and it is not clear whether it is an apples-apples comparison when two processors claim a certain number of image inferences per second. The way ImageNet changed things was by forming the basis for an image recognition competition, focused on the software and on accuracy. It didn't take long for recognition algorithms to go from being poor to being better than humans once ImageNet and the ILSVRC competition were created.

Processor Benchmarks

Back in 1972, general-purpose processors were having the same problem. MIPS, millions of instructions per second, were too inaccurate without actually running some code, but it makes a big difference what code you run, especially whether there is floating-point, whether vector processors are being used, and so on. In England, the Whetstone benchmarks were created. These were initially in Algol-60 using a compiler that had been created at English Electric in Whetstone in Leicestershire (pronounced lester-shire). This was translated into the more widely available Fortran but immediately ran into the issues that Fortran compilers were really good and optimized away some of the benchmark instructions since they didn't contribute to the output of the program. This is actually a general problem with synthetic benchmarks like this that attempt to force the compiler to output a statistically accurate mix of instructions so that it is only the underlying hardware that is being measured, not primarily the quality of the compiler. Nonetheless:

The Fortran Whetstone programs were the first general-purpose benchmarks that set industry standards of computer system performance.

Whetstone primarily measures floating-point performance. In 1984, the Dhrystone benchmark was created to measure integer performance. The name Dhrystone and its odd spelling are just wordplay on Whetstone—there is no place named Dhrystone. It was originally developed in Ada and then ported into C. Dhrystone suffered from some of the same issues with compilers as Whetstone, that code that didn't produce a result that was used would be eliminated. The creators put a lot of effort into trying to force the compilers not to do that, plus some compiler options can be used.

The Standard Performance Evaluation Corporation produced another benchmark for integer performance, SPECint. Unlike Whetstone and Dhrystone, it is not a synthetic benchmark. It is 12 largish programs for things like playing Go or running the Simplex linear-programming optimization algorithm. Like other attempts at hardware benchmarks, it again suffers from smart compilers, especially ones that vectorize, producing anomalous results. However, SPECInt is probably the most widely used benchmark today, and most general-purpose processors being announced at HOT CHIPS had SPECint numbers.

MLPerf

At HOT CHIPS, Peter Mattson, the general chair of MLPerf, presented ML Benchmark Design Challenges.

The problem MLPerf is addressing was well summed up by Cliff Young of Google in his Linley keynote late last year:

Hey, I got a Resnet-50, might not be the same code as yours but we’ll call it Resnet-50 anyway, and here are our numbers.

MLPerf is "A machine learning performance benchmark suite with broad industry and academic support." There are about 100 members (including Cadence).

The benchmarks are divided up into the type of network and the application. So vision, speech, language, commerce, research (such as GANs). One aim was to make use of whatever public datasets were available since "good now is better than perfect later". Training and inference are separate.

Training

The image shows how training is one. One the left is a public dataset such as ImageNet. In the middle is the model to be trained, and there is a target quality level. These quality levels will increase as the state of the art increases. There are two separate divisions. One, called closed, is intended to be pure apples-to-apples comparisons of hardware, so that the model is completely specified. That's an exaggeration since there are still minor differences, such as whether the hardware uses floating point. The datasets and the models used are in this table:

The open division includes innovation in the model, too, so it is measuring not just the underlying hardware performance but the entire learning system.

Another issue is that different batch sizes produce different performance, but working out the optimal batch size (the hyperparameters) is not really the point of the benchmark, plus can be very expensive requiring a large number of runs. So for the time being there is "hyperparameter borrowing" during the review process, where known good values can be shared among different implementations.

Inference

The inference benchmarks work in a similar way. Data is pulled from the public dataset (such as an image from ImageNet) and run through the trained network (such as ResNet) and produces a result (what is in the image, in this case). Again there is a target accuracy to be achieved. Again, there are open and closed divisions. In the closed division, the whole network is specified as part of the benchmark (so the weights), and just the hardware performance is being measured. In the open division, the network is not specified and innovation is allowed.

There are actually four scenarios for inference that have different characteristics:

Single stream (e.g., cellphone augmented vision): Quality measure is latency.
Multiple stream (e.g., driving with multiple cameras): Quality measure is the number of streams.
Server (e.g., translation service): Quality measure is queries per second.
Batch (e.g., classifying a dataset of thousands of photos): Quality measure is throughput.

Another big thing in inference is compressing the model. For now, the rules they are using are:

Model must be mathematically equivalent to the one supplied (which is FP32)
Quantization is allowed (e.g., 8-bit), but must be mathematically honest and not hand trained for the specific benchmark
No retraining (in the closed division)

Note that the last constraint, no retraining, makes it harder to optimize for sparse matrices with lots of zeros. One technique that is used is to forge weights that are nearly zero to be exactly zero, and then retrain the network to adjust the other weights to take account of that.

Another variable, especially in inference, is the scale of the design. Some designs are scalable (you can use different numbers of chips depending on the throughput needed). For now, scale is very simply just the number of chips. This is an area that needs more work. Another obvious benchmark value is power, and this is planned for the future, too.

Improvements

One reason for having benchmarks is simply to compare implementations. But another is to drive improvement over time, since that makes it easier to see which changes are really effective and which are marginal.

The graph above shows the improvements in the fastest 16-chip system tested (there were no hints as to whose design this was) between the release of v0.5 of the benchmarks to the v0.6 release. Note that the quality targets increased between the two benchmarks so that this is actually roughly a doubling of performance in six months.

Peter wrapped up saying:

It's Agile benchmarking: we launched, we’re iterating, we have more to do.

More Details

All things MLPerf are on the website mlperf.org. They are also creating a non-profit called MLCommons, which will be the future home of MLPerf, but also public datasets, best practices, and outreach, with the aim to "accelerate ML innovation and increase its positive impact on society."

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.