Artificial Intelligence...and Artificial Performance

3 Jun 2020 • 9 minute read

Do you know what this is? It's a benchmark.

The Ordnance Survey (OS) of Britain created around half-a-million of these, and the horizontal line at the top is known for each one as a height above sea-level. Actually, it is a height above ODN, the Ordnance Datum Newlyn, defined as the mean-sea-level between 1915 and 1921 at Newlyn (in Cornwall, in the far south-west of England, not far from where my Dad lives). An angled piece of metal, known as a "bench" could be inserted into the top horizontal mark, hence it became known as a "bench mark", and later it became a single word. Some of the benchmarks date back to 1831.

There are 190 of these benchmarks that are Fundamental Bench Marks or FBMs. These are the only benchmarks that are still used by the OS, the other half-million are no longer maintained (and may be inaccurate due to subsidence, or missing due to roadbuilding, and so on). The FBMs are used for GPS correction and what the OS calls "heighting".

Benchmarks Today

Of course, today, we use the word "benchmark" to mean something closer to a standard way of comparing one thing to another, or perhaps something to a version that is held up as the best or as perfection.

Today's post is going to look at the issues benchmarking neural networks running on different hardware.

General-purpose processors faced a similar problem. The clock frequency was a reasonable surrogate in early days with the same (usually x86) architecture. But it didn't allow comparisons between different instruction sets or different system architectures (size and levels of cache, for example). As a result, standard benchmarks were created, starting in 1972 with Whetstone, that measured floating-point performance. That was followed by Dhrystone for integer performance in 1984. SPECInt is one of the most heavily used today (the SPEC of SPECint stands for the Standard Performance Evaluation Corporation), in use since 1992, although the most significant standard dates from 2006.

Neural network processors face the same kind of definition as network accelerator chips and general-purpose processors did 25 years ago. What does 1 TOPS (tera operations per second) mean? What does it mean to say your network processes ResNet-50 at 1000 images per second? Is 4 TOPS in 5 Watts better than 1 TOPS in 1 Watt?

The Problem for Neural Nets

Google's Cliff Young laid out the problem at a Linley keynote I saw him give (see my post Google TPU Software for the full post). He was discussing how people typically compare implementations:

Hey I got a ResNet-50, might not be the same code as yours but we’ll call it ResNet-50 anyway, and here are our numbers.

This post will take a look at how people compare performance. I'm only going to look at inference-at-the-edge in SoCs, as opposed to training and cloud-based inference. Training is almost always done in cloud data centers using either GPUs or perhaps something more specialized, and much of the focus is on the cost (in $) to train a given model. But inference has to be done in chips that may be running on batteries and may be in cheap consumer products. There is usually neither the power nor the area budget to be doing floating-point inference. I think most people have been surprised by just how much we can cut back neural networks in terms of precision of the operations without losing overall accuracy of the network. I certainly have been surprised. Also, a lot of the connections between nodes that have a marginal effect can be pruned. In effect, weights that are close to zero can simply be zeroed out completely and ignored with minimal impact on accuracy.

Today, I think there are four main ways that people compare designs:

Standard networks such as ResNet-50 or Inception-v3
Raw measures such as TOPS or GFLOPS
TOPS modulated by what you have to pay to get them, so adding in some measurement of power or area
Benchmarks such as MLPerf

Standard Networks

The problem with standard networks like ResNet-50 is that they are not actually standard. These networks are typically trained in the cloud using 32-bit floating-point operations. But they then need to be mapped onto the inference engine. This typically can involve reducing the precision, and finding additional zero values.

For the image classification networks, the usual measure of goodness is how many images or frames per second they can classify. Image classification is in a particularly strong position due to ImageNet, a standard database of millions of annotated images that can be used as the "big data" to test any recognition network (see my post ImageNet: The Benchmark that Changed Everything).

Another issue with image classification is whether the images are handled in bulk, such as classify these million photographs, or one image at a time, such as classify what my phone is looking at right now. There are two important differences between these two setups. One is that in the bulk case images can be handled in parallel, starting one image before the previous one is finished. And a second difference is that more images are better. The video example is different. An image arrives 30 times per second and has to be classified there and then. There is also no real advantage if the network can classify 40 images per second when processing 30fps video.

Raw Measures

For inference, the raw measure is typically TOPS, for trillions of operations per second. However, it matters a lot what the operation is since inference is most often done with 8-bit, but can be 4-bit or even 2-bit. Or it can be either 16-bit fixed point or the 16-bit "brain" floating-point BFLOAT16 which has the same number of bits for the exponent as F32 (to get dynamic range) but many fewer bits for the mantissa than F16. The size of the operation matters a lot for two reasons. One is simply that the smaller sizes require fewer resources (area, power). But a second is that the MACs might be reconfigurable. An example would be a MAC that could do 16-bit fixed point or be reconfigured to do two 8-bit fixed-point operations with the same hardware. But that is twice the headline TOPS number. Taking that down to 2-bit means a headline number eight times as big. But a 16-bit MAC operation is not the same thing at all as eight 2-bit MAC operations.

Another place where there can be debate is whether an operation is a full multiply-accumulate, or whether an 8-bit multiply followed by a 16-bit add counts as two operations. You can end up with 1 TMACS being 2 TOPS or 1 TOPS depending on your definition.

Counting Power and Area

Without knowing what "operations" are being counted, TOPS is not a very good measure. But it is worse since it is not what the potential users care about when they are evaluating IP. They care about all of PPA for performance, power, area. So a much better measure is TOPS/W ("TOPS per watt") that takes power into account, and TOPS/mm2 ("TOPS per square millimeter") that takes area into account. For this sort of processor, where the number of MACs drives both power and area, these two measures will often move together — more MACs means more area, and means more power, too.

Benchmarks

There have been a number of benchmarks in the past, such as Stanford's DAWNbench. These have been discontinued and their experience rolled into MLPerf which is the industry-wide "standard" benchmark suite today. Everyone who is anything to do with neural networks, including Cadence, is a member.

The high-level goals are:

Agile development because ML is changing rapidly
Serve both the commercial and research communities
Enforce replicability to ensure reliable results
Use representative workloads, reflecting production use-cases
Keep benchmarking effort affordable (so all can play)

I wrote about MLPerf last year in my post MLPerf: Benchmarking Machine Learning. Note that MLPerf comes in two flavors, "open" and "closed". In the "closed" part, the measure is just the raw performance of the hardware so the weights are fully specified. In the "open" part, any optimization is allowed, such as reducing the bit lengths (known as quantization).

MLPerf has been described to me as "a good start". There has been more of an emphasis on training (as opposed to inference). I think that this is because academics care mostly about training. They care about inference as a metric for measuring how good the network is. When designing a real device with an SoC inside, the tradeoffs are different. A tiny increase in accuracy is not attractive if it comes with a massive increase in power or area (or both). For now, the MLPerf suite of benchmarks for inference is limited to four for vision (two versions of two networks), and one for translation.

Tensilica DNA 100 Processor

Cadence has been providing AI inference IP since 2016 when the Tensilica Vision P6 DSP was introduced. After that Cadence introduced the Vision Q7 and DNA (Deep Neural-network Accelerator) products.

Vision DSPs can be used if customers are looking for a multi-use. For example, if the goal is to run vision workload most of the time and also run some AI workload, Vision DSPs may do the job. If the goal is to provide a high-performance, low-power, and highly scalable solution, DNA IP may do the job. The DNA IP is not designed to run computer vision or imaging workload but rather optimized to run AI workload.

DNA optimizes for sparsity, looking for zeros in the inference and weights, and dynamically avoiding the multiplication by zero. This is a big gain in terms of a highly energy-efficient solution. When comparing TOPS/W, a DNA IP is 5X better compare to a DSP. When comparing perf/area, DNA IP is 2X better compared to a DSP.

Cadence offers the Vision Q7 DSP, which can provide 512 8-bit MAC (or 1 TOP) and also offers a DNA IP that can provide 512 8-bit MAC (1TOP). If one looks at the raw MAC available on both IP, both are equivalent but serve different purposes. At the same time, the DNA IP with 512 8-bit MAC will provide 5X higher perf/W and >2X perf/mm². DNA IP will provide >2.5X raw inference/sec for a ResNet-50 network compared to a Vision Q7 DSP.

Summary

Early benchmarks such as Dhrystone for general-purpose processors suffered from a problem as compilers got more powerful, since some of the benchmarks could be optimized heavily by the compiler and so no longer measured the hardware performance. Neural network processors suffer from similar issues, since there are at least three things that you might be benchmarking:

The raw performance of the underlying hardware (usually focused on how many MACS can actually be achieved with a headline TOPS number)
The performance of the hardware when a trained network (complete with prespecified weights) is evaluated
The performance of the software and hardware system, taking a trained network and then optimizing it for the processor, and then running the optimized network

Ultimately, the last of those is the most appropriate, using the entire ecosystem of hardware and software around the chip or processor. But it can be very difficult to compare two different implementations because so much is different. You risk getting an apples-and-oranges comparison that tells you little. For IP, a raw TOPS number without looking at power and/or area is pretty meaningless. However, for a standard semiconductor product, the performance per $ might be the most appropriate.

As car manufacturers say at the end of many commercials:

Your mileage may vary.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.