Rowen on Vision, Innovation, and the Deep Learning Explosion

13 Oct 2017 • 5 minute read

The keynote for the second day of the Linley Processor Conference was by Chris Rowen. Chris was the founder of Tensilica, and became the CTO of the IP group when Cadence acquired them. A year or so ago, he left Cadence to work in deep learning and founded Cognite Ventures. This is a hot area. As a symptom of just how hot, NIPS, the Neural Information Processing Symposium, is sold out this year.

It's the Vision, Stupid

One of his messages is that, to all intents and purposes, all data is vison. Of course, other sensor data like temperature or GPS location might be important, but in terms of data volume, that other data doesn't rise to even a single pixel on the bar graph. It's all vision all the time. 99% of all raw data is pixels.

There are now more image sensors in the world than people. If you do the math, 2x10¹⁰ sensors * 5x10⁸ pixels/second/sensor = 10¹⁹ pixels/second. That's a lot of data. Camera costs are dropping towards $5 but the storage and bandwidth costs associated with this data are immense. Without extreme semantic compression, a $5 can generate enough data that it would cost millions of dollars to store it all for a few years. Even at one frame per second, the storage costs are prohibitive (see the table above).

Vision is fundamentally hard. It has been totally transformed in the last few years by the use of deep learning and neural nets, arguably driven by the creation of ImageNet. Despite the name, ImageNet is not a net, it is 1.2M images, classified into 1000 categories, including 120 breeds of dogs. It turns out that deep learning is better than humans once you get to the breeds (see the above pictures to see how well you can do on just three pictures, and remember there are another 20,000 pictures to go).

Training

As I have mentioned before, computer vision is not yet at the stage of a child, who can learn the concept of a zebra from seeing a single zebra in the zoo, not requiring thousands of labeled images. One way to generate data that Chris talked about are adversarial networks. One network creates data and tries to fool the recognition network. As the recognition network gets better, the data network has to get better, too, and so both networks improve together.

For example, in the pictures above, the European facades on the right are completely artificial, generated. The recognition network is trying to decide whether a given picture is real (a real photo of a building) or fake (generated by the other network).

Security, Robustness, and Privacy

Security and robustness are big issues. Along with the obvious device attacks, some of which are easy if you have physical access, the algorithms can be fragile. For example, the picture above is recognized correctly as Reese Whitherspoon. But with fake glasses the algorithm goes for Russel Crowe!

Although Chris didn't talk about it, I've seen other work where traffic sign recognition can be spoofed by carefully manipulated addition of a small amount of confusion. The above stop sign is recognized as a 45mph speed limit sign due to the addition of a small amount of back and white rectangles.

Privacy is a big concern once there is a lot of automated recognition, coupled with big data gathered over time. More cameras mean there is better coverage. If they are coordinated, you get global tracking. This can be correlated with other aspects of an individual such as retail or credit card use. With deep analytics, it is not hard to extract a lot of seemingly private data.

Where Does Computing Happen

One big issue is where the computing is done, basically in the cloud or on the edge device. I won't cover that in detail since Meera already covered it in her post Visual Ventures with Chris Rowen, which covered Chris's dry run of the same presentation at Cadence a couple of days earlier.

Deep Learning Platforms

There has been a revolution in the architecture of deep learning platforms, with the power efficiency (vertical scale) around 1000X times better than a general-purpose CPU (blue dot in the middle at the bottom). GPUs (orange and yellow) and FPGAs (red) can achieve good performance, but they are not particularly good when looking at power—generally GPUs are using 32-bit floating point precision, which is often overkill. The light and dark green dots are Tensilica vision and neural network DSPs.Google's TPU (purple) is about as high performance as you can get, but since it is intended for use in a datacenter, its power efficiency is not the best.

What's under the hood of one of those Tensilica neural network DSPs (NNDSPs)? The above diagram has more details, in this case the Vision C5 DSP. The keys to a good NNDSP are:

Scale to many 1000s of MACs
High MAC density and high MAC utilization
High memory bandwidth with low memory latency
Low load on the (control) CPU
High programmability
Low sensitivity to batch size

The low load on the CPU is important, since otherwise that can be the dominant power cost. For example, the Google TPU requires more CPU power to stream the instructions to the TPU itself, than the TPU does to do the actual calculations.

AI Companies

Chris wrapped up taking a look at which companies are working in this area. Somewhat tongue in cheek, he said that the definition of an AI company is "any company started since 2015." But there are a lot of genuine AI companies. They are almost all in US, UK, China, and Israel. That's what Chris said, and he showed the pie chart above, although I can't see Israel on it. Maybe the pie chart was drawn up by one of those UN committees that likes to pretend it doesn't exist. In addition, France, Germany, and Japan do a lot of research but, for all the usual structural reasons, not many startups get created.

Here are a couple of interesting applications to show what is going on. To the left is Qualcomm's tiny computer vision module, which consumes less than 2mW and can be trained on what it has to recognize. If the answer is weeds, then on the right is a John Deere (actually Blue River but Deere just bought them) lettuce-bot, which can recognize weeds and precisely hit them, or can do precision thinning by precision application of weedkiller to some of the plants to leave the remainder unaffected.

Summary

Cameras will be everywhere. Computers with new architectures will be watching, probably many from startups. Security, privacy, and robustness need more attention.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.