Visual Baggage

16 Jun 2017 • 7 minute read

If you have been watching my "What's for Breakfast?" weekly video previews, then you will notice that I've been traveling a bit recently. The most recent videos were from Chicago, the moon, Shanghai, Munich, Mexico City, and beer pong in the cafeteria. (Full disclosure: I didn't actually go to the moon...and the beer pong didn't involve any actual beer).

Embedded Vision

The week before I left for China was the Embedded Vision Summit, which I covered in:

One of the themes of the Embedded Vision Summit was just how much progress has been made in visual recognition. Some of this is from faster (or just more) computers, and some from deepening the number of layers in the convolutional neural nets. The table below shows the growth, with the number going from 22 layers to 246 layers in just the last couple of years.

One thing that was clear from the Embedded Vision Summit was that researchers who are building these recognition systems are primarily focused on getting the best recognition possible, with no concern for how much computation was required to do that. If using a net with 246 layers and an entire server farm pushes the recognition percentages up to be the world leader, then the researchers get a paper out of it, or a PhD. On the other end of the scale, people implementing embedded recognition systems have a very constrained problem for both power and cost reasons. You simply can't put a supercomputer in the trunk of every autonomous car.

Between Embedded and Thousands of Servers

In an intermediate stage are people doing real-world recognition issues where they can assume connectivity to the cloud or to a datacenter. So they can design practical systems that have recognition close to the very best at a reasonable cost, without the full embedded challengers.

If you have been to Europe recently, and you have a biometric passport, you put your passport on the machine, stand in front of the camera, and if the system decides you look sufficiently like your passport photograph then the gate opens and you are admitted into the country and never see an immigration officer. Somehow the US has managed to end up with a half-baked system where you put your passport on the machine, it prints out a voucher with a photograph of you on it, and then you stand in line for an immigration officer to do pretty much what they used to do before the machine part. But those systems that decide if your face is you, don't need to do it all on a little processor running on battery power. On the other hand, they can't require a 10,000 server datacenter for an hour—the recognition can take at most a few seconds and has some sort of cost budget.

This is actually a very good use of computers and a nice balance. The system can be tweaked so that it has a very low false negative rate—admitting the wrong person—at the expense of a higher false positive rate—the gate doesn't open and an immigration officer has to decide that you look better having shaved your beard off. It also plays to one old strength of computers and one new one.

An Old Strength: Computers Don't Get Bored

The old strength is that computers are very good at boring and repetitive tasks in which the event of interest happens very rarely. They don't get bored. Think of the airbag controller in your car. "Have we crashed? No. Have we crashed? No..." every millisecond for five years. Until, one day, on the many billionth iteration suddenly it is "Have we crashed? Yes...Are we sure? Yes...Fire the squib in the airbag, tighten the seat belts, report our GPS coordinates back to the network."

Humans are notoriously bad at the type of routine boring job where events of interest happen rarely. They are especially hard to handle when the event of interest requires real in-depth expertise to address it. Running a nuclear power plant is perhaps the most extreme example. For years, everything runs smoothly. Then, suddenly, out of the blue, you get Three Mile Island. Or the plant is hit by a Tsunami. At those times, you need people in the control room who are very smart, understand every detail of how the control systems work, what every option is. But people that smart don't want to spend their entire career sitting in a nuclear power control room doing routine tasks. Airline pilots are similar, requiring in-depth expertise only when something unusual happens. Normal flights are uneventful. In all these cases, simulators are used to expose operators to many more "events" than happen in real life to keep them ready to handle them.

A New Strength: Visual Recognition Is Now Better Than Humans

Visual recognition has now reached the point that it is better than humans in many cases. This assumes the existence of extensive training data. Humans are still better at the experiential type of learning. As Jitendra Malik pointed out in his keynote, you can take a toddler to the zoo and point out what a zebra looks like and that is all that it takes. Visual recognition algorithms still require thousands of training images of zebras in all sorts of scenes to get the idea nailed. But with extensive training data, machines are slightly better than humans. Cadence's CactusNet outperforms humans as well as every other known network on the German Traffic Sign Recognition Benchmark (GTSRB).

There are a couple of tricks for getting training data when none exists. One is using video and attempting to predict something, then running the video to see if it did. There must be enough TV footage of professional poker in existence you can probably train an algorithm to play high-level poker from a standing start.

Another approach is to follow a human around. NVIDIA did an experiment with a car where they taught it to drive simply by having it learn from scratch, with a human driver grabbing the wheel all the time at first since the computer had no idea that staying on the road was part of driving. Those of us who have had teenagers can maybe relate to this a bit too much. But after a comparatively short period of time it drove reasonably well. There are several videos of this, here is one.

Baggage Handling

The reason I started off this post talking about the travel I have been doing is that my bags have been through a lot of baggage X-rays. Mostly very slowly. Another peeve about the US is that most countries by now have pretty good systems for moving bags through the X-ray machine but we are stuck with metal tables that we have to push the bags along. It's as if 9/11 happened last week, not 16 years ago. But I digress. It occurred to me that the job of checking baggage for bombs is perfect for automation and terrible for humans to be doing:

Very boring, with only very occasional simulated bombs used to test. And probably the odd person who left a knife in their baggage or something
Easy to generate training data by having the machine "follow" humans around since millions of bags are inspected every day by humans
Can be done with connectivity, doesn't require an embedded solution with a tiny power envelope
Machines can do the visual recognition as well as humans (I'm guessing)
Total speculation, but I am sure they could run the bags through a lot faster, more like the speed of bar codes at the supermarket

In fact, it is so obviously a great idea, I wondered why it wasn't already done by computers. I emailed Jeff Bier, the chairman of the Embedded Vision Alliance that runs the Summit and asked if he knew that this was going on. He said he doesn't specifically know of anything, but he'd be very surprised if there aren't several companies working on it. I hope so.

So, when suddenly the TSA gets an order of magnitude more efficient due to machine vision, you read it first on Breakfast Bytes.