Get email delivery of the Cadence blog featured here
Jeff Bier is the founder of the Embedded Vision Alliance, which runs the annual Embedded Vision Summit (and other things). You won't be surprised to learn that his talk at the Embedded Neural Network Seminar was on embedded vision, titled When Every Device Can See.
Of course, one of the biggest drivers of embedded vision is autonomous driving, but it's not the only one.
A recent Nature paper described a program that analyzed skin lesions and had comparable accuracy to professional dermatologists, who undergo 12 years of training after high school. Skin lesions are just blotchy shapes. As Jeff pointed out, it is a lot harder than recognizing traffic signs, where there are just a limited number of signs to pick from and they have been deliberately designed to be recognized and to be distinguishable. Skin lesions come in all shapes and sizes and the cancerous ones look pretty similar to the benign ones. They were not designed to be easy to recognize.
Another example was lip reading. He showed us silent videos of people speaking, and, of course, we had no idea what the people were saying. The best humans can manage (I'm assuming deaf people) is 52%. The program Lipnet can manage over 90%.
His last example was Amazon Go. If you don't know what this is, it is a convenience store. But there are no cashiers. You just go into the store, take what you want, and walk out. It keeps track of what you pick up (and can handle you changing your mind and putting things back) and then charges your credit card for what you leave with. Watch the 2-minute video below to get a better idea.
Jeff listed what he sees as the critical challenges to embedded vision achieving its full potential:
Not all vision is done using neural nets. The old way (before a couple of years ago) was to develop complex algorithms for special tasks. That's what MobilEye does for driving, for example. However, it is hard to scale. So these days more and more people are going the neural net route, where the algorithms are trained and then then the weights calculated are used in the recognition phase. The pie chart below shows the percentages using neural nets (and the growth rate since last year). The transformation is happening fast—deep neural networks are transforming how we extract meaning from pixels and to extract meaning from what start out as visual inputs, either photographs or video.
The state of the art at maximizing visual perception per watt is the Microsoft Hololens. It contains 24 Tensilica processors. If you are going to Mobile World Congress in Barcelona at the end of the month, you can come by and try the Hololens at the Cadence booth. The Hololens integrates physical (what you can actually see) with virtual without any time lag. Even twitchy things like video games work using Hololens. Here's a video of a mixed reality game being played with Hololens. This is a good example of how specialization drives efficiency. You could probably build a Hololens with bunch of general purpose Intel server processors, but then you wouldn't want it on your head because the power would make it unacceptably hot.
One of the big issues with machine vision is the lack of trained engineers who have studied or worked on it. Luckily there are increasingly powerful libraries available. For example, VuForia, originally developed by Qualcomm, allows you to build augmented reality applications without knowing the underlying algorithms or processor.
The Embedded Vision Summit is coming up May 1st through 3rd at the Santa Clara Convention Center. Every year the conference seems to be half as large again as the previous year. A high-level view of the program is above. The summit website has lots of details including a link for registration.