Embedded Vision: Seeing Round Corners, and Reasoning on Microcontrollers

30 May 2019 • 10 minute read

May is a month that seems to have many things associated with it. "Sell in May and go away," people who work in the financial sector will tell you. Or "I thought that spring must last forevermore; For I was young and loved, and it was May," as Vera Brittain said. But in Silicon Valley, May is the Embedded Vision Summit. Not just in Silicon Valley. As Jeff Bier, the President of the Embedded Vision Alliance said in his opening remarks, there were people there from Uruguay, India, Israel, China, Russia, Europe, and probably some more regions my typing was too slow to catch.

Each day of the main conference opens with a keynote. The first day was Ramesh Raskar of MIT Media Lab presenting Making the Invisible Visible: Within Our Bodies, the World Around Us, and Beyond. The big takeaway for me was that if you can control and time light at femtosecond resolutions, you can do seemingly impossible things.

The second day was Google's Pete Warden on why The Future of Computer Vision and Machine Learning Is Tiny. He has been looking into what it takes to do neural network inference on tiny systems that can run for years. I'll cover what I found most interesting of these two keynotes. Eventually, video of them should be available on the Alliance website.

Ramesh Raskar

Ramesh started with an intriguing proposition:

It is easier for me with camera technology to see retinal structure at fine resolution than look at that resolution on your skin. Due to the lens, I have a microscope that gives me 5um resolution. It is a great indicator of your cardiovascular health.

He runs a group called the MIT Camera Culture Group at the MIT Media Lab. One area that they work is in making invisible objects visible by exploiting the time dimension. If you get a laser pointer and turn it on for a few femtoseconds, you get a packet of photons that is about 0.2mm long but is traveling at the speed of light. By exploiting this, what his group calls either femtography or femto-photography, you can do some surprising things. As he put it, "one man's noise is another man's signal."

If you look at the setup in the diagram above, you get an idea of how this technology sees around corners. There is multi-path scattering of light. After three bounces (the door, the object, the door again) only a few photons get back to the detector at the camera, and you need extremely high time resolution to make anything of it. One thing everyone who reads this blog should already know is that light travels about a foot in a nanosecond, or about 1/100th of an inch in a picosecond. So with 1-2ps time resolution, it is possible to distinguish light paths that are sub-mm accurate and end up with an overall resolution of the object around the corners that is about 5mm accurate.

Nature made a very professional video about this technology that Ramesh showed (3 minutes):

Similar techniques can be used for looking inside your body with 5um resolution 5mm below the skin. This is known as PSG for photo scatterography. The three challenges are scattering, depth, and fluorescence lifetime. But some of the technology required is getting very cheap. For example, a single photon avalanche diode (SPAD) can give picosecond resolution and can cost as little as $50.

Next, Ramesh moved on to reading a book without opening the cover. Can we send some sort of electromagnetic signal, look at the reflections, and read the text. Light doesn't work since it doesn't penetrate. X-rays don't work since they go straight through. Terahertz waves, with a wavelength in 100s of um, are in between. A page is 25-100um thick, ink is about 5-10um thick. This can be combined with OCR technology to overcome the noisy images. One complication is that one layer of characters obscures the next, so it's also essential to subtract out the pages in front of the one being read. Today, they can do this for nine pages. He showed a video but I can't find that one online, but here's another one about the technology (4 minutes):

These approaches using very precise timing can be used to see through fog. As Ramesh said:

We giggle in Boston when we hear about automated driving trials in Arizona where there is no fog and rain.

Another area his group is looking at is automating machine learning. There are three parts to any machine learning project: capture, analyze, and act. There's lots of research on analysis, but how to capture the data is still an unsolved problem in most cases. Data is very siloed due to privacy and trade secrets. They are focused on medical data and algorithms, but the ideas generalize.

There need to be techniques that work without revealing all the patient data. Some approaches are: anonymize, obfuscate, smash, and encrypt. These work but may not be enough for the owners of the data. For example, encryption conceals the data in transit but eventually it has to be decrypted for use in training.

The big challenge is how you stop the edge devices, which have a bit of the data (for example for a single patient) needing to share their data with the cloud. Apple has an approach called "differential privacy." Google has "federated learning". Instead of sending the data to the code, send the code to the data. It's a very powerful approach, incrementally updating the weights and then sending them back.

Today, the server has the neural network and the clients have the data. The federated model is sent to the clients, and each improves it a little. The catchphrase is "share the wisdom, not the raw data."

Pete Warden

Pete Warden works for Google. Google is a big player in neural networks and deep learning, having developed TensorFlow (which is now open source with many developers), and the series of TPUs. TensorFlow now has over two billion devices in production using deep learning. Most of them run on either big servers or smartphones. For example, Live Captions works even offline for any audio and video on your phone, since it keeps that data on-device. It doesn't need a cellular or Wi-Fi connection.

Google has a single-chip TPU for edge devices. But Pete said that it's in the 500mW range. It's really efficient for big computation but it is overkill and consumes too much power for the sorts of applications he's thinking about.

But Pete is interested in much smaller devices, which is why he says the future of machine learning is "tiny." By smaller, he means microcontrollers with just tens of kilobytes of memory.

Machine learning works. Voice interfaces are expanding. Computer vision is dominated by machine learning.

Pete started with his pitch from two years ago:

Machine learning on phones is real, not a fad
It is useful for developing features across the board
Any program that requires complex rules and heuristics can benefit
Interactivity, battery usage, and privacy make on-device useful

"These have all proven to be true," he said.

He admitted that he's not a prophet, it's just that in his job working on TensorFlow Lite he sees a lot of deep learning applications where developers come and ask "can we do this?"

He is now working on investing in machine learning on embedded platforms smaller than phones, and he feels this is going to be important. He is also working on getting "big" applications to be much smaller, clearly a key requirement for getting to microcontrollers. For example, a full server-quality keyboard ASR (automatic speech recognition) transcription down to 80MB and runs on the processor in a typical Pixel phone. Over time should he thinks we can get down to a microcontroller-type of power regime. Then there is the possibility of a 50c embedded chip that runs on a coin battery for a year that can do full speech recognition and so has a voice interface. That's important since microphones are small and cheap.

Other important areas are in industrial IoT for predictive maintenance and monitoring. This needs to be what Pete calls "peel and stick" without a mains connection needed. Although the machines being monitored have (usually) electric power, on a factory floor it can cost $1000s to wire in a new device. If a battery-powered unit can simply be stuck on the machine and run for years, with microphones, cameras, and accelerometers, this can be done for a much lower price.

Flying over the country on the red-eye, Pete noticed that it's 3am down below and the towns are lit up like Disneyland although most places there is nobody at that time of day. He has a dream for streetlights that only come on when there's somebody nearby.

In agriculture, the potential is to recognize pests or weeds using vision sensors scattered through the fields with tiny cameras. Pete showed a video from PlantVillage using this type of approach to monitor cassava (aka manioc), which is an important crop in some African countries. Some diseases that affect cassava can reduce the crop by 40% or even destroy it completely. Today this is all on phones, but it could be cheap devices scattered in the fields. Here's the video (1' 30"):

The big concern is energy, which is the key bottleneck. Cost is a secondary issue, and Pete assumes that it will drop with volume if the energy problem is solved. The challenge with energy is that tethering a device is expensive and impractical (electricity is not available on 90% of the earth's surface). Replacing or recharging batteries only scales with the number of people available to do it. Today, we manage to charge one phone every night. It's a pain but doable. Imagine if we had dozens of devices per person, then charging is too much of a tax. Even replacing our smoke-alarm batteries is a pain.

As Pete put it succinctly:

Any device that requires human attention at any frequency to work is a tax on our time. Peel, stick, and forget devices are the scalable way. Devices like this for temperature, lighting, and humidity are already available, sponsored by the Department of Energy.

This boils down to batteries that last years, or energy harvesting. In practice, this means less than 1mW (mobile phones are around 1W, so 1/1000 of that). Radio is an energy hog, so we can't stream sensor data to the cloud. The only approach that works is to do on-device recognition of exceptions, and only then turn on the radio. Sensors and computation can be very low power. There are experimental image sensors that power themselves from ambient light. Microphones only need hundreds of microwatts. But radio inescapably requires energy for the transmitter (Bluetooth LE is high tens of milliwatts, for example).

Microprocessors can be very low power. Ambiq has a mW-level Cortex-M4. Others are working on the problem too. No theoretical reason we can’t compute in uWs. Neural networks require a lot of arithmetic but not a lot of loads and stores so we just need arithmetic which can be very low energy compared to accessing memory.

The SparkFun board shown, developed jointly with Google, uses the Ambiq Micro Apollo3 Blue, which runs TensorFlow Lite using only 6uA/MHz. It can last for ten days on a 2032 coin battery.

So Pete's dream is:

Low-power sensors plus enough compute running deep learning models. Neural netowrks can make sense of the sensor data, and then only wake up the radio when something actionable happens. There are 250B embedded devices active in the world right now and the number shipped is growing 20% per year. We have the chance to solve many problems around health, food, and the environment using machine learning, so let's do it.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.