Rowen's Prisms for Audio and Video

20 May 2022 • 5 minute read

This week was the 11th Embedded Vision Summit. So that means the first one, back in 2011, was just a couple of years after what I regard as the watershed event in vision, the poster session (it didn't even make the main conference) announcing ImageNet. With millions of labeled images, neural network approaches, that had been regarded as a "tapped-out mine that would never produce any gold", raced past the more programmed approaches to where, today, these networks are better than humans at recognizing dog breeds or traffic signs. For more about ImageNet, see my post ImageNet: The Benchmark that Changed Everything. You can read about three of the people who never gave up on the deep learning approach in my post Geoff Hinton, Yann LeCun, and Yoshua Bengio Win 2019 Turing Award (actually, it was the 2018 Turing Award, bestowed in 2019).

In fact, it was Yann LeCun giving a keynote at the 2014 Embedded Vision Summit that opened my eyes to how much had changed. He had a little camera attached to his laptop and he would point it at things while we watched the screen. He pointed at his shoe and it said "shoe". His coffee, and it said "coffee". The amusing one was when he pointed at his pastry and it said "bagel". "Well, it was trained in New York," Yann said.

One person who I regard as a must-see session at Embedded Vision is Chris Rowen. He was the founder of Tensilica, was its CEO for many years, and then did that CEO/CTO switch. Of course, Cadence acquired Tensilica in 2013. At the time, I was working for SemiWiki and Tensilica was one of my accounts, so I would write about them every couple of months, often involving a briefing with Chris. You can still read my post Cadence To Acquire Tensilica, complete with a hybrid logo (which I would never be allowed to do now as an official member of our Corporate Marketing organization). Chris was the VP of our processor IP business after the acquisition, before leaving to found BabbleLabs, which Cisco acquired in 2020. His title now seems to be VP of Engineering, Collaboration AI, at Cisco (and all his slides have WebEx on them).

This year, he presented one of the first non-keynote tracks on System Imperatives for Audio and Video AI at the Edge. The slides below are his.

Something Chris has been discussing for years is what he now calls the "grand tradeoff". There are no numbers on the axes but they are logarithmic in the sense that they span many orders of magnitude. Up in the top left is the ultimate in flexibility, using a public cloud to run software. Down in the bottom right is the ultimate in performance, with a dedicated chip (or IP block)...but no flexibility at all since changing the algorithm probably means manufacturing a new chip, which is slow and expensive. The algorithm is "crystallized in silicon". Fundamental technology progress can move the whole curve up and to the right, but normally all you can do is pick a point on the curve.

Chris's key questions:

How much more efficient must edge solutions be?
What split of edge-cloud in hybrid systems?
When is technology mature enough to freeze into silicon?

Of course, this tradeoff exists in many domains, not just vision and AI. Software is slow but flexible. Chips can be the ultimate in speed but silicon is the least flexible medium of all.

In the vision and AI domain, there are other considerations than flexibility and performance. Pushing against the cloud direction are privacy concerns, the amount of energy needed to move the data, and latency. Pushing against designing a special chip is time to market, the amount of engineering required, and the ease of data reuse.

The compromise is to do some stuff "always-on" (or AON), with mid-sized models in the more powerful edge processors, and then go to the cloud for big models that only need to run on rare events.

Rowen's Prism

Babblelabs worked mostly on sound, I think. So here is Rowen's Prism for audio. On the left is raw audio. Using machine learning this is split into several streams like light through a prism. Near talk is people near the microphone. Far talk is other people further away (who may be noise or may be trying to communicate). Music, noise, and other things. Each of those can then be processed specially depending on the need, and recombined to the rendered audio.

This brought Chris to the "audio iceberg". The obvious ML suspects are:

Noise reduction
Speech-to-text
Text-to-speech
Talker ID
Keyword trigger ("hey Google")

But there is so much more, what he calls ML below the surface

Beamforming
Non-linear echo cancellation
Voice activity detection
Single talker isolation
Background talker isolation
Noise analysis/synthesis
Voice cloning
Prosody transfer
Music identification/synthesis
Packet loss concealment
Health monitoring
And more...see the iceberg above

In his talk, Chris had an example showing noise removal (near-talker focus) and then enabling far-talker focus.

Chris moved on to video, and Rowen's Prism for video. It is the same idea, use machine learning to decompose the video into different streams, process them separately, and then recombine all or some of them to produce rendered video.

Video comes with its own iceberg of obvious applications above the water surface, and the mass of other less obvious things that can be done beneath the water.

Overlapping Models

The real world is more complex, with overlapping models. For example, on a WebEx video feed there may be:

Background segmentation
Rich gestures
Face localization
3D model

These compete for the available compute, and there are challenges both unifying all the models into one, or running them all independently.

The grand challenge is to unify everything with heterogeneous media and all the variety of endpoints such as headsets, deskphones, smartphones, laptops, running in the browser, TVs, and so on. This is WebEx's challenge, to take this mixture and make it work.

Silicon

So where does ML silicon fit into all this? The above graph shows some actual chips with different levels of programmability and flexibility. All of these might be appropriate depending on the application.

Guidance

Chris's advice on all this:

Know thy application – accuracy, data, footprint, latency, use-cases
Understand the tradeoff between development and execution efficient
Don’t freeze a sub-optimal algorithm
Better data beats a bigger network
Design application hierarchy to move as little data as possible
ML Responsibly: Fairness + Transparency + Privacy + Security

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

Subscriptions

Rowen's Prisms for Audio and Video

Rowen's Prism

Overlapping Models

Silicon

Guidance