Assessing Bias in Computer Vision Systems

19 Jun 2019 • 5 minute read

I came across a fascinating document from Facebook on methods to assess bias in computer vision systems. At F8 2019 (their user conference), they highlighted some of the things that they are doing to address labeling bias, algorithmic bias, and more. For example, making sure that their computer vision systems work with all skin tones, and allowing augmented reality effects to work well with everyone regardless of things like facial features or hair tone. These are the sorts of things that come up when you just look at the US, or perhaps even just look at Silicon Valley.

I've actually written on this topic before, covering a presentation at last year's Embedded Vision Summit. See my post Overcoming Bias in Computer Vision.

But they realized that there are other forms of potential bias when you look on an international scale. As they say:

Events such as weddings or commonly used household items (dish soap, for instance) can look very different in different places, so computer vision systems trained with data predominantly from one region may not perform as well when classifying images from somewhere else.

This strikes me as similar to a bias that we have in psychology and other social sciences. Almost any paper that you read on psychology is sampling from a population that consists of students at universities in the West. The cute term for this is that the samples are all WEIRD. This stands for Western, Educated, and from Industrialized, Rich, and Democratic countries. In fact, even that doesn't capture how limited the samples are, since most college students are in a small age band from late teens to mid-20s. Maybe it should be WEIRDY to add Young to the mix. These samples are then used to generalize to universal truths about humans everywhere. It only takes a moment's thought to realize that the way college students interact might be very different from, say, hunter-gatherers in Papua New Guinea. Facebook, for one thing.

The Facebook AI researchers decided to test the performance of their algorithms, along with other object recognition systems that they could get access to. They started from Dollar Street who have a collection of publicly available photos of household items. This is a set of images gathered by Gapminder Foundation. This was founded by Hans Rosling to commercialize the statistical display approach that he showed in his famous Ted talk that you are probably one of the 14M people to have seen. As it says on the Dollar Street website:

We visited 264 families in 50 countries and collected 30,000 photos.

They are all classified. Here, for example, is part of the matrix of "social drinks". The number is the monthly income of the person/family where the photo was taken, not the cost of the drink!

They then used their algorithms to identify pictures. For example, here are a few, showing soap, spices, and toothpaste from different countries. Underneath the photos are examples of labels generated by computer vision algorithms.

Their conclusion from this analysis:

photos from several countries in Africa and Asia were less likely to be identified accurately than photos from Europe and North America. Our analysis showed that this issue is not specific to one particular object-recognition system, but rather broadly affects tools from a wide range of companies, including ours. Using the Dollar Street data set and comparing performance for different income groups, we found that the accuracy of Facebook's object-recognition system varies by roughly 20 percent.

This map shows how well Facebook's object recognition performs on the Dollar Street dataset. Green indicates that it performs well (dark green best of all), and orange/yellow/red that it performs poorly (red worst of all). Grey is no data.

Having picked Papua New Guinea somewhat randomly above, as about the most un-WEIRD population I could immediately think of, it is interesting to see that it is the deepest red in the above map. If your geography is not good enough to place PNG on a map, it is the east end of that big island just to the north of Australia. (The west part of the island is part of Indonesia.)

The Facebook team looked at things other than geographic discrepancies. They used purchasing power parity to record the monthly consumption income as well as location. These are not entirely independent, after all, poor countries are poorer than rich ones, but there are poor people in rich countries and rich people in poor countries. The object recognition systems performed 10% to 20% better in classifying Dollar Street images for the wealthiest households than for the least wealthy households.

The next thing the Facebook researchers did was look at the image datasets typically used for training and testing neural network algorithms. For example, ImageNet that I wrote about in my post ImageNet: The Benchmark that Changed Everything. What they found was that:

these collections can have a very skewed geographic distribution: almost all the photos come from Europe and North America, whereas relatively few photos come from populous regions in Africa, South America, and Central and Southeast Asia. This uneven distribution may lead to biases in object-recognition systems trained on these data sets. Such systems may be much better at recognizing a traditionally Western wedding than a traditional wedding in India, for example, because they were not trained on data that included extensive examples of Indian weddings.

Another issue they identified is hashtags for training should be from many languages, not dominated by English. As they point out, "the word 'wedding' will typically return very different photos than a query for, say, the Hindi word for wedding, 'शादी.'"

This is still work-in-progress, but what Facebook is doing is:

The first step in our approach is to use Facebook's unsupervised word embedding technology to learn multi-lingual hashtag embeddings. Subsequently, we train our convolutional networks to predict the hashtag embedding that corresponds to the training image. The use of unsupervised word embedding allows us to train on images that are annotated in hundreds of different languages, including languages that have relatively few speakers. In addition to the training of multilingual vision models, we are exploring techniques that use location information to ensure we select a data set that is geographically representative of the world population. This method works by resampling training images to match a geographic target distribution. As we work to implement these measures, we'll also explore other ways to improve our object recognition systems.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.