Get email delivery of the Cadence blog featured here
The German Traffic Sign Benchmark actually has the signs divided into groups: speed limits, danger signs, and so on. It turns out that humans make far fewer errors between groups than within groups. They might mistake a 30kph speed limit sign for an 80kph sign, but very rarely for a stop sign. The errors CNNs make are much less structured and are pretty much all over the place, so there is clearly room for improvement. How can we teach CNNs to be more like humans?
The approach is to build a hierarchical CNN using a divide-and-conquer approach. One CNN is used to classify the sign into a group, and then each group has its own CNN for classification within the group. It may sound like this is duplicating a lot of work and is likely to make things worse, but the resulting networks are a lot simpler (and, of course, can be optimized in the ways outlined in yesterday’s post).
It turns out that the families that humans use naturally are not the best for this approach, and a CNN family clustering (CFC) algorithm can be used to develop better families. Interestingly, this is something Netflix discovered. When they started, they classified movies as romance, comedy, action, etc. and tried to decide how much each person likes each of those classifications. When they had a competition to improve their recommendation system, the leaders all ignored that information and independently clustered movies into groups that seemed to have predictive power, but didn’t necessarily make sense to humans.
A simplified view of the algorithm is:
So on the street signs you can see the first split into a seed set A and a seed set B. Seed set B, once extended with additional members, is a “good” family, but there is too much variation in seed set A and it needs to be further split.
Once the process is complete, we end up with five families. Some of the clustering, such as all the triangular warning signs, are similar to what a human would create. But there are some unexpected groupings. For example, seed set B in the above diagram is signs with a diagonal stripe from top right to bottom left, and ends up including the keep left sign that a human would most likely group with the other directional signs.
The training and recognition phases are fairly obvious. The family classification CNN is trained using all the training data. Then a CNN for each family is trained using just the data for that family. During inference, the family classification CNN is used to identify the family and then the appropriate CNN is used to complete recognition. See the picture above.
One reason that this process is effective is that after the hierarchical decomposition into families, more resources can be allocated in relation to the problem difficulty to the per family CNNs and this is more effective than just having a huge CNN for the entire set of signs, or a sub-optimal classification into families that are convenient for people, but which may leave a few signs in each family that are especially hard to identify.
The results are, indeed, improved. Using the pre-defined (human) families, the misclassification across families was 0.23% and within families was 0.33%. With the CFC created families (in the pictures above) the error rates are reduced to 0.1% and 0.24%, respectively.
It turns out that this Cadence HCNN is the current leader (probably not for long) in recognition for the German street sign benchmark, achieving a recognition rate of 99.82% with a complexity of only 83 million MACS. Traffic sign recognition neural networks exceed human performance.
Next: The Future of Neural Networks...and Our Robot Overlords
Previous: How to Optimize Your CNN