Understanding DNNs

Convolutional neural networks (CNNs) are great tools to reach high accuracies in a broad range of vision tasks, but their millions of parameters make it at the same time notoriously difficult to understand what actually happens inside of the "black box". We use methods from human psychophysics and principled modifications of CNN architectures to achieve a better understanding of CNN strategies, biases and failures. Below, a few projects exemplify our approach.

Texture bias in standard CNNs

Convolutional neural networks have little trouble recognising texturised images, even though these objects no longer have a global object shape (Brendel & Bethge, 2019).
For a long time, CNNs were thought to perform object recognition by learning increasingly complex representations of object shapes. However, if this were the case, why would CNNs still recognise texturised images really well, even though the global object shape is destroyed? We put forward the "texture hypothesis" of CNN object recognition. We showed that a so-called Bag-of-features CNN can achieve AlexNet-level performance on ImageNet object recognition despite being constrained to recognising only small texture-like patterns. This means that texture recognition (without analysing global object shape) is indeed a valid strategy to "solve" object recognition.

In a closely related project, we analysed whether widely used CNNs are more sensitive to texture or shape. The figure below shows how we created images where texture and shape belong to different categories. A cat with elephant skin is a cat to humans, and an elephant to CNNs: unlike humans, all standard CNNs show a striking bias towards textures. We aim to close this gap.
Deep neural networks show a bias to classify images according to texture rather than shape, whereas humans show the opposite bias (Geirhos et al., 2019).

Generalisation towards noisy input

Through systematical comparisons to human observers in the lab, we were able to show that current CNNs are far less robust towards all sorts of image perturbations. Worse yet, even when trained on eight different noise types, they fail to cope with a novel ninth noise type. This highlights how much more robust the human visual system is, and we aim to build CNNs that are able to handle noisy input much better.
Standard CNNs fail to recognise noisy images like these (Geirhos et al., 2018).

Key Papers

R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel
ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness
International Conference on Learning Representations (ICLR), 2019
Code, URL, BibTex

W. Brendel and M. Bethge
Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet
International Conference on Learning Representations (ICLR), 2019

R. Geirhos, C. R. M. Temme, J. Rauber, H. H. Schütt, M. Bethge, and F. A. Wichmann
Generalisation in humans and deep neural networks
Advances in Neural Information Processing Systems 31, 2018
Code, URL, BibTex
University of Tuebingen BCCN CIN MPI