Visualizing and Understanding Convolutional Networks

M.D. Zeiler, R. Fergus

ECCV 2014 (Honourable Mention for Best Paper Award), Arxiv 1311.2901 (Nov 28, 2013)

Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

Hierarchical Convolutional Deep Learning in Computer Vision

M.D. Zeiler. Advisor: R. Fergus

Unpublished PhD Thesis (Nov 8, 2013)

It has long been the goal in computer vision to learn a hierarchy of features useful for object recognition. Spanning the two traditional paradigms of machine learning, unsupervised and supervised learning, we investigate the application of deep learning methods to tackle this challenging task and to learn robust representations of images. We begin our investigation with the introduction of a novel unsupervised learning technique called deconvolutional networks. Based on convolutional sparse coding, we show this model learns interesting decompositions of images into parts without object label information. This method, which easily scales to large images, becomes increasingly invariant by learning multiple layers of feature extraction coupled with pooling layers. We introduce a novel pooling method called Gaussian pooling to enable these layers to store continuous location information while being differentiable, creating a unified objective function to optimize. In the supervised learning domain, a well-established model for recognition of objects is the convolutional network. We introduce a new regularization method for convolutional networks called stochastic pooling which relies on sampling noise to prevent these powerful models from overfitting. Additionally, we show novel visualizations of these complex models to better understand what they learn and to provide insight on how to develop state-of-the-art architectures for large-scale classification of 1,000 different object categories. We also investigate related problems in deep learning. First, we introduce a model for the task of mapping one high dimensional time series sequence onto another. Second, we address the choice of nonlinearity in neural networks, showing evidence that rectified linear units outperform others types in automatic speech recognition. Finally, we introduce a novel optimization method called ADADELTA which shows promising convergence speeds in practice whilst being robust to hyper-parameter selection.

Regularization of Neural Networks using DropConnect

L. Wan, M.D. Zeiler, S. Zhang, Y. LeCun, R. Fergus

ICML 2013 (June 16, 2013)

We introduce DropConnect, a generalization of Dropout (Hinton et al., 2012), for regularizing large fully-connected layers within neural networks. When training with Dropout, a randomly selected subset of activations are set to zero within each layer. DropConnect instead sets a randomly selected subset of weights within the network to zero. Each unit thus receives input from a random subset of units in the previous layer. We derive a bound on the generalization performance of both Dropout and DropConnect. We then evaluate DropConnect on a range of datasets, comparing to Dropout, and show state-of-the-art results on several image recognition benchmarks by aggregating multiple DropConnect-trained models.

On Recified Linear Units for Speech Processing

M.D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.V. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, and G.E. Hinton

ICASSP 2013 (May 26, 2013)

Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems. The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function. In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units. These units are linear when their input is positive and zero otherwise. In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology. Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks. All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data.

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Matthew D. Zeiler and Rob Fergus

ICLR 2013 (May 2, 2013)
Supplemental Material:  Images  Videos

We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler

arXiv:1212.5701 (Dec 27, 2012)

We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.

Differentiable Pooling for Hierarchical Feature Learning

Matthew D. Zeiler and Rob Fergus

arXiv:1207.0151v1 (July 3, 2012)

We introduce a parametric form of pooling, based on a Gaussian, which can be optimized alongside the features in a single global objective function. By contrast, existing pooling schemes are based on heuristics (e.g. local maximum) and have no clear link to the cost function of the model. Furthermore, the variables of the Gaussian explicitly store location information, distinct from the appearance captured by the features, thus providing a what/where decomposition of the input signal. Although the differentiable pooling scheme can be incorporated in a wide range of hierarchical models, we demonstrate it in the context of a Deconvolutional Network model (Zeiler et al. ICCV 2011). We also explore a number of secondary issues within this model and present detailed experiments on MNIST digits.

Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, and Rob Fergus

NIPS 2011 (December 12-17, 2011)
Supplemental Material:  Supplementary  Videos

We present a type of Temporal Restricted Boltzmann Machine that defines a prob- ability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponen- tially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved perfor- mance over several baselines modeling high-dimensional 2D and 3D data.

Adaptive Deconvolutional Networks for Mid and High Level Feature Learning

Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus

ICCV 2011 (November 6-13, 2011)

We present a hierarchical model that learns image de- compositions via alternating layers of convolutional sparse coding and max pooling. When trained on natural images, the layers of our model capture image information in a va- riety of forms: low-level edges, mid-level edge junctions, high-level object parts and complete objects. To build our model we rely on a novel inference scheme that ensures each layer reconstructs the input, rather than just the output of the layer directly beneath, as is common with existing hier- archical approaches. This makes it possible to learn mul- tiple layers of representation and we show models with 4 layers, trained on images from the Caltech-101 and 256 datasets. When combined with a standard classifier, features extracted from these models outperform SIFT, as well as representations from other feature learning methods.

Deconvolutional Networks

Matthew D. Zeiler, Dilip Kirshnan, Graham W. Taylor, and Rob Fergus

CVPR 2010 (June 13-18, 2010)
Supplemental Material:  Images  Videos

Building robust low and mid-level image representations, beyond edge primitives, is a long-standing goal in vision. Many existing feature detectors spatially pool edge information which destroys cues such as edge intersections, parallelism and symmetry. We present a learning framework where features that capture these mid-level cues spontaneously emerge from image data. Our approach is based on the convolutional decomposition of images under a sparsity constraint and is totally unsupervised. By building a hierarchy of such decompositions we can learn rich feature sets that are a robust image representation for both the anal- ysis and synthesis of images.

Deconvolutional Networks for Feature Learning

Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor, and Rob Fergus

The Learning (Snowbird) Workshop (April 6-9, 2010)
Supplemental Material: Images  Videos

Building robust low-level image representations, beyond edge primitives, is a long-standing goal in vision. In its most ba- sic form, an image is a matrix of intensities. How we should progress from this matrix to stable mid-level representations, useful for high-level vision tasks, remains unclear. Popular feature representations such as SIFT or HOG spatially pool edge information to form descriptors that are invariant to local transformations. However, in doing so important cues such as edge intersections, grouping, parallelism and symmetry are lost…

Modeling pigeon behaviour using a Conditional Restricted Boltzmann Machine

Matthew D. Zeiler, Graham W. Taylor, Niko F. Troje, and Geoffrey E. Hinton

ESANN 2009 (April 22-24, 2009)
Supplemental Material:  Videos

In an effort to better understand the complex courtship behaviour of pigeons, we have built a model learned from motion capture data. We employ a Conditional Restricted Boltzmann Machine (CRBM) with binary latent features and real-valued visible units. The units are conditioned on information from previous time steps to capture dynamics. We validate a trained model by quantifying the characteristic ‘head-bobbing’ present in pigeons. We also show how to predict missing data by marginalizing out the hidden variables and minimizing free energy.

Learning Pigeon Behaviour Using Binary Latent Variables

Matthew D. Zeiler. Supervised by Geoffrey E. Hinton and Graham W. Taylor

Engineering Science Undergraduate Thesis (April 14, 2009)

In an effort to better understand the complex courtship behaviour of pigeons, we have built a model learned from motion capture data. We employ a Conditional Restricted Boltzmann Ma- chine with binary latent features and real-valued visible units. The units are conditioned on information from previous time steps in a sequence to learn long-term effects and infer current features. We validate a trained model by quantifying the characteristic ‘head-bobbing’ present in generated pigeon motion. We also introduce a method of predicting missing data by marginalizing out the hidden variables and minimizing the free energy of the model. An alternative prediction method using forward and reverse passes over gaps of missing markers was presented as well. Lastly, the effects of head and foot motion on prediction results were analyzed.