Welcome to my site
This is my personal site which plays host to my publications, videos, and some information about myself.
I work on Machine Learning and Computer Vision and
recently defended my Computer Science PhD at New York University in those fields.
I have since launched a company called Clarifai to bring
large scale deep learning into every day use. Through Clarifai, we won the Imagenet
2013 Classification Challenge (results) and have
since been making huge strides in improving accuracy, speed, and memory usage of these
models.
News & Events
Recent Publications
Visualizing and Understanding Convolutional NetworksM.D. Zeiler, R. FergusECCV 2014 (Honourable Mention for Best Paper Award), Arxiv 1311.2901 (Nov 28, 2013)
Abstract
Large Convolutional Network models have recently demonstrated impressive
classification performance on the ImageNet benchmark. However there is no
clear understanding of why they perform so well, or how they might be
improved. In this paper we address both issues. We introduce a novel
visualization technique that gives insight into the function of
intermediate feature layers and the operation of the classifier. We also
perform an ablation study to discover the performance contribution from
different model layers. This enables us to find model architectures that
outperform Krizhevsky \etal on the ImageNet classification benchmark. We
show our ImageNet model generalizes well to other datasets: when the
softmax classifier is retrained, it convincingly beats the current
state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Hierarchical Convolutional Deep Learning in Computer VisionM.D. Zeiler. Advisor: R. FergusUnpublished PhD Thesis (Nov 8, 2013)
Abstract
It has long been the goal in computer vision to learn a hierarchy of
features useful for object recognition. Spanning the two traditional
paradigms of machine learning, unsupervised and supervised learning, we
investigate the application of deep learning methods to tackle this
challenging task and to learn robust representations of images.
We begin our investigation with the introduction of a novel unsupervised
learning technique called deconvolutional networks. Based on convolutional
sparse coding, we show this model learns interesting decompositions of
images into parts without object label information. This method, which
easily scales to large images, becomes increasingly invariant by learning
multiple layers of feature extraction coupled with pooling layers. We
introduce a novel pooling method called Gaussian pooling to enable these
layers to store continuous location information while being
differentiable, creating a unified objective function to optimize.
In the supervised learning domain, a well-established model for
recognition of objects is the convolutional network. We introduce a new
regularization method for convolutional networks called stochastic pooling
which relies on sampling noise to prevent these powerful models from
overfitting. Additionally, we show novel visualizations of these complex
models to better understand what they learn and to provide insight on how
to develop state-of-the-art architectures for large-scale classification
of 1,000 different object categories.
We also investigate related problems in deep learning. First, we introduce
a model for the task of mapping one high dimensional time series sequence
onto another. Second, we address the choice of nonlinearity in neural
networks, showing evidence that rectified linear units outperform others
types in automatic speech recognition. Finally, we introduce a novel
optimization method called ADADELTA which shows promising convergence
speeds in practice whilst being robust to hyper-parameter selection.
Regularization of Neural Networks using DropConnectL. Wan, M.D. Zeiler, S. Zhang, Y. LeCun, R. FergusICML 2013 (June 16, 2013)
Abstract
We introduce DropConnect, a generalization of Dropout (Hinton et al.,
2012), for regularizing large fully-connected layers within neural
networks. When training with Dropout, a randomly selected subset of
activations are set to zero within each layer. DropConnect instead sets
a randomly selected subset of weights within the network to zero. Each
unit thus receives input from a random subset of units in the previous
layer. We derive a bound on the generalization performance of both
Dropout and DropConnect. We then evaluate DropConnect on a range of
datasets, comparing to Dropout, and show state-of-the-art results on
several image recognition benchmarks by aggregating multiple
DropConnect-trained models.
On Recified Linear Units for Speech ProcessingM.D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.V. Le,
P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, and G.E. HintonICASSP 2013 (May 26, 2013)
Abstract
Deep neural networks have recently become the gold standard for acoustic
modeling in speech recognition systems. The key computational unit of a
deep network is a linear projection followed by a point-wise
non-linearity, which is typically a logistic function. In this work, we
show that we can improve generalization and make training of deep networks
faster and simpler by substituting the logistic units with rectified
linear units. These units are linear when their input is positive and zero
otherwise. In a supervised setting, we can successfully train very deep
nets from random initialization on a large vocabulary speech recognition
task achieving lower word error rates than using a logistic network with
the same topology. Similarly in an unsupervised setting, we show how we
can learn sparse features that can be useful for discriminative tasks. All
our experiments are executed in a distributed environment using several
hundred machines and several hundred hours of speech data.
Stochastic Pooling for Regularization of Deep Convolutional Neural NetworksMatthew D. Zeiler and Rob FergusICLR 2013 (May 2, 2013)
Supplemental Material: 
Images 
VideosAbstract We introduce a simple and effective method for regularizing large
convolutional neural networks. We replace the conventional deterministic
pooling operations with a stochastic procedure, randomly picking the
activation within each pooling region according to a multinomial
distribution, given by the activities within the pooling region. The
approach is hyper-parameter free and can be combined with other
regularization approaches, such as dropout and data augmentation. We achieve
state-of-the-art performance on four image datasets, relative to other
approaches that do not utilize data augmentation.ADADELTA: An Adaptive Learning Rate MethodMatthew D. ZeilerarXiv:1212.5701 (Dec 27, 2012)
Abstract We present a novel per-dimension learning rate method for gradient
descent called ADADELTA. The method dynamically adapts over time using only
first order information and has minimal computational overhead beyond
vanilla stochastic gradient descent. The method requires no manual tuning of
a learning rate and appears robust to noisy gradient information, different
model architecture choices, various data modalities and selection of
hyperparameters. We show promising results compared to other methods on the
MNIST digit classification task using a single machine and on a large scale
voice dataset in a distributed cluster environment.Differentiable Pooling for Hierarchical Feature LearningMatthew D. Zeiler and Rob FergusarXiv:1207.0151v1 (July 3, 2012)
Abstract We introduce a parametric form of pooling, based on a Gaussian,
which can be optimized alongside the features in a single global objective
function. By contrast, existing pooling schemes are based on heuristics
(e.g. local maximum) and have no clear link to the cost function of the
model. Furthermore, the variables of the Gaussian explicitly store location
information, distinct from the appearance captured by the features, thus
providing a what/where decomposition of the input signal. Although the
differentiable pooling scheme can be incorporated in a wide range of
hierarchical models, we demonstrate it in the context of a Deconvolutional
Network model (Zeiler et al. ICCV 2011). We also explore a number of
secondary issues within this model and present detailed experiments on MNIST
digits.
Recent Software Added
Adaptive Deconvolutional Network Toolbox
This toolbox includes code that implements an Adaptive Deconvolutional Network as described in the paper
Adaptive Deconvolutional Networks for Mid and High Level Feature Learning. It may also be used to implement a Deconvolutional Network as described in the paper
Deconvolutional Networks though this is not longer the recommended method. This has a function to train a Deconvolutional Network, to visualize the learned filters, and to recsontruct a new image from a trained model. Also, there are files that can be used to make descriptors that can be used with
Svetlana Lazebnik's Spatial Pyramid Matching code with a few minor modifications. The Deconvolutional Network Toolbox also works with (and now includes) the
IPP Convolutions Toolbox which drastically improves performance (just ensure the IPP Convolutions Toolbox files are in your MATLAB path in order to use it with this toolbox and that they are compiled with your IPP libraries.).
Download (.zip) 
Documentation (html)Suggested Software Eero SimonCell's Matlab Pyramid Toolbox
the MEX files contained within this package significantly speed up the LUT performance of the Deconvolutional Network (used for hyperlaplacian priors other than L1-norm).
Related Publications
Adaptive Deconvolutional Networks for Mid and High Level Feature Learning
Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus
International Conference on Computer Vision (November 6-13, 2011)
Supplemental Material:  Slides
Abstract
We present a hierarchical model that learns image de- compositions
via alternating layers of convolutional sparse coding and max pooling. When
trained on natural images, the layers of our model capture image information
in a va- riety of forms: low-level edges, mid-level edge junctions,
high-level object parts and complete objects. To build our model we rely on
a novel inference scheme that ensures each layer reconstructs the input,
rather than just the output of the layer directly beneath, as is common with
existing hier- archical approaches. This makes it possible to learn mul-
tiple layers of representation and we show models with 4 layers, trained on
images from the Caltech-101 and 256 datasets. When combined with a standard
classifier, features extracted from these models outperform SIFT, as well as
representations from other feature learning methods.
Deconvolutional Networks
Matthew D. Zeiler, Dilip Kirshnan, Graham W. Taylor, and Rob Fergus
Computer Vision and Pattern Recognition (June 13-18, 2010)
Supplemental Material:  Images  Videos
Abstract
Building robust low and mid-level image representations, beyond
edge primitives, is a long-standing goal in vision. Many existing feature
detectors spatially pool edge information which destroys cues such as edge
intersections, parallelism and symmetry. We present a learning framework
where features that capture these mid-level cues spontaneously emerge from
image data. Our approach is based on the convolutional decomposition of
images under a sparsity constraint and is totally unsupervised. By building
a hierarchy of such decompositions we can learn rich feature sets that are a
robust image representation for both the anal- ysis and synthesis of
images.
Deconvolutional Networks for Feature Learning
Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor, and Rob Fergus
The Learning (Snowbird) Workshop (April 6-9, 2010)
Supplemental Material:  Images  Videos
Abstract
Building robust low-level image representations, beyond edge
primitives, is a long-standing goal in vision. In its most ba- sic form, an
image is a matrix of intensities. How we should progress from this matrix to
stable mid-level representations, useful for high-level vision tasks,
remains unclear. Popular feature representations such as SIFT or HOG
spatially pool edge information to form descriptors that are invariant to
local transformations. However, in doing so important cues such as edge
intersections, grouping, parallelism and symmetry are lost...
Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines
This toolbox provides MATLAB implementations of ioTRBMs and FIOTRBM models for use in facial retargeting expeirments.
Download (.zip) 
Documentation (html)Related Publications
Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines
Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, and Rob Fergus
Neural Information Processing Systems (December 12-17, 2011)
Supplemental Material:  Supplementary  Videos
Abstract
We present a type of Temporal Restricted Boltzmann Machine that
defines a prob- ability distribution over an output sequence conditional on
an input sequence. It shares the desirable properties of RBMs: efficient
exact inference, an exponen- tially more expressive latent state than HMMs,
and the ability to model nonlinear structure and dynamics. We apply our
model to a challenging real-world graphics problem: facial expression
transfer. Our results demonstrate improved perfor- mance over several
baselines modeling high-dimensional 2D and 3D data.