In this dissertation, we propose models and methods targeting image understanding
tasks. In particular, we focus on Fisher kernel based approaches for the image classification
and object localization problems. We group our studies into the following
three main chapters.
First, we propose novel image descriptors based on non-i.i.d. image models.
Our starting point is the observation that local image regions are implicitly assumed
to be identically and independently distributed (i.i.d.) in the bag-of-words
(BoW) model. We introduce non-i.i.d. models by treating the parameters of the
BoW model as latent variables, which renders all local regions dependent. Using
the Fisher kernel framework we encode an image by the gradient of the data loglikelihood
with respect to model hyper-parameters. Our representation naturally
involves discounting transformations, providing an explanation of why such transformations
have proven successful. Using variational inference we extend the basic
model to include Gaussian mixtures over local descriptors, and latent topic models
to capture the co-occurrence structure of visual words.
Second, we present an object detection system based on the high-dimensional
Fisher vectors image representation. For computational and storage efficiency, we
use a recent segmentation-based method to generate class-independent object detection
hypotheses, in combination with data compression techniques. Our main
contribution is a method to produce tentative object segmentation masks to suppress
background clutter in the features. We show that re-weighting the local image
features based on these masks improve object detection performance significantly.
Third, we propose a weakly supervised object localization approach. Standard
supervised training of object detectors requires bounding box annotations of object
instances. This time-consuming annotation process is sidestepped in weakly
supervised learning, which requires only binary class labels that indicate the absence/
presence of object instances. We follow a multiple-instance learning approach
that iteratively trains the detector and infers the object locations. Our main
contribution is a multi-fold multiple instance learning procedure, which prevents
training from prematurely locking onto erroneous object locations. We show that
this procedure is particularly important when high-dimensional representations,
such as the Fisher vectors, are used.
Finally, in the appendix of the thesis, we present our work on person identification
in uncontrolled TV videos. We show that cast-specific distance metrics can be
learned without labeling any training examples by utilizing face pairs within tracks
and across temporally-overlapping tracks. We show that the obtained metrics improve
face-track identification, recognition and clustering performances.
Keywords
Image classification, object detection, weakly supervised training, computer vision,
machine learning. |