This dissertation is about extracting as well as making use of the structure
and hierarchy present in images. We develop a new low-level, multiscale,
hierarchical image segmentation algorithm designed to detect image regions
regardless of their shapes, sizes, and levels of interior homogeneity. We model
a region as a connected set of pixels that is surrounded by ramp edge discontinuities
where the magnitude of these discontinuities is large compared to the
variation inside the region. Each region is associated with a scale depending
on the magnitude of the weakest part of its boundary. Traversing through
the range of all possible scales, we obtain all regions present in the image.
Regions strictly merge as the scale increases; hence a tree is formed where
the root node corresponds to the whole image, and nodes close to the root
along a path are large, while their children nodes are smaller and capture
embedded details.
To evaluate the accuracy and precision of our algorithm, as well as to compare
it to the existing algorithms, we develop a new benchmark dataset for
low-level image segmentation. In this benchmark, small patches of many images
are hand-segmented by human subjects. We provide evaluation methods
for both boundary-based and region-based performance of algorithms. We
show that our proposed algorithm performs better than the existing low-level
segmentation algorithms on this benchmark.
Next, we investigate the segmentation-based statistics of natural images.
Such statistics capture geometric and topological properties of images, which
is not possible to obtain using pixel-, patch-, or subband-based methods.
We compile and use segmentation statistics from a large number of images,
and propose a Markov random field based model for estimating them. Our
estimates confirm some of the previous statistical properties of natural images
as well as yield new ones. To demonstrate the value of the statistics, we successfully use them as priors in image classification and semantic image
segmentation.
We also investigate the importance of different visual cues to describe
image regions for solving the region correspondence problem. We design
and develop psychophysical experiments to learn the weights of different
cues by evaluating their impact on binocular fusibility by human subjects.
Using a head-mounted display, we show a set of elliptical regions to one eye
and slightly different versions of the same set of regions to the other eye
of human subjects. We then ask them whether the ellipses fuse or not. By
systematically varying the parameters of the elliptical shapes, and testing for
fusion, we learn a perceptual distance function between two elliptical regions.
We evaluate this function on ground-truth stereo image pairs.
Finally, we propose a novel multiple instance learning (MIL) method. In
MIL, in contrast to classical supervised learning, the entities to be classi-
fied are called bags, each of which contains an arbitrary number of elements
called instances. We propose an additive model for bag classification where
we exploit the idea of searching for discriminative instances, which we call
prototypes. We show that our bag-classifier can be learned in a boosting
framework, leading to an iterative algorithm, which learns prototype-based
weak learners that are linearly combined. At each iteration of our proposed
method, we search for a new prototype so as to maximally discriminate between
the positive and negative bags, which are themselves weighted according
to how well they were discriminated in earlier iterations. Unlike previous
instance selection based MIL methods, we do not restrict the prototypes to
a discrete set of training instances but allow them to take arbitrary values
in the instance feature space. We also do not restrict the total number of
prototypes and the number of selected-instances per bag; these quantities are
completely data-driven. We show that our method outperforms state-of-theart
MIL methods on a number of benchmark datasets. We also apply our
method to large-scale image classification, where we show that the automatically
selected prototypes map to visually meaningful image regions. |