In this paper, we propose a new approach to learning a discriminative model for figure/ground segmentation by incorporating the bagoffeatures and conditional random field (CRF) techniques. We advocate the use of image patches instead of superpixels as the basic processing unit. The latter has a homogeneous appearance and adheres to object boundaries, while an image patch often contains more discriminative information (e.g., local image structure) to distinguish its categories. We use pixellevel sparse coding to represent an image patch. With the proposed feature representation, the unary classifier achieves a considerable binary segmentation performance. Further, we integrate unary and pairwise potentials into the CRF model to refine the segmentation results. The pairwise potentials include color and texture potentials with neighborhood interactions, and an edge potential. High segmentation accuracy is demonstrated on three benchmark datasets: the Weizmann horse dataset, the VOC2006 cow dataset, and the MSRC multiclass dataset. Extensive experiments show that the proposed approach performs favorably against the stateoftheart approaches.
I. INTRODUCTION
Figure/ground segmentation aims at partitioning an image into regions of coherent properties as a means for separating objects from their backgrounds. Considerable effort has been made to develop many advanced techniques in the recent years, where learning segmentation has attracted considerable attention of researchers because of its significant performance in classification applications. Further, probabilistic graphical models have also been remarkably successful in segmentation applications.
In any image system, feature representation is crucial to enhancing the system performance. How to manifest an image and how to capture salient properties of the object regions are still challenging problems. The bagoffeatures (BoF) model
[1

3]
has been widely used in the field of image processing. The model treats an image as a collection of unordered appearance descriptors extracted from local patches, quantizes them into discrete ‘visual words,’ and then computes a compact histogram representation. In this work, we propose a patchlevel BoF model to effectively represent an image patch from raw image data. By pixellevel dictionary learning, sparse coding, and spatial pyramid matching, the feature representation can capture the salient properties of the image patch, thus resulting in high patchwise segmentation accuracy.
Learning segmentation converts the image segmentation problem into a data clustering problem of image elements. One of the core challenges for machine learning is to discover what kind of information can be learned from the data sources and cluster this data into segments depicting the same object. Ren and Malik
[4]
proposed a classification model for segmentation, which feeds the Gestalt grouping cues into a linear classifier to discriminate between good and bad segmentation. Wu and Nevatia
[5]
developed a method to simultaneously detect and segment objects by boosting the edgelet feature classifiers. Duygulu et al.
[6]
modeled object recognition as a process of annotating image regions with words, and learning a mapping between region types and keywords by using an EM algorithm.
Probabilistic graphical models usually construct a cost function on the basis of some image constraints and formulate the image segmentation problem as a stochastic optimization problem. A condition random field (CRF) provides a principled approach to incorporating datadependent interactions; the complex joint probability distribution need not be modeled in this case. In this work, we use the CRF model to fuse multiple visual cues. For the CRF model, the definition of unary and pairwise potentials is very important. Previously, the unary potential was directly defined on feature spaces
[7]
. Lately, researchers have paid more attention on using a classifier to generate a unary potential
[8

15]
, and most of them prefer using a pixel or a superpixel as the basic processing unit. In contrast, we use a regular image patch. Image patches on object boundaries contain rich local structure information of an object (see
Fig. 1
). While superpixels usually have a homogeneous appearance and an almost uniform size along with edge preservation, particularly for weak boundaries, these properties weaken the discriminative capability of the unary classifier when the superpixel is taken as the sampling unit.
Some image patches on object boundaries. Ignoring their color information, we find that there are many similar spatial structures, which can be considered the common characteristics of the objects.
Our main contributions in this paper are twofold. First, we use an image patch as a sample of a unary classifier and propose an upgrade patch feature representation based on pixellevel sparse coding, which can capture more structure information of the local contour of objects. Second, we propose the color and texture pairwise potentials with neighborhood interactions and an edge potential representing edge continuity, which are validated to be very effective in our experiments.
II. CONDITIONAL RANDOM FIELDS
CRFs are probabilistic models for segmenting data with structured labels
[16]
, which are defined on a twodimensional discrete lattice, every site on which corresponds to a graph node. Let
G
(
V,E
) be an undirected graph with image patches as nodes
V
and the links between pairs of nodes as edges
E
. CRFs directly model the distribution
P
L

I,w,v
of node labels
L
conditional on image data
I
for node parameters
w
and edge parameters
v
. We are interested in finding the labels
where
l_{i}
denotes the label of the
i
^{th}
node. In this work, we are concerned with binary segmentation (foreground and background), i.e.,
l_{i}
∈ {1,1}. The joint distribution over the labels
L
given the observations
I
can be expressed as follows:
where
z
represents a normalized constant known as the partition function,
N_{i}
denotes the set of neighbors of the
i
^{th}
node in graph
G
, and
A_{i}
and
I_{ij}
indicate the unary and pairwise potentials, respectively. The unary potential is modeled using a local discriminative model that decides the association of a given node to a certain class, ignoring the interaction of its neighbors. In contrast, the pairwise potential is regarded as a datadependent smoothing function that denotes the interaction between two nodes. Both terms explicitly depend on a predefined set of features from
I
.
III. MAIN WORK
We use a CRF model to learn the conditional distribution over figure/ground labeling given an image, which allows us to incorporate different levels and different types of features in a single unified model.
 A. Unary Potentials
In this work, the unary potentials are defined by the prediction probability obtained from a linear support vector machine (SVM) classifier. Different from the existing feature descriptions, we train a pixellevel overcomplete dictionary to sparsely represent image patches in a highdimensional space.
 1) PixelLevel Texture Descriptor
Gabor wavelets have received considerable attention because of biological reasons and their optimal resolution in both frequency and spatial domains. The Gabor wavelet representation can capture the local structure corresponding to the spatial scale, spatial localization, and orientation selectivity. It can characterize the spatial frequency structure in the image, while preserving the information of spatial relations. However, many existing image representation approaches in the Gabor domain merely consider the magnitude information. In this work, we proposed a new pixellevel feature descriptor, which fuses the Gabor magnitude and the Gabor phase.
To eliminate local noise interference, a simple smoothing filter is used for removing image noise in advance. Then, we perform the Gabor transform in
D
directions and
S
scales on a given gray image, and respectively, denote the magnitude response and the phase response in direction
θ
and scale
σ
as
ρ_{θσ}
and
α_{θσ}
. Further, the 2
π
phase space is uniformly quantized into
J
intervals as
Φ
_{j}
= [
ø
_{j,}
_{min}
,
ø
_{j,}
_{min}
+
ς
] ,
j
= 1,...
J
.
ς
= 2
π
/
J
represents the quantization step, and
ø
_{j,}
_{min}
denotes the margin value between two phase intervals
Φ
_{j}
_{1}
and
Φ
_{j}
. Suppose that phase response
α_{θσ}
belongs to the
j
^{th}
interval
Φ
_{j}
. Then, we compute Eq. (2) to get a
J
dimensional vector
y
_{θσ}
in direction
θ
and scale
σ
as follows:
where
An example of diagram
J
= 8 is shown in
Fig. 2
, in which the phase space is quantized into eight intervals and
ς
=
π
/4. Assuming the phase response
α_{θσ}
∈
Φ
_{2}
, we can update the margin value as
the
J
dimensional vector
An example diagram of J = 8 ; phase response α_{θσ} belongs to the 2^{nd} phase interval Φ_{2}.
We concatenate all vectors
y
_{θσ}
of the given
D
directions and
S
scales as the pixellevel descriptor
y
= [
y
_{1,1}
,
y
_{1,2}
,...,
y
_{D,S}
] . The
ℓ
=
D
×
S
×
J
dimensional feature vector not only describes the distribution of the phase response in each scale and direction but also reflects the magnitude response.
 2) PixelLevel Dictionary Learning and Coding
Sparse representations have demonstrated considerable success in numerous applications, and the sparse modeling of signals has been proven to be very effective in signal reconstruction and classification. We randomly sample some pixels from the training image set to learn an overcomplete pixellevel dictionary. Assuming that we collect N training samples, we define a matrix
Y
∈
R
^{ℓ×N}
as the columns of samples:
where
y
_{i}
stands for the
ℓ
dimensional texture feature of the
i
^{th}
sample.
Using an overcomplete dictionary
D
∈
R
^{ℓ×L}
, which contains
L
atoms as column vectors, we can approximate the observed sample
y
well by using a sparse linear combination of these atoms. In particular, there exists a sparse coefficient vector
x
such that
y
can be approximated as
y
≈
D
_{x}
, where the vector
x
represents the weighted contribution of these atoms when reconstructing the observed sample. Given the training samples, we can learn the dictionary
D
by solving the following optimization problem:
where
λ
denotes a balance parameter and the second term enforces
x
to have a small number of nonzero elements.
The optimization problem is convex in
D
or
X
while fixing the other, but not in both simultaneously
[17]
. We solve it by alternating the optimization over
D
and
X
; the dictionary
D
can be initialized by randomly sampling
L
columns from
Y
or by Kmeans clustering. When fixing
D
, the optimization becomes a standard sparse coding problem, which can be solved very efficiently by using the featuresign search algorithm. When fixing
X
, the problem reduces to a least squares problem (as shown in Eq. (5)), which can be solved by using the Lagrange dual algorithm.
Once the overcomplete dictionary
D
is given, the texture feature
y
of each pixel can be coded as Ldimensional sparse vector
x
by solving the following l1norm regularization problem:
 3) Patch Feature Representation
The pivotal role of the unary potential in the CRFbased segmentation model has been demonstrated. It can be taken as a local decision term, which decides the association of a given graph node to a certain class. Usually, the use of the unary classifier alone leads to high accuracy as compared to the full CRF model as it can segment most parts of an object and loses only some details of the object boundaries.
In this work, we integrate the texture feature and the color feature to represent an image patch. We partition a patch into 1×2, 2×2 segments in two different scales, and then, compute the max pooling vector of the sparse codes of pixels within each of the five segments. We finally concatenate all the vectors to form a vector representation of the texture feature. The socalled spatial pyramid matching has had remarkable success in image classification applications. Color information is very useful for identifying the classes of image patches. For example, backgrounds (e.g., sky, water, grass, and tree) are usually distinguishable from objects (e.g., cow, sheep, and bird) in color. For a patch, we compute 64bin histograms in each CIE Lab color channel as its color feature and then, concatenate the texture vector and the color vector to form the final feature representation. In our experiments, we fixed the size of dictionary
D
as 2048; thus, the dimension of the patch feature is 2048×5+64×3 =10432 . We also find that max pooling outperforms the other alternative pooling methods.
 4) Unary Potential Computation
We train a binary linear SVM classifier to predict the figure/ground probabilities of an image patch, which are used for computing the unary potential. However, the boundaries of the foreground segmented by the CRF model using patches as graph nodes have a blocking effect. To generate results close to the ground truth, we split patches into some perceptually meaningful entities by using the oversegmentation boundary map, which is generated by an existing region merging method
[18]
. As shown in
Fig. 3
, a patch is split into several regions. These regions are considered the graph nodes, and their unary potential values are defined as the corresponding prediction probability of the host patch. That is, the regions split from a patch are assigned the same probability. In particular, the unary potential in Eq. (1) is defined as follows:
where
P
(
l_{i}
l
I
) denotes the local class posterior, that is, the prediction probability given by the unary classifier.
Split patches. (a) Patch partition, where each square stands for an image patch. (b) Overlaying a patch partition on the image segmentation. (c) Local magnifying map, where an irregular region stands for a split patch.
 B. Pairwise Potentials
After the unary binary classification, we can already obtain good segmentation results, but the classifier separately processes each image patch. The mutual dependence among neighboring patches is ignored, which results in some neighboring patches with a similar appearance being possibly improperly labeled as opposite classes (see
Fig. 4
). Therefore, as contextual knowledge is necessary for image segmentation, we define the pairwise potential to address this problem. In this work, the pairwise penalty
I_{ij}
is defined as the weight
w_{ij}
of a graph edge, that is,
Unary binary segmentation results (above: input images, below: binary classification results).
In image segmentation, the weights encode a graph affinity such that a pair of nodes with a high weight edge is considered to be strongly connected and edges with low weights represent the nearly disconnected nodes. We exploit the color, texture, and edge cues to model the connection between nodes and incorporate the three types of potentials in a unified CRF framework using prelearned parameters. Assuming that the superscripts
c, t
, and
e
denote the color, texture, and edge, respectively, we can rewrite
w_{ij}
(
I
) as follows:
where
g_{ij}
(
I
) represents a distance function defined over node pairs (
i, j
).
w_{ij}
(
I
) synthetically reflects the connectivity of nodes
i
and
j
on multiple feature spaces.
 1) Color Potential and Texture Potential
Color information is an essential and typical representation for images and is a key element for distinguishing objects. Mean and histogram are two common color descriptors. Mean only describes the average color component rather than the color distribution in a region. We use the CIE Lab histogram as a color descriptor for computing the color potential. Similarly, each channel is uniformly quantified into 64 levels, and then, three channels are concatenated to form a 192dimensional color vector. The experimental comparison demonstrates that the histogram descriptor is more effective than the mean descriptor, increasing the overall pixelwise labeling accuracy by 3.8%.
Every region in natural images is not isolated and is strongly connected with its adjacent regions. When computing
it is unreasonable to only use the node pair (
i, j
) and ignore the neighboring nodes. Therefore, we consider the neighborhood interactions in the main steps summarized in
Algorithm 1
.
where
D
(
i, j
) = [
_{X2}
(
h
_{i}
+
h
_{mi}
,
h
_{j}
) +
_{X2}
(
h
_{i}
,
h
_{j}
+
h
_{mi}
)] / 2.
Similarly, we can compute the texture potential
Further,
h
_{i}
stands for a texture histogram, which is computed as follows:
where
K
denotes the number of pixels in the region node
i
, and
x
represents the
L
dimensional sparse texture vector described in Section IIIA. Experiments demonstrate that the step of incorporating neighborhood interactions increases the overall pixelwise labeling accuracy by 2.3%. We also find that the color potential using the
Lab
/
_{X2}
descriptor outperforms the GMM color model as used in
[10]
.
 2) Edge Potential
Inside a very small region, edge information basically indicates local image shape priors; the regions belonging to the same object often have strong edge continuities, which are described as the edge potential in this paper. As shown in
Fig. 5
, we find that there are many edges (blue lines) going through neighboring nodes. If two neighboring nodes are crossed by an edge, they very possibly belong to a visual unit and have the same figure/ground label. Motivated by this observation, we define the edge potential to capture the cue of edge continuity.
Edge continuity. (a) Split patches. (b) Overlying image edge map on a split patch partition. (c) Local magnifying map.
Given an image, we compute its binary gradient magnitude
M
. Let
S
be the node index matrix, which indicates that the graph node that a pixel belongs to. Assuming that
c_{n}
= (
x_{n}
,
y_{n}
) denotes the coordinate of pixel
n
and (
i, j
) represents a pair of neighboring nodes, we can denote their common boundaries as a pixel pair set, as follows:
Then, the edge potential is computed as follows:
where
and
 C. Parameter Estimation
The parameter vector
v
in Eq. (7) is automatically learned from the training data. Given a set of training images
T
= {(
L
^{(n)}
,
I
^{(n)}
),
n
= 1,...
N
}, we assume that all the training data are independent and identically distributed. We then use the conditional maximum likelihood (CML) criterion to estimate
v
. Its log likelihood is computed as follows:
where the last term is the logpartition function. In general, the evaluation of the partition function is a NPhard problem. We could use either sampling techniques (e.g., the Markov chain Monte Carlo method
[19]
) or some approximations (e.g., those of the free energy
[20]
, piecewise training
[21]
, pseudolikelihood
[22]
) to estimate the parameters.
The optimal parameter
maximizes the log conditional likelihood according to the CML estimation as follows:
This can be solved by using the gradient descent method. The derivative of the log likelihood
L
(
v
) is written as follows:
where the second term
E_{P}
( · ) denotes the expectation with respect to the distribution
P
(
l

I
^{(n)}
,
v
). That is,
In general, the expectation cannot be computed analytically because of the combinatorial number of elements in the configuration space of labels. In this work, we use belief propagation
[23]
to approximate it.
IV. EXPERIMENT AND DISCUSSION
 A. Image Datasets
We evaluate the proposed approach using three datasets. The MSRC dataset
[10]
contains 591 images with 21 categories. The performance of the unary classifier on this dataset is measured by using the pixel precision. Furthermore, for comparison with a previous work
[24]
, we select the following 13 classes of 231 images with a 7class foreground (cow, sheep, airplane, car, bird, cat, and dog) and a 6class background (grass, tree, sky, water, road, and building) as the data subset. The ground truth labeling of the dataset contains pixels labeled as ‘void’ (i.e., colorcoded as black), which implies that these pixels do not belong to any of the 21 classes. In our experiments, void pixels are ignored for both the training and the testing of the unary classifier. The dataset is randomly split into roughly 40% training and 60% test sets, while ensuring approximately proportional contributions from each class.
The second dataset is the Weizmann horse dataset
[25]
, which includes the side views of many horses that have different appearances and poses. We have also used the VOC2006 cow database
[26]
in which ground truth segmentations are manually created. For the two datasets, the numbers of images in the training and test sets are exactly the same as in
[27]
.
 B. Common Parameters
When extracting the pixellevel texture descriptor, we set the parameters of the Gabor filter as scales
σ
= {1,1.2} and directions
θ
= {0,
π
/ 4,
π
/ 2, 3
π
/ 4}. The phase response is uniformly quantified into eight regions. Hence, the size of the pixellevel feature vector is 4×2×8 = 64 . We randomly select roughly 60000 samples from all the training images to learn the dictionary
D
and ensure approximately proportional contributions from each image. We set the dictionary size as 2048. Thus, a 64dimensional pixellevel vector is sparsely encoded as a 2048dimensional vector; then, by spatial pyramid matching and max pooling, we extract a patch feature from 16×16 pixel patches, which are densely sampled with a step size of 4 pixels.
During the training of the unary classifier, a patch possibly contains multiple labels; however, we take the label that accounts for more than 75% of all the pixels in this patch as its label. We find that the number of patches of the training images in each class on average is in the order of 10000, and some classes have more than 100000 patches. Considering the memory and computational constraints, we randomly select 8000 patches from each class to construct a patch dataset for the evaluation of the unary classification, and each class sample is randomly split into 25% training and 75% test sets. For efficiency, we reduce the dimension of a patch feature from 10432 to 4000 by using the incremental principal component analysis (PCA) algorithm
[28]
, in which we feed 20% samples to increment PCA in order to approximate the mean vector and the basis vectors.
 C. Unary Accuracy
To evaluate the performance of the proposed patch representation, we use a simple linear SVM classifier to conduct 21class classification experiments on the MSRC dataset. We select 1200 patches per class as training samples and the rest of the patches as the testing samples. We achieve patchwise labeling accuracy of 71.0%, while the stateoftheart approach
[10]
gives pixelwise accuracy of 69.6%. For a fair comparison, we further refine the patch precision segmentations to the pixel precision ones by simply postprocessing.
In particular, we first get split patches (i.e., graph nodes) by using an existing segmentation method as described in Section IIIA. The nodes are not always larger than the patches in size. Then, we take the nodes within the same segment as generated by
[18]
as the content consistent nodes. Finally, the label of each node is decided by a majority vote of the labels of its neighboring nodes, which must also be its contentconsistent nodes. After the above processing, we achieve pixelwise accuracy of 72.1%.
We also evaluate the binary classification performance on the 13class dataset. We select 2200 patches per class as the training samples and label the 7class foreground and the 6class background as positive samples and negative samples, respectively. Thus, we achieve patchwise labeling accuracy of 87.5%. After the postprocessing, we achieve pixelwise accuracy of 88.4%. The unary pixelwise accuracy on the Weizmann dataset and the VOC2006 dataset is 89.9% and 94.5%, respectively.
 D. Potentials Analysis
On the 13class dataset, we compare the unary potential accuracy with the full model accuracy. The latter is improved by 3.2% on average. This seemingly small numerical improvement corresponds to a large perceptual improvement (see
Fig. 6
), which shows that our pairwise potentials are effective.
Contribution of pairwise potentials. (a) Input images. (b) Results of the unary classification with patchwise accuracy. (c) Postprocessing results with pixelwise accuracy. (d) Results of the full CRF model. The first two examples show an increase in accuracy of 0.1% and 3.1%, respectively, while the last example significantly improves the accuracy by 39.0%.
 E. Comparison
We evaluate the performance of the proposed method against that of three stateoftheart methods
[24
,
27
,
29]
. The quantitative measure is the accuracy, namely segmentation cover, which is defined as the percentage of correctly classified pixels in the image (both foreground and background)
[24
,
29]
.
On the MSRC dataset, since the performance varies substantially for different classes, we respectively give the accuracy of each class. We list the quantitative comparison of seven classes in
Table 1
, which shows that our method outperforms the two competitors except for the cat class.
Quantitative comparison on the MSRC dataset
Quantitative comparison on the MSRC dataset
In addition, the method proposed in
[24]
only selects 10 images for each class such that there is a single object in each image, while we compute the segmentation accuracy on the 13class subdataset of 231 images, and many images contain several object instances. The difference in testing data also indicates that our method is more robust than that proposed in
[24]
.
Fig. 7
shows some visual examples of the same images as reported in
[24]
. Although the accuracies of some examples are lower than those in the case of the competitor methods, our overall accuracy is higher.
Qualitative comparison with [24] for the MSRC database. The first and the third rows show the results reported in [24]. The second and the fourth rows show our results.
On the Weizmann and VOC2006 datasets, we compute the 2class confusion matrix, as shown in
Table 2
, which shows that the proposed method performs favorably against the method proposed in
[27]
on the first dataset and much better on the second one.
Figs. 8
and
9
show the same examples as those considered in
[27]
. The reason that the results on the cow dataset are very goodies that the appearances of the foreground and background are respectively homogeneous and the spatial distribution of the foreground is very compact. Compared to the horse dataset, the foreground of many images are inhomogeneous; in particular, horse shanks are very slim and their colors are different from those of the body, which leads to horse shanks going missing from the final segmentation, as shown in
Fig. 8
. In addition, the similar appearance of the foreground and the shadow in the background possibly causes some errors.
Quantitative comparison on the Weizmann and VOC2006 datasets
Quantitative comparison on the Weizmann and VOC2006 datasets
Examples of representative segmentation results on the Weizmann horse dataset. From top to down: input images, results reported in [27], and our results.
Examples of representative segmentation results on the VOC2006 cow images. From top to down: Input images, results reported in [27], and our results.
V. CONCLUSIONS
In this paper, we propose a new discriminative model for figure/ground segmentation. First, a pixellevel dictionary is learnt from mass pixelwise Gabor descriptors; second, each pixel is mapped as a highdimensional sparse vector, and then, all the sparse vectors in a patch are fused to represent the patch by max pooling and spatial matching. The proposed unary features can simultaneously capture the appearance and context information, which significantly enhances the unary classification accuracy. The upgrade color and texture potentials with neighborhood interactions and the proposed edge potential weaken the interference of abnormal nodes during graph affinity computation. The experimental results demonstrate that the proposed approach is powerful with a comparison with three stateoftheart approaches. In the future, we hope to integrate explicit semantic context and salient information to make the algorithm more intelligent.
Acknowledgements
This work was supported by the Natural Science Foundation of China under Grants 61405022 and 61371157.
BIO
Lihe Zhang
received the Ph.D. degree in signal and information processing from Beijing University of Posts and Telecommunications, Beijing, China, in 2004. He is currently an Associate Professor with the School of Information and Communication Engineering, Dalian University of Technology (DUT). His current research interests include computer vision and pattern recognition.
Yongri Piao
received the B.S. degree in automation engineering from Jilin University, China, in 2003, and the M. S. and Ph.D. degrees in information and communication engineering from Pukyong National University, Republic of Korea, in 2005 and 2008, respectively. From September 2008 to December 2011, he was a Research Professor at the 3D display research center of Kwangwoon Un iversity. Since March 2012, he has been an Associate Professor at the School of Information and Communication Engineering, Dalian University of Technology, Dalian, China. His research interests include optical imaging and 3D display, optical and digital encryptions, 3D pattern recognition and tracking, 2D/3D image processing.
FeiFei L.
,
Perona P.
“A Bayesian hierarchical model for learning natural scene categoriesm”
in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2005)
San Diego, CA
2005
524 
531
Wu L.
,
Hoi S. C.
,
Yu N.
2010
“Semanticspreserving bagofwords models and applications,”
IEEE Transactions on Image Processing
19
(7)
1908 
1920
DOI : 10.1109/TIP.2010.2045169
Lazebnik S.
,
Schmid C.
,
Ponce J.
“Beyond bags of features: spatial pyramid matching for recognizing natural scene categories,”
in Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
New York City, NY
2006
2169 
2178
Ren X.
,
Malik J.
“Learning a classification model for segmentation,”
in Proceedings of 9th IEEE International Conference on Computer Vision
Nice, France
2003
10 
17
Wu B.
,
Nevatia R.
“Simultaneous object detection and segmentation by boosting local shape feature based classifier,”
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR2007)
Minneapolis, MN
2007
1 
8
Duygulu P.
,
Barnard K.
,
de Freitas J. F.
,
Forsyth D. A.
“Object recognition as machine translation: learning a lexicon for a fixed image vocabulary,”
in Proceedings of 7th European Conference on Computer Vision (ECCV2002)
Copenhagen, Denmark
2002
97 
112
He X.
,
Zemel R. S.
,
CarreiraPerpiñán M. A.
“Multiscale conditional random fields for image labeling,”
in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2004)
Washington, DC
2004
695 
702
Kohli P.
,
Ladicky L.
,
Torr P. H.
“Robust higher order potentials for enforcing label consistency,”
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR2008)
Anchorage, AK
2008
1 
8
Ladicky L.
,
Russell C.
,
Kohli P.
,
Torr P. H.
“Associative hierarchical CRFs for object class image segmentation,”
in Proceedings of 2009 IEEE 12th International Conference on Computer Vision
Kyoto, Japan
2009
739 
746
Shotton J.
,
Winn J.
,
Rother C.
,
Criminisi A.
“Textonboost: joint appearance, shape and context modeling for multiclass object recognition and segmentation,”
in Proceedings of 9th European Conference on Computer Vision (ECCV2006)
Graz, Austria
2006
1 
15
Chen C.
,
Freedman D.
,
Lampert C. H.
“Enforcing topological constraints in random field image segmentation,”
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR2011)
Providence, RI
2011
2089 
2096
Rosenfeld A.
,
Weinshall D.
“Extracting foreground masks towards object recognition,”
in Proceedings of 2011 IEEE International Conference on Computer Vision (ICCV)
Barcelona, Spain
2011
1371 
1378
Singaraju D.
,
Vidal R.
“Using global bag of features models in random fields for joint categorization and segmentation of objects,”
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR2011)
Providence, RI
2011
2313 
2319
Fulkerson B.
,
Vedaldi A.
,
Soatto S.
“Class segmentation and object localization with superpixel neighborhoods,”
in Proceedings of 2009 IEEE 12th International Conference on Computer Vision
Kyoto, Japan
2009
670 
677
Galleguillos C.
,
McFee B.
,
Belongie S.
,
Lanckriet G.
“Multiclass object localization by combining local contextual interactions,”
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR2010)
San Francisco, CA
2010
113 
120
Kumar S.
,
Hebert M.
“Discriminative random fields: a discriminative framework for contextual interaction in classification,”
in Proceedings of 9th IEEE International Conference on Computer Vision
Nice, France
2003
1150 
1157
Lee H.
,
Battle A.
,
Raina R.
,
Ng A. Y.
“Efficient sparse coding algorithms,”
in Proceedings of Advances in Neural Information Processing Systems (NIPS2006)
Vancouver, Canada
2006
801 
808
Nock R.
,
Nielsen F.
2004
“Statistical region merging,”
IEEE Transactions on Pattern Analysis and Machine Intelligence
26
(11)
1452 
1458
DOI : 10.1109/TPAMI.2004.110
Vishwanathan S. V. N.
,
Schraudolph N. N.
,
Schmidt M. W.
,
Murphy K. P.
“Accelerated training of conditional random fields with stochastic gradient methods,”
in Proceedings of the 23rd International Conference on Machine Learning (ICML)
Pittsburgh, PA
2006
969 
976
Sutton C.
,
McCallum A.
“Piecewise training of undirected models,”
in Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI2005)
Edinburgh, Scotland
2005
1 
8
Li S. Z.
2001
Markov Random Field Modeling in Image Analysis.
Springer
Tokyo
Frey B. J.
,
MacKay D. J.
“A revolution: belief propagation in graphs with cycles,”
in Proceedings of Advances in Neural Information Processing Systems (NIPS1998)
Denver, CO
1998
479 
485
Vicente S.
,
Rother C.
,
Kolmogorov V.
“Object cosegmentation,”
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR2011)
Providence, RI
2011
2217 
2224
Borenstein E.
,
Ullman S.
“Combined topdown and bottomup segmentation,”
in Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'04)
Washington, DC
2004
Everingham M.
The VOC 2006 database [Internet].
Available:
Zhang L.
,
Ji Q.
2010
“Image segmentation with a unified graphical model,”
IEEE Transactions on Pattern Analysis and Machine Intelligence
32
(8)
1406 
1425
DOI : 10.1109/TPAMI.2009.145
Ross D. A.
,
Lim J.
,
Lin R. S.
,
Yang M. H.
2008
“Incremental learning for robust visual tracking,”
International Journal of Computer Vision
77
(13)
125 
141
DOI : 10.1007/s1126300700757
Carreira J.
,
Sminchisescu C.
“Constrained parametric mincuts for automatic object segmentation,”
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010)
San Francisco, CA
2010
3241 
3248