In this study, we propose a distance metric learning approach called discriminant metric learning (DML) for face verification, which addresses a binaryclass problem for classifying whether or not two input images are of the same subject. The critical issue for solving this problem is determining the method to be used for measuring the distance between two images. Among various methods, the large margin nearest neighbor (LMNN) method is a stateoftheart algorithm. However, to compensate the LMNN’s entangled data distribution due to high levels of appearance variations in unconstrained environments, DML’s goal is to penalize violations of the negative pair distance relationship, i.e., the images with different labels, while being integrated with LMNN to model the distance relation between positive pairs, i.e., the images with the same label. The likelihoods of the input images, estimated using DML and LMNN metrics, are then weighted and combined for further analysis. Additionally, rather than using the
k
nearest neighbor (
k
NN) classification mechanism, we propose a verification mechanism that measures the correlation of the class label distribution of neighbors to reduce the false negative rate of positive pairs. From the experimental results, we see that DML can modify the relation of negative pairs in the original LMNN space and compensate for LMNN’s performance on faces with large variances, such as pose and expression.
1. Introduction
F
ace recognition is an active research issue in the field of computer vision and has been studied for more than two decades
[1]

[10]
. It has a wide range of practical applications, including surveillance and bordercontrol systems. For a traditional access system, users can enter a site using keys or integrated circuit (IC) cards; however, security is doubtful because these keys can be easily pirated. Recently, bioinformatics has become popular for access systems, and popular approaches include recognizing fingerprints and palm prints. However, these methods are inconvenient due to the required contact process. In contrast, the noncontact process of using the human face to convey a subject’s identity as an access key is more attractive.
Not exclusive to the private security field, face recognition/verification also plays an important role in public security. We are surrounded by surveillance systems and every crossroad is equipped with cameras to record every moment. Consider the occurrence of an urgent event; for example, the police pursuing a criminal and attempting to determine his movements. Checking every frame in all camera records is manually impossible. A more efficient technique would be a face verification system that could filter results and present frames of possible suspects for further analysis. However, the development of robust face verification is challenging since facial images contain an immense variety of expressions, orientations, lighting conditions, occlusions, and so forth. In recent years, researchers have focused on raising the accuracy rate with respect to these variations, including the design of good features for face representation
[4]
,
[6]
,
[9]
,
[11]
,
[12]
and distance metric learning
[10]
,
[11]
,
[13]
.
Due to its wide use in security applications, we focus our efforts in this study on the problem of face verification to determine whether a pair of facial images is the same or a different person. The fundamental problem is how to measure the similarity between facial examples. A surge of recent research
[13

20]
has focused on Mahalanobis metric learning to improve the performance of knearest neighbor (
k
NN) classifications. In an uncontrolled environment, it is assumed that facial images can only be detected by a face detector
[21]
, and thus a high degree of variances result, due to lighting conditions, poses, occlusions, and background clutter that make verification challenging. To tackle the highly entangled data distributions caused by the above factors, we propose a distance metric algorithm that can heavily penalize violations in the distance relationship for betweenclass data while preserve those remaining withinclass. In addition, we propose a validation approach that measures the correlation between the label distributions of neighbors to improve the true positive rate.
2. Related Work
Face recognition has been an active research topic for more than two decades because of its practical applications. Of particular note, face verification has attracted more attention in recent years, which latter aims to verify if an input image is that of the subject. This differs from face identification in that the test subject(s) in the input pair need not be included in the training dataset. However, in practical applications, the uncontrolled environment causes problems in terms of the immense variations in facial pose, expression, lighting conditions, occlusion, and so on, and the reliability of previous research developed with controlled settings are thus limited. In 2009, Kumar et al.
[9]
analyzed these failure cases and found that these mistakes would be avoidable if more facial attributes can be analyzed separately before classification. The authors proposed two methods for facial verification in uncontrolled setting. The first method used highlevel face representation to recognize the presence or absence of 65 attributes, such as a round face, gender, and so on. The second method was based on a simile classifier and was aimed at recognizing the similarities between the facial regions of the test image pairs with an extra identity dataset as prior knowledge. 60 people were used in the study, which was a significant improvement for the LFW dataset
[22]
. However, this approach has to define a number of reliable and relevant features
[5]
. Inspired by the results detailed in
[9]
, many alternative approaches addressed unconstrained face recognition/verification via robust feature learning
[11]
,
[23]
,
[24

26]
, or the similarity measures between feature descriptors
[14]
,
[27]
.
For feature extraction, texturebased local features are applied to face recognition/verification, including LBP
[28]
, SIFT
[29]
, and Gabor
[30]
,
[31]
. In 2004, Ahonen et al.
[4]
investigated the LBP feature, which encodes the relationship intensity between each pixel and its neighboring pixels, and the authors found that this approach yielded good results. LBP can be insensitive to lighting changes and provides promising results when compared to global feature and other texturebased approaches
[4]
. Further modifications of the LBP approach have since been proposed
[32

34]
. Rather than using one specific feature, the strategy of combining multiple features has been applied to face verification
[35]
,
[36]
. In
[35]
, the authors combined multiple texturebased features in the scorelevel fusion and shown that such a combination can provide better verification results than the use of one specific feature by approximately 5.7% on average. Color information is another important feature and the integration of color information can improve recognition performance compared to methods relying solely on color or texture information
[37

39]
. In
[36]
, the authors proposed new features including color local Gabor wavelets (CLGWs) and color local binary patterns (CLBP), to combine texture and color features. The use of texture and color information remains however an open problem
[36]
. Instead of designing handcrafted encoding methods, the approaches of
[6]
and
[40]
applied learning frameworks to select the discriminant features in order to avoid the difficulties associated to obtaining optimal encoding methods manually. In
[6]
, an unsupervised learningbased method was proposed to encode the local microstructures of a face into a set of more uniformly distributed discrete codes. Middlelevel features using unsupervised feature learning with deep network architectures
[11]
,
[24

26]
have also been applied to face verification. Further information regarding this study can be found in
[41]
and
[42]
.
After obtaining the feature descriptors, the subsequent process for face verification is to measure the similarity between the two descriptors. Inspired by the idea of “OneShot Learning” techniques
[43]
,
[44]
, Wolf et al.
[45]
proposed the oneshot similarity (OSS) approach to classify a pair of test images via a discriminant model learned from a single positive sample, and a set of prior background negative samples to solve the problem of limited positive samples. Following this, in
[46]
the authors extended the OSS approach by combining multiple OSS scores to improve the recognition rate and further considering the ranking results of the query image to propose the“TwoShot Similarity” (TSS) approach
[35]
. Although the discriminative models are produced as per the vectors being compared
[35]
,
[45]
,
[46]
and are often better suited to comparing the test pair, two classifiers need to be trained each time when two images are compared. Depending on the classifiers used in the implementation, this may lead to additional computational cost in the test process. With the aim of discovering recognition capability for two faces in an uncontrolled environment, Yin et al.
[10]
proposed an“AssociatePredict” (AP) model for face recognition based on the conjecture that the transition process is performed with given prior knowledge in a person’s brain. The model is built on a prior identity dataset, which differs from the extra unlabeled datasets of
[9]
,
[35]
,
[45]
,
[46]
, where each identity has multiple facial images with large intraclass variations. When a pair of test images is compared, the input face with a number of the most similar identities from the identity dataset is first associated, and the new appearance of the input face in different settings is predicted using appearanceprediction matching. In addition, one personspecific classifier is trained for likelihoodprediction matching. The accuracy of facial component extraction and the selection of the correct identity for appearance prediction have influenced the performance of the associationprediction model. Different from
[9]
of using the global unlabeled dataset, the authors built the personspecific model on a prior identity dataset to classify the input face against the most similar faces to improve recognition. As in
[46]
,
[35]
, the use of the online classifier may have additional computational cost.
In spite of the new frameworks that have been proposed for face verification
[9]
,
[35]
,
[45]
,
[46]
, the similarity measure between facial descriptors is the core of these research. The information theoretic metric learning (ITML)
[16]
approach for example is applied for each OSS score
[46]
. In order to tackle the highly entangled data distribution captured in the uncontrolled environment
[47]
, the
k
NN classifier is the simplest nonlinear classifier that is most often applied on the basis of the Euclidean distance metric for recognition. However, the Euclidean distance metric ignores the statistical properties of data that might be estimated from a large training set of labeled examples
[13]
. Several other distance metric algorithms
[14

20]
have also been proposed to obtain a new distance metric to investigate data properties from class labels. In
[27]
, cosine similarity metric learning (CSML) was proposed to learn a transformation matrix by measuring the cosine similarity between an image pair. In addition, the Mahalanobis distance metric learned based on the various objective functions. Relevant component analysis (RCA)
[15]
is intermediate between the unsupervised method of PCA and supervised methods of LDA using the chunklet information, a subset of a class, to learn a fullranked Mahalanobis distance metric. Unlike LDA, since betweenclass information is not explicitly imposed in the objective function, the improvement for the
k
NN based classification with the RCA metric is limited. Similar to the goal of LDA of minimizing the distance between for withinclasses whilst maximizing the distance for betweenclasses
[48]
, Xing
et al
.
[19]
proposed a Mahalanobis metric for clustering (MMC) with side information that represented the first convex objective function for distance metric learning. Because MMC was built with the normal or unimodel assumption for clusters, it is not particularly appropriate for
k
NN classifiers
[13]
. In contrast to RCA and MMC, the large margin nearest neighbor (LMNN) classification
[13]
is the first method for imposing a constraint on the distance metric for the
k
NN classification. Thus, via the metric, the
k
nearest neighbors always belong to the same class, while examples from different classes are separated by a large margin. A series of experiments have been conducted to prove that the LMNN approach yields better results than PCA, LDA, RCA, and MMC
[13]
. In order to extend the LMNN approach to the binaryclasses problem for face verification, Guillaumin
et al
.
[14]
proposed a logistic discriminantbased metric learning (LDML) to modify LMNN constraints with a probability formulation by learning a metric from a set of labeled image pairs. LDML can provide better results than
[16]
in the binaryclasses problem. In addition, the comparison mechanism, a marginalized
k
NN classifier (MkNN), was also proposed to verify a test pair by a set of labeled images. However, an incorrect classification results when two images of the same subject receive a low similarity rating if the class labels for their corresponding
k
nearest neighbors are uniformly distributed.
3. Overview of the Proposed Face Verification System
Fig. 1
shows an overview of the training and test process in the proposed face verification system. The training process commences by detecting faces from training images and normalizing face geometry according to the locations of eyes. By considering the spatial information of the face, the face image is divided into 3×3 regions. Then the 59dimensional feature of the local binary pattern is extracted from each region and the features are further concatenated into one 531dimensional feature vector
f_{i}
. To develop a discriminant metric for verification,
N_{p}
positive pairs (two images of the same subject) and
N_{n}
negative pairs (two images of different subjects) are generated from the training dataset. Then, these training pairs are used to learn the distance metric
M_{Dis}
, which is composed of two ideas,
M_{LMNN}
and
M_{DML}
. The distance relationship of the positive pairs is learned via the distance metric
M_{LMNN}
[13]
, and the distance relationship of the negative pairs is learned via our proposed metric
M_{DML}
. Hence we can minimize not only the withinclass distance, but we can maximize the betweenclass distance. Note that violations of the distance relationship for the negative pairs is heavily penalized via
M_{DML}
to reduce the false positive rate for unconstrained verification.
Flowchart of the proposed face verification system. (a) The training process (b) The test process
In the test process, a test pair of two facial images is input and the LBP features are extracted for each test image as in the training process. Then the similarity between each test image and the training images are evaluated based on the trained distance metrics
M_{LMNN}
and
M_{DML}
, respectively. The proposed verification mechanism, correlation of the
k
nearest neighbor (CkNN), constructs the corresponding
k
NN code for each test image and then measures the correlation between two
k
NN codes. This measurement is then applied to decide whether the test pair are the same subject or not.
4. Face Verification System
In this section, we discuss the details of the proposed distance metric and the verification mechanism. First, we introduce the extracted features, and then the design concept for the proposed distance metric and the optimization process. Lastly, we show the coding and verification mechanism, namely CkNN. In the following discussion, the training data set is composed of
N
subjects with
n_{i}
images, denoted as
, and the size of each image is
w
×
h
.
 4.1 Feature extraction by LBP
The LBP is a kind of texturebased feature that has been demonstrated to perform face recognition very well
[28]
. Its mechanism is the use of binary codes to present the intensity (grayvalue) relationship between the processing pixel and its surrounding pixels. For each processing pixel these binary codes are then transformed into a decimal value; and then the statistical distribution of the decimal values from all pixels are represented by a histogram as the facial feature vector.
Fig. 2
shows examples of giving an image a gray value according to its corresponding decimal value. We can see that the dark pixels correspond to those facial components and facial contours. Note that in
[28]
, binary codes are further divided into uniform and nonuniform patterns. A uniform pattern is one with the changes between binary codes, i.e., 0 to 1 or 1 to 0, occurring fewer than two times, and the remaining patterns are designated as nonuniform patterns. From
Fig. 3
, we can see that the uniform patterns can capture the local important features such as corners and edges, and hence they are recorded in one specific bin in the histogram, and all of the nonuniform patterns are recorded in one bin. The resulting facial image can be represented by a 59dimensional histogram.
Coding results of a local binary pattern
Examples of a uniform pattern corresponding to locally important facial features
The usage of a statistical histogram as a feature representation is popular
[49]
. One advantage is that it can overcome rotation variance, but the disadvantage is the loss of spatial information. When the geometry is important, the damage is obvious. In order to cope with this situation, one way to maintain the geometric relationship is to divide the object (face) into multiple regions
[28]
. Then one histogram can be used to represent each region, and the final feature vector can be obtained by concatenating all histograms. In our work, therefore, we divide one facial image into 3×3 regions, each of which is represented by a 59dimensional feature vector. In the end, 9 histograms are concatenated into one 531dimensional feature vector of the data for each training face.
 4.2 Distance Metric Learning
LMNN
[13]
is one of the stateoftheart metrics designed for the Mahalanobis distance measure which can reduce withinclass distance and enlarge betweenclass distance. Hence, using this metric, the
k
NN classifier can benefit from these modified distance relationships. However, the distribution of unconstrained facial data for the same class is highly nonlinear, and even they are entangled for different classes
[47]
. Hence, we use LMNN to minimize withinclass distance and discriminant metric distance metric learning (DML), which is designed to penalize violations of betweenclass distance relationships.
 4.2.1 LargeMargin Nearest Neighbour Metric Learning
LMNN metric learning
[13]
derives a metric favored by the
k
NN classifier to calculate the Mahalanobis distance between data values
x_{i}
and
x_{j}
via matrix
M_{LMNN}
as
If the matrix is degenerated into one identity matrix, Eq. (1) can estimate the Euclidean distance between
x_{i}
and
x_{j}
. The idea of LMNN, as shown in
Fig. 4
, is to minimize the withinclass distances (the distance between the blue squares) while maximizing the betweenclass distances larger than one unit (the distance between blue squares and black triangles). In other words, for each data value
x_{i}
, the main object is to minimize the distance for the positive pairs (
x_{i}
,
) where
is one target neighbor
[13]
, i.e., one of the k nearest neighbors having the same class label as
x_{i}
, while maximizing the distance for the negative pairs (
x_{i}
,
) where
is one of the
k
NNs having a different class label for
x_{i}
. Thus the objective function can be derived as in
[13]
:
where
M_{LMNN}
is a semidefinite matrix, i.e., all the eigenvalues are equal or larger than 0,
β
is the parameter that controls the importance of the withinclass distance and the betweenclass distance, and ξ is a slack variable to penalize violations of distance conditions between (
x_{i}
,
) and (
x_{i}
,
). In Eq. (2), the first term minimizes the distance of the positive pairs, and the second term uses ξ to maintain the distance of the negative pairs as greater than the distance of a positive pair within one unit. More details about the optimization process can be referenced in
[13]
.
Schematic illustration of LMNN. Each data value is represented by an icon, with three classes (square, triangle, and circle) shown. The left figure shows the data relationship before using the LMNN metric. After using the LMNN metric, the right figure shows the modified data relationship for x_{i}. with its neighbors, and specifically the withinclass distances. The distance between blue squares is minimized while the betweenclass distances, shown as distance between the blue squares and black triangles, are maximized.
 4.2.2 Discriminant Metric Learning
In order to tackle entangled data distributions to reduce the false positive rate, one discriminant metric is designed to enhance the penalization for negative pairs violating the distance relationship, i.e., where the betweenclass distance is smaller than withinclass distance in a unit save range.
Fig. 5
shows a schematic illustration of the discriminant metric wherein if the distance of the negative pairs, i.e., the distance between the processing of data
x_{i}
and its neighbors having different class labels (as shown in black triangles), is smaller than the distance of the positive pairs, a larger cost is assigned. On the other hand, if the distance between negative pairs is larger than the distance between positive pairs, a smaller cost is assigned. Hence, via the sigmoid function, which ranges from 0 to 1, the objective function for the discriminant metric is designed as follows:
where
is the sigmoid function,
P_{i}
and
N_{i}
are
x_{i}
’s neighbor sets whose class labels are the same as
x_{i}
or not, respectively, and
d
(●) is the Mahalanobis distance between
x_{i}
and
x_{j}
(or
x_{l}
), which is measured with the distance metric
M
. When the parameter z is larger, the function value
σ
(
z
) is closer to 1, while if z is smaller,
σ
(
z
) is closer to 0. In contrast to Eq.(1) , the sigmoid function is a convex function and a firstorder derivative. Thus the optimal value for
M_{DML}
can be obtained by taking the derivation for
M
as
where
,
,
X_{P}
= (
x_{i}

x_{j}
)(
x_{i}

x_{j}
)
^{T}
,
X_{N}
= (
x_{i}

x_{l}
)(
x_{i}

x_{l}
)
^{T}
, and
d_{P}
and
d_{N}
are the distances of the positive and negative pairs, respectively. Using the gradient descent method and setting the initial value as an identity matrix, the optimal value of
M_{DML}
can be obtained. According to the optimization process, the time taken for each iteration includes: 1) the distance calculation time for training examples, and 2) the distance metric updating time, where the computational complexities are O(
nd^{2}
+
n^{2}d
) and O(
knd^{2}
) respectively, in which
n
is the number of training data,
k
is the number of neighbors, and
d
is the dimension of LBP feature. Because the distance calculation of the training data is independent to each other, this step can be parallelized in distributed computing framework like MapReduce to speed up the offline processes. In such framework, each computing node can serve a part of the training samples to accelerate the training process time for DML, and therefore the execution of the offline processes can be fast.
Schematic illustration for DML.The left figure shows the data relationship before using the DML metric. After applying the DML metric, the right figure shows the modified data relationship for x_{i} with its neighbors that have a different class label from x_{i}. Specifically, the betweenclass distances denoted as black triangles are maximized.
 4.3 Correlation of thekNN Code
For verification, the output predicts whether a pair of images belongs to the same class. In
[14]
, the authors considered the neighbor’s class labels and proposed the MkNN
[14]
to measure the label distribution similarity between neighbors. However, when the data distribution is heavily entangled in the transformed metric space and this leads to two images of the same subject surrounded by data from different subjects, the worst case is that k neighbors are from different k subjects, and the classification might be wrong due to the low distribution probability.
With MkNN, instead of estimating only the data distance via Eqs. (2) or (3) to predict whether they are the same class or not, their corresponding neighbors’ class information is considered. After learning the distance metrics
M_{LMNN}
and
M_{DML}
, during the verification process each data value
x_{i}
is extracted from the local binary patterns and then the distances from the training images based on
M_{LMNN}
and
M_{DML}
are measured to obtain the corresponding k nearest neighbors, denoted as
and
, respectively. Then the
k
NN code
and
for
x_{i}
is defined as follows:
where the dimensions of
and
are the number of classes,
δ
(●) is an indicator function, K is the number of nearest neighbors,
and
are the
k
th nearest neighbor for
x_{i}
measured by
M_{LMNN}
and
M_{DML}
, respectively, and the corresponding labels are denoted as
y
(
) and
y
(
) . In other words, the
k
NN code contains the class label distribution of neighbors surrounding
x_{i}
.
Fig. 6
shows an example and the corresponding
k
NN code
b_{i}
= [3, 2, 0, 2]
^{T}
.
Example of kNN code construction.The code size is the number of classes (4, in this example). For x_{i} , 7 nearest neighbors measured by the distance metrics M_{LMNN} and M_{DML} is used to find the k nearest neighbors and the kNN code is b_{i} = [3, 2, 0, 2]^{T} .
After obtaining the
k
NN code for each image of the test image pair (
x_{i}
,
x_{j}
) , verification is performed by computing the correlation coefficients of
k
NN codes by
and
where
,
,
, and
are the
c
th elements in the corresponding
k
NN code
and
,
, and
, respectively, and
,
,
, and
are the corresponding means. The final verification result
V
(
x_{i}
,
x_{j}
) is defined as
where
γ
is the threshold.
V
(
x_{i}
,
x_{j}
) = 1 indicates that the test image pair (
x_{i}
,
x_{j}
) are the same subject, else they are different subjects, and
r_{c}
is a weighted similarity of
r^{LMNN}
and
r^{DML}
with the coefficient w given by
5. Experimental Results
We evaluate the performance of the proposed verification system using the LFW dataset
[22]
, which is a challenging benchmark for face verification. We first describe the training and test protocols for the experiments, then describe the experiments conducted to investigate the optimal parametric setting of the proposed approaches. Finally, we compare our proposed approach with existing algorithms.
 5.1 Training and test protocol
The LFW database is a challenging database as it comprises 5749 subjects of 13,233 images downloaded from Yahoo! News between 20022003. It contains a large variety in facial poses, expressions, lighting conditions, and occlusions, as shown in
Fig. 7
. Because the image numbers for subjects vary, according to our proposed approach to learn the distance relation between positive and negative pairs, only subjects that contain more than 10 images are selected and 116 subjects with 1691 images are used in our experiments. Note that aligned versions of faces are used in the following experiments, and after face detection, only the central part of the 100×120 pixels is cropped. For each experiment, ten runs are performed and each run randomly selected 10 images for each subject in the training process. The remaining images are used for the test
Facial examples in the LFW database
 5.2 DML parametric settings
In Eq. (10), our approach measures the distance of one image from the other training images using two distance metrics: LMNN
[13]
and our proposed metric DML. We used the value w for a weighted combination of the measured distances using these two metrics. In other words, the weighted value
w
indicates the importance of each of the distance metrics. We set
w
to be from 0.4 to 0.7. We then used the receiver operating characteristic (ROC) curves with the xaxis for a false positive rate (FPR) and the yaxis for a true positive rate (TPR) to show the experimental results
Fig. 8
. We can see that the effect of the weighted value is not obvious.
Table 1
lists the true positive rate when the false positive rate is set to 0.3. Because
w
=0.6 yielded the best results, this value was set for the following experiments.
ROC curves by setting various weighted values for w in Eq. (10).
True positive rate with a 30% false positive rate by setting various weighted values forw
True positive rate with a 30% false positive rate by setting various weighted values for w
In the second experiment, we investigated the
k
value of the
k
NN code for verification, which can be seen as a range in the feature space. For each image the
k
NN code estimates the label distribution of
k
neighbors from 1160 training images. Two images of the same subject should have similar label distributions. From the results shown in
Fig. 9
, we can see that when
k
is smaller than 40 (about 0.035 times the training images), performance is unsatisfactory due to insufficient statistical information for comparison. However, when a large
k
value is used, a confusing situation happens. This is why when
k
is larger than 100 (about 0.086 times the training images), the performance is degraded as well. This is because in our experimental settings each subject has ten images in the training database and with the
k
value set to 100, we see that the estimate of the label distribution for each image is based on approximately 10 subjects. However, the LFW database has a great variety of appearances for subjects, and the large distribution of similar data measurements causes confusion.
Table 2
shows the true positive rate with the false positive rate set to be 0.3. According to our results,
k
is set as 60 in our experiments.
ROC curves by setting various k values of the kNN code for verification. As shown, a k value that goes from 40 to 80, about 0.035 to 0.070 times the training data size, is recommended.
True positive rate with a 30% false positive rate by setting various values forkinkNN code
True positive rate with a 30% false positive rate by setting various values for k in kNN code
In addition, in the test process, when obtaining the
k
NN code for each input test image, rather than using probability to measure the
k
NN code similarity as with MkNN
[14]
, we propose the CkNN approach to measure the correlation coefficients of the
k
NN code.
Fig. 10
compares the ROC curve with MkNN
[14]
. When the false positive rate is set to be 0.3, the true positive rate for MkNN and the proposed measurement approach for the
k
NN code are 81.22% and 84.32%, respectively. The proposed approach has a better result than MkNN because MkNN makes an incorrect classification when two images of the same subject receive a low similarity rating if the class labels of their corresponding
k
nearest neighbors are uniformly distributed.
ROC curves by different measurement approaches for the kNN code: MkNN and the proposed approach, CkNN.
 5.3 System performance comparison
The proposed metric LMNN + DML using CkNN is compared with the existing metric learning algorithms LMNN
[13]
and LDML
[14]
with the classification mechanism MkNN.
Fig. 11
shows the ROC curves. Using the classification mechanism CkNN, LMNN + DML can compensate for the drawbacks of LMNN and DML and provide better results than LMNN or DML alone. In addition, if using only the proposed metric DML with CkNN, better results are provided than when using LDML with MkNN
[14]
.
Table 3
lists the true positive rate value when the false positive rate is set to be 0.3. Compared with LMNN alone, integrating DML with LMNN can improve the verification rate by 4.3%.
Comparison of ROC curves obtained by the proposed approach and other existing methods: LMNN+DML+CkNN, LMNN+CkNN, DML+CkNN, and LDML+MkNN
Performance comparison of true positive rate with 30% false positive rate
Performance comparison of true positive rate with 30% false positive rate
To further analyze the verification power of LMNN and DML together, for each test image
x_{test}
, we generated
m
test pairs with training images as (
x_{test}
,
)
i
=1~
m
(
m
=1160 in our work) to verify whether or not (
x_{test}
,
) are the same subject.
Figs. 12
and
13
show examples of images, which were incorrectly classified more than by 0.5×
m
times by LMNN and DML, respectively. We see that the error examples for LMNN are those images with larger variations in pose, expression, and occlusion while the errors for DML are frontal images. This is because only betweenclass information is considered by DML and its estimated data distribution for withinclass is not as compact as that by LMNN. Hence, compared with LMNN, DML is expected to modify the distance relationships for betweenclass data to cope with large variations. To verify DML’s abilities we selected 51 outlier examples (as shown in
Fig. 14
) including 20 images with nonfrontal poses, 20 images with exaggerated expressions, and 11 images with heavy occlusions for test, and the ROC curves are shown in
Fig. 15
. The true positive rate with a false positive rate of 0.3 for LMNN and DML are 66.86% and 71.96%, respectively. We can see that DML has more tolerance for facial variations than LMNN. Therefore, in our method, we integrate LMNN and DML so they can compensate for each other.
Examples that are incorrectly classified more than one half of test data by LMNN
Examples that are incorrectly classified more than one half of test data by DML
Outlier examples used to analyse DML’s abilities in betweenclass verification
ROC curves obtained by LMNN and DML using 51 outlier examples
In addition, we study the impact on number of failure cases by using the different appearance factors including frontal view (the cases with ±15˚ outofplan rotation), expression (excluding the cases of natural expression), pose (excluding the cases of frontal view), and others (the remaining cases from the above factors) for LMNN+DML and LMNN as shown in
Fig. 16
. The three examples shown in
Fig. 16
(a)(c) are the most improved cases and the three ones in
Fig. 16
(d)(f) are the worst cases. The test example in
Fig. 16
(a) is a slightly rotated smile face. Compared with LMNN, LMNN+DML has the improved rates, 7.1%, 43.3%, 77.8% and 71.4% for frontal view, expression, pose and other factors, respectively. The test examples in
Fig. 16
(b) and
Fig. 16
(c) are with occlusion and nonfrontal views with expression, respectively. It is not surprisingly that the number of failure cases of the frontal view factor is more than that of the ones in
Fig. 16
(a) and (c) due to the occlusion and the missing of facial information. The integration of DML and LMNN can reduce the error rate 26.4% and 58.1 % for frontal view and pose, respectively. In
Fig. 16
(c), because the variations in the rotation angle and the expression degree are higher than those in
Fig. 16
(a), more failure cases happened for both LMNN+DML and LMNN, especially in the expression and pose factors. For the three cases shown in
Fig. 16
(a)(c), we have the significant improved rate 52.8% in the pose factor on average.
The impact on number of failure cases by using the different appearance factors including frontal view (the cases with ±15˚ outofplan rotation), expression (excluding the cases of natural expression), pose (excluding the cases of frontal view), and others (the remaining cases from the above factors) for LMNN+DML and LMNN with the test example shown in the topright corner.
The test example of
Fig. 16
(d) is a woman with a hat, which caused uneven shadow on her face. For LMNN+DML, the number of failure cases of the frontal view is higher than that of expression, pose or other, and we also observed that the degrading ratio is significant as compared with LMNN. Although the distance relations of the negative pairs which violated the constraint are modified in the training process, those negative data are still close in the transformed space.
Fig. 16
(e) is the test example of being occluded by a white head band. The numbers of failure cases of the four factors are higher than those of pure LMNN.
Fig. 16
(f) shows the results for a grin. Because many female subjects are collected with expression in the LFW dataset, especially smile and grin expressions, the data distribution is much entangled and error rate is therefore higher than that of the other five cases for both LMNN and LMNN+DML. We observed that the degrading degree is small in the pose and expression variations and DML considering only the distance relation of negative pairs has limitation for interclass variations of frontal views.
Fig. 17
has been modified in accordance with the test examples as in
Fig. 16
to show the failure examples classified by LMNN and LMNN+DML. From
Fig. 17
(a) and (c) we see that DML can help to correctly classify images with large variations that were incorrectly classified by LMNN. For examples, the facial images with expression and slightly outofplan rotation angle (the first row in
Fig. 17
(a)), and even larger rotation angle and occlusion in the cheek (the second row in
Fig. 17
(a)) can be correctly classified by DML.
Fig. 17
(c) shows that the DML is able to classify cases with higher degree of expression and grinning expression than LMNN (
Fig. 17
(a)). We observed that the images with higher degree of expression (the first row in
Fig. 17
(c)) and rotated facial images (the second row in
Fig. 17
(c)) can be rescued by DML from LMNN. However, DML has limited correction ability for images with small variations, as shown in
Fig. 17
(b) and (d). Regardless of interclass variations, we can observe these failure cases are near frontal view images with lower degree of smile expression (the first row in
Fig. 17
(b) and the first row in
Fig. 17
(d)) as compared with the examples in
Fig. 17
(a) and (c) or with natural expression (the second row in
Fig. 17
(c) and the second row in
Fig. 17
(d)). The conclusion is consistent to the results as observed in
Fig. 16
that the improved rate of the frontal view in the
Fig. 16
(a)(c) is smaller than that of the pose factor, and the test examples might be incorrectly classified, leading the low accuracy rate as shown in
Fig. 16
(d)(f). The reason is that considering only the class label distribution of nearest neighbors in the test process may cause this misclassification, especially for the entangled data distribution like the LFW dataset. In order to improve the accuracy rate, combining multiple complementary features like texture and color
[36]
or learning more robust features by deep network architectures
[41]
,
[42]
from facial images can reduce the interclass variation. Some researchers also tried to classify images by using the ranking results from the training data or extra data set
[35]
to improve the accuracy for the entangled data.
(a) and (c) are misclassified examples incorrectly classified by LMNN but correctly classified by LMNN+DML. (b) and (d) are misclassified examples incorrectly classified by both LMNN and LMNN+DML. The test image is shown with green edge.
6. Conclusion
In this paper, we propose a face verification framework that uses a distance metric based on two concepts. First, we propose a distance metric“DML” that penalizes violations of the distance relationship of negative pairs. Second, the distance relationship of positive pairs is optimized via LMNN. The experimental results confirm that the proposed verification framework can reduce the false positive rate than that by using only LMNN. Moreover, the proposed classification mechanism, by measuring the label distribution of a
k
NN code in two images, can modify the errors caused by low probability for an entangled data distribution and provide better performance than MkNN. In this study, only texturebased local features are extracted from facial images. Inspired by the impressive recognition rate improvements achieved by combining texture and color features
[36]
,
[40]
, we plan to investigate the use of this effective combination approach in the metric learning framework in future research.
Acknowledgements
This work is supported by National Science Council (NSC), Taiwan, under Contract of MOST1032221E151031. The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.
BIO
JuChin Chen received her B.S., M.S. and Ph.D. degrees in computer science and information engineering from the National Cheng Kung University, Tainan, Taiwan, in 2002, 2004 and 2010, respectively. She is now an assistant professor in the Department of Computer Science and Information Engineering at the National Kaohsiung University of Applied Sciences, Taiwan. Her research interests lie in the fields of machine learning, pattern recognition, and image processing.
PeiHsun Wu received his B.S. degree in computer science and information enginnering from National Chung Cheng Univeristy, Chiayi, Taiwan, in 2009. He received his M.S. degree in computer science and information engineering from the National Cheng Kung University, Tainan, Taiwan, in 2011. His research interests lie in the fields of machine learning and computer vision.
JennJier James Lien received his M.S. and Ph.D. degrees in electrical engineering from Washington University, St. Louis, MO, and the University of Pittsburgh, Pittsburgh, PA, in 1993 and 1998, respectively. From 1995 to 1998, he was are search assistant at the Vision Autonomous Systems Center in the Robotics Institute at Carnegie Mellon University, Pittsburgh, PA. From 1999 to 2002, he was a senior research scientist at L1Identity (formerly Visionics) and a project lead for the DARPA surveillance project. He is now professor in the Department of Computer Science and Information Engineering at the National Cheng Kung University, Taiwan.
Wang X.
,
Tang X.
“Dualspace linear discriminant analysis for face recognition”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition
2004
Vol. 2
564 
569
Wang X.
,
Tang X.
2004
“A unified framework for subspace face recognition”
IEEE Transactions on Pattern Analysis and Machine Intelligence
26
(9)
1222 
1228
DOI : 10.1109/TPAMI.2004.57
Wang X.
,
Tang X.
2006
“Random sampling for subspace face recognition”
International Journal of Computer Vision
70
(1)
91 
104
DOI : 10.1007/s112630068098z
Ahonen T.
,
Hadid A.
,
Pietikainen M.
2006
“Face description with local binary patterns: application to face recognition”
IEEE Transactions on Pattern Analysis and Machine Intelligence
28
(12)
2037 
2041
DOI : 10.1109/TPAMI.2006.244
Berg T.
,
Belhumeur P.
“TomvsPete classifiers and identitypreserving alignment for face verification”
in Proc. of British Machine Vision Conference
2012
Cao Z.
,
Yin Q.
,
Tang X.
,
Sun J.
“Face recognition with learningbased descriptor”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition
2010
2707 
2714
Chen D.
,
Cao X.
,
Wang L.
,
Wen F.
,
Sun J.
“Bayesian face revisited: A joint formulation”
in Proc. of European Conference on Computer Vision
2012
566 
579
Chen D.
,
Cao X.
,
Wen F.
,
Sun J.
“Blessing of dimensionality: highdimensional feature and its efficient compression for face verification”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition
2013
3025 
3032
Kumar N.
,
Berg A.C.
,
Belhumeur P. N.
,
Nayar S. K.
“Attribute and simile classifiers for face verification”
in Proc. of International Conference on Computer Vision
2009
365 
372
Yin Q.
,
Tang X.
,
Sun J.
“An associatepredict model for face recognition”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition
2011
497 
504
Zhu Z.
,
Luo P.
,
Wang X.
,
Tang X.
“Deep learning identitypreserving face space”
in Proc. of International Conference on Computer Vision
2013
113 
120
Simonyan K.
,
Parkhi O. M.
,
Vedaldi A.
,
Zisserman A.
“Fisher vector faces in the wild”
in Proc. of British Machine Vision Conference
2013
Weinberger K.Q.
,
Blitzer J.
,
Saul L.K.
2009
“Distance metric learning for large margin nearest neighbor classification”
Journal of Machine Learning Research
10
209 
244
Guillaumin M.
,
Verbeek J.
,
Schmid C.
“Is that you? metric learning approaches for face identification”
in Proc. of International Conference on Computer Vision
2009
489 
505
BarHillel A.
,
Hertz T.
,
Shental N.
,
Weinshall D.
2005
“Learning a Mahalanobis metric from equivalence constraints”
Journal of Machine Learning Research
6
937 
965
Davis J.
,
Kulis B.
,
Jain P.
,
Sra S.
,
Dhillon I.
“Information theoretic metric learning”
in Proc. of International Conference on Machine Learning
2007
209 
216
Globerson A.
,
Roweis S.
“Metric learning by collapsing classes”
Advances in Neural Information Processing Systems
2005
451 
458
Goldberger J.
,
Roweis S.
,
Hinton G.
,
Salakhutdinov R.
“Neighbourhood components analysis”
Advances in Neural Information Processing Systems
2004
513 
520
Xing E.
,
Ng A.
,
Jordan M.
,
Russell S.
“Distance metric learning, with application to clustering with sideinformation”
Advances in Neural Information Processing Systems
2002
505 
512
Kedem D.
,
Tyree S.
,
Sha F.
,
Lanckriet G.R.
,
Weinberger K.Q.
“Nonlinear metric learning”
Advances in Neural Information Processing Systems
2012
2573 
2581
OpenCV
http://opencv.org/
Huang G.
,
Ramesh M.
,
Berg T.
,
LearnedMiller E.
2007
“Labeled faces in the wild: a database for studying face recognition in unconstrained environments”
University of Massachusetts
Liang Y.
,
Liao S.
,
Wang L.
,
Zou B.
“Exploring regularized feature selection for person specific face verification”
in Proc. of International Conference on Computer Vision
2011
1676 
1683
Sun Y.
,
Wang X.
,
Tang X.
“Hybrid deep learning for face verification”
in Proc. of International Conference on Computer Vision
2013
1489 
1496
Taigman Y.
,
Yang M.
,
Ranzato M. A.
,
Wolf L.
“DeepFace: closing the gap to humanlevel performance in face verification”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition
2014
1701 
1708
Huang G.B.
,
Lee H.
,
LearnedMiller E.
“Learning hierarchical representations for face verification with convolutional deep belief networks”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition
2012
2518 
2525
Nguyen H.V.
,
Bai L.
“Cosine similarity metric learning for face verification”
in Proc. of Asian Conference on Computer Vision
2010
709 
720
Ojala T.
,
Pietikäinen M.
,
Mäenpää T.
2002
“Multiresolution grayscale and rotation invariant texture classification with local binary patterns”
IEEE Transactions on Pattern Analysis and Machine Intelligence
24
(7)
971 
987
DOI : 10.1109/TPAMI.2002.1017623
Lowe D.G.
“Distinctive image features from scaleinvariant keypoints”
in Proc. of International Conference on Computer Vision
2004
Vol. 60, No.2
91 
110
Wiskott L.
,
Fellous J.M.
,
Krger N.
,
Malsburg C.V.D.
1997
“Face recognition by elastic bunch graph matching”
IEEE Transactions on Pattern Analysis and Machine Intelligence
19
(7)
775 
779
DOI : 10.1109/34.598235
Pinto N.
,
DiCarlo J.
,
Cox D.
“How far can you get with a modern face recognition test set using only simple features?”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition
2009
2591 
2598
Tan X.
,
Triggs B.
2007
“Enhanced local texture feature sets for face recognition under difficult lighting conditions”
Lecture Notes in Computer Science
4778
168 
182
Zhang L.
,
Chu R.
,
Xiang S.
,
Liao S.
,
Li S.
2007
“Face detection based on multiblock LBP representation”
Lecture Notes in Computer Science
4642
11 
18
Brahnam S.
,
Jain L.C.
,
Nanni L.
,
Lumini A.
2014
“Local binary patterns: new variants and applications”
Springer
Wolf L.
,
Hassner T.
,
Taigman Y.
2010
“Effective Unconstrained Face Recognition by Combining Multiple Descriptors and Learned Background Statistics”
IEEE Transactions on Pattern Analysis and Machine Intelligence
33
(10)
1978 
1990
DOI : 10.1109/TPAMI.2010.230
Choi J.Y.
,
Ro Y.M.
,
Plataniotis K. N.
2012
“Color local texture features for color face recognition”
IEEE Transactions on Image Processing
2
(3)
1366 
1380
DOI : 10.1109/TIP.2011.2168413
Shih P.
,
Liu C.
“Improving the face recognition grand challenge baseline performance using color configurations across color spaces”
in Proc. of International Conference on Image Processing
2006
1001 
1004
Liu Z.
,
Liu C.
2008
“A hybrid color and frequency features method for face recognition”
IEEE Transactions on Image Processing
17
(10)
1975 
1980
DOI : 10.1109/TIP.2008.2002837
Wang J.
,
Liu C.
2008
“Color image discriminant models and algorithms for face recognition”
IEEE Transactions on Neural Network
19
(12)
2088 
2097
DOI : 10.1109/TNN.2008.2005140
Young C.J.
,
Ro Y.M.
,
Plataniotis K. N.
2011
“Boosting color feature selection for color face recognition”
IEEE Transactions on Image Processing
20
(5)
1425 
1434
DOI : 10.1109/TIP.2010.2093906
Ranzato M.
,
Boureau Y.
,
LeCun Y.
“Sparse feature learning for deep belief networks”
Advances in Neural Information Processing Systems
2007
1185 
1192
Taigman Y.
,
Yang M.
,
Ranzato M.
,
Wolf L.
“DeepFace: closing the gap to humanlevel performance in face verification”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition
2014
Li F.F.
,
Fergus R.
,
Perona P.
2006
“Oneshot learning of object categories”
IEEE Transactions on Pattern Analysis and Machine Intelligence
28
(4)
594 
611
DOI : 10.1109/TPAMI.2006.79
Fink M.
“Object classification from a single example utilizing class relevance pseudometrics”
Advances in Neural Information Processing Systems
2005
449 
456
Wolf L.
,
Hassner T.
,
Taigman Y.
“Descriptor based methods in the wild”
in Proc. of Workshop on Faces in RealLife Images: Detection, Alignment, and Recognition
2008
Taigman Y.
,
Wolf L.
,
Hassner T.
“Multiple oneshots for utilizing class label information”
in Proc. of British Machine Vision Conference
2009
1 
12
Chu W.S.
,
Chen J.C.
,
James Lien J.J.
2011
“Kernel discriminant transformation for image setbased face recognition”
Pattern Recognition
44
(8)
1567 
1580
DOI : 10.1016/j.patcog.2011.02.011
Belhumeur P.
,
Hespanha J.
,
Kriegman D.
1997
“Eigenfaces vs. Fisherfaces: recognition using class specific linear projection”
IEEE Transactions on Pattern Analysis and Machine Intelligence
19
(7)
711 
720
DOI : 10.1109/34.598228
Lazebnik S.
,
Schmid C.
,
Ponce J.
"Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories”
in Proc. of IEEE Conference on Computer Vision and Pattern Recognition
2006
216 
2178