Relative entropy is a divergence measure between two probability density functions of a random variable. Assuming that the random variable has only two alphabets, the relative entropy becomes a crossentropy error function that can accelerate training convergence of multilayer perceptron neural networks. Also, the nth order extension of crossentropy (nCE) error function exhibits an improved performance in viewpoints of learning convergence and generalization capability. In this paper, we derive a new divergence measure between two probability density functions from the nCE error function. And the new divergence measure is compared with the relative entropy through the use of threedimensional plots.
1. INTRODUCTION
Multilayer perceptron (MLP) neural networks can approximate any function with enough number of hidden nodes
[1]

[3]
and this increases applications of MLPs to wide fields such as pattern recognition, speech recognition, time series prediction, bioinformatics, etc. MLPs are usually trained with the error backpropagation (EBP) algorithm, which minimizes the meansquared error (MSE) function between outputs and their desired values of MLP
[4]
. However, the EBP algorithm has drawbacks with slow learning convergence and poor generalization performance
[5]
,
[6]
. This is due to the incorrect saturation of output nodes and overspecialization to training samples
[6]
.
Usually, sigmoidal functions are adopted as activation functions of nodes in MLP. The sigmoidal activation function can be divided into a central linear region and two outer saturated regions. When an output node of MLP is in an extremely saturated region of the sigmoidal activation function opposite to a desired value, we say the output node is “incorrectly saturated.” The incorrect saturation makes updating amount of weights small and consequently learning convergence becomes slow. Also, when MLPs are trained too much for training samples, this causes overspecialization of MLP to training samples and generalization performance for untrained test samples will be poor.
Crossentropy (CE) error function accelerates the EBP algorithm through decreasing the incorrect saturation of output nodes
[5]
. Furthermore, the
n
th order extension of crossentropy (
n
CE) error function attains accelerated learning convergence and improved generalization capability by decreasing the incorrect saturation as well as preventing the overspecialization to training samples
[6]
.
Information theory has done a great role in neural network community. For improved performance, information theoretic view provides many learning rules of neural networks such as minimum classentropy, minimizing entropy, and feature extraction using information theoretic learning
[7]

[11]
. Also, information theory can be a basis for constructing neural networks
[12]
. The upper bound of probability of error was derived based on the Renyi’s entropy
[13]
. Maximizing the information contents of hidden nodes can be developed for better performance of MLPs
[14]
,
[15]
. In this paper, we focus on the relationship between relative entropy and the CE error function.
Relative entropy is a divergence measure between two probability density functions
[16]
. Assuming that a random variable has two alphabets, the relative entropy becomes crossentropy (CE) error function which can accelerates the learning convergence of MLPs. Since
n
CE error function is an extension of CE error function, there must be a divergence measure corresponding to
n
CE error function as CE does. In this sense, this paper derives a new divergence measure from
n
CE error function. In section 2, the relationship between the relative entropy and CE is introduced. Section 3 derives a new divergence measure from
n
CE error function and compares the new divergence measure with the relative entropy. Finally, section 4 concludes this paper.
2. RELATIVE ENTROPY AND CROSSENTROPY
Consider a random variable
x
whose probability density function (p.d.f.) is
p
(
x
). In the case that the p.d.f. of
x
is estimated with
q
(
x
), we need to measure how accurate the estimation is. Therefore, the relative entropy is defined by
as a divergence measure between
p
(
x
) and
q
(
x
) [16]. Let’s assume that the random variable
x
has only two alphabets 0 and 1, in which the probabilities are
Also,
Then,
Here,
is the entropy of a random variable
x
with two alphabets and
is the crossentropy. If we assume that ‘
q
’ corresponds to a real output value ‘
y’
of MLP output node and ‘
p
’ corresponds to its desired value ‘
t
’, we can define the crossentropy error function as
Thus, the crossentropy error function is one specific type of relative entropy assuming that a random variable has only two alphabets
[15]
.
We can use the unipolar [0, 1] mode or bipolar [1, +1] mode for describing node values of MLPs. Since ‘
t
’ and ‘
y
’ corresponds to ‘
p
’ and ‘
q
’ respectively, they are in the range of [0, 1]. Thus, the relationship between relative entropy and crossentropy error function is based on the unipolar mode of node values.
3. NEW DIVERGENC MEASURE FROM THEnth ORDER EXTENSION OF CROSSENTROPY
The
n
th order extension of crossentropy (
n
CE) error function was proposed based on the bipolar mode of node values as
[6]
where
n
is a natural number. In order to derive a new divergence measure from
n
CE error function based on the relationship between relative entropy CE error function, we need an unipolar mode formulation of
n
CE error function. That is derived as
We will derive new divergence measures from Eq. (9) with
n
=2 and 4.
When
n
=2, the
n
CE error function given by Eq. (9) becomes
where
and
By substituting Eqs. (11) and (12) into Eq. (10),
In order to derive a new divergence measure corresponding to
n
CE(
n
=2),
t
and
y
are substituted to
p
and
q
, respectively. This is the reverse procedure for deriving Eq. (7) from (6) by substituting ‘
p
’ and ‘
q
’ to ‘
t
’ and ‘
y
’, respectively. Then, we can get
Thus, by resembling the last equation in Eq. (4), the new divergence measure is derived by
where
When n=4, the
n
CE error function given by Eq. (9) is
where
and
Substituting Eqs. (18), (19), (20), (21), and (22) into Eq. (17),
By substituting
t
and
y
to
p
and
q
, respectively, we can get
Thus, by resembling the last equation in Eq. (4), the new divergence measure is derived by
where
In order to compare the new divergence measures given by Eqs. (15) and (25) with the relative entropy given by Eq. (4), we plot them in the range that
p
and
q
are in [0,1].
Fig. 1
shows the threedimensional plot of relative entropy
D
(
p

q
). The
x
and
y
axes correspond to
p
and
q
, respectively, and z axis corresponds to
D
(
p

q
).
D
(
p

q
) is minimum of zero when
p
=
q
and it increases when
p
goes far from
q
. Since
D
(
p

q
) is a divergence measure, it is not symmetric.
The threedimensional plot of relative entropy D(pq) with two alphabets
Fig. 2
shows the threedimensional plot of new divergence measure
F
(
p

q
;
n
=2) given by Eq. (15).
F
(
p

q
;
n
=2) is minimum of zero when
p
=
q
and it increases when
p
goes far from
q
as
D
(
p

q
) does. Furthermore, we can find that
F
(
p

q
;
n
=2) is more flat than
D
(
p

q
). Also, the threedimensional plot of
F
(
p

q
;
n
=4) shown in
Fig. 3
is minimum of zero when
p
=
q
and more flat than
F
(
p

q
;
n
=2) shown in
Fig. 2
. So, increasing the order
n
of the new divergence measure makes the new divergence measure more flat.
The threedimensional plot of new divergence measure with two alphabets when n=2, F(pq;n=2)
The threedimensional plot of new divergence measure with two alphabets when n=4, F(pq;n=4)
When applying MLPs to pattern classification, the optimal outputs of MLP based on various error functions were derived in
[6]
and
[18]
. We plot them in
Fig. 4
. The optimal output of MLP based on CE error function is a first order function of a posterior probability that a certain input sample belongs to a specific class. When using
n
CE error function with
n
=2 for training MLPs, as shown in
Fig. 4
, the optimal output of MLP shows more flat than the CE case. And,
n
CE error function with
n
=4 shows the optimal output more flat than CE and
n
CE with
n
=2 cases. The twodimensional contour plots of CE and
n
CE error functions also show the same property
[17]
. So, we can argue that the property of divergence measures derived from CE and
n
CE coincides with the twodimensional contour plot of CE and nCE error function in
[17]
and optimal outputs in
[6]
and
[18]
.
Optimal outputs of MLPs. Here, Q(x) denotes a posteriori probability that a certain input x belongs to a specific class
4. CONCLUSIONS
In this paper, we introduce the relationship between relative entropy and CE error function. When a random variable has only two alphabets, the relative entropy becomes crossentropy. Based on the relationship, we derive a new divergence measure form the
n
CE error function. Comparing the threedimensional plot of relative entropy and new divergence measure when
n
=2 and 4, we can argue that the order
n
of new divergence measure has an effect of flatting the divergence measure. This property coincides with the previous results which comparing the optimal outputs and contour plots of CE and
n
CE.
Acknowledgements
This work was supported by the National Research Foundation of Korea Grant funded by the Korean Government (NRF2014K2A2A4001441)
BIO
SangHoon Oh
He received his B.S. and M.S degrees in Electronics Engineering from Busan National University in 1986 and 1988, respectively. He received his Ph.D. degree in Electrical Engineering from Korea Advanced Institute of Science and Technology in 1999. From 1988 to 1989, he worked for LG semiconductor, Ltd., Korea. From 1990 to 1998, he was a senior research staff in Electronics and Telecommunication Research Institute (ETRI), Korea. From 1999 to 2000, he was with Brain Science Research Center, KAIST. In 2000, he was with Brain Science Institute, RIKEN, Japan, as a research scientist. In 2001, he was an R&D manager of Extell Technology Corporation, Korea. Since 2002, he has been with the Department of Information Communication Engineering, Mokwon University, Daejon, Korea, and is now a professor. Also, he was with the Division of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, USA, as a visiting scholar from August 2008 to August 2009. His research interests are machine learning, speech signal processing, pattern recognition, and bioinformatics.
Hiroshi Wakuya
He received the B.E. degree in electronic engineering from Kyushu Institute of Technology, Kitakyushu, Japan, in 1989, and the M.E. and Ph.D. degrees in electrical and communication engineering from Tohoku University, Sendai, Japan, in 1991 and 1994, respectively. In 1994, he joined the staff of Saga University as a Research Associate. In 1995, he became a Lecturer. From 1999 to 2000, he was a Visiting Scientist in University of Louisville, Louisville, Kentucky, USA. Since 2004, he has been an Associate Professor. His research interests include neural networks, intelligent instrumentation, and biological engineering.
SunGyu Park
He received the Ph.D. degree in the department of architecture from Tokyo University, Japan in September, 2004. He was employed by Mokwon University, Korea as a professor from March, 2009. His main research fields are an initial crack prediction of concrete technology development of the blast furnace slag concrete using alkali activator and a performance enhancement of the recycled aggregate concrete.
HwangWoo Noh
He received the B.S. and M.S. degrees from Department of Industrial Design, Hanbat National University, Daejeon, Korea, in 1996 and 2003 respectively. He has also completed his Ph.D. degree course at the Department of Industrial Design, Chungnam National University, Daejeon, Korea, in 2013. From 1997 to 2008, he worked as a representative of DesignComphics Inc. He joined the faculty of Department of Visual Design, Hanbat National University, Daejeon, Korea, in 2009. During 20082009 he served as an executive director of Korea Design Industrial Association where he was nominated Head of the DaejeonChungcheong Branch. He is currently a professor of Hanbat National University, a SecretaryGeneral of Daejeon Design Development Forum, and a Vicepresident of Korea Contents Association. His main research interests include Visual Communication Design and its Fundamentals, Packaging Design, and Disaster Prevention Design.
JaeSoo Yoo
He received his M.S. and Ph.D. degrees in Computer Science from the Korean Advanced Institute of Science and Technology, Korea in 1991 and 1995, respectively. He is now a professor in Information and Communication Engineering, Chungbuk National University, Korea. He has also been the president of Korea Contents Association since 2013. His main research interests include sensor data management, big data, and mobile social networks.
ByungWon Min
He received M.S degree in computer software from Chungang University, Seoul, Korea in 2005. He worked as a professor in the department of computer engineering, Youngdong University, Youngdong, Chungbuk, Korea, from 2005 to 2008. He received Ph.D. degree in Information and Communication Engineering from Mokwon University, Daejeon, Korea., in 2010. His research interests include SaaS & Mobile Cloud, Database, and Software Engineering. Recently he is interested in Big Data processing and its applications.
YongSun Oh
He received B.S., M.S., and Ph.D. degrees in electronic engineering from Yonsei University, Seoul, Korea, in 1983, 1985, and 1992, respectively. He worked as an R&D engineer at the System Development Division of Samsung Electronics Co. Ltd., Kiheung, KyungkiDo, Korea, from 1984 to 1986. He joined the Dept. of Information & Communication Engineering, Mokwon University in 1988. During 19981999 he served as a visiting professor of Korea Maritime University, Busan, Korea, where he was nominated as a Head of Academic Committee of KIMICS an Institute. He returned to Mokwon University in 1999, and served as a Dean of Central Library and Information Center from 2000 to 2002, as a Director of Corporation of Industrial & Educational Programs from 2003 to 2005, as a Dean of Engineering College and as a Dean of Management Strategic Affairs from 2010 to 2013, respectively. He had been the President of KoCon from 2006 to 2012. During his sabbatical years, he worked as an Invited Researcher at ETRI from 2007 to 2008, and as a Visiting Scholar at KISTI from 2014 to 2015. His research interests include Digital Communication Systems, Information Theory and their applications. Recently he is interested in Multimedia Content and Personalized eLearning.
Rumelhart D. E.
,
McClelland J. L.
1986
Parallel Distributed Processing
Cambridge, MA
Oh S.H.
1997
“Improving the Error BackPropagation Algorithm with a Modified Error Function,”
IEEE Trans. Neural Networks
8
799 
803
DOI : 10.1109/72.572117
ElJaroudi A.
,
Makhoul J.
“A New Error Criterion for Posterior probability Estimation with Neural Nets,”
Proc. IJCNN’90
Jun. 1990
vol. III
185 
192
Ridella S.
,
Rovetta S.
,
Zunino R.
1999
“Representation and Generalization Properties of ClassEntropy Networks,”
IEEE Trans. Neural Networks
10
31 
47
DOI : 10.1109/72.737491
Erdogmus D.
,
Principe J. C.
“Entropy Minimization Algorithm for Multilayer Perceptrons,”
Proc. IJCNN’01
2001
vol. 4
3003 
3008
Hild II K. E.
,
Erdogmus D.
,
Torkkola K.
,
Principe J. C.
2006
“Feature Extraction Using InformationTheoretic Learning,”
IEEE Trans. PAMI
28
(9)
1385 
1392
DOI : 10.1109/TPAMI.2006.186
Lee S.J.
,
Jone M.T.
,
Tsai H.L.
1999
“Constructing Neural Networks for MulticlassDiscretization Based on Information Theory,”
IEEE Trans. Sys., Man, and Cyb. Part B
29
445 
453
DOI : 10.1109/3477.764881
Erdogmus D.
,
Principe J. C.
“Information Transfer Through Classifiers and Its Relation to Probability of Error,”
Proc. IJCNN’01
2001
vol. 1
50 
54
Kamimura R.
,
Nakanishi S.
1995
“Hidden Information maximization for Feature Detection and Rule Discovery,”
Network: Computation in Neural Systems
6
577 
602
DOI : 10.1088/0954898X_6_4_004
Torkkola K.
“Nonlinear Feature Transforms Using Maximum Mutual Information,”
Proc. IJCNN’01
2001
vol. 4
2756 
2761
Cover T. M.
,
Thomas J. A.
1991
Elements of Information Theory
John Wiley & Sons
Oh S.H.
2012
“Contour Plots of Objective Functions for FeedForward Neural Networks,”
Int. Journal of Contents
8
(4)
30 
35
DOI : 10.5392/IJoC.2012.8.4.030
Oh S.H.
2011
“Statistical Analyses of Various Error Functions For Pattern Classifiers,”
CCIS
206
129 
133