Hidden nodes have a key role in the information processing of feed-forward neural networks in which inputs are processed through a series of weighted sums and nonlinear activation functions. In order to understand the role of hidden nodes, we must analyze the effect of the nonlinear activation functions on the weighted sums to hidden nodes. In this paper, we focus on the effect of nonlinear functions in a viewpoint of information theory. Under the assumption that the nonlinear activation function can be approximated piece-wise linearly, we prove that the entropy of weighted sums to hidden nodes decreases after piece-wise linear functions. Therefore, we argue that the nonlinear activation function decreases the uncertainty among hidden nodes. Furthermore, the more the hidden nodes are saturated, the more the entropy of hidden nodes decreases. Based on this result, we can say that, after successful training of feed-forward neural networks, hidden nodes tend not to be in linear regions but to be in saturated regions of activation function with the effect of uncertainty reduction.
1. INTRODUCTION
When an input sample is presented to feed-forward neural networks (FNNs), it is processed through a series of weighted sums and nonlinear activation functions. It was proved that the FNNs with enough hidden nodes can approximately implement any function
[1]
-
[4]
. Herein, the weighted sums to hidden nodes are a sort of projections from the input space to a hidden feature space followed by element-wise nonlinear activation functions. There have been research results to understand the role of hidden nodes. Oh and Lee proved that the nonlinear function of hidden nodes has an effect of decreasing correlations among hidden nodes
[5]
. They also argued that FNNs are a special type of nonlinear whitening filter
[5]
. And, Shah and Poon investigated that hidden nodes with sigmoidal activation functions have the ability to produce linearly independent internal representations
[6]
.
In neural network field, information theory provides many fruitful research results. Lee et al. reported that FNNs have a capability of information extraction for pattern classification through hierarchically keeping inter-class information while reducing intra-class variations
[7]
. Learning rules were proposed stemming from information theory
[8]
-
[14]
. The upper bound for the probability of error was derived based on Renyi’s entropy
[15]
. Information theory can provide the construction strategy of neural networks
[16]
. Also, hidden information maximization and maximum mutual information methods were proposed for feature extractions
[17]
,
[18]
. In this paper, we focus on the nonlinear activation functions of hidden nodes in a viewpoint of information theory. Under the assumption that nonlinear activation function can be approximated piece-wise linearly, we derive that the entropy of hidden nodes decreases after piece-wise linear transformation. Based on the derivation, we can interpret the role of hidden nodes in a viewpoint of entropy or uncertainty.
2. NONLINEAR EFFECT ON THE ENTROPY OF HIDDEN NODES
In FNNs, inputs are processed through a series of weighted sums and nonlinear activation functions. When inputs or weights are perturbed randomly, the weighted sums to hidden nodes are approximately jointly Gaussian according to the central limit theorem
[19]
-
[21]
. Therefore, we analyze the effect of nonlinear function on jointly Gaussian random variables.
If
u
and
v
are jointly Gaussian random variables with zero means, then the joint entropy of
u
and
v
[22]
is given by
Herein,
σu
and
σv
are the standard deviation of
u
and
v
, respectively, and
r
is the correlation coefficient between
u
and
v
.
Let’s assume that
u
and
v
are transformed into
y
and
z
as follows:
where
a, b, c
, and
d
are nonzero real values. Then, the entropy of
y
is given by
where
fy
(
y
) is the probability density function (p.d.f.) of random variable
y
. By substituting
[19]
Eq. (3) can be derived by
Here,
f
u
(
u
) is the p.d.f. of random variable
u
and
since
u
is Gaussian with zero mean. Accordingly, the entropy of
z
is given by
The joint entropy after the piece-wise-linear transformations is defined by
The joint p.d.f. of
y
and
z
can be separated into four quadrants as follows:
Therefore, Eq. (8) is also separated into four parts as
Since the joint p.d.f. of
u
and
v
is given by
the first quadrant term in Eq. (10) is derived by
Here,
and
The second term in the right side of Eq. (12) becomes
Since
y
= a
u
is symmetric and zero mean,
And according to the same procedure,
By Papoulis
[19]
,
and
Also by Oh
[23]
,
and
By substituting Eqs. (18) and (20) into Eq. (12),
Using the same procedure from Eq. (11) to (24), we can derive that
and
By substituting Eqs. (24), (25), (26), and (27) into Eq. (10), we attain
Finally we take the derivation
with the substitution of Eq. (1) into (28).
In FNNs, we usually use the tanh(.) sigmoidal function (
Fig. 1
) as the nonlinear activation function of hidden node. Assuming that the nonlinear activation function can be approximated piece-wise linearly as in Eq. (2), then the parameters
a, b, c,
and
d
correspond to the slope of nonlinear activation function. As shown in
Fig. 2
, the slope is less than or equal to 0.5. Thus,
and
h
(
y,z
)<
h
(
u,v
). Also, the less steep the slopes are, the more the joint entropy after the nonlinear function decreases.
Consequently, we can argue that the nonlinear activation function decreases the entropy or uncertainty of hidden nodes. The sigmoid activation function can be separated into linear region with steep slope and saturated region with gentle slope as shown in
Fig. 1
. When hidden node values are in the saturation region of sigmoid activation function, the entropy decreases much more. This coincides with the argument that hidden nodes tend to be saturated after successful training.
The sigmoid activation function of hidden node
The slope of the sigmoid activation function
Furthermore, the entropy of hidden nodes provides hierarchical understanding of information extraction capabilities acquired through learning. Lee et al. argued that input samples have the inter-class information as well as the intra-class variation
[7]
. The inter-class information is the information content that an input sample belongs to a specific class, and the intra-class variation is a measure of the average variations within the classes including noise contaminations. After learning, input samples are projected to hidden nodes through weighted sums and element-wise nonlinear transformations. In this paper, we proved that the entropy of hidden nodes decreases after the nonlinear transformations. The decreasing of hidden nodes’ entropy is correspond to the decreasing of intra-class variations, as pointed out by Lee et al
[7]
.
3. CONCLUSIONS
In this paper, we prove that the entropy of jointly Gaussian random variables decreases after piece-wise linear transformations. Also, the less steep the slopes are, the more the joint entropy decreases. Since the nonlinear activation function of hidden nodes can be approximated piece-wise linearly, we can argue that the nonlinear activation function decreases the uncertainty among hidden nodes. Furthermore, the entropy of hidden nodes decrease much more after successful training of FNNs.
BIO
Sang-Hoon Oh
received his B.S. and M.S degrees in Electronics Engineering from Pusan National University in 1986 and 1988, respectively. He received his Ph.D. degree in Electrical Engineering from Korea Advanced Institute of Science and Technology in 1999. From 1988 to 1989, he worked for LG semiconductor, Ltd., Korea. From 1990 to 1998, he was a senior research staff in Electronics and Telecommunication Research Institute (ETRI), Korea. From 1999 to 2000, he was with Brain Science Research Center, KAIST. In 2000, he was with Brain Science Institute, RIKEN, Japan, as a research scientist. In 2001, he was an R&D manager of Extell Technology Corporation, Korea. Since 2002, he has been with the Department of Information Communication Engineering, Mokwon University, Daejon, Korea, and is now a full professor. Also, he was with the Division of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, USA, as a visiting scholar from August 2008 to August 2009. His research interests are machine learning, speech signal processing, pattern recognition, and bioinformatics.
Liao Y.
,
Fang S. C.
,
Nuttle H. L. W.
2003
“Relaxed Conditions for Radial-Basis Function Networks to be Universal Approximators,”
Neural Networks
16
1019 -
1028
DOI : 10.1016/S0893-6080(02)00227-7
Oh S. H.
,
Lee Y.
1994
“Effect of Nonlinear Transformations on Correlation Between Weighted Sums in Multilayer Perceptrons,”
IEEE Trans., Neural Networks
5
508 -
510
DOI : 10.1109/72.286927
Shah J. V.
,
Poon C. S.
1999
“Linear Independence of Internal Representations in Multilayer Perceptrons,”
IEEE Trans., Neural Networks
10
10 -
18
DOI : 10.1109/72.737489
Lee Y.
,
Song H. K.
1993
“Analysis on the Efficiency of Pattern Recognition Layers Using Information Measures,”
Proc. IJCNN’93 Nagoya
3
2129 -
2132
El-Jaroudi A.
,
Makhoul J.
1990
“A New Error Criterion for Posterior probability Estimation with Neural Nets,”
Proc. IJCNN’90
3
185 -
192
Ridella S.
,
Rovetta S.
,
Zunino R.
1999
“Representation and Generalization Properties of Class-Entropy Networks,”
IEEE Trans. Neural Networks
10
31 -
47
DOI : 10.1109/72.737491
Erdogmus D.
,
Principe J. C.
2001
“Entropy Minimization Algorithm for Multilayer Perceptrons,”
Proc. IJCNN’01
4
3003 -
3008
Hild II K. E.
,
Erdogmus D.
,
Torkkola K.
,
Principe J. C.
2006
“Feature Extraction Using Information-Theoretic Learning,”
IEEE Trans. PAMI
28
(9)
1385 -
1392
DOI : 10.1109/TPAMI.2006.186
Li R.
,
Liu W.
,
Principe J. C.
2007
“A Unifiying Criterion for Instaneous Blind Source Separation Based on Correntropy,”
Signal Processing
87
(8)
1872 -
1881
DOI : 10.1016/j.sigpro.2007.01.022
Ekici S.
,
Yildirim S.
,
Poyraz M.
2008
“Energy and Entropy-Based Feature Extraction for Locating Fault on Transmission Lines by Using Neural Network and Wavelet packet Decomposition,”
Expert Systems with Applications
34
2937 -
2944
DOI : 10.1016/j.eswa.2007.05.011
Erdogmus D.
,
Principe J. C.
2001
“Information Transfer Through Classifiers and Its Relation to Probability of Error,”
Proc. IJCNN’01
1
50 -
54
Lee S. J.
,
Jone M. T.
,
Tsai H. L.
1999
“Constructing Neural Networks for Multi class-Discretization Based on Information Theory,”
IEEE Trans. Sys., Man, and Cyb.-Part B
29
445 -
453
DOI : 10.1109/3477.764881
Kamimura R.
,
Nakanishi S.
1995
“Hidden Information maximization for Feature Detection and Rule Discovery,”
Network: Computation in Neural Systems
6
577 -
602
DOI : 10.1088/0954-898X/6/4/004
Torkkola K.
2001
“Nonlinear Feature Transforms Using Maximum Mutual Information,”
Proc. IJCNN’01
4
2756 -
2761
Papoulis A.
1984
Probability, Random Variables, and Stochastic Processes
second ed.
McGraw-Hill
New York
Cover T. M.
,
Thomas J. A.
1991
Elements of Information Theory
John Wiley and Sons, INC.
Oh S. H.
2003
“Decreasing of Correlations Among Hidden Neurons of Multilayer Perceptrons,”
Journal of the Korea Contents Association
3
(3)
98 -
102