Hidden nodes have a key role in the information processing of feedforward neural networks in which inputs are processed through a series of weighted sums and nonlinear activation functions. In order to understand the role of hidden nodes, we must analyze the effect of the nonlinear activation functions on the weighted sums to hidden nodes. In this paper, we focus on the effect of nonlinear functions in a viewpoint of information theory. Under the assumption that the nonlinear activation function can be approximated piecewise linearly, we prove that the entropy of weighted sums to hidden nodes decreases after piecewise linear functions. Therefore, we argue that the nonlinear activation function decreases the uncertainty among hidden nodes. Furthermore, the more the hidden nodes are saturated, the more the entropy of hidden nodes decreases. Based on this result, we can say that, after successful training of feedforward neural networks, hidden nodes tend not to be in linear regions but to be in saturated regions of activation function with the effect of uncertainty reduction.
1. INTRODUCTION
When an input sample is presented to feedforward neural networks (FNNs), it is processed through a series of weighted sums and nonlinear activation functions. It was proved that the FNNs with enough hidden nodes can approximately implement any function
[1]

[4]
. Herein, the weighted sums to hidden nodes are a sort of projections from the input space to a hidden feature space followed by elementwise nonlinear activation functions. There have been research results to understand the role of hidden nodes. Oh and Lee proved that the nonlinear function of hidden nodes has an effect of decreasing correlations among hidden nodes
[5]
. They also argued that FNNs are a special type of nonlinear whitening filter
[5]
. And, Shah and Poon investigated that hidden nodes with sigmoidal activation functions have the ability to produce linearly independent internal representations
[6]
.
In neural network field, information theory provides many fruitful research results. Lee et al. reported that FNNs have a capability of information extraction for pattern classification through hierarchically keeping interclass information while reducing intraclass variations
[7]
. Learning rules were proposed stemming from information theory
[8]

[14]
. The upper bound for the probability of error was derived based on Renyi’s entropy
[15]
. Information theory can provide the construction strategy of neural networks
[16]
. Also, hidden information maximization and maximum mutual information methods were proposed for feature extractions
[17]
,
[18]
. In this paper, we focus on the nonlinear activation functions of hidden nodes in a viewpoint of information theory. Under the assumption that nonlinear activation function can be approximated piecewise linearly, we derive that the entropy of hidden nodes decreases after piecewise linear transformation. Based on the derivation, we can interpret the role of hidden nodes in a viewpoint of entropy or uncertainty.
2. NONLINEAR EFFECT ON THE ENTROPY OF HIDDEN NODES
In FNNs, inputs are processed through a series of weighted sums and nonlinear activation functions. When inputs or weights are perturbed randomly, the weighted sums to hidden nodes are approximately jointly Gaussian according to the central limit theorem
[19]

[21]
. Therefore, we analyze the effect of nonlinear function on jointly Gaussian random variables.
If
u
and
v
are jointly Gaussian random variables with zero means, then the joint entropy of
u
and
v
[22]
is given by
Herein,
σ_{u}
and
σ_{v}
are the standard deviation of
u
and
v
, respectively, and
r
is the correlation coefficient between
u
and
v
.
Let’s assume that
u
and
v
are transformed into
y
and
z
as follows:
where
a, b, c
, and
d
are nonzero real values. Then, the entropy of
y
is given by
where
f_{y}
(
y
) is the probability density function (p.d.f.) of random variable
y
. By substituting
[19]
Eq. (3) can be derived by
Here,
f
_{u}
(
u
) is the p.d.f. of random variable
u
and
since
u
is Gaussian with zero mean. Accordingly, the entropy of
z
is given by
The joint entropy after the piecewiselinear transformations is defined by
The joint p.d.f. of
y
and
z
can be separated into four quadrants as follows:
Therefore, Eq. (8) is also separated into four parts as
Since the joint p.d.f. of
u
and
v
is given by
the first quadrant term in Eq. (10) is derived by
Here,
and
The second term in the right side of Eq. (12) becomes
Since
y
= a
u
is symmetric and zero mean,
And according to the same procedure,
By Papoulis
[19]
,
and
Also by Oh
[23]
,
and
By substituting Eqs. (18) and (20) into Eq. (12),
Using the same procedure from Eq. (11) to (24), we can derive that
and
By substituting Eqs. (24), (25), (26), and (27) into Eq. (10), we attain
Finally we take the derivation
with the substitution of Eq. (1) into (28).
In FNNs, we usually use the tanh(.) sigmoidal function (
Fig. 1
) as the nonlinear activation function of hidden node. Assuming that the nonlinear activation function can be approximated piecewise linearly as in Eq. (2), then the parameters
a, b, c,
and
d
correspond to the slope of nonlinear activation function. As shown in
Fig. 2
, the slope is less than or equal to 0.5. Thus,
and
h
(
y,z
)<
h
(
u,v
). Also, the less steep the slopes are, the more the joint entropy after the nonlinear function decreases.
Consequently, we can argue that the nonlinear activation function decreases the entropy or uncertainty of hidden nodes. The sigmoid activation function can be separated into linear region with steep slope and saturated region with gentle slope as shown in
Fig. 1
. When hidden node values are in the saturation region of sigmoid activation function, the entropy decreases much more. This coincides with the argument that hidden nodes tend to be saturated after successful training.
The sigmoid activation function of hidden node
The slope of the sigmoid activation function
Furthermore, the entropy of hidden nodes provides hierarchical understanding of information extraction capabilities acquired through learning. Lee et al. argued that input samples have the interclass information as well as the intraclass variation
[7]
. The interclass information is the information content that an input sample belongs to a specific class, and the intraclass variation is a measure of the average variations within the classes including noise contaminations. After learning, input samples are projected to hidden nodes through weighted sums and elementwise nonlinear transformations. In this paper, we proved that the entropy of hidden nodes decreases after the nonlinear transformations. The decreasing of hidden nodes’ entropy is correspond to the decreasing of intraclass variations, as pointed out by Lee et al
[7]
.
3. CONCLUSIONS
In this paper, we prove that the entropy of jointly Gaussian random variables decreases after piecewise linear transformations. Also, the less steep the slopes are, the more the joint entropy decreases. Since the nonlinear activation function of hidden nodes can be approximated piecewise linearly, we can argue that the nonlinear activation function decreases the uncertainty among hidden nodes. Furthermore, the entropy of hidden nodes decrease much more after successful training of FNNs.
BIO
SangHoon Oh
received his B.S. and M.S degrees in Electronics Engineering from Pusan National University in 1986 and 1988, respectively. He received his Ph.D. degree in Electrical Engineering from Korea Advanced Institute of Science and Technology in 1999. From 1988 to 1989, he worked for LG semiconductor, Ltd., Korea. From 1990 to 1998, he was a senior research staff in Electronics and Telecommunication Research Institute (ETRI), Korea. From 1999 to 2000, he was with Brain Science Research Center, KAIST. In 2000, he was with Brain Science Institute, RIKEN, Japan, as a research scientist. In 2001, he was an R&D manager of Extell Technology Corporation, Korea. Since 2002, he has been with the Department of Information Communication Engineering, Mokwon University, Daejon, Korea, and is now a full professor. Also, he was with the Division of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, USA, as a visiting scholar from August 2008 to August 2009. His research interests are machine learning, speech signal processing, pattern recognition, and bioinformatics.
Liao Y.
,
Fang S. C.
,
Nuttle H. L. W.
2003
“Relaxed Conditions for RadialBasis Function Networks to be Universal Approximators,”
Neural Networks
16
1019 
1028
DOI : 10.1016/S08936080(02)002277
Oh S. H.
,
Lee Y.
1994
“Effect of Nonlinear Transformations on Correlation Between Weighted Sums in Multilayer Perceptrons,”
IEEE Trans., Neural Networks
5
508 
510
DOI : 10.1109/72.286927
Shah J. V.
,
Poon C. S.
1999
“Linear Independence of Internal Representations in Multilayer Perceptrons,”
IEEE Trans., Neural Networks
10
10 
18
DOI : 10.1109/72.737489
Lee Y.
,
Song H. K.
1993
“Analysis on the Efficiency of Pattern Recognition Layers Using Information Measures,”
Proc. IJCNN’93 Nagoya
3
2129 
2132
ElJaroudi A.
,
Makhoul J.
1990
“A New Error Criterion for Posterior probability Estimation with Neural Nets,”
Proc. IJCNN’90
3
185 
192
Ridella S.
,
Rovetta S.
,
Zunino R.
1999
“Representation and Generalization Properties of ClassEntropy Networks,”
IEEE Trans. Neural Networks
10
31 
47
DOI : 10.1109/72.737491
Erdogmus D.
,
Principe J. C.
2001
“Entropy Minimization Algorithm for Multilayer Perceptrons,”
Proc. IJCNN’01
4
3003 
3008
Hild II K. E.
,
Erdogmus D.
,
Torkkola K.
,
Principe J. C.
2006
“Feature Extraction Using InformationTheoretic Learning,”
IEEE Trans. PAMI
28
(9)
1385 
1392
DOI : 10.1109/TPAMI.2006.186
Li R.
,
Liu W.
,
Principe J. C.
2007
“A Unifiying Criterion for Instaneous Blind Source Separation Based on Correntropy,”
Signal Processing
87
(8)
1872 
1881
DOI : 10.1016/j.sigpro.2007.01.022
Ekici S.
,
Yildirim S.
,
Poyraz M.
2008
“Energy and EntropyBased Feature Extraction for Locating Fault on Transmission Lines by Using Neural Network and Wavelet packet Decomposition,”
Expert Systems with Applications
34
2937 
2944
DOI : 10.1016/j.eswa.2007.05.011
Erdogmus D.
,
Principe J. C.
2001
“Information Transfer Through Classifiers and Its Relation to Probability of Error,”
Proc. IJCNN’01
1
50 
54
Lee S. J.
,
Jone M. T.
,
Tsai H. L.
1999
“Constructing Neural Networks for Multi classDiscretization Based on Information Theory,”
IEEE Trans. Sys., Man, and Cyb.Part B
29
445 
453
DOI : 10.1109/3477.764881
Kamimura R.
,
Nakanishi S.
1995
“Hidden Information maximization for Feature Detection and Rule Discovery,”
Network: Computation in Neural Systems
6
577 
602
DOI : 10.1088/0954898X/6/4/004
Torkkola K.
2001
“Nonlinear Feature Transforms Using Maximum Mutual Information,”
Proc. IJCNN’01
4
2756 
2761
Papoulis A.
1984
Probability, Random Variables, and Stochastic Processes
second ed.
McGrawHill
New York
Cover T. M.
,
Thomas J. A.
1991
Elements of Information Theory
John Wiley and Sons, INC.
Oh S. H.
2003
“Decreasing of Correlations Among Hidden Neurons of Multilayer Perceptrons,”
Journal of the Korea Contents Association
3
(3)
98 
102