Advanced
Effect of Nonlinear Transformations on Entropy of Hidden Nodes
Effect of Nonlinear Transformations on Entropy of Hidden Nodes
International Journal of Contents. 2014. Mar, 10(1): 18-22
Copyright © 2014, The Korea Contents Association
  • Received : October 01, 2013
  • Accepted : January 12, 2014
  • Published : March 28, 2014
Download
PDF
e-PUB
PubReader
PPT
Export by style
Article
Author
Metrics
Cited by
TagCloud
About the Authors
Sang-Hoon Oh
shoh@mokwon.ac.kr

Abstract
Hidden nodes have a key role in the information processing of feed-forward neural networks in which inputs are processed through a series of weighted sums and nonlinear activation functions. In order to understand the role of hidden nodes, we must analyze the effect of the nonlinear activation functions on the weighted sums to hidden nodes. In this paper, we focus on the effect of nonlinear functions in a viewpoint of information theory. Under the assumption that the nonlinear activation function can be approximated piece-wise linearly, we prove that the entropy of weighted sums to hidden nodes decreases after piece-wise linear functions. Therefore, we argue that the nonlinear activation function decreases the uncertainty among hidden nodes. Furthermore, the more the hidden nodes are saturated, the more the entropy of hidden nodes decreases. Based on this result, we can say that, after successful training of feed-forward neural networks, hidden nodes tend not to be in linear regions but to be in saturated regions of activation function with the effect of uncertainty reduction.
Keywords
1. INTRODUCTION
When an input sample is presented to feed-forward neural networks (FNNs), it is processed through a series of weighted sums and nonlinear activation functions. It was proved that the FNNs with enough hidden nodes can approximately implement any function [1] - [4] . Herein, the weighted sums to hidden nodes are a sort of projections from the input space to a hidden feature space followed by element-wise nonlinear activation functions. There have been research results to understand the role of hidden nodes. Oh and Lee proved that the nonlinear function of hidden nodes has an effect of decreasing correlations among hidden nodes [5] . They also argued that FNNs are a special type of nonlinear whitening filter [5] . And, Shah and Poon investigated that hidden nodes with sigmoidal activation functions have the ability to produce linearly independent internal representations [6] .
In neural network field, information theory provides many fruitful research results. Lee et al. reported that FNNs have a capability of information extraction for pattern classification through hierarchically keeping inter-class information while reducing intra-class variations [7] . Learning rules were proposed stemming from information theory [8] - [14] . The upper bound for the probability of error was derived based on Renyi’s entropy [15] . Information theory can provide the construction strategy of neural networks [16] . Also, hidden information maximization and maximum mutual information methods were proposed for feature extractions [17] , [18] . In this paper, we focus on the nonlinear activation functions of hidden nodes in a viewpoint of information theory. Under the assumption that nonlinear activation function can be approximated piece-wise linearly, we derive that the entropy of hidden nodes decreases after piece-wise linear transformation. Based on the derivation, we can interpret the role of hidden nodes in a viewpoint of entropy or uncertainty.
2. NONLINEAR EFFECT ON THE ENTROPY OF HIDDEN NODES
In FNNs, inputs are processed through a series of weighted sums and nonlinear activation functions. When inputs or weights are perturbed randomly, the weighted sums to hidden nodes are approximately jointly Gaussian according to the central limit theorem [19] - [21] . Therefore, we analyze the effect of nonlinear function on jointly Gaussian random variables.
If u and v are jointly Gaussian random variables with zero means, then the joint entropy of u and v [22] is given by
PPT Slide
Lager Image
Herein, σu and σv are the standard deviation of u and v , respectively, and r is the correlation coefficient between u and v .
Let’s assume that u and v are transformed into y and z as follows:
PPT Slide
Lager Image
where a, b, c , and d are nonzero real values. Then, the entropy of y is given by
PPT Slide
Lager Image
where fy ( y ) is the probability density function (p.d.f.) of random variable y . By substituting [19]
PPT Slide
Lager Image
Eq. (3) can be derived by
PPT Slide
Lager Image
Here, f u ( u ) is the p.d.f. of random variable u and
PPT Slide
Lager Image
since u is Gaussian with zero mean. Accordingly, the entropy of z is given by
PPT Slide
Lager Image
The joint entropy after the piece-wise-linear transformations is defined by
PPT Slide
Lager Image
The joint p.d.f. of y and z can be separated into four quadrants as follows:
PPT Slide
Lager Image
Therefore, Eq. (8) is also separated into four parts as
PPT Slide
Lager Image
Since the joint p.d.f. of u and v is given by
PPT Slide
Lager Image
the first quadrant term in Eq. (10) is derived by
PPT Slide
Lager Image
Here,
PPT Slide
Lager Image
and
PPT Slide
Lager Image
The second term in the right side of Eq. (12) becomes
PPT Slide
Lager Image
Since y = a u is symmetric and zero mean,
PPT Slide
Lager Image
And according to the same procedure,
PPT Slide
Lager Image
By Papoulis [19] ,
PPT Slide
Lager Image
and
PPT Slide
Lager Image
Also by Oh [23] ,
PPT Slide
Lager Image
PPT Slide
Lager Image
PPT Slide
Lager Image
and
PPT Slide
Lager Image
By substituting Eqs. (18) and (20) into Eq. (12),
PPT Slide
Lager Image
Using the same procedure from Eq. (11) to (24), we can derive that
PPT Slide
Lager Image
PPT Slide
Lager Image
and
PPT Slide
Lager Image
By substituting Eqs. (24), (25), (26), and (27) into Eq. (10), we attain
PPT Slide
Lager Image
Finally we take the derivation
PPT Slide
Lager Image
with the substitution of Eq. (1) into (28).
In FNNs, we usually use the tanh(.) sigmoidal function ( Fig. 1 ) as the nonlinear activation function of hidden node. Assuming that the nonlinear activation function can be approximated piece-wise linearly as in Eq. (2), then the parameters a, b, c, and d correspond to the slope of nonlinear activation function. As shown in Fig. 2 , the slope is less than or equal to 0.5. Thus,
PPT Slide
Lager Image
and h ( y,z )< h ( u,v ). Also, the less steep the slopes are, the more the joint entropy after the nonlinear function decreases.
Consequently, we can argue that the nonlinear activation function decreases the entropy or uncertainty of hidden nodes. The sigmoid activation function can be separated into linear region with steep slope and saturated region with gentle slope as shown in Fig. 1 . When hidden node values are in the saturation region of sigmoid activation function, the entropy decreases much more. This coincides with the argument that hidden nodes tend to be saturated after successful training.
PPT Slide
Lager Image
The sigmoid activation function of hidden node
PPT Slide
Lager Image
The slope of the sigmoid activation function
Furthermore, the entropy of hidden nodes provides hierarchical understanding of information extraction capabilities acquired through learning. Lee et al. argued that input samples have the inter-class information as well as the intra-class variation [7] . The inter-class information is the information content that an input sample belongs to a specific class, and the intra-class variation is a measure of the average variations within the classes including noise contaminations. After learning, input samples are projected to hidden nodes through weighted sums and element-wise nonlinear transformations. In this paper, we proved that the entropy of hidden nodes decreases after the nonlinear transformations. The decreasing of hidden nodes’ entropy is correspond to the decreasing of intra-class variations, as pointed out by Lee et al [7] .
3. CONCLUSIONS
In this paper, we prove that the entropy of jointly Gaussian random variables decreases after piece-wise linear transformations. Also, the less steep the slopes are, the more the joint entropy decreases. Since the nonlinear activation function of hidden nodes can be approximated piece-wise linearly, we can argue that the nonlinear activation function decreases the uncertainty among hidden nodes. Furthermore, the entropy of hidden nodes decrease much more after successful training of FNNs.
BIO
Sang-Hoon Oh
received his B.S. and M.S degrees in Electronics Engineering from Pusan National University in 1986 and 1988, respectively. He received his Ph.D. degree in Electrical Engineering from Korea Advanced Institute of Science and Technology in 1999. From 1988 to 1989, he worked for LG semiconductor, Ltd., Korea. From 1990 to 1998, he was a senior research staff in Electronics and Telecommunication Research Institute (ETRI), Korea. From 1999 to 2000, he was with Brain Science Research Center, KAIST. In 2000, he was with Brain Science Institute, RIKEN, Japan, as a research scientist. In 2001, he was an R&D manager of Extell Technology Corporation, Korea. Since 2002, he has been with the Department of Information Communication Engineering, Mokwon University, Daejon, Korea, and is now a full professor. Also, he was with the Division of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, USA, as a visiting scholar from August 2008 to August 2009. His research interests are machine learning, speech signal processing, pattern recognition, and bioinformatics.
References
Hornik K. , Stincombe M. , White H. 1989 “Multilayer feedforward networks are universal approximators,” Neural Networks 2 359 - 366    DOI : 10.1016/0893-6080(89)90020-8
Hornik K. 1991 “Approximation Capabilities of Multilayer Feedforward Networks,” Neural Networks 4 251 - 257    DOI : 10.1016/0893-6080(91)90009-T
Suzuki S. 1998 “Constructive Function Approximation by Three-layer Artificial Neural Networks,” Neural Networks 11 1049 - 1058    DOI : 10.1016/S0893-6080(98)00068-9
Liao Y. , Fang S. C. , Nuttle H. L. W. 2003 “Relaxed Conditions for Radial-Basis Function Networks to be Universal Approximators,” Neural Networks 16 1019 - 1028    DOI : 10.1016/S0893-6080(02)00227-7
Oh S. H. , Lee Y. 1994 “Effect of Nonlinear Transformations on Correlation Between Weighted Sums in Multilayer Perceptrons,” IEEE Trans., Neural Networks 5 508 - 510    DOI : 10.1109/72.286927
Shah J. V. , Poon C. S. 1999 “Linear Independence of Internal Representations in Multilayer Perceptrons,” IEEE Trans., Neural Networks 10 10 - 18    DOI : 10.1109/72.737489
Lee Y. , Song H. K. 1993 “Analysis on the Efficiency of Pattern Recognition Layers Using Information Measures,” Proc. IJCNN’93 Nagoya 3 2129 - 2132
El-Jaroudi A. , Makhoul J. 1990 “A New Error Criterion for Posterior probability Estimation with Neural Nets,” Proc. IJCNN’90 3 185 - 192
Bichsel M. , Seitz P. 1989 “Minimum Class Entropy: A maximum Information Approach to Layered Networks,” Neural Networks 2 133 - 141    DOI : 10.1016/0893-6080(89)90030-0
Ridella S. , Rovetta S. , Zunino R. 1999 “Representation and Generalization Properties of Class-Entropy Networks,” IEEE Trans. Neural Networks 10 31 - 47    DOI : 10.1109/72.737491
Erdogmus D. , Principe J. C. 2001 “Entropy Minimization Algorithm for Multilayer Perceptrons,” Proc. IJCNN’01 4 3003 - 3008
Hild II K. E. , Erdogmus D. , Torkkola K. , Principe J. C. 2006 “Feature Extraction Using Information-Theoretic Learning,” IEEE Trans. PAMI 28 (9) 1385 - 1392    DOI : 10.1109/TPAMI.2006.186
Li R. , Liu W. , Principe J. C. 2007 “A Unifiying Criterion for Instaneous Blind Source Separation Based on Correntropy,” Signal Processing 87 (8) 1872 - 1881    DOI : 10.1016/j.sigpro.2007.01.022
Ekici S. , Yildirim S. , Poyraz M. 2008 “Energy and Entropy-Based Feature Extraction for Locating Fault on Transmission Lines by Using Neural Network and Wavelet packet Decomposition,” Expert Systems with Applications 34 2937 - 2944    DOI : 10.1016/j.eswa.2007.05.011
Erdogmus D. , Principe J. C. 2001 “Information Transfer Through Classifiers and Its Relation to Probability of Error,” Proc. IJCNN’01 1 50 - 54
Lee S. J. , Jone M. T. , Tsai H. L. 1999 “Constructing Neural Networks for Multi class-Discretization Based on Information Theory,” IEEE Trans. Sys., Man, and Cyb.-Part B 29 445 - 453    DOI : 10.1109/3477.764881
Kamimura R. , Nakanishi S. 1995 “Hidden Information maximization for Feature Detection and Rule Discovery,” Network: Computation in Neural Systems 6 577 - 602    DOI : 10.1088/0954-898X/6/4/004
Torkkola K. 2001 “Nonlinear Feature Transforms Using Maximum Mutual Information,” Proc. IJCNN’01 4 2756 - 2761
Papoulis A. 1984 Probability, Random Variables, and Stochastic Processes second ed. McGraw-Hill New York
Lee Y. , Oh S. H. , Kim M. W. 1993 “An Analysis of Premature Saturation in Back-Propagation Learning,” Neural Networks 6 719 - 728    DOI : 10.1016/S0893-6080(05)80116-9
Oh Y. , Oh S. H. 1994 “Input Noise Immunity of Multilayer Perceptrons,” ETRI Journal 16 35 - 43    DOI : 10.4218/etrij.94.0194.0013
Cover T. M. , Thomas J. A. 1991 Elements of Information Theory John Wiley and Sons, INC.
Oh S. H. 2003 “Decreasing of Correlations Among Hidden Neurons of Multilayer Perceptrons,” Journal of the Korea Contents Association 3 (3) 98 - 102