New incremental learning algorithm using extended data expression, based on probabilistic compounding, is presented in this paper. Incremental learning algorithm generates an ensemble of weak classifiers and compounds these classifiers to a strong classifier, using a weighted majority voting, to improve classification performance. We introduce new probabilistic weighted majority voting founded on extended data expression. In this case class distribution of the output is used to compound classifiers. UChoo, a decision tree classifier for extended data expression, is used as a base classifier, as it allows obtaining extended output expression that defines class distribution of the output. Extended data expression and UChoo classifier are powerful techniques in classification and rule refinement problem. In this paper extended data expression is applied to obtain probabilistic results with probabilistic majority voting. To show performance advantages, new algorithm is compared with Learn++, an incremental ensemblebased algorithm.
I. INTRODUCTION
In many practical applications, where data become available in small series from time to time, the problem, known as “catastrophic forgetting”, requires data to be learned incrementally without losing classification performance on old data. Traditional classification algorithms do not allow incremental learning. When new data become available, these algorithms are reinitialized and retrained using a combination of old and new data. The problem becomes unsolvable, if old data are not available any more. Under such condition, it is necessary to update an existing classifier in an incremental way to accommodate new data without forgetting previously acquired knowledge.
There are several approaches for incremental learning. One of them is using recently developed ensemblebased algorithms that make creative use of bootstrap based ideas. First, this approach involves generating an ensemble of classifiers for each dataset that becomes available. The result is an ensemble of ensembles of classifiers. And then classifiers are compounded to a final classifier.
One of the main characteristics of ensemble is diversity among classifiers. It is obvious because combining same classifiers doesn’t lead performance improvement. Diversity among classifiers can be achieved by adapting the classifiers to the training data distribution to create different training data subsets. This procedure guarantees that classifiers focus on more informative instances.
There are several algorithms to compound classifiers. In the simple majority voting, a commonly used combination rule, each classifier votes for the class it predicts, and the class receiving the largest number of votes is the ensemble decision. In the weighted majority voting, each classifier has a voting weight inversely proportional to its classification error. The winner is a class with the largest total vote. Also, algebraic combination of the class values can be used, so that the winner is a class receiving the highest combined vote
[1]
. All these algorithms consider just one class as a result of classifier, while the classifier can give class distribution as a result. For example, the leaf of decision tree, in case of weak classifier, can contain events from all classes. Losses of information are unavoidable, when the winner class is chosen. An alternative approach is to consider the class distribution.
A new probabilistic majority voting, based on probabilistic output expression, is presented in this paper, in the incremental learning.
Effectiveness of the extended data expression was shown previously
[2
,
3]
. In this paper, a new application of extended data expression in combining the outputs of weak classifiers is shown to be effective.
The rest of the paper is organized as follows. First, an extended data expression is described. Second, UChoo, a decision tree classifier using the entropy measures as same as the ones used in C4.5, but with the extended data expression, is described. Next, a new incremental learning algorithm, founded on probabilistic majority voting, is proposed. Finally, experimental results are provided to show the performance of the new algorithm in comparison with Learn++ incremental learning algorithm.
II. RELATED WORK
Learn++ is an incremental learning algorithm that was introduced by Polikar et al.
[4

6]
. It is based on the AdaBoost, which uses multiple classifiers to allow the system to learn incrementally. The algorithm works on the concept of using many classifiers that are weak learners to give a good overall classification. The weak learners are trained on a separate subset of the training data and then the classifiers are compounded using a weighted majority voting. The weights for the weighted majority voting are calculated using the performance of the classifiers on the entire training data set.
III. EXTENDED DATA EXPRESSION
First, consider a simple example of data in common expression. Table 1 consists of training data, where each datum is described by three attributes, two of which are discrete, while the 3rd is a continuous attribute, and a class. The “Outlook” attribute has three outcomes of ‘sunny’, ‘overcast’ and ‘rain’. “Windy?” has two outcomes of ‘True’ and ‘False’. Both “Outlook” and “Windy?” are discrete attributes and “Temperature” is a continuous attribute whose values range from 60 to 80. Finally, class has two outcomes of ‘Play’ and ‘Don't Play’.
Table 2
shows the modified expression
[2]
of the training data set of
Table 1
. There are two main features of extended expression. First, each attribute has distribution over the possible values, i.e. each attribute entry is to be filled with a probabilistic value ranging from 0 to 1. Second, each event has a weight value, which shows how much importance the event has. In this example, all the data are assumed to have the equal weight of one.
Data can be described by an expert in the field. For example, an expert says that it is ‘sunny’ and the temperature is around 70 and the probability to play is 2/3 regardless of whether windy or not.
In addition, the expert can say that the data or ‘rule’ he describes has the importance of having the weight of 30 compared to the individual datum in the data set. UChoo can then express the case as the event #1 shown on the first row of the
Table 3
.
Since each event has a weight, an instance has been redefined as the event with a weight value of 1 in
Table 3
. In other words, each attribute can have distribution over the possible class values in an instance. And each row of Table 3 has a weight value, which shows how much importance the event has. For example, if each observed instance is assumed to have the importance weight of 1, the expert can now be able to describe his idea or rule’s importance with the weight value so that it can be compared with an observed instance. The event at the first row of the above
Table 3
has the weight value of 30. In other words, the event has the importance equivalent to that of 30 instances.
An event is a collection of instances with equal attribute value distribution and class value distribution.
An instance can be regarded as an instance having the weight of 1. Therefore, the number of events may not be equal to that of all the instances. For example, in
Table 3
, the number of all the instances is 35 while the number of the events is 6. The number of all the instances in a data set T is denoted as T in this paper while the number of the events is denoted as p.
Training data set
Extended data expression
Extended data expression with the expert knowledge
Extended data expression with the expert knowledge
UChoo algorithm utilized this extended data expression to make the classifiers from incomplete, missing or varying dimensional data as in
[2
,
3]
.
IV. UCHOO ALGORITHM
UChoo algorithm uses the measure entropy similar to that of C4.5
[7]
, but it not only can generate classifiers from the observed data set as in C4.5, but also from the combination of the observed data with the rules or data from the expert described in the form of the extended data set expression.
The entropy measure equation of C4.5 is as follows:
where
Gain_ratio
is used as the measure of how much an attribute,
A
, can differentiate the classes.
Gain(A)
is the mutual information between the class and the attribute
A
, and
Split_info(A)
is the normalization constant so that the number of the outcomes of attribute
A
does not affect the information measure.
where
A
is an attribute, and
k
is the number of values of a class, and
n
is the number of outcomes of the attribute.
New equations for extended data expression are defined as follows.
Class membership weight
C_{i} (m)
states how much the
m
th event belongs to the class
C_{i}
and the following condition must be satisfied:
Outcome membership weight
O_{Aj} (m)
states how much the outcome value
j
in the attribute A can happen in the mth event and the following condition must be satisfied:
T_{Aj}
is the subset of
T
that has the outcome value
j
in the attribute
A
. For example, as the attribute
A
has three values, the set
T
is divided into three subsets. The
freq(C_{i},T)
is the number of instances in the set
T
, which have the class value of
C_{i}
, as follows:
where
Weigth(m,T)
is the weight value of the
m
th event in the set
T
. 
T
 is the number of instances in the set
T
. An instance means the event whose weight is 1. Therefore, if an event has the weight
W
, it means that there are
W
number of instances with equal distribution values of attributes and class. 
T_{Aj}
 is the number of instances in the set
T_{Aj}
, as follows:
In this case, multiplying
O_{Aj} (m)
, the outcome possibility value of each event in
T_{Aj}
, by
Weight(m, T(_{Aj})
for each
m
and adding the results of all the multiplications, we get 
T_{Aj}
. Therefore, the best attribute for splitting has the biggest
Gain_ratio
in the node of the recursion tree by using the entropy equations with these newly defined values.
V. INCREMENTAL LEARNING ALGORITHM BASED ON NEW PROBABILISTIC MAJORITY VOTING
 A. Algorithm
Ensemblebased algorithms create collection of weak classifiers and compound them to strong classifier. Such algorithms use weighted majority voting, based on the performance of the classifiers. In this case, a weak classifier gives out a hypothesis and decides one class for each training event.
Consider a situation for a decision tree: a leaf contains 100 events and 45 of them belong to class 1, the other 55 events belong to class 2. So, hypothesis assigns this leaf to class 2 as class 2 has more popularity in the leaf.
Probabilistic expression is more natural. For the previous example, weak classifier gives class distribution as hypothesis: 0.45 for class 1, 0.55 for class 2. This class distribution can be used to compound hypotheses.
Consider the example in
Fig. 1
. Here, n is the number of classifiers,
k
is the number of classes,
m
is the number of databases.
is class distribution of
is compound hypothesis of
and the weights in the compound distribution can be calculated as follows using the error of the
j
th classifier e
^{j}
.
Probabilistic expression is easily realized by using UChoo classifier and extended data expression. Weights of events in extended data expression can be used to define class distribution as a result of classifier. UChoo is the classifier based on such data expression and the class distribution in a leaf can be expressed as follows:
Ensemble of classifiers.
where
T_{leaf}
is the set of events in the leaf and
K
in Eqs. (10) and (11) is a normalized constant to make the sum of all the weights equals one,
Define here the training data
where
w_{i}
≥ 0 is the weight of instance,
x_{i}
∈
X
is the
i
th instance in the extended attribute space
X
and
y_{i}
∈
Y
is the class label of
x_{i}
in the extended class space
Y
. T is number of weak classifiers. Output of WeakLearn for the event
x
is hypothesis
h(x) = (h_{1}(x),^{...},h_{k}(x))
, where
h_{i}(x)
states how much the event
x
belongs to the class
C_{i}
and
UChoo classifier with pruning parameter is used as WeakLearn. Pruning parameter controls the weakness of a classifier. Pruning parameter controls the ratio of the majority class instances to the total instances in the leaf. Pruning 40% means, that each leaf consists of 60% of one class. Class distribution in each leaf is determined by the formula described above.
The block diagram of the incremental learning is in
Fig. 2
, and the algorithm is described in
Fig. 3
.
At first, for available database, the distribution is initialized as uniform, and all the events have equal probability to be chosen into training subset
TR_{t}
. At each iteration, weak classifier (UChoo) is trained on
TR_{t}
to obtain hypothesis
h^{t}
= {
h^{t}
(
x_{i}
)
x_{i}
∈
S
}. Compounded hypothesis is calculated according to the normalized class distribution in a leaf given above.
Weights are changed by using performance of compounded classifier applying AdaBoost rule
[8]
, so that the weights of misclassified events are increased relative to the correctly classified training events. In AdaBoost weights are changed according to performance of the hypothesis, which is result of current weak classifier. Here the performance of the compounded hypothesis is used to update weights, so the ability of resulting hypothesis is improved after each iteration. Also, resampling with replacement is used to obtain training subset
TR_{t}
. So, the events difficult to be trained can be chosen to the training set more than one time. It means an instance will have higher weight, when it is incorrectly classified, and UChoo classifier focuses on such an event. In the end, final hypothesis is obtained by compounding hypotheses for all the databases.
To show the power of proposed probabilistic result expression, new incremental algorithm, explained in this section, is compared with Learn++ with UChoo classifier used as weak classifier. Learn++ algorithm has the same plan, but uses deterministic weight majority voting.
 B. Analysis
Since we use the same scheme as Learn++, error bound is defined in the same way. At first, consider a twoclass problem, where there are only class 0 and class 1. In this
Block diagram of incremental learning.
The proposed algorithm.
case hypothesis
h_{1}^{t}(x)
denotes probability of class 1 for the event
x
. Accordingly, probability of class 0 is
Error bound of compound hypothesis
H^{t}
obviously is AdaBoost error bound
[8]
, that is,
The final hypothesis is
It is the same as the one in
[6]
, and therefore the error bound is
In case of more than 2 classes, multiclass problem can be reduced to twoclass problem with same error bound as it is shown in
[8]
.
Notice that we define upperbound of error, which says that probabilistic majority is not worse than the deterministic one. In
[9]
, authors show that summing the outcomes of the realvalued neural networks and then selecting the best prediction performs better than selecting the best prediction of each network and then combining them with a majority rule. The experiments in the next section are carried out in the same way.
VI. EXPERIMENTS
Algorithms are tested on data sets collected in UCI Machine Repository
[10]
: Glass (214 events), Balance Scale (625 events), Vehicle (846 events) and Pendigits (10992 events). The data sets do not contain any missing value or unknown value. Detailed descriptions and performance on each data set are explained in the following.
As the goal of the experiments is to show the advantages of new probabilistic majority voting, the algorithm is compared with Learn++. Experiments are made under equal conditions for both algorithms: UChoo classifier is used as WeakLearner, the same number of hypotheses is generated and the training set and the testing set are the same for each experiment.
At first, each data set is randomly divided into two subsets. First (80%) is used to train the classifier and second (20%) to test the algorithms. To show the incremental effect, training data set is divided into 3 data bases and organized in the way that the previously seen data are not used in the next stages of learning. Exception is made for Glass data. Training data set in this case is not divided into 3 data bases, because the number of events is small.
For each database 12 weak hypotheses are generated. Weakness of hypotheses varies with pruning parameter between 0% and 40%. Number of weak hypotheses is set to 12. All the experiments show that the error just slightly varies depending on the number of hypotheses in the range from 10 to 20. More than 20 hypotheses make the system work very slowly.
 A. Glass Data
Data set consists of 214 events. 172 events are used for training and the remaining events are used for testing. Each event consists of 10 attributes, with 6 classes.
Table 4
shows the performance of algorithms on test data using different pruning parameters. Performance of probabilistic incremental algorithm is quite better than that of Learn++ with Uchoo as the base classfier for all tests. Also note that the error is slight less for weak learners when pruning parameter varies from 25 to 40%.
It means strong classifiers tend to lead to overfitting and slight diversity among classifiers. Experiments show an advantage of probabilistic majority voting over weight majority voting.
Error on glass data
 B. Vehicle Data
Data set consists of 846 events. Among them, 677 events are used for training and the remaining events are used for testing. Data event is described by 18 attributes from which one of 4 classes is determined.
Table 5
shows performance of the algorithms on test data using different pruning parameters. We can observe better performance of probabilistic incremental algorithm compared to that of Learn++, but errors are similar for all values of pruning parameters.
In
Tables 6
and
7
, incremental ability of the two algorithms is compared. Third column shows average error of weak classifiers learned on each data base. 4th, 5th, and 6th columns show the performance of the classifier at each incremental step on all databases.
Error on vehicle data
Error of probabilistic incremental learning on vehicle data (pruning 30%)
Error of probabilistic incremental learning on vehicle data (pruning 30%)
Error of Learn++ on vehicle data (pruning 30%)
Error of Learn++ on vehicle data (pruning 30%)
Error on balancescale data
Error on balancescale data
Error of probabilistic incremental learning on balancescale data (pruning 30%)
Error of probabilistic incremental learning on balancescale data (pruning 30%)
Error of Learn++ on balancescale data (pruning 30%)
Error of Learn++ on balancescale data (pruning 30%)
Performance is increasing incrementally from one data base to another. The new algorithm gives 10% of improvement while Learn++ gets 4++ on testing data at the final incremental step.
 C. BalanceScale Data
Balance Scale Weight & Distance data set, obtained from UCI repository, is generated to model psychological experimental results. The dataset consists of 625 events and has 5 attributes; 500 events are used for training and 125 events for testing.
Table 8
compares errors of the new algorithm and Learn++ for pruning parameter in the range of [0,40]. A better performance of the new algorithm is observed. Both algorithms show the advantage of combining weak classifiers to make a better compound classifier. In
Tables 9
and
10
, 11% performance improvement is observed on testing data.
 D. Pendigits Data
Data set consists of 10992 events. 8794 events are used for training and remaining 2198 events are used for testing. Data set consists of 16 attributes which describe handwritten numerical characters from 0 to 9.
Table 11
shows performance of algorithms on test data using different parameters. Error of new boosting algorithm is less than that of Learn++.
Tables 12
and
13
show a 3% performance improvement in incremental ability. Errors of compound classifiers are shown in
Table 14
. Errors are decreasing through iteration. This demonstrates that algorithm is learning incrementally.
Error on pendigits data
Error of probabilistic incremental learning on pendigits data (pruning 30%)
Error of probabilistic incremental learning on pendigits data (pruning 30%)
Error of Learn++ on pendigits data (pruning 30%)
Error of Learn++ on pendigits data (pruning 30%)
Error of new boosting algorithm through iterations on pendigits data
Error of new boosting algorithm through iterations on pendigits data
Error of Learn++ through iterations on pendigits data
Error of Learn++ through iterations on pendigits data
Tables 14
and
15
show errors of both algorithms through iterations on the testing set of Pendigits data. The two algorithms have almost the same error on first iteration; final hypothesis of new algorithm has better performance.
All the results show advantage of probabilistic majority voting over weight majority voting.
 E. Overall Performance
Figs. 4
through 7 compare the new proposed algorithm with Learn++ for different pruning parameters. The x coordinate of the figures is (100pruning parameter (%)) in percentage. New algorithm shows better results in almost all the experiments for various pruning parameters. Also, in
Tables 6
,
9
, and
12
, the performance of the hypothesis is higher on the own training data set and performance on the test set is improving as the hypothesis trained on new data base is added. In the Pendigits data the improvement is only 3%, but for vehicle and balance data it is 10%.
In the Pendigits data case, however, the error rate is decreased from 6.23% to 3.09%, which is actually a significant performance improvement as the error rate is decreased more than 50%.
VII. CONCLUSION
This paper presents a new incremental algorithm based on probabilistic result expression. Such expression prevents losses of information in weight majority voting. Results of experiments corroborate advantage of the probabilistic result expression over the deterministic expression.
Error rate on glass data.
Error rate on vehicle data.
Error rate on balancescale data.
Error rate on pendigits data.
Extended data expression and UChoo classifier are powerful techniques in classification and rule refinement problem. In this paper extended data expression is applied to obtain probabilistic results with probabilistic majority voting.
Proposed algorithm shows good performance that confirms the usefulness of extended data expression and UChoo classifier in solving classification problems.
To be able to calculate the probabilistic majority voting, the number of instances in the leaf should be counted. Because UChoo divides the space recursively according to the entropy measure, the subspace in the leaf is captured naturally and the probability of a certain class is seen. For a classifier such as artificial neural network, it is not easy to define the subspace where the test data belongs, and hence the probabilistic majority voting is not done naturally. This fact signifies the advantage of the decision tree type classifiers as a basic building block of more complicated ensemble of classifiers
Acknowledgements
This work was supported by research fund of Chungnam National University.
Duin R. P. W.
2002
“The combining classifier: to train or not to train?”
in Proceedings of the 16th International Conference on Pattern Recognition
Quebec, Canada
765 
770
Kong J. M.
,
Seo D. H.
,
Lee W. D.
2007
“Rule refinement withextended data expression”
in Proceedings of the 6th International Conference on Machine Learning and Applications
Cincinnati:OH
310 
315
Kim D. H.
,
Lee D. H.
,
Lee W. D.
2006
“Classifier using extendeddata expression”
in Proceedings of the IEEE Mountain Workshop on Adaptive and Learning Systems
Logan: UT
154 
159
Polikar R.
2007
“Bootstrapinspired techniques in computationalintelligence”
IEEE Signal Processing Magazine
24
(4)
59 
72
DOI : 10.1109/MSP.2007.4286565
Polikar R.
,
Udpa L.
,
Udpa S. S.
,
Honavar V.
2000
“Learn++: anincremental learning algorithm for multilayer perceptron networks”
in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
Istanbul, Turkey
3414 
3417
Polikar R.
,
Udpa L.
,
Udpa S. S.
,
Honavar V.
2001
“Learn++: anincremental learning algorithm for supervised neural networks”
IEEE Transactions on Systems, Man, and Cybernetics Part C:Applications and Reviews
31
(4)
497 
508
DOI : 10.1109/5326.983933
Quinlan J. R.
1996
“Bagging, boosting, and C4.5”
in Proceedings of the 13th National Conference on Artificial Intelligence
Portland: OR
725 
730
Schapire R. E.
2002
“The boosting approach to machine learning: anoverview”
in MSRI Workshop on Nonlinear Estimation and Classification
Berkeley: CA
Murphey Y. L.
,
Chen Z.
,
Feldkamp L.
2002
“Incremental neurallearning using AdaBoost”
in Proceedings of the International Joint Conference on Neural Networks
Honolulu: HI
2304 
2308
UCI machine learning repository [Internet].
Available: http://archive.ics.uci.edu/ml/