SVM-Based Incremental Learning Algorithm for Large-Scale Data Stream in Cloud Computing

KSII Transactions on Internet and Information Systems (TIIS).
2014.
Oct,
8(10):
3378-3393

- Received : April 09, 2014
- Accepted : October 31, 2014
- Published : October 28, 2014

Download

PDF

e-PUB

PubReader

PPT

Export by style

Share

Article

Metrics

Cited by

TagCloud

We have witnessed the rapid development of information technology in recent years. One of the key phenomena is the fast, near-exponential increase of data. Consequently, most of the traditional data classification methods fail to meet the dynamic and real-time demands of today’s data processing and analyzing needs--especially for continuous data streams. This paper proposes an improved incremental learning algorithm for a large-scale data stream, which is based on SVM (Support Vector Machine) and is named DS-IILS. The DS-IILS takes the load condition of the entire system and the node performance into consideration to improve efficiency. The threshold of the distance to the optimal separating hyperplane is given in the DS-IILS algorithm. The samples of the history sample set and the incremental sample set that are within the scope of the threshold are all reserved. These reserved samples are treated as the training sample set. To design a more accurate classifier, the effects of the data volumes of the history sample set and the incremental sample set are handled by weighted processing. Finally, the algorithm is implemented in a cloud computing system and is applied to study user behaviors. The results of the experiment are provided and compared with other incremental learning algorithms. The results show that the DS-IILS can improve training efficiency and guarantee relatively high classification accuracy at the same time, which is consistent with the theoretical analysis.
W
ith the advancement of information technology, large amounts of data are being generated from operations in every industry, such as weather monitoring, web logs, phone records, financial transactions, etc. This type of data, often denoted as a data stream
[1]
, is different from traditional data collections due to its high speed, high dimension, real-time and dynamic nature. These data must be processed quickly to get effective information in time to be used for decision making, planning, and researching purposes
[1]
[2]
. For example, we can use data to predict and respond to purchase trends by analyzing customers’ behavioral characteristics; weather forecasting accuracy can be improved by obtaining information from the image flow data sent back by satellites; extracting and mining the Internet data stream will help with emergency monitoring and when analyzing public opinion on the Internet. All of these applications have made our daily life more convenient and more intelligent. Cloud computing technology facilitates the data stream process using large-scale clusters and distribution computing, which improves the computing capacity and the transmission efficiency of network services. Using traditional data classification methods to process contemporary data streams is increasingly difficult, because the data form has transformed from static to dynamic. Data are generated continuously over time and the volume of data is infinite, so we cannot store and compute all of the data. Operations performed on a data stream need to be long-term and dynamic, which means that the classification algorithm must be smart enough to learn from its data stream in order to update the classifier and that it must be able to produce highly accurate classifications. Algorithms that can adapt to the features of a data stream and classify it efficiently have become one of the most active research directions in this field.
Among all of the classification technologies, the SVM algorithm has the most advantages over the others when processing a small sample size of non-linear and high-dimensional data. Based on the VC dimension theory and SRM (structural risk minimization inductive principle), which seeks an optimal compromise between empirical and confidence risks, SVM increases the generalization ability of the classifier while maintaining high accuracy
[3]
. Linearly non-separable problems are resolved using an importing penalty factor and by mapping the problems in a higher dimensional space with the aid of the kernel function. In this way, non-linear problems are converted into linear ones. SVM has better classification effects when handling high dimensional issues because its classifier is simple and the algorithm’s computational complexity is not impacted by the dimensions of the data. The support vector set, which is only a small part of the sample data, is introduced to represent the characteristics of the whole data set. So SVM is suitable for incremental learning during which support vector sets of history samples are reserved and to identify the relevant information in the history data.
This paper studies the classification methods of data streams using the SVM theory. In applying this theory, several additional contributions are made: (1) an improved incremental learning SVM algorithm is proposed, referred to as IILS, which is appropriate for data stream classification; (2) by focusing on system load conditions and node performance, an improved incremental learning SVM algorithm, referred to as DS-IILS, is developed to resolve the classification problems in a large-scale data stream; (3) these algorithms are implemented in a cloud computing system and are used to analyze the behaviors of a bank’s customers to predict their purchase trends. Meanwhile, the results are compared with other data classification algorithms to prove the effectiveness of the proposed algorithms.
Section 2 of this paper introduces related work and describes the problems to be studied. In section 3, we present the IILS and DS-IILS algorithms and their computational complexity analyses. The experimental analysis is delineated in section 4. The paper concludes with a presentation of future work that can be carried out as a result of this research.
f
(
x
)=0 is the historical optimal separating hyperplane,
P
_{1}
,
P
_{2}
,
P
_{3}
,
P
_{4}
are the support vectors in the historical sample set. When new samples
A
_{1}
,
A
_{2}
,
A
_{3}
,
A
_{4}
,
A
_{5}
arrive,
g
(
x
)=0 is the new optimal separating hyperplane and
A
_{1}
,
A
_{4}
,
A
_{5}
,
Q
_{1}
,
Q
_{2}
,
Q
_{3}
,
Q
_{4}
are the new support vectors for the new classifier. Traditional KKT incremental learning finds samples that don’t meet the KKT condition and trains them, by which the sample size will be reduced. But in
Fig. 1
,
A
_{5}
conforms to the KKT condition, but it is also a support vector of the new classifier. Because
A
_{5}
is close to the historical optimal hyperplane, when filtering samples, this kind of data should also be considered. Except for the samples that violate the KKT condition, denoted by the sample point of
y_{i}f
(
x_{i}
)<1, data points like
A
_{5}
are reserved as well. Then, we let the condition be all sample points that satisfy
y_{i}f
(
x_{i}
)<1+
ε
,
ε
>0.
ε
is imported to constrain samples that are a certain interval apart from the hyperplane, a larger
ε
means more sample points. When
ε
is infinite, all samples are included in the new sample set. Also, in researching the reserving samples, we find that
Q
_{1}
,
Q
_{2}
,
Q
_{3}
,
Q
_{4}
are changed from non-support vectors to support vectors through incremental learning, so the sample points of the historical sample set that satisfy
y_{i}f
(
x_{i}
)<1+
ε
,
ε
>0 should also be reserved. Apparently, a larger
ε
means higher accuracy and a larger sample set, which lead to higher demands in computation. So
ε
needs to be chosen accordingly. Our research mainly focuses on reserving important classification sample information and discarding useless samples.
Change analysis of support vector set after incremental learning
In the meantime, our user behavior analysis takes into account the general trend of users, so the number of incremental samples is beneficial for the classifier. Therefore, when historical data and incremental data are weighted separately, our weighing principle is that the weights increase with the volume of data.
We assume that the historical training set (which number is
l
) is as formula 1:
And the incremental sample set (which number is
k
) is as formula 2:
We define that the sample weight in {(
x
_{1}
,
y
_{1}
),···,(
x
_{1}
,
y
_{1}
)}∈(
X
×
Y
)
^{l}
is denoted by
, and the weight in {(
x
_{l+1}
,
y
_{l+1}
),···,(
x
_{l+k}
,
y
_{l+k}
)}∈(
X
×
Y
)
^{k}
is denoted by
.
After increasing the weight of the samples, the quadratic programming problem is changed to：
Where
C
>0 is the penalty factor, so that the transformed dual problem can be described by formula 4, that is, the optimal solution of formula 4 is that of our problem.
α
is the Lagrange multiplier.
Result
α
^{*}
=(
α
^{*}
_{1}
,···,
α
^{*}
_{l+k}
)
^{T}
is the optimal solution, so the decision function is
, in which
and
x_{p}
,
d_{p}
are any positive class support vector and its weights
x_{q}
,
d_{q}
are any negative class support vector and its weight.
A
is the initial sample set and
B
and
D
are incremental sample sets (we take to two incremental sets for this example). Among them, the amount of samples in
A
is
N_{A}
and
N_{B}
and
N_{D}
are the amounts of
B
and
D
. The data set before the sliding window is the historical sample set and
N
_{0}
is the amount of samples. The data set in the current sliding window is the incremental sample set and its samples amount is
N_{t}
.
The procedure of IILS is as follows：
A_{i}
, (1≤
i
≤
H
) can be calculated using formula 5. Among them,
H
is the number of nodes in the cloud. By selecting nodes that have lower concurrent access rates as work nodes, the training efficiency of the whole system is improved.
Task_{i}
is amount of tasking executing in ith node and
SumTask_{i}
is the max task number that can be executed in it. If there are
H
nodes in the cloud, the overall concurrent access rate of the system is described by formula 6:
The load threshold
L
is defined to denote
C_{A}
, then
L
=
C_{A}
. If 0≤
A_{i}
≤
L
, then the ith node belongs to the light-load node set
L_{light}
; otherwise it belongs to the heavy-load node set
L_{heavy}
. Selected nodes must be nodes of
L_{light}
so that the system load balance is guaranteed and the processing efficiency is improved.
Based on previous considerations, the proposed improved incremental learning algorithm of SVM focusing on the large-scale data streams, named the DS-IILS, is described below:
First: data stream preprocessing
Second: analysis of system load condition
Third: data stream parallel processing
n
, the scale of sample set
l
, the amount of support vectors
N_{SV}
, and their distribution. Commonly,
N_{SV}
is much smaller than
l
, that is (
N_{SV}
/
l
)<<1, so the computational complexity of SVM is
Assume the scale of the newly incremental sample set is
k
and 0<
k
/
l
<1. In traditional SVM algorithms, the newly incremental samples are added to the historical samples to be retrained and the scale of the sample set is
l
(1+
k
/
l
), so their computational complexity is
which is the scale of the sample set including the samples of the history sample set and the newly incremental sample set that are within the scope of the threshold, will be much smaller than
l
(1+
k
/
l
) after adopting IILS; that is,
Generally,
so the computational complexity of IILS is
which is less than the complexity of other algorithms.
The core of the DS-IILS is the IILS, but the DS-IILS also considers the system load and adopts parallel optimization, so the run time is greatly reduced and the work efficiency is significantly improved.
Configuration of cloud nodes
We adopt Clementine12 to preprocess data and analyze the importance of variables. Our algorithm is realized based on the libSVM
[25]
. The data set is pulled from the Bank Marketing data sets
[26]
in the UCI database1. The data are generated by marketing behaviors of financial institutions in Portugal which try to predict customers’ purchase behaviors by classification. There are 17 attributes in the data set, among which IDs ranging from 1 to 9 are customer attributes, including age, job, and education, etc., and IDs ranging from 9 to 16 are customers’ contact information, and ID=17 is the target attribute in which true is purchase behavior and false is non-purchase behavior.
C
are 0.1, 1, 10, and 100. The results are shown in
Table 2
.
Contrast experiment of kernel function and penalty factor
There are several factors to consider when choosing a kernel function. Although the training time of the polynomial kernel function is slightly shorter than that of the RBF kernel function, the overall training accuracy of the classifier of the RBF kernel function is higher than that of the polynomial kernel function. Furthermore, it is generally known that if the parameters of the kernel function are greater, then the stronger classification result depends on the parameters and it will consume too much time selecting the parameters. The parameter number of the RBF kernel function is less than that of the polynomial kernel function; therefore, we selected the RBF kernel function for this experiment.
Moreover, considering the selection of penalty factor
C
, we can see in table 2 that when
C
is only 100, the training time is obviously longer, but in other cases the training time is stable; so, the value of
C
will be selected from 0.1, 1, and 10. According to the classification accuracy, when
C
is 10, the accuracy is higher and steadier in different samples; so, we set the value of the penalty factor
C
at 10 for this experiment.
y_{i}f
(
x_{i}
)<1+
ε
,
ε
>0 have to be reserved in the DS-IILS. A properly chosen
ε
will lead to better classification performance, so we take contrast experiments on
ε
. The size of the initial sample set is 5000 and the same amount is in incremental sample sets 1 through 3. Based on the analysis in section 4.2, the kernel function we adopt is the RBF and the penalty factor
C
=10. Select the value of
ε
from among 0, 0.2, 0.4, 0.6, and 0.8. We conduct the experiment 10 times in the same conditions, and the average results are reported in
Fig. 2
and
Fig. 3
.
Fig. 2
shows its effect on training time and
Fig. 3
depicts its effect on accuracy.
Effect of ε on training time
Effect of ε on accuracy
According to the results, we can conclude that the larger
ε
will lead to more training time and higher accuracy. 0.4 is chosen to be the value of
ε
for the comparison.
C
is 10.
ε
is set at 0.4. The size of data block
x
is 32MB. To assure the accuracy of experiment results, we conduct the experiment 50 times under the same conditions and the average results are reported in
Fig. 4
and
Fig. 5
. Training time is compared in
Fig. 4
and comparison of accuracy is in
Fig. 5
. From the results, we can see that, in general, the incremental learning algorithm can improve efficiency to a large extent with only a little decline in accuracy, so the incremental learning algorithm is effective. This is consistent with the complexity analyses of the algorithms in section 3.3. Also, the accuracy of the KKT incremental algorithm decreases significantly when a lot of data emerge because it reserves fewer samples. Although the training time of the DS-IILS is slightly longer than the KKT incremental algorithm, the accuracy of the DS-IILS is obviously improved. Besides, the DS-IILS has a higher efficiency when compared with the BatchSVM algorithm, especially when data is continuously increasing because it will not accumulate a large amount of samples. So the performance of the DS-IILS is superior.
Comparison of training time with 4 algorithms
Comparison of accuracy with 4 algorithms
Next, the sample sizes are enlarged to prove its high efficiency in processing a large-scale data streams. The sample sizes of the initial sample set and the incremental sample sets 1 through 3 are 1 million and the testing sample size is 100,000.
Fig. 4
shows that DS-IILS is apparently better than the standard SVM algorithm, so we only compare it with the KKT algorithm and the BatchSVM algorithm. The results are shown in
Fig. 6
, from which we can conclude that the system training time of the DS-IILS is much shorter than that of the other algorithms. This is because the DS-IILS considers not only the system load but also the node performance. When scheduling tasks, it can first select the better performing node to process the task.
Comparison of training time on a large-scale data stream with 3 algorithms
Ning Wang received the B.E. degree in computer science and technology from Shandong University of Technology, China, in 2003, and the M.S. degree in computer software and theory from Yunnan Normal University, China, in 2006. She is currently working toward the Ph.D. degree with School of Computer and Communication Engineering, University of Science and Technology Beijing, China. Her current research interests include cloud computing and data mining. Email: wangning200658@126.com.
Yang Yang received his Ph.D. degree from University of Science and Technology in Lille, Lille, France in 1988. Currently, he is a professor and a Ph. D. supervisor in School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China. His research interests include cloud computing, intelligence control and multimedia communication. Email: yyang@ustb.edu.cn.
Liyuan Feng received the B.S. degree from University of Science and Technology Beijing, China, in 2011, and the M.E. degree from University of Science and Technology Beijing, China, in 2014. Her current research interests are big data, cloud computing and data stream analysis. Email: fengliyuar@163.com.
Zhenqiang Mi received B.S. in automation and Ph.D. in communication engineering, both from School of Information Engineering, University of Science and Technology Beijing, in the year 2006 and 2011, respectively. From 2011, he has been an assistant professor with the school of computer and communication engineering, University of Science and Technology Beijing. His research interest includes mobile ad hoc networks and cloud computing in mobile environments. Email: mizq@ustb.edu.cn.
Kun Meng received his Ph.D. degree from University of Science and Technology Beijing, Beijing, China. Now, he is a Postdoctor Fellow of the Department of Computer Science and Technology, Tsinghua University. His research interests include QoS of computer networks, performance evaluation, network security analysis, service computing and stochastic models. Email: mengkun1024@163.com.

1. Introduction

2. Related work and problem analysis

- 2.1 Related work

The processing model of a data stream is different from that of a traditional database because of its fast, scalable, and dynamic features. Only part of the data stream can be processed due to its potentially infinite nature. Each record in the data stream corresponds to a time point, called a timestamp. Data in a specific time range are processed according to the user’s requirements. The time range is defined as a time window
[4]
. Data in the time window is part of the data stream, so the result is an approximation within an acceptable error range. There are three commonly used window models: the landmark window model, the sliding window model, and the damped window model
[4]
. None of these models have an obvious advantage over the others, so different application conditions have to be taken into consideration in order to adopt a suitable window model. The sliding window model
[4]
is interested in the data in a present window, so old data will be cleared whenever new data arrives. The window will slide on the time axis with the arrival of new data. In this paper, the sliding widow model is adopted due to our application requirements.
There are two primary research directions in data stream classifications and mining algorithms: (1) a single classifier data mining algorithm, in which classification is realized by continuous learning and the updating of classifiers; (2) a composite classifier algorithm, which combines several similar or different classifiers and weights their outcomes linearly to improve accuracy, thereby absorbing the advantages of all classifiers. Focusing on the problems of concept shift, unbalanced data sets, missing data content, and noise, Ghazikhani
[5]
proposed an integrated classifier in an online neural network to elevate its performance. Ditzler
[6]
suggested that concept shift and data set unbalancing could be settled jointly. Farid
[7]
developed new methods for concept shift detecting and self-adaption integrating. Hashemi
[8]
came up with the FlexDT algorithm to prevent fake concept shifts caused by noise, and thereby improved accuracy. This paper
[9]
proposed a new SVM model, containing a procedure for data cleaning. The experiment aimed at a large amount of unbalanced data within a data stream.
In other related works, Sun
[10]
adopted a circulated structure to update multi-classifiers based on the futures of the data stream, and thereby improved its efficiency. Couellan
[11]
treated the training problem as a common nonlinear optimization problem that had a special decomposition property and adopted incremental gradients to calculate the inner product. The algorithm he developed had better performance when considering prediction accuracy and CPU time. Zheng
[12]
proposed an online SVM incremental learning algorithm, which consisted of learning prototypes and learning support vectors to resolve learning problems with large-scale data. Hou
[13]
proposed a RedTree algorithm, in which the classification of multiple data streams was achieved using composite sampling. Cazzolate
[14]
came up with an incremental decision tree algorithm based on the VFDT (Very Fast Decision Tree) system, which achieved better performance in noise processing and accuracy. Abdallah
[15]
introduced a new classification of behavior recognition systems based on computer clusters. It integrated increments from data stream mining and active learning. Dilectin
[16]
classified the real-time data for tsunami warnings using a KNN (K-Nearest Neighbor) algorithm and dynamically detected data. In the study of human face gender recognition, Yang
[17]
proposed a new framework for reaching faster face gender recognition using a local ternary pattern and an extreme learning machine. Syed
[18]
proposed a Batch incremental learning algorithm based on SVM, which could carry out incremental learning effectively using data changes. The training data set used in the network text classification was massive and fast-changing. Ding
[19]
suggested the FVI-SVM algorithm, which narrowed the volume of the training data set using a KKT (Karush-Kuhn-Tucker) condition and improved classification efficiency. Gonzalez-Mendoza
[20]
introduced KKT conditions and considered the KKT optimality conditions in order to present a strategy to implement the SVM-QP (quadratic optimization problem based on Support Vector) Machines. Among those studied, the BatchSVM incremental learning algorithm could continually accumulate more support vector sets, but when data endlessly increased, it could bring too great a burden to perform the computations. Besides, the incremental learning algorithm of the KKT conditions filtered more incremental samples during training, so the classification accuracy was lower.
In a weighting method of the SVM algorithm, which treated the gray image as a digital terrain model, Zheng
[21]
developed an adaptive weighted least squares support vector machine, named the LS-SVM, to iteratively estimate the optimal gray surface underlying the noisy image. The LS-SVM worked on Gaussian noise while the weighted LS-SVM worked on the outliers and the non-Gaussian noise. Xing
[22]
presented a feature selection and weighted support vector machine (FSWSVM) classifier-based algorithm to detect ships using polarimetric SAR imagery. For the online classification of data streams with imbalanced class distribution, Zhu
[23]
gave an incremental Linear Proximal Support Vector Machine, named the LPSVM, also called the DCIL-IncLPSVM, which provided robust learning performance in the case of class imbalance. Fergani
[24]
presented a new way to deal with the class imbalance problem, creating an efficient way of choosing the suitable regularization parameter C of the Soft-Support Vector Machines (C-SVM) method in order to facilitate automatic recognition of activities in a smart home environment. None of these papers considered the case of imbalanced data volumes within a history sample set and an incremental sample set in incremental learning algorithms. During the research reported in this paper, we paid attention to this issue and handled it using weighted processing to improve the classification accuracy.
- 2.2 Problem analysis

There are several defects in traditional incremental learning algorithms. For example, the standard SVM
[9]
is not an incremental algorithm. Although its classification accuracy is higher, the problem is that the higher the data volume, the longer the training time. By discarding useless sample sets and keeping useful support vector sets only, the BatchSVM
[18]
can realize the purpose of a smaller data set volume. But this advantage disappears when it comes to larger amount of support vectors, the subset volume will be larger as a result of continuous iteration and accumulation. Moreover, data stream, which is real-time continuous, will pose a heavy burden on computation. An incremental learning algorithm based on a KKT condition
[20]
will also decrease the sample amount in incremental samples, which will take part in the next step of training. This lowers the classification accuracy because a lot of incremental samples are filtered at the same time. Moreover, during data stream processing, we take the data of a sliding window as the incremental sample set and the previous data as the historical sample set. The problem of an unbalanced sample amount emerges when the ratio of the history sample set becomes larger and larger. Data streams all have their own features, so the solutions vary from case to case. For example, when providing user behavior analysis the aim is to analyze the general trend of users, so the major factor affecting the general behavior trend is the scale of the data, that is, slight changes in the users’ behaviors cannot affect the general trend. Based on this analysis, we improved the SVM incremental learning algorithm in order to adapt to the data stream features of a data stream used to perform user behavior analysis.
The performance of the SVM incremental learning algorithm depends on reserving important information in the history samples. Properly reserving information will simplify calculation and provide effective information for subsequent training. The key point is the support vector set.
In
Fig. 1
,
PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

3. The Improved incremental learning algorithm of SVM

- 3.1 IILS Algorithm design

Based on the analysis of section 2.2, aiming to resolve the issues of BatchSVM
[18]
and KKT incremental algorithm
[20]
, which are the problem of reserving valid history data and the problem of imbalanced data volumes in the history sample set and the incremental sample set, a data-stream-oriented improved incremental learning algorithm of SVM, named the IILS, is described below:
Premise：Assuming that
- 1) In the initial state,Ais the initial sample set andBis the incremental set, thenN0=NA,Nt=NB. By trainingA, we can acquire the classifierψA, the support vector setSVAand the decision functionfA(x);
- 2) With a set threshold ofε,ε>0, samples that meetyifA(xi)<1+ε,ε>0 inAwill be reserved and become the historical outline set, which is denoted byA', while samples that are too far away from the optimal separating hyperplane are discarded;
- 3) DetectBusingψA, if there are no samples that violate the KKT condition, thenψAis the classifier after incremental learning, which is denoted byψA+B, and the historical outline set is (A+B)'=A',N0=NA+NB;
- 4) If not all samples ofBmeet the KKT condition, then samples that are inA'and satisfyyif(xi)<1+ε,ε>0 inBare trained with a weight to get the classifierψA+B, the support vector setSVA+Band the decision functionfA+B(x). Samples that satisfyyifA+B(xi)<1+ε,ε>0 in the training set are reserved, which is denoted by (A+B)', and are then treated as the historical outline set after updating, while samples that are too far away from the optimal separating hyperplane are abandoned;
- 5) Detect D usingψA+Band iterate to acquire the new classifierψand the decision functionf(x) until there are no newly incremental sets.

- 3.2 DS-IILS Algorithm design

To improve the processing efficiency of a large-scale data stream in the system, i.e. the optimization of the task schedule, we consider the effect of node concurrent access on the system load, which is also one of our optimal strategies. The concurrency access rate of each node
PPT Slide

Lager Image

PPT Slide

Lager Image

- (1) The data stream is divided into blocks with the same size ofxMB and they are sorted by arrival time;
- (2) The blocks amount is denoted byE;

- (1) Calculate the concurrent access rate of each nodeAiusing formula 5;
- (2) Calculate the concurrent access rate of the whole systemCAusing formula 6 and define the load thresholdL=CA;
- (3) If 0≤Ai≤L, then the ith node belongs toLlight, if not then it belongs toLheavy. The number of nodes inLlightisM;
- (4) Sort nodes ofLlightin ascending order by their concurrent access rate.

- (1) IfE≤M, then data blocks are put into sorted nodes by their arriving order. All nodes parallel process the blocks which are assigned to them; that is, each node will call IILS and execute incremental learning internally simultaneously with all other nodes. Then, the decision function of each node is weighted and voted according to the amount of samples that have been processed. The final decision functionf(x) is determined and the algorithm ends.
- (2) IfE>M, then the previousMdata blocks are put into sorted nodes by their arriving order. Meanwhile, all nodes parallel process the blocks which are assigned to them; that is, each node will call IILS and execute incremental learning internally simultaneously with all other nodes. The decision function and the amount of processed samples are reserved. MakeE=E-Mand return to the second step.

- 3.3 Computation complexity analyses of IILS and DS-IILS

The computational complexity of the SVM is affected by the dimension of samples
PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

4. Experiment

- 4.1 Simulation environment and data set

The campus cloud of USTB (University of Science and Technology Beijing), in which Hadoop was adopted as the underlying storage, MapReduce as the upper parallel computing framework and virtualization technology was also realized, served as the experimental platform. Our evaluations of the algorithms are performed within this campus cloud. For now, it consists of 10 data nodes with different computer performances. Their configurations are shown in
Table 1
. Algorithms are evaluated based on two aspects: (1) training time, which reflects the algorithm’s efficiency and (2) classification accuracy, which represents the algorithm’s learning abilities and classification accuracies. Meanwhile, to ensure the accuracy and effectiveness of the experiment’s results, we adopt Java for all algorithms, each experiment is run 50 times under the same condition, and the average results are reported.
Configuration of cloud nodes

PPT Slide

Lager Image

- 4.2 Selection of kernel function and penalty factor

In these experiments, we mainly compare the RBF kernel function with a polynomial kernel function. The numbers of samples are 4505, 9025, 13724, and 17960 respectively. And the values of the penalty factor
Contrast experiment of kernel function and penalty factor

PPT Slide

Lager Image

- 4.3 Parameter selection in improved incremental algorithm

Historical samples that satisfy
PPT Slide

Lager Image

PPT Slide

Lager Image

- 4.4 Experimental comparisons

The IILS algorithm is the core of the DS-IILS algorithm in this experiment, i.e., the IILS is a subfunction of the DS-IILS and an exception of the DS-IILS when the number of cloud nodes is one. In the experimental comparisons explored during this research, we compared the DS-IILS with the standard SVM
[9]
, the BatchSVM incremental algorithm
[18]
, and the KKT incremental algorithm
[20]
, thereby excluding the IILS. Again, the focus of our comparisons includes training time and accuracy in order to determine the effectiveness of the DS-IILS algorithm. In
Fig. 4
and
5
, the sizes of the initial sample and the incremental sample sets 1 through 3 are all 5000; the size of the testing sample set is 20000. The kernel function is RBF and the penalty factor
PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

5. Conclusion

This paper reports on our efforts to improve the SVM incremental learning algorithm as it applies to the classification methods of data streams, specifically to adapt it to the features of a large-scale data stream used for customer behavior analysis. First, we proposed an improved incremental learning algorithm based on the SVM theory, in which the threshold of the distance from the optimal separating hyperplane is defined. Historical samples and incremental samples that satisfy the limits of the threshold are reserved to construct the training sample set. The ratio of the historical sample set and the incremental sample set is also considered and processed using weighting. Second, by taking the load condition and node performances into consideration, we took advantage of cloud computing capabilities to improve the incremental learning algorithm based on the SVM theory for use with a large-scale data stream, thereby improving the training efficiency and the classification accuracy of the system. Finally, the proposed algorithm is implemented in the cloud platform and compared with the standard SVM algorithm, the BatchSVM incremental algorithm, and the KKT incremental learning algorithm in the process of analyzing customer behaviors. The experiment results show that the DS-IILS algorithm has a higher training efficiency and classification accuracy for a large-scale data stream, which is consistent with the theoretical analysis. In this paper, we research two-classification incremental learning algorithms based on SVM. Our future work will address the problems with multi-classification incremental learning algorithms.
BIO

Wu W.
,
Gruenwald L.
2010
“Research issues in mining multiple data streams”
ACM
in Proc. of the First International Workshop on Novel Data Stream Pattern Mining Techniques
July 25
Article (CrossRef Link).
56 -
60

Yuan Y. B.
,
Lu J.
,
Cao F. L.
2011
“Baby formula classification based on forth order polynomial smoothing support vector machine”
in Proc. of International Conference on Machine Learning and Cybernetics
July 10-13
Article (CrossRef Link).
685 -
691

Pang S.
,
Zhu L.
,
Chen G.
,
Sarrafzadeh A.
,
Ban T.
,
Inoue D.
2013
“Dynamic class imbalance learning for incremental LPSVM”
Neural Networks
Article (CrossRef Link).
44
87 -
100
** DOI : 10.1016/j.neunet.2013.02.007**

Ding W.
,
Han Y.
,
Wang J.
,
Zhao Z.
2012
“Space reduction for extreme aggregation of data stream over time-based sliding window”
in Proc. of 2012 IEEE 5th International Conference on Cloud Computing
June 24-29
Article (CrossRef Link).
1002 -
1003

Ghazikhani A.
,
Yazdi H. S.
,
Monsefi R.
2012
“Class imbalance handling using wrapper-based random oversampling”
in Proc. of 2012 20th Iranian Conference on Electrical Engineering (ICEE 2012)
May 15-17
Article (CrossRef Link).
611 -
616

Ditzler G.
,
Polikar R.
2013
“Incremental learning of concept drift from streaming imbalanced data”
IEEE Transactions on Knowledge and Data Engineering
Article (CrossRef Link).
25
(10)
2283 -
2301
** DOI : 10.1109/TKDE.2012.136**

Farid D. M.
,
Zhang L.
,
Hossain A.
,
Rahman C. M.
,
Strachan R.
,
Sexton G.
,
Dahal K.
2013
“An adaptive ensemble classifier for mining concept drifting data streams”
Expert Systems with Applications
Article (CrossRef Link).
40
(15)
5895 -
5906
** DOI : 10.1016/j.eswa.2013.05.001**

Hashemi S.
,
Yang Y.
2009
“Flexible decision tree for data stream classification in the presence of concept change, noise and missing values”
Data Mining and Knowledge Discovery
Article (CrossRef Link).
19
(1)
95 -
131
** DOI : 10.1007/s10618-009-0130-9**

Tang Y.
,
Zhang Y. Q.
,
Chawla N. V.
,
Krasser S.
2009
“SVMs modeling for highly imbalanced classification”
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on
Article (CrossRef Link).
39
(1)
281 -
288
** DOI : 10.1109/TSMCB.2008.2002909**

Sun N.
,
Guo Y.
2012
“Model on data stream classification with incremental learning”
Computer Engineering and Design
Article (CrossRef Link).
33
(11)
4225 -
4229

Couellan N. P.
,
Trafalis T. B.
2013
“On-line SVM learning via an incremental primal-dual technique”
Optimization Method Software
Article (CrossRef Link).
28
(2)
256 -
275
** DOI : 10.1080/10556788.2011.633705**

Zheng J.
,
Shen F.
,
Fan H.
,
Zhao J.
2013
“An online incremental learning support vector machine for large-scale data”
Neural Computing Applications
Article (CrossRef Link).
22
(5)
1023 -
1035
** DOI : 10.1007/s00521-011-0793-1**

Hou W.
,
Yang B.
,
Wuc C.
,
Zhou Z.
2010
“RedTrees: a relational decision tree algorithm in streams”
Expert Systems with Applications
Article (CrossRef Link).
37
(9)
6265 -
6269
** DOI : 10.1016/j.eswa.2010.02.096**

Cazzolato M. T.
,
Ribeiro M. X.
,
Yaguinuma C.
,
Santos M. T. P.
2013
“A statistical decision tree algorithm for data stream classification”
in Proc. of the 15th International Conference on Enterprise Information Systems
July 4-July 7
Article (CrossRef Link).
217 -
223

Abdallah Z. S.
,
Gaber M. M.
,
Srinivasan B.
,
Krishnaswamy S.
2012
“StreamAR: incremental and active learning with evolving sensory data for activity recognition”
in Proc. of In Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on
November 7-9
Article (CrossRef Link).
1163 -
1170

Mercy B. V.
2012
“Classification and dynamic class detection of real time data for tsunami warning system”
in Proc. of In Recent Advances in Computing and Software Systems (RACSS), 2012 International Conference on
April 25-27
Article (CrossRef Link).
124 -
129

Yang J.
,
Jiao Y.
,
Xiong N.
,
Park D.
2013
“Fast face gender recognition by using local ternary pattern and extreme learning machine”
KSII Transactions on Internet and Information Systems
Article (CrossRef Link).
7
(7)
1705 -
1720

Syed N. A.
,
Huan S.
,
Kah L.
,
Sung K.
1999
“Incremental learning with support vector machines”
in Proc. of the Workshop on Support Vector Machines at the international Joint Conference on Artificial Intelligence (IJCAI 1999)
July 31-August 6
http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.46.6367&rep=rep1&type=pdf
352 -
356

Ding W.
,
Xue A.
2012
“Fast incremental learning SVM for Web text classification”
Application Research of Computers
Article (CrossRef Link).
29
(4)
1275 -
1278

Gonzalez-Mendoza M.
,
Orozco R.E.I.
2014
“Quadratic optimization fine tuning for the support vector machines learning phase”
Expert Systems with Applications
Article (CrossRef Link).
41
(3)
886 -
892
** DOI : 10.1016/j.eswa.2013.08.019**

Zheng S.
,
Yang C.
,
Hendriks E. A.
,
Wang X.
2013
“Adaptive weighted least squares SVM based snowing model for image denoising”
International Journal of Wavelets Multiresolution and Information Processing
Article (CrossRef Link).
11
(6)
1 -
25
** DOI : 10.1142/S0219691313500434**

Xing X.
,
Ji K.
,
Zou H.
,
Sun J.
2013
“Feature selection and weighted SVM classifier-based ship detection in PolSAR imagery”
International Journal of Remote Sensing
Article (CrossRef Link).
34
(22)
7925 -
7944
** DOI : 10.1080/01431161.2013.827812**

Zhu L.
,
Pang S.
,
Chen G.
,
Sarrafzadeh A.
2012
“Class imbalance robust incremental LPSVM for data streams learning”
in Proc. of In Neural Networks (IJCNN), The 2012 International Joint Conference on
June 10-15
Article (CrossRef Link).
1 -
8

Fergani B.
,
Clavier L.
2013
“Importance-weighted the imbalanced data for C-SVM classifier to human activity recognition”
in Proc. of In Systems, Signal Processing and their Applications (WoSSPA), 2013 8th International Workshop on
May12-15
Article (CrossRef Link).
330 -
335

Chang C. C.
,
Lin C. J.
2011
“LIBSVM: a library for support vector machines”
ACM Transactions on Intelligent Systems and Technology
Article (CrossRef Link).
2
(3)
1 -
27
** DOI : 10.1145/1961189.1961199**

Moro S.
,
Cortez P.
,
Rita P.
2014
“A Data-Driven Approach to Predict the Success of Bank Telemarketing”
Decision Support Systems
Article (CrossRef Link).
62
22 -
31
** DOI : 10.1016/j.dss.2014.03.001**

Citing 'SVM-Based Incremental Learning Algorithm for Large-Scale Data Stream in Cloud Computing
'

@article{ E1KOBZ_2014_v8n10_3378}
,title={SVM-Based Incremental Learning Algorithm for Large-Scale Data Stream in Cloud Computing}
,volume={10}
, url={http://dx.doi.org/10.3837/tiis.2014.10.005}, DOI={10.3837/tiis.2014.10.005}
, number= {10}
, journal={KSII Transactions on Internet and Information Systems (TIIS)}
, publisher={Korean Society for Internet Information}
, author={Wang, Ning
and
Yang, Yang
and
Feng, Liyuan
and
Mi, Zhenqiang
and
Meng, Kun
and
Ji, Qing}
, year={2014}
, month={Oct}