Advanced
An Efficient Algorithm for Big Data Prediction of Pipelining, Concurrency (PCP) and Parallelism based on TSK Fuzzy Model
An Efficient Algorithm for Big Data Prediction of Pipelining, Concurrency (PCP) and Parallelism based on TSK Fuzzy Model
Journal of the Korea Institute of Information and Communication Engineering. 2015. Oct, 19(10): 2301-2306
Copyright © 2015, The Korean Institute of Information and Commucation Engineering
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(http://creativecommons.org/li-censes/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Received : August 14, 2015
  • Accepted : September 24, 2015
  • Published : October 31, 2015
Download
PDF
e-PUB
PubReader
PPT
Export by style
Share
Article
Author
Metrics
Cited by
TagCloud
About the Authors
장영 김
jykim77@suwon.ac.kr

Abstract
The time to address the exabytes of data has come as the information age accelerates. Big data transfer technology is essential for processing large amounts of data. This paper posits to transfer big data in the optimal conditions by the proposed algorithm for predicting the optimal combination of Pipelining, Concurrency, and Parallelism (PCP), which are major functions of GridFTP. In addition, the author introduced a simple design process of Takagi-Sugeno-Kang (TSK) fuzzy model and designed a model for predicting transfer throughput with optimal combination of Pipelining, Concurrency and Parallelism. Hence, the author evaluated the model of the proposed algorithm and the TSK model to prove the superiority.
Keywords
Ⅰ. INTRODUCTION
Big data transfer technologies have been in the spotlight with three properties of the velocity, volume and variety.
As these technologies are actively researched, areas of computer network researching data transfer technology are also emerging as important topics. Although FTP is being used in order to allow the transfer of the data among the computers, it may decrease the throughput of data transfer especially in big data transfer. In order to solve this problem, GridFTP [1] is popularly used to achieve an optimal throughput to transfer the large amounts of data by enhancing security, transfer speed and reliability of data transmission. Existing work has tried to find optimal values of Pipelining, Concurrency and Parallelism (PCP) by using historical PCP datasets. Hence, the author is willing to propose an efficient algorithm with Takagi-Sugeno-Kang (TSK) fuzzy model [2 , 3] to predict optimal values of PCP and the throughput. PCP includes the following functions.
Pipelining is a method of transmitting files continuously without waiting for a response signal for the previous transmission. Concurrency usually transmits various files via different channels at the same time. Parallelism is a method of transmitting different parts of the same file via multiple parallel data channels at the same time. Therefore, three PCP functions are very useful for large-scale data transmissions.
GridFTP [1] can improve throughput by optimizing PCP values which are manageable major functions depending on file size, number of files, bandwidth, round trip time (RTT) and buffer size. Prior studies [4 - 10] suggested some algorithms for finding the optimal combination of parallelism. However, a drawback of these studies is that throughput of data transfer may decrease due to overhead, which could occur when the conventional algorithm finds the optimal combination of PCP values by using the information gained by transferring sampling files through network channel.
Hence, this paper suggests an efficient algorithm for predicting the optimal combination of PCP based on fuzzy model; this is appropriate for certain circumstances of data transfer based on the measured abundant experimentation datasets which contain throughput values depending on PCP in different testbed environments. In addition, the author designed a model of predicting the throughput of data transfer under the optimal combination of PCP using the processed experimentation dataset by TSK fuzzy model.
Ⅱ. BACKGROUND
Abundant PCP experimentation datasets measured in various testbeds contain more than seventeen factors such as file size, number of files, bandwidth, round-trip time, buffer size, pipelining, concurrency, parallelism, throughput, transfer duration and so on. The system will be complex and difficult to interpret because each factor has a non-linear relationship with the throughput and other factors. The PCP background is designed in Fig. 2 .
In this paper, the author used clustering, which is a data classification algorithm that identifies the nature of PCP experimentation dataset and specially contains the concept of fuzzy to reflect a more specific characteristic of the data. The author designed a model of predicting the highest throughput of data transfer under the optimal combination of pipelining, concurrency and parallelism based on PCP experimentation dataset. The TakagiSugeno-Kang fuzzy model [2 , 3] can approximate the very complex non-linear system based on fuzzy rule-based inference.
- 2.1. Takagi-Sugeno-Kang Fuzzy Model
TSK Fuzzy Model is a method of inference based on the fuzzy rule for approximating the complex non-linear system. After creating several rules by dividing input space into fuzzy area, which infers final output by applying fitness into each rule. Here are the steps of the procedure of TSK Fuzzy Model ( Fig. 1 ).
PPT Slide
Lager Image
TSK 퍼지 모델 Fig. 1 TSK fuzzy model
PPT Slide
Lager Image
PCP 지식 배경 Fig. 2 PCP Background
Ⅲ. PROPOSED ALGORITHM
The proposed algorithm finds the most similar data by comparing certain input data with the existing abundant datasets. Most similar data is determined by the data that have the lowest sum of error rate by comparing factors of input data with factors of data in the dataset. In addition, the maximum operators were added in order to consider the factor that contains the outlier.
The proposed algorithm uses the following steps to find optimal PCP values. First, each error is calculated by comparing each of the factors of the data in dataset with each factor of the input data. Second, the algorithm calculates the maximum value on each factor. Third, the algorithm finds the error rate by dividing the maximum value of each factor for each factor in error. Fourth, the algorithm calculates the sum of error rate (SumER) for each data. Fifth, the algorithm calculates the maximum value (MaxER) for each of the error rate data. Sixth, the algorithm calculates the final error rate by combining SumER and MaxER using weight coefficient. Finally, the algorithm finds the minimum value out of final error rate and corresponding data number is determined as most similar data with input data.
The above comparison algorithm is a method of comparing each error rate. However, several factors need to be taken into account simultaneously for one data. If the scale of range of each factor is different, then the comparison method cannot determine the most similar characteristics of data. The proposed algorithm is able to compare characteristics of data by compensating for the differences of range for each factor, using rate of error in the case of several factors exist in data.
Hence, they are useful characteristics for comparing data that contains various scale factors. Therefore, the author applied the concept of the maximum operator to error rate to control for the amount of removal of the outlier.
Finally, compared value is calculated by the total amount of the sum of the error rate multiplied by the weight coefficient and maximum value of error rate multiplied by the weight coefficient. If the weight coefficient is close to 1, then it is difficult to remove the outlier because it is a total comparison method. On the other hand, if weight coefficient is close to 0, then it is more likely to remove the outlier because it is a partial comparison method. The weight coefficient can be used by adjusting to appropriately find most similar characteristics of data.
Ⅳ. EXPERIMENTAL RESULTS
- 4.1. Proposed algorithm for finding optimal combination of PCP values for the highest throughput based on PCP experimentation dataset
The author applied the proposed algorithm to PCP experimentation dataset that contains data transfer throughput depending on file size, number of files, bandwidth, round trip time, buffer size, pipelining, concurrency and parallelism generated from various testbeds. Input data is the data corresponding to certain circumstances of file transmission. The author compared the characteristics of input data with characteristics of data in a PCP experimentation dataset and determined the most similar characteristics of data and determining the PCP values of corresponding data number as an optimal combination of pipelining, concurrency and parallelism for certain circumstance of file transmission.
In this experiment, the author applied the most effective five factors to throughput of data transfer such as file size, number of files, bandwidth, round trip time and buffer size in the proposed algorithm. Also, it is essential to preprocess the PCP experimentation dataset before applying the proposed algorithm to dataset.
PCP experimentation dataset contains many data that have the same values in file size, number of files, bandwidth, round trip time and buffer size. However, PCP values and throughput values are different. Therefore, a preprocessed experimentation dataset is constructed by extracting each data of highest throughput in all possible cases of the values is same in factors such as file size (Bytes), number of files, bandwidth (BW, Mbps), round trip time (RTT, seconds), buffer size (BS, Bytes) and Throughput (Th, Mbps).
전처리 실험 데이터Table. 1 Preprocessed Experimentation Dataset
PPT Slide
Lager Image
전처리 실험 데이터 Table. 1 Preprocessed Experimentation Dataset
Testing dataset is randomly constructed within the error of 10% with the processed experimental dataset.
테스팅 데이터Table. 2 Testing Dataset
PPT Slide
Lager Image
테스팅 데이터 Table. 2 Testing Dataset
실험결과Table. 3 Results of the Experiment
PPT Slide
Lager Image
실험결과 Table. 3 Results of the Experiment
The result of the experiment shows highly accurate determination capability, though the range of each factor is very different.
- 4.2. TSK fuzzy model for predicting throughput under optimal combination of PCP
A prior preprocessed experimentation dataset is used as training data in design of TSK Fuzzy Model. Also, a testing dataset is constructed by extracting each data with the second highest throughput in the same value in file size, number of files, bandwidth, round trip time and buffer size from original PCP experimentation dataset.
Ⅴ. CONCLUSION AND FUTURE WORK
In this paper, the author predicted an optimal combination of pipelining, concurrency and parallelism (PCP) for certain circumstances of file transfer such as accruing overhead, network saturation, based on experimental dataset measured by various testbeds. Hence, it would be feasible to transfer large-scale data in optimal throughput with GridFTP by using optimal combination of PCP.
In addition, the author designed a model for predicting throughput of file transfer under the optimal combination of PCP by using TSK fuzzy rule based inference. In future work, the author will optimize weight of error rate in the proposed algorithm based on objective function. Therefore, the new optimization algorithm will provide more efficient way and accomplish to find optimal values of PCP and throughput.
BIO
김장영(Jang-Young Kim)
2005년 2월: 연세대학교 컴퓨터과학 공학사
2010년 5월: Pennsylvania State Univ. 공학석사
2013년 7월: State University of New York 공학박사
2013년 8월: University of South Carolina 조교수
2014년 3월: 수원대학교 컴퓨터학과 조교수
※관심분야 : Big data, Cloud computing, Networks
References
GridFTP, Globus Online http://www.globus.org
Takagi T. , Sugeno M. 1985 “Fuzzy identification of systems and its applications to modeling and control,” IEEE Trans. Syst., Man, Cybern. 15 116 - 132    DOI : 10.1109/TSMC.1985.6313399
Sugeno M. , Yasukawa T. 1993 “A fuzzy-logic-based approach to qualitative modeling,” IEEE Trans, Fuzzy Syst. 1 7 - 31    DOI : 10.1109/TFUZZ.1993.390281
Yildirim E. , Kim J. , Kosar T. “Optimizing the sample size for a cloud-hosted data scheduling service,” Proc. 2nd International Workshop on Cloud Computing and Scientific Applications (CCSA in conjunction with CCGRID12) 2012
Kim J. , Yildirim E. , Kosar T. “A highly-accurate and low-overhead prediction model for transfer throughput optimization,” Proc. of DISCS Workshop November 2012
Allen B. , Bresnahan J. , Childers L. , Foster I. , Kandaswamy G. , Kettimuthu R. , Kordas J. , Link M. , Martin S. , Pickett K. , Tuecke S. 2012 “Software as a service for data scientists,” Communications of the ACM 55 (2) 81 - 88    DOI : 10.1145/2076450.2076468
Yildirim E. , Kim J. , Kosar T. (2013) "Modeling Throughput Sampling Size for a Cloud-hosted Data Scheduling and Optimization Service," In Future Generation Computer Systems (FGCS) 29 (7) 1795 - 1807    DOI : 10.1016/j.future.2013.01.003
Yildirim E. , Kim J. , Kosar T. "How GridFTP Pipelining, Parallelism and Concurrency Work: A Guide for optimizing large dataset transfers," In Proceedings of IEEE/ACM Supercomputing'12 Workshop on Network-Aware Data Management (NDM 2012) Salt Lake City, UT November 2012 (Best Paper Award)
Yildirim E. , Balman M. , Kosar T. 2012 “Data-intensive Distributed Computing: Challenges and Solutions for Large-scale Information Management, ch. Data-aware Distributed Computing IGI-Global
Yildirim E. , yin D. , Kosar T. 2011 “Prediction of optimal parallelism level in wide area data transfers,” IEEE Transactions on Parallel and Distributed Systems 22 (12) 2033 - 2045    DOI : 10.1109/TPDS.2011.228