Advanced
Document Classification Model Using Web Documents for Balancing Training Corpus Size per Category
Document Classification Model Using Web Documents for Balancing Training Corpus Size per Category
Journal of Information and Communication Convergence Engineering. 2013. Dec, 11(4): 268-273
Copyright ©2013, The Korean Institute of Information and Commucation Engineering
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/li-censes/bync/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Received : July 08, 2013
  • Accepted : September 27, 2013
  • Published : December 31, 2013
Download
PDF
e-PUB
PubReader
PPT
Export by style
Share
Article
Author
Metrics
Cited by
TagCloud
About the Authors
So-Young Park
Juno Chang
Taesuk Kihl
tsroad@smu.ac.kr

Abstract
In this paper, we propose a document classification model using Web documents as a part of the training corpus in order to resolve the imbalance of the training corpus size per category. For the purpose of retrieving the Web documents closely related to each category, the proposed document classification model calculates the matching score between word features and each category, and generates a Web search query by combining the higher-ranked word features and the category title. Then, the proposed document classification model sends each combined query to the open application programming interface of the Web search engine, and receives the snippet results retrieved from the Web search engine. Finally, the proposed document classification model adds these snippet results as Web documents to the training corpus. Experimental results show that the method that considers the balance of the training corpus size per category exhibits better performance in some categories with small training sets.
Keywords
I. INTRODUCTION
Recently, a huge number of documents have been produced and stored in digital archives [1] . Therefore, document classification plays a very important role in many information management and retrieval tasks. It refers to the task of classifying a document into a predefined category. Most digital documents are frequently updated, and writers disseminate information and present their ideas on various topics [2] . Unlike news articles written by well-educated journalists and standardized according to official style guides, most digital documents, such as weblogs and twitter tweets, tend to contain colloquial sentences and slang that misleads the classifier, because anybody can publish anything [3 - 6] . Considering this cumbersomeness of document classification, researchers have proposed the following approaches.
First, naive Bayes-based approaches have performed well from the perspectives of spam filtering and email categorization, and they require a small training corpus to estimate the parameters necessary for document classification [4] . However, the conditional independence assumption is violated by real-world data and does not consider the frequency of word occurrences [7] . Therefore, these approaches perform very poorly when features are highly correlated [8] .
Second, the support vector machine (SVM)-based approaches have been recognized as some of the most effective document classification methods as compared to supervised machine learning algorithms. Nevertheless, these approaches have certain difficulties with respect to parameter tuning and kernel selection [7] . Further, the performance of SVM suitable for binary classification degrades rapidly, as the amount of training data decreases, resulting in a relatively poor performance on large-scale datasets with many labels [8] .
Third, knowledge-based approaches have utilized knowledge, such as the meaning and relationships of the words, which can be obtained from an ontology, such as WordNet [8] . However, considering the restricted incompleteness of an ontology, the effect of word sense and relation disambiguation is quite limited [9 , 10] .
Fourth, Web-based approaches have been used for mining desired information, such as popular opinions from Web documents obtained by traversing the Web in order to explore information that is related to a particular topic of interest. However, Web crawlers attempt to search the entire Web, which is impossible because of the size and the complexity of the World Wide Web [11] .
In this paper, we propose a document classification model to solve the imbalance of the training corpus size per category by utilizing Web documents. The rest of this paper is organized as follows: Section II presents an overview of the proposed model consisting of a prediction phase to assign a category to a given document, and a learning step to balance the training corpus size by adding Web documents. Then, Section III presents some experimental results, and the characteristics of the proposed method are presented as the conclusion of this paper in Section IV.
II. WEB-BASED DOCUMENT CLASSIFICATION MODEL
As shown in Fig. 1 , the proposed Web-based document classification model is composed of a prediction phase and a learning phase. Given an unlabeled document in the prediction phase, the feature extraction step extracts some word features, such as unigrams and bigrams from the document. Then, the document classification step predicts the category of the document by utilizing the learned parameters. On the other hand, the learning phase obtains these learned parameters, such as word-frequency distribution from the category-labeled documents. Section II-A introduces the document classification method, and Section II-B describessed Web document acquisition method in detail.
Lager Image
Proposed Web-based document classification model.
- A. Document Classification Method
The proposed document classification method selects the category ci , taking the highest matching probability for the given unlabeled document D , as represented in Eq. (1), where ci denotes an element of the category set C . In order to make it easy to mathematically deal with the document classification problem, the proposed model represents the unlabeled document D as the document vector
Lager Image
consisting of feature values extracted from the document by the feature extraction step. Each feature value indicates the number of times that each word feature, such as the bigram “arcade game” appears in the document [12] . Most elements of the document vector take a zero value.
Lager Image
In the maximum entropy framework [12 - 15] , the conditional probability of predicting an outcome ci given the history d is defined as follows:
Lager Image
Lager Image
Some word features selected on the basis of chi-square statisticsSNS: social network service.
Lager Image
Some word features selected on the basis of chi-square statistics SNS: social network service.
where fi(d,ci) denotes the feature function, λi represents the weighting parameter of fi(d,ci) , k refers to the number of features, and Z(d) indicates the normalization factor for Eq. (3) [13 - 15] .
- B. Web Document Acquisition Method
In order to balance the training corpus size per category, the proposed Web-based document classification model increases the number of labeled documents by utilizing a Web search engine to retrieve the Web documents.
In the query generation step shown in Fig. 1 , the proposed method first selects some useful word features per category on the basis of the chi-square statistics [16 , 17] . The chi-square statistics of word feature fj in the category ci is defined as follows:
Lager Image
where A denotes the number of documents containing the word feature fj in the category ci , B represents the number of documents containing the word feature fj in other categories rather than ci , C indicates the number of documents not containing the word feature fj in the category ci , D refers to the number of documents not containing the word feature fj in other categories rather than ci , and N denotes the total
Lager Image
Document distribution according to categories.
number of documents. Each word feature fj is computed for every category, and the word features with top n highest chisquare statistics are used for the query candidates. Table 1 shows some examples of word features selected for the categories, such as health, music, shopping, social network service (SNS), and sports according to the chi-square statistics.
Given the selected word features per category, the proposed Web document acquisition method generates a query for the Web search engine. Considering that the lower-ranked word features can be less related to the characteristics of the category, the proposed method combines two of the higher-ranked word features and the title of the category for the purpose of retrieving the Web documents closely related to the category. Therefore, the query consists of three word features: the category title, one of the top 5 ranked word features, and one of the top 200 ranked word features except the top 5 ranked word features. For each category, the proposed query generation step maximally yields 975 queries by multiplying the 5 higherranked word features with the 195 lower-ranked word features.
Finally, the proposed Web document acquisition method sends each combined query to the open application programming interface (API) of the Web search engine, and receives the snippet results retrieved from the Web search engine. The proposed Web document acquisition method assumes the snippet results of each query as one Web document.
III. EXPERIMENTS
In order to prove the validity of utilizing the Web documents, we have tested the MALLET document classification package [12] with a mobile application description document corpus [18] , which is divided into 90% for the training set and 10% for the test set. The document corpus consists of 3,521 documents and 302,772 words. Each
Lager Image
Performance variation by the addition of Web documents.
document has 3 to 515 words, and a document is composed of 85.99 words on average. As described in Fig. 2 , the distribution of documents is not balanced among categories in the corpus. For instance, the documents corresponding to the Utility category capture roughly 21% of the corpus, while the documents corresponding to the music category capture 0.28% of the corpus.
On the other hand, the proposed method is evaluated on the basis of the following evaluation criteria: precision, recall, and F-measure. Precision indicates the ratio of correct candidate categories to the candidate categories predicted by the proposed document classification model. Recall indicates the ratio of correct candidate categories to the total number of categories of the documents in the corpus. F-measure indicates the harmonic mean of the precision and the recall.
Fig. 3 shows that the performance is improved by adding Web documents; here, each performance value is an average of the 10-fold cross validation. In this figure, the y-axis indicates the recall value, and x-axis denotes the number of Web documents. Because the proposed document classification model predicts all categories of the given documents, the precision is the same as the recall. The baseline performance without any document classification model is
Lager Image
Performance difference between baseline and addition of Web documents.
21% because the documents corresponding to the utility account for roughly 21% of the corpus.
Equal size performance indicates the recall of the document classification model MALLET, learned from the training set with a different number of Web documents per category for the balance of the training corpus size. Equal addition performance indicates the recall of the document classification model MALLET, learned from the training set with an equal number of Web documents per category [19] . For example, the Equal size method does not add any Web documents for categories, such as Utility that already have many documents, while the method adds many Web documents for categories, such as Music that have few documents. On the other hand, the Equal addition method always adds the same number of Web documents to each category according to the x-axis value.
Fig. 3 shows that the Equal size method is slightly more effective than the Equal addition method [19] because the document classification model decreases the tendency of the ambiguous or less informative document to correspond to the category with the relatively more documents, by balancing the number of documents per category. Further, Fig. 3 shows that the performance does not always increase according to the addition of Web documents because the characteristics of mobile application description documents is considerably different from the characteristics of Web documents, which are very sensitive to Web search queries. Moreover, the performance does not generally increase much upon the addition of Web documents as most category prediction results in the test set are biased towards a few categories corresponding to many documents, such as Utility category, Puzzle/Board game category, and Health/Fitness category, whereas Web documents improve the classification performance for the categories with a few documents, such as Music category, Testurself category, and SNS category.
Fig. 4 shows the effects of Web documents according to categories at the peak performance by adding 150 Web documents per category. It indicates that the effects of Web documents can differ considerably from each other according to the category, although the general performance did not increase much upon the addition of Web documents. In particular, the recall and the precision of Music category increased by 100% because the document classification model without Web documents did not predict Music category at all, whereas the document classification model with Web documents attempted to predict the correct Music category once. Like the Testurself category and the SNS category, categories with fewer mobile application description documents increased the recall because the probability of predicting these categories increased by adding Web documents corresponding to the categories. On the other hand, categories, such as Utility category and the Health/Fitness category with many documents, increased the precision because the possibility of predicting these categories decreases. However, Food category and Education category decreased the recall because the proposed method did not find an appropriate query for Web document retrieval because these categories had few documents.
IV. CONCLUSIONS
In this paper, we propose a document classification model utilizing Web documents. The proposed model has the following characteristics: the proposed model can adjust the balance of the training corpus size per category by adding Web documents to the training corpus. Further, the proposed model can retrieve Web documents that are closely related to each category by combining two of the higher-ranked word features and the category title according to the matching score between the word features and the category. Experimental results show that the proposed model improves performance in some categories in the case of small training sets by balancing the number of documents per category. In the future, we intend to compare the effects of some feature selection methods, such as mutual information and information gain. Further, we plan to automatically generate word clusters and apply them to the proposed model.
Acknowledgements
This research was supported by Basic Science ResearchProgram through the National Research Foundation ofKorea (NRF) funded by the Ministry of Education, Science and Technology (2012R1A1A3013405). Further, this paper isan expanded version of the paper “Application of WebSearch Results for Document Classification” published inthe Proceedings of ICFICE 2013.
References
Nyberg K. , Raiko T. , Tiinanen T. , Hyvonen E. 2010 “Documentclassification utilising ontologies and relations betweendocuments” in Proceeding of the 8th Workshop on Mining and Learning with Graphs Washington: DC 86 - 93
Ayyasamy R. K. , Tahayna B. , Alhashmi S. , Eu-Gene S. , Egerton S. 2010 “Mining Wikipedia knowledge to improve documentindexing and classification” in Proceeding of 10th International Conference on Information Sciences, Signal Processing and their Applications Kuala Lumpur, Malaysia 806 - 809
Ferreira R. , Freitas F. , Brito P. , Melo J. , Lima R. , Costa E. 2013 “RetriBlog: an architecture-centered framework for developingblog crawlers” Expert Systems with Applications 40 (4) 1177 - 1195    DOI : 10.1016/j.eswa.2012.08.020
Park S. , Kim C. W. , An D. U. 2009 “E-mail classification andcategory reorganization using dynamic category hierarchy andPCA” Journal of Information and Communication Engineering 7 (3) 351 - 355
Yun H. 2011 “Classifying temporal topics with similar patterns onTwitter” Journal of Information and Communication Engineering 9 (3) 295 - 300    DOI : 10.6109/jicce.2011.9.3.295
Yun H. 2012 “Quantifying influence in social networks and news media” Journal of Information and Communication Convergence Engineering 10 (2) 135 - 140    DOI : 10.6109/jicce.2012.10.2.135
Baharudin B. , Lee L. H. , Khan K. 2010 “A review of machinelearning algorithms for text-documents classification” Journal of Advances in Information Technology 1 (1) 4 - 20    DOI : 10.4304/jait.1.1.4-20
Rubin T. N. , Chambers A. , Smyth P. , Steyvers M. 2012 “Statisticaltopic models for multi-label document classification” Machine Learning 88 (1-2) 157 - 208    DOI : 10.1007/s10994-011-5272-5
Lu G. , Huang P. , He L. , Cu C. , Li X. 2010 “A new semanticsimilarity measuring method based on Web search engines” WSEAS Transactions on Computers 9 (1) 1 - 10
Jialei Z. , Hwang C. G. , Jung G. D. , Choi Y. K. 2011 “A design ofK-XMDR search system using topic maps” Journal of Information Communication Engineering 9 (3) 287 - 294    DOI : 10.6109/jicce.2011.9.3.287
Samarawickrama S. , Jayaratne L. 2011 “Automatic text classificationand focused crawling” in Proceeding of 6th International Conference on Digital Information Management Melbourne, Australia 143 - 148
McCallum A. K. MALLET: a machine learning for languagetoolkit [Internet]. Available: http://mallet.cs.umass.edu.
Berger A. L. , Della Pietra V. J. , Della Pietra S. A. 1996 “Amaximum entropy approach to natural language processing” Computational Linguistics 22 (1) 39 - 71
Lim J. H. , Hwang Y. S. , Park S. Y. , Rim H. C. 2004 “Semantic rolelabeling using maximum entropy model” in Proceeding of the Conference on Computational Natural Language Learning Boston: MA 122 - 125
Tan H. , Zhao T. , Wang H. , Hong W. P. 2007 “Identification ofChinese event types based on local feature selection and explicitpositive & negative feature combination” International Journal of the Korean Institute of Maritime Information and Communication Sciences 5 (3) 233 - 238
Yang Y. , Pedersen J. O. 1997 “A comparative study on featureselection in text categorization” in Proceeding of the 14th International Conference on Machine Learning Nashville: TN 412 - 420
Seki K. , Mostafa J. 2005 “An application of text categorizationmethods to gene ontology annotation” in Proceeding of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Salvador, Brazil 138 - 145
Kihl T. , Chang J. , Park S. Y. 2012 “Application tag system basedon experience and pleasure for hedonic searches,” in Convergenceand Hybrid Information Technology Springer Heidelberg, Germany 342 - 352
Park S. Y. , Chang J. , Kihl T. 2013 “Application of Web searchresults for document classification,” in Future InformationCommunication Technology and Applications Springer Heidelberg, Germany 293 - 298