Advanced
Enhancing Text Document Clustering Using Non-negative Matrix Factorization and WordNet
Enhancing Text Document Clustering Using Non-negative Matrix Factorization and WordNet
Journal of Information and Communication Convergence Engineering. 2013. Dec, 11(4): 241-246
Copyright ©2013, The Korean Institute of Information and Commucation Engineering
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/li-censes/bync/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Received : December 28, 2012
  • Accepted : March 22, 2013
  • Published : December 31, 2013
Download
PDF
e-PUB
PubReader
PPT
Export by style
Share
Article
Author
Metrics
Cited by
TagCloud
About the Authors
Chul-Won Kim
Department of Computer Engineering, Honam University, Gwangju 506-714, Korea
Sun Park
Networked Computing System Lab., Gwangju Institute of Science and Technology, Gwangju 500-712, Korea
sunpark@nm.gist.ac.kr

Abstract
A classic document clustering technique may incorrectly classify documents into different clusters when documents that should belong to the same cluster do not have any shared terms. Recently, to overcome this problem, internal and external knowledge-based approaches have been used for text document clustering. However, the clustering results of these approaches are influenced by the inherent structure and the topical composition of the documents. Further, the organization of knowledge into an ontology is expensive. In this paper, we propose a new enhanced text document clustering method using non-negative matrix factorization (NMF) and WordNet. The semantic terms extracted as cluster labels by NMF can represent the inherent structure of a document cluster well. The proposed method can also improve the quality of document clustering that uses cluster labels and term weights based on term mutual information of WordNet. The experimental results demonstrate that the proposed method achieves better performance than the other text clustering methods.
Keywords
I. INTRODUCTION
Traditional document clustering methods are based on the bag of words (BOW) model, which represents documents with features, such as weighted term frequencies. However, these methods ignore semantic relationships between terms within a document set. The clustering performance of the BOW model is dependent on a distance measure of document pairs. However, this distance measure cannot reflect the real distance between two documents because the documents are composed of high-dimension terms with respect to complicated document topics. Further, the results of clustering documents are influenced by the document properties or the cluster forms desired by the user. Recently, internal and external knowledge-based approaches have been used for overcoming the problems of the vector model-based document clustering method [1 - 3] .
Internal knowledge-based document clustering determines the inherent structure of a document set by using a factorization technique [4 - 9] . These methods have been studied intensively, and although they have many advantages, the successful construction of semantic features from the original document set remains limited with respect to the organization of very different documents or the composition of similar documents [10] .
External knowledge-based document clustering exploits the term ontology constructed using an external knowledge database with respect to Wikipedia or WordNet [11 - 14] . The term ontology techniques can improve the BOW term representation of document clustering. However, it is often difficult to locate a comprehensive ontology that covers all the concepts mentioned in the document collection, which leads to a loss of information [1 , 12] . Moreover, the ontology-based method incurs a relatively high cost as the ontology has to be constructed manually by knowledge engineers and domain experts.
In order to resolve the limitations of knowledge-based approaches, in this paper, we propose a text document clustering method that uses non-negative matrix factorization (NMF) and WordNet. The proposed method combines the advantages of the internal and external knowledge-based methods. In the proposed method, first, meaningful terms of a cluster for describing the cluster topics of documents are extracted using NMF. The extracted terms well represent the document clusters through semantic features (i.e., internal knowledge) that have the inherent structure of the documents. Second, the term weights of documents are calculated using the term mutual information (TMI) of the synonyms of documents terms obtained from WordNet (i.e., external knowledge). The term weights can easily classify documents into an appropriate cluster by extending the coverage of a document with respect to a cluster label.
The rest of this paper is organized as follows: Section II reviews related works on text document clustering. Section III describes the NMF algorithm. Section IV presents the proposed text document clustering method. Finally, Section V presents the evaluation and experimental results, and Section VI concludes this paper.
II. RELATED WORKS
Recently, a knowledge-based document clustering method, which is used for increasing the efficiency of document clustering, has been proposed; the techniques used in the method can be divided into internal and external knowledgebased techniques.
As an internal knowledge-based approach, Li et al. [8] pro-posed a document clustering algorithm, called the adaptive subspace iteration (ASI), which explicitly models the subspace structure and works well for high-dimensional data. This is influenced by the composition of the document set for document clustering. To overcome the orthogonal problem of latent semantic indexing (LSI), Xu et al. [4] proposed a document partitioning method based on NMF in the given document corpora. The results from the abovementioned method have a stronger semantic interpretation than those from LSI, and the clustering result can be derived easily using the semantic features of NMF. However, this method cannot be kernelized because the NMF must be performed in the original feature space of the data points. Wang et al. [9] used clustering with local and global regularization (CLGR), which uses local label predictors and global label smoothing regularizers. They achieved satisfactory results because the CLGR algorithm uses fixed neighborhood sizes. However, the different neighborhood sizes cause the final clustering results to deteriorate [9] .
The external knowledge-based techniques for document clustering include TMI with conceptual knowledge by WordNet [11] , concept mapping schemes from Wikipedia [12] , concept weighting from domain ontology [13] , and fuzzy associations with condensing cluster terms by WordNet [14] .
III. NON-NEGATIVE MATRIX FACTORIZATION
This section reviews the NMF theory along with the corresponding algorithm. In this paper, we define the matrix notation as follows: Let X*j be the j -th column vector of matrix X, Xi* be the i th row vector, and Xij be the element of the i th row and the j -th column. NMF decomposes a given m × n matrix A into a non-negative semantic feature matrix W and a non-negative semantic variable matrix H , as shown in Eq. (1) [10] .
Lager Image
where W denotes an m × r non-negative matrix and H represents an r × n non-negative matrix. Usually, r is chosen to be smaller than m or n ; hence, the total sizes of W and H are smaller than the size of the original matrix A .
Further, an objective function is used for minimizing the Euclidean distance between each column of A and its approximation à =WH ; this function was proposed by Lee and Seung [10] . As the objective function, the following Frobenius norm is used:
Lager Image
W and H are continuously updated until Θ E ( W,H ) converges under the predefined threshold or exceeds the set number of repetitions. The update rules are as follows:
Lager Image
Lager Image
IV. PROPOSED DOCUMENT CLUSTERING METHOD
This paper proposes a document clustering method that uses cluster label terms generated by NMF and term weights based on the TMI of WordNet. The proposed method consists of three phases: preprocessing, extraction of cluster terms and term weights, and document clustering. In the subsections below, each phase is explained in full.
- A. Preprocessing
In the preprocessing phase, Rijsbergen’s stop words list is used for removing all stop words, and word stemming is removed using Porter stemming algorithm [15] . Then, the term document frequency matrix A is constructed from the document set.
- B. Cluster Term Extraction and Term Weight Calculation
This section consists of two phases: cluster term extraction and term weight calculation. The cluster terms corresponding to the properties of the document clusters are extracted by using the semantic features of NMF; these terms can explain the topic of the document cluster well.
The extraction method can be described as follows: First, the term document frequency matrix A is constructed by executing the preprocessing phase. Second, the number of clusters (i.e., the number of semantic features r ) is set, and NMF is performed on matrix A to decompose the two sematic feature matrices W and H . Finally, matrix W and Eq. (5) are used for extracting the cluster terms. The column vector of matrix W corresponds to the cluster, and the row vector of matrix W refers to the terms of the document; that is, an element of matrix W (i.e., the semantic feature value) indicates the extent to which the term reflects the cluster. The equation for extracting cluster terms is as follows:
Lager Image
where Cp denotes the term set of the p th cluster and Aij represents the term corresponding to the semantic feature of the i -th row and the j -th column in matrix W . The average semantic feature value, asf , is as follows:
Lager Image
where m denotes the total number of rows (i.e., the number of terms) and n represents the total number of columns (i.e., the number of clusters).
Example 1 illustrates the cluster term extraction.
Example 1. Table 1 shows the six documents (i.e., the extracted the [2] ’s Figure 4.10). Table 2 shows the term document frequency matrix generated in the preprocessing phase, described in Table 1 . Table 3 presents the sematic feature matrix W obtained through NMF from Table 2 , and the result of the average of non-zero elements of the semantic features vector asf calculated using Eq. (6). Table 4 shows the results of the extracted cluster terms from Table 3 , which match the semantic feature values greater than the average semantic feature value asf .
The term weights are calculated using TMI based on the synonyms of WordNet. WordNet is a lexical database for the English language where words (i.e., terms) are grouped in synsets consisting of synonyms and thus representing a specific meaning of a given term [16] .
Document set of composition of six documents
Lager Image
Document set of composition of six documents
Term document frequency matrix fromTable 1
Lager Image
Term document frequency matrix from Table 1
Theasfand semantic feature matrixWby NMF fromTable 2NMF: non-negative matrix factorization.
Lager Image
The asf and semantic feature matrix W by NMF from Table 2 NMF: non-negative matrix factorization.
Extracted cluster terms
Lager Image
Extracted cluster terms
Class label terms may be restricted by the properties of a document cluster and the document composition. To resolve this problem, in this study, we use the term weight of documents by using the TMI on synonyms obtained from WordNet. Term weights of the document are calculated by using Jing’s TMI as in Eq. (7) [11] . In the equation for Jing’s TMI, δ il indicates the semantic information between two terms. If term Alj appears in the synonyms of Aij obtained from WordNet, δ il will be treated in the same level for different Aij and Alj , otherwise, δ il will be set to zero.
Lager Image
- C. Document Clustering
This section explains document clustering using cosine similarity between the cluster terms and the term weights of the documents. The proposed method is described as follows: First, the cosine similarity between the cluster terms and the term weights is calculated using Eq. (8). Then, the document having the highest similarity value with respect to the class label is added to a document cluster [3 , 15] .
The cosine similarity function between the sentence vectors and the query is computed as follows [15] :
Lager Image
where A*a and A*b denote the a th document and the b th document, respectively. Further, m denotes the number of terms.
V. EXPERIMENTS AND EVALUATION
In this study, we use the dataset of 20 newsgroups for the performance evaluation [17] . To evaluate the proposed method, mixed documents were randomly chosen from the abovementioned dataset. A normalized mutual information metric related to Eqs. (9) and (10) was used for measuring the document clustering performance [2 - 4 , 7 - 9] . The cluster numbers for the evaluation method were set in the range of 2 to 10, as shown in Fig. 1 . For each given cluster number k , 50 experiments were performed on different randomly selected clusters, and the final performance values were the average of the values obtained from these experiments.
The normalized mutual information metric
Lager Image
was used for measuring the document clustering performance [2 - 4 , 7 - 9] . To measure the similarity between the two sets of document clusters C = { c1, c2, ..., ck } and C' = { c'1, c'2, ..., c'k }, the following mutual information metric MI ( C,C' ) was used:
Lager Image
where p ( ci ) and p ( c'j ) denote the probabilities that a document arbitrarily selected from the corpus belongs to ci and c'j , respectively, and p ( ci, c'j ) denotes the joint probability that the selected document simultaneously belongs to ci and c'j . MI ( C, C' ) takes values between zero and max ( H ( C ), H ( C' )), where H ( C ) and H ( C' ) are the entropies of C and C' , respectively. The metric does not need to locate the corresponding counterpart in C' , and the value is maintained for all permutations. A normalized metric MI , which takes values between zero and one, was used as shown in Eq. (10) [2 - 4 , 7 - 9] :
Lager Image
In this study, seven different document clustering methods were implemented, as shown in Fig. 1 . The NMF, ASI, CLGR, RNMF, and FPCA methods are document clustering methods based on internal knowledge, and the FAWDN and TMINMF methods are clustering methods based on a combination of the internal and the external knowledge. WNMF denotes the proposed method described in this paper. FAWDN denotes the previously proposed method that is based on WordNet and fuzzy theory [14] . FPCA is the previously proposed method based on principal component analysis (PCA) and fuzzy relationships [6] , and RNMF is the method proposed previously using NMF and cluster refinement [5] . NMF denotes Xu’s method using NMF [4] . ASI is Li’s method using adaptive subspace iteration [8] . Lastly, CLGR denotes Wang’s method using local and global regularization [9] .
As seen in Fig. 1 , the average normalized metric of WNMF is 14.99% higher than that of NMF, 14.48% higher than that of ASI, 9.21% higher than that of CLGR, 6.66% higher than that of RNMF, 4.80% higher than that of FPCA, and 3.98% higher than that of FAWDN.
VI. CONCLUSION
In this paper, we proposed an enhanced text document clustering method using NMF and WordNet. The proposed method uses the semantic features of the document on the basis of the internal knowledge of NMF for extracting the
Lager Image
Evaluation results of performance comparison. NMF: nonnegativematrix factorization, ASI: adaptive subspace iteration, CLGR: clusteringwith local and global regularization, RNMF: cluster refinement NMF, FPCA:fuzzy and principal component analysis, FAWDN: NMF based fuzzy andWordNet, WNMF: cluster based WordNet and NMF.
cluster terms, which are well represented within the important inherent structure of the document cluster. In order to solve the limitation of the internal knowledge-based clustering methods with respect to the influence of the internal structure of documents, the proposed method uses TMI of WordNet to calculate the term weights of documents. Further, this method uses a similarity between the cluster terms and the term weights to improve the quality of the text document clustering. It was demonstrated that the value of the normalized mutual information metric is higher in the case of the proposed method than in the case of the other text document clustering methods for a dataset of 20 newsgroups.
References
Hu J. , Fang L. , Cao Y. , Zeng H. J. , Li H. , Yang Q. , Chen Z. 2008 “Enhancing text clustering by leveraging Wikipedia semantics” in Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Singapore 179 - 186
Chakrabarti S. 2003 Mining the Web: Discovering Knowledge fromHypertext Data. Morgan Kaufmann Publishers Boston, MA
Baeza-Yates R. , Ribeiro-Neto B. 2011 Modern InformationRetrieval: The Concepts and Technology behind Search 2nd ed. Addison-Wesley New York, NY
Xu W. , Liu X. , Gong Y. 2003 “Document clustering based on nonnegativematrix factorization” in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in information Retrieval Toronto, Canada 267 - 273
Park S. , An D. U. , Cha B. , Kim C. W. 2009 “Document clusteringwith cluster refinement and non-negative matrix factorization” in Proceedings of the 16th International Conference on Neural Information Processing Bangkok, Thailand 281 - 288
Park S. , Kim K. J. 2010 “Document clustering using non-negativematrix factorization and fuzzy relationship” Journal of Korea Navigation Institute 14 (2) 239 - 246
Xu W. , Gong Y. 2004 “Document clustering by concept factorization” in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Sheffield, UK 202 - 209
Li T. , Ma S. , Ogihara M. 2004 “Document clustering via adap-tivesubspace iteration” in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Sheffield, UK 218 - 225
Wang F. , Zhang C. , Li T. 2007 “Regularized clustering for documents” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Amsterdam, The Netherlands 95 - 102
Lee D. D. , Seung H. S. 1999 “Learning the parts of objects by nonnegative matrix factorization” Nature 401 (6755) 788 - 791    DOI : 10.1038/44565
Jing L. , Zhou L. , Ng M. K. , Huang J. Z. 2006 “Ontology-baseddistance measure for text clustering” in Proceedings of 2006 SIAM International Conference on Data Mining Bethesda, MD
Hu X. , Zhang X. , Lu C. , Park E. K. , Zhou X. 2009 “ExploitingWikipedia as external knowledge for document clustering” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Paris, France 389 - 396
Tar H. H. , Nyaunt T. T. S. 2011 “Ontology-based concept weighting fortext documents” World Academy of Science, Engineering and Technology 57 249 - 253
Park S. , Lee S. R. 2011 “Enhancing document clustering usingcondensing cluster terms and fuzzy association” Journal of IEICE Transactions on Information and Systems 94D (6) 1227 - 1234    DOI : 10.1587/transinf.E94.D.1227
Frakes W. B. , Baeza-Yates R. 1992 Information Retrieval: DataStructures & Algorithms. Prentice-Hall Englewood Cliffs, NJ
Miller G. A. 1995 “WordNet: a lexical database for English” Communications of the ACM 38 (11) 39 - 41    DOI : 10.1145/219717.219748
The 20 newsgroups data set [Internet] Available: http://qwone.com/~jason/20Newsgroups/.