Traditional clustering methods are usually based on the bagofwords (BOW) model. A disadvantage of the BOW model is that it ignores the semantic relationship among terms in the data set. To resolve this problem, ontology or matrix factorization approaches are usually used. However, a major problem of the ontology approach is that it is usually difficult to find a comprehensive ontology that can cover all the concepts mentioned in a collection. This paper proposes a new document clustering method using semantic features and fuzzy relations for solving the problems of ontology and matrix factorization approaches. The proposed method can improve the quality of document clustering because the clustered documents use fuzzy relation values between semantic features and terms to distinguish clearly among dissimilar documents in clusters. The selected cluster label terms can represent the inherent structure of a document set better by using semantic features based on nonnegative matrix factorization, which is used in document clustering. The experimental results demonstrate that the proposed method achieves better performance than other document clustering methods.
I. INTRODUCTION
The rapidly growing availability of a large quantity of textual data, such as online news, blogs, emails, and Internet bulletin boards, has created the need for effective document clustering methods. In addition, document clustering has been receiving increased attention as an important method for unsupervised document organization, automatic summarization, topic extraction, and information filtering or retrieval
[1

7]
.
Recent studies on document clustering methods use machine learning
[2
,
5
,
8

10]
techniques, graphbased methods
[7
,
11]
, and matrix factorizationbased methods
[12

14]
. Machine learningbased methods use a semisupervised clustering model with respect to prior knowledge and documents’ membership
[2
,
5
,
8

10]
. Graphbased methods model the given document set using an undirected graph in which each node represents a document
[7
,
11]
. Matrix factorizationbased methods use semantic features of the document sets for document clustering
[12

15]
.
Recent studies on document clustering methods use machine learning
[2
,
5
,
8

10]
techniques, graphbased methods
[7
,
11]
, and matrix factorizationbased methods
[12

14]
. Machine learningbased methods use a semisupervised clustering model with respect to prior knowledge and documents’ membership
[2
,
5
,
8

10]
. Graphbased methods model the given document set using an undirected graph in which each node represents a document
[7
,
11]
. Matrix factorizationbased methods use semantic features of the document sets for document clustering
[12

15]
. cover all the concepts mentioned in a collection
[5]
. In addition, a matrix factorization approach might limit successful decomposition of semantic features from any data set as data objects viewed from extremely different viewpoints, or highly articulated objects
[13
,
16
,
17]
.
In this paper, we propose a document clustering method using semantic features by nonnegative matrix factorization (NMF) and fuzzy relations. The proposed method uses fuzzy relations between semantic features and terms in a document set to resolve the matrix factorization approach problem. The NMF can represent an individual object as the nonnegative linear combination of partial information extracted from a large volume of objects
[13
,
16
,
17]
. NMF has great power to easily extract semantic features representing the inherent structure of data objects. The factorization result of NMF has a better semantic interpretation, and the clustering result can be easily derived from it
[16]
. Fuzzy relations
[18]
use the concept of fuzzy set theory
[19]
to model the vagueness in the information retrieval. The basic concept of fuzzy relations involves the construction of index terms from a set of documents
[18]
.
The proposed method has the following advantages. First, it can extract important cluster label terms in a document set using semantic features by NMF. By this means, it can identify major topics and subtopics of clusters with respect to their semantic features. Second, it can remove the dissimilar documents in clusters using fuzzy relations between semantic features and document terms. Thus it can improve the quality of document clustering by assisting with the removal of dissimilarity information. The rest of the paper is organized
The rest of the paper is organized as follows: Section II describes works related to document clustering methods. In Section III, we review NMF and fuzzy relations in detail. In Section IV, the proposed document clustering method is introduced. Section V shows the evaluation and experimental results. Finally, we conclude in Section VI.
II. RELATED WORKS
Traditional clustering methods can be classified into partitioning, hierarchical, densitybased, and gridbased methods. Most of these methods use distance functions as object criteria and are not effective in high dimensional spaces
[1
,
3
,
4
,
6
,
20]
.
Li et al.
[20]
proposed a document clustering algorithm called adaptive subspace iteration (ASI) using explicit modeling of the subspace structure associated with each cluster. Wang et al.
[21]
proposed a clustering approach for clustering multitype interrelated data objects. It fully explores the relationship between data objects for clustering analysis. Park et al.
[12]
proposed a document clustering method using NMF and cluster refinement. Park et al.
[15]
proposed a document clustering method using latent semantic analysis (LSA) and fuzzy association. Xu et al.
[14]
proposed a document partitioning method based on the NMF of the given document corpus. Xu and Gong
[13]
proposed a data clustering method that models each cluster as a linear combination of the data points, and each data point as a linear combination of the cluster centers. Li and Ding
[22]
presented an overview and summary of various matrix factorization algorithms for clustering and theoretically analyzed the relationships among them.
Wang et al.
[7]
proposed document clustering with local and global regularization (CLGR). It uses local label predictors and a global label smoothness regularizer. Liu et al.
[9]
proposed a document clustering method using cluster refinement and model selection. It uses a Gaussian mixture model and expectation maximization algorithm to conduct initial document clustering. It also refines the initially obtained document clusters by voting on the cluster label of each document.
Ji and Xu
[8]
proposed a semisupervised clustering model that incorporates prior knowledge about documents’ membership for document clustering analysis. Zhang et al.
[10]
adopted a relaxation labelingbased cluster algorithm to evaluate the effectiveness of the aforementioned types of links for document clustering. It uses both content and linkage information in the dataset
[11]
. Hu et al.
[5]
proposed a document clustering method exploiting Wikipedia as external knowledge. They used Wikipedia for resolving the external ontology of document clustering problem. Fodeh et al.
[2]
proposed a document clustering method using an ensemble model combining statistics and semantics
[7]
.
III. NMF AND FUZZY RELATION THEORY
 A. NMF
In this paper, we define the matrix notation as follows. Let
X_{*j}
be the
j
′th column vector of matrix
X
,
X_{i*}
be the
i
′th row vector and
X_{ij}
be the element of the
i
′th row and the
j
′th column.
NMF involves decomposing a given
m
×
n
matrix
A
into a nonnegative semantic feature matrix
W
and a nonnegative semantic variable matrix
H
, as shown in Eq. (1).
where
W
is a
m
×
r
nonnegative matrix and
H
is a
r
×
n
nonnegative matrix. Usually
r
is chosen to be smaller than
m
or
n
, so that the total sizes of
W
and
H
are smaller than that of the original matrix
A
.
We use the objective function that minimizes the Euclidean distance between each column of
A
and its approximation
Ã
=
WH
, which was proposed in Lee and Seung
[16
,
17]
. As an objective function, the Frobenius norm is used:
We keep updating
W
and
H
until Θ
_{E}
(
W
,
H
) converges under the predefined threshold or exceeds the number of repetitions. The update rules are as follows:
Example 1.
We illustrate the example using a NMF algorithm: Let
r
be 3, the number of repetitions be 50, and the tolerance be 0.001. When the initial elements of the
W
and
H
matrices are 0.5, it decomposes the matrix
A
into the
W
and
H
matrices as in
Fig. 1
.
Fig. 2
shows an example of sentence representation using NMF. The column vector
A
_{*3}
corresponding to the third sentence is represented as a linear combination of semantic feature vectors
W_{*l}
and semantic variable column vector
H
_{*3}
.
The powers of the two nonnegative matrices
W
and
H
are described as follows: all semantic variables (
H_{lj}
) are used to represent each sentence.
W
and
H
are represented sparsely. Intuitively, it make more sense for each sentence to be associated with some small subset of a large array of topics (
W_{*l}
), rather than just one topic or all the topics. In each semantic feature (
W_{*l}
), the NMF has grouped together semantically related terms
[3]
.
Result of the nonnegative matrix factorization algorithm.
Example of sentence representation using semantic features and semantic variables.
 B. Fuzzy Relations Theory
In this section, we give a brief review of fuzzy relations theory
[1
,
18
,
19]
, which is used in document clustering. The fuzzy set is defined as follows:
Definition 1.
A fuzzy relation between two finite sets X = {x_{1}, …, x_{u}} and Y = {y_{1}, …, y_{v}} is formally defined as a binary fuzzy relation f: X × Y →[0,1], where u and v are the numbers of elements in X and Y, respectively.
Definition 2.
Given a set of index terms, T = {t_{1}, …, t_{2}}, and a set of documents, D={d_{1}, …, d_{v}}, each t_{i} is represented by a fuzzy set h(t_{i}) of documents, h(t_{i}) = {F(t_{i},d_{j}) ∀d_{i}∈D}, where F(t_{i}, d_{j}) is the significance (or membership) degree of t_{i}, in d_{j}.
Definition 3.
The fuzzy related terms (RT) relation is based on the evaluation of the cooccurrences of t_{i} and t_{j} in the set D and can be defined as follows.
A simplification of the fuzzy RT relation based on the cooccurrence of terms is given as follows:
where
r_{i,j}
represents the fuzzy RT relation between terms
i
and
j
,
n_{i,j}
is the number of documents containing both
i
'th and
j
'th terms,
n_{i}
, is the number of documents including the
i
'th term, and
n_{j}
is the number of documents including the
j
'th term.
Definition 4.
The membership degrees between each document to each of the cluster sets can be defined as follows:
where μ_{i,j} is the membership degree of d_{i} belonging to CT^{j}, r_{a,b} is the fuzzy relation between term t_{j}∈d_{j} and term t_{b}∈CT^{j}. CT is the term set with respect to representing a cluster topic.
IV. PROPOSED DOCUMENT CLUSTERING METHOD
In this section, we propose a method that clusters documents by semantic features by NMF and fuzzy relations.
Document clustering method using semantic features and fuzzy relations. NMF: nonnegative matrix factorization.
The proposed method consists of the preprocessing phase, cluster label extraction phase, and the document cluster phase. We next give a full explanation of the three phases shown in
Fig. 3
.
 A. Preprocessing
In the preprocessing phase, we remove all stopwords by using Rijsbergen’s stopwords list and perform word stemming by Porter’s stemming algorithm
[3
,
6]
. Then we construct the termfrequency vector for each document in the document set
[1
,
3
,
6]
.
Let
T_{i}
= [
t_{1i}
,
t_{2i}
…
t_{ni}
]
^{T}
be the termfrequency vector of document
i
, where elements
t_{ji}
denote the frequency in which term
j
occurs in document
i
. Let
A
be an
m
×
n
terms by documents matrix, where
m
is the number of terms and
n
is the number of documents in a document set.
 B. Cluster Label Term Extraction by NMF
In the cluster label terms extraction phase, we use semantic features by NMF
[15

17]
to extract cluster label terms. The proposed cluster label term extraction method is described as follows. First, the preprocessing phase is performed, and then the termdocument frequency matrix is constructed.
Table 1
shows the termdocument frequency matrix with respect to 7 documents and 6 terms.
Table 2
shows the semantic features matrix
W
by NMF from
Table 1
. The cluster label terms having the top semantic values are selected in each column for the cluster label terms in
Table 2
.
Table 3
shows the extracted cluster label terms from
Table 2
.
Termdocument frequency matrix
Termdocument frequency matrix
Semantic features matrix W by nonnegative matrix factorization fromTable 1
Semantic features matrix W by nonnegative matrix factorization from Table 1
Result of cluster label terms extraction fromTable 2
Result of cluster label terms extraction from Table 2
 C. Clustering Document by Fuzzy Relations
The document clustering phase is described as follows. First, we construct the term correlation matrix
M
with respect to the relationship between cluster label terms and terms of the document set using the fuzzy RT relation by Eq. (5). The term correlation matrix is an
n
by
n
symmetric matrix whose element,
m_{ij}
, has the value on the interval [0,1] in which 0 indicates no relationship and 1 indicates a full relationship between the terms
t_{i}
and
t_{j}
. Therefore,
m_{ij}
is equal to 1 for all
i
=
j
. Since a term has the strongest relationship to itself
[18]
.
Table 4
shows the term correlation matrix using
Table 1
and Eq. (5).
Second, a document
d_{i}
is clustered into the cluster
C^{j}
, where the membership degree
μ_{i,j}
is the maximum by Eq. (6). The term
t_{a}
in
i
is associated with cluster
C^{j}
if the terms
k_{b}
’s in
CT^{j}
(for cluster
C^{j}
) are related to the term
t_{a}
[18]
.
Table 5
shows the result of document clustering using
Table 4
and Eq. (6).
Term correlation matrix fromTable 2by the fuzzy related terms relation
Term correlation matrix from Table 2 by the fuzzy related terms relation
Result of document clustering fromTable 4
Result of document clustering from Table 4
 V. EXPERIMENTS AND EVALUATION
We use the
Reuters
(
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
) document corpora to evaluate the proposed method. The
Reuters
corpus has 21,578 documents, which are grouped into 135 clusters
[7]
. We use the normalized mutual information metric
MI
for measuring the document clustering performance
[7
,
13
,
14]
.
We have conducted a performance evaluation by testing the proposed method and comparing it with 6 other representative data clustering methods using the same data corpus. We implemented 7 document clustering methods: FNMF, FLSA, RNMF, KM, NMF, ASI, and CLRG. In
Fig. 4
, Fisher NMF (FNMF) denotes our proposed method. Feature LSA (FLSA) denotes our previous proposed method using LSA and fuzzy relations
[15]
. Robust NMF (RNMF) denotes our previous method by using NMF and cluster refinement
[14]
. KM denotes the partitioning method using traditional
k
means
[1
,
3
,
4
,
6]
. NMF demotes Xu’s method using nonnegative matrix factorization
[14]
. ASI denotes Li’s method using adaptive subspace iteration
[20]
. CLRG denotes Wang’s method using local and global regularization
[7]
.
The evaluation results are shown in
Fig. 4
. The evaluations were conducted for the cluster numbers ranging from 2 to 10. For each given cluster number
k
, 50 experiment runs were conducted on different randomly chosen clusters, and the final performance values were obtained by averaging the values from the 50 experiments.
Evaluation results of performance comparison. NMI: normalized mutual information, KM: kmeans, NMF: nonnegative matrix factorization, ASI: adaptive subspace iteration, CLRG: clustering with local and global regularization, RNMF: robust NMF, FLSA: feature latent semantic analysis, FMNF: Fisher NMF.
VI. CONCLUSION
In this paper, we have presented a document clustering method using semantic features by NMF and fuzzy relations. The proposed method in this paper has the following advantages. First, it can identify dissimilar documents between clusters by fuzzy relations with respect to cluster label terms and documents, thereby improving the quality of document clustering. Second, it can easily extract cluster label terms that cover the major topics of a document well using semantic features by NMF. Experimental results show that the proposed method outperforms 6 other summarization methods.
Chakrabarti S.
2003
Mining the Web: Discovering Knowledge from Hypertext Data.
MorganKaufmann
Boston, MA
Fodeh S. J.
,
Punch W. F.
,
Tan P. N.
2009
“Combining statistics and semantics via ensemble model for document clustering”
in Proceeding of the 24th Annual ACM Symposium on Applied Computing
Honolulu, HI
1446 
1450
Frankes W. B.
,
Ricardo B. Y.
1992
Information Retrieval: Data Structure & Algorithms.
PrenticeHall
Englewood Cliffs,
Han J.
,
Kamber M.
2006
Data Mining Concepts and Techniques
2nd ed.
MorganKaufmann
Boston, MA
Hu X.
,
Zhang X.
,
Lu C.
,
Park E. K.
,
Zhou X.
2009
“Exploiting Wikipedia as external knowledge for document clustering”
in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Paris, France
389 
396
BaezaYates R.
,
RibeiroNeto B.
1999
Modern Information Retrieval.
ACM Press
New York, NY
Wang F.
,
Zhang C.
,
Li T.
2007
“Regularized clustering for documents”
in Proceeding of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Amsterdam, The Netherlands
95 
102
Ji X.
,
Xu W.
2006
“Document clustering with prior knowledge”
in Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Seattle, WA
405 
412
Liu X.
,
Gong Y.
,
Xu W.
,
Zhu S.
2002
“Document clustering with cluster refinement and model selection capabilities”
in Proceeding of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Tampere, Finland
191 
198
Zhang X.
,
Hu X.
,
Zhou X.
2008
“A comparative evaluation of different link types on enhancing document clustering”
in Proceeding of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Singapore
555 
562
Hu T.
,
Xiong H.
,
Zhou W.
,
Sung S. Y.
,
Luo H.
2008
“Hypergraph partitioning for document clustering: a unified clique perspective”
in Proceeding of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Singapore
871 
872
Park S.
,
An D. U.
,
Cha B. R.
,
Kim C. W.
2009
“Document clustering with cluster refinement and nonnegative matrix factorization”
in Proceeding of the 16th International Conference on Neural Information Processing
Bangkok, Thailand
281 
288
Xu W.
,
Gong Y.
2004
“Document clustering by concept factorization”
in Proceeding of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Sheffield, UK
202 
209
Xu W.
,
Liu X.
,
Gong Y.
2003
“Document clustering based on nonnegative matrix factorization”
in Proceeding of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Toronto, Canada
267 
273
Park S.
,
An D. U.
,
Cha B. R.
,
Kim C. W.
2010
“Document clustering with semantic features and fuzzy association”
Technology and Management
Bangkok, Thailand
in Proceeding of the 4th International Conference on Information Systems
167 
175
Lee D. D.
,
Seung H. S.
1999
“Learning the parts of objects by nonnegative matrix factorization”
Nature
401
(6755)
788 
791
DOI : 10.1038/44565
Lee D. D.
,
Seung H. S.
2001
“Algorithms for nonnegative matrix factorization”
MIT Press
Cambridge, MA
in Advances in Neural Information Processing Systems 13.
556 
562
Haruechaiyasak C.
,
Shyu M. L.
,
Chen S. C.
,
Li X.
2002
“Web document classification based on fuzzy association”
in Proceedings of the 26th International Computer Software and Applications Conference on Prolonging Software Life: Development and Redevelopment
Oxford, UK
487 
492
Zadeh L. A.
1993
“Fuzzy sets”
MorganKaufmann
San Francisco, CA
in Readings in Fuzzy Sets for Intelligent Systems
Li T.
,
Ma S.
,
Ogihara M.
2004
“Document clustering via adaptive subspace iteration”
in Proceeding of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Sheffield, UK
218 
225
Wang J.
,
Zeng H.
,
Chen Z.
,
Lu H.
,
Tao L.
,
Ma W. Y.
2003
“ReCoM: reinforcement clustering of multitype interrelated data objects”
in Proceeding of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Toronto, Canada
274 
281
Li T.
,
Ding C.
2006
“The relationships among various nonnegative matrix factorization method for clustering”
in Proceeding of the 6th International Conference on Data Mining
Hong Kong, China
362 
371