This paper proposes an approach for visualizing research trends using theme maps and extra information. The proposed algorithm includes the following steps. First, text mining is used to construct a vector space of keywords. Second, correspondence analysis is employed to reduce highdimensionality and to express relationships between documents and keywords. Third, kernel density estimation is applied in order to generate threedimensional data that can show the concentration of the set of documents. Fourth, a cartographical concept is adapted for visualizing research trends. Finally, relative vitalization information is provided for more accurate research trend analysis. The algorithm of the proposed approach is tested using papers about Traditional Korean Medicine.
1. INTRODUCTION
People live in threedimensional space and recognize objects in the real world. And people analyze and integrate the information to use the objects around them. A map is invented to utilize the people’s spatial perception ability
[1]
,
[2]
. In general, map is used to know where we are and how to get to the destination. Also this concept can be used to express the underlying structure in the documents set. It is called as ‘Theme Map’ which is expressed by threedimensional landscapes
[2]

[4]
. And the map will convey relational information between documents. If theme map is constructed in threedimension, then it requires the changing of view point and the integration of information to analyze the relationship among documents
[2]
,
[5]
. This extra effort makes hard to get insight for documents’ relation from theme map, it arises needs for alternative concept, for example bird eye view
[1]
.
A contour map is made to provide the bird eye view for the topography of threedimensional landscape
[6]
,
[7]
. And we can use the contour map metaphor to grasp the documents’ relationship. A contour map metaphor is made good use in the IT field to visualize the mutual relation  web structure, software procedure, document relationship and so on
[2]
. To construct a contour map, we must extract the information which can best represent documents and can express the underlying relationship. In general, contour map is composed of lines and colors which are called as contour lines and depth colors respectively
[2]
,
[7]
. If any two documents are similar, then they placed much closer than any others on the theme map. Peaks appear on the map where there is a high concentration of documents. The distance between valleys and ridges shows how closely the themes are related. To construct and visualize the contour map, 3 steps algorithm  information analysis, dimension reduction and map visualization are needed
[2]
,
[6]
,
[7]
.
The challenge of this research aims to give high quality research trend analysis tools through creating theme map and extra information. This research adopts Traditional Korean Medicine (TKM) papers to show the algorithms and procedures suggested in this study.
2. RELATED TECHNOLOGIES
 2.1 Information analysis
To visualize the information on the theme map, information analysis is needed which is composed of indexing and analysis process
[1]
. Indexing process is to extract the information from unstructured documents. It is composed of keyword extraction stage using natural language processing and vector space construction from the extracted keywords. Automatic indexing is a familiar algorithm to show the corpus space of each document. And a natural language processing can represent document’s content well. The partofspeechtagging is the basic technology in the noun processing algorithm. And this method is more accurate than the indexing based algorithms
[8]
.
Analysis process is to discover the extracted information from indexing stage. In general, it uses the classification and clustering methods. The classification algorithm is assigning the document to the predefined groups. But the clustering method can create variable groups based document’s similarity. Widely used classification methods include the naive Bayesian method, knearest neighbor and network models. And commonly used clustering algorithm is selforganizing map (SOM) which produces a twodimensional grid representation for Ndimensional features
[7]
. Other popular clustering algorithms include multidimensional scaling, the knearest neighbor method, and the Kmeans algorithm
[9]
.
 2.2 Dimension reduction
Each document is composed of keyword vector and each keyword means one dimension. So the vector space of documents’ set stores information in the highdimension. Dimension reduction is needed to visualize the relationship in the highdimensional vector space. Commonly twodimensional or threedimensional visualization is preferred. This process always arise information distortion problem. Multidimensional Scaling (MDS), Self Organizing Map (SOM), Principal Component Analysis (PCA), Correspondence Analysis (CA) is commonly used
[8]
,
[10]
. MDS is a set of related statistical techniques often used in information visualization to show the relationship and structure of data. MDS is commonly used to show the ordination results. MDS algorithm deals with the similarity matrix and it locate each item in n dimensional space – typically two or three dimension
[10]
.
SOM is a kind of unsupervised machine learning using artificial intelligence methods. SOM produce lowdimensional representation of the input space called a map. SOM is widely used to visualize highdimensional data in lowdimensional space. The SOM model was first described as an artificial neural network by the Finnish professor Teuvo Kohonen
[7]
. PCA adopts a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components
[11]
. The first principal component has the most information about the original data, and each succeeding components has the left information as possible. If the set is composed of highdimensional data which is hard to view, then PCA can supplies the user with a lowerdimensional picture, a "shadow" of this object. And this selected dimension (principal components) is the most informative viewpoint.
CA was developed by JeanPaul Benzécri and is conceptually similar to PCA, but scales the data so that rows and columns are treated equivalently
[12]
. CA commonly deals with the contingency table or matrix, and decomposes the table into orthogonal factors using chisquare method. CA can produce the relations between rows and columns. CA has the distinct advantage over other dimension reduction algorithm. It is called duality which can reveal nonlinear relationships between rows and columns of a multidimensional contingency table.
 2.3 Map visualization
There are seven kind of information visualization methods as 1D, 2D, 3D, multidimensional, tree, network and temporal
[13]
. Among seven categories, we reviewed the multidimensional methods which have capacity of representing map data. The multidimensional approach represents information as multidimensional objects and projects them into a threedimensional or a twodimensional space. The SPIRE system is belong to this category
[14]
.
The ThemeScape is the documents analysis tool using theme map in the SPIRE system which is implemented to help solve the problem of information overload
[6]
. And There are several graphical tools such as MATLAB
[15]
, Mathematica
[16]
, ChartFX
[17]
, TeeChart
[18]
, and R package
[19]
etc. which can represent three dimensional data as contour map.
An example of ThemeScape
3. ALGORITHM
This section explains the algorithm to make theme map using contour map as stated before. The algorithm is composed of the following seven steps:

✓ Vector space formation

✓ Correspondence analysis

✓ Kernel density estimation

✓ Local maxima detection

✓ Knearest neighbor detection

✓ Contour map generation

✓ Research vitalization analysis
In implementation, the output of former step is used as the input of latter step. And the algorithm is executed sequentially. First, the vector space of matrix form is created from the extracted keywords which came from documents. Next, correspondence analysis is applied to the former vector space to get the twodimensional coordinates of documents. And density probability estimation transforms it to get the smoothed curved surface. To locate the peak, local maxima detection and knearest neighbor detection are used. And the data are visualized as contour map by the R package. Finally the former data are analyzed to get the vitalization information for the documents.
Step one: vector space formation
The analysis of natural language (unstructured) document provides a characterization on content. This process is composed of extraction of essential descriptor and representation of these features. Text features are typically one of three general types, though any number of variations and hybrids are possible
[4]
. The first type is frequencybased on first order statistics. The count of key words in a document is measured and used as feature vector. The second type is on higher order statistics captured by Bayesian or neural nets. The third type is semantic in nature. This approach utilizes word corpus and the knowledge of language to extract the semantic in the text.
We used the keyword frequency method to create vector space like
Figure 2
. To extract the experts’ knowledge, the keywords (Keyword 1, Keyword 2, …, Keyword n) are selected from the term dictionary
[20]
. In the
Figure 2
, the keyword’s counts are express as
C_{mn}
for each document (Document 1, Document 2, …, Document m)
The matrix of documentskeyword vector space
Step two: correspondence analysis
CA is a multivariate descriptive method based on a data matrix with nonnegative elements and related to principal component analysis (PCA). In general CA is applied to categorical data, but it can be applicable for any kind of data. CA has duality characteristics which came from the scaling the data, and deals rows and columns equivalently. Thus we can induce the relation of not only within row data and within column data but also between row data and column data.
The CA solution was shown to be neatly encapsulated in the singularvalue decomposition (SVD) of a suitably transformed matrix
[12]
. To summarize the theory, first divide the I × J data matrix, denoted by N, by its grand total n to obtain the socalled correspondence matrix
Let the row and column marginal totals of P be the vectors r and c respectively, that is the vectors of row and column masses, and D
_{r}
and D
_{c}
be the diagonal matrices of these matrices. The computational algorithm to obtain coordinates of the row and column profiles with respect to principal axes, using the SVD, is as follows:

Calculate the matrix of standardized residuals:

Principal coordinates of rows:

Principal coordinates of columns:

Standard coordinates of rows:

Standard coordinates of columns:
Finally we can get the two dimensional coordinates of each document (row) and keyword (column) from X and Y.
Step three: kernel density estimation
In mathematics, extrapolation is the algorithm to construct new point from set of known points. It is similar to the interpolation process which constructs new inner point between known points. There are several kinds of extrapolation which is widely applied in computation. The linear extrapolation and polynomial extrapolation is commonly used, but the result which is constructed from the method is poor. And Conic extrapolation and French curve extrapolation generate better results than previously mentioned algorithms, but theses uses more points to construct new point. Thus it consumes more computational power
[21]
.
In statistics, kernel density estimation (KDE) is a nonparametric way of estimating the probability density function of a random variable. Given some data about a sample of a population, kernel density estimation makes it possible to extrapolate the data to the entire population
[22]
. It means that we can construct three dimensional coordinate from the results of CA through applying KDE. If
x_{1}
,
x_{2}
,...,
x_{n}
~
i
are independent and sample of random variables which follow same probability distribution, then kernel density approximation is
At this point, K is some kernel and h is smoothing parameter called bandwidth.
Step four: local maxima detection
Local maxima detection and local minima detection are used to detect points which have outstanding characteristics or features. It is commonly applied in image processing of computer vision fields. There are several kinds of feature detection algorithm. Corner detection and blob detection algorithms are commonly used in early ages. Image features which must be welldefined and stable are extracted with corner detection. And blob detection is very similar to the local maxima detection in using blob descriptor. Recently interest point detection is more common terminology in image processing. An interest point is a point in the image which in general can be characterized as follows:
[23]

It has a clear, preferably mathematically wellfounded, definition.

It has a welldefined position in image space.

The local image structure around the interest point is rich in terms of local information contents.
We used local maxima detection to locate the peak of three dimension data constructed from KDE. If the x satisfies f(x) ≥ f(x') for all x' located around x, then we call f(x) as local maxima. That is, if the x satisfies xx'<δ for some δ>0, then all x must satisfy previous formula. Because we deal with three dimension data, we have to find all (x,y) positions which satisfy f(x,y)≥f(x',y'). Because these detected peaks are the locally highest points of the three dimension data, these positions are where the density of documents is high.
Step five: knearest neighbor detection
Nearest neighbor detection (NND), also known as similarity detection or closest point detection, is an optimization problem for finding closest points in metric spaces
[24]
. There are numerous variants of the NND problem and the two most wellknown are the knearest neighbor detection (k
_{NN}
) and the εapproximate nearest neighbor detection. The k
_{NN}
detection identifies the top k nearest neighbors to the point. We can compute the distance between two points using some Euclidean distance functions like below.
For all peak points and all keyword positions, the Euclidean distance are measured. And k
_{NN}
identifies the top k nearest neighbor keyword positions to the peak which is detected by KDE. This means that the most common k keywords are displayed in the peak points.
Step six: contour map visualization
A contour line (also isoline, isogram or isarithm) of a function of two variables is a curve along which the function has a constant value
[25]
. In cartography, contour map is illustrated with contour lines which are set of lines. A Contour line is the joined points of equal height. Contour lines show the hills, valleys and peaks which can represent the steepness of slopes, show the distance of peaks and so on. And contour level is the successive contour lines which are represented by the interpolated colors. It helps people understand the contour map. In general, contour plot is a graphic representation of the relationships among three numeric variables (x, y and z) in two dimensions using contour map metaphor.
There are several graphical tools such as MATLAB
[15]
, Mathematica
[16]
, ChartFX
[17]
, TeeChart
[18]
, and R package
[19]
etc. which can represent three dimension data as contour map. We choose the R package as graphical tools because this software provides java connection which can develop the web based program. Besides this function it has plenty of statistical libraries to help the numeric and statistical calculation.
Step seven: research vitalization analysis
After the construction of theme map, we can analyze the height of the peaks. Correspondence analysis can represent the relation between the document and keyword. If we analyze the peaks’ height, we can get the relative vitalization of the documents. Because the absolute vitalization of documents requires more statistical information, we only know about the relative vitalization information. It is divided as many stages as the user want to. In this study, we used threestep stages (low, mid, high) relative vitalization. Below is the detailed process of vitalization analysis:

Calculate the difference between the highest peak and ground

Divide the height into three and get the each height

Create the (x,y) positions classified as each stage

Get the vitalization of the documents and keywords using each position
4. AN EXAMPLE OF IMPLEMENTATION: THE CASE OF TKM PAPERS
 4.1 Test bed: a TKM paper collection
We used Korea Institute of Oriental Medicine’s Oriental Advanced Searching Integrated System (OASIS) to test the developed algorithm. OASIS is the largest database service system in Korea in the field of TKM
[26]
.
Data fileds serviced by OASIS
Data fileds serviced by OASIS
Table 1
shows the provided information by the OASIS. And we extracted title, keyword, and abstract from the information. To construct the experts’ term dictionary, we used 22,334 TKM terms from the TKM KoreanEnglish dictionary
[20]
. And we used 11,160 medical terms form then Korean Medical Library Engine
[27]
.
 4.2 Theme map creation
We selected 252 papers which searched using ‘ginseng’ keyword from OASIS. First the system calculates each keyword’s frequencies from searched papers. And the system cut off the keyword list using threshold to get the high quality visualization. Using the keyword list, it generates the paperkeyword matrix which is used as vector space and is fed as the CA input. CA was applied for the paperkeyword matrix and it generates twodimensional data for the papers and keywords. The last of algorithm are applied as stated before. And finally we get the theme map in
Figure 3
, but the red color letter is not generated by the system. It is marked to provide the analysis example.
Theme map generated by ginseng related papers
 4.3 Research trend analysis
The implemented system can generate research vitalization information based on threestep stages. It assigns each paper and keyword into one of the stages. The generated keyword list is like
Table 2
. High vitalization contains 33 keywords and mid vitalization contains 31 keywords. Finally low vitalization contains 22 keywords. Among 252 papers, 131 are assigned as high vitalization stage, 53 are mid stage, and 68 are low stage.
Keyword list for each vitalization stage
Keyword list for each vitalization stage
Now we can analyze the research trend more accurately using theme map and vitalization keyword list.
Figure 3
can be analyzed like below. This analysis is done only by the TKM experts.

A: Researches about ginseng’s effect for stress

B: Researches about ginseng’s effect for inflammation suppression

C: Researches about ginseng’s tissue

D: Researches about combined administration ginseng and insulin

E: Researches about chaphora ginseng’s effect for cell destruction

F: Researches about western ginseng’s effect

G: Researches about ginseng specific component’s effect for blood and muscle

H: Researches about ginseng’s effect for blood and cell
 4.4 Performance evaluation
In order to provide a performance evaluation and further evidence of the scalability of our methodology. Performance evaluation environment is set as web server and database servers are separated.

Web server : Intel Pentium D 3.40GHz, 2.50GB RAM, Windows XP SP 3, Tomcat 5.5

Database server : 2 Xeon 3.0GHz, 2GB, Solaris 9, Oracle 9i
We used System.currentTimeMillis() function to measure the real operational time in the evaluation environment. And the result is like
Table 3
. We averaged 10 evaluation results as summarized in
Table 3
. Vector space formation, correspondence analysis, kernel density estimation, and local maxima estimation algorithms were executed within 100ms. And visualization was executed about 800ms and vitalization analysis algorithm almost took no times. The total time for information visualization was 1053ms
Performance evaluation result
Performance evaluation result
5. DISCUSSION & CONCLUSIONS
In this study, we showed the new theme map visualization algorithm based on the keyword. Also we suggested new analysis algorithm to get the relative vitalization. Finally we illustrated that it is possible to analyze the research trend through theme map and vitalization information. Using keyword extracted from documents has several advantages in information retrieval, visualization, and analysis than using references of documents. Keyword based theme map analysis system can maximize the systemized portion due to reduce extra experts’ knowledge.
If each document is not represented by a high dimensional vector whose components indicate that document’s discriminating words and how those words are connected to all other topics of interest that span and describe the document collection, then visualization tools may cause the bottleneck and inaccurate placement that occurs when a landscape visualization like a ThemeScape is based on an intervening document projection to the twodimensional plane as occurred with Galaxies
[6]
. In this paper we adopted correspondence analysis whose row and column geometries have similar interpretation, so we avoided the above mentioned problems.
In the map, height is created by the proportion of the papers’ density. So we can get relative information for the vitalization of research using the calculation of height. It is the difference between ground and the highest peak on the map. The height is divided into trisection, and each twodimensional position is calculated. All documents’ position and all keywords’ positions are compared to the former locations. The developed system provides document list and keyword list related to each stage. The researcher analyzes the research trend with theme map and vitalization information.
It spends much time and experts’ knowledge to grasp the research trend and to decide the new research area from the paper analysis manually. In this study, we showed that it spends less time and efforts when we use the theme map and extra information, for example vitalization information. We used TKM paper database to example developed algorithms. But the same technique can applied to the patent, project report, and research note etc.
By its nature, this study is an exploratory one, and needs more extension and elaboration in terms of methodology and application. First, it is need to analyze the absolute vitalization stage. It will be impossible to get the goal if we use only theme map’s information. It means that extra information is need. Second, more severe test process must be performed to get trust for this result. Test for research materials in the various fields must be measured to show that this system is useful. And the comparisons are need with the experts’ analysis results. Third, CA has the advantage to get the relationship between documents and keywords. But like PCA, MDS, SOM, and etc. there is weakness called information loss due to the dimension reduction. Thus new algorithm is needed to reduce this information loss.
Acknowledgements
This research was supported by the project, “Development integrated infrastructure for research information resource in Korean Traditional Medicine” funded by the Korea Institute of Oriental Medicine (K14141).
BIO
Yea SangJun
He received the B.S., M.S in computer science from KAIST, Korea in 2002, 2004 respectively. From 2008, he has been with the Information Development & Management Group, Korea Institute of Oriental Medicine.
Kim Chul
He received the B.S., M.S in industrial engineering from KAIST, Korea in 1998, 2000 respectively and also received Ph.D. in oriental medicine informatics from Wonkwang university, Korea in 2009. From 2006, he has been with the Information Development & Management Group, Korea Institute of Oriental Medicine.
ASIS&T
2005
Annual review of information science and technology 39th
Information Today
Medford NJ
Chung W.
,
Chen H.
,
Nunamaker Jr J. F.
2003
“Business Intelligence Explorer: A Knowledge Map Framework for Discovering Business Intelligence on the Web,”
Proceedings of the 36th Hawaii International Conference on System Science
Wise J. A.
,
Thomas J. J.
,
Pennock K.
,
Lantrip D.
,
Pottier M.
,
Schur A.
,
Crow V.
1995
“Visualizing the NonVisual: Spatial analysis and interaction with information from text documents,”
Proceedings of the 1995 IEEE Symposium on Information Visualization
Atlanta, Georgia
51 
58
Anick P. G.
,
Vaithyanathan S.
1997
“Exploiting clustering and phrases for context based information retrieval,”
Proceedings of the ACM SIGIR Annual International Conference on Research and Development in Information Retrieval
314 
323
James A.
1999
“Wise The Ecological Approach to Text Visualization,”
Journal of the American society for information science
50
(13)
1224 
1233
Kaski S.
1995
“Data exploration using selforganizing maps,”
Acta Polytechnica Scandinavica, Mathematics, Computing and Management in Engineering
Series 82
57 
Salton C.
1989
Automatic text processing
AddisonWesley
Reading MA
Yang Y. Y.
,
Akers L.
,
Kose T.
,
Yang C. B.
2008
“Text mining and visualization tools – Impressions of emerging capabilities,”
World Patent Information
30
(4)
280 
293
DOI : 10.1016/j.wpi.2008.01.007
Kruskal J. B.
,
Wish M.
1978
Multidimensional Scaling, Sage University Paper series on Quantitative Application in the Social Sciences
Sage Publications
Beverly Hills and London
Theodorou Y.
,
Drossosb C.
,
Alevizos P.
2007
“Correspondence analysis with fuzzy data: the fuzzy eigenvalue problem,”
Fuzzy Sets and Systems
158
704 
721
DOI : 10.1016/j.fss.2006.11.011
Greenacre
1983
Michael Theory and Applications of Correspondence Analysis
Academic Press
London
Shneiderman B.
1996
“The eyes have it: A task by data type taxonomy for information visualization,”
Proceedings of IEEE Workshop on Visual Language
336 
343
Boyack K. W.
,
Wylie B. N.
,
Davidson G. S.
2002
“Domain visualization using VxInsight for science and technology management,”
Journal of the American Society for Information Science and Technology
53
(9)
764 
774
DOI : 10.1002/asi.10066
Mathwork website.
Available at: <>
Wolfram website.
Available at:
SoftwareFx website.
Available at: <>
Teechart website.
Available at: <>
Rproject website.
Available at: <>
Dictionary publishing committee
2004
KoreanEnglish Dictionary of Oriental Medicine
Jimundang
Brezinski C.
,
Redivo Z. M.
1991
Extrapolation Methods: Theory and Practice
NorthHolland
Wasserman L.
2005
All of Statistics: A Concise Course in Statistical Inference
Springer Texts in Statistics
Lindeberg T.
1993
“Detecting Salient BlobLike Image Structures and Their Scales with a ScaleSpace Primal Sketch: A Method for FocusofAttention,”
International Journal of Computer Vision
11
(3)
283 
318
DOI : 10.1007/BF01469346
Beyer K.
,
Goldstein J.
,
Ramakrishnan R.
,
Shaft U.
1999
“When is nearest neighbor meaningful?,”
In Proceedings of the 7th ICDT
Richard C.
,
Robbins H.
,
Stewart I.
1996
What Is Mathematics: An Elementary Approach to Ideas and Methods
Oxford University Press
New York
OASIS website.
Available at: <>
Korean Medical Library Engine website.
Available at: <>