In this paper, we describe a novel method for document layout analysis that is based on a Fuzzy Energy Matrix (FEM). A FEM is a two-dimensional matrix that contains the likelihood of text and non-text and is generated through the use of Fuzzy theory. The key idea is to define an Energy map for the document to categorize text and non-text. The proposed mechanism is designed for execution with a low-resolution document image, and hence our method has a fast processing speed. The proposed method has been tested on public ICDAR 2009 datasets to conduct a comparison against other state-of-the-art methods, and it was also tested with Korean documents. The results of the experiment indicate that this scheme achieves superior segmentation accuracy, in terms of both precision and recall, and also requires less time for computation than other state-of-the-art document image analysis methods.
In recently years, high resolution phone cameras have been developed and are widely used in the smart phones to get better image. The document recognition based on smart phone is also attractive theme for smart phone user. Therefore new application focus on algorithm time consuming in order to lunch smart phone environment. In the document recognition field, document image layout analysis is necessary preprocessing part. Usually, a scenario of document recognition consists of three tasks. First, a document layout analysis application segments text and non-text. Second, text lines and words are categorized. Finally, the OCR engine recognizes word information. Therefore layout analysis algorithm should be fast and precise for next part. However traditional document analysis approaches
tend to undervalue time consuming and made algorithm heuristic for increasing performance. So our algorithm considers not only performance but also processing time.
In the past few decades various document layout analysis method have been proposed using different paradigms. There are generally two models for documents analysis: bottom-up and top-down. The bottom-up approach relies on low-level features such as pixel and contented component result. In the document analysis, generally, categorized components can be merged into successively group of blocks represent paragraph or figure
. Whereas, the top-down approaches divide page into group of blocks, which are successively split into sub-blocks, in an iterative mechanism, to obtain the text, figure
. Usually, top-down method is faster than bottom-up but it is less accurate than bottom-up.
In the bottom-up approaches, Koich at al
proposed a document analysis algorithm using Voronoi diagram. In order to reduce processing time, they focus on text segmentation because quality of segmented non-text is not important in the OCR engine process for recognition. Therefore our method also does not spend much time for accurately detecting non-text. Laura et al
utilized a neuro-fuzzy methodology to segment document. The proposed method strategy is capable of describing the physical structure of document. They extract pixel level features using fuzzy set and then after training, an adaptive neuro fuzzy network algorithm recognizes text and non-text. In
used top-down approach based on the connected component extraction. The proposed document analysis model, which preserves top-down generation, is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer, and logical analysis. The authors have demonstrated pretty good results. Henry at el
extracts background region in the document image using computational-geometry algorithms for off-line enumeration of maximal white rectangles and on-line rectangle unification. This algorithm is simple and fast however algorithm is so heuristic. In the limited datasets, heuristic algorithm can reach to high performance but it cannot expect good performance in different environment.
In our work, the fuzzy theory with rectangular membership function is used for just converting continuous input values to several grade values in order to make FEM. The reason for taking fuzzy theory is that the continuous input values are not suitable for generating FEM.
The remainder of our paper consists of following sections. In Section 2, we introduce the fuzzy set theory. Section 3 presents the method for extracting component level features from connected component algorithm result. Section 4 provides detail for generating fuzzy energy matrix. Section 5 briefly demonstrates post-processing algorithm. In Section 6, we show the experimental results with public datasets, and we conclude our paper in Section 7.
2. FUZZY THEORY
A Fuzzy set is a set whose elements have levels of membership represented by a real number in the interval [0, 1]. In order to represent imprecise vectors, fuzzy set is very powerful tool that uses degree of variables rather than quantitative variables. For example weather condition may have a value such as ‘1: good’ or ‘0: bad’ that are not clearly defined because weather has various conditions. But the fuzzy set can represent binary weather conditions as meaningful classifications
. The definition of fuzzy set can be given as follows:
Let U be the universe of discourse, where U=
. A fuzzy set is determined depending on universe of discourse and sub intervals. The Fuzzy set A can be defined on the universe of discourse follow as;
denotes the membership function of the fuzzy set A,
: U → [0 ,1], and
) denotes the degree of membership function of the crisp interval
with input linguistic value x, and
) ∈ [0 1] and 1 ≤ i ≤ n , linguistic values x should be assigned to each fuzzy set
) is a rectangular membership degree and it is shown in equation (2)
In equation the observations of features are converted to the fuzzy set has got the highest degrees of membership value. In our study, we changed quantitative features of text and non-text into degree of variables using Fuzzy set rules.
3. FEATURE EXTRACTION
The aim of this section is to extract features, which represent text and non-next. To extract features, an input image is down-sampled to 1/5 because the low-resolution document image is enough to segment document and can reduce time expense. We have utilized combined 2 binarization results using otsu
. Each binarization technique has advantages: the otsu can clearly keep figure region and the sauvola can keep table and separator therefore we used linear combination result with morphological filling operation as an input image.
illustrates this process. Given binary input image, the connected component algorithm is applied to binary image. In order to categorize between text and non-text, we proposed a feature extraction function (3) based on spatial properties of the connected components
Initialization input image (a) Otsu (b) Sauvola (c) Linear combining result between Otsu and Sauvola
, height and width of component respectively. The
are weight value. The function
is designed to increase energy when component is non-text. Usually non-next has not uniformly variation in comparison with text and gap of scalar between
is bigger than text. The function F depends on
value because our input image is low-resolution image so we often see distorted width features from touched small text horizontally whereas most of height features keep its property despite low-resolution environment as in
. In this part, we just extract elemental features based on geometrical properties so the function F is not enough for categorizing component into text and non-text. The reason is because feature of several non-texts have similar features with big text. This problem is main challenge task in the document analysis field. We will figure out problem using Fuzzy Energy Matrix in the next section.
Binarization with low-resolution image (a) original (b) binary result.
4. FUZZY ENERGY MATRIX (FEM)
In this section, the ultimate goal is to generate the Fuzzy Energy Matrix to classify text and non-text. The FEM is two-dimensional matrix representing the grade of membership of
and location in the document. As mentioned in the previous section, the function
is not enough to recognize between big text and figure so we included a location feature in the document. These combining features can provide that how many components, which have similar
, are located in similar level space to us. The location of components is an efficient feature for recognizing big text from figure because most of big texts are arranged in a row. Therefore, we will generate the FEM using based on mentioned properties.
- 4.1 Fuzzification
After connected component algorithm, we extracts two features, which are height coordinate of vector and
, and then extracted quantitative features are transformed into “grade of membership” using universe of discourse and sub intervals. In the fuuzy set theory, this process is called “Fuzzification”. The lengths of intervals used in partitioning universe of discourse are determined subjectively. The following Fuzzification algorithm can be given:
Step 1. Define the two universe of discourse and subintervals. Based on document height.
Where h is document height. The
are universe of discourses based on function
and image location, respectively.
Step 2. The length of intervals is defined as
= 15 and equation can be defined follows;
Where Ud represents each universe of discourse.
] and [
] are boundary (right/ left) of intervals.
Step 3. We fuzzified input features, and pseudocode can be given:
The fuzzified vectors (
) based on the length of intervals are applied to generate Fuzzy Energy Matrix in the next section.
- 4.2 Matrix Generation
The Fuzzy Energy Matrix represents 2-dimensional probability matrix for classifying between text and non-text based on Fuzzified vectors from previous section. In order to generate FEM, first we make a
matrix which represents fuzzy grade for
. And then whole of fuzzified vectors are assigned to level of matrix and then we count frequency corresponding to each grade. In the
, we can observe that how many components are located in each fuzzy grade and assume that fuzzy level space including large energy is text. Whereas isolated small energy is non-text because when similar F values are located in similar grade in the Fuzzy Energy Matrix, it is strong feature representing the text. The integrated values in
are normalized to range 0~1 using sigmoid function
. The normalization equation can be given:
Fuzzy Energy Matrix with Document inFig. 3
Fuzzy Energy Matrix with Document in Fig. 3
Normalization Fuzzy Energy Matrix using Sigmoid inFig. 3
Normalization Fuzzy Energy Matrix using Sigmoid in Fig. 3
is gradient of curve and
is Threshold value considering text feature. In this study, we have used a
=4. If number of same F values are detected in same location more than
=4, it indicates a high probability with text. A pseudocode for FEM process can be given:
, we can see segmented non-text energy with green coordinates from the
. And the
(c) is a document energy corresponding to
. The text and non-text results are defined using document energy with T=0.5 (energy >0.5: text, energy <0 .5: non-text).
Classification text using FEM (a) original (b) binary result (c) Energy map with 3D (d) text segmentation result (e) non-text segmentation result (green coordinates are fuzzified values * we can check energy in the Table 1, Table 2)
In this section, the main focus was introducing on five tasks which are Table detection, noise, separator, generating boundary for text.
- 5.1 Text boundary
In order to generate boundary region for text, we proposed a background extraction algorithm based on projection technique. This process is as follows:
Step 1. The morphology horizontal dilation algorithm is applied to the segmented text results (Fig. 4(a)).
Step 2. In the dilation result, projection technique with horizo-ntal and vertical are utilized for extracting background. The stop condition is that when projection algorithm may be faced with text pixel. And then if length of projection result is more than (image height/2, width/2), the result is considered background.Fig. 4shows estimated background.
Step 3. Finally, we observe non-text pixels in the black regionFig. 4(c). If there are no non-text pixels, we made rectangle around text in the observing black region whereas the morphology result is just utilized for case of observed non-text.
Text boundary generation (a) segmented text (b) morphology result (c) background extraction (d) final result
- 5.2 Heuristic filter for Non-text
First, in order to segment Table, our algorithm examines components, which has fuzzy grade
, are existed in the non-text region because most of table has a number of similar components arranged in the non-text region uniformly. In this study, when number of counted component
in the non-text more then 10, it considers table. When table doesn’t have lines, this way has strong advantage compare to traditional methods because we easily utilized geometric information of words arranged table using fuzzified values. Second, the separator is 2-type which are horizontal and vertical. A ratio between horizontal and vertical is utilized to detect separator. If ratio less than 0.2, we have considered the separator. Finally, the noise components are eliminated according to size of component.
6. EXPERIMENTAL RESULTS
We have implemented the proposed algorithm using MATLAB 2012 on an Intel(R) Core(TM) i5-4670 CPU 3.40GHz. The performance of the proposed method is evaluated on public ICDAR 2009 datasets because they contain 55 document images and have been utilized in most comparative works
. Also we have utilized our 100 document images, which are 50 English and 50 Korean for evaluation with ground truth mask.
The DICE system comprises two main steps. First each pixel is categorized primarily into machine-printed text. Second, isolated vectors are eliminated. Third, open and close operations are used to remove small regions. Finally, interior pixels are removed and contours of polygons are extracted
. The Fraunhofer system includes hybrid methods, which are separator detection, page segmentation and Text line and region extraction
. The Tesseract is bottom-up method submitted by Ray Smith of Google Inc. This method includes binary morphology and connected component analysis, to estimate the type, which are text, image, separator and unknown. Two of the key methods employed include neighbourhood stroke-width measurement and appropriateness of overlap between adjacent connected components
demonstrates the performance using F-measure result with other state-of-the-art methods. As mentioned previously, our algorithm focused on processing time however our performance is also good as compare to traditional methods in the
and average of proposed algorithm processing time is a 1.4 second in the MATLAB environment. Unfortunately, other state-of-the-art methods didn’t mention their processing time. However the state-of-the-art methods are operated on the full-resolution image so we can guess that their processing time may well exceed 3 second because while implementing connected component and binarization algorithm in the full-resolution document, algorithm requires processing time of at least 2 second in the ICDAR 2009 datasets (PRIMA). Whereas proposed algorithm have used low-resolution document image (scale 1/5) to decrease time consuming and our performance is not so bad.
shows proposed algorithm performance with other state-of-the-art methods using F-measure.
F-measure result with ICDAR 2009 datasets
PRIMA measure with ICDAR 2009 datasets
, the PRIMA measure for different region types for the four submitted methods. The PRIMA Lab (University of Salford, Manchester, United Kingdom) provides evaluation tool for PRMIA measure.
F-measure with 100 datasets with Korean and English
illustrates visualization results of proposed method with 3D maps
Visualization results of proposed method with 3D map (a) ICDAR 2009 (b) Our datasets.
We have presented a document image analysis method using FEM with low-resolution image. In particular, our algorithm performance and processing time are good compare to the state-of-art-methods however there are several drawbacks. First, the FEM algorithm can only detect text and non-text, in order to detect, table, separator, noise, it depends on heuristic post-processing algorithm. As mentioned previously, a heuristic algorithm is not best way. Therefore we will figure out the first drawback using improvement in
equation. Second, proposed algorithm performance depends on initial binarization result. When character are touched each other in the binarization result, it may cause a critical effect on our proposed mechanism. In order to overcome second problem, we need extract a reasonable feature for recognizing touched character in the future works.
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2015-018993).
Kang Han Oh
He received the B.S in Computer Science from Honam University in 2010. And he received the M.S in Electronic & Computer Engineering at Chonnam National University in 2013. He has been taking the Ph.D course in Electronics & Computer Engineering at Chonnam National University, Korea. His research interests are pattern recognition, machine learning and Image processing.
Soo Hyung Kim
He received his B.S degree in Computer Engineering from Seoul National University in 1986, and his M.S and Ph.D degrees in Computer Science from Korea Advanced Institute of Science and Technology in 1988 and 1993 respectively. From 1990 to 1996, he was a senior member of research staff in Multimedia Research Center of Samsung Electronics Co., Korea. Since 1997, he has been a professor in the Department of Computer Science, Chonnam National University, Korea. His research interests are pattern recognition, document image processing, medical image processing, and ubiquitous computing.
Bhupendra Kumar P.
“Integrated Fuzzy-HMM for project uncertainties in time-cost tradeoff problem,”
J. Applied Soft Computing
DOI : 10.1016/j.asoc.2014.03.035
“A Threshold Selection method from Gray Level Histogram,”
IEEE Transactions on Systems, Man and Cybernetics
“A robust algorithm for text string separation from mixed text/graphics images,”
IEEE Trans. Pattern Anal. Mach. Intell.
Anil K. J.
“Document representation and its application to page decomposition,”
IEEE Trans. Pattern Anal. Mach. Intell.
DOI : 10.1109/34.667886
O’Gorman D. L.
“The document spectrum for page layout analysis,”
IEEE Trans. Pattern Anal. Mach. Intell.
DOI : 10.1109/34.244677
“Document page segmentation using neuro-fuzzy approach,”
Applied Soft Computing
DOI : 10.1016/j.asoc.2006.11.008
“Segmentation of page images using the area Voronoi diagram,”
In Computer Vision Image Understanding
DOI : 10.1006/cviu.1998.0684
Anil K. J.
“Document representation and its application to pagedecomposition,”
IEEE Trans. Pattern Anal. Mach. Intell.
DOI : 10.1109/34.667886
“ICDAR2009 Page Segmentation Competition”
“Hybrid page Layout Analysis via Tab-Stop Detection”