Document Layout Analysis Based on Fuzzy Energy Matrix
Document Layout Analysis Based on Fuzzy Energy Matrix
International Journal of Contents. 2015. Jun, 11(2): 1-8
Copyright © 2015, The Korea Contents Association
  • Received : January 19, 2015
  • Accepted : April 01, 2015
  • Published : June 28, 2015
Export by style
Cited by
About the Authors
KangHan Oh
SooHyung Kim

In this paper, we describe a novel method for document layout analysis that is based on a Fuzzy Energy Matrix (FEM). A FEM is a two-dimensional matrix that contains the likelihood of text and non-text and is generated through the use of Fuzzy theory. The key idea is to define an Energy map for the document to categorize text and non-text. The proposed mechanism is designed for execution with a low-resolution document image, and hence our method has a fast processing speed. The proposed method has been tested on public ICDAR 2009 datasets to conduct a comparison against other state-of-the-art methods, and it was also tested with Korean documents. The results of the experiment indicate that this scheme achieves superior segmentation accuracy, in terms of both precision and recall, and also requires less time for computation than other state-of-the-art document image analysis methods.
In recently years, high resolution phone cameras have been developed and are widely used in the smart phones to get better image. The document recognition based on smart phone is also attractive theme for smart phone user. Therefore new application focus on algorithm time consuming in order to lunch smart phone environment. In the document recognition field, document image layout analysis is necessary preprocessing part. Usually, a scenario of document recognition consists of three tasks. First, a document layout analysis application segments text and non-text. Second, text lines and words are categorized. Finally, the OCR engine recognizes word information. Therefore layout analysis algorithm should be fast and precise for next part. However traditional document analysis approaches [6] - [10] tend to undervalue time consuming and made algorithm heuristic for increasing performance. So our algorithm considers not only performance but also processing time.
In the past few decades various document layout analysis method have been proposed using different paradigms. There are generally two models for documents analysis: bottom-up and top-down. The bottom-up approach relies on low-level features such as pixel and contented component result. In the document analysis, generally, categorized components can be merged into successively group of blocks represent paragraph or figure [6] - [9] , [11] . Whereas, the top-down approaches divide page into group of blocks, which are successively split into sub-blocks, in an iterative mechanism, to obtain the text, figure [10] . Usually, top-down method is faster than bottom-up but it is less accurate than bottom-up.
In the bottom-up approaches, Koich at al [9] proposed a document analysis algorithm using Voronoi diagram. In order to reduce processing time, they focus on text segmentation because quality of segmented non-text is not important in the OCR engine process for recognition. Therefore our method also does not spend much time for accurately detecting non-text. Laura et al [8] utilized a neuro-fuzzy methodology to segment document. The proposed method strategy is capable of describing the physical structure of document. They extract pixel level features using fuzzy set and then after training, an adaptive neuro fuzzy network algorithm recognizes text and non-text. In [11] used top-down approach based on the connected component extraction. The proposed document analysis model, which preserves top-down generation, is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer, and logical analysis. The authors have demonstrated pretty good results. Henry at el [10] extracts background region in the document image using computational-geometry algorithms for off-line enumeration of maximal white rectangles and on-line rectangle unification. This algorithm is simple and fast however algorithm is so heuristic. In the limited datasets, heuristic algorithm can reach to high performance but it cannot expect good performance in different environment.
In our work, the fuzzy theory with rectangular membership function is used for just converting continuous input values to several grade values in order to make FEM. The reason for taking fuzzy theory is that the continuous input values are not suitable for generating FEM.
The remainder of our paper consists of following sections. In Section 2, we introduce the fuzzy set theory. Section 3 presents the method for extracting component level features from connected component algorithm result. Section 4 provides detail for generating fuzzy energy matrix. Section 5 briefly demonstrates post-processing algorithm. In Section 6, we show the experimental results with public datasets, and we conclude our paper in Section 7.
A Fuzzy set is a set whose elements have levels of membership represented by a real number in the interval [0, 1]. In order to represent imprecise vectors, fuzzy set is very powerful tool that uses degree of variables rather than quantitative variables. For example weather condition may have a value such as ‘1: good’ or ‘0: bad’ that are not clearly defined because weather has various conditions. But the fuzzy set can represent binary weather conditions as meaningful classifications [1] . The definition of fuzzy set can be given as follows:
Definition. Let U be the universe of discourse, where U= u 1 , u 2 ,… un . A fuzzy set is determined depending on universe of discourse and sub intervals. The Fuzzy set A can be defined on the universe of discourse follow as;
PPT Slide
Lager Image
Where fA denotes the membership function of the fuzzy set A, fA : U → [0 ,1], and fA ( x,ui ) denotes the degree of membership function of the crisp interval ui with input linguistic value x, and fA ( x,ui ) ∈ [0 1] and 1 ≤ i ≤ n , linguistic values x should be assigned to each fuzzy set [2] . Here fAi ( x,uj ) is a rectangular membership degree and it is shown in equation (2)
PPT Slide
Lager Image
In equation the observations of features are converted to the fuzzy set has got the highest degrees of membership value. In our study, we changed quantitative features of text and non-text into degree of variables using Fuzzy set rules.
The aim of this section is to extract features, which represent text and non-next. To extract features, an input image is down-sampled to 1/5 because the low-resolution document image is enough to segment document and can reduce time expense. We have utilized combined 2 binarization results using otsu [3] and sauvola [4] . Each binarization technique has advantages: the otsu can clearly keep figure region and the sauvola can keep table and separator therefore we used linear combination result with morphological filling operation as an input image. Fig. 1 illustrates this process. Given binary input image, the connected component algorithm is applied to binary image. In order to categorize between text and non-text, we proposed a feature extraction function (3) based on spatial properties of the connected components
PPT Slide
Lager Image
PPT Slide
Lager Image
Initialization input image (a) Otsu (b) Sauvola (c) Linear combining result between Otsu and Sauvola
Where H,W , height and width of component respectively. The α and β are weight value. The function F is designed to increase energy when component is non-text. Usually non-next has not uniformly variation in comparison with text and gap of scalar between H and W is bigger than text. The function F depends on H value because our input image is low-resolution image so we often see distorted width features from touched small text horizontally whereas most of height features keep its property despite low-resolution environment as in Fig. 2 . In this part, we just extract elemental features based on geometrical properties so the function F is not enough for categorizing component into text and non-text. The reason is because feature of several non-texts have similar features with big text. This problem is main challenge task in the document analysis field. We will figure out problem using Fuzzy Energy Matrix in the next section.
PPT Slide
Lager Image
Binarization with low-resolution image (a) original (b) binary result.
In this section, the ultimate goal is to generate the Fuzzy Energy Matrix to classify text and non-text. The FEM is two-dimensional matrix representing the grade of membership of F and location in the document. As mentioned in the previous section, the function F is not enough to recognize between big text and figure so we included a location feature in the document. These combining features can provide that how many components, which have similar F , are located in similar level space to us. The location of components is an efficient feature for recognizing big text from figure because most of big texts are arranged in a row. Therefore, we will generate the FEM using based on mentioned properties.
- 4.1 Fuzzification
After connected component algorithm, we extracts two features, which are height coordinate of vector and F , and then extracted quantitative features are transformed into “grade of membership” using universe of discourse and sub intervals. In the fuuzy set theory, this process is called “Fuzzification”. The lengths of intervals used in partitioning universe of discourse are determined subjectively. The following Fuzzification algorithm can be given:
  • Step 1. Define the two universe of discourse and subintervals. Based on document height.
PPT Slide
Lager Image
Where h is document height. The U and V are universe of discourses based on function F and image location, respectively.
Step 2. The length of intervals is defined as n = 15 and equation can be defined follows;
PPT Slide
Lager Image
  • vi,ui= [L_boundi,R_boundi]
Where Ud represents each universe of discourse.
[ Rboundi ] and [ Lboundi ] are boundary (right/ left) of intervals.
Step 3. We fuzzified input features, and pseudocode can be given:
PPT Slide
Lager Image
The fuzzified vectors ( Fv 1 , Fv 2 ) based on the length of intervals are applied to generate Fuzzy Energy Matrix in the next section.
- 4.2 Matrix Generation
The Fuzzy Energy Matrix represents 2-dimensional probability matrix for classifying between text and non-text based on Fuzzified vectors from previous section. In order to generate FEM, first we make a n ×n matrix which represents fuzzy grade for F, location . And then whole of fuzzified vectors are assigned to level of matrix and then we count frequency corresponding to each grade. In the Table 1 , we can observe that how many components are located in each fuzzy grade and assume that fuzzy level space including large energy is text. Whereas isolated small energy is non-text because when similar F values are located in similar grade in the Fuzzy Energy Matrix, it is strong feature representing the text. The integrated values in Table 2 are normalized to range 0~1 using sigmoid function [5] . The normalization equation can be given:
PPT Slide
Lager Image
Fuzzy Energy Matrix with Document inFig. 3
PPT Slide
Lager Image
Fuzzy Energy Matrix with Document in Fig. 3
Normalization Fuzzy Energy Matrix using Sigmoid inFig. 3
PPT Slide
Lager Image
Normalization Fuzzy Energy Matrix using Sigmoid in Fig. 3
Where γ is gradient of curve and X is Threshold value considering text feature. In this study, we have used a X =4. If number of same F values are detected in same location more than X =4, it indicates a high probability with text. A pseudocode for FEM process can be given:
PPT Slide
Lager Image
In Fig. 3 , we can see segmented non-text energy with green coordinates from the Table 1 , Table 2 . And the Fig. 3 (c) is a document energy corresponding to n_FEM . The text and non-text results are defined using document energy with T=0.5 (energy >0.5: text, energy <0 .5: non-text).
PPT Slide
Lager Image
Classification text using FEM (a) original (b) binary result (c) Energy map with 3D (d) text segmentation result (e) non-text segmentation result (green coordinates are fuzzified values * we can check energy in the Table 1, Table 2)
In this section, the main focus was introducing on five tasks which are Table detection, noise, separator, generating boundary for text.
- 5.1 Text boundary
In order to generate boundary region for text, we proposed a background extraction algorithm based on projection technique. This process is as follows:
  • Step 1. The morphology horizontal dilation algorithm is applied to the segmented text results (Fig. 4(a)).
  • Step 2. In the dilation result, projection technique with horizo-ntal and vertical are utilized for extracting background. The stop condition is that when projection algorithm may be faced with text pixel. And then if length of projection result is more than (image height/2, width/2), the result is considered background.Fig. 4shows estimated background.
  • Step 3. Finally, we observe non-text pixels in the black regionFig. 4(c). If there are no non-text pixels, we made rectangle around text in the observing black region whereas the morphology result is just utilized for case of observed non-text.
PPT Slide
Lager Image
Text boundary generation (a) segmented text (b) morphology result (c) background extraction (d) final result
- 5.2 Heuristic filter for Non-text
First, in order to segment Table, our algorithm examines components, which has fuzzy grade u1 , are existed in the non-text region because most of table has a number of similar components arranged in the non-text region uniformly. In this study, when number of counted component u1 in the non-text more then 10, it considers table. When table doesn’t have lines, this way has strong advantage compare to traditional methods because we easily utilized geometric information of words arranged table using fuzzified values. Second, the separator is 2-type which are horizontal and vertical. A ratio between horizontal and vertical is utilized to detect separator. If ratio less than 0.2, we have considered the separator. Finally, the noise components are eliminated according to size of component.
We have implemented the proposed algorithm using MATLAB 2012 on an Intel(R) Core(TM) i5-4670 CPU 3.40GHz. The performance of the proposed method is evaluated on public ICDAR 2009 datasets because they contain 55 document images and have been utilized in most comparative works [12] . Also we have utilized our 100 document images, which are 50 English and 50 Korean for evaluation with ground truth mask.
The DICE system comprises two main steps. First each pixel is categorized primarily into machine-printed text. Second, isolated vectors are eliminated. Third, open and close operations are used to remove small regions. Finally, interior pixels are removed and contours of polygons are extracted [12] . The Fraunhofer system includes hybrid methods, which are separator detection, page segmentation and Text line and region extraction [11] . The Tesseract is bottom-up method submitted by Ray Smith of Google Inc. This method includes binary morphology and connected component analysis, to estimate the type, which are text, image, separator and unknown. Two of the key methods employed include neighbourhood stroke-width measurement and appropriateness of overlap between adjacent connected components [13] .
Fig. 5 demonstrates the performance using F-measure result with other state-of-the-art methods. As mentioned previously, our algorithm focused on processing time however our performance is also good as compare to traditional methods in the fig. 5 , fig. 6 and average of proposed algorithm processing time is a 1.4 second in the MATLAB environment. Unfortunately, other state-of-the-art methods didn’t mention their processing time. However the state-of-the-art methods are operated on the full-resolution image so we can guess that their processing time may well exceed 3 second because while implementing connected component and binarization algorithm in the full-resolution document, algorithm requires processing time of at least 2 second in the ICDAR 2009 datasets (PRIMA). Whereas proposed algorithm have used low-resolution document image (scale 1/5) to decrease time consuming and our performance is not so bad. Fig. 5 shows proposed algorithm performance with other state-of-the-art methods using F-measure.
PPT Slide
Lager Image
F-measure result with ICDAR 2009 datasets
PPT Slide
Lager Image
PRIMA measure with ICDAR 2009 datasets
In the Fig. 6 , the PRIMA measure for different region types for the four submitted methods. The PRIMA Lab (University of Salford, Manchester, United Kingdom) provides evaluation tool for PRMIA measure.
PPT Slide
Lager Image
F-measure with 100 datasets with Korean and English
Fig. 8 illustrates visualization results of proposed method with 3D maps
PPT Slide
Lager Image
Visualization results of proposed method with 3D map (a) ICDAR 2009 (b) Our datasets.
We have presented a document image analysis method using FEM with low-resolution image. In particular, our algorithm performance and processing time are good compare to the state-of-art-methods however there are several drawbacks. First, the FEM algorithm can only detect text and non-text, in order to detect, table, separator, noise, it depends on heuristic post-processing algorithm. As mentioned previously, a heuristic algorithm is not best way. Therefore we will figure out the first drawback using improvement in F equation. Second, proposed algorithm performance depends on initial binarization result. When character are touched each other in the binarization result, it may cause a critical effect on our proposed mechanism. In order to overcome second problem, we need extract a reasonable feature for recognizing touched character in the future works.
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2015-018993).
Kang Han Oh
He received the B.S in Computer Science from Honam University in 2010. And he received the M.S in Electronic & Computer Engineering at Chonnam National University in 2013. He has been taking the Ph.D course in Electronics & Computer Engineering at Chonnam National University, Korea. His research interests are pattern recognition, machine learning and Image processing.
Soo Hyung Kim
He received his B.S degree in Computer Engineering from Seoul National University in 1986, and his M.S and Ph.D degrees in Computer Science from Korea Advanced Institute of Science and Technology in 1988 and 1993 respectively. From 1990 to 1996, he was a senior member of research staff in Multimedia Research Center of Samsung Electronics Co., Korea. Since 1997, he has been a professor in the Department of Computer Science, Chonnam National University, Korea. His research interests are pattern recognition, document image processing, medical image processing, and ubiquitous computing.
Bhupendra Kumar P. , Sanjay S. 2014 “Integrated Fuzzy-HMM for project uncertainties in time-cost tradeoff problem,” J. Applied Soft Computing 21 320 - 329    DOI : 10.1016/j.asoc.2014.03.035
Chen S. M. 1996 “Forecasting enrollments based on fuzzy time series,” Fuzzy Sets and Systems 81 311 - 319    DOI : 10.1016/0165-0114(95)00220-0
Otsu N. 1975 “A Threshold Selection method from Gray Level Histogram,” IEEE Transactions on Systems, Man and Cybernetics 9 (1) 62 - 66
Sauvola J. , Pietikainen M. 2000 “Adaptive document image binarization,” J. The Journal of the Pattern Recognition 225 - 236    DOI : 10.1016/S0031-3203(99)00055-2
Fletcher A. , Kasturi R. 1998 “A robust algorithm for text string separation from mixed text/graphics images,” IEEE Trans. Pattern Anal. Mach. Intell. 10 294 - 308
Anil K. J. , Bin Y. 1998 “Document representation and its application to page decomposition,” IEEE Trans. Pattern Anal. Mach. Intell. 20 294 - 308    DOI : 10.1109/34.667886
O’Gorman D. L. 1993 “The document spectrum for page layout analysis,” IEEE Trans. Pattern Anal. Mach. Intell. 15 1162 - 1173    DOI : 10.1109/34.244677
Laura C. , Ciro C. , Przemyslaw G. 2008 “Document page segmentation using neuro-fuzzy approach,” Applied Soft Computing 8 118 - 126    DOI : 10.1016/j.asoc.2006.11.008
Kise K. , Sato A. , Iwata M. 1998 “Segmentation of page images using the area Voronoi diagram,” In Computer Vision Image Understanding 70 (3) 370 - 382    DOI : 10.1006/cviu.1998.0684
Baird D. H. S. 1994 “Backgroud structure in document images,” Int. J. Pattern Recognition Artif. Intell. 8 1013 - 1030    DOI : 10.1142/S0218001494000516
Anil K. J. , Bin Y. 1998 “Document representation and its application to pagedecomposition,” IEEE Trans. Pattern Anal. Mach. Intell. 20 294 - 308    DOI : 10.1109/34.667886
Antonacopoulos A. , Pletschacher S. , Bridson D. , Papadopoulos C. “ICDAR2009 Page Segmentation Competition” Proc. ICDAR 2009 1370 - 1374
Smith Ray “Hybrid page Layout Analysis via Tab-Stop Detection” Proc. ICDAR Barcelona, Spain 2009 241 - 245