GPU-Based Optimization of Self-Organizing Map Feature Matching for Real-Time Stereo Vision
GPU-Based Optimization of Self-Organizing Map Feature Matching for Real-Time Stereo Vision
Journal of Information and Communication Convergence Engineering. 2014. Jun, 12(2): 128-134
Copyright © 2014, The Korea Institute of Information and Commucation Engineering
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Received : December 09, 2013
  • Accepted : April 09, 2014
  • Published : June 30, 2014
Export by style
Cited by
About the Authors
Kajal Sharma
Inkyu Moon

In this paper, we present a graphics processing unit (GPU)-based matching technique for the purpose of fast feature matching between different images. The scale invariant feature transform algorithm developed by Lowe for various feature matching applications, such as stereo vision and object recognition, is computationally intensive. To address this problem, we propose a matching technique optimized for GPUs to perform computations in less time. We optimize GPUs for fast computation of keypoints to make our system quick and efficient. The proposed method uses a self-organizing map feature matching technique to perform efficient matching between the different images. The experiments are performed on various image sets to examine the performance of the system under varying conditions, such as image rotation, scaling, and blurring. The experimental results show that the proposed algorithm outperforms the existing feature matching methods, resulting in fast feature matching due to the optimization of the GPU.
Feature selection and matching is a key component in many computer vision tasks, such as path finding, obstacle detection, navigation, and stereo vision [1 - 4] . Several strategies have been already proposed for keypoint detection [5 , 6] . Schmid and Mohr [7] used Harris corners as interest points in image recognition problems to match features against a large database of images. This method allows features to be matched under arbitrary orientation changes, but it is sensitive to image scaling. Lowe [8] proposed a scale-invariant feature transform (SIFT) descriptor for the extraction of interest points from an image that is invariant to both scale and rotation. The SIFT technique, however, uses a 128-dimensional vector to describe the SIFT feature that is computationally intensive. In a recent research [9] , we have implemented an efficient feature matching technique, which is faster than Lowe’s SIFT, with a selforganizing map (SOM) that can be used for real-time stereo matching applications. In this paper, we extended our research and implement the proposed method with graphics processing units (GPUs) to further optimize the execution time.
Due to the advancements of parallel processing techniques, multi-core CPU methods are widely used to accelerate computationally intensive tasks [10] . Modern programmable graphics hardware contains powerful co-processing GPUs with a peak performance of hundreds of GigaFLOPS, which is an order of magnitude higher than that of the CPUs [11] . For accelerating computer vision applications, many researchers are now exploiting parallelism supported by modern programmable graphics hardware that provides a great scope for acceleration to run computations in parallel [12 , 13] . Some researchers also use specialized and reconfigurable hardware to speed up these algorithms [14 - 16] . One example is discovering concurrency in parallel computing, where coloring is used to identify subtasks that can be carried out or data elements that can be updated simultaneously [17] . Yet another example of coloring is the efficient computation of sparse derivative matrices [18] . With the increasing programmability and computational power of GPUs, the recent work by Sinha et al. [10] accelerates some parts of the SIFT algorithm by using the hardware capacities of the GPU. A 10× speedup was obtained, which makes this technique feasible for video applications [10] . A variety of computer vision algorithms have been parallelized, providing significant acceleration to the computation [10 , 12 , 13] .
In this paper, we present a technique that is designed to achieve fast feature matching in images with the use of neural networks and GPUs. Our contribution is the proposal of a GPU-optimized matching technique based on Kohonen’s SOM [19] . The presented method provides a significant reduction of computation time as compared to Lowe’s SIFT. In our approach, the scale space for keypoint extraction is configured in parallel for detecting the candidate points among which the number of keypoints is reduced with the SOM neural network. The descriptor vector generation is accelerated by the GPU, and matching is accomplished on the basis of competitive learning. The key idea is to optimize keypoint extraction with a GPU and to reduce the descriptor size with the winning calculation method. Similar winning pixels in the images are found and associated to accomplish feature matching. The proposed method using a GPU is faster as multi-processing is used.
The rest of this paper is organized as follows: Section II gives a brief overview of stereo vision. The procedure of feature matching with the GPU-optimized method is presented in Section III. Experimental results are shown in Section IV, while Section V presents the conclusions.
Stereo vision is based on acquisition of three-dimensional (3D) data from different views obtained by a single moving camera or a fixed arrangement of several cameras. The 3D position of an object is determined by triangulating the optical rays from at least two views of the same object point. Optical axes of two camera-lens units are configured in parallel, and the straight line joining the two optical centers is parallel to each image’s horizontal line in order to meet the epipolar constraint ( Fig. 1 ). The 3D image data are obtained on the basis of the position of the points in the left and the right images [20] .
PPT Slide
Lager Image
Stereo vision system acquiring three-dimensional (3D) object point.
The coordinates of a 3D point P of the object ( X , Y , Z ) are as follows:
PPT Slide
Lager Image
where e denotes the distance between two optical centers, f indicates the focal length of the two lens, and δ refers to the disparity of P . Disparity is defined as the difference in the location of the object point between the left image and the right image. ( XL , YL ) and ( XR , YR ) are the coordinates of the projection of the point P in two images, left and right, respectively ( Fig. 1 ). The disparity is given by δ = ( XL - XR ), i.e., the difference in position in the left and the right images.
In this section, we explain the proposed GPU-based method to optimize image matching along with the SOM methodology. The GPU is a special-purpose processing unit with a single instruction multiple thread parallel architecture. With the advent of multi-core CPUs and many-core GPUs, mainstream processor chips are now parallel systems. Moreover, their parallelization continues to scale with Moore‘s law. Compute unified device architecture (CUDA) considers GPU hardware as an independent platform that can provide a programming environment and minimize the need for understanding the graphics pipeline. A GPU hardware chip has N multiprocessors (MPs), and each MP has M scalar processors. The GPU memory is divided into global, shared, and constant. In addition to the three main memories, there are registers, which are on-chip memories ( Fig. 2(a) ). Variables that reside in registers are accessed at a very high speed in a highly parallel manner. Registers are allocated to individual threads; each thread can only access its own registers [21] . A kernel function typically uses registers to hold the frequently accessed variables that are private to each thread. When the kernel function to be executed with CUDA is ready, a one-block grid must be configured. The block generates a large number of threads to share data with the other threads that ensure parallel processing. The thread is composed of hierarchical SIMD architecture, and a collection of threads is called a thread block, or simply, a block. After the allocation of memory and transferring of data to the GPU, these thread blocks are initialized to execute the assigned parallel jobs. The GPU executes multiple threads in parallel and independently processes vector streams. Fig. 2(b) shows the graphic framework of our proposed GPU-based method for feature matching with SIFT and SOM.
PPT Slide
Lager Image
Graphics processing unit (GPU) architecture. (a) Memory, thread, and block organization, (b) GPU graphics framework.
The parts executing on the GPU that would mainly reduce the computation time are described in subsections III-A and III-B. The streams of instructions and fragments (pixels) are designed to be processed in parallel on the GPU with a peak performance of GFLOPS (GigaFLOPS). Fragment processors perform the role of computational kernels, and different computational steps are often mapped to different fragment programs [10] . Texture mapping on the GPU is analogous to the CPU’s random read-only memory interface; the fragment processors apply a fragment program (a pixel shader) to each fragment in the stream in order to compute the final color of each pixel. The GPU processes the pixels with parallel processing by the fragment processors and calculates the four color values at once as a vector. The image data are rearranged into a four-channel RGBA image where one color value in the RGBA image represents a 2 × 2 block of the gray-level image, thereby reducing the image size by 4. The RGBA image is processed on the GPU, and the fragment shader arranges them into one RGBA image format ( Fig. 3 ). The Gaussian convolution is directly applied to the RGBA image format where the Gaussian kernel consists of even and odd values to perform semi-convolutions, which are added further to form the final result. The calculations are performed in the fragment shader, and the computational overhead is low because of the rearrangement of the image in the RGBA image format.
PPT Slide
Lager Image
Gaussian convolution in RGBA 16-GPU format for odd and even samples.
- A. Parallel Scale Space Configuration and Construction of the Descriptor
In our approach, the construction of a Gaussian scale space is accelerated by the GPU by using fragment programs. In order to extract the candidate keypoint, the scale space L ( x, y, σ ) is computed in parallel by the convolution of the variable-scale Gaussian G ( x, y, σ ) over the input image I ( x, y ) as follows:
PPT Slide
Lager Image
where * denotes the convolution operation in x and y . Stable keypoint locations in the scale space can be computed from the difference of Gaussians (DOG) separated by a constant multiplicative factor k given by
PPT Slide
Lager Image
The intensity image, gradients, and the DOG values are stored in the RGBA packed format and computed in parallel in the same passage by using vector operations in the fragment programs. The Hessian matrix H ( x, y, σ ) is computed by the second-order derivative of the Gaussian blurred image as follows:
PPT Slide
Lager Image
where Lxy denotes the second-order derivative of the Gaussian image in both horizontal and vertical directions. An equal number of threads is assigned σ1 ,…, σn according to the number of given variables scaled for Gaussian operations in order to simultaneously construct individual pyramids of m octaves. Let mI be the number of features detected in I and D be the dimension of the descriptors ( D = 128). A texture of size mI × D is created and filled with the mI descriptor values, each one occupying a column.
- B. Winner-Based Image Matching and Acceleration Using GPU
We use the SOM algorithm to map high-dimensional keypoints to a lower dimensional space by using competitive learning [8] . The pseudo-code of the proposed algorithm is shown in Fig. 4 . Fig. 5 summarizes the complete algorithm.
PPT Slide
Lager Image
Pseudo-code of the presented algorithm.
PPT Slide
Lager Image
Complete algorithm of the suggested approach. GPU: graphics processing unit, SOM: self-organizing map, DOG: difference of Gaussians.
To optimize the algorithm, the descriptor and the locator section are implemented on the GPU in order to solve the complexity of feature matching that consumes a considerable amount of time to obtain the descriptor vector. To this end, keypoints are scanned to obtain the descriptor and locator vector that can run in parallel on the GPU. In the initial stage, a function reads an image that returns the SIFT keypoints, which consist of locators and descriptors of an image that are stored in a temporary file. The temporary file is then processed separately for the scanning and organization of the locators and descriptors. The scanning and organization of the locators and descriptors is carried out in a parallel manner on the GPU. The processors consist of a front-end that reads/decodes the data, which is responsible for allocating memory on the GPU and transferring data from the CPU to the GPU, and initializes the kernel instructions. A backend is made up of a group of eight calculating units and two super-function units where instructions are executed in the SIMD mode; the same instruction is sent to all threads in the warp. NVIDIA calls this mode of execution ‘single instruction multiple threads (SIMT)’. The streaming multiprocessors’ operating mode is as follows:
  • 1) At each cycle, a warp ready for execution is selected by the front-end, which starts execution of an instruction.
  • 2) To send the instruction to all 32 threads in the warp, the backend will take four cycles, but since it operates at double the frequency of the front-end, from its point of view, only two cycles will be executed
The presented approach uses scale invariant feature vectors instead of an image database as an input to the SOM network. A topological map is obtained with the use of the SOM network, and we obtain a 2D neuron grid where each neuron is associated with a weight vector containing 128 element descriptors. The 128-dimensional descriptor generation is accelerated using the GPU in order to increase the execution speed of the algorithm. The descriptor vector is read, and its value is recorded individually in an array; further, the total value is recorded using built-in libraries. The value is later used as a limit for declaring threads and blocks for the GPU. As per the limit of each block, only 512 threads can be accommodated in each block and there can be a total of 65535 blocks in the grid. Each value is assigned to each thread in the block, and the number of blocks depends on the total value divided by 512 (number of threads). An inspection of the temporary file containing the keypoints of each image indicates that the first four values are the locator values followed by the 128 descriptor values. The GPU threads are employed to arrange these locator and descriptor values accordingly.
For image matching, we use a learning algorithm based on the concept of the nearest neighbor learning. One image is considered the reference and the next image as the matching image. Both images are represented in terms of the winning neurons in the SOM network as explained in the pseudo-code. After the network is trained, input data are distributed throughout the grid of neurons. The feature vectors are arranged according to their internal similarity with the SOM, thereby forming a topological map of the input vectors. The winning neuron is found for each pixel of the next image, and the pixel value is associated to it once the winner is found. Feature matching is performed by associating the similar winning pixels between the pixels of the reference image and its neighbor image. By iteratively repeating these steps, winning pixels are obtained with the SOM and the matching between the pixels of the different image pairs is accomplished.
In this section, we present some experimental results to show the performance of the proposed method. Experiments were performed using images captured with a Kinect camera designed by Microsoft; the experimental details and results of two test image sets are as follows: each image set contained 15 different images, and the performance of the proposed GPU-based method was compared with that of Lowe’s method and that of the SOM-based feature matching method [9] . For a comparative analysis of the execution performance, the algorithm was run on a CPU and a GPU independently. Experiments were conducted on an Intel Core i3 processor endowed with an NVIDIA GPU, GeForce 310 with a global memory of 512 MB. The CUDA, proposed by NVIDIA for its graphics processors, exposes a programming model that integrates the host (CPU) and the GPU codes in the same code source files. The main programming introduced by the programming model is an explicitly parallel function invocation (kernel), which is executed by a user-specified number of threads.
Every CUDA kernel is explicitly invoked by the host code and executed by the device, while the host side code continues the execution asynchronously after instantiating the kernel. The image size for the two test image sets is 480 × 380 pixels. Experiments were conducted under varying situations, such as rotation, scaling, and blurring conditions. Figs. 6 and 7 show the matching results for two image sets obtained using Lowe’s method and with the SOM-based feature matching method [9] . The proposed method achieves the same matching results as those described in [9] with much less computation time. It is found that the average matching time is 0.13953 s for Lowe’s SIFT, while for the SOM-based feature matching method [9] , it is 0.01447 seconds, which is reduced to 0.00165 seconds by using the proposed GPU-based method.
PPT Slide
Lager Image
Test image set 1: feature matching between image 1 and image 2 under varying conditions. (a) Matching between two images, (b) rotation of 30°, (c) scaling with a 0.5 scale factor, and (d) blurring.
PPT Slide
Lager Image
Test image set 2: feature matching between image 8 and image 9 under varying conditions. (a) Matching between two images, (b) rotation of 30°, (c) scaling with a 0.5 scale factor, and (d) blurring.
Using the SOM-based feature matching [9] method, we can accelerate Lowe’s algorithm by nine times on average, and yet another nine times by using the proposed GPU-based method. Thus, the presented GPU-based method yields a performance improvement of approximately 90 times. Tables 1 and 2 show the comparison results of the computation time on two image sets for different images, in the cases of the Lowe’s method, SOM-based feature matching method [9] , and the proposed GPU-based method. Experimental results show that the SOM-based feature matching method [9] performs more efficient matching than Lowe’s SIFT. A significant reduction in the computation time is obtained by using the proposed GPU-based method.
Comparison results for test image set 1 with Lowe’s method, SOM-based feature matching method[9], and the proposed GPU-based method
PPT Slide
Lager Image
GPU: graphics processing unit, SOM: self-organizing map.
Comparison results for test image set 2 with Lowe’s method, SOM-based feature matching method[9], and the proposed GPU-based method
PPT Slide
Lager Image
GPU: graphics processing unit, SOM: self-organizing map.
This paper presents a matching method to obtain the features under varying conditions with reduced processing time. The computation time of the proposed method is reduced as compared to Lowe’s method and further optimized by using a GPU. Experiments on various test images have been carried out to evaluate how well the presented method performs on the matching problem as compared to Lowe’s method. Experimental results show that the proposed method produces better matching results with a significant reduction in computation times.
Kajal Sharma received her B.E. in Computer Engineering from University of Rajasthan, India, in 2005, and her M.Tech. and Ph.D. in Computer Science from Banasthali University, Rajasthan, India, in 2007 and 2010, respectively. From October 2010 to September 2011, she worked as a postdoctoral researcher at Kongju National University, Korea. Since October 2011, she has been working as a postdoctoral researcher at the School of Computer Engineering, Chosun University, Gwangju, Korea. Her research interests include image and video processing, neural networks, computer vision, and robotics. She has published many research papers in various national and international journals and conferences.
Saifullah received his B.E. in Electronics Engineering from N.E.D University of Engineering and Technology, Pakistan, in 2011, and his M.S. in Computer Engineering from Chosun University, Gwangju, Korea, in 2013. From September 2011 to June 2013, he worked as a graduate research assistant in the 3D Image Processing Lab of the Computer Engineering Department of Chosun University under the supervision of Dr. Inkyu Moon. As a research assistant, Mr. Saifullah worked on symmetrical cryptography, digital holographic encryption, neural network algorithms, and acceleration of these algorithms on a graphics processing unit. His research interests include image processing, neural networks, algorithm implementation, and computer vision. He has published research papers in various international conferences.
Inkyu Moon received his B.S. and M.S. in Electronics Engineering from Sungkyunkwan University, Korea, in 1996 and 1998, respectively, and his M.S. and Ph.D. in Electrical and Computer Engineering from University of Connecticut in 2007. From January 2008 to January 2009, he was a researcher in a post-doctoral position at the University of Connecticut. He joined Chosun University, Korea, in 2009, and is currently, an associate professor at the School of Computer Engineering. He has to his credit more than 50 publications, including 30+ peer reviewed journal articles and 20+ conference proceedings (10+ Keynote Addresses and invited conference papers). Dr. Moon is a member of IEEE, OSA, and SPIE. He is on the Editorial Board of the Korea Multimedia Society.
Brown M. Z. , Burschka D. , Hager G. D. 2003 “Advances in computational stereo,” IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (8) 993 - 1008    DOI : 10.1109/TPAMI.2003.1217603
Pribanic T. , Obradovic N. , Salvi J. 2012 “Stereo computation combining structured light and passive stereo matching,” Optics Communications 285 (6) 1017 - 1022    DOI : 10.1016/j.optcom.2011.10.045
Lee C. H. , Lim Y. C. , Kwon S. , Lee J. H. 2011 “Stereo vision-based vehicle detection using a road feature and disparity histogram,” Optical Engineering 50 (2) 027004 - 027004    DOI : 10.1117/1.3535590
Belongie S. , Malik J. , Puzicha J. 2002 “Shape matching and object recognition using shape contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (4) 509 - 522    DOI : 10.1109/34.993558
Shi J. , Tomasi C. 1994 “Good features to track,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 593 - 600
Mikolajczyk K. , Schmid C. 2004 “Scale & affine invariant interest point detectors,” International Journal of Computer Vision 60 (1) 63 - 86    DOI : 10.1023/B:VISI.0000027790.02288.f2
Schmid C. , Mohr R. 1997 “Local grayvalue invariants for image retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (5) 530 - 534    DOI : 10.1109/34.589215
Lowe D. G. 2004 “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision 60 (2) 91 - 110    DOI : 10.1023/B:VISI.0000029664.99615.94
Sharma K. , Kim S. G. , Singh M. P. 2012 “An improved feature matching technique for stereo vision applications with the use of self-organizing map,” International Journal of Precision Engineering and Manufacturing 13 (8) 1359 - 1368    DOI : 10.1007/s12541-012-0179-z
Sinha S. N. , Frahm J. M. , Pollefeys M. , Genc Y. 2011 “Feature tracking and matching in video using programmable graphics hardware,” Machine Vision and Applications 22 (1) 207 - 217    DOI : 10.1007/s00138-007-0105-z
Bjorke K. A. 2006 “Image processing on parallel GPU pixel units,” Proceedings of SPIE 6065 606515 -
Fung J. , Mann S. 2005 “OpenVIDIA: parallel GPU computer vision,” in Proceedings of the 13th Annual ACM international conference on Multimedia 849 - 852
Yang R. , Pollefeys M 2003 “Multi-resolution real-time stereo on commodity graphics hardware,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 211 - 217
Zach C. , Karner K. , Bischof H. 2004 “Hierarchical disparity estimation with programmable graphics hardware,” in Proceedings of the 12th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision 275 - 282
Bramberger M. , Brunner J. , Rinner B. , Schwabach H. 2004 “Realtime video analysis on an embedded smart camera for traffic surveillance,” in Proceedings of the 10th IEEE Real-Time and Embedded Technology and Applications Symposium 174 - 181
Klupsch S. , Ernst M. , Huss S. A. , Rumpf M. , Strzodka R. 2002 “Real time image processing based on reconfigurable hardware acceleration,” in Proceedings of the IEEE Workshop Heterogeneous Reconfigurable Systems on Chip (SoC) 1 - 7
Jones M. T. , Plassmann P. E. 1994 “Scalable iterative solution of sparse linear systems,” Parallel Computing 20 (5) 753 - 773    DOI : 10.1016/0167-8191(94)90004-3
Saad Y. 1996 “ILUM: a multi-elimination ILU preconditioner for general sparse matrices,” SIAM Journal on Scientific Computing 17 (4) 830 - 847    DOI : 10.1137/0917054
Kohonen T. 1990 “The self-organizing map,” Proceedings of the IEEE 78 (9) 1464 - 1480    DOI : 10.1109/5.58325
Toulminet G. , Bertozzi M. , Mousset S. , Bensrhair A. , Broggi A. 2006 “Vehicle detection by means of stereo vision-based obstacles features extraction and monocular pattern analysis,” IEEE Transactions on Image Processing 15 (8) 2364 - 2375    DOI : 10.1109/TIP.2006.875174
Kirk D. , Hwu W. 2010 Programming Massively Parallel Processors: A Hands-On Approach Morgan Kaufmann Publisher Burlington, MA