A programmable Soc for Var ious Image Applications Based on Mobile Devices
A programmable Soc for Var ious Image Applications Based on Mobile Devices
Journal of Korea Multimedia Society. 2014. Mar, 17(3): 324-332
Copyright © 2014, Korea Multimedia Society
  • Received : December 24, 2013
  • Accepted : February 13, 2014
  • Published : March 31, 2014
Export by style
Cited by
About the Authors
Bongkyu, Lee
Dept. of computer science and Statistics, Cheju National University

This paper presents a programmable System-On-a-chip for various embedded applications that need Neural Network computations. The system is fully implemented into Field-Programmable Gate Array (FPGA) based prototyping platform. The SoC consists of an embedded processor core and a reconfigurable hardware accelerator for neural computations. The performance of the SoC is evaluated using real image processing applications, such as optical character recognition (OCR) system.
The demand for ‘smart’ devices in consumer electronics is increasing. This is motivated by the wide spread use of low-cost embedded electronics in various environments [1] . Also, it is desirable that electronic devices are capable of sensing and understanding their surroundings and adapting their services according to the context. Artificial Neural Networks (ANN) have been spot-lighted for this purpose, primary due to their wide range of applicability [2] . The Multilayer Perceptron (MLP) is the most frequently used ANN due to its ability to model non-linear systems and establish non-linear decision boundaries in classification problems such as optical character recognition (OCR), data mining and image processing/ recognition [2] .
However, since MLP requires extremely high throughput, this computational complexity is highly undesirable from real time operations for embedded devices which have constraints in their processing capabilities. An attractive solution to this is to design a dedicated hardware for MLP acceleration [3] .
Hardware implementation of MLP has been a hot topic for many years, mainly due to accuracy, required space, and processing speed. Various hardware implementations for MLP were successful using entire hardware implementation such as ASIC design method [3 - 5] . However, full hardware implementation is not effective in terms of cost and implementation complexity. Recently, the reconfigurable computing paradigm is a topic of active research. Utilizing the capability of reconfigurable devices, the implementation of MLP structures in FPGA has been wide spreaded [6] .
Designers have used FPGA in board–level designs for a long while. To create high performance, versatile platforms, some architectures start incorporating logic operations and interconnections that can be reconfigured during run time. Adding reconfigurable logic to the SoC provides flexibility for changing functionality after fabrication. Compared to programmable processors, these architectures offer the potential to achieve higher performance and power-efficiency with greater flexibility [7] . To boost the impact of reconfigurable SoCs, some research work has been done to extract parallelism form the applications/algorithms and map the parallelism into the reconfigurable architecture efficiently [8] .
Although a few hardware implementations using FPGA have been proposed thus far [9 - 11] , a hardware implementation of an MLP still remains to be a challenging problem for embedded applications. Since different pre-processing and postprocessing techniques can be combined with an MLP in real applications, the system should be reconfigured according to applications. Also, there have been strong needs for hardware design which can accommodate variations in network structure without hardware redesign [12] . These problems can be overcome by software/hardware co-design method. This method is carried out by analyzing the timing of the different portions of the algorithm and implementing the time extensive parts on hardware [13] . A SoC that has a microprocessor and related configurable hardware accelerators can deliver large speedups, while keeping the flexibility of software models.
In this paper, we implement a novel MLP-SoC architecture for smart applications into embedded devices. For testing and debugging the target architecture in the register transfer level (RTL) efficiently, an FPGA based prototyping platform is designed and implemented. The implemented SoC can accommodate variations in network structures and applications without hardware modification. To evaluate the SoC, an OCR system is built on the prototyping platform where the SoC is implemented. The experimental result proves the effectiveness of the SoC in terms of both speed and recognition rate.
Our goal is to implement a MLP-SoC which can be used for embedding processes. During the MLP-SoC implementation, a prototyping platform, data representation and precision and hardware components play important roles in design decisions.
- 2.1 The prototyping platform
As the complexity of SoC design is constantly growing and reusable IP libraries are wealthy, the main design issue shifts to the verification method to handle the complex SoC system easily. Thus a low-cost co-verification solution consisting of a hardware emulator based on FPGA and an embedded processor is introduced. It provides a good visibility for the internal signals of the system design mapped in the emulator. It is useful to verify the complex SoC design steps [14] .
To design and verification of SoC, FPGA-based prototyping platform has become popular in coverification and rapid prototyping [15] . Mapping the entire design of the target SoC into an FPGA gives an accurate and fast representation. Some basic components, including CPU, bus system and associated interconnection blocks, are selected for designing the platform. LEON 2 [16] is selected for the programmable processor and implemented into the FPGA. For the communication between internal components, on-chip AHB/APB AMBA bus system is also implemented into FPGA.
In addition to the FPGA chip, the platform offers the SDRAM-based memory (128 Mbytes) and the flash-based storage (8 Mbytes). SDRAM-based memory unit is used for storing external data, while flash-based storage is used for storing software blocks. Fig. 1 shows the implemented prototyping platform having compact size of 112×129 mm.
PPT Slide
Lager Image
The prototyping platform.
- 2.2 MLP for image processing/recognition
A MLP for image processing/recognition application consists of processing elements arranged in layers. Typically, it requires three or more layers of processing nodes: an input layer, one or more hidden layers, and an output layer. Every processing node in one particular layer is fully or partially connected to every node in the layers above and below it. The weighted connections define the behavior of the network and are adjusted during training through a supervised training algorithm called back-propagation [2] .
In the recognition, an input vector is presented to the input layer. For successive layers, the input to each node is the sum of the scalar products of the incoming vector components with their respective weighted connections:
PPT Slide
Lager Image
where wij is the weight connecting node j to node i and outj is the output from node j .
The output of a node i is outi = f ( sumi ), which is then sent to all nodes in the next layer. This continues through all the layers of the network until the output layer is reached and the output vector is computed, where f denotes the activation function of each node. A sigmoid or a hyperbolic tangent function is frequently used. Table 1 shows the hardware constraints for the target SoC.
Target hardware constraints
PPT Slide
Lager Image
Target hardware constraints
Since a floating-point representation of data (weights, inputs, outputs) in a neural network may still be impractical for embedded hardware, we use fixed point representations for weights, inputs and outputs. Unsigned 8 bits are used for represent input values, while signed 9 bits are used for output precisions since some activation functions produce negative outputs. Weights are stored in the weight table using signed 12 bits fixed-point representations. The direct implementation of a specific activation function as hardware does not appropriate to our work since the target hardware should be reconfigurable. Thus, we use a lookup table storing output values for define activation functions. By using this method, a few different activation functions can be implemented with fixed hardware.
Fig. 2 shows the top level block description of the MLP-SoC. It comprises the LEON 2 core (main processor), MLP co-processor (hardware accelerator), memory controller, camera interface and bus system. All of these components are integrated into FPGA of the prototyping platform.
PPT Slide
Lager Image
Architectural overview of the MLP-SoC.
LEON 2 is a 32-bit RISC processor compliant with the SPARC V8 architecture. It is highly configurable and thus very suitable for SoC. Also, software written in C language can be directly executed under the LEON 2 core using cross- tool chain [16] . We implement LEON 2 core (shown in Fig. 3 , dotted box) using open VHDL source into FPGA of the prototyping platform. The camera interface controller and the I 2 C circuit are capable of handling a few image sensors with their fixed logics.
PPT Slide
Lager Image
Block diagram of the standard LEON 2 processor.
- 3.1 MLP computation co-processor
Fig. 4 shows the architecture of the implemented MLP computation co-processor that is dedicated to the basic computations of neurons. As seen in the figure, the MLP computation co-processor consists of two major parts - Host interface block for memory accesses and bus interface and MLP block for neural computing.
PPT Slide
Lager Image
Architectural overview of the MLP computation co-processor.
Host interface block is responsible for bus interfaces between MLP computation co-processor and other controllers. It consists of two direct memory access (DMA) units, source DMA and destination DMA. The source DMA block retrieves an input from the external memory and stores it into input buffers (2K X8 bits). Two buffers are prepared in order to be able to process one input, while another is being buffered for the next computation. The block sends the signal to the MLP block in order to start the computation task for the current input. When the computation task completes, the destination DMA block stores the generated output from the MLP block into the external memory. This data stream is useful when the size and the number of inputs are large.
In order to accommodate the constraints described in Table 1 , the MLP block consists of storages and computation module. There are three different static memories: the function table, the hidden node register file and the weight table. An activation function, such as sigmoid or hyper-tangent function, can be implemented in the activation table without modifying hardware. The weight table consists of 128K * 19 bits, 12 bits are used for saving weight values and 7 bits are used for saving hidden node index. Hidden nodes register file consists of 128 * 24 bits for storing transient results of hidden/output nodes.
Fig. 5 shows the RTL diagram of the computation module. The computation module obtains inputs from an input buffer and computes activation values of all nodes of successive layers until the values of output nodes are computed. Then, it sends the output values to host interface block for saving them into SRAM-based memory. Fig. 5 also shows the precisions of the implemented logics.
PPT Slide
Lager Image
The RTL diagram of the computation modulel
The implemented MLP-SoC is fully synthesized by VHDL model and transferred into the FPGA (XILINX X2CV8000) of the prototyping platform. This MLP-SoC architecture provides fast processing of neural connections and transfer functions, and is well suited for MLP-type neural models. The operational clock rate of the FPGA is 30MHz. This clock source is fed into all components that are implemented on hardware. In the next section, we will verify the effectiveness of our reconfigurable architecture using a real application, building an OCR system onto the implemented architecture.
OCR is the process by which a computer maps a digitized character image to text. This system is the base for many different types of embedded applications, such as portable translators, electronic dictionaries and personal data assistants [18] . The algorithm of the target OCR system consists of three main stages [19] as shown in the Fig. 6 . First, an image is acquired by the MICRON MT9V112 image sensor [20] connected to camera interface. Second, preprocessing step is performed in order to segment the image into individual characters using histogram-based method. Extracted characters are converted into binary-valued images (0 or 255) [21] . Then the normalization for skew and size variations is performed to obtain 30X24 (pixels) sized actual input images of the MLP.
PPT Slide
Lager Image
The processing flow of the implemented OCR system.
Configurations of MLPs
PPT Slide
Lager Image
Configurations of MLPs
Since a structure of a neural network, such as a number of nodes, an activation function, can be varied for a specific application for the better performance, a SoC for MLPs should accept these variations. To show the reconfigurable property of the MLP_SoC, we try to build two MLPs onto the same architecture. Table 3 shows configurations of the implemented MLPs. We successfully implement each MLP into the FPGA of the prototyping platform independently without any modifications of the existing hardware. The training for each MLP is conducted onto the separated desktop computer with the same learning data set (English alphabets a-z and A–Z, Times New Roman). Using the trained weights, recognition experiments are performed with three 320X 240 (pixels) sized document images which have 730 characters total. The OCR systems recognize 686 (version_1) and 719 (version_2) characters correctly, thus the recognition rates are 94% (version_1) and 98% (version_2).
Speed of each processing module for Reconfigurable OCR system
PPT Slide
Lager Image
Speed of each processing module for Reconfigurable OCR system
The other important issue of the evaluation for the MLP-SoC is the recognition speed. We check the required time of each module of the OCR system with version_2 MLP for recognizing one 320X240 (pixels) document which contains 260 characters. Table 4 shows the required times of all modules for the task. The OCR system can process nearly 43 characters per second. The neural computation module requires 3.9, while the software implementation of the target MLP requires 869 seconds under LEON2. This result is mainly due to the MLP computation co-processor that speed up the neural computing 223 times compared to the software implementation. Since almost commercial software OCR systems are implemented on servers or desktop computers which have higher hardware capabilities such as powerful CPUs, they do not require a large amount of processing times. However wearable/mobile devices have constraints in their processing capabilities because of costs and power consumptions. Thus the hardware acceleration is the best solution to implement intelligence tasks into hardware constrained devices. From the experiments, we conclude that the implemented MLP_SoC can be used to build smart embedded devices capable of various image processing applications.
In this paper, we design and implement the architecture of a MLP-SoC suitable for small-sized smart devices. The implemented SoC is tested and verified in the RTL using the FPGA-based prototyping platform. Without modifying the existing hardware, we can build two application systems on the designed architecture successfully by reconfiguring the SoC. The example shows that the MLP-SoC can be effectively used for various mobile/wearable devices which need intelligence capability. We are in the process of the chip fabrication for the implemented architecture.
Bongkyu Lee
He received the Ph.D degree from the Department of computer engineering, Seoul National University in 1995. Since 1996, he has been a professor at the Jeju national university. His interesting research fields are in Neural Networks, image processing and Augmented Reality.
Pentland A. , Choudhury T. 2000 “Face Recognition for Smart Environments” IEEE Computer 33 (2) 50 - 55    DOI : 10.1109/2.820039
Zurada J.M. 1992 Introduction to Artificial Neural Systems PWS Publishing Company New Jersey
Schoenauer T. , Jahnke A. , Roth U. , Klar H. 1998 “Digital Neurohardware: Principles and Perspectives” Neuronal Networks in Applications 2 (20) 101 - 106
Mathia K. , Clark J. , Colbert B. , Saeks R. 1996 “Benchmarking and MIMD Neural Network Processor” WCNN’96 1203 - 1210
Theocharides T. , Link G. , Vijaykrishnan N. , Irwin M.J. , Wolf W. 2004 “Embedded Hardware Face Detection” Proc. the 17th International Conference on VLSI Design 569 - 572
Brogatti M. , Lertora F. , Foret B. , Cali L. 2003 “A Reconfigurable System Featuring Dynamically Extensible Embedded Microprocessor, FPGA, and Customizable I/O”, IEEE Journal of Solid-State Circuits 38 (3) 521 - 529    DOI : 10.1109/JSSC.2002.808288
Tsai T.H. , Yang Y.C. , Liu C.N. 2005 “A Hardware/Software Co-Design of MP3 Audio Decoder” The Journal of VLSI Signal Processing Systems for signal, Image and Video Technology 41 (1) 111 - 127    DOI : 10.1007/s11265-005-6254-2
Nahvi M. , Ivanov A. 2004 “Indirect Test architecture for SoC Testing” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 23 (7) 1128 - 1142    DOI : 10.1109/TCAD.2004.829796
Oritigosa E.M. , Canas A. , Ros E. , Ortigosa P.M. , Mota S. , Diaz J. 2006 “Hardware Dscription of Multi-Layer Perceptrons with Different Abstraction Levels” Microprocessors and Microsystems 30 (7) 435 - 444    DOI : 10.1016/j.micpro.2006.03.004
Vitabile S. , Conti V. , Gennaro F. , Sorbello F. 2005 “Efficient MLP Digital Implementation on FPGA” Proc. the 8th Euromicro Conference on DSD 124 - 129
Rosado-Munoz A. , Soria-Olivas E. , Gomez-Chova L. , Frances J.V. 2008 “An IP Core and GUI Implementing Multilayer Perceptron with a Fuzzy Activation Function on Configurable Logic Devices” Journal of Universal Computer Science 14 (10) 1678 - 1694
Pormann M. , Franzmeier M. , Kalte H. , Witkowski U. , Ruckert U. 2002 “A Reconfigurable SOM Hardware Accelerator” Proc. the European Symposium on Artificial Neural Networks Bruges(Belgium) 337 - 342
Islam M.S. , Beg M.S. , Bhuyan M.S. , Othman M. 2006 “Design and Implementation of Discrete Cosine Transform Chip for Digital Comsumer Products” IEEE Transaction on Consumer Electronics 52 (3) 998 - 1003    DOI : 10.1109/TCE.2006.1706499
Lee J.H. , Kim S.C. 2011 “Analysis of Verification Methodologies based on a SoC Platform Design” International Journal of Contents 7 (1) 23 - 28    DOI : 10.5392/IJoC.2011.7.1.023
Valle P.G.D. , Atienza D. , Paci G. , Poletti F. 2007 “Application of FPGA Emulation to SoC Floorplan and Packaging Exploration” Proc. the XXI I Conference on Design of Circuits and Integrated System 236 - 240
2004 LEON2 Processor User’s Manual
Shon S.M. , Yang S.H. , Kim S.W. , Baek K.H. , Paik W.H. 2006 “Soc Design of an Auto Focus Driving Image Signal Processor for Mobile Camera Applications” IEEE Transactions on Consumer Electronics 52 (1) 10 - 16    DOI : 10.1109/TCE.2006.1605018
Nakajima H. , Matsuo Y. , Nagata M. , Saito K. 2005 “Portable Translator Capable of Recognizing Characters on Signboard and Menu Captured by Built-In Camera” Proc. the ACL Interactive Poster and Demonstration Sessions 61 - 64
Lee B. , Cho Y. , Cho S. 1995 “Translation, Scale and Rotation Invariant Pattern Recognition using PCA and Reduced Second Order Neural Network” Neural, Parallel & Scientific Computation 3 (3) 417 - 429
2005 MT9V112 Manual
Na M.Y. , Kim H. J. , Kim T. Y. 2011 “An Illumination and background-Robust Hand Image Segmentation Method based on Dynamic Threshod Values” Journal of Korea Multimedia Society 14 (5) 607 - 613