This paper presents a programmable System-On-a-chip for various embedded applications that need Neural Network computations. The system is fully implemented into Field-Programmable Gate Array (FPGA) based prototyping platform. The SoC consists of an embedded processor core and a reconfigurable hardware accelerator for neural computations. The performance of the SoC is evaluated using real image processing applications, such as optical character recognition (OCR) system.
The demand for ‘smart’ devices in consumer electronics is increasing. This is motivated by the wide spread use of low-cost embedded electronics in various environments
. Also, it is desirable that electronic devices are capable of sensing and understanding their surroundings and adapting their services according to the context. Artificial Neural Networks (ANN) have been spot-lighted for this purpose, primary due to their wide range of applicability
. The Multilayer Perceptron (MLP) is the most frequently used ANN due to its ability to model non-linear systems and establish non-linear decision boundaries in classification problems such as optical character recognition (OCR), data mining and image processing/ recognition
However, since MLP requires extremely high throughput, this computational complexity is highly undesirable from real time operations for embedded devices which have constraints in their processing capabilities. An attractive solution to this is to design a dedicated hardware for MLP acceleration
Hardware implementation of MLP has been a hot topic for many years, mainly due to accuracy, required space, and processing speed. Various hardware implementations for MLP were successful using entire hardware implementation such as ASIC design method
. However, full hardware implementation is not effective in terms of cost and implementation complexity. Recently, the reconfigurable computing paradigm is a topic of active research. Utilizing the capability of reconfigurable devices, the implementation of MLP structures in FPGA has been wide spreaded
Designers have used FPGA in board–level designs for a long while. To create high performance, versatile platforms, some architectures start incorporating logic operations and interconnections that can be reconfigured during run time. Adding reconfigurable logic to the SoC provides flexibility for changing functionality after fabrication. Compared to programmable processors, these architectures offer the potential to achieve higher performance and power-efficiency with greater flexibility
. To boost the impact of reconfigurable SoCs, some research work has been done to extract parallelism form the applications/algorithms and map the parallelism into the reconfigurable architecture efficiently
Although a few hardware implementations using FPGA have been proposed thus far
, a hardware implementation of an MLP still remains to be a challenging problem for embedded applications. Since different pre-processing and postprocessing techniques can be combined with an MLP in real applications, the system should be reconfigured according to applications. Also, there have been strong needs for hardware design which can accommodate variations in network structure without hardware redesign
. These problems can be overcome by software/hardware co-design method. This method is carried out by analyzing the timing of the different portions of the algorithm and implementing the time extensive parts on hardware
. A SoC that has a microprocessor and related configurable hardware accelerators can deliver large speedups, while keeping the flexibility of software models.
In this paper, we implement a novel MLP-SoC architecture for smart applications into embedded devices. For testing and debugging the target architecture in the register transfer level (RTL) efficiently, an FPGA based prototyping platform is designed and implemented. The implemented SoC can accommodate variations in network structures and applications without hardware modification. To evaluate the SoC, an OCR system is built on the prototyping platform where the SoC is implemented. The experimental result proves the effectiveness of the SoC in terms of both speed and recognition rate.
2. ISSUES IN THE IMPLEMENTATION OF MLP-SOC
Our goal is to implement a MLP-SoC which can be used for embedding processes. During the MLP-SoC implementation, a prototyping platform, data representation and precision and hardware components play important roles in design decisions.
- 2.1 The prototyping platform
As the complexity of SoC design is constantly growing and reusable IP libraries are wealthy, the main design issue shifts to the verification method to handle the complex SoC system easily. Thus a low-cost co-verification solution consisting of a hardware emulator based on FPGA and an embedded processor is introduced. It provides a good visibility for the internal signals of the system design mapped in the emulator. It is useful to verify the complex SoC design steps
To design and verification of SoC, FPGA-based prototyping platform has become popular in coverification and rapid prototyping
. Mapping the entire design of the target SoC into an FPGA gives an accurate and fast representation. Some basic components, including CPU, bus system and associated interconnection blocks, are selected for designing the platform. LEON 2
is selected for the programmable processor and implemented into the FPGA. For the communication between internal components, on-chip AHB/APB AMBA bus system is also implemented into FPGA.
In addition to the FPGA chip, the platform offers the SDRAM-based memory (128 Mbytes) and the flash-based storage (8 Mbytes). SDRAM-based memory unit is used for storing external data, while flash-based storage is used for storing software blocks.
shows the implemented prototyping platform having compact size of 112×129 mm.
The prototyping platform.
- 2.2 MLP for image processing/recognition
A MLP for image processing/recognition application consists of processing elements arranged in layers. Typically, it requires three or more layers of processing nodes: an input layer, one or more hidden layers, and an output layer. Every processing node in one particular layer is fully or partially connected to every node in the layers above and below it. The weighted connections define the behavior of the network and are adjusted during training through a supervised training algorithm called back-propagation
In the recognition, an input vector is presented to the input layer. For successive layers, the input to each node is the sum of the scalar products of the incoming vector components with their respective weighted connections:
is the weight connecting node
is the output from node
The output of a node
), which is then sent to all nodes in the next layer. This continues through all the layers of the network until the output layer is reached and the output vector is computed, where
denotes the activation function of each node. A sigmoid or a hyperbolic tangent function is frequently used.
shows the hardware constraints for the target SoC.
Target hardware constraints
Target hardware constraints
Since a floating-point representation of data (weights, inputs, outputs) in a neural network may still be impractical for embedded hardware, we use fixed point representations for weights, inputs and outputs. Unsigned 8 bits are used for represent input values, while signed 9 bits are used for output precisions since some activation functions produce negative outputs. Weights are stored in the weight table using signed 12 bits fixed-point representations. The direct implementation of a specific activation function as hardware does not appropriate to our work since the target hardware should be reconfigurable. Thus, we use a lookup table storing output values for define activation functions. By using this method, a few different activation functions can be implemented with fixed hardware.
3. THE MLP-SOC
shows the top level block description of the MLP-SoC. It comprises the LEON 2 core (main processor), MLP co-processor (hardware accelerator), memory controller, camera interface and bus system. All of these components are integrated into FPGA of the prototyping platform.
Architectural overview of the MLP-SoC.
LEON 2 is a 32-bit RISC processor compliant with the SPARC V8 architecture. It is highly configurable and thus very suitable for SoC. Also, software written in C language can be directly executed under the LEON 2 core using cross- tool chain
. We implement LEON 2 core (shown in
, dotted box) using open VHDL source into FPGA of the prototyping platform. The camera interface controller and the I
C circuit are capable of handling a few image sensors with their fixed logics.
Block diagram of the standard LEON 2 processor.
- 3.1 MLP computation co-processor
shows the architecture of the implemented MLP computation co-processor that is dedicated to the basic computations of neurons. As seen in the figure, the MLP computation co-processor consists of two major parts - Host interface block for memory accesses and bus interface and MLP block for neural computing.
Architectural overview of the MLP computation co-processor.
Host interface block is responsible for bus interfaces between MLP computation co-processor and other controllers. It consists of two direct memory access (DMA) units, source DMA and destination DMA. The source DMA block retrieves an input from the external memory and stores it into input buffers (2K X8 bits). Two buffers are prepared in order to be able to process one input, while another is being buffered for the next computation. The block sends the signal to the MLP block in order to start the computation task for the current input. When the computation task completes, the destination DMA block stores the generated output from the MLP block into the external memory. This data stream is useful when the size and the number of inputs are large.
In order to accommodate the constraints described in
, the MLP block consists of storages and computation module. There are three different static memories: the function table, the hidden node register file and the weight table. An activation function, such as sigmoid or hyper-tangent function, can be implemented in the activation table without modifying hardware. The weight table consists of 128K * 19 bits, 12 bits are used for saving weight values and 7 bits are used for saving hidden node index. Hidden nodes register file consists of 128 * 24 bits for storing transient results of hidden/output nodes.
shows the RTL diagram of the computation module. The computation module obtains inputs from an input buffer and computes activation values of all nodes of successive layers until the values of output nodes are computed. Then, it sends the output values to host interface block for saving them into SRAM-based memory.
also shows the precisions of the implemented logics.
The RTL diagram of the computation modulel
The implemented MLP-SoC is fully synthesized by VHDL model and transferred into the FPGA (XILINX X2CV8000) of the prototyping platform. This MLP-SoC architecture provides fast processing of neural connections and transfer functions, and is well suited for MLP-type neural models. The operational clock rate of the FPGA is 30MHz. This clock source is fed into all components that are implemented on hardware. In the next section, we will verify the effectiveness of our reconfigurable architecture using a real application, building an OCR system onto the implemented architecture.
4. APPLICATIONEXAMPLE: OCRSYSTEM
OCR is the process by which a computer maps a digitized character image to text. This system is the base for many different types of embedded applications, such as portable translators, electronic dictionaries and personal data assistants
. The algorithm of the target OCR system consists of three main stages
as shown in the
. First, an image is acquired by the MICRON MT9V112 image sensor
connected to camera interface. Second, preprocessing step is performed in order to segment the image into individual characters using histogram-based method. Extracted characters are converted into binary-valued images (0 or 255)
. Then the normalization for skew and size variations is performed to obtain 30X24 (pixels) sized actual input images of the MLP.
The processing flow of the implemented OCR system.
Configurations of MLPs
Since a structure of a neural network, such as a number of nodes, an activation function, can be varied for a specific application for the better performance, a SoC for MLPs should accept these variations. To show the reconfigurable property of the MLP_SoC, we try to build two MLPs onto the same architecture.
shows configurations of the implemented MLPs. We successfully implement each MLP into the FPGA of the prototyping platform independently without any modifications of the existing hardware. The training for each MLP is conducted onto the separated desktop computer with the same learning data set (English alphabets a-z and A–Z, Times New Roman). Using the trained weights, recognition experiments are performed with three 320X 240 (pixels) sized document images which have 730 characters total. The OCR systems recognize 686 (version_1) and 719 (version_2) characters correctly, thus the recognition rates are 94% (version_1) and 98% (version_2).
Speed of each processing module for Reconfigurable OCR system
Speed of each processing module for Reconfigurable OCR system
The other important issue of the evaluation for the MLP-SoC is the recognition speed. We check the required time of each module of the OCR system with version_2 MLP for recognizing one 320X240 (pixels) document which contains 260 characters. Table 4 shows the required times of all modules for the task. The OCR system can process nearly 43 characters per second. The neural computation module requires 3.9, while the software implementation of the target MLP requires 869 seconds under LEON2. This result is mainly due to the MLP computation co-processor that speed up the neural computing 223 times compared to the software implementation. Since almost commercial software OCR systems are implemented on servers or desktop computers which have higher hardware capabilities such as powerful CPUs, they do not require a large amount of processing times. However wearable/mobile devices have constraints in their processing capabilities because of costs and power consumptions. Thus the hardware acceleration is the best solution to implement intelligence tasks into hardware constrained devices. From the experiments, we conclude that the implemented MLP_SoC can be used to build smart embedded devices capable of various image processing applications.
In this paper, we design and implement the architecture of a MLP-SoC suitable for small-sized smart devices. The implemented SoC is tested and verified in the RTL using the FPGA-based prototyping platform. Without modifying the existing hardware, we can build two application systems on the designed architecture successfully by reconfiguring the SoC. The example shows that the MLP-SoC can be effectively used for various mobile/wearable devices which need intelligence capability. We are in the process of the chip fabrication for the implemented architecture.
He received the Ph.D degree from the Department of computer engineering, Seoul National University in 1995. Since 1996, he has been a professor at the Jeju national university. His interesting research fields are in Neural Networks, image processing and Augmented Reality.
“Face Recognition for Smart Environments”
DOI : 10.1109/2.820039
Introduction to Artificial Neural Systems
PWS Publishing Company
“Digital Neurohardware: Principles and Perspectives”
Neuronal Networks in Applications
“Benchmarking and MIMD Neural Network Processor”
“Embedded Hardware Face Detection”
Proc. the 17th International Conference on VLSI Design
“A Reconfigurable System Featuring Dynamically Extensible Embedded Microprocessor, FPGA, and Customizable I/O”,
IEEE Journal of Solid-State Circuits
DOI : 10.1109/JSSC.2002.808288
“A Hardware/Software Co-Design of MP3 Audio Decoder”
The Journal of VLSI Signal Processing Systems for signal, Image and Video Technology
DOI : 10.1007/s11265-005-6254-2
“Indirect Test architecture for SoC Testing”
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
DOI : 10.1109/TCAD.2004.829796
“Hardware Dscription of Multi-Layer Perceptrons with Different Abstraction Levels”
Microprocessors and Microsystems
DOI : 10.1016/j.micpro.2006.03.004
“Efficient MLP Digital Implementation on FPGA”
Proc. the 8th Euromicro Conference on DSD
“An IP Core and GUI Implementing Multilayer Perceptron with a Fuzzy Activation Function on Configurable Logic Devices”
Journal of Universal Computer Science
“A Reconfigurable SOM Hardware Accelerator”
Proc. the European Symposium on Artificial Neural Networks Bruges(Belgium)
“Design and Implementation of Discrete Cosine Transform Chip for Digital Comsumer Products”
IEEE Transaction on Consumer Electronics
DOI : 10.1109/TCE.2006.1706499
“Analysis of Verification Methodologies based on a SoC Platform Design”
International Journal of Contents
DOI : 10.5392/IJoC.2011.7.1.023
“Application of FPGA Emulation to SoC Floorplan and Packaging Exploration”
Proc. the XXI I Conference on Design of Circuits and Integrated System
LEON2 Processor User’s Manual
“Soc Design of an Auto Focus Driving Image Signal Processor for Mobile Camera Applications”
IEEE Transactions on Consumer Electronics
DOI : 10.1109/TCE.2006.1605018
“Portable Translator Capable of Recognizing Characters on Signboard and Menu Captured by Built-In Camera”
Proc. the ACL Interactive Poster and Demonstration Sessions
“Translation, Scale and Rotation Invariant Pattern Recognition using PCA and Reduced Second Order Neural Network”
Neural, Parallel & Scientific Computation
Kim H. J.
Kim T. Y.
“An Illumination and background-Robust Hand Image Segmentation Method based on Dynamic Threshod Values”
Journal of Korea Multimedia Society