Advanced
High Accuracy Skeleton Estimation using 3D Volumetric Model based on RGB-D
High Accuracy Skeleton Estimation using 3D Volumetric Model based on RGB-D
Journal of Broadcast Engineering. 2020. Dec, 25(7): 1095-1106
Copyright © 2020, The Korean Institute of Broadcast and Media Engineers
  • Received : November 18, 2020
  • Accepted : November 30, 2020
  • Published : December 30, 2020
Download
PDF
e-PUB
PubReader
PPT
Export by style
Article
Author
Metrics
Cited by
About the Authors
Kyung-Jin, Kim
Department of Electronic Material Engineering, Kwangwoon University
Byung-Seo, Park
Department of Electronic Material Engineering, Kwangwoon University
Ji-Won, Kang
Department of Electronic Material Engineering, Kwangwoon University
Jin-Kyum, Kim
Department of Electronic Material Engineering, Kwangwoon University
Woo-Suk, Kim
Department of Electronic Material Engineering, Kwangwoon University
Dong-Wook, Kim
Department of Electronic Material Engineering, Kwangwoon University
Young-Ho, Seo
Department of Electronic Material Engineering, Kwangwoon University
yhseo@kw.ac.kr

Abstract
In this paper, we propose an algorithm that extracts a high-precision 3D skeleton using a model generated using a distributed RGB-D camera. When information about a 3D model is extracted through a distributed RGB-D camera, if the information of the 3D model is used, a skeleton with higher precision can be obtained. In this paper, in order to improve the precision of the 2D skeleton, we find the conditions to obtain the 2D skeleton well using the PCA. Through this, high-quality 2D skeletons are obtained, and high-precision 3D skeletons are extracted by combining the information of the 2D skeletons. Even though this process goes through, the generated skeleton may have errors, so we propose an algorithm that removes these errors by using the information of the 3D model. We were able to extract very high accuracy skeletons using the proposed method.
Keywords
Ⅰ. Introduction
As the virtual reality and augmented reality industries are becoming more common in recent years, the development of 3D video content technology that provides immersive experiences is also actively being developed. 3D video contents are applied to various application fields such as games, video services, medical care, and education [1] . All of these techniques target virtual models. As representative 3D data for virtual objects, there is a point cloud that expresses an object in the form of a point. This data basically contains 3D coordinate information and texture coordinate information for each point, and color information, normal information, material information, etc. are additionally composed according to the application [1] [2] .
Computer Vision aims to embody the visual perception of humans using computers. Since extracting information by analyzing an image or video captured by a camera is the key, detecting the location and direction of an object is a key technology in computer vision. Among these, the technique of recognizing the poses taken by a person is called the human pose estimation [3] . Literally, it can be viewed as a problem of estimating the location of how the joints of a person's body are organized in a photo or video. However, not all joints are visible in the shape of a person in the image. Even with the same pose, it depends on the direction in which it was photographed, sometimes hidden by different objects, and wearing various clothes. Depending on the light intensity, it can be difficult to estimate. Human pose estimation technology is still a difficult field, although it has been covered for a long time in computer vision [3] .
Skeleton extraction technology is the most commonly used tool among technologies for analyzing human posture and human movement. Until now, many researches have been conducted for the extraction of skeletons. Many signal processing techniques for skeleton extraction have been researched and in recent years, many techniques based on deep learning have been developed [4] . A representative deep learning network that extracts 2D skeletons is Openpose [5] . This network detects a large number of people at once at a speed of 8.8 fps [4] [5] . Studies have also been conducted to extract 3D skeletons by solving occlusion problems using multiple RGB-D cameras [6] . In this paper, point cloud data is created using multi-view camera system, and the occlusion problem is solved using this, and a deep learning-based 3D skeleton is extracted. Also, inaccurately extracted parts are corrected using point cloud model to extract high-precision skeletons.
This paper describes the process of acquiring a real-life-based point cloud sequence in Section 2, and introduces a 3D skeleton extraction method proposed in Section 3. Section 4 shows the results of using this algorithm, and Section 5 concludes this paper.
Ⅱ. 3D Reconstruction using RGB-D Camera System
- 1. Camera System
This section describes the method for generating point clouds for skeleton generation. First, we implemented a system for acquiring point cloud using 8 RGB-D cameras, which is shown in Fig. 1 . The 8 sets of RGB and depth images acquired using the system in Fig. 1 are converted to a point cloud. As a result, 8 sets of point clouds are generated.
PPT Slide
Lager Image
3D point cloud capturing system (a) vertical, (b) horizontal shooting angle and range
- 2. Extrinsic Calibration
First, a 3D Charco board is used to find a matching point in an RGB image input from multiple cameras. The use of charuco boards is not essential [7] . Figure 2(a) is one side of the Charuco board, and Fig. 2(b) is the result of displaying the matching point in the 3D Charuco board and then displaying it in the world coordinate system. The origin of the world coordinate system is set to one corner of the 3D Charuco board. In order to obtain 3D coordinates of the matching points, calibration between the depth and the RGB image is performed [8] , and 3D coordinates of the matching points are obtained from the depth map.
PPT Slide
Lager Image
Charuco board used to acquire feature points(a)Charuco board, (b) World coordinate system obtained through Charuco board
Next, we use a method for obtaining extrinsic parameters of each camera using matching coordinates in point cloud sets for generating 3D model. These parameters are calculated using an optimization algorithm such that the SED (Squared Euclidean Distance) of the matched coordinates is minimal [9] . The transformation matrix of the coordinate system includes parameters for rotation angles and translation values for each of the x , y , and z axes. After setting one camera as the reference coordinate system, the parameters for converting those of other cameras to the reference coordinate system are obtained. Xref represents the coordinates of the reference camera and Xi represents the coordinates of the remaining cameras. Ri→ref and ti→ref represent the rotation and translation matrix from each camera to the reference camera. The initial Ri→ref is a unit matrix and ti→ref is all zero. When Eq. (1) is applied with the initial parameter, the result is Xi , and converges to Xref while optimizing.
PPT Slide
Lager Image
The loss function to be optimized is the average value of SED of Xref and Xi . Equation (2) represents the error function.
PPT Slide
Lager Image
The process of differentiating the loss function with respect to the coordinate transformation parameters and updating the parameter to minimize the function can be expressed as Eq (3) [10] . α is a learning rate as a constant, and a value of 0.01 was used. P n+1 and Pn are parameters in the n +1 and n -th iterations, respectively.
PPT Slide
Lager Image
When this process is performed more than 200,000 times, the average error of 8 cameras is reduced to 2.98mm. When the parameters of each camera are obtained by Eq. (3), the transformation from the camera coordinate system to the world coordinate system can be performed using Eq. (4), and the point cloud can be aligned based on the unified coordinate system. PW represents world coordinates (reference camera coordinates), and PC represents camera coordinates [12] .
PPT Slide
Lager Image
Ⅲ. High-precision 3D Skeleton Estimation
- 1. Proposed Algorithm
When the point cloud is captured through a multi-view RGB-D camera system, projection images of four planes are generated for 3D skeleton extraction. Next, the 2D skeleton of the projected image is extracted using the OpenPose library, and the intersection point of the joints in space for the 3D skeleton operation is calculated. Finally, a post-processing process for high-precision 3D skeleton extraction proceeds. Figure 3 shows the proposed algorithm for skeleton extraction.
PPT Slide
Lager Image
Work flow for 3D skeleton extraction
- 2. Pre-Processing
When a 2D skeleton is extracted by inputting the projection image of the point cloud into the OpenPose network, the accuracy of the skeleton extracted from the image projected from the front direction can be the highest. Therefore, by analyzing the spatial distribution of the 3D coordinates of the point cloud, the front of the object is found, and the front direction of the point cloud is rotated so that it is parallel to the Z-axis direction. Principal Component Analysis (PCA) is used to find the frontal direction [13] . Principal component analysis is used to find the principal components of distributed data.
Figure 4 shows the two vectors
PPT Slide
Lager Image
,
PPT Slide
Lager Image
found using principal component analysis when the data are elliptical in the 2D plane. The two vectors may well represent the distribution of the data. By calculating the direction and magnitude of these vectors, we can effectively analyze the data distribution [13] .
PPT Slide
Lager Image
Example of PCA in 2D
By performing principal component analysis on thethree-dimensional coordinates of the point cloud, a vector that can most simply represent the distribution of the point cloud on the x, y, and z axes can be obtained. Since the distribution of the y-axis, which is the vertical direction of the object, is not necessary to find the front, the point cloud is projected on the xz plane and principal component analysis is performed on the 2D plane about the x and z axes. Through this method, a more accurate front direction can be found and the amount of calculation can be reduced. The PCA first finds the covariance matrix and finds the eigenvectors for that matrix. In the two eigenvectors obtained, the vector with the small eigenvalue becomes the vector corresponding to
PPT Slide
Lager Image
in Fig. 5 , and this vector represents the front direction. Figure 5 shows the point cloud for before and after rotating so that the front of the object lies on the z-axis using the vector found through the PCA.
PPT Slide
Lager Image
Object rotation (a) before and (b) after using PCA
After finding the front of the object, the AABB (Axis-aligned Bounding Box) is set up to determine the projection plane in space. The process of projecting from 3D to a 2D plane is transformed from the world coordinate system to coordinates on the projection plane through the MVP (Model View Projection) matrix, which is a 4x4 matrix. Then, to convert to the pixel coordinate system, the dynamic range is changed and quantization is performed to an integer [11] .
- 3. 2D Skeleton Extraction using Deep learning
When 4 projection images are created, 2D skeleton is extracted using OpenPose [5] . The OpenPose is a project announced at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017 and was developed by Carnegie Mellon University in the United States. It is based on the Convolutional Neural Network (CNN), and is a library that can extract the features of several people's bodies, hands, and faces in real-time from photos. The characteristic of this project is that the poses of several people can be found quickly. Before the OpenPose was announced, to estimate the poses of several people, the Top-Down method was mainly used to detect each person in a photo and find a pose for the detected person repeatedly. The OpenPose is a type of bottom-up method that improves performance without repetitive processing. The Bottom-Up method is a method of estimating the joints of all people, connecting the positions of each joint, and then regenerating them with the joint positions of the corresponding person. In general, in the Bottom-Up method, there is a problem of determining which person the joint belongs to. To compensate for this, the OpenPose uses the Part Affinity Fields, which allows you to infer which person a body part belongs to. The result of extracting the skeleton using the OpenPose is output as an image and a json file. Figure 6 is the result of 2D skeleton extraction of the projected image.
PPT Slide
Lager Image
Extraction of the 2D skeleton of the projected image (a) front, (b) right, (c) rear, (d) left
- 4. Joint Intersection Calculation
The joint coordinates extracted on the four projection planes located in space are calculated after the process of restoring the 2D skeleton pixel coordinate system back to the 3D coordinate system. When the corresponding coordinates on the four planes are connected, four coordinates that intersect in space are extracted. Among these four coordinates, a coordinate having a distance of 3 cm or more from other coordinates is determined as a coordinate containing an error and removed. A 3D skeleton is obtained through the average value of the candidate coordinates that have not been removed. Figure 7 is an example of extracting the 3D joint of the right hand.
PPT Slide
Lager Image
The extracted joint with error (a) 2D skeleton, (b) joint intersection, (c) incorrect joint, (d) target joint and point cloud in the neighborhood
- 5. Post-Processing
Figure 7(d) shows that the joint is located outside the object because it was extracted incorrectly. Like these joints, misaligned joints will be placed outside the 3D model, which needs to be corrected. A post-processing process is performed to correct the error joint. First, find the neighborhood point cloud of the joint. When the neighborhood point cloud is obtained, the center point and the radius value with the smallest error of the equation (5) of the sphere for the neighborhood point cloud are obtained.
PPT Slide
Lager Image
For N neighborhood point cloud, the center point with the smallest error becomes the corrected joint position. Figure 8 is an example of finding the center point with the smallest error for N neighborhood point clouds. Since you can define a sphere with only 4 points, the formula for defining a sphere for N points increases with the number of points. Therefore, we need to find the sphere with the least error in equation (5). In Figure 8 (a) , a sphere that is too large for the points is defined, and (c) is an image that is too small. Fig. 8(b) gives an appropriate result, and the center of this sphere becomes the corrected joint. After obtaining the virtual sphere in contact with the adjacent point cloud the most through an iterative process, the joint is horizontally moved to the center of the circle.
PPT Slide
Lager Image
Sphere estimation for correcting the unsuitable joint (a) large sphere, (b) medium sphere, (c) small sphere
Figure 9 shows the correction of the error joint through the above post-processing. When the error joint is obtained as shown in Fig. 9 (a) , applying the proposed post-processing process results in a corrected joint as shown in Fig. 9 (b) . From this figure, you can see that when post-processing is applied, the joint will be positioned stably inside the model.
PPT Slide
Lager Image
Joints that are located outside the object (a) before correction, (b) moved inside the object after correction
Ⅳ. Experiment and Result
- 1. Capturing System
In this paper, eight Microsoft Kinect Azure cameras were used. The camera arrangement follows the capturing system as shown in Figure 1 . Four units were installed at a height of 0.7m from the ground, and the remaining four were placed at a height of 1.5m from the ground to capture the top of the object. A threshold value was set for the depth value to obtain a point cloud for an object within 0.1m to 1.5m. Figure 10 is the actual camera system.
PPT Slide
Lager Image
The used distributed camera system
- 2. 3D Calibration
Each camera outputs RGB and depth images at a speed of 30fps. Using these two images and the camera's internal parameters, the point cloud for the camera coordinate system can be generated. When 8 sets of point clouds are generated, they are integrated into a 3D model through a camera calibration optimization process. Table 1 is the average calibration error of the point cloud, and it can be seen that all cameras have an error of less than 5mm.
Mean registration error of point cloud
PPT Slide
Lager Image
Mean registration error of point cloud
Figure 11 is the point cloud of the Charuco board box before and after matching by capturing the Charuco board box. Figure 11(a) is the point cloud before registration, and Fig. 11(b) is the point cloud after registration. From Fig. 11 , it can be seen that it is matched to be the same as the shape of the actual Charuco board box.
PPT Slide
Lager Image
Point cloud before and after integration (a) Point cloud output from each camera, (b) Point cloud integrated through coordinate transformation parameters
- 3. 3D Skeleton Extraction Result
The experiment on 3D skeleton extraction was performed as a sequence for a graphics model with a correct skeleton [15] . The numerical value for the joint error was obtained by means of MPJPE (Mean Per Joint Position Error). MPJPE represents the average value of the joint error between the ground truth skeleton and the predicted skeleton, and is calculated using Equation (6) [16] .
PPT Slide
Lager Image
In Equation (6), P0 is the joint coordinate of the ground truth skeleton, and P is the joint coordinate of the predicted skeleton. Equation (6) is an equation for obtaining the average value of the joint error. Here, N represents the number of joints. In this paper, 15 joints were used. Using the proposed algorithm, we confirmed how much the performance was improved for MPJPE. Since the definition of the position of the joint is different for each algorithm, the error according to the defined position is always included. Therefore, the standard deviation of MPJPE was calculated as to whether the skeleton was stably extracted. Figure 12 is a comparison of the ground truth skeleton and the skeleton after correction in three frames. In Fig. 12 , the red skeleton represents the ground truth skeleton, and the blue skeleton represents the skeleton predicted through the proposed algorithm.
PPT Slide
Lager Image
(a) mesh, (b) point cloud and ground truth skeleton, (c) point cloud, ground truth skeleton, skeleton before correction, (d) point cloud, ground truth skeleton, after correction Skeleton of Frame 1, 4, 9
The MPJPE measurement results in Fig. 13(c) and (d) are shown in the graph of Fig. 13 . From the results, it can be confirmed that the skeleton after correction has less error and that the skeleton is stably extracted.
PPT Slide
Lager Image
MPJPE results for each frame before and after correction
Ⅴ. Conclusion
In this paper, we propose an algorithm to extract a 3D skeleton by acquiring a photo-realistic-based point cloud sequence at a speed of 30 fps through 8 RGB-D camera systems. Through the camera calibration process applying an optimization algorithm, an integrated point cloud with an error of less than 5mm could be created. In order to extract the skeleton, the projection planes for the four sides of the object are generated, and the 2D skeleton is extracted using the OpenPose library, a deep learning model. And the post-process was proposed for high-precision skeleton extraction. Through this, it was possible to extract a high-precision 3D skeleton using the generated point cloud sequence without a separate motion capture device. Through post-processing, it was possible to visually confirm that it appeared more stably than when using only openpose, and numerically through MPJPE for each frame. By extracting high-precision 3D skeletons, it can be useful in various applications such as 3D model animating, motion recognition, and compression.
This work was supported by the Technology development Program (S2949268) funded by the Ministry of SMEs and Startups (MSS, Korea).
BIO
Kyung-Jin Kim
- 2019. 02 : Received B.S. in Department of Electronic Material Engineering, Kwangwoon University
- 2019. 03 ~ Current : Pursuing M.S. in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : Pointcloud, Digital Holography, 2D/3D Image processing and Compression
Byung-Seo Park
- 2019 2월 : Received B.S. in Department of Business Administration, Kwangwoon University
- 2019. 03 ~ Current : Pursuing M.S. in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : Pointcloud, Deep learning, 2D/3D Image processing
Ji-Won Kang
- 2019. 02 : Received B.S. in Department of Electronic Material Engineering, Kwangwoon University
- 2019. 09 ~ Current : Pursuing M.S. in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : Image processing, Digital hologram, Deep learning
Jin-Kyum Kim
- 2019. 02 : Received B.S. in Department of Electronic Material Engineering, Kwangwoon University
- 2019. 03 ~ Current : Pursuing M.S. in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : Digital Holography, 2D/3D Image processing and Compression
Woo-Suk Kim
- 2018. 08 : Received B.S. in Department of Electrical and Electronic Control Engineering, National Hankyung University
- 2018. 09 ~ Current : Pursuing M.S. in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : Digital Holography, 2D/3D Image processing and Compression, Super resolution Image processing
Dong-Wook Kim
- 1983. 02 : Received B.S. Department of Electronic Engineering, Hanyang University
- 1985. 02 : Received M.S. in Hanyang University
- 1991. 09 : Ph.D. in Department of Electronic Engineering, Georgia Tech
- 1992. 03 ~ Current : Associate professor in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : 3D Image processing, Digital hologram, Digital VLSI Testability, VLSI CAD, DSP design, Wireless Communication
Young-Ho Seo
- 1999. 02 : Received B.S. in Department of Electronic Material Engineering, Kwangwoon University
- 2001. 02 : Received M.S. in Department of Electronic Material Engineering, Kwangwoon University
- 2004. 08 : Received Ph.D. in Department of Electronic Material Engineering, Kwangwoon University
- 2005. 09 ~ 2008. 02 : Assistant professor in Hansung University
- 2008. 03 ~ Current : Associate professor in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : Immersive Media, 2D/3D Image processing, Digital hologram
References
Guo Y. , Bennamoun M. , Sohel F. , Lu M. , Wan J. 2015 An Integrated Framework for 3-D Modeling, Object Detection, and Pose Estimation From Point-Clouds IEEE Transactions on Instrumentation and Measurement 64 (3) 683 - 693    DOI : 10.1109/TIM.2014.2358131
Schreer O. 2019 Advanced Volumetric Capture and Processing SMPTE Motion Imaging Journal 128 (5) 18 - 24
Munea T. L. , Jembre Y. Z. , Weldegebriel H. T. , Chen L. , Huang C. , Yang C. 2020 The Progress of Human Pose Estimation: A Survey and Taxonomy of Models Applied in 2D Human Pose Estimation IEEE Access 8 133330 - 133348    DOI : 10.1109/ACCESS.2020.3010248
Rim Beanbonyka 2020 Real-time Human Pose Estimation using RGB-D images and Deep Learning Journal of Internet Computing and Services 21 (3) 113 - 121
Cao Zhe 2018 OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields arXiv preprint arXiv:1812.08008
He Haoyang 2019 Interacting Multiple Model-Based Human Pose Estimation Using a Distributed 3D Camera Network IEEE Sensors Journal 19 (22) 10584 - 10590    DOI : 10.1109/JSEN.2019.2931603
An Gwon Hwan 2018 Charuco board-based omnidirectional camera calibration method Electronics 7 (12) 421 -    DOI : 10.3390/electronics7120421
Zhang Z. 2000 A flexible new technique for camera calibration IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (11) 1330 - 1334    DOI : 10.1109/34.888718
Ruder Sebastian 2016 An overview of gradient descent optimization algorithms arXiv preprint arXiv:1609.04747
Lee S. 2019 Convergence Rate of Optimization Algorithms for a Non-strictly Convex Function Institute of Control Robotics and Systems 349 - 350
Jackins C.L. , Tanimoto S.L. 1980 Oct-trees and their use in representing-three-dimensional objects Comput. Graphics Image Process. 14 (3) 249 - 270    DOI : 10.1016/0146-664X(80)90055-6
Kim K. , Park B. , Kim J. , Kim D. , Seo Young-Ho 2020 Holographic augmented reality based on three-dimensional volumetric imaging for a photorealistic scene Optics Express 28 35972 - 35985    DOI : 10.1364/OE.411141
Wold Svante , Kim Esbensen , Geladi Paul 1987 Principal component analysis Chemometrics and Intelligent Laboratory Systems 2 (1–3) 37 - 52    DOI : 10.1016/0169-7439(87)80084-9
Barequet Gill , Har-Peled Sariel 2001 Efficiently approximating the minimum-volume bounding box of a point set in three dimensions Journal of Algorithms 38 (1) 91 - 109    DOI : 10.1006/jagm.2000.1127
https://free3d.com/ko/3d-model/nathan-animated-003-walking-644277.html
Luvizon Diogo 2019 Machine Learning for Human Action Recognition and Pose Estimation based on 3D Information. Diss Cergy Paris Université