High Accuracy Skeleton Estimation using 3D Volumetric Model based on RGB-D

Journal of Broadcast Engineering.
2020.
Dec,
25(7):
1095-1106

- Received : November 18, 2020
- Accepted : November 30, 2020
- Published : December 30, 2020

Download

PDF

e-PUB

PubReader

PPT

Export by style

Article

Metrics

Cited by

In this paper, we propose an algorithm that extracts a high-precision 3D skeleton using a model generated using a distributed RGB-D camera. When information about a 3D model is extracted through a distributed RGB-D camera, if the information of the 3D model is used, a skeleton with higher precision can be obtained. In this paper, in order to improve the precision of the 2D skeleton, we find the conditions to obtain the 2D skeleton well using the PCA. Through this, high-quality 2D skeletons are obtained, and high-precision 3D skeletons are extracted by combining the information of the 2D skeletons. Even though this process goes through, the generated skeleton may have errors, so we propose an algorithm that removes these errors by using the information of the 3D model. We were able to extract very high accuracy skeletons using the proposed method.
3D point cloud capturing system (a) vertical, (b) horizontal shooting angle and range
Charuco board used to acquire feature points(a)Charuco board, (b) World coordinate system obtained through Charuco board
Next, we use a method for obtaining extrinsic parameters of each camera using matching coordinates in point cloud sets for generating 3D model. These parameters are calculated using an optimization algorithm such that the SED (Squared Euclidean Distance) of the matched coordinates is minimal
[9]
. The transformation matrix of the coordinate system includes parameters for rotation angles and translation values for each of the
x
,
y
, and
z
axes. After setting one camera as the reference coordinate system, the parameters for converting those of other cameras to the reference coordinate system are obtained.
X_{ref}
represents the coordinates of the reference camera and
X_{i}
represents the coordinates of the remaining cameras.
R_{i→ref}
and
t_{i→ref}
represent the rotation and translation matrix from each camera to the reference camera. The initial
R_{i→ref}
is a unit matrix and
t_{i→ref}
is all zero. When Eq. (1) is applied with the initial parameter, the result is
X_{i}
, and converges to
X_{ref}
while optimizing.
The loss function to be optimized is the average value of SED of
X_{ref}
and
X_{i}′
. Equation (2) represents the error function.
The process of differentiating the loss function with respect to the coordinate transformation parameters and updating the parameter to minimize the function can be expressed as Eq (3)
[10]
.
α
is a learning rate as a constant, and a value of 0.01 was used.
P
_{n+1}
and
P_{n}
are parameters in the
n
+1 and
n
-th iterations, respectively.
When this process is performed more than 200,000 times, the average error of 8 cameras is reduced to 2.98mm. When the parameters of each camera are obtained by Eq. (3), the transformation from the camera coordinate system to the world coordinate system can be performed using Eq. (4), and the point cloud can be aligned based on the unified coordinate system.
P_{W}
represents world coordinates (reference camera coordinates), and
P_{C}
represents camera coordinates
[12]
.
Work flow for 3D skeleton extraction
,
found using principal component analysis when the data are elliptical in the 2D plane. The two vectors may well represent the distribution of the data. By calculating the direction and magnitude of these vectors, we can effectively analyze the data distribution
[13]
.
Example of PCA in 2D
By performing principal component analysis on thethree-dimensional coordinates of the point cloud, a vector that can most simply represent the distribution of the point cloud on the x, y, and z axes can be obtained. Since the distribution of the y-axis, which is the vertical direction of the object, is not necessary to find the front, the point cloud is projected on the xz plane and principal component analysis is performed on the 2D plane about the x and z axes. Through this method, a more accurate front direction can be found and the amount of calculation can be reduced. The PCA first finds the covariance matrix and finds the eigenvectors for that matrix. In the two eigenvectors obtained, the vector with the small eigenvalue becomes the vector corresponding to
in
Fig. 5
, and this vector represents the front direction.
Figure 5
shows the point cloud for before and after rotating so that the front of the object lies on the z-axis using the vector found through the PCA.
Object rotation (a) before and (b) after using PCA
After finding the front of the object, the AABB (Axis-aligned Bounding Box) is set up to determine the projection plane in space. The process of projecting from 3D to a 2D plane is transformed from the world coordinate system to coordinates on the projection plane through the MVP (Model View Projection) matrix, which is a 4x4 matrix. Then, to convert to the pixel coordinate system, the dynamic range is changed and quantization is performed to an integer
[11]
.
Extraction of the 2D skeleton of the projected image (a) front, (b) right, (c) rear, (d) left
The extracted joint with error (a) 2D skeleton, (b) joint intersection, (c) incorrect joint, (d) target joint and point cloud in the neighborhood
For N neighborhood point cloud, the center point with the smallest error becomes the corrected joint position.
Figure 8
is an example of finding the center point with the smallest error for N neighborhood point clouds. Since you can define a sphere with only 4 points, the formula for defining a sphere for N points increases with the number of points. Therefore, we need to find the sphere with the least error in equation (5). In
Figure 8 (a)
, a sphere that is too large for the points is defined, and
(c)
is an image that is too small.
Fig. 8(b)
gives an appropriate result, and the center of this sphere becomes the corrected joint. After obtaining the virtual sphere in contact with the adjacent point cloud the most through an iterative process, the joint is horizontally moved to the center of the circle.
Sphere estimation for correcting the unsuitable joint (a) large sphere, (b) medium sphere, (c) small sphere
Figure 9
shows the correction of the error joint through the above post-processing. When the error joint is obtained as shown in
Fig. 9 (a)
, applying the proposed post-processing process results in a corrected joint as shown in
Fig. 9 (b)
. From this figure, you can see that when post-processing is applied, the joint will be positioned stably inside the model.
Joints that are located outside the object (a) before correction, (b) moved inside the object after correction
The used distributed camera system
Mean registration error of point cloud
Figure 11
is the point cloud of the Charuco board box before and after matching by capturing the Charuco board box.
Figure 11(a)
is the point cloud before registration, and
Fig. 11(b)
is the point cloud after registration. From
Fig. 11
, it can be seen that it is matched to be the same as the shape of the actual Charuco board box.
Point cloud before and after integration (a) Point cloud output from each camera, (b) Point cloud integrated through coordinate transformation parameters
In Equation (6), P0 is the joint coordinate of the ground truth skeleton, and P is the joint coordinate of the predicted skeleton. Equation (6) is an equation for obtaining the average value of the joint error. Here, N represents the number of joints. In this paper, 15 joints were used. Using the proposed algorithm, we confirmed how much the performance was improved for MPJPE. Since the definition of the position of the joint is different for each algorithm, the error according to the defined position is always included. Therefore, the standard deviation of MPJPE was calculated as to whether the skeleton was stably extracted.
Figure 12
is a comparison of the ground truth skeleton and the skeleton after correction in three frames. In
Fig. 12
, the red skeleton represents the ground truth skeleton, and the blue skeleton represents the skeleton predicted through the proposed algorithm.
(a) mesh, (b) point cloud and ground truth skeleton, (c) point cloud, ground truth skeleton, skeleton before correction, (d) point cloud, ground truth skeleton, after correction Skeleton of Frame 1, 4, 9
The MPJPE measurement results in
Fig. 13(c)
and
(d)
are shown in the graph of
Fig. 13
. From the results, it can be confirmed that the skeleton after correction has less error and that the skeleton is stably extracted.
MPJPE results for each frame before and after correction
※ This work was supported by the Technology development Program (S2949268) funded by the Ministry of SMEs and Startups (MSS, Korea).
Kyung-Jin Kim
- 2019. 02 : Received B.S. in Department of Electronic Material Engineering, Kwangwoon University
- 2019. 03 ~ Current : Pursuing M.S. in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : Pointcloud, Digital Holography, 2D/3D Image processing and Compression
Byung-Seo Park
- 2019 2월 : Received B.S. in Department of Business Administration, Kwangwoon University
- 2019. 03 ~ Current : Pursuing M.S. in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : Pointcloud, Deep learning, 2D/3D Image processing
Ji-Won Kang
- 2019. 02 : Received B.S. in Department of Electronic Material Engineering, Kwangwoon University
- 2019. 09 ~ Current : Pursuing M.S. in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : Image processing, Digital hologram, Deep learning
Jin-Kyum Kim
- 2019. 02 : Received B.S. in Department of Electronic Material Engineering, Kwangwoon University
- 2019. 03 ~ Current : Pursuing M.S. in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : Digital Holography, 2D/3D Image processing and Compression
Woo-Suk Kim
- 2018. 08 : Received B.S. in Department of Electrical and Electronic Control Engineering, National Hankyung University
- 2018. 09 ~ Current : Pursuing M.S. in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : Digital Holography, 2D/3D Image processing and Compression, Super resolution Image processing
Dong-Wook Kim
- 1983. 02 : Received B.S. Department of Electronic Engineering, Hanyang University
- 1985. 02 : Received M.S. in Hanyang University
- 1991. 09 : Ph.D. in Department of Electronic Engineering, Georgia Tech
- 1992. 03 ~ Current : Associate professor in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : 3D Image processing, Digital hologram, Digital VLSI Testability, VLSI CAD, DSP design, Wireless Communication
Young-Ho Seo
- 1999. 02 : Received B.S. in Department of Electronic Material Engineering, Kwangwoon University
- 2001. 02 : Received M.S. in Department of Electronic Material Engineering, Kwangwoon University
- 2004. 08 : Received Ph.D. in Department of Electronic Material Engineering, Kwangwoon University
- 2005. 09 ~ 2008. 02 : Assistant professor in Hansung University
- 2008. 03 ~ Current : Associate professor in Department of Electronic Material Engineering, Kwangwoon University
- Research of Interest : Immersive Media, 2D/3D Image processing, Digital hologram

Ⅰ. Introduction

As the virtual reality and augmented reality industries are becoming more common in recent years, the development of 3D video content technology that provides immersive experiences is also actively being developed. 3D video contents are applied to various application fields such as games, video services, medical care, and education
[1]
. All of these techniques target virtual models. As representative 3D data for virtual objects, there is a point cloud that expresses an object in the form of a point. This data basically contains 3D coordinate information and texture coordinate information for each point, and color information, normal information, material information, etc. are additionally composed according to the application
[1]
[2]
.
Computer Vision aims to embody the visual perception of humans using computers. Since extracting information by analyzing an image or video captured by a camera is the key, detecting the location and direction of an object is a key technology in computer vision. Among these, the technique of recognizing the poses taken by a person is called the human pose estimation
[3]
. Literally, it can be viewed as a problem of estimating the location of how the joints of a person's body are organized in a photo or video. However, not all joints are visible in the shape of a person in the image. Even with the same pose, it depends on the direction in which it was photographed, sometimes hidden by different objects, and wearing various clothes. Depending on the light intensity, it can be difficult to estimate. Human pose estimation technology is still a difficult field, although it has been covered for a long time in computer vision
[3]
.
Skeleton extraction technology is the most commonly used tool among technologies for analyzing human posture and human movement. Until now, many researches have been conducted for the extraction of skeletons. Many signal processing techniques for skeleton extraction have been researched and in recent years, many techniques based on deep learning have been developed
[4]
. A representative deep learning network that extracts 2D skeletons is Openpose
[5]
. This network detects a large number of people at once at a speed of 8.8 fps
[4]
[5]
. Studies have also been conducted to extract 3D skeletons by solving occlusion problems using multiple RGB-D cameras
[6]
. In this paper, point cloud data is created using multi-view camera system, and the occlusion problem is solved using this, and a deep learning-based 3D skeleton is extracted. Also, inaccurately extracted parts are corrected using point cloud model to extract high-precision skeletons.
This paper describes the process of acquiring a real-life-based point cloud sequence in Section 2, and introduces a 3D skeleton extraction method proposed in Section 3. Section 4 shows the results of using this algorithm, and Section 5 concludes this paper.
Ⅱ. 3D Reconstruction using RGB-D Camera System

- 1. Camera System

This section describes the method for generating point clouds for skeleton generation. First, we implemented a system for acquiring point cloud using 8 RGB-D cameras, which is shown in
Fig. 1
. The 8 sets of RGB and depth images acquired using the system in
Fig. 1
are converted to a point cloud. As a result, 8 sets of point clouds are generated.
PPT Slide

Lager Image

- 2. Extrinsic Calibration

First, a 3D Charco board is used to find a matching point in an RGB image input from multiple cameras. The use of charuco boards is not essential
[7]
.
Figure 2(a)
is one side of the Charuco board, and
Fig. 2(b)
is the result of displaying the matching point in the 3D Charuco board and then displaying it in the world coordinate system. The origin of the world coordinate system is set to one corner of the 3D Charuco board. In order to obtain 3D coordinates of the matching points, calibration between the depth and the RGB image is performed
[8]
, and 3D coordinates of the matching points are obtained from the depth map.
PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

Ⅲ. High-precision 3D Skeleton Estimation

- 1. Proposed Algorithm

When the point cloud is captured through a multi-view RGB-D camera system, projection images of four planes are generated for 3D skeleton extraction. Next, the 2D skeleton of the projected image is extracted using the OpenPose library, and the intersection point of the joints in space for the 3D skeleton operation is calculated. Finally, a post-processing process for high-precision 3D skeleton extraction proceeds.
Figure 3
shows the proposed algorithm for skeleton extraction.
PPT Slide

Lager Image

- 2. Pre-Processing

When a 2D skeleton is extracted by inputting the projection image of the point cloud into the OpenPose network, the accuracy of the skeleton extracted from the image projected from the front direction can be the highest. Therefore, by analyzing the spatial distribution of the 3D coordinates of the point cloud, the front of the object is found, and the front direction of the point cloud is rotated so that it is parallel to the Z-axis direction. Principal Component Analysis (PCA) is used to find the frontal direction
[13]
. Principal component analysis is used to find the principal components of distributed data.
Figure 4
shows the two vectors
PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

- 3. 2D Skeleton Extraction using Deep learning

When 4 projection images are created, 2D skeleton is extracted using OpenPose
[5]
. The OpenPose is a project announced at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017 and was developed by Carnegie Mellon University in the United States. It is based on the Convolutional Neural Network (CNN), and is a library that can extract the features of several people's bodies, hands, and faces in real-time from photos. The characteristic of this project is that the poses of several people can be found quickly. Before the OpenPose was announced, to estimate the poses of several people, the Top-Down method was mainly used to detect each person in a photo and find a pose for the detected person repeatedly. The OpenPose is a type of bottom-up method that improves performance without repetitive processing. The Bottom-Up method is a method of estimating the joints of all people, connecting the positions of each joint, and then regenerating them with the joint positions of the corresponding person. In general, in the Bottom-Up method, there is a problem of determining which person the joint belongs to. To compensate for this, the OpenPose uses the Part Affinity Fields, which allows you to infer which person a body part belongs to. The result of extracting the skeleton using the OpenPose is output as an image and a json file.
Figure 6
is the result of 2D skeleton extraction of the projected image.
PPT Slide

Lager Image

- 4. Joint Intersection Calculation

The joint coordinates extracted on the four projection planes located in space are calculated after the process of restoring the 2D skeleton pixel coordinate system back to the 3D coordinate system. When the corresponding coordinates on the four planes are connected, four coordinates that intersect in space are extracted. Among these four coordinates, a coordinate having a distance of 3 cm or more from other coordinates is determined as a coordinate containing an error and removed. A 3D skeleton is obtained through the average value of the candidate coordinates that have not been removed.
Figure 7
is an example of extracting the 3D joint of the right hand.
PPT Slide

Lager Image

- 5. Post-Processing

Figure 7(d)
shows that the joint is located outside the object because it was extracted incorrectly. Like these joints, misaligned joints will be placed outside the 3D model, which needs to be corrected. A post-processing process is performed to correct the error joint. First, find the neighborhood point cloud of the joint. When the neighborhood point cloud is obtained, the center point and the radius value with the smallest error of the equation (5) of the sphere for the neighborhood point cloud are obtained.
PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

Ⅳ. Experiment and Result

- 1. Capturing System

In this paper, eight Microsoft Kinect Azure cameras were used. The camera arrangement follows the capturing system as shown in
Figure 1
. Four units were installed at a height of 0.7m from the ground, and the remaining four were placed at a height of 1.5m from the ground to capture the top of the object. A threshold value was set for the depth value to obtain a point cloud for an object within 0.1m to 1.5m.
Figure 10
is the actual camera system.
PPT Slide

Lager Image

- 2. 3D Calibration

Each camera outputs RGB and depth images at a speed of 30fps. Using these two images and the camera's internal parameters, the point cloud for the camera coordinate system can be generated. When 8 sets of point clouds are generated, they are integrated into a 3D model through a camera calibration optimization process.
Table 1
is the average calibration error of the point cloud, and it can be seen that all cameras have an error of less than 5mm.
Mean registration error of point cloud

PPT Slide

Lager Image

PPT Slide

Lager Image

- 3. 3D Skeleton Extraction Result

The experiment on 3D skeleton extraction was performed as a sequence for a graphics model with a correct skeleton
[15]
. The numerical value for the joint error was obtained by means of MPJPE (Mean Per Joint Position Error). MPJPE represents the average value of the joint error between the ground truth skeleton and the predicted skeleton, and is calculated using Equation (6)
[16]
.
PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

Ⅴ. Conclusion

In this paper, we propose an algorithm to extract a 3D skeleton by acquiring a photo-realistic-based point cloud sequence at a speed of 30 fps through 8 RGB-D camera systems. Through the camera calibration process applying an optimization algorithm, an integrated point cloud with an error of less than 5mm could be created. In order to extract the skeleton, the projection planes for the four sides of the object are generated, and the 2D skeleton is extracted using the OpenPose library, a deep learning model. And the post-process was proposed for high-precision skeleton extraction. Through this, it was possible to extract a high-precision 3D skeleton using the generated point cloud sequence without a separate motion capture device. Through post-processing, it was possible to visually confirm that it appeared more stably than when using only openpose, and numerically through MPJPE for each frame. By extracting high-precision 3D skeletons, it can be useful in various applications such as 3D model animating, motion recognition, and compression.
BIO

Guo Y.
,
Bennamoun M.
,
Sohel F.
,
Lu M.
,
Wan J.
2015
An Integrated Framework for 3-D Modeling, Object Detection, and Pose Estimation From Point-Clouds
IEEE Transactions on Instrumentation and Measurement
64
(3)
683 -
693
** DOI : 10.1109/TIM.2014.2358131**

Schreer O.
2019
Advanced Volumetric Capture and Processing
SMPTE Motion Imaging Journal
128
(5)
18 -
24

Munea T. L.
,
Jembre Y. Z.
,
Weldegebriel H. T.
,
Chen L.
,
Huang C.
,
Yang C.
2020
The Progress of Human Pose Estimation: A Survey and Taxonomy of Models Applied in 2D Human Pose Estimation
IEEE Access
8
133330 -
133348
** DOI : 10.1109/ACCESS.2020.3010248**

Rim Beanbonyka
2020
Real-time Human Pose Estimation using RGB-D images and Deep Learning
Journal of Internet Computing and Services
21
(3)
113 -
121

Cao Zhe
2018
OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields
arXiv preprint arXiv:1812.08008

He Haoyang
2019
Interacting Multiple Model-Based Human Pose Estimation Using a Distributed 3D Camera Network
IEEE Sensors Journal
19
(22)
10584 -
10590
** DOI : 10.1109/JSEN.2019.2931603**

An Gwon Hwan
2018
Charuco board-based omnidirectional camera calibration method
Electronics
7
(12)
421 -
** DOI : 10.3390/electronics7120421**

Zhang Z.
2000
A flexible new technique for camera calibration
IEEE Transactions on Pattern Analysis and Machine Intelligence
22
(11)
1330 -
1334
** DOI : 10.1109/34.888718**

Ruder Sebastian
2016
An overview of gradient descent optimization algorithms
arXiv preprint arXiv:1609.04747

Lee S.
2019
Convergence Rate of Optimization Algorithms for a Non-strictly Convex Function
Institute of Control Robotics and Systems
349 -
350

Jackins C.L.
,
Tanimoto S.L.
1980
Oct-trees and their use in representing-three-dimensional objects
Comput. Graphics Image Process.
14
(3)
249 -
270
** DOI : 10.1016/0146-664X(80)90055-6**

Kim K.
,
Park B.
,
Kim J.
,
Kim D.
,
Seo Young-Ho
2020
Holographic augmented reality based on three-dimensional volumetric imaging for a photorealistic scene
Optics Express
28
35972 -
35985
** DOI : 10.1364/OE.411141**

Wold Svante
,
Kim Esbensen
,
Geladi Paul
1987
Principal component analysis
Chemometrics and Intelligent Laboratory Systems
2
(1–3)
37 -
52
** DOI : 10.1016/0169-7439(87)80084-9**

Barequet Gill
,
Har-Peled Sariel
2001
Efficiently approximating the minimum-volume bounding box of a point set in three dimensions
Journal of Algorithms
38
(1)
91 -
109
** DOI : 10.1006/jagm.2000.1127**

https://free3d.com/ko/3d-model/nathan-animated-003-walking-644277.html

Luvizon Diogo
2019
Machine Learning for Human Action Recognition and Pose Estimation based on 3D Information. Diss
Cergy Paris Université

Citing 'High Accuracy Skeleton Estimation using 3D Volumetric Model based on RGB-D
'

@article{ BSGHC3_2020_v25n7_1095}
,title={High Accuracy Skeleton Estimation using 3D Volumetric Model based on RGB-D}
,volume={7}
, url={http://dx.doi.org/10.5909/JBE.2020.25.7.1095}, DOI={10.5909/JBE.2020.25.7.1095}
, number= {7}
, journal={Journal of Broadcast Engineering}
, publisher={The Korean Institute of Broadcast and Media Engineers}
, author={Kim, Kyung-Jin
and
Park, Byung-Seo
and
Kang, Ji-Won
and
Kim, Jin-Kyum
and
Kim, Woo-Suk
and
Kim, Dong-Wook
and
Seo, Young-Ho}
, year={2020}
, month={Dec}