3D pose estimation in videos using convolutional neural network.
Thesis DisciplineComputer Science
Degree GrantorUniversity of Canterbury
Degree NameDoctor of Philosophy
This thesis proposes, develops and evaluates different convolutional neural network based methods for 3D single-person pose estimation in RGB video. The research goals are achieved by studying image processing methods that use machine learning algorithms and applying them to different aspects of the task of pose estimation. The theoretical framework for fulfilling these goals is based on the design of the convolutional neural network for pose estimation task, which has been explored and extended in this work.
Different object detection, object tracking, and activity recognition methods have been compared and evaluated in this thesis. State-of-the-art pose estimation methods which can regress pose from images are extensively reviewed and used as the starting point of this thesis. The thesis also introduces pose-guided image synthesis methods which can be used to create images that contain a person in a given human pose.
This thesis proposes a three-stage CNN-based framework for 3D pose estimation for a single person in RGB video. The task of 3D pose estimation in RGB video is divided into three sub-tasks: human object detection, 2D pose estimation, and 3D pose regression. A state-of-the-art object detection method called Faster RCNN, a state-of-the-art 2D pose estimation method known as Stacked Hourglass, and a greedy-style 3D pose reconstruction method called Projection Matching Pursuit are applied to complete the three sub-tasks respectively. Then the proposed 3D pose estimation framework is evaluated on Human3.6M dataset and an Olympic figure-skating video. The results prove that the proposed framework produces a visually satisfactory 3D pose estimation for many of the poses but not for unusual poses such as those often seen in figure-skating.
One of the reasons that convolutional neural network performs poorly on images with unusual poses is a lack of training data. This thesis proposes a method to augment human pose dataset using generative adversarial network (GAN). The task of human pose dataset augmentation is to generate a large number of labeled pose-image data pairs from a small training dataset. Generative adversarial network shows potential on the area of conditional image synthesis. The dataset augmentation task can be divided into three sub- tasks: pose data augmentation, mask image generation, and RGB image generation. One Variational Autoencoder (VAE) network and two generative adversarial (GAN) networks are designed to complete the three sub-tasks respectively. This method is then evaluated on Human3.6M datasets. The experimental results show that this data augmentation method can help the training of the pose estimation neural network.