3D human pose estimation in video with temporal convolutions and semi-supervised training

Posted on 16/04/2019, in Paper.
  • Overview: This paper propose a temporal CNN model to generate 3D pose keypoints from 2D keypoints. It also introduce the semi-supervised training method, back-projection, that helps to solve the problem that 3D labels for keypoints are limited.
  • 2D->3D keypoints: Main stream of 3D keypoints detection rely on 2D predictions and was thought to be straightforward if ground truth is given. Previous works go from 2D to 3D in a single-frame setting, but recent work start using the temporal information which was called 2D keypoints trajectories.
  • Temporal dilated CNN: This is very similar the one used in NLP. There are two empirical modifications in this paper: a) the padding on t he boundary is replica rather than zeros; b) The model is outputting one frame at a time with limited reception field (this is for computational purpose).
  • Back-projection: This is my favorite part. Because the solution is direct, to project 3D keypoints back to 2D ones, given camera parameters, the authors propose to train a reconstruction like loss with the 2D-3D model as the encoding part. Naive reconstruction would give trivial solution by output the same thing in 2D-3D model so they also introduce a loss called Bone length L2 loss to match the length between joints to the labeled 3D batched.
  • Result: They achieved STOA on Human3.6M and HumanEva-I. The semi-supervised learning method is even amazed. The ablation shows the bone loss term is very important.

Overflowed notes: This paper attacks me by its contribution in connecting two tasks: 2D and 3D key point detection in videos. There usually sn’t enough information to infer from 2D to 3D in a single frame, but with the help of temporal dimension, i.e. the tracking, we are able to reconstruct the 3D key points very well. The conjugated task is essentially estimating the camera parameters. I guess if the 2D_3D training can be performed end2rend, it will also benefit the 2D detector by /task-augmentation_ (this is a word I made up btw).