Skip to main content

Human Pose Estimation in Videos


In this paper, we present a method to estimate a sequence of human poses in unconstrained videos. In contrast to the commonly employed graph optimization framework, which is NP-hard and needs approximate solutions, we formulate this problem into a unified two-stage tree-based optimization problem for which an efficient and exact solution exists. Although the proposed method finds an exact solution, it does not sacrifice the ability to model the spatial and temporal constraints between body parts in the video frames; indeed it even models the symmetric parts better than the existing methods. The proposed method is based on two main ideas: `Abstraction’ and `Association’ to enforce the intra- and inter-frame body part constraints respectively without inducing extra computational complexity to the polynomial-time solution. Using the idea of `Abstraction’, a new concept of `abstract body part’ is introduced to model not only the tree-based body part structure similar to existing methods, but also extra constraints between symmetric parts. Using the idea of `Association’, the optimal tracklets are generated for each abstract body part, in order to enforce the spatiotemporal constraints between body parts in adjacent frames. Finally, a sequence of the best poses is inferred from the abstract body part tracklets through the tree-based optimization. We evaluated the proposed method on three publicly available video-based human pose estimation datasets, and obtained dramatically improved performance compared to the state-of-the-art methods.

Human pose estimation is crucial for many computer vision applications, including human-computer interaction, activity recognition, and video surveillance. It is a very challenging problem due to the large appearance variance, non-rigidity of the human body, different viewpoints, cluttered background, self-occlusion, etc. Recently, significant progress has been made in solving the human pose estimation problem in unconstrained single images; however, human pose estimation in videos is a relatively new and challenging problem, which needs significant improvement. Obviously, a single image-based pose estimation method can be applied to each video frame to get an initial pose estimation, and a further refinement through frames can be applied to make the pose estimation consistent and more accurate. However, due to the innate complexity of video data, the problem formulations of most video-based human pose estimation methods are very complex (usually NP-hard), therefore, approximate solutions have been proposed to solve them which results in sub-optimal solutions. Furthermore, most of the existing methods model body parts as a tree structure and these methods tend to suffer from double counting issues (which means symmetric parts, for instance, left and right ankles, are easy to be mixed together). In this paper, we aim to formulate the video-based human pose estimation problem in a different manner, which makes the problem solvable in polynomial time with an exact solution, and also effectively enforces the spatiotemporal constraints between body parts (which will handle the double-counting issues).

Figure 1: Intuitions

Figure 2: Symmetric Body Parts


We propose two key ideas to tackle this issue, which approximate the original fully connected model into a simplified tree-based model. The first idea is Abstraction: in contrast to the standard tree representation of body parts, we introduce a new concept, abstract body parts, to conceptually combine the symmetric body parts (please refer to Figure 2). This way, we take advantage of the symmetric nature of the human body parts without inducing simple cycles into the formulation. The second idea is Association, using which we generate optimal tracklets for each abstract body part to ensure the temporal consistency. Since each abstract body part is processed separately, it does not induce any temporal simple cycles into the graph.

Figure 3: The framework


We show results on several publicly available datasets.

Figure 4: Results on Three Datasets

Code and Datasets

Code and Datasets

Related Publication

Dong Zhang, Mubarak Shah, Human Pose Estimation in Videos, International Conference in Computer Vision 2015, Santiago, Chile, Dec. 11-18, 2015.