Human Pose Estimation

11 min readJun 1, 2021

Take a look at how Human Pose Estimation works.

What exactly is Human Pose Estimation?

In the area of computer vision, human pose estimation is a significant issue. Imagine detecting an individual’s every little gesture in real-time and doing a biomechanical study. The technology will have far-reaching implications. Video monitoring, assisted living, advanced driver assistance systems (ADAS), and sports analysis are all possible applications. In each of these 2D photos and 3D videos, human pose estimation refers to a set of computer vision algorithms for estimating the major points on the human body — knees, hips, elbows, neck, shoulders, and feet. This allows for a variety of estimations, such as the body’s position, such as lying down or extending, its placement in a scene, its movement, and even the capacity to infer.

Why this blog?

Human pose estimate is a significant subject that has captivated the Computer Vision field for several decades. However, most of the literature in Pose Estimation (research papers and blogs) is rather complex, finding it challenging for someone new to get up to speed. This blog aims to provide you a basic grasp of Pose Estimation and perhaps spark your interest in the topic. Anyone with no prior knowledge of computer vision can explore the blog on this basis. To properly comprehend the blog, even just a fundamental background of Computer Vision topics is enough. We hope that the blog encourages more developers and makers to try to pose Estimation in their own projects.

Problem Definition

The following axes can be used to characterize the problem statement:

The Number of People Being Tracked

Pose estimation is categorized as single-person or multi-person based on the number of individuals being identified in an image. Single-person pose estimation is substantially easier than multi-person pose estimation since a single person’s pose may be estimated from a single image that may contain (and generally does) more than one person. On the opposite side, multi-person pose estimation, on the other hand, determines the pose of all individuals present in the image. And also, the problem of inter-person occlusion must be addressed in multi-person pose estimation (MPPE). The introduction and application of deep learning-based architectures and the availability of large-scale datasets like MPII datasets of human poses both single and multi-person poses have recently become popular and gaining more and more attention.

**Fig 1: Pose estimation is divided into two categories: (a) single-person posture estimate and (b) multi-person pose estimation.**

Input Modality

Modality refers to the various forms of inputs available. supported the benefit of availableness, based on the benefit of availability, the topmost 3 types of inputs are,

1. Red-Green-Blue (RGB) image: The pictures(images) that we tend to see around us on a daily basis, and therefore the most typical type of input for Pose Estimation. Models functioning on RGB-only input have a large advantage over others in terms of the quality of the input source. This is because of the ease of availability of common cameras (which capture RGB images), composing them to be models that can be used across a huge range of devices.

2. Depth Image: In a Depth image also called “Time of Flight”, the value of pixels relates to space/distance from the camera as measured by time-of-flight. The introduction and popularity of low-cost devices like Microsoft Kinect have made it effortless to obtain Depth data. Depth image can also complement RGB image to create more composite and precise Computer Vision models, whereas Depth-only models are immensely used where privacy is a concern.

3.Infra-red (IR) image: In an IR image, the value of a pixel is decided by the extent of infrared (IR) light that is reflected back to the camera. Experimentation in Computer Vision primarily based on IR images are nominal, as compared to that of RGB and Depth images. Microsoft Kinect additionally offers IR images while recording. However, presently there are no datasets that include IR images.

**Fig 2: RGB image (Left), Depth image (Center), Infrared image (Right)**

Static Image vs Video

A video is nothing but a group of images, where every 2 successive frames share an enormous portion of the data present in them (which is the base of most of the video compression techniques). This time-based dependence in videos can be oppressed while performing Pose estimation.

For a video, a string of poses needs to be put together for the input video sequence. It is anticipated that the estimated poses should ideally be consistent across consecutive frames of video, and the algorithm needs to be computationally efficient to handle a huge number of frames. The complication of occlusion is probably less complicated to solve for a video due to the proximity of past or future frames where the body part is not occluded.

If temporal characteristics are not a part of the pipeline, it is feasible to apply static pose estimation for each frame in a video. However, the results are typically not as good as desired due to jitters and irregularity problems.

**Fig 3: Removing jitter from the video**

2-D vs 3-D Estimation

Depending on the output dimension precondition, the Pose Estimation issue can be categorized into 2-D Pose Estimation and 3-D Pose Estimation. 2-D Pose Estimation is projecting the location of body joints within the image (in terms of pixel values). Conversely, 3-D Pose Estimation is predicting a three-dimensional structural arrangement of all the body joints as its final output.

Body Model

Every pose estimation model has a body model to which it agrees upon beforehand. It simplifies the pose estimation model into estimating the body model parameters. The N-joint rigid kinematic skeleton model is mostly used in algorithms as the final output. N is in the range of 13 to 30. Kinematic models are represented as the graphs, where each vertex V represents a joint. The edges E encodes constraints about the structure of the body model

A shape-based body model is another type of body model in which human body parts are approximated using geometric shapes like rectangles, cylinders, comics, etc.

Pose Estimation Pipeline

Pre-processing

● Background removal: It is required for segmentation of humans from the background, or removal of some noise.

● Bounding box creation: Some algorithms, especially in MPPE, create bounding boxes for every Human present in the image. Each bounding box is then separately evaluated for Human Pose.

● Camera calibration and image registration: Image registration is required in case of inputs from multiple cameras are used. In the case of 3D Human Pose Estimation, camera calibration helps in converting the reported ground truth into standard world coordinates.

Feature Extraction

Feature extraction in Machine Learning is the creation of derived values from raw data that can be used as input to a learning algorithm. Raw data can be images or videos. There are two types of features, implicit and explicit. Conventional Computer Vision-based features like Histogram of Oriented Gradients (HoG) and Scale Invariant Feature Transform (SIFT) are explicit features. These features are calculated explicitly before feeding the input to the following learning algorithm.

Implicit features refer to deep learning-based feature maps like outputs from complex Deep Convolutional Neural Networks (CNNs). These feature maps are never created explicitly but are a part of a complete pipeline trained end-to-end.

Inference

Confidence Maps:

A standardized way to predict merged locations generates confidence maps for every join. Confidence maps are the distribution of opportunities over the image, representing the confidence of the integrated environment across all pixels.

Bottom-Up Approach: Low-level methods include first finding the parts or joints of one or more people in the image, and then assembling them together and associating them with another person.

In simple terms, the algorithm first predicts all the body parts/joints present in the image. This is followed by the formation of a graph, depending on the body model, which connects the joints of the same person. Advanced Integer linear programming (ILP) systems or bipartite are two common ways to make this graph

Top-Down Approach: Top-down methods include a step-by-step separation step, in which each person is first separated in a binding box, followed by the self-limiting measurements made for each binding box.

The level of placement can be divided into generative body-model-based approaches and in-deep learning approaches. The established method of the body model involves trying to measure the body model in the image, allowing the final prediction to be the same as that of humans. In-deep learning studies based on specific predictions predict specific areas, so the final prediction has no guarantee of human similarity.

Post-processing

Many algorithms, which include both bottom and top modes, have no barrier to the relationship of the final output. To put it in layman terms, the algorithm that predicts compound positions from an input image does not have a refusal/modification filter that is not natural. This can sometimes lead to unusual human estimates.

**Fig 10. Pose Estimation using Kinect containing weird and unnatural pose**

Datasets

In Human Pose Estimation, some common datasets used are:

MPII: This human pose dataset is a multi-person 2D Pose Estimation dataset. It comprises 500 human activities, which have been collected from YouTube videos. This was the first dataset that had such a huge range of poses, it was also the first dataset to launch a 2D pose estimation challenge.
COCO: It is the largest 2D Pose Estimation dataset, it is considered to be the benchmark for testing 2D pose estimation algorithms. This dataset collects images from Flickr.
Human Eva: This dataset is a single-person 3D Pose Estimation dataset. It was the first 3D Pose Estimation dataset of substantial size. It contains video sequences that are recorded using multiple RGB and grayscale cameras. Ground truth 3D poses are captured using marker-based motion capture (mocap) cameras.
Human3.6M: This dataset is a single 2D/3D Pose Estimation dataset. It contains video sequences of 11 actors doing 15 different possible activities which were recorded using RGB and Depth Cameras. The 3D poses are captured using 10 mocap cameras. Human3.6M is, to date, the largest 3D Pose Estimation dataset.
SURREAL: It is a single-person 2D/3D Pose Estimation dataset. It contains virtual video animations which have been created using mocap data recorded in the lab. It is the biggest 3D Pose Estimation dataset but is not considered a benchmark for Pose Estimation Algorithms because of its synthetic nature.

Use cases and applications

We’ll go over several real-world use cases for pose estimation in this section and also explore the possible impact of this computer vision approach across sectors.

We’ll look at how to pose estimation is applied in the following areas:

Human movement and activity:

Pose estimation can be used to track and measure human movement, which is one of the most obvious applications. However, tracking movement isn’t something that can be put into production in and of itself. The applications that come from tracking this movement, however, are dynamic and far-reaching with a little creative thinking. Consider an AI-powered personal trainer that operates by simply pointing a camera at a person conducting a workout and letting a human pose estimation model (trained on the number of specific poses relevant to a training regimen) determine whether or not a certain exercise has been executed correctly. This type of app could make home fitness programmers safer and more inspiring, while simultaneously boosting accessibility and lowering the costs associated with professional physical trainers.

Experiences using augmented reality:

Pose estimation combined with augmented and virtual reality applications currently provides consumers with a better online experience. Users can, for example, virtually learn how to play tennis from virtual coaches who are pose-represented. Pose estimation techniques can also be used in conjunction with augmented reality applications. The US Army, for example, is experimenting with augmented reality programmers for use in combat. These programmers are designed to aid soldiers in distinguishing between friendly and violent forces, as well as improving night vision.

Gaming & Animation:

Gaming has gone a long way since developers used 8-bit graphics just a few years ago. The gaming industry has been transformed by technological breakthroughs such as facial recognition and virtual reality. The pose estimate has also improved. Previously, game designers had to manually animate characters. It’s a time-consuming, costly, and exhausting process. Pose estimation, on the other hand, has gone a long way toward streamlining and automating animation. Developers may simply capture motion in real-time using computer vision technology. By automating the game’s characters, the ability to capture animations has improved the user experience in video games.

Robotics is a branch of science that deals with the:

Robotics has become an integral element of our daily life. In the manufacturing industry, particularly in China, companies have continued to employ robotics. Robotics have traditionally relied on 2D vision systems to complete tasks. They were capable of doing a variety of jobs, but they had a number of drawbacks. For one thing, there were issues with mobility. It was particularly difficult to program the robot’s movement direction. They also required extensive calibration procedures and were rigid and inflexible in response to environmental changes. As a result, they needed to be reprogrammed on a regular basis. Programmers were able to construct more flexible and accurate robots that were more responsive to environmental changes after the invention of 3D pose estimation.

Conclusion

Human Pose Estimation is an evolving discipline. In recent times, there has been a noticeable trend in Human Pose Estimation of moving towards the use of deep learning, specifically CNN-based approaches, due to their superior performance across tasks and datasets. One of the main reasons for the success of deep learning is the availability of large amounts of training data, especially with the advent of the COCO and Human3.6M datasets.