Skip to main content
shopping_basket Basket 0

Personal AI Trainer - Body Pose Estimation and Repetition Counting (Part III)



In a previous article, I described a concept about a personal AI trainer that could generate custom workout plans based on voice-based user input, and then monitor the user's execution of the workout and count their repetitions. I already covered the conversational AI part of the project here, so in this follow-up article, I will go through the second necessary piece of the full project - pose estimation and automatic repetition counting. For the pose estimation part, I will briefly describe a neural network-based approach and then continue with a custom algorithm that can do the exercise repetition counting.

The pre-trained model used in the article is available in the attachments. The source code and instructions on how to get started are available in the GitHub repository.

Image shows visualisation of the counting output

A visualisation of the output that the repetition counting program from this article generates.

2D Pose Estimation

A 2D pose estimator locates human skeletons in an image and allows us to query the locations of different body parts in the 2D space of the image. There are multiple methods to achieve this, but for this project, we'll be using a part affinity field method that allows real-time pose estimation. In this section, I will briefly explain how this method works. Although a rigorous understanding of the method is not required for this article, the algorithm we'll be using is described here.

The first stage of the algorithm passes the input image to a neural network. This network generates part confidence maps, which estimate the positions of different body parts. There is a separate confidence map for each part of the body. For example, in the image below two confidence maps are shown: one that highlights the left shoulders, and one for the left elbows.

art confidence maps, generated from an input image

Part confidence maps, generated from an input image. Adapted from

So far, we know the locations of different body parts, but we need to figure out how they relate to one another. Looking at all different combinations and finding the best one is NP-hard and far too slow for real-time applications. That is why the input image is passed to a second neural network which generates a part affinity field for each limb. Essentially, a part affinity is a 2D vector that points from one limb to the other. This is visualised below:

Part affinity fields, generated from an input image

Part affinity fields, generated from an input image. Adapted from

The part affinity field provides a "shortcut" to the search algorithm, which can now greedily parse the body part positions from the part confidence maps in real-time and generate 2D pose skeletons for the people in the image. Our implementation of the algorithm above has been adapted from this repository.

part confidence maps and affinity fields are combined to generate the final pose estimate

The part confidence maps and affinity fields are combined to generate the final pose estimate. Adapted from

Automatic Repetition Counting

An exercise is defined by the periodic execution of repetitions. A single repetition begins with some starting body position, goes through some changes of the body position, and ends with the starting position. The change in body position is expressed in the change of joint angles of our bodies. For example, when doing squats, we can observe a change in the angles of our knee joints.

The pose estimation algorithm gave us the locations of different body parts in the 2D space of the image. By picking any three points that correspond to a body part (e.g. left ankle, left knee, left hip), we can estimate the angle that's formed between these three points using simple trigonometry.

So we know how to estimate joint angles, and we know that the change in joint angles is what defines a single repetition. How do we use this to count repetitions? One simple strategy is to capture the initial joint angles at the starting position before the exercise begins. Then, at each video frame, we calculate the mean squared error (MSE) between the current joint angles and the initial joint angles. This will give us a single value estimate of how "far" we are from the starting position.

An animation of the difference between the initial and current joint angles on both knees, and the corresponding MSE.

Throughout a single repetition, we could expect the MSE to increase, reach a peak, and then go back to nearly zero. If we detect when the MSE crosses a certain threshold on its way down, we would know that the person doing the exercise is about to finish a repetition. This is visualised below:

An animation of the thresholded MSE. Counting the rising edges of the resulting square wave is equivalent to counting repetitions.

This simple threshold-crossing rule creates the square wave pattern above. By counting the number of rising edges in that wave, we can get information about the number of repetitions.


This article served as the second stepping stone to create a personal AI trainer by building a real-time repetition counting app. In the following article, we will integrate this work with the conversational AI apps to create an interactive personal AI trainer. Stay tuned!

I have a BEng in Electrical & Electronic Engineering and an MSc in Artificial Intelligence. I like to combine knowledge from both fields in my work. In short, I make computers walk, talk, hear, see and understand.

Recommended Articles

DesignSpark Electrical Logolinkedin