Building a framework to track tennis games at scale

I stepped away from the first iteration of this project last year when I realized I was going to have to annotate thousands of images to create a robust player/ball detection model. In the next few months, I'll design and build such a modelling framework. This post outlines the methods and services which will be used for the framework.

Tennis TV
The ATP has a streaming service called Tennis TV which offers full games and daily highlights for most tournaments on the tour. The highlights are of particular interest given that they consist of condensed versions of various games. This offers a diversity of players, outfits, visibility, crowds, etc. which will make the model more robust.

To maximize the diversity of the labelled set, I downloaded the quarter-final highlight video of 15 tournaments. Each video contains highlights from four games and is roughly 8 minutes long, which at 25fps is 12,000 frames, or 3,000 frames per game. I clipped each minute of each video, then took every fifth frame, up to 25 frames per minute. This is roughly 200 frames per tournament (or 50 frames per game), for 3000 frames in total. A highlight video consists of roughly 65-70% of actual game-play, with the other 30-35% being replays, audience pans, etc. Therefore, the final modelling set is roughly 2000 frames.

Roboflow
Using Roboflow, I labelled the front player, back player, and tennis ball for each frame that featured them. Roboflow is free, slick, and saved me hours of labelling work through it's label assist feature. Basically, I annotated a couple of games manually, trained a model on these games, then loaded the model into the label assist feature to automate much of the annotation process for the remaining games. The dataset is publicly available.

Roboflow also has an API that plugs easily into Google Colab.

Google Colab
I'm currently using Google Colab to train the computer vision model. I've looked around, and for my current needs (research), it fits. Easy to use, compatible with Roboflow, offers a free-to-use GPU (though it disconnects with 90 minutes of inactivity). I'm using it to train a high-speed, "You Only Look Once" objection detection model.

YOLOv5
The object detection model I chose to start is YOLOv5. It's incredibly fast. Inference is around 100fps, which makes it scalable when it comes time to track an entire tournament of gameplay. It works fine for players, but it doesn't leverage the spatio-temporal nature of tennis. This is particularly problematic when detecting a tennis ball, which, depending on the context, can appear as a blurred line, blend in with the court lines, be hidden behind a player, etc.

Some results from the latest model iteration. overall: 91.0% P, 85.5% R, 85.8 mAP50 front-player: 99.5% P, 99.6% R, 99.5 mAP50 back-player: 99.5% P, 99.4% R, 99.3 mAP50 tennis-ball: 74.1% P, 57.0% R, 58.7 mAP50

Video Example
For a short example of the inference, click here.

Game-State Classification
Because of the labelling framework, where players are only labelled when the camera is in 'gameplay' mode (long static shot of the court with a player in front of the net and another behind it), the player detection model fails to detect players in other shots like commercials or time-outs or even replays. This allows for an indirect game-state classification strategy - whenever players aren't detected, it is assumed to be non-gameplay. A post will be dedicated on refining this.

Detecting Court Lines
Currently still using the Hough Lines method laid out in a previous post. Ths might be replaced with YOLOv5's instance segmentation model, where the court is identified with a bounding box but the polygon is saved.

Post-Processing of Tennis Players
The model performs strongly, so a simple imputation strategy using interpolation has worked so far. I'll dedicate a post to this.

Post-Processing of Tennis Ball
to be continued