Normalized cross-correlation with alpha-masked templates for object detection
How to make 2d game recognition using classical computer vision approach
Methods based on deep learning are the most popular for object recognition today. A good deep learning model can find an object at different scales, angles, light conditions and even when it partially overlapped by other objects.
To create such model, you need to have a set of source data and an augmentation algorithm that could generate training examples based on this data, as similar as possible to those your the program will see in reality. The more accurate the model should work, the stricter the requirements for training examples and the more difficult it is to create them.
Deep learning is a powerful approach to object recognition, but it is redundant for some tasks and it is easier to use classic computer vision approaches.
In this tutorial, we will look at how to make a 2d computer game character recognition with template matching using SuperTux as an example.
Template matching is one of the methods for detecting objects in an image. It involves moving the sample(template) around the image and calculating the measure of similarity between the sample and the part of the image that is below it. If it is greater than the selected threshold, we assume that this part of the image corresponds to the object.
There are many ways to calculate image similarity. We will consider the normalized cross-correlation, which is calculated using the formula:
Normalized cross-correlation works well even in difficult conditions, when the image brightness changes or there is interference. Let’s see how it can be implemented.
Implementing normalized cross-correlation with linear filtering
As a basis, let’s take the filter2D function from the OpenCV library, which allows us to calculate expressions of the form:
This function uses Fast Fourier Transform in the internal implementation and therefore has a complexity O(N logN) that will give us a significant performance gain compared to summing directly, which has a complexity O(MN) where N is the number of pixels of the image, and M is the number of pixels of the template.
- I — image;
- t — template;
- M — template mask.
At the first stage, we calculate Ī(x, y) — the average brightness value of the image area within the template. We can get it by applying the template mask to the image as a linear filter. Filtering result we normalize by the number of elements that participated in the summation, i.e. the number of non-zero elements of mask(m(x′, y′)=1 if the pixel belongs to the template, otherwise m(x′, y′)=0):
Let’s code Ī(x, y) calculation:
The value of Ī in the normalized cross-correlation formula for the same pixel in the source image may be different. Everything depends on the window R(x,y) for which the summation is performed, so we need to expand the brackets
in both the numerator and denominator, respectively.
After opening the brackets we get:
Let’s code last expression:
After replacing the summation with filtering, we get the final expression of normalized cross-correlation:
The code for R(x, y) calculation is shown below.
Variable substitution was used:
Let’s now apply our masked_normxcorr function to recognize Tux.
Making Tux recognition
At first, make a template. Cut out the Tux from the screenshot and use an eraser to make the background pixels transparent.
Next, we load a grayscale Tux image and its mask (pixels that have an opacity greater than 50%). To avoid numerical errors in the calculation process, we will to reduce the pixels brightness to the interval (0, 1).
As a test image, we will take another screenshot of the game, which we will also use as a grayscale image with brightness interval (0, 1).
Let’s apply normalized cross-correlation to the test image and look at the result (blur is used to ensure that recognition works correctly even when the Tux differs slightly from the standard, for example, while he walking).
We can see that in place of Tux there is a bright white spot-the peak of normalized cross-correlation.
Now, we need to apply threshold binarization for finding coordinates of our “object”. The threshold value should be selected in such way that on the one hand the Tux position is highlighted even if it differs from the template, on the other hand, so that there are no false positives. I chose 0.74.
With this approach, we do not get a single point, but a set of coordinates where Tux can be located. To remove false positives, we implement non-maximum suppression within the size of the template using the OpenCV’s dilate function.
Then, we extract the coordinates of the normalized cross-correlation peaks.
Now we can draw the recognized Tux locations and display the result on the screen.
The full source code can be downloaded here.