Using AI to play videogames has been nothing new. For instance, OpenAI’s OpenAI Five famously defeated the world champions of “Dota 2”, a team based MOBA game, in 2018. However, typically, videogame AIs use something called ‘Reinforcement Learning’. Essentially, each action the AI takes is given a certain reward. For instance, if the AI kills an enemy, it will be rewarded as it is a desirable action. The agent (AI) seeks to maximize its rewards, and thereby producing a set of actions that will allow it to “play” the game.

However, one major downside of such an approach is the difficult nature of acquiring input information. In most cases, the AI must have access to the game’s internal states, often via an external API or be hard-coded into the game itself. Unlike humans, who interpret the information using our eyes, the AI needs to be told exactly what those information are. Using Computer Vision is not exactly easy either, due to the computationally costly nature using technologies such as object detection or OCR to recognize information from a screencap.

This begs the question, could we instead turn this problem into a classification task? Humans provide footage of gameplay, categorized by actions (such as pressing a certain key), and the AI simply has to recognize the pattern behind each action.

The project source code can be found here.

Here is a demo of the project on Subway Surfers

Video

What is CNNs and how is it applicable

CNN, or Convolutional Neural Network, is a neural network architecture specifically designed for image recognition tasks, inspired by the biological features of our very own eyes. It works by looking at small sections of the image, called “kernel”, as opposed to the entire image at once. The benefit of doing this is that, instead of “memorizing” the image, CNNs can actually identify sub-features of an image for better recognition. Sub-features include things like edges, surfaces etc, which will be important to us later on.

In our case, we simply need to provide plenty of footage (more the better), and label each frame with a specific action. For example, in Subway Surfers (a parkour game), possible actions include going left, right, jumping or ducking. The idea is that when we, say, go left because there is an obstacle in front of us, the AI will hopefully learn this pattern and will perform a similar action when faced with a similar circumstance.

In this case, we will use RESNET18, which is a pre-trained CNN that is 18 layers deep, hence “18”. Using pre-trained CNNs has the benefit of retaining their capabilities of identifying basic features, such as edges/surfaces, which is contained in the earlier layers of the network, and only needing to retrain the latter “higher level” layers responsible for gathering the higher level features. This will not only save us massive computational costs (compared to training a brand new network) but also provide higher accuracy and faster recognition.

What does the AI actually “see”?

As you may or may not know, color in computers is stored as RGB values, or Red Green Blue values. Each color pixel is represented by a mixture of these 3 primary colors, hence giving us all the wonderful colors we can see. However, for a CNN, color image is not good news, as it requires more computational power. For a black and white image, one only needs to store the brightness of each pixel (from black to white), as opposed to the three (RGB) values required for each pixel for a color image, and more information means longer time to process for our AI.

However, in the real world, often times color does make a massive difference in terms of image recognition. A different color can sometimes mean an entirely new object. However, in our case (and in a lot of videogames), which is Subway Surfers, color does not matter. We care more about the shape of objects, much more than the colorful decoration. So, the most obvious thing for us to do is to make the image black and white.

But we don’t have to stop here. There are a lot of unnecessary information in the image that are not strictly relevant to the ability to play the game. In our case, things like graffiti on the train are merely decoration, and their existence generates unnecessary pixel information for our network to process. In fact, we only really need to capture the shape of each train carriage or obstacle in order to play the game. How do we do that? In comes Canny edge detection algorithm.

What is Canny edge detection algorithm

The canny algorithm is just as its name, it can detect edges in an image. For example, here is an image processed using the Canny algorithm:

Canny

Without going into a 20-page essay on the algorithm, we can in fact fine tune this algorithm on the amount of detail it captures using two parameters, the lower and upper threshold. In general, the smaller the difference between the two thresholds, the less information is captured.

So now we face an interesting dilemma. Obviously, less detail in the picture means better and faster processing, which is crucial for such a fast paced game like Subway Surfers. However, there isn’t really much point if all the AI can see is a black picture with a few lines here and there. The idea is to capture just enough information for the AI to make an informed decision, but not too much information in order to speed up the decision-making time of the AI.

Training and result

With the image processing sorted, all we have to do is provide some training footage for the AI to learn from. If you want to check out the scripts I used to capture image and categorize them, I have neatly packaged my code into a easy-to-use library with a single Class that performs all the necessary functions to train the AI to play your favorite game, project link here.

Of course, the results will strongly depend on the difficulty of the game in question. In my case, the AI was able to perform basic evasive maneuvers just after around 20 recorded games. It starts to show signs of strategy (such as collecting powerups and coins) after around 40 recorded games, and advanced strategy (such as choosing empty lanes to avoid dead-ends) after roughly 50 or so games.

The performance of the network is quite decent. On average, the network had a 85% - 90% validation accuracy, and it takes around 70 - 90 milliseconds to process and classify each individual image. In the end, the AI managed to achieve, on average, 3000 points, which is a respectable score when I can typically only achieve 15,000, taking into the processing delay of the AI.

Conclusion

Whilst reinforcement learning has become the ubiquitous and “gold standard” way of training AI to play games, this project demonstrates that there are often more than one way to approach a problem.