Predicting video game sales using Machine Learning in Python


Video games have become immensely popular over the past decade. The global games market in 2019 was estimated at $148.8 billion. In this article, you’ll learn how to implement a Machine Learning model that can predict the global sales of a video game depending on certain features such as its genre, critic reviews, and user reviews in Python.

Predicting video game sales using ML

As the global sales of a video game is a continuous quantity, we’ll have to implement a regression model. Regression is a form of supervised machine learning algorithm that can predict a target variable (which should be a continuous value) using a set of independent features. Some of the applications include salary forecasting, real estate predictions, etc.


You can download the dataset from kaggle. It contains 16719 observations/rows and 16 features/columns where the features include:

  • NA_Sales, EU_Sales, JP_Sales: Sales in North America, Europe and Japan (in millions).
  • Other_Sales: Sales in other parts of the world (in millions).
  • Global_Sales: Total worldwide sales (in millions).
  • Rating: The ESRB ratings.


Importing the dataset

# Importing the required libraries
import pandas as pd
import numpy as np

# Importing the dataset
dataset = pd.read_csv('Video_Games_Sales_as_at_22_Dec_2016.csv')

# Dropping certain less important features
dataset.drop(columns = ['Year_of_Release', 'Developer', 'Publisher', 'Platform'], inplace = True)

# To view the columns with missing values
print('Feature name || Total missing values')

We drop certain features in order to reduce the time required to train the model.



Feature name || Total missing values
Name               2
Genre              2
NA_Sales           0
EU_Sales           0
JP_Sales           0
Other_Sales        0
Global_Sales       0
Critic_Score    8582
Critic_Count    8582
User_Score      9129
User_Count      9129
Rating          6769

Splitting the dataset into Train & Test sets

X = dataset.iloc[:, :].values
X = np.delete(X, 6, 1)

y = dataset.iloc[:, 6:7].values

# Splitting the dataset into Train and Test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Saving name of the games in training and test set
games_in_training_set = X_train[:, 0]
games_in_test_set = X_test[:, 0]

# Dropping the column that contains the name of the games
X_train = X_train[:, 1:]
X_test = X_test[:, 1:]

Here, we initialize ‘X’ and ‘y’ where ‘X’ is the set of independent variables and ‘y’ the target variable i.e. the Global_Sales. The Global_Sales column which is present at index 6 in ‘X’ is removed using the np.delete() function before the dataset is split into training and test sets. We save the name of the games in a separate array named  ‘games_in_training_set’ and ‘games_in_test_set’ as these names will not be of much help when predicting the global sales.


Imputation in ML is a method of replacing the missing data with substituted values. Here, we’ll use the Imputer class from the scikit-learn library to impute the columns with missing values and to impute the columns with values of type string, we’ll use CategoricalImputer from sklearn_pandas and replace the missing values with ‘NA’ i.e. Not Available.

from sklearn.preprocessing import Imputer
imputer = Imputer(strategy = 'mean')
X_train[:, [5 ,6, 7, 8]] = imputer.fit_transform(X_train[:, [5, 6, 7, 8]])
X_test[:, [5 ,6, 7, 8]] = imputer.transform(X_test[:, [5, 6, 7, 8]])

from sklearn_pandas import CategoricalImputer
categorical_imputer = CategoricalImputer(strategy = 'constant', fill_value = 'NA')
X_train[:, [0, 9]] = categorical_imputer.fit_transform(X_train[:, [0, 9]])
X_test[:, [0, 9]] = categorical_imputer.transform(X_test[:, [0, 9]])


We encode the categorical columns of ‘X’ using ColumnTransformer and OneHotEncoder from the scikit-learn library. This will assign one separate column to each category present in a categorical column of ‘X’.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0, 9])], remainder = 'passthrough') 
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)

Building the model

We’ll implement our model i.e. the regressor using XGBRegressor (where XGB stands for extreme gradient boosting). XGBoost is an ensemble machine learning algorithm based on decision trees similar to the RandomForest algorithm. However, unlike RandomForest that makes use of fully grown trees, XGBoost combines trees that are not too deep. Also, the number of trees combined in XGBoost is more in comparison to RandomForest. Ensemble algorithms effectively combine weak learners to produce a strong learner. XGBoost has additional features focused on performance and speed when compared to gradient boosting.

from xgboost import XGBRegressor
model = XGBRegressor(n_estimators = 200, learning_rate= 0.08), y_train)

Making predictions on the Test set

Global Sales i.e. the target variable ‘y’ for the games in the test set is predicted using the model.predict() method.

# Predicting test set results
y_pred = model.predict(X_test)

# Visualising actual and predicted sales
games_in_test_set = games_in_test_set.reshape(-1, 1)
y_pred = y_pred.reshape(-1, 1)
predictions = np.concatenate([games_in_test_set, y_pred, y_test], axis = 1)
predictions = pd.DataFrame(predictions, columns = ['Name', 'Predicted_Global_Sales', 'Actual_Global_Sales'])

First few rows of the ‘predictions’ data frame:


Evaluating model performance

We’ll use r2_score  and root mean squared error (RMSE) to evaluate the model performance where closer the r2_score is to 1 & lower the magnitude of RMSE, the better the model is.

from sklearn.metrics import r2_score, mean_squared_error
import math
r2_score = r2_score(y_test, y_pred)
rmse = math.sqrt(mean_squared_error(y_test, y_pred))

print(f"r2 score of the model : {r2_score:.3f}")
print(f"Root Mean Squared Error of the model : {rmse:.3f}")


r2 score of the model : 0.972
Root Mean Squared Error of the model : 0.242

As the r2_score is very close to 1, this indicates that the model is highly accurate. You can also try improving the model performance by tuning the hyperparameters of the XGBoost regressor.

Also read:

  1. Random forest for regression and its implementation
  2. Gradient Boosting with Scikit-learn

Leave a Reply

Your email address will not be published. Required fields are marked *