Machine Learning | Customer Churn Analysis Prediction

Hello folks!

In this article, we are going to see how to build a machine learning model for Customer Churn analysis Prediction. Basically customer churning means that customers stopped continuing the service. There are various machine learning algorithms such as logistic regression, decision tree classifier, etc which we can implement for this.

Also, there are various datasets available online related to Customer churn. For this article, we are going to use a dataset from Kaggle:

In this dataset there are both categorical features and numerical futures, so we will use the Pipeline from sklearn for the same and apply the Decision Tree Classifier learning algorithm for this problem.

Customer Churn Analysis Prediction Code in Python

We will write this code in Google Colab for better understanding and handling. See the code below:

from google.colab import files
uploaded = files.upload()
import pandas as pd
import io
df = pd.read_csv(io.BytesIO(uploaded['WA_Fn-UseC_-Telco-Customer-Churn.csv']))
df = df[~df.duplicated()] # remove duplicates
total_charges_filter = df.TotalCharges == " "
df = df[~total_charges_filter]
df.TotalCharges = pd.to_numeric(df.TotalCharges)

Here we first upload our data and then read that data in a CSV file using pandas.

categoric_features = [
numeric_features = [ "MonthlyCharges","tenure", "TotalCharges"]
output = "Churn"
df[numerical_features].hist(bins=40, figsize=(7,7 ),color="green")

Then we will divide the data into categoric_features and numeric_features present in the CSV file. And plot the histogram of numeric data.

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 3, figsize=(20, 5))
df[df.Churn == "No"][numerical_features].hist(bins=30, color="black", alpha=0.5, ax=ax)
df[df.Churn == "Yes"][numerical_features].hist(bins=30, color="green", alpha=0.5, ax=ax)

matplotlib plot chart

R, C = 4, 4
fig, ax = plt.subplots(R, C, figsize=(18, 18))
row, col = 0, 0
for i, categorical_feature in enumerate(categorical_features):
    if col == C - 1:
        row += 1
    col = i % C
    df[categorical_feature].value_counts().plot(x='bar', ax=ax[row, col]).set_title(categorical_feature)

matplotlib plot chart

matplotlib plot chart

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
        ('num', numeric_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
from sklearn import tree

clf = Pipeline([
     ('preprocessor', preprocessor),
     ('clf', tree.DecisionTreeClassifier(max_depth=3,random_state=42))

Then we will import our python sklearn library to make a pipeline for combining categoric and numerical features together and apply them to the decision tree model.

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.20, random_state=42), df_train[output])
prediction = clf.predict(df_test)
from sklearn.metrics import classification_report
print(classification_report(df_test[output], prediction)

Then we will split our data into training and testing set. And give our training set to pipeline “calf” to train our model. After this, we will print out our results on the screen that you can see in the image above.

I hope you enjoyed the article. Thank you!

Leave a Reply

Your email address will not be published. Required fields are marked *