Calculate AUC With sklearn in Python

In this tutorial, we will explore the AUC (Area under the ROC Curve) and its significance in evaluating the Machine Learning model. We will also calculate AUC in Python using sklearn (scikit-learn)

AUC

AUC signifies the area under the Receiver Operating Characteristics (ROC) curve and is mostly used to evaluate the performance of the binary classification model like a Logistic Regression. As we know, we have many classification models, and we work on different datasets every time. So, there should be some quantity that should tell us how well a model fits a dataset. This will result in deciding which model should be used. To evaluate this, there are two parameters:

Sensitivity refers to the ratio of the true positive class to the total number of actual positive classes. It is also called the True Positive Rate and tells us how many positive classes are correctly identified by the model.
Sensitivity = (TP)/(TP+FN).
Specificity refers to the ratio of the true negative class to the total number of actual negative classes. It is also called the True Negative Rate and tells us how many negative classes are correctly identified by the model.
Specificity = (TN/TN+FP).

ROC  is the plot of Sensitivity and (1-Specificity). Then, we calculate the area under this curve, known as the AUC score. An AUC of 1.0 means perfect discrimination, while an AUC of 0.5 suggests that the model performs no better than random chance.

Step 1: Importing Libraries

Let’s import all the necessary libraries. I will be working on the Diabetes Dataset preloaded in the sklearn library. Here, the iris dataset can’t be used as it is multiclass, and we need a binary class dataset.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve, auc
import matplotlib.pyplot as plt

Step 2: Loading the dataset

Load the dataset from sklearn. This dataset contains the target variable as different numbers. We need to convert it into binary values (1 for Diabetes and 0 for non-diabetes). For this we will take the mean of the values and if the value y is greater than mean value then the person has diabetes and 1 will be assigned for it. Also we will split the dataset into training and test set in the ratio of 80:20.

diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

y_binary = (y > y.mean()).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=2023)

Step 3: Training the model

Load the Logistic Regression model and train it. We need the probabilities of each sample in the dataset to calculate the AUC score. Therefore, we will use model.predict_proba() for the same.

model = LogisticRegression()

model.fit(X_train, y_train)

y_prob = model.predict_proba(X_test)[:, 1]

Step 4: Calculate the AUC

Use the direct function from the sklearn to calculate the AUC.

auc_score = roc_auc_score(y_test, y_prob)
print(f'AUC Score: {auc_score:.4f}')

Output:

AUC Score: 0.8318

Additional: plotting ROC curve

You can also plot the ROC curve to get more insights. First, we will calculate the False positive and true positive rates. We can also calculate AUC from FPR and TPR, as shown in the below code.

fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

Output:

Calculate AUC With sklearn in Python

Here, FPR is False Positive Rate, which is 1- True Negative Rate, and TPR is True Positive Rate. The plot between FPR and TPR is with an orange line. This plot often shows a dotted line, which signifies the line with slop =1  and accounts for the random classification. As you know, there are only two classes, so with random classification, a 0.5 score as there is a 50% chance of Yes and No.

Leave a Reply

Your email address will not be published. Required fields are marked *