Calculate AUC With sklearn in Python
In this tutorial, we will explore the AUC (Area under the ROC Curve) and its significance in evaluating the Machine Learning model. We will also calculate AUC in Python using sklearn (scikit-learn)
AUC
AUC signifies the area under the Receiver Operating Characteristics (ROC) curve and is mostly used to evaluate the performance of the binary classification model like a Logistic Regression. As we know, we have many classification models, and we work on different datasets every time. So, there should be some quantity that should tell us how well a model fits a dataset. This will result in deciding which model should be used. To evaluate this, there are two parameters:
Sensitivity refers to the ratio of the true positive class to the total number of actual positive classes. It is also called the True Positive Rate
and tells us how many positive classes are correctly identified by the model.
Sensitivity = (TP)/(TP+FN).
Specificity refers to the ratio of the true negative class to the total number of actual negative classes. It is also called the True Negative Rate
and tells us how many negative classes are correctly identified by the model.
Specificity = (TN/TN+FP).
ROC is the plot of Sensitivity and (1-Specificity). Then, we calculate the area under this curve, known as the AUC score. An AUC of 1.0 means perfect discrimination, while an AUC of 0.5 suggests that the model performs no better than random chance.
Step 1: Importing Libraries
Let’s import all the necessary libraries. I will be working on the Diabetes Dataset preloaded in the sklearn library. Here, the iris dataset can’t be used as it is multiclass, and we need a binary class dataset.
from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, roc_curve, auc import matplotlib.pyplot as plt
Step 2: Loading the dataset
Load the dataset from sklearn. This dataset contains the target variable as different numbers. We need to convert it into binary values (1 for Diabetes and 0 for non-diabetes). For this we will take the mean of the values and if the value y is greater than mean value then the person has diabetes and 1 will be assigned for it. Also we will split the dataset into training and test set in the ratio of 80:20.
diabetes = load_diabetes() X, y = diabetes.data, diabetes.target y_binary = (y > y.mean()).astype(int) X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=2023)
Step 3: Training the model
Load the Logistic Regression model and train it. We need the probabilities of each sample in the dataset to calculate the AUC score. Therefore, we will use model.predict_proba()
for the same.
model = LogisticRegression() model.fit(X_train, y_train) y_prob = model.predict_proba(X_test)[:, 1]
Step 4: Calculate the AUC
Use the direct function from the sklearn to calculate the AUC.
auc_score = roc_auc_score(y_test, y_prob) print(f'AUC Score: {auc_score:.4f}')
Output:
AUC Score: 0.8318
Additional: plotting ROC curve
You can also plot the ROC curve to get more insights. First, we will calculate the False positive and true positive rates. We can also calculate AUC from FPR and TPR, as shown in the below code.
fpr, tpr, _ = roc_curve(y_test, y_prob) roc_auc = auc(fpr, tpr) plt.figure(figsize=(8, 8)) plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'AUC = {roc_auc:.2f}') plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.xlabel('False Positive Rate (FPR)') plt.ylabel('True Positive Rate (TPR)') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend(loc='lower right') plt.show()
Output:
Here, FPR is False Positive Rate, which is 1- True Negative Rate, and TPR is True Positive Rate. The plot between FPR and TPR is with an orange line. This plot often shows a dotted line, which signifies the line with slop =1 and accounts for the random classification. As you know, there are only two classes, so with random classification, a 0.5 score as there is a 50% chance of Yes and No.
Leave a Reply