Imbalanced Multiclass Classification with the E.coli Dataset in Python
In this tutorial, we will be dealing with imbalanced multiclass classification with the E.coli dataset in Python.
Classifications in which more than two labels can be predicted are known as multiclass classifications. In such cases, if the data is found to be skewed or imbalanced towards one or more class it is difficult to handle. Such problems are commonly known as Imbalanced Multiclass classification problems.
Dataset is available here.
Imbalanced Multiclass Classification
Let us load the necessary libraries, please make sure that you guys have the latest version of the libraries on your system:
from pandas import read_csv from pandas import set_option from collections import Counter from matplotlib import pyplot from numpy import mean from numpy import std from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier
It is now time to load the data into the python file. We can now print out the shape (or size) of the data set and then move ahead accordingly. Also, we can parse through the whole data set once if required.
filename = '/home/sumit/ecoli.csv' df = read_csv(filename, header=None) print(df.shape) target = df.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 50 print('Class=%s, Count=%d, Percentage=%.5f%%' % (k, v, per)) set_option('precision', 5) print(df.describe())
Output:
(336, 8) Class=cp, Count=143, Percentage=21.27976% Class=im, Count=77, Percentage=11.45833% Class=imS, Count=2, Percentage=0.29762% Class=imL, Count=2, Percentage=0.29762% Class=imU, Count=35, Percentage=5.20833% Class=om, Count=20, Percentage=2.97619% Class=omL, Count=5, Percentage=0.74405% Class=pp, Count=52, Percentage=7.73810%
0 1 2 ... 4 5 6 count 336.00000 336.00000 336.00000 ... 336.00000 336.00000 336.00000 mean 0.50006 0.50000 0.49548 ... 0.50003 0.50018 0.49973 std 0.19463 0.14816 0.08850 ... 0.12238 0.21575 0.20941 min 0.00000 0.16000 0.48000 ... 0.00000 0.03000 0.00000 25% 0.34000 0.40000 0.48000 ... 0.42000 0.33000 0.35000 50% 0.50000 0.47000 0.48000 ... 0.49500 0.45500 0.43000 75% 0.66250 0.57000 0.48000 ... 0.57000 0.71000 0.71000 max 0.89000 1.00000 1.00000 ... 0.88000 1.00000 0.99000 [8 rows x 7 columns]
Plotting the histogram of the data, through this we will get a better insight into the data. This will help us make better choices in the future coding pattern.
df.hist(bins=25) pyplot.show()
Output:
Now in some of the classes the data available in the dataset in insufficient, this may lead to an error. To handle this just remove such classes. So using the new_data() function to remove the rows.
def new_data(filename): df = read_csv(filename, header=None) df = df[df[7] != 'imS'] df = df[df[7] != 'imL'] data = df.values X, y = data[:, :-1], data[:, -1] y = LabelEncoder().fit_transform(y) return X, y
Let us now evaluate the algorithms. We will be evaluating the following models on this dataset:
- RF: Random Forest
- ET: Extra Trees
- LDA: Linear Discriminant Analysis
- SVM: Support Vector Machine
- BAG: Bagged Decision Trees
def evaluate_model(X, y, model): cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores def get_models(): models, names = list(), list() models.append(LinearDiscriminantAnalysis()) names.append('LDA') models.append(LinearSVC()) names.append('SVM') models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') models.append(ExtraTreesClassifier(n_estimators=1000)) names.append('ET') return models, names
Running the code and plotting the boxplot, will help us better understand the behaviour of the five algorithms being used in the model.
X, y = load_dataset(full_path) models, names = get_models() results = list() for i in range(len(models)): scores = evaluate_model(X, y, models[i]) results.append(scores) print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()
Output:
>LDA 0.881 (0.041) >SVM 0.882 (0.040) >BAG 0.855 (0.038) >RF 0.887 (0.022) >ET 0.877 (0.034)
Let us now try this whole upon the same data from scratch and print the results obtained and the expected results.
We will be evaluating the following models on this dataset:
OM, CP, PP, IMU, OML, IM
from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier def new_data(filename): df = read_csv(filename, header=None) df = df[df[7] != 'imS'] df = df[df[7] != 'imL'] data = df.values X, y = data[:, :-1], data[:, -1] le = LabelEncoder() y = le.fit_transform(y) return X, y, le filename = '/home/sumit/ecoli.csv' X, y, le = new_data(filename) model = RandomForestClassifier(n_estimators=1000) model.fit(X, y) # known class "om" row = [0.78,0.68,0.48,0.50,0.83,0.40,0.29] q = model.predict([row]) l = le.inverse_transform(q)[0] print('>Predicted=%s (expected om)' % (l)) # known class "cp" row = [0.49,0.29,0.48,0.50,0.56,0.24,0.35] q = model.predict([row]) l = le.inverse_transform(q)[0] print('>Predicted=%s (expected cp)' % (l)) # known class "pp" row = [0.74,0.49,0.48,0.50,0.42,0.54,0.36] q = model.predict([row]) l = le.inverse_transform(q)[0] print('>Predicted=%s (expected pp)' % (l)) # known class "imU" row = [0.72,0.42,0.48,0.50,0.65,0.77,0.79] q = model.predict([row]) l = le.inverse_transform(q)[0] print('>Predicted=%s (expected imU)' % (l)) # known class "omL" row = [0.77,0.57,1.00,0.50,0.37,0.54,0.0] q = model.predict([row]) l = le.inverse_transform(q)[0] print('>Predicted=%s (expected omL)' % (l)) # known class "im" row = [0.06,0.61,0.48,0.50,0.49,0.92,0.37] q = model.predict([row]) l = le.inverse_transform(q)[0] print('>Predicted=%s (expected im)' % (l))
Output:
>Predicted=om (expected om) >Predicted=cp (expected cp) >Predicted=pp (expected pp) >Predicted=imU (expected imU) >Predicted=omL (expected omL) >Predicted=im (expected im)
Clearly the model correctly predicts the expected output. Congratulations!
Hope you had fun learning in this tutorial with me. Have a good day and happy learning.
Also read: AdaBoost Algorithm for Machine Learning in Python
Leave a Reply