Data preprocessing (splitting dataset before training model)

Training model on data is not an easy task. the various parameter should me consider before training any model if it is an artificial neural network or any convolution neural network training takes on many fundamental parameters.

How to split dataset into train and test data in Python – Data preprocessing

In this tutorial learn how to split dataset before training any system. If the dataset is not split in the correct order or type it can lead to overfitting or underfitting hence leading to bad trained or untrained model(system)

What is underfitting?

Underfitting mainly occurs when a machine learning algorithm is not able to capture the lower trend of data which is mainly when data is nor well fitted inside the model.

Another way to check is when the algorithm shows low variance but high bias then it’s underfitting. underfitting should be avoided to prevent data and model going in the waste direction.

What is overfitting?

When machine learning algorithm is trained on very well data and very closely on a dataset which can lead to a negative impact on the performance of the system leading to the wrong system and prediction model. Overfitting should be avoided so the negative impact on the performance on the system can be removed.

Implementing data preprocessing

Before processing ahead I would recommend seeing these tutorials



so finally now we will encode in python

Generally, data are given in two form

  1. Training set
  2. Test set

Here we have to split training data set in four subpart

  1. X_train
  2. y_train
  3. X_test
  4. y_test

We will not directly test our model on test data set because this will lead to dataset wastage

follow below code to split your dataset

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('dataset.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]

# Splitting the dataset - training set & test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

code segment test_sixe indicates the ratio of test data split in whole train dataset and random_state is sized to zero.

this will result in a dataset of a train in splitting in ration 8:2

for example, if you have 1000 data in training data set then it will make

  • x_train = 800
  • y_test = 200

I hope you got a good idea about dataset splitting, hope to see you in the next tutorial until then enjoy learning.

Leave a Reply

Your email address will not be published. Required fields are marked *