Data preprocessing (splitting dataset before training model)
Training model on data is not an easy task. the various parameter should me consider before training any model if it is an artificial neural network or any convolution neural network training takes on many fundamental parameters.
How to split dataset into train and test data in Python – Data preprocessing
In this tutorial learn how to split dataset before training any system. If the dataset is not split in the correct order or type it can lead to overfitting or underfitting hence leading to bad trained or untrained model(system)
What is underfitting?
Underfitting mainly occurs when a machine learning algorithm is not able to capture the lower trend of data which is mainly when data is nor well fitted inside the model.
Another way to check is when the algorithm shows low variance but high bias then it’s underfitting. underfitting should be avoided to prevent data and model going in the waste direction.
What is overfitting?
When machine learning algorithm is trained on very well data and very closely on a dataset which can lead to a negative impact on the performance of the system leading to the wrong system and prediction model. Overfitting should be avoided so the negative impact on the performance on the system can be removed.
Implementing data preprocessing
Before processing ahead I would recommend seeing these tutorials
- Importing dataset using Pandas (Python deep learning library )
- How to import libraries for deep learning model in python ?
- using sklearn StandardScaler() to transform input dataset values.
so finally now we will encode in python
Generally, data are given in two form
- Training set
- Test set
Here we have to split training data set in four subpart
- X_train
- y_train
- X_test
- y_test
We will not directly test our model on test data set because this will lead to dataset wastage
follow below code to split your dataset
import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('dataset.csv') X = dataset.iloc[:, 3:13].values y = dataset.iloc[:, 13].values # Encoding categorical data from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X_1 = LabelEncoder() X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1]) labelencoder_X_2 = LabelEncoder() X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2]) onehotencoder = OneHotEncoder(categorical_features = [1]) X = onehotencoder.fit_transform(X).toarray() X = X[:, 1:] # Splitting the dataset - training set & test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
code segment test_sixe indicates the ratio of test data split in whole train dataset and random_state is sized to zero.
this will result in a dataset of a train in splitting in ration 8:2
for example, if you have 1000 data in training data set then it will make
- x_train = 800
- y_test = 200
I hope you got a good idea about dataset splitting, hope to see you in the next tutorial until then enjoy learning.
Leave a Reply