How to Split the dataset with scikit-learn’s train_test_split() in Python
Dataset splitting plays a crucial role in machine learning. It helps us to evaluate the performance of the model. In this tutorial, we will learn how to split the dataset using scikit-learn.
Splitting the dataset using scikit-learn
Steps involved:
- Importing packages
- Loading the dataset
- Splitting using sklearn
Importing the packages:
import pandas as pd from sklearn.model_selection import train_test_split
For splitting we need to import train_test_split from sklearn.
Loading the dataset:
Lets consider Sample.csv as the dataset
df = pd.read_csv("PATH OF THE DATASET") df.shape
(614, 13)
df.columns
Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'], dtype='object')
In the dataset we can find that Loan_Status is dependent variable.
X = df.drop(['Loan_Status'],1) X.shape
(614, 12)
y = df['Loan_Status'] y.shape
(614,)
User input:
print("Enter the splitting factor:") n = float(input())
Enter the splitting factor: 0.3
Here user needs to give the factor by which train data and test data should be splitted. Let us consider 0.3 as splitting factor.
Splitting using sklearn:
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(X,y, test_size=n)
Here we are splitting the dataset randomly into x_train, x_test, y_train, and y_test by given splitting factor.
NOTE: train_test_split(X,y, test_size=n, random_state = any integer) produces same result after every execution. Where as train_test_split(X,y, test_size=n) produces different results for every execution.
Before Splitting:
print("Size of x:") print(X.shape) print("Size of y:") print(y.shape)
Size of x: (614, 12) Size of y: (614,)
After Splitting:
print("Size of x_train:") print(x_train.shape) print("Size of y_train:") print(y_train.shape) print("Size of x_test:") print(x_test.shape) print("Size of y_test:") print(y_test.shape)
Size of x_train (429, 12) Size of y_train (429,) Size of x_test (185, 12) Size of y_test (185,)
As the splitting factor is 0.3, 30% of total dataset ((i.e) 30% of 614 = 185) goes to test data and remaining goes to train successfully.
In this way the dataset is splitted into train and test using scikit-learn.
Also read:
Leave a Reply