sklearn.model_selection.train_test_split in Python

In this post, I will be explaining about scikit learn’s “train_tets_split" function. This utility function comes under the sklearn’s ‘model_selection‘ function and facilitates in separating training data-set to train your machine learning model and another testing data set to check whether your prediction is close or not?

Modules Required and Versions of them:

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split

versions:

matplotlib:3.1.2
sklearn:0.22
pandas:0.25.3
numpy:1.18.0

Here, I have used sklearn’s very well known Iris data set to demonstrate the “sklearn.model_selection.train_test_split” function.

The syntax:

train_test_split(x,y,test_size,train_size,random_state,shuffle,stratify)

Mostly, parameters – x,y,test_size– are used and shuffle is by default True so that it picks up some random data from the source you have provided.
test_size and train_size are by default set to 0.25 and 0.75 respectively if it is not explicitly mentioned. The measure is actually the percent of data assigned for each purpose.

Let’s take a short example. In this example:

  1. I’ll import the iris data set from the sklearn.datasets
  2. Then I’ll split the dataset into test and training datasets.
  3. Train the model using LinearRegression from sklearn.linear_model
  4. Then fit the model and plot a scatter plot using matplotlib, and also find the model score.

Importing the modules and data sets

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split

Naming the columns of the Iris dataset using a pandas data frame

col_names = "Sepal_Length Sepal_Width Petal_Length Petal_Width".split(' ')
iris_data = datasets.load_iris()
df = pd.DataFrame(iris_data.data,columns=col_names)
print(df.head(n=10)) # this is to print the first 10 rows of the data

Splitting the data into train and test data set

X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:2],df.iloc[:,1:3],test_size=0.35)
print(f'x_train shape: {X_train.shape}\ny_train shape:{y_train.shape}')  # this is to see the dimension of the training set

Training the model using LinearRegression

model_obj = linear_model.LinearRegression()

Prediction, plotting and finding the model score:

model_fit = model_obj.fit(X_train,y_train)
prediction = model_obj.predict(X_test)
print(f'prediction shape:{prediction.shape}\nX_test shape:{X_test.shape}')
print(f'------------prediction----------\n{prediction}\n{16*"--"}')
print(f'------------y_test----------\n{y_test}\n{16*"--"}')
plt.scatter(prediction,y_test)
plt.show()
print(model_fit.score(X_test,y_test,))  # try to find the accuracy of your model!

I have got about 87% accuracy:

0.8727137843452777

Also read:

2 responses to “sklearn.model_selection.train_test_split in Python”

  1. Juan Gonzalez says:

    Great work!
    I get 2 errors.
    1. Not fit ( which I fix with “model_obj.fit(X_train,y_train)’ )
    2.”model_fit” not defined ( which I fix with “model_fit = model_obj.fit(X_train,y_train)’ )

    accuracy = 0.8191499801704696

Leave a Reply