sklearn.model_selection.train_test_split in Python
In this post, I will be explaining about scikit learn’s “train_tets_split"
function. This utility function comes under the sklearn’s ‘model_selection
‘ function and facilitates in separating training data-set to train your machine learning model and another testing data set to check whether your prediction is close or not?
Modules Required and Versions of them:
import matplotlib.pyplot as plt import pandas as pd from sklearn import datasets, linear_model from sklearn.model_selection import train_test_split
versions:
matplotlib:3.1.2
sklearn:0.22
pandas:0.25.3
numpy:1.18.0
Here, I have used sklearn’s very well known Iris data set to demonstrate the “sklearn.model_selection.train_test_split
” function.
The syntax:
train_test_split(x,y,test_size,train_size,random_state,shuffle,stratify)
Mostly, parameters – x,y,test_size
– are used and shuffle
is by default True
so that it picks up some random data from the source you have provided.
test_size
and train_size
are by default set to 0.25
and 0.75
respectively if it is not explicitly mentioned. The measure is actually the percent of data assigned for each purpose.
Let’s take a short example. In this example:
- I’ll import the iris data set from the
sklearn.datasets
- Then I’ll split the dataset into test and training datasets.
- Train the model using LinearRegression from
sklearn.linear_model
- Then fit the model and plot a scatter plot using matplotlib, and also find the model score.
Importing the modules and data sets
import matplotlib.pyplot as plt import pandas as pd from sklearn import datasets, linear_model from sklearn.model_selection import train_test_split
Naming the columns of the Iris dataset using a pandas data frame
col_names = "Sepal_Length Sepal_Width Petal_Length Petal_Width".split(' ') iris_data = datasets.load_iris() df = pd.DataFrame(iris_data.data,columns=col_names) print(df.head(n=10)) # this is to print the first 10 rows of the data
Splitting the data into train and test data set
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:2],df.iloc[:,1:3],test_size=0.35) print(f'x_train shape: {X_train.shape}\ny_train shape:{y_train.shape}') # this is to see the dimension of the training set
Training the model using LinearRegression
model_obj = linear_model.LinearRegression()
Prediction, plotting and finding the model score:
model_fit = model_obj.fit(X_train,y_train) prediction = model_obj.predict(X_test) print(f'prediction shape:{prediction.shape}\nX_test shape:{X_test.shape}') print(f'------------prediction----------\n{prediction}\n{16*"--"}') print(f'------------y_test----------\n{y_test}\n{16*"--"}') plt.scatter(prediction,y_test) plt.show() print(model_fit.score(X_test,y_test,)) # try to find the accuracy of your model!
I have got about 87% accuracy:
0.8727137843452777
Also read:
Great work!
I get 2 errors.
1. Not fit ( which I fix with “model_obj.fit(X_train,y_train)’ )
2.”model_fit” not defined ( which I fix with “model_fit = model_obj.fit(X_train,y_train)’ )
accuracy = 0.8191499801704696
sorry, I missed that part of the code…thanks for pointing out.