How to split data into training and testing in Python without sklearn

Post Views: 1,271

Dataset splitting is essential to overcome underfitting and overfitting. In this tutorial, you will learn how to split data into training and testing in python without sklearn.

Splitting the data into training and testing in python without sklearn

steps involved:

Importing the packages
Load the dataset
Shuffling the dataset
Splitting the dataset

As an example we considered this dataset: mushroom.csv

Importing packages:

import pandas as pd
import numpy as np
import math

Reading the dataset:

df = pd.read_csv("/content/mushrooms.csv")
df.shape

(8124, 23)

We can see that there are 23 columns and 8124 rows in the dataset taken.

Shuffling the dataset:

Shuffling is necessary to avoid bias or variance. We can shuffle the data frame by using the sample() method as shown:

df = df.sample(frac = 1)

By exploring the dataset we can find that the “class” attribute is the dependent and the remaining attributes are independent. Let’s consider X as an independent variable and y as a dependent variable:

X = np.array(df.drop(["class"],1))
print("Shape of X:",X.shape)
print(X)

Shape of X: (8124, 22)
[['f' 's' 'n' ... 'w' 'v' 'd']
['f' 'f' 'g' ... 'h' 'y' 'p']
['f' 'y' 'c' ... 'w' 'c' 'd']
...
['f' 'f' 'g' ... 'k' 's' 'g']
['x' 'f' 'n' ... 'k' 'y' 'd']
['f' 'y' 'y' ... 'h' 'v' 'g']]

y = np.array(df["class"])
print("Shape of y:",y.shape)
print(y)

Shape of y: (8124,)
['p' 'p' 'p' ... 'e' 'e' 'p']

User input:

Users need to enter the splitting factor by which dataset should be divided into train and test.

print("Enter the splitting factor (i.e) ratio between train and test")
s_f = float(input())

Enter the splitting factor (i.e) ratio between train and test
0.8

Splitting:

Let us take 0.8 as the splitting factor. That means train data contains 80% of total rows (i.e) (80% of 8124 = 6499) and test data contains remaining (i.e) 1625.

n_train = math.floor(s_f * X.shape[0])
n_test = math.ceil((1-s_f) * X.shape[0])
X_train = X[:n_train]
y_train = y[:n_train]
X_test = X[n_train:]
y_test = y[n_train:]
print("Total Number of rows in train:",X_train.shape[0])
print("Total Number of rows in test:",X_test.shape[0])

Total Number of rows in train: 6499
Total Number of rows in test: 1625

Before splitting:

print("X:")
print(X)
print("y:")
print(y)

X:
[['f' 's' 'n' ... 'w' 'v' 'd']
['f' 'f' 'g' ... 'h' 'y' 'p']
['f' 'y' 'c' ... 'w' 'c' 'd']
...
['f' 'f' 'g' ... 'k' 's' 'g']
['x' 'f' 'n' ... 'k' 'y' 'd']
['f' 'y' 'y' ... 'h' 'v' 'g']]
y:
['p' 'p' 'p' ... 'e' 'e' 'p']

After splitting:

print("X_train:")
print(X_train)
print("\ny_train:")
print(y_train)
print("\nX_test")
print(X_test)
print("\ny_test")
print(y_test)

X_train:
[['f' 's' 'n' ... 'w' 'v' 'd']
['f' 'f' 'g' ... 'h' 'y' 'p']
['f' 'y' 'c' ... 'w' 'c' 'd']
...
['f' 'y' 'w' ... 'n' 's' 'u']
['f' 'f' 'g' ... 'n' 'v' 'd']
['f' 's' 'n' ... 'w' 'v' 'l']]

y_train:
['p' 'p' 'p' ... 'p' 'e' 'p']

X_test
[['x' 'f' 'g' ... 'w' 'n' 'g']
['f' 'f' 'e' ... 'n' 'y' 'd']
['f' 'y' 'n' ... 'w' 'v' 'd']
...
['f' 'f' 'g' ... 'k' 's' 'g']
['x' 'f' 'n' ... 'k' 'y' 'd']
['f' 'y' 'y' ... 'h' 'v' 'g']]

y_test
['e' 'e' 'p' ... 'e' 'e' 'p']

In this way, we have split the dataset into X_train, X_test, y_train, and y_test without using sklearn.