Prepare your own data set for image classification in Machine learning Python

There is large amount of open source data sets available on the Internet for Machine Learning, but while managing your own project you may require your own data set. Today, let’s discuss how can we prepare our own data set for Image Classification.

Collect Image data

The first and foremost task is to collect data (images). One can use camera for collecting images or download from Google Images (copyright images needs permission). There are many browser plugins for downloading images in bulk from Google Images. Suppose you want to classify cars to bikes. Download images of cars in one folder and bikes in another folder.

Process the Data

The downloaded images may be of varying pixel size but for training the model we will require images of same sizes. So let’s resize the images using simple Python code. We will be using built-in library PIL.

data set for image classification in Machine learning Python


from PIL import Image
import os
def resize_multiple_images(src_path, dst_path):
    # Here src_path is the location where images are saved.
    for filename in os.listdir(src_path):
            new_img = img.resize((64,64))
            if not os.path.exists(dst_path):
            print('Resized and saved {} successfully.'.format(filename))

src_path = <Enter the source path>
dst_path = <Enter the destination path>
resize_multiple_images(src_path, dst_path)

The images should have small size so that the number of features is not large enough while feeding the images into a Neural Network. For example, a colored image is 600X800 large, then the Neural Network need to handle 600*800*3 = 1,440,000 parameters, which is quite large. On the other hand any colored image of 64X64 size needs only 64*64*3 = 12,288 parameters, which is fairly low and will be computationally efficient. Now since we have resized the images, we need to rename the files so as to properly label the data set.


import os

def rename_multiple_files(path,obj):


    for filename in os.listdir(path):
            f,extension = os.path.splitext(path+filename)
            print('Rename successful.')

path=<Enter the path of objects to be renamed>
obj=<Enter the prefix to be added to each file. For ex. car, bike, cat, dog, etc.>

Since, we have processed our data. Merge the content of ‘car’ and ‘bikes’ folder and name it ‘train set’. Pull out some images of cars and some of bikes from the ‘train set’ folder and put it in a new folder ‘test set’. Now we have to import it into our python code so that the colorful image can be represented in numbers to be able to apply Image Classification Algorithms.

Import Images in form of array

from PIL import Image
import os
import numpy as np
import re

def get_data(path):
    for filename in os.listdir(path):
            if re.match(r'car',filename):
   + filename)
            np_array = np.asarray(img)
            l,b,c = np_array.shape
            np_array = np_array.reshape(l*b*c,)

    return np.array(all_images_as_array), np.array(label)

path_to_train_set = <Enter the location of train set>
path_to_test_set = <Enter the location of test set>
X_train,y_train = get_data(path_to_train_set)
X_test, y_test = get_data(path_to_test_set)

print('X_train set : ',X_train)
print('y_train set : ',y_train)
print('X_test set : ',X_test)
print('y_test set : ',y_test)

Woah! You made it. Your image classification data set is ready to be fed to the neural network model. Feel free to comment below.

8 responses to “Prepare your own data set for image classification in Machine learning Python”

  1. Divyesh Srivastava says:

    Nice post

  2. Mrityunjay Tripathi says:

    Thanks Divyesh!

  3. Dharmendra says:

    Helpful for fresher…thanks too

  4. Goldy Mazumdar says:

    very useful…..just what i was looking for.
    Thank you

  5. karolina says:

    f,extension is a variable right? Whats the purpose of f, ?

    • Mrityunjay Tripathi says:

      The RHS part returns the name of the file and the extension, so the first part which is ‘f’ is the name of the file, and the second part which is ‘extension’ is the extension of the file. Since we are going to rename the file, the old name doesn’t matter to us. That is why ‘f’ is not used further in the code but we need to preserve the extension of the file. I hope it was helpful.

  6. kpiz says:

    Hi Sir,
    I have two datasets train and test in a separate folder of eye image .
    Two csv file train_csv and test_csv with their label male and female.
    how can I load the file and proceed as in this tutorial?

    We were thought to load and prepare the dataset this way

    labels = pd.read_csv(“/content/content/eye_gender_data/Training_set.csv”) # loading the labels
    file_paths = [[fname, ‘/content/content/eye_gender_data/train/’ + fname] for fname in labels[‘filename’]]
    images = pd.DataFrame(file_paths, columns=[‘filename’, ‘filepaths’])
    train_data = pd.merge(images, labels, how = ‘inner’, on = ‘filename’)

    data = [] # initialize an empty numpy array
    image_size = 100 # image size taken is 100 here. one can take other size too
    for i in range(len(train_data)):

    img_array = cv2.imread(train_data[‘filepaths’][i], cv2.IMREAD_GRAYSCALE) # converting the image to gray scale

    new_img_array = cv2.resize(img_array, (image_size, image_size)) # resizing the image array
    data.append([new_img_array, train_data[‘label’][i]])

    From this step I don’t know how to carry out the data preprocessing. Any help will be appreciated.

    Thank you.

  7. Ekow Yamoah says:

    This has helped me a lot!!!!
    Thank you verrry much!!

Leave a Reply

Your email address will not be published. Required fields are marked *