Prepare your own data set for image classification in Machine learning Python

Post Views: 1,553

There is large amount of open source data sets available on the Internet for Machine Learning, but while managing your own project you may require your own data set. Today, let’s discuss how can we prepare our own data set for Image Classification.

Collect Image data

The first and foremost task is to collect data (images). One can use camera for collecting images or download from Google Images (copyright images needs permission). There are many browser plugins for downloading images in bulk from Google Images. Suppose you want to classify cars to bikes. Download images of cars in one folder and bikes in another folder.

Process the Data

The downloaded images may be of varying pixel size but for training the model we will require images of same sizes. So let’s resize the images using simple Python code. We will be using built-in library PIL.

data set for image classification in Machine learning Python

Resize

from PIL import Image
import os
def resize_multiple_images(src_path, dst_path):
    # Here src_path is the location where images are saved.
    for filename in os.listdir(src_path):
        try:
            img=Image.open(src_path+filename)
            new_img = img.resize((64,64))
            if not os.path.exists(dst_path):
                os.makedirs(dst_path)
            new_img.save(dst_path+filename)
            print('Resized and saved {} successfully.'.format(filename))
        except:
            continue

src_path = <Enter the source path>
dst_path = <Enter the destination path>
resize_multiple_images(src_path, dst_path)

The images should have small size so that the number of features is not large enough while feeding the images into a Neural Network. For example, a colored image is 600X800 large, then the Neural Network need to handle 600*800*3 = 1,440,000 parameters, which is quite large. On the other hand any colored image of 64X64 size needs only 64*64*3 = 12,288 parameters, which is fairly low and will be computationally efficient. Now since we have resized the images, we need to rename the files so as to properly label the data set.

Rename

import os

def rename_multiple_files(path,obj):

    i=0

    for filename in os.listdir(path):
        try:
            f,extension = os.path.splitext(path+filename)
            src=path+filename
            dst=path+obj+str(i)+extension
            os.rename(src,dst)
            i+=1
            print('Rename successful.')
        except:
            i+=1

path=<Enter the path of objects to be renamed>
obj=<Enter the prefix to be added to each file. For ex. car, bike, cat, dog, etc.>
rename_multiple_files(path,obj)

Since, we have processed our data. Merge the content of ‘car’ and ‘bikes’ folder and name it ‘train set’. Pull out some images of cars and some of bikes from the ‘train set’ folder and put it in a new folder ‘test set’. Now we have to import it into our python code so that the colorful image can be represented in numbers to be able to apply Image Classification Algorithms.

Import Images in form of array

from PIL import Image
import os
import numpy as np
import re

def get_data(path):
    all_images_as_array=[]
    label=[]
    for filename in os.listdir(path):
        try:
            if re.match(r'car',filename):
                label.append(1)
            else:
                label.append(0)
            img=Image.open(path + filename)
            np_array = np.asarray(img)
            l,b,c = np_array.shape
            np_array = np_array.reshape(l*b*c,)
            all_images_as_array.append(np_array)
        except:
            continue

    return np.array(all_images_as_array), np.array(label)

path_to_train_set = <Enter the location of train set>
path_to_test_set = <Enter the location of test set>
X_train,y_train = get_data(path_to_train_set)
X_test, y_test = get_data(path_to_test_set)

print('X_train set : ',X_train)
print('y_train set : ',y_train)
print('X_test set : ',X_test)
print('y_test set : ',y_test)

Woah! You made it. Your image classification data set is ready to be fed to the neural network model. Feel free to comment below.

8 responses to “Prepare your own data set for image classification in Machine learning Python”

Divyesh Srivastava says:

May 27, 2019 at 8:36 am

Nice post

Reply
Mrityunjay Tripathi says:

May 27, 2019 at 10:51 am

Thanks Divyesh!

Reply
Dharmendra says:

May 27, 2019 at 12:40 pm

Helpful for fresher…thanks too

Reply
Goldy Mazumdar says:

November 9, 2020 at 2:47 pm

very useful…..just what i was looking for.
Thank you

Reply
karolina says:

July 2, 2021 at 1:54 am

f,extension is a variable right? Whats the purpose of f, ?

Reply
- Mrityunjay Tripathi says:
  
  July 4, 2021 at 10:59 pm
  
  The RHS part returns the name of the file and the extension, so the first part which is ‘f’ is the name of the file, and the second part which is ‘extension’ is the extension of the file. Since we are going to rename the file, the old name doesn’t matter to us. That is why ‘f’ is not used further in the code but we need to preserve the extension of the file. I hope it was helpful.
  
  Reply
kpiz says:

July 30, 2021 at 7:40 pm

Hi Sir,
I have two datasets train and test in a separate folder of eye image .
Two csv file train_csv and test_csv with their label male and female.
how can I load the file and proceed as in this tutorial?

We were thought to load and prepare the dataset this way

labels = pd.read_csv(“/content/content/eye_gender_data/Training_set.csv”) # loading the labels
file_paths = [[fname, ‘/content/content/eye_gender_data/train/’ + fname] for fname in labels[‘filename’]]
images = pd.DataFrame(file_paths, columns=[‘filename’, ‘filepaths’])
train_data = pd.merge(images, labels, how = ‘inner’, on = ‘filename’)

data = [] # initialize an empty numpy array
image_size = 100 # image size taken is 100 here. one can take other size too
for i in range(len(train_data)):

img_array = cv2.imread(train_data[‘filepaths’][i], cv2.IMREAD_GRAYSCALE) # converting the image to gray scale

new_img_array = cv2.resize(img_array, (image_size, image_size)) # resizing the image array
data.append([new_img_array, train_data[‘label’][i]])

From this step I don’t know how to carry out the data preprocessing. Any help will be appreciated.

Thank you.

Reply
Ekow Yamoah says:

May 7, 2022 at 6:16 pm

This has helped me a lot!!!!
Thank you verrry much!!

Reply

Prepare your own data set for image classification in Machine learning Python

Collect Image data

Process the Data

data set for image classification in Machine learning Python

8 responses to “Prepare your own data set for image classification in Machine learning Python”

Leave a Reply Cancel reply

Related Posts