Prepare your own data set for image classification in Machine learning Python
There is large amount of open source data sets available on the Internet for Machine Learning, but while managing your own project you may require your own data set. Today, let’s discuss how can we prepare our own data set for Image Classification.
Collect Image data
The first and foremost task is to collect data (images). One can use camera for collecting images or download from Google Images (copyright images needs permission). There are many browser plugins for downloading images in bulk from Google Images. Suppose you want to classify cars to bikes. Download images of cars in one folder and bikes in another folder.
Process the Data
The downloaded images may be of varying pixel size but for training the model we will require images of same sizes. So let’s resize the images using simple Python code. We will be using built-in library PIL.
data set for image classification in Machine learning Python
Resize
from PIL import Image import os def resize_multiple_images(src_path, dst_path): # Here src_path is the location where images are saved. for filename in os.listdir(src_path): try: img=Image.open(src_path+filename) new_img = img.resize((64,64)) if not os.path.exists(dst_path): os.makedirs(dst_path) new_img.save(dst_path+filename) print('Resized and saved {} successfully.'.format(filename)) except: continue src_path = <Enter the source path> dst_path = <Enter the destination path> resize_multiple_images(src_path, dst_path)
The images should have small size so that the number of features is not large enough while feeding the images into a Neural Network. For example, a colored image is 600X800 large, then the Neural Network need to handle 600*800*3 = 1,440,000 parameters, which is quite large. On the other hand any colored image of 64X64 size needs only 64*64*3 = 12,288 parameters, which is fairly low and will be computationally efficient. Now since we have resized the images, we need to rename the files so as to properly label the data set.
Rename
import os def rename_multiple_files(path,obj): i=0 for filename in os.listdir(path): try: f,extension = os.path.splitext(path+filename) src=path+filename dst=path+obj+str(i)+extension os.rename(src,dst) i+=1 print('Rename successful.') except: i+=1 path=<Enter the path of objects to be renamed> obj=<Enter the prefix to be added to each file. For ex. car, bike, cat, dog, etc.> rename_multiple_files(path,obj)
Since, we have processed our data. Merge the content of ‘car’ and ‘bikes’ folder and name it ‘train set’. Pull out some images of cars and some of bikes from the ‘train set’ folder and put it in a new folder ‘test set’. Now we have to import it into our python code so that the colorful image can be represented in numbers to be able to apply Image Classification Algorithms.
Import Images in form of array
from PIL import Image import os import numpy as np import re def get_data(path): all_images_as_array=[] label=[] for filename in os.listdir(path): try: if re.match(r'car',filename): label.append(1) else: label.append(0) img=Image.open(path + filename) np_array = np.asarray(img) l,b,c = np_array.shape np_array = np_array.reshape(l*b*c,) all_images_as_array.append(np_array) except: continue return np.array(all_images_as_array), np.array(label) path_to_train_set = <Enter the location of train set> path_to_test_set = <Enter the location of test set> X_train,y_train = get_data(path_to_train_set) X_test, y_test = get_data(path_to_test_set) print('X_train set : ',X_train) print('y_train set : ',y_train) print('X_test set : ',X_test) print('y_test set : ',y_test)
Woah! You made it. Your image classification data set is ready to be fed to the neural network model. Feel free to comment below.
Nice post
Thanks Divyesh!
Helpful for fresher…thanks too
very useful…..just what i was looking for.
Thank you
f,extension is a variable right? Whats the purpose of f, ?
The RHS part returns the name of the file and the extension, so the first part which is ‘f’ is the name of the file, and the second part which is ‘extension’ is the extension of the file. Since we are going to rename the file, the old name doesn’t matter to us. That is why ‘f’ is not used further in the code but we need to preserve the extension of the file. I hope it was helpful.
Hi Sir,
I have two datasets train and test in a separate folder of eye image .
Two csv file train_csv and test_csv with their label male and female.
how can I load the file and proceed as in this tutorial?
We were thought to load and prepare the dataset this way
labels = pd.read_csv(“/content/content/eye_gender_data/Training_set.csv”) # loading the labels
file_paths = [[fname, ‘/content/content/eye_gender_data/train/’ + fname] for fname in labels[‘filename’]]
images = pd.DataFrame(file_paths, columns=[‘filename’, ‘filepaths’])
train_data = pd.merge(images, labels, how = ‘inner’, on = ‘filename’)
data = [] # initialize an empty numpy array
image_size = 100 # image size taken is 100 here. one can take other size too
for i in range(len(train_data)):
img_array = cv2.imread(train_data[‘filepaths’][i], cv2.IMREAD_GRAYSCALE) # converting the image to gray scale
new_img_array = cv2.resize(img_array, (image_size, image_size)) # resizing the image array
data.append([new_img_array, train_data[‘label’][i]])
From this step I don’t know how to carry out the data preprocessing. Any help will be appreciated.
Thank you.
This has helped me a lot!!!!
Thank you verrry much!!