# How to Make Predictions with scikit-learn in Python

In this post, I’ll discuss, “How to make predictions using scikit-learn” in Python.

**How to Install “scikit-learn” :**

I’ll be using Python version

`3.7.6 (default, Dec 19 2019, 23:50:13) \n[GCC 7.4.0]`

and scikit-learn versio*n, sklearn.__version__ '0.22'*

In Windows :

`pip install scikit-learn`

In Linux :

`pip install --user scikit-learn`

## Importing scikit-learn into your Python code

`import sklearn`

## How to predict Using scikit-learn in Python:

`scikit-learn`

can be used in making **the Machine Learning** model, both for **supervised** and **unsupervised** ( *and some semi-supervised problems) t*o predict as well as to determine the accuracy of a model!

- To solve Regression problems (Linear, Logistic, multiple, polynomial regression)
- Fit and Evaluate the model
- For pre-processing a data available
- In feature extraction from categorical variables
- For Non-Linear Classification (in Decision Trees)
- In Clustering analysis

##### And more other advanced applications such as face-recognition, hand-writing recognition, etc…

## Starting With a Simple Example:-

For Example, you have data on cake sizes and their costs :

We can easily predict the price of a “cake” given the diameter :

# program to predict the price of cake using linear regression technique from sklearn.linear_model import LinearRegression import numpy as np # Step 1 : Training data x=[[6],[8],[10],[14],[18]] # cake size (diameter) in inches y=[[7],[9],[13],[17.5],[18]] # cake price in dollars # step 2: Create and fit the model model = LinearRegression() model.fit(x,y) size=int(input('Enter the size of the cake: ')) #step 3: make a prediction print(f'The price of a {size}" cake would be ${model.predict(np.array([size]).reshape(1,-1))[0][0]:.02f}')

### To Evaluate the model and find the fitness of the model:

To find out, that how good the prediction is,

you use the following test data set :

And use the following code :

## r-square using scikit learn x_test=[8,9,11,16,12] # test y_test=[11,8.5,15,18,11] # test x=[[6],[8],[10],[14],[18]] # cake size (diameter) in inches y=[[7],[9],[13],[17.5],[18]] # cake price in dollars model= LinearRegression() model.fit(x,y) r_square_value=model.score(np.array(x_test).reshape(-1,1),np.array(y_test).reshape(-1,1)) print(f'r-square value from Linear Regression: {r_square_value}')

And the output is:

** summary: **Till now you have learned to predict the outcome of any value if it is related linearly…

### Multiple Linear Regression

But suppose the cake price depends on the size of the toppings as well as the size of the cake! Then you will have to use:

And Use the following code to plot a graph against the training data set:

import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression x1=[[6,2],[8,1],[10,0],[14,2],[18,0]] # cake size (diameter) in inches y=[[7],[9],[13],[17.5],[18]] # cake price in dollars model= LinearRegression() model.fit(x1,y) x1_test=[[8,2],[9,0],[11,2],[16,2],[12,0]] y_test=[[11],[8.5],[15],[18],[11]] f=plt.figure() ax=f.add_subplot(111) plt.xlabel('cake size and toppings') plt.ylabel('cake price') predictions = model.predict(x1_test) v1,v2=[],[] for i,prediction in enumerate(predictions): print(f'predicted value : {prediction[0]:.02f} vs target value: {y_test[i][0]}') v1.append(prediction[0]) v2.append(y_test[i][0]) print(f'R-squared : {model.score(x1_test,y_test)}') ax.plot(v1,color='g',linestyle='--') ax.plot(v2,color='r',linestyle='--') plt.grid(True,linestyle='-',linewidth='0.5') plt.show() plt.close(f)

you’ll get this graph :

summary:So far you have learned about predicting data sets that are linearly related to some of the features.

Now you’ll learn how toExtract Featuresfrom Image andPre-processdata.

### Extracting points of Interest from an Image and Preprocessing

**Extracting Features :**

# extracting points of interest from an image # import os import numpy as np from skimage.feature import corner_harris,corner_peaks from skimage.color import rgb2gray import matplotlib.pyplot as plt import skimage.io as io from skimage.exposure import equalize_hist def view_corners(corners,image): f = plt.figure() plt.gray() # converting to grayscale plt.imshow(image) y_corner , x_corner = zip(*corners) plt.plot(x_corner,y_corner,'x') plt.xlim(0, image.shape[1]) f.set_size_inches(np.array(f.get_size_inches()) * 2.0) # to scale the display plt.show() if __name__=='__main__': my_image= io.imread('/home/tuhin/Pictures/crow image.jpg') my_image=equalize_hist(rgb2gray(my_image)) corners = corner_peaks(corner_harris(my_image),min_distance=2) view_corners(corners , my_image)

*image used: *

*the graph that you’ll see:*

This code is capable enough of detecting the points of interest from an image,

thus it is highly relevant to use in case of HD RGB images(with lots of pixels).

**Preprocessing:**** **Generally, predictive models perform well, when they are trained using preprocessed datasets.

**#**

**note:**These types of datasets have zero mean and unit variance.In scikit-learn, preprocessing can be done on a numpy array,

like this:

# preprocessing from sklearn import preprocessing import numpy as np data = np.array([[0,1,12,4,0,0],[12,4,5,6,0,1],[0,0,0,1,1,0]]) print(preprocessing.scale(data))

Output:

[[-0.70710678 -0.39223227 1.28684238 0.16222142 -0.70710678 -0.70710678] [ 1.41421356 1.37281295 -0.13545709 1.13554995 -0.70710678 1.41421356] [-0.70710678 -0.98058068 -1.15138528 -1.29777137 1.41421356 -0.70710678]]

### Logistic Regression:

This is a special case of the generalized “linear model” of scikit-learn.

This is used in classification purposes.

A very common example is * “spam filtering”* in messages.

Let’s take an example of the data set:

Here is a collection of some spam messages and some non-spam(*ham) *messages.

we’ll take the help of scikit-learn to classify spam-ham messages!

import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model.logistic import LogisticRegression from sklearn.model_selection import train_test_split df = pd.read_csv('/home/tuhin/Downloads/smsspamcollection/SMSSpam.csv', delimiter='\t',header=None) print(df.head(10)) x_train_raw, x_test_raw, y_train, y_test =train_test_split(df[1],df[0]) # this function will split train and test data set in 75%-25% respectively vector = TfidfVectorizer() x_train = vector.fit_transform(x_train_raw) x_test = vector.transform(x_test_raw) classifier = LogisticRegression() classifier.fit(x_train,y_train) predictions = classifier.predict(x_test) x_test_rawList = list(x_test_raw.values) # x_test_raw is in pandas dataFrame format, converting it to list count=0 for i in predictions: print(f'Prediction:{i}; Message:{x_test_rawList[count]}') count += 1

link for the full dataset:

**output:**

And this code will **predict** which one is spam and which one is not!

### DECISION HIERARCHY WITH scikit-learn

As in the case of non-linear regression, there are problems like decision trees

*And we can also solve them using scikit-learn:*

*And scikit-learn’s ‘DecisionTreeClassifier’ does the job. *

*usage:*

from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV pipelining = Pipeline([('clf', DecisionTreeClassifier(criterion='entropy'))]) #setting the parameters for the GridSearch parameters = {'clf__max_depth': (150, 155, 160),'clf__min_samples_split': (1, 2, 3),'clf__min_samples_leaf': (1, 2, 3)} # note that paramets will be different for different problems grid_search = GridSearchCV(pipelining, parameters, n_jobs=-1,verbose=1, scoring='f1') predictions = grid_search.predict(x_test) # we make predictions for the test data-set, where, x_test is the test_dataset # you can get the test_data set by using train_test_split() function mentioned previously

*# note: Here we count for the F1 score, of the model and that path of decision is selected, which has the best F1 score.*

**Clustering Methods in scikit-learn:**

And there are many more clustering algorithms available under the scikit-learn module of python,

*some of the popular ones are:*

**1. k Means clustering.**

from sklearn.cluster import k_means

**2. Affinity Propagation**

**usage: **`from sklearn.cluster import affinity_propagation`

**3. Mini Batch KMeans**

*usage: *`from sklearn.cluster import MiniBatchKMeans`

**4. Spectral Clustering:**

*usage: *`from sklearn.cluster import SpectralClustering`

**5. spectral biclustering:**

*usage: *`from sklearn.cluster import `

`SpectralBiclustering`

**6. spectral-co-clustering:**

**usage:**`from sklean.cluster import SpectralCoclustering`

**#note: Many other clustering algorithms are available under “sklearn.cluster”. **

**These are some of them because it is not possible to list them in a single post!**
## Leave a Reply