Pipeline in Machine Learning with scikit-learn in Python

In this post, I’ll explain how pipeline technique works in Python, scikit-learn. (Machine learning)
First, before starting with the uses of “pipeline”, it is better to have a brief understanding of the topic
“What is pipeline?”
“How to import it in the Python code?”

The Definition for working of pipeline function:

pipeline is an abstract option in Machine Learning and not any Machine Learning algorithm. Sometimes you need to perform some series of different transformations in the model you have created (like feature extraction, imputation, scaling, etc.). For all these tasks to execute correctly you need to call the ‘fit’ and ‘transform’ methods repeatedly and feed with the training set data separately to every function one by one.

But using ‘sklearn.pipeline’ you can do that in a few lines of code, which makes the code tidy and pretty much easy and readable to understand afterward. Thus this helps in better tuning the ML model you are working on! As you can configure the whole model using one object!

Pipeline: Syntax and Usage in Python code

import:

from sklearn.pipeline import Pipeline

syntax:

Pipeline(steps,memory=None,Verbose=False)

‘steps’ here is the list of fit and transforms you want to perform on the data.

**Note:

For Pipeline to work successfully if a Pipeline has ‘N’ objects/steps then the first ‘N-1’ must implement both fit and transform method and the Nth step must implement ‘fit’.
Otherwise, an error will be thrown!

Example of Code implementing pipelining and comparing it with non-pipelined code

First I’ll create some random data matrix for my model.

import sklearn.datasets
test_matrix = sklearn.datasets.make_spd_matrix(10,random_state=2)

**Note: I have used random_state=2 to get reproducible output. It is similar to random.seed()
and here sklearn.datasets.make_spd_matrix(dimension,random_state) will generate a random “symmetric positive-definite” matrix of size (10,10)

Now if you print that test_matrix you can see, that some data are missing! Thus to fill up those gaps impute from preprocessing

from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.preprocessing import StandardScaler

masking_array = np.random.binomial(1,.1,test_matrix.shape).astype(bool)
test_matrix[masking_array] = np.nan
imputer = SimpleImputer(missing_values=np.NaN,strategy='mean')
imputed_array = imputer.fit_transform(test_matrix)

Here I have masked the input data matrix(test_matrix) with a boolean numpy matrix and replaced missing data with np.NaNtype.
Then used sklearn.impute.SimpleImputer to impute those missing values masked as NaN and fill them with ‘mean’.
Now I have to standardize the data, for obtaining a better performance score.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler(copy=True,with_mean=True,with_std=True)
scaled_nd_imputed = scaler.fit_transform(imputed_array)

Now inside scaled_nd_imputed there is the array that is ready to be used for training and prediction for a better performance score!

But instead of doing all these steps, you can also do the same thing using just 2 lines of code in Pipeline!

Easy Approach Using sklearn.pipeline.Pipeline():

pipe = Pipeline(steps=[('imputer', SimpleImputer(missing_values=np.NaN,strategy='mean')),
                ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True))])
new_mat = pipe.fit_transform(test_matrix)

So the values stored as 'scaled_nd_imputed' is exactly same as stored in 'new_mat'.
You can also verify that using the numpy module in Python! Like as follows:

np.array_equal(scaled_nd_imputed,new_mat)

This will return True if the two matrices generated are the same.

**moreover you can access every object of Pipeline using set_params method.
The syntax for using it : pipe.set_params(imputer__strategy='median')
and output should be like this:

output:

Pipeline(memory=None, steps=[(‘imputer’, SimpleImputer(add_indicator=False, copy=True, fill_value=None, missing_values=nan, strategy=’median’, verbose=0)), (‘scaler’, StandardScaler(copy=True, with_mean=True, with_std=True))], verbose=False)

The link of the jupyter notebook is an available pipeline in machine learning scikit-learn

Leave a Reply

Your email address will not be published. Required fields are marked *