Pipeline in Machine Learning with scikit-learn in Python
In this post, I’ll explain how pipeline technique works in Python, scikit-learn. (Machine learning)
First, before starting with the uses of “pipeline”, it is better to have a brief understanding of the topic
“What is pipeline?”
“How to import it in the Python code?”
The Definition for working of pipeline function:
pipeline is an abstract option in Machine Learning and not any Machine Learning algorithm. Sometimes you need to perform some series of different transformations in the model you have created (like feature extraction, imputation, scaling, etc.). For all these tasks to execute correctly you need to call the ‘fit’ and ‘transform’ methods repeatedly and feed with the training set data separately to every function one by one.
But using ‘sklearn.pipeline’ you can do that in a few lines of code, which makes the code tidy and pretty much easy and readable to understand afterward. Thus this helps in better tuning the ML model you are working on! As you can configure the whole model using one object!
Pipeline: Syntax and Usage in Python code
from sklearn.pipeline import Pipeline
‘steps’ here is the list of fit and transforms you want to perform on the data.
For Pipeline to work successfully if a Pipeline has ‘N’ objects/steps then the first ‘N-1’ must implement both fit and transform method and the Nth step must implement ‘fit’.
Otherwise, an error will be thrown!
Example of Code implementing pipelining and comparing it with non-pipelined code
First I’ll create some random data matrix for my model.
import sklearn.datasets test_matrix = sklearn.datasets.make_spd_matrix(10,random_state=2)
**Note: I have used random_state=2 to get reproducible output. It is similar to
sklearn.datasets.make_spd_matrix(dimension,random_state) will generate a random “symmetric positive-definite” matrix of size (10,10)
Now if you print that
test_matrix you can see, that some data are missing! Thus to fill up those gaps
from sklearn.impute import SimpleImputer, MissingIndicator from sklearn.preprocessing import StandardScaler masking_array = np.random.binomial(1,.1,test_matrix.shape).astype(bool) test_matrix[masking_array] = np.nan imputer = SimpleImputer(missing_values=np.NaN,strategy='mean') imputed_array = imputer.fit_transform(test_matrix)
Here I have masked the input data matrix(
test_matrix) with a boolean numpy matrix and replaced missing data with
sklearn.impute.SimpleImputer to impute those missing values masked as NaN and fill them with ‘mean’.
Now I have to standardize the data, for obtaining a better performance score.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler(copy=True,with_mean=True,with_std=True) scaled_nd_imputed = scaler.fit_transform(imputed_array)
scaled_nd_imputed there is the array that is ready to be used for training and prediction for a better performance score!
But instead of doing all these steps, you can also do the same thing using just 2 lines of code in Pipeline!
Easy Approach Using sklearn.pipeline.Pipeline():
pipe = Pipeline(steps=[('imputer', SimpleImputer(missing_values=np.NaN,strategy='mean')), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True))]) new_mat = pipe.fit_transform(test_matrix)
So the values stored as
'scaled_nd_imputed' is exactly same as stored in
You can also verify that using the numpy module in Python! Like as follows:
This will return
True if the two matrices generated are the same.
**moreover you can access every object of Pipeline using
The syntax for using it :
and output should be like this:
Pipeline(memory=None, steps=[(‘imputer’, SimpleImputer(add_indicator=False, copy=True, fill_value=None, missing_values=nan, strategy=’median’, verbose=0)), (‘scaler’, StandardScaler(copy=True, with_mean=True, with_std=True))], verbose=False)
The link of the jupyter notebook is an available pipeline in machine learning scikit-learn