Feature Scaling in Machine Learning using Python
When we work on machine learning models, we go through datasets that had multiple features with varying properties. So, it becomes an obstacle to our machine learning algorithm.
Feature Scaling is an important part of data preprocessing which is the very first step of a machine learning algorithm.
Python program for feature Scaling in Machine Learning
Feature Scaling is a process to standardize different independent features in a given range. It improves the efficiency and accuracy of machine learning models.
Therefore, it is a part of data preprocessing to handle highly variable magnitudes or units.
Normalization (Min-Max scaling) :
Normalization is a technique of rescaling values so that they get ranged between 0 and 1.
It is a technique in which the values are modified according to the mean and standard deviation.
Algorithms where Feature Scaling is important:
- K-Means: uses Euclidean Distance for feature scaling.
- Principal Component Analysis
- Gradient Descent
Generally, algorithms that are based on distance get affected by feature scaling.
Example with code in Python:
- Importing libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn import preprocessing
- Loading dataset: Here, I am using Data_for_Missing_Values.csv.
Country Age Salary Purchased 0 France 44.0 72000.0 No 1 Spain 27.0 48000.0 Yes 2 Germany 30.0 54000.0 No 3 Spain 38.0 61000.0 No 4 Germany 40.0 NaN Yes
- Selecting Features: Taking the features on which we want to perform scaling is taken in a separate variable by using the iloc method. iloc[:,1:3] denotes we are taking all the rows and 1and 2 columns.
x = data.iloc[:, 1:3].values print ("Before scaling :", x)
Before scaling : [[4.4e+01 7.2e+04] [2.7e+01 4.8e+04] [3.0e+01 5.4e+04] [3.8e+01 6.1e+04] [4.0e+01 nan] [3.5e+01 5.8e+04] [ nan 5.2e+04] [4.8e+01 7.9e+04] [5.0e+01 8.3e+04] [3.7e+01 6.7e+04]]
- Performing Feature Scaling: To from Min-Max-Scaling we will use inbuilt class sklearn.preprocessing.MinMaxScaler(). To perform standardization we will use the inbuilt class sklearn.preprocessing.StandradScaler
min_max_scaler=preprocessing.MinMaxScaler(feature_range=(0,1)) x1=min_max_scaler.fit_transform(x) print("After min_max_scaling\n",x1) std=preprocessing.StandardScaler() x2=std.fit_transform(x) print("After standardisation\n",x2)
After min_max_scaling [[0.73913043 0.68571429] [0. 0. ] [0.13043478 0.17142857] [0.47826087 0.37142857] [0.56521739 nan] [0.34782609 0.28571429] [ nan 0.11428571] [0.91304348 0.88571429] [1. 1. ] [0.43478261 0.54285714]] After standardisation [[ 0.71993143 0.71101276] [-1.62367514 -1.36437583] [-1.21009751 -0.84552869] [-0.10722383 -0.24020701] [ 0.16849459 nan] [-0.52080146 -0.49963059] [ nan -1.01847774] [ 1.27136827 1.31633443] [ 1.54708669 1.66223253] [-0.24508304 0.27864014]]
We can notice that the values get confined to a specific range.