Binning method for data smoothing in Python

In this tutorial, we’ll learn about the Binning method for Data smoothing in Python.
Data smoothing is a pre-processing technique which is used to remove noise from the dataset.
We’ll first learn it’s basics then move towards its implementation using Python.
In this method, we have to sort the data, firstly, then store these data in different bins. Finally, apply the data smoothing approach.

Data smoothing can be performed in three different ways:

  1. Bin means: Each value stored in the bin will be replaced by bin means.
  2. Bin median: Each value stored in the bin will be replaced by bin median.
  3. Bin boundary: The minimum and maximum bin values are stored at the boundary while intermediate bin values are replaced by the boundary value to which it is more closer.

Now, let’s have an example as follows:

Data before sorting: 

7 10, 9, 18

Data after sorting: 

7, 9, 10,18

Data after bin means:

11, 11, 11, 11  

as means of 7, 9, 10, 14 is 11.

Data after bin median: 

10, 10, 10, 10

Data after bin boundary: 

7, 7, 7, 18

Since 7 and 18 are minimum and maximum bin values so they are bin boundary. 9 is closer to 7 and 10 is closer to 7 rather than 18 so they are replaced by 7.

Now, we’ll take real-life examples of stock prices turn-over and apply the Binning method on that. The dataset we are using is NSE50. We’ll use the only turnover values.
First import the following packages:

import numpy as np 
import math 
import pandas as pd

Now, read the CSV file using Pandas and extract the Turn-over column only.

df = pd.read_csv('nse50_data.csv')
data = df['Turnover (Rs. Cr)']

We’ll use only 30 values from the data for the sake of convenience.

data = data[:30]

Now, we’ll sort the data.

data=np.sort(data)
print(data)

The corresponding data is as follows:

array([10388.69, 10843.92, 10858.35, 10896.89, 12012.41, 12113.53,
       12199.98, 12211.18, 12290.16, 12528.8 , 12649.4 , 12834.85,
       13320.2 , 13520.01, 13591.3 , 13676.58, 13709.57, 13837.03,
       13931.15, 14006.48, 14105.94, 14440.17, 14716.66, 14744.56,
       14932.51, 15203.09, 15787.28, 15944.45, 20187.98, 21595.33])

Now, we’ll create three different matrices having 10 rows and 3 columns. These matrices will act as our bin.

b1=np.zeros((10,3)) 
b2=np.zeros((10,3)) 
b3=np.zeros((10,3))

Now, we’ll compute the Mean Bin as follows:

for i in range (0,30,3): 
  k=int(i/3) 
  mean=(data[i] + data[i+1] + data[i+2] )/3
  for j in range(3): 
    b1[k,j]=mean 

print("-----------------Mean Bin:----------------- \n",b1)

The corresponding mean bin is as follows:

-----------------Mean Bin:----------------- 
 [[10696.98666667 10696.98666667 10696.98666667]
 [11674.27666667 11674.27666667 11674.27666667]
 [12233.77333333 12233.77333333 12233.77333333]
 [12671.01666667 12671.01666667 12671.01666667]
 [13477.17       13477.17       13477.17      ]
 [13741.06       13741.06       13741.06      ]
 [14014.52333333 14014.52333333 14014.52333333]
 [14633.79666667 14633.79666667 14633.79666667]
 [15307.62666667 15307.62666667 15307.62666667]
 [19242.58666667 19242.58666667 19242.58666667]]

Now, we’ll compute the Median Bin as follows:

for i in range (0,30,3): 
  k=int(i/3) 
  for j in range (3): 
    b2[k,j]=data[i+1] 
print("-----------------Median Bin :----------------- \n",b2)

The corresponding median bin is as follows:

-----------------Median Bin :----------------- 
 [[10843.92 10843.92 10843.92]
 [12012.41 12012.41 12012.41]
 [12211.18 12211.18 12211.18]
 [12649.4  12649.4  12649.4 ]
 [13520.01 13520.01 13520.01]
 [13709.57 13709.57 13709.57]
 [14006.48 14006.48 14006.48]
 [14716.66 14716.66 14716.66]
 [15203.09 15203.09 15203.09]
 [20187.98 20187.98 20187.98]]

Now, we’ll compute the Boundary Bin as follows:

for i in range (0,30,3): 
  k=int(i/3) 
  for j in range (3): 
    if (data[i+j]-data[i]) < (data[i+2]-data[i+j]): 
      b3[k,j]=data[i] 
    else: 
      b3[k,j]=data[i+2]	 

print("-----------------Boundary Bin:----------------- \n",b3)

The corresponding boundary bin is as follows:

-----------------Bin Boundary :----------------- 
 [[10388.69 10858.35 10858.35]
 [10896.89 12113.53 12113.53]
 [12199.98 12199.98 12290.16]
 [12528.8  12528.8  12834.85]
 [13320.2  13591.3  13591.3 ]
 [13676.58 13676.58 13837.03]
 [13931.15 13931.15 14105.94]
 [14440.17 14744.56 14744.56]
 [14932.51 14932.51 15787.28]
 [15944.45 21595.33 21595.33]]

I hope you enjoyed this tutorial.

Leave a Reply

Your email address will not be published. Required fields are marked *