Binning method for data smoothing in Python

Post Views: 1,695

In this tutorial, we’ll learn about the Binning method for Data smoothing in Python.

Binning is a technique for data smoothing that involves dividing your data into ranges, or bins, and replacing the values within each bin with a summary statistic, such as the mean or median. This can be useful for reducing noise in the data and making patterns more apparent.

Here is an example of how to perform binning in Python using the pandas library:

import pandas as pd

df = pd.read_csv('data.csv')

# Divide the data into bins
binned_df = df.groupby(pd.cut(df['column_name'], bins)).mean()

# Plot the binned data
binned_df.plot()

In this example, df is a DataFrame containing the data, column_name is the name of the column that you want to bin, and bins is a list or array of bin edges. The groupby method divides the data into bins based on the values in the specified column, and the mean method calculates the mean of each group. The resulting DataFrame, binned_df, contains the binned data.

You can also use other summary statistics, such as the median or mode, by using the median or mode methods instead of mean.

# Calculate the median of each bin
binned_df = df.groupby(pd.cut(df['column_name'], bins)).median()

# Calculate the mode of each bin
binned_df = df.groupby(pd.cut(df['column_name'], bins)).apply(lambda x: x.mode())

Data smoothing is a pre-processing technique that is used to remove noise from the dataset.
We’ll first learn it’s basics then move toward its implementation using Python.
In this method, we have to sort the data, firstly, then store these data in different bins. Finally, apply the data smoothing approach.

Data smoothing can be performed in three different ways:

Bin means: Each value stored in the bin will be replaced by bin means.
Bin median: Each value stored in the bin will be replaced by bin median.
Bin boundary: The minimum and maximum bin values are stored at the boundary while intermediate bin values are replaced by the boundary value to which it is closer.

Now, let’s have an example as follows:

Data before sorting:

7 10, 9, 18

Data after sorting:

7, 9, 10,18

Data after bin means:

11, 11, 11, 11

as means of 7, 9, 10, 14 is 11.

Data after bin median:

10, 10, 10, 10

Data after bin boundary:

7, 7, 7, 18

Since 7 and 18 are minimum and maximum bin values so they are bin boundary. 9 is closer to 7 and 10 is closer to 7 rather than 18 so they are replaced by 7.

Now, we’ll take real-life examples of stock prices turn-over and apply the Binning method on that. The dataset we are using is NSE50. We’ll use the only turnover values.
First import the following packages:

import numpy as np 
import math 
import pandas as pd

Now, read the CSV file using Pandas and extract the Turn-over column only.

df = pd.read_csv('nse50_data.csv')
data = df['Turnover (Rs. Cr)']

We’ll use only 30 values from the data for the sake of convenience.

data = data[:30]

Now, we’ll sort the data.

data=np.sort(data)
print(data)

The corresponding data is as follows:

array([10388.69, 10843.92, 10858.35, 10896.89, 12012.41, 12113.53,
       12199.98, 12211.18, 12290.16, 12528.8 , 12649.4 , 12834.85,
       13320.2 , 13520.01, 13591.3 , 13676.58, 13709.57, 13837.03,
       13931.15, 14006.48, 14105.94, 14440.17, 14716.66, 14744.56,
       14932.51, 15203.09, 15787.28, 15944.45, 20187.98, 21595.33])

Now, we’ll create three different matrices having 10 rows and 3 columns. These matrices will act as our bin.

b1=np.zeros((10,3)) 
b2=np.zeros((10,3)) 
b3=np.zeros((10,3))

Now, we’ll compute the Mean Bin as follows:

for i in range (0,30,3): 
  k=int(i/3) 
  mean=(data[i] + data[i+1] + data[i+2] )/3
  for j in range(3): 
    b1[k,j]=mean 

print("-----------------Mean Bin:----------------- \n",b1)

The corresponding mean bin is as follows:

-----------------Mean Bin:----------------- 
 [[10696.98666667 10696.98666667 10696.98666667]
 [11674.27666667 11674.27666667 11674.27666667]
 [12233.77333333 12233.77333333 12233.77333333]
 [12671.01666667 12671.01666667 12671.01666667]
 [13477.17       13477.17       13477.17      ]
 [13741.06       13741.06       13741.06      ]
 [14014.52333333 14014.52333333 14014.52333333]
 [14633.79666667 14633.79666667 14633.79666667]
 [15307.62666667 15307.62666667 15307.62666667]
 [19242.58666667 19242.58666667 19242.58666667]]

Now, we’ll compute the Median Bin as follows:

for i in range (0,30,3): 
  k=int(i/3) 
  for j in range (3): 
    b2[k,j]=data[i+1] 
print("-----------------Median Bin :----------------- \n",b2)

The corresponding median bin is as follows:

-----------------Median Bin :----------------- 
 [[10843.92 10843.92 10843.92]
 [12012.41 12012.41 12012.41]
 [12211.18 12211.18 12211.18]
 [12649.4  12649.4  12649.4 ]
 [13520.01 13520.01 13520.01]
 [13709.57 13709.57 13709.57]
 [14006.48 14006.48 14006.48]
 [14716.66 14716.66 14716.66]
 [15203.09 15203.09 15203.09]
 [20187.98 20187.98 20187.98]]

Now, we’ll compute the Boundary Bin as follows:

for i in range (0,30,3): 
  k=int(i/3) 
  for j in range (3): 
    if (data[i+j]-data[i]) < (data[i+2]-data[i+j]): 
      b3[k,j]=data[i] 
    else: 
      b3[k,j]=data[i+2]	 

print("-----------------Boundary Bin:----------------- \n",b3)

The corresponding boundary bin is as follows:

-----------------Bin Boundary :----------------- 
 [[10388.69 10858.35 10858.35]
 [10896.89 12113.53 12113.53]
 [12199.98 12199.98 12290.16]
 [12528.8  12528.8  12834.85]
 [13320.2  13591.3  13591.3 ]
 [13676.58 13676.58 13837.03]
 [13931.15 13931.15 14105.94]
 [14440.17 14744.56 14744.56]
 [14932.51 14932.51 15787.28]
 [15944.45 21595.33 21595.33]]

I hope you enjoyed this tutorial.

One response to “Binning method for data smoothing in Python”

SK says:

June 12, 2022 at 5:37 pm

Thank you very much! Now I am able to implement binning methods on my huge datasets.
You effort is much appreciated!
Thank you very much.

Reply

Binning method for data smoothing in Python

One response to “Binning method for data smoothing in Python”

Leave a Reply Cancel reply