Binning Data with Pandas qcut and cut in Python

In this tutorial, you will learn how to do Binning Data in Pandas by using qcut and cut functions in Python.

First, we will focus on qcut.

qcut

qcut is used to divide the data into equal size bins. The pandas documentation describes qcut as a “Quantile-based discretization function.

Understand with an example:-

Data = pd.DataFrame(dataset['SalePrice'])

Data.describe()

Output:

SalePrice
count
1460.000000
mean
180921.195890
std
79442.502883
min
34900.000000
25%
129975.000000
50%
163000.000000
75%
214000.000000
max
755000.000000

This is the data that we will use throughout the tutorial.

Data['Quantile_ex_1'] = pd.qcut(Data['SalePrice'], q=4)
Data['Quantile_ex_1'].value_counts()

Output:

(129975.0, 163000.0]     367
(163000.0, 214000.0]     366
(34899.999, 129975.0]    365
(214000.0, 755000.0]     362
Name: Quantile_ex_1, dtype: int64

Here, we used quantiles with “q=4” the bins match the percentiles from the “describe” function. Also, we can notice that the size of all bins is almost the same.

If we use “q=20” then the output shows the 20 bins with equal size. Here is the code.

Data['Quantile_ex_2'] = pd.qcut(Data['SalePrice'], q=20)
Data['Quantile_ex_2'].value_counts()

Output:-

(135500.0, 141000.0]    79
(147000.0, 155000.0]    78
(106475.0, 115000.0]    76
(198620.0, 214000.0]    76
(34899.999, 88000.0]    75
(163000.0, 172500.0]    75
(179280.0, 187500.0]    75
(129975.0, 135500.0]    74
(115000.0, 124000.0]    73
(326100.0, 755000.0]    73
(214000.0, 230000.0]    73
(278000.0, 326100.0]    72
(230000.0, 250000.0]    72
(250000.0, 278000.0]    72
(141000.0, 147000.0]    71
(187500.0, 198620.0]    71
(88000.0, 106475.0]     71
(124000.0, 129975.0]    70
(172500.0, 179280.0]    69
(155000.0, 163000.0]    65
Name: Quantile_ex_2, dtype: int64

Again, for  q=20 the size of the bins are almost the same.

 

If we want to label this price data into three different labels like [cheap, moderate, costly]. See the following code to do this.

bin_labels_3 = ['Cheap', 'Moderate', 'Costly']
Data['quantile_ex_3'] = pd.qcut(Data['SalePrice'],
                              q=[0, .3, .6, 1],
                              labels=bin_labels_3)
Data.head()

Output:-

SalePrice
Quantile_ex_1
Quantile_ex_2
quantile_ex_3
0
208500
(163000.0, 214000.0]
(198620.0, 214000.0]
Costly
1
181500
(163000.0, 214000.0]
(179280.0, 187500.0]
Costly
2
223500
(214000.0, 755000.0]
(214000.0, 230000.0]
Costly
3
140000
(129975.0, 163000.0]
(135500.0, 141000.0]
Moderate
4
250000
(214000.0, 755000.0]
(230000.0, 250000.0]
Costly

 

From the “quantile_ex_3” column you can notice that we have labeled the price data into three different categories.

 

If you want to know the frequency of each category here is the code.

Data['quantile_ex_3'].value_counts()

Output:

Costly      584
Cheap       439
Moderate    437
Name: quantile_ex_3, dtype: int64

 

If we use label = False in the above code then those three categories will automatically label by 0,1,2. Here is the code

Data['quantile_ex_4'] = pd.qcut(Data['SalePrice'],
                              q=[0, .3, .6, 1],
                              labels=False)
Data.head()
Output:
SalePrice
Quantile_ex_1
Quantile_ex_2
quantile_ex_3
quantile_ex_4
0
208500
(163000.0, 214000.0]
(198620.0, 214000.0]
Costly
2
1
181500
(163000.0, 214000.0]
(179280.0, 187500.0]
Costly
2
2
223500
(214000.0, 755000.0]
(214000.0, 230000.0]
Costly
2
3
140000
(129975.0, 163000.0]
(135500.0, 141000.0]
Moderate
1
4
250000
(214000.0, 755000.0]
(230000.0, 250000.0]
Costly
2

The last column shows that the three categories are labeled by 0,1,2.

cut

If you want the same size for all bins then you should use “cut”. While if you want the same frequency for different bins then you should use “qcut”.

When you use the cut function then you will not get the same frequency for all bins.

Data['Quantile_ex_1'] = pd.cut(Data['SalePrice'], bins=4)
Data['Quantile_ex_1'].value_counts()

Output:-

(34179.9, 214925.0]     1100
(214925.0, 394950.0]     330
(394950.0, 574975.0]      25
(574975.0, 755000.0]       5

Again, we divided price data into four different bins. Here, you can notice that here the size of every bins are the same with 180045 but the frequency of each bin is not the same. However, when we use “qcut” function the frequencies are the same for all the bins but the size of the bins is different.

Now, label the data with cut function as  we have does with qcut.

cut_labels_3 = ['Cheap', 'Moderate', 'Costly']
cut_bins = [34900 ,226500.0,453000.0,755000]
Data['cut_ex1'] = pd.cut(Data['SalePrice'], bins=cut_bins, labels=cut_labels_3)

Output:-

SalePrice
Quantile_ex_1
cut_ex1
0
208500
(34179.9, 214925.0]
Cheap
1
181500
(34179.9, 214925.0]
Cheap
2
223500
(214925.0, 394950.0]
Cheap
3
140000
(34179.9, 214925.0]
Cheap
4
250000
(214925.0, 394950.0]
Moderate

 

If you want to create 8 bins with equal sizes you can use np.linspace(0, 755000, 9).

Data['Quantile_ex_2'] = pd.cut(Data['SalePrice'], bins=np.linspace(0, 755000, 9))
Data['Quantile_ex_2'].value_counts()

Output:

(94375.0, 188750.0]     860
(188750.0, 283125.0]    366
(0.0, 94375.0]           97
(283125.0, 377500.0]     96
(377500.0, 471875.0]     30
(471875.0, 566250.0]      6
(566250.0, 660625.0]      3
(660625.0, 755000.0]      2
Name: Quantile_ex_2, dtype: int64

Here we create eights bins with the same size.

Now, label the data

Data['cut_ex2'] = pd.cut(Data['SalePrice'], bins=np.linspace(0, 755000, 9),labels = False)
Data.head()

Output:

SalePrice
Quantile_ex_1
cut_ex1
Quantile_ex_2
cut_ex2
0
208500
(34179.9, 214925.0]
Cheap
(188750.0, 283125.0]
2
1
181500
(34179.9, 214925.0]
Cheap
(94375.0, 188750.0]
1
2
223500
(214925.0, 394950.0]
Cheap
(188750.0, 283125.0]
2
3
140000
(34179.9, 214925.0]
Cheap
(94375.0, 188750.0]
1
4
250000
(214925.0, 394950.0]
Moderate
(188750.0, 283125.0]
2

You can notice in the last column with label data.

Instead of np.linspace(0, 755000, 9) you can use pd.interval_range. There is a downside to using interval_range. You can not define custom labels even you try ti define the label.

interval_range = pd.interval_range(start=0, freq=10000, end=755000)
Data['cut_ex2'] = pd.cut(Data['SalePrice'], bins=interval_range, labels=[0,1,2,3,4,5,6,7])
Data.head()
Output:-
SalePrice
Quantile_ex_1
cut_ex1
Quantile_ex_2
cut_ex2
0
208500
(34179.9, 214925.0]
Cheap
(188750.0, 283125.0]
(200000, 210000]
1
181500
(34179.9, 214925.0]
Cheap
(94375.0, 188750.0]
(180000, 190000]
2
223500
(214925.0, 394950.0]
Cheap
(188750.0, 283125.0]
(220000, 230000]
3
140000
(34179.9, 214925.0]
Cheap
(94375.0, 188750.0]
(130000, 140000]
4
250000
(214925.0, 394950.0]
Moderate
(188750.0, 283125.0]
(240000, 250000]

In the above code we tried to label data into [0,1,2,3,4,5,6,7] but the last column of the output shows only the ranges, not the label.

Leave a Reply