Binning Data with Pandas qcut and cut in Python

Post Views: 834

In this tutorial, you will learn how to do Binning Data in Pandas by using qcut and cut functions in Python.

First, we will focus on qcut.

qcut

qcut is used to divide the data into equal size bins. The Pandas documentation describes qcut as a “Quantile-based discretization function.

Understand with an example:-

Data = pd.DataFrame(dataset['SalePrice'])

Data.describe()

Output:

	SalePrice
count	1460.000000
mean	180921.195890
std	79442.502883
min	34900.000000
25%	129975.000000
50%	163000.000000
75%	214000.000000
max	755000.000000

This is the data that we will use throughout the tutorial.

Data['Quantile_ex_1'] = pd.qcut(Data['SalePrice'], q=4)
Data['Quantile_ex_1'].value_counts()

Output:

(129975.0, 163000.0]     367
(163000.0, 214000.0]     366
(34899.999, 129975.0]    365
(214000.0, 755000.0]     362
Name: Quantile_ex_1, dtype: int64

Here, we used quantiles with “q=4” the bins match the percentiles from the “describe” function. Also, we can notice that the size of all bins is almost the same.

If we use “q=20” then the output shows the 20 bins with equal size. Here is the code.

Data['Quantile_ex_2'] = pd.qcut(Data['SalePrice'], q=20)
Data['Quantile_ex_2'].value_counts()

Output:-

(135500.0, 141000.0]    79
(147000.0, 155000.0]    78
(106475.0, 115000.0]    76
(198620.0, 214000.0]    76
(34899.999, 88000.0]    75
(163000.0, 172500.0]    75
(179280.0, 187500.0]    75
(129975.0, 135500.0]    74
(115000.0, 124000.0]    73
(326100.0, 755000.0]    73
(214000.0, 230000.0]    73
(278000.0, 326100.0]    72
(230000.0, 250000.0]    72
(250000.0, 278000.0]    72
(141000.0, 147000.0]    71
(187500.0, 198620.0]    71
(88000.0, 106475.0]     71
(124000.0, 129975.0]    70
(172500.0, 179280.0]    69
(155000.0, 163000.0]    65
Name: Quantile_ex_2, dtype: int64

Again, for q=20 the size of the bins are almost the same.

If we want to label this price data into three different labels like [cheap, moderate, costly]. See the following Python program to do this.

bin_labels_3 = ['Cheap', 'Moderate', 'Costly']
Data['quantile_ex_3'] = pd.qcut(Data['SalePrice'],
                              q=[0, .3, .6, 1],
                              labels=bin_labels_3)
Data.head()

Output:-

	SalePrice	Quantile_ex_1	Quantile_ex_2	quantile_ex_3
0	208500	(163000.0, 214000.0]	(198620.0, 214000.0]	Costly
1	181500	(163000.0, 214000.0]	(179280.0, 187500.0]	Costly
2	223500	(214000.0, 755000.0]	(214000.0, 230000.0]	Costly
3	140000	(129975.0, 163000.0]	(135500.0, 141000.0]	Moderate
4	250000	(214000.0, 755000.0]	(230000.0, 250000.0]	Costly

From the “quantile_ex_3” column you can notice that we have labeled the price data into three different categories.

If you want to know the frequency of each category here is the code.

Data['quantile_ex_3'].value_counts()

Output:

Costly      584
Cheap       439
Moderate    437
Name: quantile_ex_3, dtype: int64

If we use label = False in the above code then those three categories will automatically label by 0,1,2. Here is the code

Data['quantile_ex_4'] = pd.qcut(Data['SalePrice'],
                              q=[0, .3, .6, 1],
                              labels=False)
Data.head()

Output:

	SalePrice	Quantile_ex_1	Quantile_ex_2	quantile_ex_3	quantile_ex_4
0	208500	(163000.0, 214000.0]	(198620.0, 214000.0]	Costly	2
1	181500	(163000.0, 214000.0]	(179280.0, 187500.0]	Costly	2
2	223500	(214000.0, 755000.0]	(214000.0, 230000.0]	Costly	2
3	140000	(129975.0, 163000.0]	(135500.0, 141000.0]	Moderate	1
4	250000	(214000.0, 755000.0]	(230000.0, 250000.0]	Costly	2

The last column shows that the three categories are labeled by 0,1,2.

cut

If you want the same size for all bins then you should use “cut”. While if you want the same frequency for different bins then you should use “qcut”.

When you use the cut function then you will not get the same frequency for all bins.

Data['Quantile_ex_1'] = pd.cut(Data['SalePrice'], bins=4)
Data['Quantile_ex_1'].value_counts()

Output:-

(34179.9, 214925.0]     1100
(214925.0, 394950.0]     330
(394950.0, 574975.0]      25
(574975.0, 755000.0]       5

Again, we divided price data into four different bins. Here, you can notice that here the size of every bins are the same with 180045 but the frequency of each bin is not the same. However, when we use “qcut” function the frequencies are the same for all the bins but the size of the bins is different.

Now, label the data with cut function as we have does with qcut.

cut_labels_3 = ['Cheap', 'Moderate', 'Costly']
cut_bins = [34900 ,226500.0,453000.0,755000]
Data['cut_ex1'] = pd.cut(Data['SalePrice'], bins=cut_bins, labels=cut_labels_3)

Output:-

	SalePrice	Quantile_ex_1	cut_ex1
0	208500	(34179.9, 214925.0]	Cheap
1	181500	(34179.9, 214925.0]	Cheap
2	223500	(214925.0, 394950.0]	Cheap
3	140000	(34179.9, 214925.0]	Cheap
4	250000	(214925.0, 394950.0]	Moderate

If you want to create 8 bins with equal sizes you can use np.linspace(0, 755000, 9).

Data['Quantile_ex_2'] = pd.cut(Data['SalePrice'], bins=np.linspace(0, 755000, 9))
Data['Quantile_ex_2'].value_counts()

Output:

(94375.0, 188750.0]     860
(188750.0, 283125.0]    366
(0.0, 94375.0]           97
(283125.0, 377500.0]     96
(377500.0, 471875.0]     30
(471875.0, 566250.0]      6
(566250.0, 660625.0]      3
(660625.0, 755000.0]      2
Name: Quantile_ex_2, dtype: int64

Here we create eights bins with the same size.

Now, label the data

Data['cut_ex2'] = pd.cut(Data['SalePrice'], bins=np.linspace(0, 755000, 9),labels = False)
Data.head()

Output:

	SalePrice	Quantile_ex_1	cut_ex1	Quantile_ex_2	cut_ex2
0	208500	(34179.9, 214925.0]	Cheap	(188750.0, 283125.0]	2
1	181500	(34179.9, 214925.0]	Cheap	(94375.0, 188750.0]	1
2	223500	(214925.0, 394950.0]	Cheap	(188750.0, 283125.0]	2
3	140000	(34179.9, 214925.0]	Cheap	(94375.0, 188750.0]	1
4	250000	(214925.0, 394950.0]	Moderate	(188750.0, 283125.0]	2

You can notice in the last column with label data.

Instead of np.linspace(0, 755000, 9) you can use pd.interval_range. There is a downside to using interval_range. You can not define custom labels even you try ti define the label.

interval_range = pd.interval_range(start=0, freq=10000, end=755000)
Data['cut_ex2'] = pd.cut(Data['SalePrice'], bins=interval_range, labels=[0,1,2,3,4,5,6,7])
Data.head()

Output:-

	SalePrice	Quantile_ex_1	cut_ex1	Quantile_ex_2	cut_ex2
0	208500	(34179.9, 214925.0]	Cheap	(188750.0, 283125.0]	(200000, 210000]
1	181500	(34179.9, 214925.0]	Cheap	(94375.0, 188750.0]	(180000, 190000]
2	223500	(214925.0, 394950.0]	Cheap	(188750.0, 283125.0]	(220000, 230000]
3	140000	(34179.9, 214925.0]	Cheap	(94375.0, 188750.0]	(130000, 140000]
4	250000	(214925.0, 394950.0]	Moderate	(188750.0, 283125.0]	(240000, 250000]

In the above code we tried to label data into [0,1,2,3,4,5,6,7] but the last column of the output shows only the ranges, not the label.

Binning Data with Pandas qcut and cut in Python

qcut

cut

Leave a Reply Cancel reply