Binning Data with Pandas qcut and cut in Python
In this tutorial, you will learn how to do Binning Data in Pandas by using qcut and cut functions in Python.
First, we will focus on qcut.
qcut
qcut is used to divide the data into equal size bins. The pandas documentation describes qcut as a “Quantile-based discretization function.
Understand with an example:-
Data = pd.DataFrame(dataset['SalePrice']) Data.describe()
Output:
SalePrice | |
---|---|
count | 1460.000000 |
mean | 180921.195890 |
std | 79442.502883 |
min | 34900.000000 |
25% | 129975.000000 |
50% | 163000.000000 |
75% | 214000.000000 |
max | 755000.000000 |
This is the data that we will use throughout the tutorial.
Data['Quantile_ex_1'] = pd.qcut(Data['SalePrice'], q=4) Data['Quantile_ex_1'].value_counts()
Output:
(129975.0, 163000.0] 367 (163000.0, 214000.0] 366 (34899.999, 129975.0] 365 (214000.0, 755000.0] 362 Name: Quantile_ex_1, dtype: int64
Here, we used quantiles with “q=4” the bins match the percentiles from the “describe” function. Also, we can notice that the size of all bins is almost the same.
If we use “q=20” then the output shows the 20 bins with equal size. Here is the code.
Data['Quantile_ex_2'] = pd.qcut(Data['SalePrice'], q=20) Data['Quantile_ex_2'].value_counts()
Output:-
(135500.0, 141000.0] 79 (147000.0, 155000.0] 78 (106475.0, 115000.0] 76 (198620.0, 214000.0] 76 (34899.999, 88000.0] 75 (163000.0, 172500.0] 75 (179280.0, 187500.0] 75 (129975.0, 135500.0] 74 (115000.0, 124000.0] 73 (326100.0, 755000.0] 73 (214000.0, 230000.0] 73 (278000.0, 326100.0] 72 (230000.0, 250000.0] 72 (250000.0, 278000.0] 72 (141000.0, 147000.0] 71 (187500.0, 198620.0] 71 (88000.0, 106475.0] 71 (124000.0, 129975.0] 70 (172500.0, 179280.0] 69 (155000.0, 163000.0] 65 Name: Quantile_ex_2, dtype: int64
Again, for q=20 the size of the bins are almost the same.
If we want to label this price data into three different labels like [cheap, moderate, costly]. See the following code to do this.
bin_labels_3 = ['Cheap', 'Moderate', 'Costly'] Data['quantile_ex_3'] = pd.qcut(Data['SalePrice'], q=[0, .3, .6, 1], labels=bin_labels_3) Data.head()
Output:-
SalePrice | Quantile_ex_1 | Quantile_ex_2 | quantile_ex_3 | |
---|---|---|---|---|
0 | 208500 | (163000.0, 214000.0] | (198620.0, 214000.0] | Costly |
1 | 181500 | (163000.0, 214000.0] | (179280.0, 187500.0] | Costly |
2 | 223500 | (214000.0, 755000.0] | (214000.0, 230000.0] | Costly |
3 | 140000 | (129975.0, 163000.0] | (135500.0, 141000.0] | Moderate |
4 | 250000 | (214000.0, 755000.0] | (230000.0, 250000.0] | Costly |
From the “quantile_ex_3” column you can notice that we have labeled the price data into three different categories.
If you want to know the frequency of each category here is the code.
Data['quantile_ex_3'].value_counts()
Output:
Costly 584 Cheap 439 Moderate 437 Name: quantile_ex_3, dtype: int64
If we use label = False in the above code then those three categories will automatically label by 0,1,2. Here is the code
Data['quantile_ex_4'] = pd.qcut(Data['SalePrice'], q=[0, .3, .6, 1], labels=False) Data.head()
Output:
SalePrice | Quantile_ex_1 | Quantile_ex_2 | quantile_ex_3 | quantile_ex_4 | |
---|---|---|---|---|---|
0 | 208500 | (163000.0, 214000.0] | (198620.0, 214000.0] | Costly | 2 |
1 | 181500 | (163000.0, 214000.0] | (179280.0, 187500.0] | Costly | 2 |
2 | 223500 | (214000.0, 755000.0] | (214000.0, 230000.0] | Costly | 2 |
3 | 140000 | (129975.0, 163000.0] | (135500.0, 141000.0] | Moderate | 1 |
4 | 250000 | (214000.0, 755000.0] | (230000.0, 250000.0] | Costly | 2 |
The last column shows that the three categories are labeled by 0,1,2.
cut
If you want the same size for all bins then you should use “cut”. While if you want the same frequency for different bins then you should use “qcut”.
When you use the cut function then you will not get the same frequency for all bins.
Data['Quantile_ex_1'] = pd.cut(Data['SalePrice'], bins=4) Data['Quantile_ex_1'].value_counts()
Output:-
(34179.9, 214925.0] 1100 (214925.0, 394950.0] 330 (394950.0, 574975.0] 25 (574975.0, 755000.0] 5
Again, we divided price data into four different bins. Here, you can notice that here the size of every bins are the same with 180045 but the frequency of each bin is not the same. However, when we use “qcut” function the frequencies are the same for all the bins but the size of the bins is different.
Now, label the data with cut function as we have does with qcut.
cut_labels_3 = ['Cheap', 'Moderate', 'Costly'] cut_bins = [34900 ,226500.0,453000.0,755000] Data['cut_ex1'] = pd.cut(Data['SalePrice'], bins=cut_bins, labels=cut_labels_3)
Output:-
SalePrice | Quantile_ex_1 | cut_ex1 | |
---|---|---|---|
0 | 208500 | (34179.9, 214925.0] | Cheap |
1 | 181500 | (34179.9, 214925.0] | Cheap |
2 | 223500 | (214925.0, 394950.0] | Cheap |
3 | 140000 | (34179.9, 214925.0] | Cheap |
4 | 250000 | (214925.0, 394950.0] | Moderate |
If you want to create 8 bins with equal sizes you can use np.linspace(0, 755000, 9).
Data['Quantile_ex_2'] = pd.cut(Data['SalePrice'], bins=np.linspace(0, 755000, 9)) Data['Quantile_ex_2'].value_counts()
Output:
(94375.0, 188750.0] 860 (188750.0, 283125.0] 366 (0.0, 94375.0] 97 (283125.0, 377500.0] 96 (377500.0, 471875.0] 30 (471875.0, 566250.0] 6 (566250.0, 660625.0] 3 (660625.0, 755000.0] 2 Name: Quantile_ex_2, dtype: int64
Here we create eights bins with the same size.
Now, label the data
Data['cut_ex2'] = pd.cut(Data['SalePrice'], bins=np.linspace(0, 755000, 9),labels = False) Data.head()
Output:
SalePrice | Quantile_ex_1 | cut_ex1 | Quantile_ex_2 | cut_ex2 | |
---|---|---|---|---|---|
0 | 208500 | (34179.9, 214925.0] | Cheap | (188750.0, 283125.0] | 2 |
1 | 181500 | (34179.9, 214925.0] | Cheap | (94375.0, 188750.0] | 1 |
2 | 223500 | (214925.0, 394950.0] | Cheap | (188750.0, 283125.0] | 2 |
3 | 140000 | (34179.9, 214925.0] | Cheap | (94375.0, 188750.0] | 1 |
4 | 250000 | (214925.0, 394950.0] | Moderate | (188750.0, 283125.0] | 2 |
You can notice in the last column with label data.
Instead of np.linspace(0, 755000, 9) you can use pd.interval_range. There is a downside to using interval_range. You can not define custom labels even you try ti define the label.
interval_range = pd.interval_range(start=0, freq=10000, end=755000) Data['cut_ex2'] = pd.cut(Data['SalePrice'], bins=interval_range, labels=[0,1,2,3,4,5,6,7]) Data.head()
Output:-
SalePrice | Quantile_ex_1 | cut_ex1 | Quantile_ex_2 | cut_ex2 | |
---|---|---|---|---|---|
0 | 208500 | (34179.9, 214925.0] | Cheap | (188750.0, 283125.0] | (200000, 210000] |
1 | 181500 | (34179.9, 214925.0] | Cheap | (94375.0, 188750.0] | (180000, 190000] |
2 | 223500 | (214925.0, 394950.0] | Cheap | (188750.0, 283125.0] | (220000, 230000] |
3 | 140000 | (34179.9, 214925.0] | Cheap | (94375.0, 188750.0] | (130000, 140000] |
4 | 250000 | (214925.0, 394950.0] | Moderate | (188750.0, 283125.0] | (240000, 250000] |
In the above code we tried to label data into [0,1,2,3,4,5,6,7] but the last column of the output shows only the ranges, not the label.
Leave a Reply