pandas.get_dummies in Python
In this tutorial, we will learn how to create dummy variables using get_dummies in Python. This method is very useful for using data with machine learning algorithms. It is used to convert variables in the data frame to dummy variables. So, let’s begin the tutorial
Creating Data Frame in Pandas
Here is a sample data frame that we are creating to demonstrate get_dummies
method
import pandas as p data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] } d1 = p.DataFrame(data1)
pandas.get_dummies()
This method has 8 arguments. Only one argument is mandatory, the rest are optional. The syntax along with arguments are:
pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Only the ‘data’ argument is mandatory, others are optional. Let us look at each argument and its functionality
1) data
This is the data for which we will be creating the dummy variables. The following code snippet shows how dummy variables are created using the get_dummies()
method
import pandas as p data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] } d1 = p.DataFrame(data1) print(p.get_dummies(d1))
OUTPUT:
0_1 0_2 1_Delhi 1_Hyderabad 0 1 0 0 1 1 0 1 1 0
2) prefix
This prefix is added to the names of the columns of the dummy variables. By default, this argument is ‘None’. It can be changed based on the requirement. This argument can be passed as a string, list of strings, dictionary of strings.
import pandas as p data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] } d1 = p.DataFrame(data1) print(p.get_dummies(d1, prefix=['f','s']))
OUTPUT:
f_1 f_2 s_Delhi s_Hyderabad 0 1 0 0 1 1 0 1 1 0
3) prefix_sep
This argument is used to change the separator of the prefix. By default, this argument is ‘_’. It can be changed by passing a new separator as an argument which is a string.
import pandas as p data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] } d1 = p.DataFrame(data1) print(p.get_dummies(d1, prefix=['f','s'],prefix_sep=':'))
OUTPUT:
f:1 f:2 s:Delhi s:Hyderabad 0 1 0 0 1 1 0 1 1 0
4) dummy_na
By default, this argument is ‘False’. If it is made ‘True’ a column for ‘Nan’ is created.
import pandas as p data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] } d1 = p.DataFrame(data1) print(p.get_dummies(d1,dummy_na=True))
OUTPUT:
0_1 0_2 0_nan 1_Delhi 1_Hyderabad 1_nan 0 1 0 0 0 1 0 1 0 1 0 1 0 0
5) columns
This argument is used to specify the columns for which the dummy variables are to be created. By default, it is none. If the name of the column is specified, then dummy variables are created only for those columns.
import pandas as p data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] } d1 = p.DataFrame(data1) print(p.get_dummies(d1,columns=['0']))
OUTPUT: 1 0_1 0_2 0 Hyderabad 1 0 1 Delhi 0 1
6) sparse
This argument is used to specify whether the dummy variable columns should contain sparse values or not. By default, it is ‘False’.
import pandas as p data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] } d1 = p.DataFrame(data1) print(p.get_dummies(d1,sparse=True))
OUTPUT:
0_1 0_2 1_Delhi 1_Hyderabad 0 1 0 0 1 1 0 1 1 0
7) drop_first
This argument is used to remove the first level. By default, its value is ‘False’. By specifying the argument value as ‘True’, the first level will be removed.
import pandas as p data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] } d1 = p.DataFrame(data1) print(p.get_dummies(d1,prefix=['f','s'],drop_first=True))
OUTPUT:
f_2 s_Hyderabad 0 0 1 1 1 0
8) dtype
This argument is used to specify the data type of the values represented by the dummy variables. By default, the data type is uint8. It can be changed explicitly by specifying the argument value with another data type.
import pandas as p data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] } d1 = p.DataFrame(data1) print(p.get_dummies(d1,dtype='float'))
OUTPUT: 0_1 0_2 1_Delhi 1_Hyderabad 0 1.0 0.0 0.0 1.0 1 0.0 1.0 1.0 0.0
Putting everything together, the code is:
import pandas as p data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] } d1 = p.DataFrame(data1) print(d1) print(p.get_dummies(d1)) print(p.get_dummies(d1, prefix=['f','s'])) print(p.get_dummies(d1, prefix=['f','s'],prefix_sep=':')) print(p.get_dummies(d1,dummy_na=True)) print(p.get_dummies(d1,columns=['0'])) print(p.get_dummies(d1,sparse=True)) print(p.get_dummies(d1,prefix=['f','s'],drop_first=True)) print(p.get_dummies(d1,dtype='float'))
Leave a Reply