pandas.get_dummies in Python

In this tutorial, we will learn how to create dummy variables using get_dummies in Python. This method is very useful for using data with machine learning algorithms. It is used to convert variables in the data frame to dummy variables. So, let’s begin the tutorial

Creating Data Frame in Pandas

Here is a sample data frame that we are creating to demonstrate get_dummies method

import pandas as p
data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] }
d1 = p.DataFrame(data1)

pandas.get_dummies()

This method has 8 arguments. Only one argument is mandatory, the rest are optional. The syntax along with arguments are:

pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

Only the ‘data’ argument is mandatory, others are optional. Let us look at each argument and its functionality

1) data

This is the data for which we will be creating the dummy variables. The following code snippet shows how dummy variables are created using the get_dummies() method

import pandas as p
data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] }
d1 = p.DataFrame(data1) 
print(p.get_dummies(d1))

OUTPUT:

  0_1  0_2  1_Delhi  1_Hyderabad
0 1    0    0        1
1 0    1    1        0

2) prefix

This prefix is added to the names of the columns of the dummy variables. By default, this argument is ‘None’.  It can be changed based on the requirement. This argument can be passed as a string, list of strings, dictionary of strings.

import pandas as p
data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] }
d1 = p.DataFrame(data1) 
print(p.get_dummies(d1, prefix=['f','s']))

OUTPUT:

  f_1  f_2  s_Delhi  s_Hyderabad
0 1    0    0        1
1 0    1    1        0

3) prefix_sep

This argument is used to change the separator of the prefix. By default, this argument is ‘_’. It can be changed by passing a new separator as an argument which is a string.

import pandas as p
data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] }
d1 = p.DataFrame(data1) 
print(p.get_dummies(d1, prefix=['f','s'],prefix_sep=':'))

OUTPUT:

  f:1 f:2 s:Delhi s:Hyderabad
0 1   0   0       1
1 0   1   1       0

4) dummy_na

By default, this argument is ‘False’. If it is made ‘True’ a column for ‘Nan’ is created.

import pandas as p
data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] }
d1 = p.DataFrame(data1) 
print(p.get_dummies(d1,dummy_na=True))

OUTPUT:

  0_1 0_2 0_nan 1_Delhi 1_Hyderabad 1_nan
0 1   0   0     0       1           0
1 0   1   0     1       0           0

5) columns

This argument is used to specify the columns for which the dummy variables are to be created. By default, it is none. If the name of the column is specified, then dummy variables are created only for those columns.

import pandas as p
data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] }
d1 = p.DataFrame(data1) 
print(p.get_dummies(d1,columns=['0']))
OUTPUT:
  1          0_1 0_2
0 Hyderabad  1   0
1 Delhi      0   1

6) sparse

This argument is used to specify whether the dummy variable columns should contain sparse values or not. By default, it is ‘False’.

import pandas as p
data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] }
d1 = p.DataFrame(data1) 
print(p.get_dummies(d1,sparse=True))

OUTPUT:

  0_1 0_2 1_Delhi 1_Hyderabad
0 1   0   0       1
1 0   1   1       0

7) drop_first

This argument is used to remove the first level. By default, its value is ‘False’. By specifying the argument value as ‘True’, the first level will be removed.

import pandas as p
data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] }
d1 = p.DataFrame(data1)
print(p.get_dummies(d1,prefix=['f','s'],drop_first=True))

OUTPUT:

  f_2 s_Hyderabad
0 0   1
1 1   0

8) dtype

This argument is used to specify the data type of the values represented by the dummy variables. By default, the data type is uint8. It can be changed explicitly by specifying the argument value with another data type.

import pandas as p
data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] }
d1 = p.DataFrame(data1) 
print(p.get_dummies(d1,dtype='float'))
OUTPUT:
  0_1 0_2  1_Delhi 1_Hyderabad
0 1.0 0.0  0.0     1.0
1 0.0 1.0  1.0     0.0

Putting everything together, the code is:

import pandas as p
data1 = { '0':['1','2'], '1':['Hyderabad','Delhi',] }
d1 = p.DataFrame(data1) 
print(d1)
print(p.get_dummies(d1))
print(p.get_dummies(d1, prefix=['f','s']))
print(p.get_dummies(d1, prefix=['f','s'],prefix_sep=':'))
print(p.get_dummies(d1,dummy_na=True))
print(p.get_dummies(d1,columns=['0']))
print(p.get_dummies(d1,sparse=True))
print(p.get_dummies(d1,prefix=['f','s'],drop_first=True))
print(p.get_dummies(d1,dtype='float'))

Leave a Reply

Your email address will not be published.