Working with Text Data in pandas Python

Like any other Programming language Python has two main data types. They are :

  1. Numeric Data
  2. Text Data

Data types play a major role in any type of analysis. Pandas is a Python library which is fast, powerful, and easy to use tool for working with data.

What is text type data in Python?

Text data is nothing but the strings in Python or object in pandas. A string variable can contain any type of data like Integer, Float(decimal), a Boolean sequence, etc. For a compiler or an interpreter used for Python, anything between a parenthesis (” ” or ‘ ‘) is a string.

Type of input data can be found with type Function of Python

Syntax: type(variable_name)

 

a='10'
b='2.98'
char='Hi'
print(type(a),type(b),type(c))

Output:

 <class 'str'> <class 'str'> <class 'str'>

Pandas in Python:

Pandas is a high-level data manipulation tool. It is built on the Numpy package and its key data structure is called the DataFrame. DataFrames allow the user to store and manipulate data in the form of tables.

Importing pandas:

import pandas as pd

 

How to work on text data with pandas

Working with the text in Python needs a Pandas package.

How to create a Series with pandas:

A Series is an array of data in Python. In other words, Series is nothing but one dimensional labeled array. It is capable of holding data of any type. It can even be compared to a column in an excel sheet. The index helps to access the data of Series.

Series=pd. Series ([‘x’, ’y’, ‘z’], dtype='string')

How to change the type of a variable:

astype function helps in changing the type of input data.

Syntax: variable_name.astype(‘type’)

 

a=10
a.astype('string')

Output:

Str

How to create a text DataFrame with Pandas

DataFrame from list variable:
import pandas as pd 
  
#First create a list of strings 
lst = ['Hi','this', 'is', 'an' ,'Article', 'on pandas'] 
  
# then pass the list variable into DataFrame function of pandas 
dataframe = pd.DataFrame(lst)
dataframe.astype('string')
print(dataframe)
Output:
  0
0 Hi 

1 this 

2 is 

3 an 

4 Article 

5 on pandas
DataFrame from a dictionary:
#First create a dictionary 
Dictionary ={'Name': ['Anish', 'Kumar'],
            'Age':[20,30]} 
# Pass the dictionaryinto DataFrame function of pandas 
dataframe= pd.DataFrame ( Dictionary )  
print(dataframe)

Output:

  Name   Age 
0 Anish 20 
1 Kumar 30

How to change the case of the data:

There can be 2 types of conversions:

  1. lower case
  2. upper case

Lower case conversion:

str.lower function helps to convert the text in a pandas series into lower case.

Syntax: series_name.str.lower() name.str.lower()

s = pd.Series(['A', 'B', 'C','dog', 'cat'],dtype="string")
#To convert text in a dataframe
s.str.lower()

Output:

0 a 

1 b 

2 c 

3 dog 

4 cat 

dtype: string

Upper case conversion:

str.upper function helps in converting the text in a Pandas series into upper case.

Syntax: series_name.str.upper()

s = pd.Series(['A', 'B', 'C','dog', 'cat'],dtype="string") 
#To convert text in a dataframe
s.str.upper()

Output:

0 A 
1 B 
2 C 
3 DOG 
4 CAT 
dtype: string

How to find the length:

str.len function helps to find the length of the text in the series.

Syntax: series_name.str.len()

s = pd.Series(['A', 'B', 'C','dog', 'cat'],dtype="string")
s.str.len()

Output:

0 1 
1 1 
2 1 
3 3 
4 3 
dtype: Int64

Encoding & Decoding

Encoding and decoding data of a dataframe needs LabelEncoder function which is part of sci-kit learn module of Python.

LabelEncoder is a utility class that helps in normalizing labels such that they contain only values between 0 and n_classes-1.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# Encode the given data
le=le.fit(["paris", "paris", "tokyo", "amsterdam"])

# Prints classes with start from 0 to n_classes-1
class=list(le.classes_)
print(class)

# Transforms the text to encoded number 
encode=le.transform(["tokyo", "tokyo", "paris"])
print(encode)
#Transforms the encoded number back into the original text
decode=list(le.inverse_transform([2, 2, 1]))
print(decode)

Output:

["amsterdam", "paris", "tokyo"]

 [2,2,1] 

["tokyo", "tokyo", "paris"]

Leave a Reply

Your email address will not be published. Required fields are marked *