Working with Text Data in pandas Python
Like any other Programming language Python has two main data types. They are :
- Numeric Data
- Text Data
Data types play a major role in any type of analysis. Pandas is a Python library which is fast, powerful, and easy to use tool for working with data.
What is text type data in Python?
Text data is nothing but the strings in Python or object in pandas. A string variable can contain any type of data like Integer, Float(decimal), a Boolean sequence, etc. For a compiler or an interpreter used for Python, anything between a parenthesis (” ” or ‘ ‘) is a string.
Type of input data can be found with type Function of Python
Syntax: type(variable_name)
a='10' b='2.98' char='Hi' print(type(a),type(b),type(c))
Output:
<class 'str'> <class 'str'> <class 'str'>
Pandas in Python:
Pandas is a high-level data manipulation tool. It is built on the Numpy package and its key data structure is called the DataFrame. DataFrames allow the user to store and manipulate data in the form of tables.
Importing pandas:
import pandas as pd
How to work on text data with pandas
Working with the text in Python needs a Pandas package.
How to create a Series with pandas:
A Series is an array of data in Python. In other words, Series is nothing but one dimensional labeled array. It is capable of holding data of any type. It can even be compared to a column in an excel sheet. The index helps to access the data of Series.
Series=pd. Series ([‘x’, ’y’, ‘z’], dtype='string')
How to change the type of a variable:
astype function helps in changing the type of input data.
Syntax: variable_name.astype(‘type’)
a=10 a.astype('string')
Output:
Str
How to create a text DataFrame with Pandas
DataFrame from list variable:
import pandas as pd #First create a list of strings lst = ['Hi','this', 'is', 'an' ,'Article', 'on pandas'] # then pass the list variable into DataFrame function of pandas dataframe = pd.DataFrame(lst) dataframe.astype('string') print(dataframe)
Output: 0 0 Hi 1 this 2 is 3 an 4 Article 5 on pandas
DataFrame from a dictionary:
#First create a dictionary Dictionary ={'Name': ['Anish', 'Kumar'], 'Age':[20,30]} # Pass the dictionaryinto DataFrame function of pandas dataframe= pd.DataFrame ( Dictionary ) print(dataframe)
Output:
Name Age 0 Anish 20 1 Kumar 30
How to change the case of the data:
There can be 2 types of conversions:
- lower case
- upper case
Lower case conversion:
str.lower function helps to convert the text in a pandas series into lower case.
Syntax: series_name.str.lower() name.str.lower()
s = pd.Series(['A', 'B', 'C','dog', 'cat'],dtype="string") #To convert text in a dataframe s.str.lower()
Output:
0 a 1 b 2 c 3 dog 4 cat dtype: string
Upper case conversion:
str.upper function helps in converting the text in a Pandas series into upper case.
Syntax: series_name.str.upper()
s = pd.Series(['A', 'B', 'C','dog', 'cat'],dtype="string") #To convert text in a dataframe s.str.upper()
Output:
0 A 1 B 2 C 3 DOG 4 CAT dtype: string
How to find the length:
str.len function helps to find the length of the text in the series.
Syntax: series_name.str.len()
s = pd.Series(['A', 'B', 'C','dog', 'cat'],dtype="string") s.str.len()
Output:
0 1 1 1 2 1 3 3 4 3 dtype: Int64
Encoding & Decoding
Encoding and decoding data of a dataframe needs LabelEncoder function which is part of sci-kit learn module of Python.
LabelEncoder is a utility class that helps in normalizing labels such that they contain only values between 0 and n_classes-1.
from sklearn import preprocessing le = preprocessing.LabelEncoder() # Encode the given data le=le.fit(["paris", "paris", "tokyo", "amsterdam"]) # Prints classes with start from 0 to n_classes-1 class=list(le.classes_) print(class) # Transforms the text to encoded number encode=le.transform(["tokyo", "tokyo", "paris"]) print(encode) #Transforms the encoded number back into the original text decode=list(le.inverse_transform([2, 2, 1])) print(decode)
Output:
["amsterdam", "paris", "tokyo"] [2,2,1] ["tokyo", "tokyo", "paris"]
Leave a Reply