How to merge two csv files by specific column in Python

In this tutorial, we are going to learn how we can merge two CSV files by specific column in Python using Pandas.

Python is developed as a great tool for data analysis, since the presence of a large number of modules in Python which makes it one of the popular and widely used language for handling and getting insights from data, one such module is Pandas.

Pandas is developed on two different modules of Python(Numpy and Matplotlib) and specially used to deal with heterogeneous data, hence an important tool for data wrangling for analyzing real-time data.

Why do we need to merge two CSV files?

  • Since real-life datasets are quite large and they are generally stored in different CSVs for different categories.
  • For Example, if we need to analyze Police activity against crime we need to analyze each and every factor, whether the criminal was male or female, criminal’s race, weather on that day, time of the crime, etc.
  • So merging data frames is necessary for checking different factors affecting prediction.

Here I am using play store app data CSV, containing details of Apps downloaded from play store and there reviews CSV, you may get this CSV here:
apps.csv
user_reviews.csv
(For practicing there are many open-source datasets available on the web .)

Loading the CSV file as Data-Frame

Important – CSV file must be in your directory!.

#Importing Pandas with alias name pd
import pandas as pd

#Loading CSV Files
apps=pd.read_csv('Apps.csv')
reviews=pd.read_csv('Reviews.csv')

#Exploring the dataset
print(apps.info())
print(reviews.info())

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9659 entries, 0 to 9658
Data columns (total 14 columns):
Unnamed: 0        9659 non-null int64
App               9659 non-null object
Category          9659 non-null object
Rating            8196 non-null float64
Reviews           9659 non-null int64
Size              8432 non-null float64
Installs          9659 non-null object
Type              9659 non-null object
Price             9659 non-null object
Content Rating    9659 non-null object
Genres            9659 non-null object
Last Updated      9659 non-null object
Current Ver       9651 non-null object
Android Ver       9657 non-null object
dtypes: float64(2), int64(2), object(10)
memory usage: 1.0+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
App                       64295 non-null object
Translated_Review         37427 non-null object
Sentiment                 37432 non-null object
Sentiment_Polarity        37432 non-null float64
Sentiment_Subjectivity    37432 non-null float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB
None

It is a good practice to inspect our dataset hence we use the info() method and we can see that both datasets have an ‘App’ column as common so we merge these data set on the ‘App’ column.

DataFrame.merge(file1,file2,how=,on=)

1.Merging on Single column.

‘on=’ argument is used to take column on which we want to merge.

apps_reviews=pd.merge(apps,reviews,how='left',on='App')[['Reviews','Sentiment_Polarity']].head()
apps_reviews

Output:

Reviews	Sentiment_Polarity
0	159	NaN
1	967	-0.250
2	967	-0.725
3	967	0.000
4	967	NaN

I selected only two columns so don’t get confused, because we have 19 columns.

2. We can merge on multiple columns by passing a list of column names to ‘on= ‘ argument.
eg. on=[‘column1’ , ‘column2’],only if both columns are present in both CSVs.

3. We can also merge on column1 of file1 and column2 of file2 by using left_on and right_on argument.

apps_reviews=pd.merge(apps,reviews,how='right',left_on='Reviews',right_on='Sentiment_Polarity')[['Reviews','Sentiment_Polarity']].head()
apps_reviews

Output:

Reviews	Sentiment_Polarity
0	0.0	0.0
1	0.0	0.0
2	0.0	0.0
3	0.0	0.0
4	0.0	0.0

 

4. We can do the following types of merges only:
Using argument ‘how=’

  • ‘left’-All values of left CSV and common values of the right.
  • ‘right’-All values of right CSV and common values of the left.
  • ‘outer’-All values of both CSVs.
  • ‘inner’-Only common values.
apps_reviews=pd.merge(apps,reviews,how='right',on='App')[['Reviews','Sentiment_Polarity']].head()
apps_reviews
  Reviews	Sentiment_Polarity
0	967.0	-0.250
1	967.0	-0.725
2	967.0	0.000
3	967.0	NaN
4	967.0	0.500

 

For further reading:

 

Leave a Reply

Your email address will not be published. Required fields are marked *