How to merge two csv files by specific column in Python
In this tutorial, we are going to learn how we can merge two CSV files by specific column in Python using Pandas.
Python is developed as a great tool for data analysis, since the presence of a large number of modules in Python which makes it one of the popular and widely used language for handling and getting insights from data, one such module is Pandas.
Pandas is developed on two different modules of Python(Numpy and Matplotlib) and specially used to deal with heterogeneous data, hence an important tool for data wrangling for analyzing real-time data.
Why do we need to merge two CSV files?
- Since real-life datasets are quite large and they are generally stored in different CSVs for different categories.
- For Example, if we need to analyze Police activity against crime we need to analyze each and every factor, whether the criminal was male or female, criminal’s race, weather on that day, time of the crime, etc.
- So merging data frames is necessary for checking different factors affecting prediction.
Here I am using play store app data CSV, containing details of Apps downloaded from play store and there reviews CSV, you may get this CSV here:
apps.csv
user_reviews.csv
(For practicing there are many open-source datasets available on the web .)
Loading the CSV file as Data-Frame
Important – CSV file must be in your directory!.
#Importing Pandas with alias name pd import pandas as pd #Loading CSV Files apps=pd.read_csv('Apps.csv') reviews=pd.read_csv('Reviews.csv') #Exploring the dataset print(apps.info()) print(reviews.info())
Output:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9659 entries, 0 to 9658 Data columns (total 14 columns): Unnamed: 0 9659 non-null int64 App 9659 non-null object Category 9659 non-null object Rating 8196 non-null float64 Reviews 9659 non-null int64 Size 8432 non-null float64 Installs 9659 non-null object Type 9659 non-null object Price 9659 non-null object Content Rating 9659 non-null object Genres 9659 non-null object Last Updated 9659 non-null object Current Ver 9651 non-null object Android Ver 9657 non-null object dtypes: float64(2), int64(2), object(10) memory usage: 1.0+ MB None <class 'pandas.core.frame.DataFrame'> RangeIndex: 64295 entries, 0 to 64294 Data columns (total 5 columns): App 64295 non-null object Translated_Review 37427 non-null object Sentiment 37432 non-null object Sentiment_Polarity 37432 non-null float64 Sentiment_Subjectivity 37432 non-null float64 dtypes: float64(2), object(3) memory usage: 2.5+ MB None
It is a good practice to inspect our dataset hence we use the info() method and we can see that both datasets have an ‘App’ column as common so we merge these data set on the ‘App’ column.
DataFrame.merge(file1,file2,how=,on=)
1.Merging on Single column.
‘on=’ argument is used to take column on which we want to merge.
apps_reviews=pd.merge(apps,reviews,how='left',on='App')[['Reviews','Sentiment_Polarity']].head() apps_reviews
Output:
Reviews Sentiment_Polarity 0 159 NaN 1 967 -0.250 2 967 -0.725 3 967 0.000 4 967 NaN
I selected only two columns so don’t get confused, because we have 19 columns.
2. We can merge on multiple columns by passing a list of column names to ‘on= ‘ argument.
eg. on=[‘column1’ , ‘column2’],only if both columns are present in both CSVs.
3. We can also merge on column1 of file1 and column2 of file2 by using left_on and right_on argument.
apps_reviews=pd.merge(apps,reviews,how='right',left_on='Reviews',right_on='Sentiment_Polarity')[['Reviews','Sentiment_Polarity']].head() apps_reviews
Output:
Reviews Sentiment_Polarity 0 0.0 0.0 1 0.0 0.0 2 0.0 0.0 3 0.0 0.0 4 0.0 0.0
4. We can do the following types of merges only:
Using argument ‘how=’
- ‘left’-All values of left CSV and common values of the right.
- ‘right’-All values of right CSV and common values of the left.
- ‘outer’-All values of both CSVs.
- ‘inner’-Only common values.
apps_reviews=pd.merge(apps,reviews,how='right',on='App')[['Reviews','Sentiment_Polarity']].head() apps_reviews
Reviews Sentiment_Polarity 0 967.0 -0.250 1 967.0 -0.725 2 967.0 0.000 3 967.0 NaN 4 967.0 0.500
For further reading:
Leave a Reply