Movies Recommendation using Collaborative Filtering using Machine Learning in Python
Table contents:
- What are the Recommendation systems?
- Types of Recommendation systems
- Collaborative Filtering for Online Grocery Recommendation
Let’s start with what is Recommendation system.
Recommendation Systems: Recommendation systems suggest user items that are most relevant to him or her whether they are trying to buy books online or watch movies online. There is a recommendation system operating in the background that is looking at your behavior and suggests items that you are mostly like to with. So, all the major platforms that we used today use recommendation systems to improve the user experience. The recommendation systems are usually divided into two types:
- Content-based systems.
- Collaborative filtering systems.
Content-based systems use the features and attributes of the item of data. So, for example, if you want to recommend movies to a user. We recommend the movies to a user based on the movies that he or she likes the movies in the past based on the tags, directors, and actors.
In collaborative filtering systems, we recommend the movies to a user based on the ratings that he or she gave in recent movies.
Algorithm:
- Collect the data
- Standardize the collected data
- Fill null(NAN) values with zero.
- Find the similarity between the movies using correlation.
- Get similar movies.
I am using the Jupyter notebook. I’ll give you a quick tutorial for those who have not used this notebook before. Instead of writing our scripts in a text editor or an IDE what data scientists love doing is using a notebook because it allows us to add interactivity and make it easier for us to share our work with others as well. Before starting your code, you must import be library like pandas.
import pandas a pd
I’ve downloaded a data set called movies and ratings from this zip file. First, download the zip file and then unzip it. I am importing the CSV files movies.csv and ratings.csv
movies_data = pd.read_csv("Downloads\movies.csv") retings_data = pd.read_csv("Downloads\ratings.csv")
Now, merge both datasets in one data set on moviesId
.
movies_data = pd.merge(movies_data,ratings_data,on='movieId')
Now, check the data if there are any NAN values in the files. If there are NAN values, fill NAN with zero
movies_data.fillna(0)
ratings_data.fillna(0)
In my case, I have no NAN values.
Now, I say item_similarity is equal to as we already see in our section we will be using similarity between the movies. Before using the item_similarity
, we need to create Data_Frame with movies and userId. To do this we’ll use a function called pivot_table
movies_data_df = movies_data.pivot_table(index=['userId'],columns=['title'],values='rating')
Now, again check the data if there are any NAN values in the files. If there are NAN values, fill NAN with zero
movies_data.fillna(0)
ratings_data.fillna(0)
It’s not possible to give a rating to all movies by one user. If users are not given a rating for any movie, It doesn’t mean that the user not giving zero ratings. So, let’s address this issue. What we’ll do is create a method over here that will standardize the ratings which were given to all users. Let’s create a method called standardize: This standardized method will take each row of data_frame
input and then convert it such that new_ratings
will be the original rating minus the mean of all ratings and will be divided by the range of ratings that users will give.
def standard(row): new_row = (row-row.mean())/(row.max()-row.min()) return new_row movies_data_std = movies_data_df.apply(standard)
item_similarity = movies_data_std.corr()
I’ll create a method called getsimiliarmovie
. This method will take a movie name and rating that have user has given that he or she must have watched in the past. And this method will return a similarity score that the movies related. Once I got the row from the data frame I want to arrange it in descending order. What I mean to say is,
similiar_score = similiar_score.sort_values(ascending=False)
def getsimiliarmovie(movie,rating): similiar_score = item_similarity[movie]*(rating-2.5) similiar_score = similiar_score.sort_values(ascending=False) return similiar_score
Let’s test this method does it really works fine or not.
action_lover = [("eXistenZ (1999)",5),("xXx (2002)",3)] similar_movies = pd.DataFrame() for movies,ratings in action_lover: similar_movies = similar_movies.append(getsimiliarmovie(movies,ratings),ignore_index=True) similar_movies.head() similar_movies.sum().sort_values(ascending=False)
Output:
eXistenZ (1999) 2.581511 Happy Accidents (2000) 1.299960 Dummy (2002) 1.187514 Shape of Things, The (2003) 1.130215 Quills (2000) 1.109947 ... Clear and Present Danger (1994) -0.087791 Jungle Book, The (1994) -0.097731 Under Siege 2: Dark Territory (1995) -0.099089 Disclosure (1994) -0.106220 Nine Months (1995) -0.113396 Length: 9719, dtype: float64
It’s working fine.
Leave a Reply