Similar movie prediction in Python
In this tutorial, we will learn how to build a Similar Movie Prediction model using the Scikit-learn library in Python. This model predicts similar movies by comparing their overviews.
Scikit-learn is one of the most popular and versatile machine learning libraries in Python. It provides simple and efficient tools for data mining and data analysis.
Prerequisites:
Basics about how to use scikit-learn library
Installing Necessary Libraries
First, open your command prompt or working IDE and enter the following command to install the necessary libraries
pip install pandas scikit-learn
Importing Libraries
import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity
TF-IDF stands for Term Frequency-Inverse Document Frequency. It evaluates the importance of a word in a document relative to a collection of documents. It increases the importance proportionally to the number of times a word appears in the document but offsets it by the frequency of the word in the corpus.
Cosine Similarity measures the similarity between two non-zero vectors of an inner product space. It often measures the cosine of the angle between two vectors. In the context of document similarity, cosine similarity measures how similar two documents are based on the terms they contain and their TF-IDF scores.
loading the dataset
df = pd.read_csv('tmdb_5000_movies.csv')
pd.read_csv()
is a function from the pandas library. This function reads comma-separated values (CSV) file into a DataFrame.
Here I am using the TMDB_5000_movies
dataset. This dataset contains various types of information about movies, such as title, popularity, overview, genre, and more. However, you can use any other dataset that contains similar information if you prefer.
You can download the dataset from Kaggle or use the link below to directly navigate to the TMDB_5000_movies dataset(from Kaggle):
https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv
Selecting Required Columns
df = df[['title', 'overview']].dropna()
We only require the title column and overview column from the Dataset so we select only those columns
The .dropna()
method in pandas is used to remove rows with missing values.
Initialize the Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
The TfidfVectorizer helps us to convert the movie overviews into numerical vectors.
This step will also remove common English stop words.
Transform the Overviews
tfidf_matrix = tfidf_vectorizer.fit_transform(df['overview'])
This transforms the text into a sparse matrix of TF-IDF features.
Building a similarity Matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
This will create a similarity matrix where each element represents the similarity score between two movies.
Define a Function to get Similar Movies
Now build a function that takes a movie title as input and returns the titles of the most similar movies based on cosine similarity scores.
If the movie name is not found in the dataset, it asks you to provide an overview. Then, it finds similar movies based on the overview you provide.
def get_similar_movies(input_title, cosine_sim=cosine_sim): movie = df[df['title'] == input_title] if not movie.empty: idx = movie.index[0] else: input_overview = input("Movie '{}' not found. Please enter its overview: ".format(input_title)) # Create a temporary DataFrame with the input overview temp_df = pd.DataFrame({'title': ['Input Movie'], 'overview': [input_overview]}) # Fit and transform the input overview input_vector = tfidf_vectorizer.transform(temp_df['overview']) similarities = cosine_similarity(input_vector, tfidf_matrix) similarities = similarities[0] #sort in descending order sim_indices = similarities.argsort()[::-1] #select top 5 moivies sim_indices = sim_indices[:5] return df['title'].iloc[sim_indices].tolist() # Get similarity scores with all movies sim_scores = list(enumerate(cosine_sim[idx])) # Sort the list of tuples (index and score) based on score in descending orde sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) # Get the indices of the 5 most similar movies (excluding itself) sim_scores = sim_scores[1:6] movie_indices = [i[0] for i in sim_scores] return df['title'].iloc[movie_indices].tolist()
Usage
input_title = input("Enter a movie name: ") similar_movies = get_similar_movies(input_title) print("Movies similar to '{}':".format(input_title)) for movie in similar_movies: print(movie)
Output:
python similar_movies.py Enter a movie name: Titanic Movies similar to 'Titanic': Raise the Titanic Ghost Ship I Can Do Bad All By Myself Event Horizon Niagara Enter a movie name: Godfather Movie 'Godfather' not found. Please enter its overview: The Godfather "Don" Vito Corleone is the head of the Corleone mafia family in New York. He is at the event of his daughter's wedding. Michael, Vito's youngest son and a decorated WWII Marine is also present at the wedding. Michael seems to be uninterested in being a part of the family business. Vito is a powerful man, and is kind to all those who give him respect but is ruthless against those who do not. But when a powerful and treacherous rival wants to sell drugs and needs the Don's influence for the same, Vito refuses to do it. What follows is a clash between Vito's fading old values and the new ways which may cause Michael to do the thing he was most reluctant in doing and wage a mob war against all the other mafia families which could tear the Corleone family apart. Movies similar to 'Godfather': The Godfather: Part II The Godfather The Godfather: Part III Halloween 4: The Return of Michael Myers The Last Godfather
Leave a Reply