Similar movie prediction in Python

Post Views: 898

In this tutorial, we will learn how to build a Similar Movie Prediction model using the Scikit-learn library in Python. This model predicts similar movies by comparing their overviews.

Scikit-learn is one of the most popular and versatile machine learning libraries in Python. It provides simple and efficient tools for data mining and data analysis.

Prerequisites:

Basics about how to use scikit-learn library

Installing Necessary Libraries

First, open your command prompt or working IDE and enter the following command to install the necessary libraries

pip install pandas scikit-learn

Importing Libraries

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

TF-IDF stands for Term Frequency-Inverse Document Frequency. It evaluates the importance of a word in a document relative to a collection of documents. It increases the importance proportionally to the number of times a word appears in the document but offsets it by the frequency of the word in the corpus.

Cosine Similarity measures the similarity between two non-zero vectors of an inner product space. It often measures the cosine of the angle between two vectors. In the context of document similarity, cosine similarity measures how similar two documents are based on the terms they contain and their TF-IDF scores.

loading the dataset

df = pd.read_csv('tmdb_5000_movies.csv')

pd.read_csv() is a function from the pandas library. This function reads comma-separated values (CSV) file into a DataFrame.

Here I am using the TMDB_5000_movies dataset. This dataset contains various types of information about movies, such as title, popularity, overview, genre, and more. However, you can use any other dataset that contains similar information if you prefer.

You can download the dataset from Kaggle or use the link below to directly navigate to the TMDB_5000_movies dataset(from Kaggle):

https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv

Selecting Required Columns

df = df[['title', 'overview']].dropna()

We only require the title column and overview column from the Dataset so we select only those columns

The .dropna() method in pandas is used to remove rows with missing values.

Initialize the Vectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words='english')

The TfidfVectorizer helps us to convert the movie overviews into numerical vectors.

This step will also remove common English stop words.

Transform the Overviews

tfidf_matrix = tfidf_vectorizer.fit_transform(df['overview'])

This transforms the text into a sparse matrix of TF-IDF features.

Building a similarity Matrix

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

This will create a similarity matrix where each element represents the similarity score between two movies.

Define a Function to get Similar Movies

Now build a function that takes a movie title as input and returns the titles of the most similar movies based on cosine similarity scores.

If the movie name is not found in the dataset, it asks you to provide an overview. Then, it finds similar movies based on the overview you provide.

def get_similar_movies(input_title, cosine_sim=cosine_sim):
    movie = df[df['title'] == input_title]
    if not movie.empty:
        idx = movie.index[0]
    else:
        input_overview = input("Movie '{}' not found. Please enter its overview: ".format(input_title))

        # Create a temporary DataFrame with the input overview
        temp_df = pd.DataFrame({'title': ['Input Movie'], 'overview': [input_overview]})

        # Fit and transform the input overview 
        input_vector = tfidf_vectorizer.transform(temp_df['overview'])

        similarities = cosine_similarity(input_vector, tfidf_matrix)
        similarities = similarities[0]

        #sort in descending order
        sim_indices = similarities.argsort()[::-1]

        #select top 5 moivies
        sim_indices = sim_indices[:5]
       
        return df['title'].iloc[sim_indices].tolist()

    # Get similarity scores with all movies
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the list of tuples (index and score) based on score in descending orde
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the indices of the 5 most similar movies (excluding itself)
    sim_scores = sim_scores[1:6]

    movie_indices = [i[0] for i in sim_scores]
    return df['title'].iloc[movie_indices].tolist()

Usage

input_title = input("Enter a movie name: ")
similar_movies = get_similar_movies(input_title)
print("Movies similar to '{}':".format(input_title))
for movie in similar_movies:
    print(movie)

Output:

python similar_movies.py
Enter a movie name: Titanic 
Movies similar to 'Titanic':
Raise the Titanic
Ghost Ship
I Can Do Bad All By Myself
Event Horizon
Niagara

Enter a movie name: Godfather
Movie 'Godfather' not found. Please enter its overview: The Godfather "Don" Vito Corleone is the head of the Corleone mafia family in New York. He is at the
event of his daughter's wedding. Michael, Vito's youngest son and a decorated WWII Marine is also present at the wedding. Michael seems to be uninterested in being a part of the family business. Vito is a powerful man, and is kind to all those who give him respect but is ruthless against those who do not. But when a powerful and treacherous rival wants to sell drugs and needs the Don's influence for the same, Vito refuses to do it. What follows is a clash between Vito's fading old values and the new ways which may cause Michael to do the thing he was most reluctant in doing and wage a mob war against all the other mafia families which could tear the Corleone family apart.
Movies similar to 'Godfather':
The Godfather: Part II
The Godfather
The Godfather: Part III
Halloween 4: The Return of Michael Myers
The Last Godfather

Output