Content Based Movie Recommender System using Python
Table of Content
What is a Recommender System ?
Introduction to Content-Based Recommender System
Advantages and Disadvantages of Content-Based Recommender System
Creating a Content-Based Recommender System
Importing Necessary Libraries
Reading and Viewing the data
Retaining Relevant Data
Converting the Overview of Movies into Vectors
Vector Comparision using Sigmoid Kernel
Reverse Mapping Movie’s Name with Index
Function for returning similar movies
Output
Future Scope of this project
What is a Recommender System ?
A Recommender System is an information filtering system that predicts the output based on the user’s past selections or based on the item’s information with which the user interacted. These systems deal with overload problems by efficiently delivering relevant information.
Recommender systems are now part of our daily life from shopping for an online store to watching new series on Netflix, these systems are deployed everywhere.
Introduction to Content-Based Recommender System
Though there are various type's of recommender system which have their unique way of giving the recommendations to the users, our prime focus will be Content-Based Recommender System.
Now, what is a Content-Based Recommender System?
A Recommender System, that provides recommendations to the user based on the item similarity format is known as a Content-Based Recommender System. So in broader terms, this type of recommender system recommends products that are similar to the products that are already liked or viewed by the users.
Have you ever thought about why you get the recommendation of sci-fi movies when you watch interstellar? Yes you guessed it right because of the content-based recommender system as interstellar is a sci-fi movie and other sci-fi movies are similar to interstellar.
Advantages and Disadvantages of Content-Based Recommender System
Let’s start with advantages first
It overcomes the cold start problem i.e even if the database does not contain user preferences it still shows recommendations to users.
It easily adjusts its recommendations as the user changes preferences.
User similarity is not available hence no profile sharing is present, so privacy is maintained.
These are some disadvantages of Content-Based Systems
As the recommendation depends upon item similarity hence a rich description of items must be given to the systems.
Content Overspecialization also occurs in which content similar to the one already present in the User’s list is not recommended to the User.
Creating a Content-Based Recommender System
Importing Necessary Libraries
# Importing the Necessary Libraries
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import sigmoid_kernel
Reading and Viewing the data
# Click here to Download the data required for the Content-Based Movie Recommender System Case Study
# Reading and Viewing the data
credits = pd.read_csv('tmdb_5000_credits.csv')
movies = pd.read_csv('tmdb_5000_movies.csv')
pd.set_option('display.max_columns', None)
credits.head()
movie_id title cast crew
0 19995 Avatar [{"cast_id": 242, "character": "Jake Sully", "... [{"credit_id": "52fe48009251416c750aca23", "de...
1 285 Pirates of the Caribbean: At World's End [{"cast_id": 4, "character": "Captain Jack Spa... [{"credit_id": "52fe4232c3a36847f800b579", "de...
2 206647 Spectre [{"cast_id": 1, "character": "James Bond", "cr... [{"credit_id": "54805967c3a36829b5002c41", "de...
3 49026 The Dark Knight Rises [{"cast_id": 2, "character": "Bruce Wayne / Ba... [{"credit_id": "52fe4781c3a36847f81398c3", "de...
4 49529 John Carter [{"cast_id": 5, "character": "John Carter", "c... [{"credit_id": "52fe479ac3a36847f813eaa3", "de...
movies.head()
Retaining Relevant Data
# Dropping the unwanted parts and merging the data
credits['id'] = credits['movie_id']
credits.drop('movie_id',axis=1,inplace=True)
df = movies.merge(credits,on='id')
df.drop(['homepage', 'title_x', 'title_y', 'status','production_countries'],axis=1,inplace=True)
Converting the Overview of Movies into Vectors
df['overview'][0]
'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'
# Creating an object of Tfidf vector class and fitting it on the overview of movies
tfidf = TfidfVectorizer(stop_words='english',analyzer='word',min_df=3,strip_accents='unicode'
,ngram_range=(1,3),token_pattern=r'\w{1,}',max_features=None)
df['overview'].fillna('',inplace=True)
vector = tfidf.fit_transform(df['overview'])
vector
Vector Comparision using Sigmoid Kernel
# Comparision Between Each and Every index
sigmoid = sigmoid_kernel(vector,vector)
sigmoid
array([[0.76163447, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416],
[0.76159416, 0.76163447, 0.76159416, ..., 0.76159513, 0.76159416,
0.76159416],
[0.76159416, 0.76159416, 0.76163447, ..., 0.76159486, 0.76159416,
0.76159455],
...,
[0.76159416, 0.76159513, 0.76159486, ..., 0.76163447, 0.76159483,
0.76159473],
[0.76159416, 0.76159416, 0.76159416, ..., 0.76159483, 0.76163447,
0.76159461],
[0.76159416, 0.76159416, 0.76159455, ..., 0.76159473, 0.76159461,
0.76163447]])
Reverse Mapping Movie’s Name with Index
# Now reverse mapping of movies and indices
index = pd.Series(df.index,index=df['original_title']).drop_duplicates()
index
original_title
Avatar 0
Pirates of the Caribbean: At World's End 1
Spectre 2
The Dark Knight Rises 3
John Carter 4
El Mariachi 4798
Newlyweds 4799
Signed, Sealed, Delivered 4800
Shanghai Calling 4801
My Date with Drew 4802
Length: 4803, dtype: int64
index['Avatar']
0
sigmoid[0]
array([0.76163447, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416])
# Seeing the sigmoid value of every vector with this vector.
list(enumerate(sigmoid[index['Avatar']]))
Function for returning similar movies
# Creating a function that returns top 10 movies that are similar to given movie
def content(title,sigmoid=sigmoid):
position = index[title]
score = sorted(list(enumerate(sigmoid[position])),key=lambda x:x[1],reverse=True)
indices = score[1:11]
movie_indices = [i[0] for i in indices]
# Top 10 most similar movies
return df['original_title'].iloc[movie_indices]
Output
content('Avatar')
1341 Obitaemyy Ostrov
634 The Matrix
3604 Apollo 18
2130 The American
775 Supernova
529 Tears of the Sun
151 Beowulf
311 The Adventures of Pluto Nash
847 Semi-Pro
942 The Book of Life
Name: original_title, dtype: object
Future Scope of this project
This is a first-generation recommender system but currently, more complex recommender systems are employed in the industry which uses neural networks for predictions, so after this, you will get basic intuition of what content-based recommender systems are and you can embed these systems into a web application or android application.
About the Author's:
Utkarsh Bahukhandi
Utkarsh Bahukhandi, is B.Tech undergraduate from Maharaja Agrasen institute of technology. He is a data science enthusiast and explores challenging projects in ML and DS niche like Natural Language Processing and Computer Vision.
Mohan Rai
Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC. Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.
Table of Content
- What is a Recommender System ?
- Introduction to Content-Based Recommender System
- Advantages and Disadvantages of Content-Based Recommender System
- Creating a Content-Based Recommender System
- Importing Necessary Libraries
- Reading and Viewing the data
- Retaining Relevant Data
- Converting the Overview of Movies into Vectors
- Vector Comparision using Sigmoid Kernel
- Reverse Mapping Movie’s Name with Index
- Function for returning similar movies
- Output
- Future Scope of this project
A Recommender System is an information filtering system that predicts the output based on the user’s past selections or based on the item’s information with which the user interacted. These systems deal with overload problems by efficiently delivering relevant information.
Recommender systems are now part of our daily life from shopping for an online store to watching new series on Netflix, these systems are deployed everywhere.
Though there are various type's of recommender system which have their unique way of giving the recommendations to the users, our prime focus will be Content-Based Recommender System.
Now, what is a Content-Based Recommender System?
A Recommender System, that provides recommendations to the user based on the item similarity format is known as a Content-Based Recommender System. So in broader terms, this type of recommender system recommends products that are similar to the products that are already liked or viewed by the users.
Have you ever thought about why you get the recommendation of sci-fi movies when you watch interstellar? Yes you guessed it right because of the content-based recommender system as interstellar is a sci-fi movie and other sci-fi movies are similar to interstellar.
Let’s start with advantages first
- It overcomes the cold start problem i.e even if the database does not contain user preferences it still shows recommendations to users.
- It easily adjusts its recommendations as the user changes preferences.
- User similarity is not available hence no profile sharing is present, so privacy is maintained.
These are some disadvantages of Content-Based Systems
- As the recommendation depends upon item similarity hence a rich description of items must be given to the systems.
- Content Overspecialization also occurs in which content similar to the one already present in the User’s list is not recommended to the User.
# Importing the Necessary Libraries
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import sigmoid_kernel
# Click here to Download the data required for the Content-Based Movie Recommender System Case Study
# Reading and Viewing the data
credits = pd.read_csv('tmdb_5000_credits.csv')
movies = pd.read_csv('tmdb_5000_movies.csv')
pd.set_option('display.max_columns', None)
credits.head()
movie_id title cast crew
0 19995 Avatar [{"cast_id": 242, "character": "Jake Sully", "... [{"credit_id": "52fe48009251416c750aca23", "de...
1 285 Pirates of the Caribbean: At World's End [{"cast_id": 4, "character": "Captain Jack Spa... [{"credit_id": "52fe4232c3a36847f800b579", "de...
2 206647 Spectre [{"cast_id": 1, "character": "James Bond", "cr... [{"credit_id": "54805967c3a36829b5002c41", "de...
3 49026 The Dark Knight Rises [{"cast_id": 2, "character": "Bruce Wayne / Ba... [{"credit_id": "52fe4781c3a36847f81398c3", "de...
4 49529 John Carter [{"cast_id": 5, "character": "John Carter", "c... [{"credit_id": "52fe479ac3a36847f813eaa3", "de...
movies.head()
# Dropping the unwanted parts and merging the data
credits['id'] = credits['movie_id']
credits.drop('movie_id',axis=1,inplace=True)
df = movies.merge(credits,on='id')
df.drop(['homepage', 'title_x', 'title_y', 'status','production_countries'],axis=1,inplace=True)
df['overview'][0]
'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'
# Creating an object of Tfidf vector class and fitting it on the overview of movies
tfidf = TfidfVectorizer(stop_words='english',analyzer='word',min_df=3,strip_accents='unicode'
,ngram_range=(1,3),token_pattern=r'\w{1,}',max_features=None)
df['overview'].fillna('',inplace=True)
vector = tfidf.fit_transform(df['overview'])
vector
<4803x10417 sparse matrix of type ''
with 127220 stored elements in Compressed Sparse Row format>
# Comparision Between Each and Every index
sigmoid = sigmoid_kernel(vector,vector)
sigmoid
array([[0.76163447, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416],
[0.76159416, 0.76163447, 0.76159416, ..., 0.76159513, 0.76159416,
0.76159416],
[0.76159416, 0.76159416, 0.76163447, ..., 0.76159486, 0.76159416,
0.76159455],
...,
[0.76159416, 0.76159513, 0.76159486, ..., 0.76163447, 0.76159483,
0.76159473],
[0.76159416, 0.76159416, 0.76159416, ..., 0.76159483, 0.76163447,
0.76159461],
[0.76159416, 0.76159416, 0.76159455, ..., 0.76159473, 0.76159461,
0.76163447]])
# Now reverse mapping of movies and indices
index = pd.Series(df.index,index=df['original_title']).drop_duplicates()
index
original_title
Avatar 0
Pirates of the Caribbean: At World's End 1
Spectre 2
The Dark Knight Rises 3
John Carter 4
El Mariachi 4798
Newlyweds 4799
Signed, Sealed, Delivered 4800
Shanghai Calling 4801
My Date with Drew 4802
Length: 4803, dtype: int64
index['Avatar']
0
sigmoid[0]
array([0.76163447, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
0.76159416])
# Seeing the sigmoid value of every vector with this vector.
list(enumerate(sigmoid[index['Avatar']]))
# Creating a function that returns top 10 movies that are similar to given movie
def content(title,sigmoid=sigmoid):
position = index[title]
score = sorted(list(enumerate(sigmoid[position])),key=lambda x:x[1],reverse=True)
indices = score[1:11]
movie_indices = [i[0] for i in indices]
# Top 10 most similar movies
return df['original_title'].iloc[movie_indices]
content('Avatar')
1341 Obitaemyy Ostrov
634 The Matrix
3604 Apollo 18
2130 The American
775 Supernova
529 Tears of the Sun
151 Beowulf
311 The Adventures of Pluto Nash
847 Semi-Pro
942 The Book of Life
Name: original_title, dtype: object
This is a first-generation recommender system but currently, more complex recommender systems are employed in the industry which uses neural networks for predictions, so after this, you will get basic intuition of what content-based recommender systems are and you can embed these systems into a web application or android application.
About the Author's:
Write A Public Review