Have you ever wondering how Netflix know what movies you probably like? Or if you watch Stranger Things that Netflix is likely to suggest to you is the ET: The Extra-Terrestrial. Basic on the customer viewing behavior data, we can predict users discover the exact content they want to watch. Similar products method is one of the ways to run the prediction models. Especially, brand new users haven’t rated the movies.

**First, Read the movie database files in python.**

Let’s have a closer look at the datasets. Use pandas read csv command to load the data set into a data table. The first column is the ID of the user who made the rating. The second column is the ID of the movie that the user rated. And the third column is the rating that the user gave the movies. Files download : movie_ratings_data_set and movies.csv.

**Second, convert the running list of user ratings into a matrix.**

Use the user ID field for the rows or index in the pivot table, and use the movie ID for columns in the table. When summarizing data with a pivot table, it’s possible that we’ll have duplicates when the same user viewed the same movie twice but gave it two different ratings. In this case, we can use the aggregate function to resolve duplicates. We’ll pass in the parameter called aggfunc=np.max or can use aggfunc=np.mean. Therefore, if a user rated the same movie twice, we’ll take the higher rating or the mean of the rating.

In the real world, most users will not review all products so there will always be a lot of black data. Don’t worry! sparse datasets are normal for recommendation systems. With Python, we can find out those missing data.

**Third, Apply matrix factorization to find the latest features**.

In order to find out the missing data in the movie rating matrix, we assign attributes to each user and each movie and then multiply them together and adding up the results. Attributes can be “Action”, “Drama”, “Romance”, “Music” , “Dark”…etc. All we know is that each attribute represents some characteristic that made users feel attracted to certain movies. these vectors are hidden information that we found by looking at review data. U ( user attributes) x M (movie attributes) = Movie ratings

For example, User 1 is rating two movies M1 and M2. User1 like crowd-pleaser movies and don’t like too much drama. M1 is an action movie like Pirates of The Caribbean: The Curse of the Black Pearl and M2 is an Arthouse movie and not crowd-pleaser like Lost in Translation. As a result, user 1 will give M1 score 82 and M2 score -38.

We can actually use the movie ratings we know so far to work backward and find the U matrix and an M matrix that satisfy this equation. Finally, we’ll multiply the U and M matrices we found back together to get review scores for every user and every movie.

Matrix Factorization

- U( user attributes) x M(movie attributes) = Movie ratings
- Set all elements in U and M to random numbers. Right now, U and M will result in random numbers.
- Create a “cost function” that checks how far off U*M. Currently is from equaling the known values of the movie rating matrix
- Using a numerical optimization algorithm, tweak the numbers in U and M a little at a time. The goal is to get the “cost function” a little closer to zero. Scipy’s fmin_cg() optimization function to find the minimum cost.
- Repeat step 3 until we can’t reduce the cost function further. The U and M. Values we find will be estimated U*M=Movie Ratings

Use matrix factorization to calculate the U and M matrices. We define 15 attributes in each U and M matrix. num_features=15

**Regularization **

A control in the model that limits how much weight to place on one single attribute when modeling users and products. The higher the regulation amount, the less weight we can put on any single attribute. Regularization limits how much weight we will place on one single attribute and reduce emphasizes specific data points too much. For example, if we have two romance movies. The first is romance movie like Notting Hill and the second is historical romance movie. Both have romance elements, but some viewers might prefer funny movie and other viewers might prefer the serious movie. They both have some romance similar elements. However, they are very different movies that appeal to different audiences. If we place too much weight on romance, then the system will recommend “Pearl Harbor” to “Notting Hill” audiences.

Regularization helps the system to recognized for both its romance and comedy elements. The higher we set the regularization amount, the less weight we’ll put on any single attribute. We are using an amount of 0.1 in the code because we are using small dataset. For larger datasets, we can use 1.0, 10.0 etc. We can later experiment with different regulation values to see how it affects the quality of your recommendations.

**Root-Mean-Square Error **RMSE is a measurement of the difference between the user’s real rating and the rating we predicted. The lower the more accurate the model. An RMSE of zero means our model perfectly guesses user ratings. When measuring the accuracy of our recommendation system, we’ll randomly split our movie ratings data into two groups. The first 70% of data will be our training dataset. The other 30% data will be our testing dataset.

Out Put: we got a training RMSE of 0.24 and a testing RMSE of about 1.2. The low training RMSE shows that our basic algorithm is working, the testing RMSE is the more important number because it tells us how good our predictions are. We can adjust the regularization amount parameter to see the difference of the RMSE. Moreover, larger the movies review dataset we have the more accurate prediction.

**Use Intent representations( U x M ) to find similar products. **

Use numpy’s transpose function to flip-flop the matrix so each column becomes a row. This just makes the data easier to work with, it doesn’t change the data itself.

To find other movies similar to this one, we just have to find the other movies whose numbers are closest to this movie’s numbers.

This gives us the difference in scores between the current movie and every other movie in the database and takes the absolute value of the difference we calculated. Next, pandas provide a convenient sort value function.

Output:

Finally, we can print out the first five movies on the list. The first movie in this list is the movie itself. That’s because a movie is most similar to itself. The other four movies look pretty similar to our movie. We can use them to find similar products and recommend them to the audience.

Reference:

Linkedin Learning: Machine Learning & AI Foundations: Recommendations

Python Tutorial: Pandas Pivot Table Explained

Matrix Factorization: A Simple Tutorial and Implementation in Python