An Improved Slope One Algorithm Based on User Similarity Weights

This paper presents a Slope one improved algorithm based on user similarity and user interest forgetting function. Aiming at the problem of large number of users and a lot of noise data, the inactive users are filtered out by setting the threshold of user activity, and then the neighbors of the target users are obtained through the calculation of user similarity. According to interest forgetting function, and then filter out items that have less effect on current users to reduce the noise data to improve the accuracy of the algorithm. Experimental comparison shows that the improved algorithm has better accuracy than the commonly used weighted Slope one and two-pole Slope one.


1Preface
Collaborative Filtering (CF) is the most widely used and most successful personalized recommendation algorithm, which is divided into two categories: Userbased collaborative filtering algorithm and Item-based collaborative filtering algorithm.Slope one algorithm is a special form of Item-based collaborative filtering algorithm, which has the characteristics of easy to use, accurate recommendation and high computational efficiency.However, the problems such as cold start and data sparse restrict its development.To solve these problems, Proposed an improvement suggestion [1] , in which, for the data sparsity problem, the most common is to compress the original data by using the dimension reduction technique [3] , in addition, it can also directly remove irrelevant users or projects to reduce users -For the cold-start problem, some researchers point out that their neighbors can be searched initially according to the user attributes or project attributes, that is, the content-based recommendation [4,5] .For the Slope one algorithm, the change of user interest is not considered And user similarity, this paper proposes an improved Slope One based on user interest forgetting function and user nearest-neighbor filtering strategy in order to provide more accurate recommendations for target users.

2.Slope One Introduction
When collaborative filtering is recommended, people need to build user-item scoring matrices first.For m users and n items in the recommendation system, the user set is represented by U={u1, ..., um},the set of items is represented by P={p1, ..., pn},the scoring matrix is represented by R, The element ri,j in R is the score of the item pj by the user ui.If a user i did not score for itemj, then the corresponding ri,j is empty.After that, the current user's non-graded item in the grading matrix is scored.

The basic slope one algorithm
The basic method of the algorithm is to use a simple regression expression w=f(v)=v+b, where v is the prediction score of the user u for the project, Ru,j is the average deviation of itemi from the itemj score, denoted as Devi,j, U is defined as a user set, Si,j (U) is a set of users scoring itemi and itemj , Num () is the number of elements contained in the set, then the formula of Devi,j is Therefore, we can use Devi,j+Ru,j to obtain the predictive score of itemi from user u and get the predicted value after all possible predictions are averaged: Where: Ri denotes that all users have given item sets whose scores satisfy the criteria (i ≠ j and Si,j are non-empty).

Improved Slope One algorithm
The current Slope One algorithm mainly includes Weighted Slope One algorithm and Bi-Polar Slope One algorithm.
Weighted Slope One algorithm takes into account the average number of different users Devj,i credibility is different, the number of users as the weight, assuming that 100 users simultaneously rated the project j and i, while only 10 users at the same time scoring Projects j and k, then apparently obtained Devj,i is more persuasive than Devj,k, and therefore weighted to the final predicted value.The weighted algorithm is Bi-Solar Slope One divides the scoring matrix into two categories according to "like" and "dislike".The dividing scheme determines whether the score of the user is greater than the average score of the user, "like" if the score of the item is greater than "like", and " dislike ", see formula (4) and formula (5).When predicting for current user u, the predictor will focus only on those scores that the user has agreed on the positive or negative score.
among ( 4) and ( 5), S (u) represents a set of items for which user u has a score, and u r represents the average score of user u.Finally, "like" and "dislike" are combined to get the comprehensive forecast score.

3.Slope One Method Of Weighting Based
On User Similarity And Forgetting Function Aiming at the existing problems of Slope One algorithm, this paper proposes a weighted Slope One algorithm (SAF) based on user similarity and forgetting function modification, that is, before the user item score is predicted, by removing extremely inactive users, To reduce the interference of the noise samples, and then calculate the similarity between users to get the user's nearest neighbor set, use the similarity between users as the weight of the score prediction, correct the item scores in the nearest neighbor set by the interest forgetting function, and finally, The nearest neighbor set performs a predictive analysis of the sample.

User activity rating
When the user similarity is calculated, the complexity of the algorithm is directly affected by the score item.All the users who remove the extremely inactive before calculating the similarity can reduce the time consumption and also can improve the prediction accuracy to a certain extent.In addition extremely inactive users score very little on the project, most likely new users, for new users because of the lack of data support, often recommend popular products will have better results.First set a threshold for activity, this article is defined as Uinact for users who score less than 0.01% of the number of users, and the user whose rating item is greater than or equal to 0.01%, defined as Uact, User Set U = Uinact ∪Uact.Define the total number of users for the project score Num (itemu), the total number of items Num (item), then

User Similarity Measures
The similarity measure is very important to the recommendation system.The traditional similarity measure methods mainly include cosine similarity, modified cosine similarity and Pearson correlation coefficient.
Compared with the traditional similarity measure method, similarity algorithm based on Euclidean distance is simple and suitable for the case of scoring data sparseness, Euclidian similarity calculation is only needed when there is at least one user to score together, In the case of similarity, users who do not have a common score are considered as having similarities of 0.The Euclidean distance is When selecting the similarity calculation method, the similarity of Euclidean distance only needs more than one common score between users, which is more suitable for sparse matrix, and its complexity is lower than the traditional similarity measure method.Therefore, Similarity calculation, this article uses formula (9) calculation.

User interest forgetting function
The famous German psychologist Ebbinghaus discovered a lot of experiments on the basis of human nature's forgetting laws: forgetting began immediately after memory, and the entire process of forgetting was uneven.Ebbinghaus plotted his experimental results as a curve, the famous Ebbinghaus forgetting curve, shown in Fig 1: Where ri,j represents the score of user i on item j,r'i,j represents the remaining score of user i on item j after decaying over time, and t represents the difference between the current time and the user's rating time (unit: minute).
2).The nearest neighbor set defining user u is NUn = {n∈i | a ≤ | sim (u,Uactn) |}, the value of a is (0,1), and the similarity between users u and v is given by the formula (9), we define simu,v = | sim (u, v) |.
3).Calculate r'i,j from equation (11) based on the nearest neighbor set NUn obtained in step 2, and stabilize when r'i,j is less than 20% of ri,j, The fixed value is no longer calculated.In the nearest neighbor set NUn, replace r'i,j with the original ri,j to get a new nearest neighbor set NNUn 4).Define simi,j = { j∈k |a≤sim( itemi，itemj)}, use simi,j as the weight, and use the formula (3) to calculate the predictive score.

Experimental data set
This paper uses MovieLens data set to evaluate the algorithm.The data for this dataset is the user's rating of the movie, from 1 to 5, respectively, the higher the score indicates that the user prefers the movie and records the time of the user's rating, at least 20 movies per user Scored.MovieLens is made up of three databases of different sizes.This article uses a smallscale library containing 100,000 scans of 1682 movies by 943 independent users.The data sparseness for this dataset is:1-100000/(943*1682)=0.9369.
The whole experiment divides the dataset into training and test datasets.Firstly,the dataset is divided into five equal parts randomly, among which four are selected as the training set and one is divided into the test set, and 200 of them are removed from the training set Record the history score and then join the test set as a forecast for new users.

Evaluation index
In this paper, the mean absolute error (MAE) is used as the metric, which is also the most commonly used recommended quality measure.MAE

5.Conclusion
In this paper, the current Slope One algorithm does not consider the similarity and lead to the low accuracy of personalized recommendation, by investigating the user similarity and user interest forgetting function to amend its similarity, and then according to the revised user similarity Determine the set of its neighbors, and use the similarity as a weight to improve the slope one algorithm.Three groups of experiments show that the removal of inactive user samples mainly improves the computational efficiency of the algorithm, selects the nearest neighbor by computing the similarity, and then modifies the nearest user's score by the interest forgetting function, which makes the sample data more streamlined and more significant Improve the accuracy of the score data prediction, but also greatly reduce the computational workload of the entire algorithm.

Fig 1 .
Fig 1. Ebbinghaus forgetting curve Ebbinghaus forgetting curve mathematical function formula is as follows: k t k c + = ) (lg 100 b (10) Where b is the memory retention (in percent), t is the time interval from memory awareness to the current time (in minutes), and c and k are two control constants.Ebbinghaus through the experiment repeatedly demonstrated that when c=1.25,k=1.84,theforgetting function and the most similar to the normal rule of forgetting.User's remaining interest in the project in real time, that is, the remaining score.User interest forgetting function formula is as follows: measures the