Evaluation of Recommender Systems -- Recommender systems evaluation strategies, classic evaluation metrics
The evaluation in the recommender systems domain might be done utilizing several principal approaches, namely, off-line experiment, user studies and online experiments. Selecting the evaluation strategy, it is beneficial to realize the difference among them.
Off-line experiments are run on pre-collected datasets of users choosing or rating items. Utilizing such datasets allows to simulate the users’ behavior and to evaluate algorithms efficiency. Significant advantage of this type of experiments that it does not require real users interaction. The downside is that off-line experiments can evaluate a narrow set of aspects, primarily about algorithms prediction or recommendation accuracy. The goal of this approach is to filter out algorithms which have poor performance before the recommender system will be evaluated with a user study or online type evaluations.
User studies are conducted with a group of people by giving them a set of test tasks and recording observed behavior, collecting quantitative measurements, results of questionnaires, among others. Unlike off-line experiments user studies allow to observe people’s behavior when interacting with the system and collect more various quantitative measurements such as a time spent before submitting a feedback, for example. The principal disadvantage is that they are expensive to conduct in terms of time and costs.
Online experiments providing the strongest evidence that the recommender system has a value. This type of experiment is usually conducted when off-line and user studies were done, and the system is ready to run in the production state. Online evaluation is unique in terms that they can measure achieved goal, for example, profit increase for the e-commerce recommender system or user retention. While utilizing such experiments one should consider the risk that bad recommendation quality or design might discourage real users preventing them from ever using the system again. Therefore, the system should be evaluated carefully by other strategies before utilizing this strategy. [herlocker2004evaluating]
Recommender systems might be evaluated against various aspects of a recommender system, namely, functional and non-functional.
Functional aspects are typically the performance accuracy of utilized algorithms. It includes such accuracy metrics as predictive accuracy metrics measuring how accurate the recommender system predicts ratings or recommends top-N lists of items.
To measure accuracy in off-line experiments, a relevant dataset must be selected. Usually, the dataset is randomly divided into two parts: 80% as a training set and 20% as a test set of ratings. The algorithm “learns” from the train set and makes assumptions about ratings in the test set.
The difference between predicted and actual ratings forms a basis for an accuracy metrics. Remarkable and one of the widely used accuracy metrics is Mean Absolute Error (MAE).[Herlocker 2004, Shani 2011] It measures overall error differences between a predicted rating and the real rating to a total number of ratings in the test set.
As the name suggests, the mean absolute error is an average of the absolute errors , where f_i is the prediction and y_i the true value. Note that alternative formulations may include relative frequencies as weight factors.
Another functional metrics is coverage. It measures the percentage of items for which the recommender system can produce recommendations. Catalog coverage measures the percentage of items for which the recommender system has ever made recommendations from the total available number of items. There might be items for which the system can potentially make a recommendation, but the algorithm never suggests those items.
The learning rate also belongs to non-accuracy functional metrics and measures how quickly the RS achieves a reasonable recommendation level for recently introduced items or users. Also it might measure what is the cutoff in the number of ratings the system needs to be able to make reasonable recommendations and not almost random ones. Reasonable in the context means the level when the system has already collected enough ratings, so that its further increase does not significantly improve the accuracy. The most common method to evaluate the learning rate is to plot a prediction quality versus number of items.
Non-functional aspects are usability, scalability, robustness, users’ trust, among others. These aspects do not affect the quality of recommendations, but have direct influence on users’ satisfaction when they use the system.
To summarize, the prediction accuracy does not always guarantee users’ satisfaction. Other aspects should also be considered to success with the design of the RS. Choice of the experimental approach defines available evaluation metrics; therefore, this is beneficial to conduct a set of different experiments whenever this is possible. In the following section, the challenges RS are facing are described. Resolving existing limitations leads to an increase in users’ satisfaction.