Personalized ranking systems - also known as recommender systems - use different big data methods, including collaborative filtering, graph random-walks, matrix factorization, and latent-factor models. With their wide use in various social-network, e-commerce, and content platforms, online platforms and developers are in need of better ways to choose the systems that are most suitable for their use-cases.
Looking at the research literature on recommender systems, it describes a multitude of performance measures to evaluate the performance of different algorithms.
However, from the perspective of the end-users, the large number of available measures does not provide much help in deciding which algorithm to deploy. Some of the measures are correlated, while others deal with different aspects of recommendation performance, such as accuracy and diversity.
To address this problem, we propose a novel benchmarking framework that mixes different evaluation measures in order to rank the recommender systems on each benchmark dataset, separately.
Additionally, our approach discovers sets of correlated measures as well as sets of evaluation measures that are the least correlated.
We investigate the robustness of the proposed methodology using published results from an experimental study involving multiple big datasets and evaluation measures.
Our work provides a general framework that can handle an arbitrary number of evaluation measures and help end-users rank the systems available to them.