Recommender systems use big data methods, and are widely used in various social-network, e-commerce, and content platforms. With their increased relevance, online platforms and developers are in need of better ways to choose the systems that are most suitable for their use-cases. At the same time, the research literature on recommender systems describes a multitude of measures to evaluate the performance of different algorithms. For the end-user however, the large number of available measures do not provide much help in deciding which algorithm to deploy. Some of the measures are correlated, while others deal with different aspects of recommendation performance like accuracy and coverage. To address this problem, we propose a novel benchmarking framework that mixes different evaluation measures in order to rank the recommender systems on each benchmark dataset, separately. Additionally, our approach discovers sets of correlated measures as well as sets of evaluation measures that are least correlated. We investigate the robustness of the proposed methodology using published results from an experimental study involving multiple big datasets and evaluation measures. Our work provides a general framework that can handle an arbitrary number of evaluation measures and help end-users rank the systems available to them.