Fair Benchmarking for Dynamic Algorithm Configuration
2023 - 2025
Automated machine learning, such as hyperparameter optimization and neural architecture search, led to new state of the art in several modern AI applications and became more efficient by orders of magnitude in recent years. This facilitated efficient development of new AI applications, which were tedious, error-prone and time-consuming before. As the next logical step, Dynamic Algorithm Configuration (DAC) aims at dynamically adjusting the hyperparameters of a target algorithm to improve performance in a data-driven manner [Biedenkapp et al., 2020]. This allows to not only statically tune hyperparameters, but adjust them while the learning takes place. In contrast to previous methods, this allows us to deeply go into the algorithms and open up new potentials to further improvements of performance. Theoretical and empirical results demonstrated the advantages of dynamically controlling hyperparameters, e.g. in the domains of deep learning, evolutionary algorithms and AI planning [Daniel et al., 2016; Vermetten et al., 2019; Doerr & Doerr, 2020; Shala et al. 2020; Speck et al., 2021]. Until recently, it was not possible to compare different DAC approaches in a sound and reproducible manner and therefore, stopping research on DAC in making substantial progress. In order to foster research on DAC and to be able to compare algorithms in a fair way, DACBench [Eimer et al., 2021] has been developed by Prof. Lindauer and his colleagues. DACBench is a benchmark library to assess the performance and behavior of algorithms for Dynamic Algorithm Configuration (DAC). It addresses following issues entailed in DAC. First, the lack of a uniform interface harms accessibility and reproducibility. Building an interface for each new DAC optimizer to each new target problem is tedious and hinders comparability across benchmarks and optimizers. The intuitive interface of DACBench is based on the well-known OpenAI gym [Brockman et al., 2016]. Nevertheless, the interface is kept lightweight and is highly customizable, with a mechanism to log all details required to reproduce the benchmark. Therefore the purpose of DACBench is to ensure fair comparisons of methods across domains and to ensure reproducibility of previous work. In addition, it should serve as a template for future benchmarks such that new benchmarks can be easily added to the library. In recent works, Prof. Lindauer, his team and colleagues showed that DAC can be successfully applied to several AI domains and outperforms traditional approaches, including evolutionary strategies [Shala et al., 2020], AI Planning [Speck et al., 2021; Biedenkapp et al. 2022] and SGD for neural networks [Adriansen et al. 2022]. DACBench is actively maintained and developed and aims to include a larger selection of benchmarks, extending to the other AI domains like SAT or MIP and additional “toy” benchmarks to ease development of new DAC methods and meta-algorithms applied to DAC. However, there are several remaining challenges and opportunities to improve DAC w.r.t. building trust into DAC by considering the various aspects of trustworthy AI (transparency, explainability, fairness, robustness, and privacy) as defined in the EC Guidelines for Trustworthy AI and the Assessment List for Trustworthy AI. These include: • Typically AI models are overfitted to the selected data instances for training, which means that their selected hyperparameters can not generalize the learned knowledge on new data instances that were not involved in the training process. • There is an inherent bias presented in the learning process that originates from the quality of data used in the training process. • There are no quantitative measures to estimate the level of trust/confidence when using a developed AI model on new unseen data, which will indicate to which level generalization of the learned knowledge is possible. To overcome all the aforementioned challenges, a crucial role is the benchmarking process [Bartz-Beielstein et al., 2020]. The main task of a classical benchmarking in AI involves evaluation of the performance of an algorithm against other algorithms. To do this, there are three main questions that should be taken with a great care: i) which data instances should be selected for the comparison study, ii) how to design the experiments that lead to reproducible and replicable results, and iii) which performance measures should be analyzed and with which statistical approach. Even though great success and huge steps have been achieved by proposing more robust statistical methodologies for benchmarking [Eftimov and Korošec, 2019], the selection of the benchmark data instances that will be included in the analysis can have a huge impact on the experimental design of the learning task. It can happen that the same set of algorithm instances (algorithms with different hyperparameters) evaluated on different sets of data instances results in different winning algorithms. This can lead to unintentional false presentation of the results that make their newly developed algorithm instance/hyperparameters look superior to the others. At the same time, this fact decreases the generalization of the obtained results. Recently, Tome Eftimov and his colleagues have proposed an approach based on unsupervised learning [Eftimov et al., 2022] and graph theory [Cenikj et al., 2022] that can select only representative data instances that should be involved in the learning process. Testing it on data instances from single-objective optimization and univariate time series, the proposed approach shows and guarantees reproducible statistical outcomes in comparing performance of algorithm instances. These results point out that this kind of approach is extremely welcome to be researched to generalize the developed AI models and allow transferring of the learned knowledge. At the same time, it will allow the selection of the most promising hyperparameters and reduce the uncertainty when the model will be tested on new data instances. The goals of our project are: 1. To investigate different meta-feature representations for data instances for DAC learning tasks presented in the DACBench library, which will allow us to perform complementary analysis between them and perform a landscape coverage analysis of the problem/feature space, leading to a thorough and fair comparison of DAC methods. The utility of meta-feature representations will also be investigated by transforming them with matrix factorization and deep learning techniques. 2. To develop methodologies that will allow us automatically to select more representative data instances that will uniformly cover the landscape space of the data instances using their meta-feature representation, which can be used for benchmarking studies to produce reproducible and transferable results. 3. To define quantitative indicators that will measure the level of diversity between the data instances used for training DAC and instances used for testing by using their meta-feature representation. This will provide researchers the level of trust of applying the selected hyperparameters on new unseen data instances. 4. To determine computational time and energy variables that will be measured to estimate the greener level (i.e. encouraging a reduction in resources spent) of performing DAC experiments only on the selected representative data instances.
ARRS - Nemčija