This repository contains the implementation for paper: On the Importance of Reward Design in Reinforcement Learning-based Dynamic Algorithm Configuration: A Case Study on OneMax with (1+(
We propose applying RL to control the population size of the (1+(
We provide an example to visualize improvements in a problem of size 100 (using a 10Γ10 grid to save space), comparing two controllers: an RL-based policy and a random policy. Blue cells denote the 1s bit, while red cells represent the 0s bit. The optimal state occurs when the grid is completely filled with blue cells.
RL-based Policy | Random Policy |
---|---|
![]() |
![]() |
Outline the structure of repository.
OneMax-DAC/
βββ notebooks/
β βββ test.ipynb # Testing on-the-fly using trained DDQNs
βββ resources/ # Additional resources for this project
β βββ ddqn_ckpts # Contains DDQNs checkpoints for all problem sizes
β βββ other_methods # Contains irace-based tuning and optimal policies
βββ onemax_dac/ # Source code for the project
β βββ train.py # Script to train models
β βββ dac/ # Main components of DAC employed in this project
β β βββ trainer.py # Module to train DAC
β β βββ buffer.py # Module to store the experiences
β β βββ agent.py # Module to hold the environment and replay buffer
β β βββ policy.py # Module of Q-Network
β β βββ eval.py # Contains functions to evaluate the policy
β β βββ logger.py # Module to monitor the training process
β β βββ utils.py # Helping functions
| βββ theory_env/ # Contains theorectical environments based on DACBench
β β βββ onemax.py # Module of OneMax problem
βββ requirements.txt # List of dependencies
βββ README.md # Project readme file
βββ LICENSE # License for the project
To re-produce this project, you will need to have the following dependencies installed:
After installing Miniconda, you can create a new environment and install the required packages using the following commands:
conda create -n onemaxdac python=3.10
conda activate onemaxdac
For installing torch
, refer this link: INSTALLING PREVIOUS VERSIONS OF PYTORCH
then clone and install dependencies:
pip install -r requirements.txt
We provide the best checkpoints of DDQNs, which are trained using the best settings of reward functions in certain problem sizes at resources/ddqn_ckpts
.
To replicate the results reported in the paper, follow the notebook test.ipynb:
- Initialize the DDQN and OneMax environment objects.
- Load the trained checkpoint properly.
- Run (1+(
$\lambda$ ,$\lambda$ ))-GA and observe the ERT.
Note: Please make sure you have the notebook kernel installed with the necessary packages.
We divide our experiments into three groups:
- Original reward function
- Reward scaling
- Reward shifting
The implementation of these families of reward functions can be found in onemax.py.
python onemax_dac/train.py \ ## Main Python script for training
--problem_size 100 \ ## Set problem size n=100
--reward_choice original \ ## Use original reward function
--seed 1 \ ## Set random seed
--num_workers 4 ## Set number of CPUs for parallel processing
python onemax_dac/train.py \ ## Main Python script for training
--problem_size 100 \ ## Set problem size n=100
--reward_choice scaling \ ## Use scaled reward function
--seed 1 \ ## Set random seed
--num_workers 4 ## Set number of CPUs for parallel processing
python onemax_dac/train.py \ ## Main Python script for training
--problem_size 100 \ ## Set problem size n=100
--reward_choice shifting \ ## Use reward shifting with fixed bias
--fixed_shift -3 \ ## Set value of bias
--seed 1 \ ## Set random seed
--num_workers 4 ## Set number of CPUs for parallel processing
python onemax_dac/train.py \ ## Main Python script for training
--problem_size 100 \ ## Set problem size n=100
--reward_choice shifting \ ## Use reward shifting with adaptive bias
--seed 1 \ ## Set random seed
--num_workers 4 ## Set number of CPUs for parallel processing
After running, the terminal should look like this:
{'TrainingConfig': {'max_steps': 500000, 'buffer_size': 1000000, 'epsilon_start': 1.0, 'epsilon_end': 0.2, 'warmup_steps': 10000, 'batch_size': 2048, 'learning_rate': 0.001, 'gamma': 0.99, 'tau': 0.01, 'loss_fn': 'MSE', 'eval_interval': 2000, 'n_eval_episodes': 100, 'output_dir': 'outputs', 'accelerator': 'cpu', 'num_workers': 4, 'wandb': False, 'seed': 1, 'fixed_shift': None}, 'PolicyConfig': {'policy_name': 'DDQN', 'net_arch': [50, 50], 'activation_fn': 'ReLU'}, 'EnvConfig': {'problem_size': 100, 'state_dim': 2, 'discrete_action': True, 'action_choices': [], 'reward_choice': 'original', 'seed': 1, 'init_obj_rate': 0.5, 'kwargs': {}}}
Populating Buffer: 100%|ββββββββββββββββββββββββββββββββββββ| 10000/10000 [00:03<00:00, 2512.55it/s]
[Training]: step: 14000.00 | shift: 0.00 | loss: 27.20 | best_val_rt: 689.01: 1%|β | 4110/490000 [00:25<50:44, 159.58it/s]
Observation:
- A dictionary containing the current running configurations.
- A message of
Populating Buffer
indicating that the warm-up process is running within N steps. - Then the training process starts running with the remaining steps indicated by total steps minus warm-up steps. There is some information including:
- step: current evaluation step
- shift: value of shifting
- loss: current training loss
- best_val_rt: best evaluated expected runtime
During the process, we can monitor the logs by following the path outputs/checkpoints/<date>_<time>/seed_<#>
. In this directory:
outputs/checkpoints/<date>_<time>/seed_<#>/
βββ config.yml # Training configuration is stored here
βββ evaluations.json # Contains learned policies and ERTs from both evaluation and testing phases
βββ best.pt # Best checkpoint of the Q-network
βββ learning_curve.pdf # Evaluated ERT under 100 runs during training
βββ policy.pdf # Policy comparison