On the Importance of Reward Design in Reinforcement Learning-based Dynamic Algorithm Configuration

This repository contains the implementation for paper: On the Importance of Reward Design in Reinforcement Learning-based Dynamic Algorithm Configuration: A Case Study on OneMax with (1+($\lambda$, $\lambda$))-GA

💡 Introduction

We propose applying RL to control the population size of the (1+($\lambda$, $\lambda$))-GA optimizing the OneMax problem. We use the number of evaluations (#Evals) at each step to validate how well the RL-based policy can choose the proper $\lambda$ to maximize the number of 1s in a given binary string.

We provide an example to visualize improvements in a problem of size 100 (using a 10×10 grid to save space), comparing two controllers: an RL-based policy and a random policy. Blue cells denote the 1s bit, while red cells represent the 0s bit. The optimal state occurs when the grid is completely filled with blue cells.

RL-based Policy	Random Policy

🎯 Repository Structure

Outline the structure of repository.

OneMax-DAC/
├── notebooks/                     
│   └── test.ipynb                  # Testing on-the-fly using trained DDQNs
├── resources/                      # Additional resources for this project
│   ├── ddqn_ckpts                   # Contains DDQNs checkpoints for all problem sizes
│   ├── other_methods               # Contains irace-based tuning and optimal policies
├── onemax_dac/                     # Source code for the project
│   ├── train.py                    # Script to train models
│   ├── dac/                        # Main components of DAC employed in this project
│   │   ├── trainer.py              # Module to train DAC
│   │   ├── buffer.py               # Module to store the experiences
│   │   ├── agent.py                # Module to hold the environment and replay buffer
│   │   ├── policy.py               # Module of Q-Network
│   │   ├── eval.py                 # Contains functions to evaluate the policy
│   │   ├── logger.py               # Module to monitor the training process
│   │   └── utils.py                # Helping functions
|   ├── theory_env/                 # Contains theorectical environments based on DACBench
│   │   └── onemax.py               # Module of OneMax problem
├── requirements.txt                # List of dependencies
├── README.md                       # Project readme file
└── LICENSE                         # License for the project

⚙️ Installation

To re-produce this project, you will need to have the following dependencies installed:

Ubuntu 18.04.6 LTS
Miniconda
Python 3.10
PyTorch (version 2.0 or later)

After installing Miniconda, you can create a new environment and install the required packages using the following commands:

conda create -n onemaxdac python=3.10
conda activate onemaxdac

For installing torch, refer this link: INSTALLING PREVIOUS VERSIONS OF PYTORCH

then clone and install dependencies:

pip install -r requirements.txt

🚀 Quickstart

Testing

We provide the best checkpoints of DDQNs, which are trained using the best settings of reward functions in certain problem sizes at resources/ddqn_ckpts.

To replicate the results reported in the paper, follow the notebook test.ipynb:

Initialize the DDQN and OneMax environment objects.
Load the trained checkpoint properly.
Run (1+($\lambda$, $\lambda$))-GA and observe the ERT.

Note: Please make sure you have the notebook kernel installed with the necessary packages.

Training

We divide our experiments into three groups:

Original reward function
Reward scaling
Reward shifting

The implementation of these families of reward functions can be found in onemax.py.

Experiment with the Original Reward Function

python onemax_dac/train.py \    ## Main Python script for training
    --problem_size 100 \        ## Set problem size n=100
    --reward_choice original \  ## Use original reward function
    --seed 1 \                  ## Set random seed
    --num_workers 4             ## Set number of CPUs for parallel processing

Experiment with the Reward Scaling

python onemax_dac/train.py \    ## Main Python script for training
    --problem_size 100 \        ## Set problem size n=100
    --reward_choice scaling \  ## Use scaled reward function
    --seed 1 \                  ## Set random seed
    --num_workers 4             ## Set number of CPUs for parallel processing

Experiment with the Reward Shifting

python onemax_dac/train.py \    ## Main Python script for training
    --problem_size 100 \        ## Set problem size n=100
    --reward_choice shifting \  ## Use reward shifting with fixed bias
    --fixed_shift -3 \          ## Set value of bias
    --seed 1 \                  ## Set random seed
    --num_workers 4             ## Set number of CPUs for parallel processing

python onemax_dac/train.py \    ## Main Python script for training
    --problem_size 100 \        ## Set problem size n=100
    --reward_choice shifting \  ## Use reward shifting with adaptive bias
    --seed 1 \                  ## Set random seed
    --num_workers 4             ## Set number of CPUs for parallel processing

Terminal

After running, the terminal should look like this:

{'TrainingConfig': {'max_steps': 500000, 'buffer_size': 1000000, 'epsilon_start': 1.0, 'epsilon_end': 0.2, 'warmup_steps': 10000, 'batch_size': 2048, 'learning_rate': 0.001, 'gamma': 0.99, 'tau': 0.01, 'loss_fn': 'MSE', 'eval_interval': 2000, 'n_eval_episodes': 100, 'output_dir': 'outputs', 'accelerator': 'cpu', 'num_workers': 4, 'wandb': False, 'seed': 1, 'fixed_shift': None}, 'PolicyConfig': {'policy_name': 'DDQN', 'net_arch': [50, 50], 'activation_fn': 'ReLU'}, 'EnvConfig': {'problem_size': 100, 'state_dim': 2, 'discrete_action': True, 'action_choices': [], 'reward_choice': 'original', 'seed': 1, 'init_obj_rate': 0.5, 'kwargs': {}}}
Populating Buffer: 100%|████████████████████████████████████| 10000/10000 [00:03<00:00, 2512.55it/s]
[Training]: step: 14000.00 | shift: 0.00 | loss: 27.20 | best_val_rt: 689.01:   1%|▋                                                                             | 4110/490000 [00:25<50:44, 159.58it/s]

Observation:

A dictionary containing the current running configurations.
A message of Populating Buffer indicating that the warm-up process is running within N steps.
Then the training process starts running with the remaining steps indicated by total steps minus warm-up steps. There is some information including:
- step: current evaluation step
- shift: value of shifting
- loss: current training loss
- best_val_rt: best evaluated expected runtime

Logs

During the process, we can monitor the logs by following the path outputs/checkpoints/<date>_<time>/seed_<#>. In this directory:

outputs/checkpoints/<date>_<time>/seed_<#>/
├── config.yml                      # Training configuration is stored here
├── evaluations.json                # Contains learned policies and ERTs from both evaluation and testing phases
├── best.pt                         # Best checkpoint of the Q-network
├── learning_curve.pdf              # Evaluated ERT under 100 runs during training
└── policy.pdf                      # Policy comparison

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
assets		assets
docs		docs
notebooks		notebooks
onemax_dac		onemax_dac
resources		resources
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On the Importance of Reward Design in Reinforcement Learning-based Dynamic Algorithm Configuration

🗒️ Table of Contents

💡 Introduction

🎯 Repository Structure

⚙️ Installation

🚀 Quickstart

Testing

Training

Experiment with the Original Reward Function

Experiment with the Reward Scaling

Experiment with the Reward Shifting

Terminal

Logs

About

Languages

License

taindp98/OneMax-DAC

Folders and files

Latest commit

History

Repository files navigation

On the Importance of Reward Design in Reinforcement Learning-based Dynamic Algorithm Configuration

🗒️ Table of Contents

💡 Introduction

🎯 Repository Structure

⚙️ Installation

🚀 Quickstart

Testing

Training

Experiment with the Original Reward Function

Experiment with the Reward Scaling

Experiment with the Reward Shifting

Terminal

Logs

About

Topics

Resources

License

Stars

Watchers

Forks

Languages