DynaOpt: Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation
Code for Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation
Clone the repository:
git clone https://github.com/mindojune/dynaopt.git
cd dynaopt
Next, create a new environment and install the required packages:
conda create -n dynaopt python=3.9
conda activate bolt
pip install -r requirements
Download the weight here.
Put the weight inside dynaopt/weights
.
First, train the warm-start model in a supervised manner.
python supervised_train.py --experiment MI --num_epochs 5
Save the path of your trained model and run the inference step over test data.
start_dir={your supervised model}
python test_util.py --experiment MI_rl --model_start_dir $start_dir
Next, you can train the rl models. We use the k-self critical sequencec training algorithm in this project. Refer to the following table to train differnet models.
model | script / flag |
---|---|
Round | rl_train.py / round |
Uniform Weighted | rl_train.py / weighted |
DORB | rl_train.py / bandit |
DynaOpt (Ours) | rl_train.py / bandit_weighted |
C-Dynaopt (Ours) | con_rl_train.py / None |
python rl_train.py --learning_mode bandit_weighted --seed $i --experiment MI_rl --model_start_dir $start_dir
python con_rl_train.py --seed $i --experiment MI_rl --model_start_dir $start_dir
python rl_train.py --learning_mode weighted --seed $i --experiment MI_rl --model_start_dir $start_dir
python rl_train.py --learning_mode round --seed $i --experiment MI_rl --model_start_dir $start_dir
python rl_train.py --learning_mode bandit --seed $i --experiment MI_rl --model_start_dir $start_dir
And compute the statistics on the test data.
python compute_stats.py --dir ./outputs
Our project is licensed under the Apache License 2.0, ensuring open access and collaboration, with due credit given to the original work which forms the backbone of our codebase.
This code makes use of the Keep it Simple code by Laban et al.
Specifically, we use the code for the k self-critical sequence training algorithm..
@inproceedings{laban2021keep_it_simple,
title={Keep It Simple: Unsupervised Simplification of Multi-Paragraph Text},
author={Philippe Laban and Tobias Schnabel and Paul N. Bennett and Marti A. Hearst},
booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics},
volume={1},
year={2021}
}
If you make use of the code, models, or algorithm, please cite our paper:
@inproceedings{min-etal-2024-bandit,
title={Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation},
author = "Min, Do June and
P{\'e}rez-Rosas, Ver{\'o}nica and
Resnicow, Kenneth and
Mihalcea, Rada",
booktitle={COLING 2024},
volume={},
year={2024}
}