Official implementation of NAACL 2025 Main Conference Paper "Modern LLMs are Few-Shot Parallel Detoxification Data Annotators"
- Put source non-parallel data into
data/
directory - Run
python src/get_data.py
- Generate Few Shot demonstrations from the multiparadetox dataset using
python src/get_fewshot.py
- Generate a synthetic dataset using
generate_detox_data.sh
- Clean the synthetic dataset using
clean_dataset.sh
- Train the models using
train_mt0.sh
- Run inference of trained models using
inference_of_trained_models.sh
- Run final eval using
run_final_eval.sh
- Do SBS using
python src/sbs.py