Gabriele Sarti • Vilém Zouhar • Grzegorz Chrupała • Ana Guerberof Arenas • Malvina Nissim • Arianna Bisazza
Abstract: Word-level quality estimation (QE) detects erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. Our QE4PE study investigates the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated by behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.
This repository contains data, scripts and notebooks associated to the paper "QE4PE: Word-level Quality Estimation for Human Post-Editing". If you use any of the following contents for your work, we kindly ask you to cite our paper:
@misc{sarti-etal-2024-qe4pe,
title={{QE4PE}: Word-level Quality Estimation for Human Post-Editing},
author={Gabriele Sarti and Vilém Zouhar and Grzegorz Chrupała and Ana Guerberof-Arenas and Malvina Nissim and Arianna Bisazza},
year={2025},
eprint={2503.03044},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.03044},
}
GroTE is a simple Gradio-based interface for post-editing machine translation outputs with error spans. It allows to visualize and edit translations in a web interface hosted on HF Spaces, with real-time logging of granular editing actions. Find out more about setting up and running GroTE in the GroTE repository.
Processed QE4PE logs for pre
, main
and post
tags, MQM/ESA annotations and questionnaire responses are available as 🤗 Datasets. Summary of the data:
- Post-edits over NLLB 3.3B outputs for >400 segments from WMT23 (social media and biomedical abstracts): 15 edits per direction (3 oracle post-edits + 12 core set translators) for En->It and En->Nl.
- A single set of MQM and ESA annotations from 12 human annotators for MT outputs and all post-edited versions across both directions for a subset of ~150 segments.
- Fine-grained editing logs for core set translators across
pre
,main
andpost
editing phases. - Pre- and post-task questionnaires for all post-editors.
The raw logfiles produced by our 🐮 GroTE interface are available in the task
folder in the same repository as the datasets. Refer to the main QE4PE dataset readme and readmes in each task folder for more details about the provided data.
This section provides a step-by-step guide to reproduce the data processing and analysis steps for the QE4PE study.
IMPORTANT: While we describe how to regenerate all outputs we used for our analysis, they are all pre-computed and available in the 🤗 Datasets repository. We are adding the scripts little by little, please be patient and reach out if needed! 🤗
Install the required dependencies and the qe4pe
package:
pip install -r requirements-dev.txt
pip install -e .
Download the QE4PE repository from the 🤗 Datasets repository and place it in the data
folder (it can be pulled as a git submodule with git submodule update --init --recursive
and git submodule update --recursive
).
TODO: Add script for generation with NLLB 3.3B
The generated outputs are saved in data/setup/wmt23/nllb_<SIZE>/wmttest2023.<LANG>
, with <SIZE>
being either 3b
or 600m
and <LANG>
being ita
or nld
.
TODO: Add script for XCOMET annotations
The generated outputs are saved in data/setup/wmt23/nllb_<NLLB_SIZE>/wmttest2023_xcomet-<XCOMET_SIZE>_<LANG>.json
, with <NLLB_SIZE>
being either 3b
or 600m
, <LANG>
being ita
or nld
, and <XCOMET_SIZE>
being xl
or xxl
.
Run qe4pe filter-wmt-data
to recover selected segments for pre
, main
and post
editing phases from the full set of WMT23 segments and their translations available in data/setup/wmt23
. Intermediate outputs are saved in data/setup/processed
.
TODO: Add scripts for generating highlights with XCOMET and the unsupervised methods.
Highlighted segments are saved in the data/setup/highlights
folder.
Raw QA annotations are provided in data/setup/qa/eng-ita
and data/setup/qa/eng-nld
.
TODO: Add script for converting HTML annotations to a QA dataframe.
The final dataframe is saved in data/setup/qa/qa_df.csv
.
Run qe4pe process-task-data --TASK_PATH
to perform the preprocessing of outputs and logs for a specific task in data/setup/task
, e.g. qe4pe process-task-data data/task/main
. The processing is controlled by the task processing_config.json
file, which specifies paths and additional info (e.g. for main
QA annotations are merged with other fields).
The processed data is saved in data/processed/task
as processed_<TASK>.csv
.
TODO: Add notebook with plots from the selection process.
Follow the analysis notebook to reproduce the main plots and results from the paper. While some plots were retouched in Inkscape for the final version (marked as _edited
in figures/
), we provide the code to generate them from the processed data.
Modeling results can be reproduced from the modeling notebook.
TODO: Add additional analysis scripts for appendix plots.
If you encounter any issues while running the scripts or notebooks, please open an issue in this repository. We will be happy to help you out!