A production-level solution for training and evaluating delexicalized Portuguese Variety Identification (VID) models following the AAAI 2025 paper: Enhancing Portuguese Variety Identification with Cross-Domain Approaches. In addition to the core training and evaluation pipelines, this repository also includes a simple Streamlit demo and a FastAPI endpoint.
Important Notice:
- Incomplete Repository: This repository is incomplete. It represents an industry-level refactor of a scientific research project that was submitted to AAAI.
- Branch Archive: The branch "AAAI" is an archival version and should not be extended.
- Production-Ready Code: Implements state-of-the-art delexicalized VID solutions.
- Research Integration: Based on methods described in the AAAI 2025 paper.
- Modular Design: Training and evaluation routines are packaged as a Python module.
- Interactive Demo & API: Run a Streamlit demo and a FastAPI endpoint for quick model interaction.
- Docker Support: Easily spin up the demo and API using Docker Compose.
- PyPI Package: The
pt_vid
package is available on PyPI, ensuring smooth installation and integration via GitHub Actions pipelines. - HuggingFace Compatibility: Our best model is fully compatible with HuggingFace and runs off-the-shelf.
You can quickly run our best model using HuggingFace's pipeline API:
from transformers import pipeline
pipe = pipeline("text-classification", model="liaad/PtVId")
result = pipe("Olá tudo bem? Este trabalho é só um ponto de partida")
print(result)
This command instantiates the model and performs text classification immediately.
We recommend using Conda for an isolated Python environment. The recommended Python version is 3.10:
conda create --name .conda python=3.10
conda activate .conda
You can install the package directly from the source in editable mode:
pip install -e .
Alternatively, install the production-ready package from PyPI:
pip install -U pt-vid
Examples of training and evaluation routines are provided in the /exec
directory:
-
Training:
An example training script is available in/exec/Train.py
. This script demonstrates how to execute the training pipeline. -
Evaluation & Result Plotting:
An example evaluation and result plotting script is available in/exec/Test.py
. This script shows how to evaluate the trained models and visualize the results.
A simple Streamlit demo and a FastAPI endpoint are included for quick testing and integration.
The recommended way to run the demo and API endpoints is via Docker Compose. Ensure Docker is installed, then run:
docker-compose -f dev.docker-compose.yml up
This command will launch both the Streamlit app and the FastAPI service in development mode.
/exec
: Contains scripts to execute training and evaluation routines./your_package
: The main Python package implementing the VID solutions.dev.docker-compose.yml
: Docker Compose file for running the demo and API endpoints.- Other directories/files: Additional resources, utilities, and configurations.
The goal of introducing industry/production-level code in this experiment was to establish the major guidelines for extending our work. Our aim is to deliver this framework to research teams around the globe, who will adapt the code to their respective languages and needs.
Beyond making the code open-source, the authors also intend to continue developing this package with low priority, improving and extending it over time.
- Migration to Apache Airflow: Move the scripts under the
/exec
directory to the Apache Airflow ecosystem to provide a more human-friendly track of the processes abstracted in these scripts. - Completion of Missing Parts: Implement the features covered in the AAAI submission but not yet integrated into this repository, including:
- Extending the dataset generators.
- Implementing the transformer-based training pipeline.
It is expected that any future contributions adhere to the foundations established post-AAAI, which are maintained in the main branch. These contributions will help complete the missing components in this repository and further advance the project.
If you use this repository in your research, please consider citing our work:
@misc{sousa2025enhancingportuguesevarietyidentification,
title={Enhancing Portuguese Variety Identification with Cross-Domain Approaches},
author={Hugo Sousa and Rúben Almeida and Purificação Silvano and Inês Cantante and Ricardo Campos and Alípio Jorge},
year={2025},
eprint={2502.14394},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.14394},
}
This project is licensed under the MIT License. Check the LICENSE file for further licensing information.
For any questions or issues, please open an issue on GitHub or contact:
- Rúben Almeida – [email protected]
- Hugo Sousa – [email protected]