Portuguese Variety Identifier

A production-level solution for training and evaluating delexicalized Portuguese Variety Identification (VID) models following the AAAI 2025 paper: Enhancing Portuguese Variety Identification with Cross-Domain Approaches. In addition to the core training and evaluation pipelines, this repository also includes a simple Streamlit demo and a FastAPI endpoint.

Warning: Current Status

Important Notice:

Incomplete Repository: This repository is incomplete. It represents an industry-level refactor of a scientific research project that was submitted to AAAI.
Branch Archive: The branch "AAAI" is an archival version and should not be extended.

Features

Production-Ready Code: Implements state-of-the-art delexicalized VID solutions.
Research Integration: Based on methods described in the AAAI 2025 paper.
Modular Design: Training and evaluation routines are packaged as a Python module.
Interactive Demo & API: Run a Streamlit demo and a FastAPI endpoint for quick model interaction.
Docker Support: Easily spin up the demo and API using Docker Compose.
PyPI Package: The pt_vid package is available on PyPI, ensuring smooth installation and integration via GitHub Actions pipelines.
HuggingFace Compatibility: Our best model is fully compatible with HuggingFace and runs off-the-shelf.

Quickstart: Using the HuggingFace Pipeline

You can quickly run our best model using HuggingFace's pipeline API:

from transformers import pipeline

pipe = pipeline("text-classification", model="liaad/PtVId")

result = pipe("Olá tudo bem? Este trabalho é só um ponto de partida")

print(result)

This command instantiates the model and performs text classification immediately.

Getting Started

1. Set Up a Virtual Environment

We recommend using Conda for an isolated Python environment. The recommended Python version is 3.10:

conda create --name .conda python=3.10
conda activate .conda

2. Install Dependencies

You can install the package directly from the source in editable mode:

pip install -e .

Alternatively, install the production-ready package from PyPI:

pip install -U pt-vid

Training and Evaluation

Examples of training and evaluation routines are provided in the /exec directory:

Training:
An example training script is available in /exec/Train.py. This script demonstrates how to execute the training pipeline.
Evaluation & Result Plotting:
An example evaluation and result plotting script is available in /exec/Test.py. This script shows how to evaluate the trained models and visualize the results.

Running the Demo and API

A simple Streamlit demo and a FastAPI endpoint are included for quick testing and integration.

Using Docker Compose

The recommended way to run the demo and API endpoints is via Docker Compose. Ensure Docker is installed, then run:

docker-compose -f dev.docker-compose.yml up

This command will launch both the Streamlit app and the FastAPI service in development mode.

Repository Structure

/exec: Contains scripts to execute training and evaluation routines.
/your_package: The main Python package implementing the VID solutions.
dev.docker-compose.yml: Docker Compose file for running the demo and API endpoints.
Other directories/files: Additional resources, utilities, and configurations.

Contribution and Future Work

The goal of introducing industry/production-level code in this experiment was to establish the major guidelines for extending our work. Our aim is to deliver this framework to research teams around the globe, who will adapt the code to their respective languages and needs.

Beyond making the code open-source, the authors also intend to continue developing this package with low priority, improving and extending it over time.

Future Increments (Without Time Estimate):

Migration to Apache Airflow: Move the scripts under the /exec directory to the Apache Airflow ecosystem to provide a more human-friendly track of the processes abstracted in these scripts.
Completion of Missing Parts: Implement the features covered in the AAAI submission but not yet integrated into this repository, including:
- Extending the dataset generators.
- Implementing the transformer-based training pipeline.

It is expected that any future contributions adhere to the foundations established post-AAAI, which are maintained in the main branch. These contributions will help complete the missing components in this repository and further advance the project.

Citation

If you use this repository in your research, please consider citing our work:

@misc{sousa2025enhancingportuguesevarietyidentification,
      title={Enhancing Portuguese Variety Identification with Cross-Domain Approaches}, 
      author={Hugo Sousa and Rúben Almeida and Purificação Silvano and Inês Cantante and Ricardo Campos and Alípio Jorge},
      year={2025},
      eprint={2502.14394},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.14394}, 
}

License

This project is licensed under the MIT License. Check the LICENSE file for further licensing information.

Contact

For any questions or issues, please open an issue on GitHub or contact:

Rúben Almeida – [email protected]
Hugo Sousa – [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github/workflows		.github/workflows
demo		demo
endpoint		endpoint
exec		exec
out		out
src/pt_vid		src/pt_vid
test		test
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Portuguese Variety Identifier

Warning: Current Status

Features

Quickstart: Using the HuggingFace Pipeline

Getting Started

1. Set Up a Virtual Environment

2. Install Dependencies

Training and Evaluation

Running the Demo and API

Using Docker Compose

Repository Structure

Contribution and Future Work

Future Increments (Without Time Estimate):

Citation

License

Contact

About

Releases

Packages

Contributors 2

Languages

License

LIAAD/portuguese_vid

Folders and files

Latest commit

History

Repository files navigation

Portuguese Variety Identifier

Warning: Current Status

Features

Quickstart: Using the HuggingFace Pipeline

Getting Started

1. Set Up a Virtual Environment

2. Install Dependencies

Training and Evaluation

Running the Demo and API

Using Docker Compose

Repository Structure

Contribution and Future Work

Future Increments (Without Time Estimate):

Citation

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages