Skip to content

MTCleanse is a parallel data cleaning toolkit designed for MT cleaning and preprocessing parallel corpora. Built for researchers, language technologists, and MT practitioners.

License

Notifications You must be signed in to change notification settings

Ancastal/mtcleanse

Repository files navigation

MTCleanse: Machine Translation Corpus Cleaning

PyPI version Python versions

MTCleanse is a powerful, state-of-the-art toolkit designed for cleaning and preprocessing parallel corpora to be used for neural machine translation (NMT) systems. Built for researchers, language technologists, and MT practitioners, it addresses the critical "garbage in, garbage out" problem that plagues many translation models.

By systematically removing noise, detecting misalignments, filtering problematic sentence pairs, and handling outliers, MTCleanse significantly improves the quality of training data, leading to more accurate, robust, and reliable translation models.

Features

  • Clean parallel text datasets with configurable parameters
  • Remove noise such as URLs, emails, and control characters
  • Filter texts based on length constraints
  • Detect and remove statistical outliers
  • Domain-based filtering using sentence embeddings
  • Export cleaned data in various formats (text files, JSON)
  • Comprehensive statistics on the cleaning process

Installation

pip install mtcleanse

Or install from source:

git clone https://github.com/yourusername/mtcleanse.git
cd mtcleanse
pip install -e .

Quick Start

from mtcleanse.cleaning import ParallelTextCleaner

# Initialize with default settings
cleaner = ParallelTextCleaner()

# Clean parallel text files
cleaner.clean_files(
    source_file="source.en",
    target_file="target.fr",
    output_source="clean_source.en",
    output_target="clean_target.fr"
)

# Or clean text directly
source_texts = ["Hello world", "This is a test"]
target_texts = ["Bonjour le monde", "C'est un test"]
clean_source, clean_target = cleaner.clean_texts(source_texts, target_texts)

Command Line Interface

MTCleanse also provides a command-line interface:

mtcleanse-clean --source source.en --target target.fr --output-source clean_source.en --output-target clean_target.fr

Configuration

You can customize the cleaning process with various parameters:

cleaner = ParallelTextCleaner({
    "min_chars": 10,
    "max_chars": 500,
    "min_words": 3,
    "max_words": 50,
    "enable_domain_filtering": True,
    "domain_contamination": 0.2
})

# This method returns the cleaned data and the statistics
clean_source, clean_target, stats = cleaner.clean_texts(
    source_texts=["Hello world", "This is a test"],
    target_texts=["Bonjour le monde", "C'est un test"]
)

# This method saves the cleaned data to disk and generates an HTML report
cleaner.clean_file(
    source_file="source.en",
    target_file="target.fr",
    output_source="clean_source.en",
    output_target="clean_target.fr",
    html_report="report.html"
)

Development

Setting up the development environment

# Clone the repository
git clone https://github.com/yourusername/mtcleanse.git
cd mtcleanse

# Install in development mode with development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pip install pre-commit
pre-commit install

# Run tests
pytest tests/ --cov=mtcleanse

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

MTCleanse is a parallel data cleaning toolkit designed for MT cleaning and preprocessing parallel corpora. Built for researchers, language technologists, and MT practitioners.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published