MTCleanse: Machine Translation Corpus Cleaning

MTCleanse is a powerful, state-of-the-art toolkit designed for cleaning and preprocessing parallel corpora to be used for neural machine translation (NMT) systems. Built for researchers, language technologists, and MT practitioners, it addresses the critical "garbage in, garbage out" problem that plagues many translation models.

By systematically removing noise, detecting misalignments, filtering problematic sentence pairs, and handling outliers, MTCleanse significantly improves the quality of training data, leading to more accurate, robust, and reliable translation models.

Features

Clean parallel text datasets with configurable parameters
Remove noise such as URLs, emails, and control characters
Filter texts based on length constraints
Detect and remove statistical outliers
Domain-based filtering using sentence embeddings
Export cleaned data in various formats (text files, JSON)
Comprehensive statistics on the cleaning process

Installation

pip install mtcleanse

Or install from source:

git clone https://github.com/yourusername/mtcleanse.git
cd mtcleanse
pip install -e .

Quick Start

from mtcleanse.cleaning import ParallelTextCleaner

# Initialize with default settings
cleaner = ParallelTextCleaner()

# Clean parallel text files
cleaner.clean_files(
    source_file="source.en",
    target_file="target.fr",
    output_source="clean_source.en",
    output_target="clean_target.fr"
)

# Or clean text directly
source_texts = ["Hello world", "This is a test"]
target_texts = ["Bonjour le monde", "C'est un test"]
clean_source, clean_target = cleaner.clean_texts(source_texts, target_texts)

Command Line Interface

MTCleanse also provides a command-line interface:

mtcleanse-clean --source source.en --target target.fr --output-source clean_source.en --output-target clean_target.fr

Configuration

You can customize the cleaning process with various parameters:

cleaner = ParallelTextCleaner({
    "min_chars": 10,
    "max_chars": 500,
    "min_words": 3,
    "max_words": 50,
    "enable_domain_filtering": True,
    "domain_contamination": 0.2
})

# This method returns the cleaned data and the statistics
clean_source, clean_target, stats = cleaner.clean_texts(
    source_texts=["Hello world", "This is a test"],
    target_texts=["Bonjour le monde", "C'est un test"]
)

# This method saves the cleaned data to disk and generates an HTML report
cleaner.clean_file(
    source_file="source.en",
    target_file="target.fr",
    output_source="clean_source.en",
    output_target="clean_target.fr",
    html_report="report.html"
)

Development

Setting up the development environment

# Clone the repository
git clone https://github.com/yourusername/mtcleanse.git
cd mtcleanse

# Install in development mode with development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pip install pre-commit
pre-commit install

# Run tests
pytest tests/ --cov=mtcleanse

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
examples		examples
mtcleanse		mtcleanse
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MTCleanse: Machine Translation Corpus Cleaning

Features

Installation

Quick Start

Command Line Interface

Configuration

Development

Setting up the development environment

License

Contributing

About

Releases 3

Packages

Languages

License

Ancastal/mtcleanse

Folders and files

Latest commit

History

Repository files navigation

MTCleanse: Machine Translation Corpus Cleaning

Features

Installation

Quick Start

Command Line Interface

Configuration

Development

Setting up the development environment

License

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages