Multilingual Parallel Corpus Pipeline

A unified pipeline to download, preprocess, assess the quality, merge, and push multilingual parallel corpora to the Hugging Face Hub. Supports sources like Hugging Face Datasets, GitHub, and OPUS.

✨ Features

Download datasets from:
- Hugging Face Hub: Supports downloading datasets in various formats from the Hugging Face Hub. Specify the dataset format and provide the required details, such as column names.
- GitHub : Supports downloading datasets hosted on GitHub or any publicly accessible URL. Provide the source URL and the target destination.
- OPUS : Supports downloading OPUS datasets by specifying the dataset name and URL. You can also retrieve information about all available OPUS datasets using the opus_info function (available in utils.py).
Preprocessing:
- Rule-based filtering: involves deduplication, dropping empty segments, and removing HTML tags.
- Semantic filtering: evaluates the translation pairs with cosine similarity scores derived from sentence embedding models, using the SentenceTransformers library
- Language detect filtering: discards segments that are unlikely to be in the expected language. Two language models are available; AfroLid and FastText.
- Quality estimation filtering: apply reference-free evaluation of the translation using Comet models and exclude segments that are lower than the threshold.
Quality Assessment:
- Estimates the translation quality of the preprocessed dataset using Comet model.
Merge, Deduplicate and Push:
- Dedulicate agains test dataset to avoid overlaps
- Combine all processed datasets into one
- Deduplicate globally to avoid same segments from datasets
- Push to Hugging Face Hub

🚀 Quick Start

Clone the repository

git clone https://github.com/amaneth/mt-data-processing.git
cd mt-data-processing

Install dependencies
```
pip install -r requirements.txt
```
Configure settings Modify the config.yaml file to define your language pair, data sources, and pipeline settings.
Preprocess the dataset
```
python process.py --config config.yaml
```
Push to Hugging Face Hub
```
python push_to_hub.py --dataset data
```

🛠️ `config.yaml` Overview

The config.yaml file controls the entire pipeline. Here’s an overview of its sections:

`dataset`

lang_pair: Source and target languages (e.g., en-am)
sources: List of dataset types to include (hf, github, opus)

Hugging Face datasets (`hf`)

name: Identifier for the dataset
path: HF dataset ID
split: Train/test/dev
config_name: Config name if needed
src_col / tgt_col: Source and target language fields

GitHub datasets (`github`)

name: Identifier for the dataset
src_url: URL to the source file
tgt_url: URL to the target file

OPUS datasets (`opus`)

name: Identifier for the dataset
url: Download URL

`preprocessing`

pipelines: List of preprocessing steps, e.g., rule_filter, semantic_filter
from_cache: If true, it checks if the dataset is already preporcessed in the save_dir and skips preprocessing"

`filters`

Rule filter:
- min_length, max_length
- max_length_ratio
Semantic filter:
- threshold: Similarity threshold
- chunk_size: Batch size for filtering
Language detect filter:
- batch_size: Batch size for fasttext processing
- min_score: threshold value for filtering

`output`

prefix: Prefix for filtered dataset files
format: Final dataset format (e.g., json, csv, parquet)
save_dir: Output directory for saving results

`logging`

log_file: Log filename
log_dir: Directory for storing logs
level: Log level (INFO, DEBUG, etc.)

Citation

This repository is part of the AfriNLLB project. If you use any part of the project's code, data, models, or approaches, please cite the following paper:

@inproceedings{moslem-etal-2026-afrinllb,
    title = "{A}fri{NLLB}: Efficient Translation Models for African Languages",
    author = "Moslem, Yasmin  and
      Wassie, Aman Kassahun  and
      Gizachew, Amanuel",
    booktitle = "Proceedings of the Seventh Workshop on African Natural Language Processing (AfricaNLP)",
    month = jul,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
}

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
configs		configs
.gitignore		.gitignore
README.md		README.md
dataset_loader.py		dataset_loader.py
dataset_merge_sample.py		dataset_merge_sample.py
fetch.py		fetch.py
helpers.py		helpers.py
merge.py		merge.py
model_loader.py		model_loader.py
pipelines.py		pipelines.py
process.py		process.py
push_to_hub.py		push_to_hub.py
requirements.txt		requirements.txt
validators.py		validators.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Parallel Corpus Pipeline

✨ Features

🚀 Quick Start

🛠️ `config.yaml` Overview

`dataset`

Hugging Face datasets (`hf`)

GitHub datasets (`github`)

OPUS datasets (`opus`)

`preprocessing`

`filters`

`output`

`logging`

Citation

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

AfriNLP/mt-data-processing

Folders and files

Latest commit

History

Repository files navigation

Multilingual Parallel Corpus Pipeline

✨ Features

🚀 Quick Start

🛠️ config.yaml Overview

dataset

Hugging Face datasets (hf)

GitHub datasets (github)

OPUS datasets (opus)

preprocessing

filters

output

logging

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages

🛠️ `config.yaml` Overview

`dataset`

Hugging Face datasets (`hf`)

GitHub datasets (`github`)

OPUS datasets (`opus`)

`preprocessing`

`filters`

`output`

`logging`