A unified pipeline to download, preprocess, assess the quality, merge, and push multilingual parallel corpora to the Hugging Face Hub. Supports sources like Hugging Face Datasets, GitHub, and OPUS.
- Download datasets from:
- Hugging Face Hub: Supports downloading datasets in various formats from the Hugging Face Hub. Specify the dataset format and provide the required details, such as column names.
- GitHub : Supports downloading datasets hosted on GitHub or any publicly accessible URL. Provide the source URL and the target destination.
- OPUS : Supports downloading OPUS datasets by specifying the dataset name and URL. You can also retrieve information about all available OPUS datasets using the
opus_infofunction (available inutils.py).
- Preprocessing:
- Rule-based filtering: involves deduplication, dropping empty segments, and removing HTML tags.
- Semantic filtering: evaluates the translation pairs with cosine similarity scores derived from sentence embedding models, using the SentenceTransformers library
- Language detect filtering: discards segments that are unlikely to be in the expected language. Two language models are available; AfroLid and FastText.
- Quality estimation filtering: apply reference-free evaluation of the translation using Comet models and exclude segments that are lower than the threshold.
- Quality Assessment:
- Estimates the translation quality of the preprocessed dataset using Comet model.
- Merge, Deduplicate and Push:
- Dedulicate agains test dataset to avoid overlaps
- Combine all processed datasets into one
- Deduplicate globally to avoid same segments from datasets
- Push to Hugging Face Hub
- Clone the repository
git clone https://github.com/amaneth/mt-data-processing.git cd mt-data-processing - Install dependencies
pip install -r requirements.txt
- Configure settings
Modify the
config.yamlfile to define your language pair, data sources, and pipeline settings. - Preprocess the dataset
python process.py --config config.yaml
- Push to Hugging Face Hub
python push_to_hub.py --dataset data
The config.yaml file controls the entire pipeline. Here’s an overview of its sections:
lang_pair: Source and target languages (e.g.,en-am)sources: List of dataset types to include (hf,github,opus)
name: Identifier for the datasetpath: HF dataset IDsplit: Train/test/devconfig_name: Config name if neededsrc_col/tgt_col: Source and target language fields
name: Identifier for the datasetsrc_url: URL to the source filetgt_url: URL to the target file
name: Identifier for the dataseturl: Download URL
pipelines: List of preprocessing steps, e.g.,rule_filter,semantic_filterfrom_cache: If true, it checks if the dataset is already preporcessed in thesave_dirand skips preprocessing"
- Rule filter:
min_length,max_lengthmax_length_ratio
- Semantic filter:
threshold: Similarity thresholdchunk_size: Batch size for filtering
- Language detect filter:
batch_size: Batch size for fasttext processingmin_score: threshold value for filtering
prefix: Prefix for filtered dataset filesformat: Final dataset format (e.g.,json,csv,parquet)save_dir: Output directory for saving results
log_file: Log filenamelog_dir: Directory for storing logslevel: Log level (INFO,DEBUG, etc.)
This repository is part of the AfriNLLB project. If you use any part of the project's code, data, models, or approaches, please cite the following paper:
@inproceedings{moslem-etal-2026-afrinllb,
title = "{A}fri{NLLB}: Efficient Translation Models for African Languages",
author = "Moslem, Yasmin and
Wassie, Aman Kassahun and
Gizachew, Amanuel",
booktitle = "Proceedings of the Seventh Workshop on African Natural Language Processing (AfricaNLP)",
month = jul,
year = "2026",
address = "Rabat, Morocco",
publisher = "Association for Computational Linguistics",
}