Dataset importers

Dataset importers can be used in datasets sections of the training config.

Example:

  train:
    - opus_ada83/v1
    - mtdata_newstest2014_ruen
Data source Prefix Name examples Type Comments
MTData mtdata newstest2017_ruen parallel Supports many datasets. Run mtdata list -l ru-en to see datasets for a specific language pair.
OPUS opus ParaCrawl/v7.1 parallel Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
SacreBLEU sacrebleu wmt20 parallel Official evaluation datasets available in SacreBLEU tool. Recommended to use in datasets:test config section. Look up supported datasets and language pairs in sacrebleu.dataset python module.
Flores flores dev, devtest parallel Evaluation dataset from Facebook that supports 100 languages.
NTREX-128 ntrex devtest parallel Evaluation dataset from Microsoft that supports 128 languages.
Custom parallel url https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst parallel A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the [LANG] will be replaced with the to and from language codes.
News crawl news-crawl news.2019 mono Monolingual news datasets from WMT
OPUS opus tldr-pages/v2023-08-29 mono Monolingual dataset from OPUS.
HPLT hplt mono/v3.0 mono HPLT monolingual corpus (mostly from Internet Archive, but also from Common Crawl).
Custom mono url https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst mono A custom zst compressed monolingual dataset, for instance uploaded to GCS.

You can also use find-corpus tool to find all datasets for an importer and get them formatted to use in config.

Set up a local poetry environment.

task find-corpus -- en ru

The config generator uses find-corpus to generate a training config automatically and include all the available datasets:

task config-generator -- ru en --name test

Make sure to check licenses of the datasets before using them.

Adding a new importer

Add Python code here for parallel data importers or here for monolingual ones