Dataset importers

Dataset importers can be used in datasets sections of the training config.

Example:

  train:
    - opus_ada83/v1
    - mtdata_newstest2014_ruen

Data source	Prefix	Name examples	Type	Comments
MTData	mtdata	newstest2017_ruen	parallel	Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
OPUS	opus	ParaCrawl/v7.1	parallel	Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
SacreBLEU	sacrebleu	wmt20	parallel	Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
Flores	flores	dev, devtest	parallel	Evaluation dataset from Facebook that supports 100 languages.
NTREX-128	ntrex	devtest	parallel	Evaluation dataset from Microsoft that supports 128 languages.
Custom parallel	url	`https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst`	parallel	A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
News crawl	news-crawl	news.2019	mono	Monolingual news datasets from WMT
OPUS	opus	tldr-pages/v2023-08-29	mono	Monolingual dataset from OPUS.
HPLT	hplt	mono/v3.0	mono	HPLT monolingual corpus (mostly from Internet Archive, but also from Common Crawl).
Custom mono	url	`https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst`	mono	A custom zst compressed monolingual dataset, for instance uploaded to GCS.

You can also use find-corpus tool to find all datasets for an importer and get them formatted to use in config.

Set up a local poetry environment.

task find-corpus -- en ru

The config generator uses find-corpus to generate a training config automatically and include all the available datasets:

task config-generator -- ru en --name test

Make sure to check licenses of the datasets before using them.

Adding a new importer