Dataset importers

Dataset importers can be used in datasets sections of the training config.

Example:

  train:
    - opus_ada83/v1
    - mtdata_newstest2014_ruen
Data source Prefix Name examples Type Comments
MTData mtdata newstest2017_ruen corpus Supports many datasets. Run mtdata list -l ru-en to see datasets for a specific language pair.
OPUS opus ParaCrawl/v7.1 corpus Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
SacreBLEU sacrebleu wmt20 corpus Official evaluation datasets available in SacreBLEU tool. Recommended to use in datasets:test config section. Look up supported datasets and language pairs in sacrebleu.dataset python module.
Flores flores dev, devtest corpus Evaluation dataset from Facebook that supports 100 languages.
Custom parallel url https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst corpus A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the [LANG] will be replaced with the to and from language codes.
Paracrawl paracrawl-mono paracrawl8 mono Datasets that are crawled from the web. Only mono datasets are used in this importer. Parallel corpus is available using opus importer.
News crawl news-crawl news.2019 mono Some news monolingual datasets from WMT21
Common crawl commoncrawl wmt16 mono Huge web crawl datasets. The links are posted on WMT21
Custom mono url https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst mono A custom zst compressed monolingual dataset, for instance uploaded to GCS.

You can also use find-corpus tool to find all datasets for an importer and get them formatted to use in config.

Set up a local poetry environment.

task find-corpus -- en ru

Make sure to check licenses of the datasets before using them.

Adding a new importer

Just add a shell script to corpus or mono which is named as <prefix>.sh and accepts the same parameters as the other scripts from the same folder.