Dataset importers

Dataset importers can be used in datasets sections of the training config.

Example:

  train:
    - opus_ada83/v1
    - mtdata_newstest2014_ruen

Data source	Prefix	Name examples	Type	Comments
MTData	mtdata	newstest2017_ruen	parallel	Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
OPUS	opus	ParaCrawl/v7.1	parallel	Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
SacreBLEU	sacrebleu	wmt20	parallel	Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
Flores	flores	dev, devtest	parallel	Evaluation dataset from Facebook that supports 100 languages.
NTREX-128	ntrex	devtest	parallel	Evaluation dataset from Microsoft that supports 128 languages.
Custom parallel	url	`https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst`	parallel	A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
News crawl	news-crawl	news.2019	mono	Monolingual news datasets from WMT
OPUS	opus	tldr-pages/v2023-08-29	mono	Monolingual dataset from OPUS.
HPLT	hplt	mono/v3.0	mono	HPLT monolingual corpus (mostly from Internet Archive, but also from Common Crawl).
Hugging Face	hf	orai-nlp/ZelaiHandi:train:text@fb8101a	mono	Hugging Face monolingual dataset. Dataset names have to include Hugging Face repository identifier, git commit revision (optional) dataset split and dataset field where text is located. The format is `{repo_name}:{subset}:{split}:{field}@{revision}`. Note that this importer is only for manually included datasets and `find-corpus` does not include Hugging Face search.
Hugging Face parallel data	hfp	ayymen/Weblate-Translations:default:train:source_string:target_string@fb8101a	parallel	Hugging Face parallel dataset. Dataset names have to include Hugging Face repository identifier, subset, split, field as srouce, field as target and git commit revision (optional). For further details see here. Note that this importer is only for manually included datasets and `find-corpus` does not include Hugging Face search.
Custom mono	url	`https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst`	mono	A custom zst compressed monolingual dataset, for instance uploaded to GCS.

You can also use find-corpus tool to find all datasets for an importer and get them formatted to use in config.

Set up a local poetry environment.

task find-corpus -- en ru

The config generator uses find-corpus to generate a training config automatically and include all the available datasets:

task config-generator -- ru en --name test

Make sure to check licenses of the datasets before using them.

HuggingFace datasets

The HF dataset importers use the following parameters (formatted as {repo_name}:{subset}:{split}:{field}@{revision} or {repo_name}:{subset}:{split}:{src_field}:{trg_field}@{revision}): - repo_name: the Hugging Face dataset repository identifier (includes organization and name). - subset: the dataset subset. It is common that this is not shown in the dataset viewer at the HF page, in this case subset will probably be default. For parallel data it is also common to specify the language pair here. - split: the dataset split. Typically train. - field or (src_field and trg_field for parallel): the row field (column) where the text is located.

Monlingual

Monolingual HF dataset importer assumes each row is a document and will convert to a plain text without document boundaries and preserving endlines (sentences or paragraphs). No additial processing is performed.

Parallel

Parallel data HF importer supports two types of parallel datasets. Both can be used with the same pipeline config format (hfp_{repo_name}:{subset}:{split}:{src_field}:{trg_field}@{revision}) and the importer will automatically detect the type. However there are other types, described below, that are not supported.

The first type is the Translation HF format, which only has one feature (or column) called 'translation' and each row is a JSON containing language codes as keys and text for each translation side as values. Examples of datasets using this format are the Helsinki-NLP or WMT organizations datasets. To use a dataset using this format, like Helsinki-NLP/euconst, you should format the config entry like this: hfp_Helsinki-NLP/euconst:en-fr:train:fr:en, where fr and en are the JSON keys that will be used as source text and target text respectively.

The second type is for datasets that do not use the Translation HF format but instead a normal text dataset format. In this case the HF dataset viewer will show several columns for each row instead of a JSON. For this type specify in src_field and trg_field the dataset columns that have to be used as source and target sentence. For example to use this dataset: hfp_ayymen/Weblate-Translations:en-fr:train:source_string:target_string.

Other types of dataset for translation datasets that are currently not supported: - Datasets which have the same sentences for all languages and each language is separated in a subset or split, - Datasets with multiple language pairs in the same subset+split that need to be filtered using a field (column). For example Flores+ has this two formats and cannot be used with the HF importer.

Adding a new importer

Add Python code here for parallel data importers or here for monolingual ones

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search