Data Cleaning

Making datasets less noisy to improve quality of translation.

OpusCleaner

We use an all-in-one cleaning tool OpusCleaner by HPLT project.

Custom filter configs

The idea behind OpusCleaner is customizing filter rules for each language pair and dataset to get a training corpus with less noise and train higher quality translation models.

Filtering rules can be tuned in an interactive UI.

Installation

Install the OpusCleaner UI on a server. See the installation instructions in the OpusCleaner readme.

For local usage: run from a poetry shell task opuscleaner. Then go to http://0.0.0.0:8000.

Making filters

Choose a language pair and download the required OPUS datasets. They will correspond to opus_... training datasets in the training pipeline config.

Configure cleaning rules for the datasets in the UI.

Copy JSON files for the produced filters data/train-parts/*.filter.json to pipeline/clean/opuscleaner/configs/<src-lang-code>-<trg-lang-code>/ for language pair and dataset specific filters (such filters will also apply to the opposite language pair)

or to

pipeline/clean/opuscleaner/configs/ for dataset specific filters that will apply to all language pairs.

Make sure to replace the language codes to the template values <src> and <trg> and remove absolutes paths from the "files" section. See examples in the directory.

Default configs

Set opuscleaner-mode: custom (this is the default when generating a config) in the training config to use custom per-dataset and per-language pair configs.

If no custom config was specified for the dataset, the default config template will be used.

Modify if needed. Some rules require specifying source or target language. The <src> and <trg> in the template will be automatically replaced with the trained language pair. The generated default config will be copied to the target dataset cleaning directory.

The config is chosen based on this search order:

Dataset and language specific: configs/<language-pair>/<dataset>.filter.json
Language specific: configs/<language-pair>/default.filter.json
Dataset specific: configs/<dataset>.filter.json
Default: configs/default.filter.json

The first found config will be applied.

If the desired behaviour is to apply only the default config template and skip all possible custom configs for the current language pair and/or datasets, set opuscleaner-mode: defaults.

Language codes

OpusCleaner uses many external tools in its filters. It means support of language code schemes for specific tools can differ. For some languages it's required to replace <src>, <trg> in the filter.json to the tools specific language codes.

For example, for Chinese Traditional we use:

Pipeline code: zh_hant
It maps to OpusCleaner code: zh_Hant
It is changed in the OpusCleaner config for fasttext filter with openlid-v2 model to cmn:

    {
      "filter": "fasttext_filter",
      "parameters": {
        "FASTTEXT_MODEL_TYPE": "openlid-v2",
        "LANG1": "eng",
        "LANG2": "cmn"
    }

See more details about the supported languages and language code mappings here.

Bicleaner

It is recommended to use Bicleaner ML models to filter noisy data. See the bicleaner documentation for more details on how to configure it.

Monolingual cleaning

Currently, it does not run OpusCleaner as monolingual filters are not fully supported. It runs legacy Bergamot cleaning scripts that include alphabet ratios and fast text. Also, it runs Monocleaner to filter based on fluency.

Monocleaner thresholds can be adjusted in the training config:

  # Monocleaner filters sentences in monolingual corpus based on language fluency
  # Use sanitized dataset names for compatibility with Taskcluster (replace ".", "/", ":", "[", "]" to "_")
  monocleaner:
    mono-src:
      # News-crawl is typically clean, enable on dataset by dataset basis
      default-threshold: 0.0
      dataset-thresholds:
        # We already filter it by document score, remove only the noisiest segments
        hplt_mono_v2_0: 0.5
        # Filter only garbage from NLLB
        opus_NLLB_v1: 0.5
    mono-trg:
      # News-crawl is typically clean, enable on dataset by dataset basis
      default-threshold: 0.0
      dataset-thresholds:
        # We already filter HPLT by document score, so it's relatively clean,
        # but let's still apply the default threshold for monocleaner to get more fluent target texts for back-translations
        hplt_mono_v2_0: 0.7
        # Sentences for back-translations should be in fluent language, apply even more aggressive threshold for NLLB
        opus_NLLB_v1: 0.8

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search