Pipeline steps

The pipeline steps are based on the Bergamot student training recipe. For a visualization of the pipeline see Training Pipeline DAGs which visually breaks down the various steps.

Toolchain

Installs dependencies and compiles Marian and other tools.

Data downloading

Downloads datasets and samples an appropriate amount from the sentences. The time depends on dataset size. The sampling of huge mono datasets (100M+ sentences) is the most intensive operation.

Uses datasets.train configuration section.

Analyze data

Runs data analysis on the downloaded datasets and outputs charts. For example a distribution of sentence length in a dataset.

Data cleaning

Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and monolingual datasets. Good parallelization across CPU cores.

Uses OpusCleaner for parallel datasets.

Bicleaner AI

Filters noisy sentence pairs in a parallel corpus using bicleaner-ai classifier. Cleaning thresholds are configurable per dataset.

See more details on the Bicleaner page.

Merge and dedupe

Merges all the cleaned datasets into one. Applies deduplication on the target side. When there are two or more target text duplicates, it takes the pair with the best BicleanerAI score.

Training vocabulary

Trains SentencePiece vocabulary/tokenizer model on parallel corpus.

See more details on choosing the size of vocabulary here.

Training backward model

Trains a shallow sequence to sequence RNN model in an opposite direction. It is useful for back-translations and cross entropy filtering. It is based on a marian example.

Augmentation with back-translations

Translates monolingual corpus combined from monolingual datasets in target language using the backward model.

It is more useful for low-resource languages but is still recommended for high-resource ones as well.

Generating corpus word alignments

Produces ICU-tokenized alignments accepted by OpusTrainer using eflomal library.

It trains alignments separately for origninal parallel, backtranslated and student corpus. Backtranslated and student steps use eflomal priors extracted from the alignments trained for the original parallel corpus. It can improve accuracy for a smaller corpus as well as performance.

It works with uncompressed datasets, so it can be heavy on disk.

Training teacher

Trains one or several big transformer models on the augmented dataset. They will be later used for decoding as an ensemble of models. Runs OpusTrainer data augmentation on-the-fly.

Translation by teachers

Translates the corpus and the monolingual data in source language (configurable in datasets.mono-src) using the trained teacher models.

This is the heaviest part of the pipeline but highly parallelizable.

Cross-entropy filtering

Scores the translated corpus with the backward model and removes a part of the corpus with the lowest scores to reduce noise.

At this point we work with huge datasets that can be very disk intensive.

Training shortlist

Trains SentencePiece tokenized alignments using eflomal similar to the alignments steps and then extracts lexical shortlist using extract_lex tool.

Some tools require uncompressed datasets on disk, and they are huge at this point. Good CPU parallelization.

Training student

Trains a small transformer student model on the filtered data and using the alignments. OpusTrainer remaps the alignments to SentencePiece-based tokenization. See more details on the OpusTrainer page.

Fine-tuning student

Fine-tunes the student model by emulating 8bit GEMM during training. Converges very quickly and then degrades.

Quantization

Applies 8 bit quantization to the fined-tuned student model. Marian CPU threads must be set to 1 for this step.

Evaluation

Calculates metrics for all models (BLEU, chrF, COMET22). It runs Marian decoding on GPU for all models except the quantized ones that it runs on CPU.

It uses datasets.test configuration section.

Export

Exports the trained model and the shortlist to bergamot-translator format.

Uploading

Uploads all useful artifacts to the production GCP bucket:

Models
Training config
Distillation corpus
Logs

Resource usage

Step	Bottleneck
Compiling Marian and tools	CPU
Data downloading	Network, Disk
Analyze data	CPU, Disk
Data cleaning	CPU
Bicleaner	GPU
Merge and dedupe	CPU, Disk
Training vocabulary	CPU
Training backward model	GPU
Augmentation with back-translations	GPU
Generating alignments	CPU, Disk
Training teacher	GPU
Translation by teacher	GPU
Cross-entropy filtering	GPU, CPU, Disk
Training shortlist	CPU, Disk
Training student	GPU
Fine-tuning student	GPU
Quantization	CPU
Evaluation	GPU
Export
Uploading	Network

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search