Final Evaluation

After models are trained, final evaluations can be triggered.

See the evaluation dashboard with results

Run an evaluation:

task eval -- --config taskcluster/configs/eval.yml

Make sure and update the eval.yml file to run it for specific metrics, translators etc. The evals will be logged to trigger-eval.log and uploaded to the bucket specified.

See the example config taskcluster/configs/eval.yml for configuration details.

Storage

The evaluation results are saved on GCS as JSON files with the following path templates:

gs://<bucket>/final_evals/<src>-<trg>__<dataset>__<translator>__<model>__<timestamp>__translations.json
gs://<bucket>/final_evals/<src>-<trg>__<dataset>__<translator>__<model>__<timestamp>__<metric>.metrics.json
gs://<bucket>/final_evals/<src>-<trg>__<dataset>__<translator>__<model>__<timestamp>__<metric>.scores.json
# + copied as "latest" for easy access
gs://<bucket>/final_evals/<src>-<trg>__<dataset>__<translator>__<model>__latest__translations.json
gs://<bucket>/final_evals/<src>-<trg>__<dataset>__<translator>__<model>__latest__<metric>.metrics.json
gs://<bucket>/final_evals/<src>-<trg>__<dataset>__<translator>__<model>__latest__<metric>.scores.json

Examples:

final_evals/en-ru__wmt24pp__google__v2__latest__translations.json
final_evals/en-ru__bouquet__opusmt__opus-mt-en-ru__latest__comet22.metrics.json
final_evals/en-ru__bouquet__bergamot__retrain_base-memory_KJ23-iDVTcymG1ZldWY17w__20251122T003231__bleu.metrics.json

When running on the production bucket, by default it will not overwrite previous results if "latest" file is present on GCS. To rerun the specific evaluation specify override: true in the config. It will add evaluations with a new timestamp and replace "latest" files.

Language pairs

Any language pair which has a two-letter ISO code can be used (some tools require code mapping, see pipeline/eval/langs.py).

Non English-centric language pairs have limited support. We use two latest (or specified in the config) Bergamot models to run pivot translation through English.

Even if a Bergamot model for a language pair is absent in the storage, it is still possible to run evaluation for the other translators.

Datasets

Only the latest high-quality datasets with good language coverage are used.

  • Flores200-plus
  • WMT24++
  • Bouquet

Translators

  • Bergamot (Firefox models)
  • Google Translate API
  • Azure Translate API
  • opusmt HF models
  • NLLB 600M
  • Argos Translate

Bergamot runs the final quantized models that we deploy in Firefox with bergamot-translator inference engine compiled in native mode. It is different from WASM mode used in Firefox.

Models

Each translator can discover its available models. For Bergamot, it discovers all the models exported to the Bergamot format for the language pairs that are available on GCS.

For example:

gs://moz-fx-translations-data--303e-prod-translations-data/models/en-ru/retrain_base-memory_KJ23-iDVTcymG1ZldWY17w/exported/lex.50.50.enru.s2t.bin.gz
gs://moz-fx-translations-data--303e-prod-translations-data/models/en-ru/retrain_base-memory_KJ23-iDVTcymG1ZldWY17w/exported/model.enru.intgemm.alphas.bin.gz
gs://moz-fx-translations-data--303e-prod-translations-data/models/en-ru/retrain_base-memory_KJ23-iDVTcymG1ZldWY17w/exported/vocab.enru.spm.gz

Other translators can also be extended to discover more than one model or API version.

The IDs of the models correspond to the model names in the file names.

For example:

retrain_base-memory_KJ23-iDVTcymG1ZldWY17w
v2
opus-mt-en-ru

To run evaluation only for the latest uploaded Bergamot models per language pair set models: ["latest"] in the config.

Metrics

Supported metrics include: - chrF - chrF++ - BLEU - spBLEU - COMET22 - MetricX-24 XL - MetricX-24 XL QE (referenceless) - LLM (reference-based)

LLM Evaluation

An LLM can provide an evaluation using the OpenAI API. This will provide an analysis for an evaluation datasets of the following metrics, with a score of 1-5 and an explanation of the score:

  • adequacy
  • fluency
  • terminology
  • hallucination
  • punctuation

See pipeline/eval/eval-batch-instructions.md for the full prompt for this analysis.

Running locally

Run under Docker with

task docker

Make sure translator-cli is compiled with

task inference-build

Install dependencies:

pip install -r taskcluster/docker/eval/final_eval.txt

Running some metrics, datasets and translators require setting environment variables with secrets:

# Hugging Face token to use restricted HF datasets ("bouquet", "flores200-plus")
export HF_TOKEN=...
# To use "llm-ref" OPEN AI API based metric
export OPENAI_API_KEY=...
# To use "microsoft" translator API
export AZURE_TRANSLATOR_KEY=...
# To use "google" translator API
export GOOGLE_APPLICATION_CREDENTIALS=<path>/creds.json

The output files are stored on disk in the --artifacts folder (by default data/final_evals/).

Run the evals script:

export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python 
export PYTHONPATH=$(pwd) 

python pipeline/eval/final_eval.py \
  --config=taskcluster/configs/eval.yml \
  --artifacts=data/final_evals \
  --bergamot-cli=inference/build/src/app/translator-cli