Training Pipeline - mozilla/translations

Training pipelines for Firefox Translations machine translation models.

The trained models are hosted in firefox-translations-models repository, compatible with bergamot-translator and power the Firefox web page translation starting with version 118.

The pipeline was originally developed as a part of Bergamot project that focuses on improving client-side machine translation in a web browser.

Training pipeline

The pipeline is capable of training a translation model for a language pair end to end. Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters. Some settings, especially low resource languages might require extra tuning.

We use Marian, the fast neural machine translation engine .

Learning resources

Acknowledgements

This project uses materials developed by: - Bergamot project (github, website) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303 - HPLT project (github, website) that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546] - OPUS-MT project (github, website) - Many other open source projects and research papers (see References)

References

Here is a list of selected publications on which the training pipeline is based. You can find more relevant publications on Bergamot project web-site.

  1. V. M. Sánchez-Cartagena, M. Bañón, S. Ortiz-Rojas and G. Ramírez-Sánchez, "Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task", in Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers. Brussels, Belgium: Association for Computational Linguistics, October 2018

  2. Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, Marta Bañón and Sergio Ortiz Rojas "Bifixer and Bicleaner: two open-source tools to clean your parallel data.", in Proceedings of the 22nd Annual Conference of the European Association for Machine Translation. Lisboa, Portugal: European Association for Machine Translation, November 2020

  3. Mölder F, Jablonski KP, Letcher B, et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10:33. Published 2021 Jan 18. doi:10.12688/f1000research.29032.2

  4. Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task (Bogoychev et al., NGT 2020)

  5. From Research to Production and Back: Ludicrously Fast Neural Machine Translation (Kim et al., EMNLP 2019)

  6. The University of Edinburgh’s Submissions to the WMT19 News Translation Task (Bawden et al., 2019)

  7. Jörg Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'2012)

  8. The University of Edinburgh’s Neural MT Systems for WMT17, Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, and Philip Williams. In Proceedings of the EMNLP 2017 Second Conference on Machine Translation (WMT17), 2017.
  9. Marian: Fast Neural Machine Translation in C++, Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, Andre ́ F. T. Martins, and Alexandra Birch.
  10. Improving Neural Machine Translation Models with Monolingual Data, Rico Sennrich,Barry Haddow,Alexandra Birch, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016.
  11. A Call for Clarity in Reporting BLEU Scores (Post, 2018)
  12. The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation, Facebook
  13. Many-to-English Machine Translation Tools, Data, and Pretrained Models (Gowda et al., ACL 2021)
  14. Chris Dyer, Victor Chahuneau, and Noah A. Smith. (2013). A Simple, Fast, and Effective Reparameterization of IBM Model 2. In Proc. of NAACL.
  15. Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., ACL 2016)
  16. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Taku Kudo, 2018)
  17. Bicleaner AI: Bicleaner Goes Neural (Zaragoza-Bernabeu et al., LREC 2022)
  18. Sequence-Level Knowledge Distillation (Yoon Kim, Alexander M. Rush, EMNLP 2016)