Snakemake

This section included the instructions on how to run the pipeline using Snakemake orchestrator (locally or on a Slurm cluster).

NOTICE: Mozilla has switched to Taskcluster for model training, and the Snakemake pipeline is not maintained. Feel free to contribute if you find bugs.

Snakemake workflow manager infers the DAG of tasks implicitly from the specified inputs and outputs of the steps. The workflow manager checks which files are missing and runs the corresponding jobs either locally or on a cluster depending on the configuration.

Snakemake parallelizes steps that can be executed simultaneously. It is especially useful for teacher ensemble training and translation.

The main Snakemake process (scheduler) should be launched interactively. It runs the job processes on the worker nodes in cluster mode or on a local machine in local mode.

System requirements

Local mode

Ubuntu 18.04 (it can work on other Linux distributions, but might require setup scripts fixes; see more details in marian installation instructions).
One or several Nvidia GPUs with CUDA drivers installed and at least 8 GB of memory.
CUDNN installed
At least 16 CPU cores ( some steps of the pipeline utilize multiple cores pretty well, so the more the better).
64 GB RAM (128 GB+ might be required for bigger datasets)
200+ GB of disk space ( mostly for datasets and transformations ). It depends on chosen datasets and can be significantly higher.

It was tested on:

Ubuntu 18.04
56 core Xeon server
128 GB of RAM
x8 NVIDIA RTX 2080 GPUs with 12 GB of memory
CUDA 11.2
100 GB of local disk space
Many terabytes of NFS mounted storage

Cluster mode

Slurm cluster with CPU and Nvidia GPU nodes
CUDA 11.2 ( it was also tested on 11.5)
CUDNN library installed
Singularity module if running with containerization (recommended)
If running without containerization, there is no procedure to configure the environment automatically. All the required modules (for example parallel) should be preinstalled and loaded in ~/.bashrc

It was tested on Mozilla Slurm cluster using Singularity containers. The pipeline can also be launched on CSD3 HPC but it was not fully tested.

Cloud mode

Snakemake workflows can work on Kubernetes, Google Cloud Life Sciences and other cloud platforms. The pipeline was not tested in this mode and might require modification.

Please refer to Cloud execution section of Snakemake documentation.

It is also possible to deploy Slurm cluster in the cloud. For example, using Slurm on Google Cloud Platform.

Configuration

Clone the repo:

git clone https://github.com/mozilla/firefox-translations-training.git
cd firefox-translations-training

Choose a Snakemake profile from profiles/ or create a new one
Adjust paths in the Makefile if needed and set PROFILE variable to the name of your profile
Adjust Snakemake and workflow settings in the profiles/<profile>/config.yaml, see Snakemake CLI reference for details
Configure experiment and datasets in configs/config.prod.yml (or configs/config.test.yml for test run)
Change source code if needed for the experiment
(Cluster mode) Adjust cluster settings in the cluster profile. For slurm-moz: profiles/slurm-moz/config.cluster.yml You can also modify profiles/slurm-moz/submit.sh or create a new Snakemake profile.
(Cluster mode) It might require further tuning of requested resources in Snakemake file:
- Use threads for a rule to adjust parallelism
- Use resources: mem_mb=<memory> to adjust total memory requirements per task (default is set in profiles/slurm-moz/config.yaml)

Installation

Running

Dry run first to check that everything was installed correctly:

make dry-run

To run the pipeline:

make run

To test the whole pipeline end to end (it is supposed to run relatively quickly and does not train anything useful):

make test

You can also run a speicific profile or config by overriding variables from Makefile

make run PROFILE=slurm-moz CONFIG=configs/config.test.yml

Specific target

By default, all Snakemake rules are executed. To run the pipeline up to a specific rule use:

make run TARGET=<non-wildcard-rule-or-path>

For example, collect corpus first:

make run TARGET=merge_corpus

You can also use the full file path, for example:

make run TARGET=/models/ru-en/bicleaner/teacher-base0/model.npz.best-ce-mean-words.npz

Rerunning

If you want to rerun a specific step or steps, you can delete the result files that are expected in the Snakemake rule output. Snakemake might complain about a missing file and suggest to run it with --clean-metadata flag. In this case run:

make clean-meta TARGET=<missing-file-name>

and then as usual:

make run

Reporting

To create a Snakemake html report, run:

make report

Results

See Directory Structure section.

The main directories inside SHARED_ROOT are:

data/<lang_pair>/<experiment> - data produced by the pipeline jobs
logs/<lang_pair>/<experiment> - logs of the jobs for troubleshooting
experiments/<lang_pair>/<experiment> - saved experiment settings for future reference
models/<lang_pair>/<experiment> - all models produced by the pipeline. The final compressed models are in exported folder.

Exported models example

/models/ru-en/test/exported/model.ruen.intgemm.alphas.bin.gz
/models/ru-en/test/exported/lex.50.50.ruen.s2t.bin.gz
/models/ru-en/test/exported/vocab.ruen.spm.gz

Directory structure

├ data
│   └ ru-en
│      └ test
│        ├ original
│        │   ├ corpus
│        │   │   ├ mtdata_JW300.en.gz
│        │   │   └ mtdata_JW300.ru.gz
│        │   ├ devset
│        │   │   ├ flores_dev.en.gz
│        │   │   └ flores_dev.ru.gz
│        │   ├ eval
│        │   │   ├ sacrebleu_wmt20.en.gz
│        │   │   └ sacrebleu_wmt20.ru.gz
│        │   ├ mono
│        │   │   ├ news-crawl_news.2020.ru.gz
│        │   │   └ news-crawl_news.2020.en.gz
│        │   ├ devset.ru.gz
│        │   └ devset.en.gz
│        ├ clean
│        │   ├ corpus
│        │   │   ├ mtdata_JW300.en.gz
│        │   │   └ mtdata_JW300.ru.gz
│        │   ├ mono
│        │   │   ├ news-crawl_news.2020.ru.gz
│        │   │   └ news-crawl_news.2020.en.gz
│        │   ├ mono.ru.gz
│        │   └ mono.en.gz
│        ├ biclean
│        │   ├ corpus
│        │   │   ├ mtdata_JW300.en.gz
│        │   │   └ mtdata_JW300.ru.gz
│        │   ├ corpus.ru.gz
│        │   ├ corpus.en.gz
│        ├ translated
│        │   ├ mono.ru.gz
│        │   └ mono.en.gz
│        ├ augmented
│        │   ├ corpus.ru.gz
│        │   └ corpus.en.gz
│        ├ alignment
│        │   ├ corpus.aln.gz
│        │   └ lex.s2t.pruned.gz
│        ├ merged
│        │   ├ corpus.ru.gz
│        │   └ corpus.en.gz
│        └ filtered
│            ├ corpus.ru.gz
│            └ corpus.en.gz
├ models
│   └ ru-en
│       └ test
│          ├ backward
│          ├ teacher-base0
│          ├ teacher-base1
│          ├ teacher-finetuned0
│          ├ teacher-finetuned1
│          ├ student
│          ├ student-finetuned
│          ├ speed
│          ├ evaluation
│          │  ├ backward
│          │  ├ teacher-base0
│          │  ├ teacher-base1
│          │  ├ teacher-finetuned0
│          │  ├ teacher-finetuned1
│          │  ├ teacher-ensemble
│          │  ├ student
│          │  ├ student-finetuned
│          │  └ speed
│          └ exported
│
├ experiments
│   └ ru-en
│      └ test
│         └ config.sh
├ logs
│   └ ru-en
│      └ test
│         └ clean_corpus.log