Training a DeepSpeech model

Training a DeepSpeech model

Making training files available to the Docker container

Before we can train a model, we need to make the training data available to the Docker container. The training data was previously prepared in the instructions for formatting data. Copy or extract them to the directory you specified in your bind mount. This will make them available to the Docker container.

$ cd deepspeech-data
$ ls cv-corpus-6.1-2020-12-11/
total 12
4 drwxr-xr-x 3 kathyreid kathyreid 4096 Feb  9 10:42 ./
4 drwxrwxr-x 7 kathyreid kathyreid 4096 Feb  9 10:43 ../
4 drwxr-xr-x 3 kathyreid kathyreid 4096 Feb  9 10:43 id/

We’re now ready to being training.

Running training

We’re going to walk through some of the key parameters you can use with DeepSpeech.py.

python3 DeepSpeech.py \
  --train_files persistent-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
  --dev_files persistent-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
  --test_files persistent-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv

Do not run this yet

The options --train_files, --dev_files and --test_files take a path to the relevant data, which was prepared in the section on data formatting.

Specifying checkpoint directories so that you can restart training from a checkpoint

As you are training your model, DeepSpeech will store checkpoints to disk. The checkpoint allows interruption to training, and to restart training from the checkpoint, saving hours of training time.

Because we have our training environment configured to use Docker, we must ensure that our checkpoint directories are stored in the directory used by the bind mount, so that they persist in the event of failure.

To specify checkpoint directories, use the --checkpoint_dir parameter with DeepSpeech.py:

python3 DeepSpeech.py \
  --train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
  --dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
  --test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
  --checkpoint_dir deepspeech-data/checkpoints

Do not run this yet

Advanced checkpoint configuration

How checkpoints are stored

Checkpoints are stored as Tensorflow tf.Variable objects. This is a binary file format; that is, you won’t be able to read it with a text editor. The checkpoint stores all the weights and biases of the current state of the neural network as training progresses.

Checkpoints are named by the total number of steps completed. For example, if you train for 100 epochs at 2000 steps per epoch, then the final checkpoint will be named 20000.

~/deepspeech-data/checkpoints-true-id$ ls
total 1053716
     4 drwxr-xr-x 2 root      root           4096 Feb 24 14:17 ./
     4 drwxrwxr-x 5 root      root      4096 Feb 24 13:18 ../
174376 -rw-r--r-- 1 root      root      178557296 Feb 24 14:11 best_dev-12774.data-00000-of-00001
     4 -rw-r--r-- 1 root      root           1469 Feb 24 14:11 best_dev-12774.index
  1236 -rw-r--r-- 1 root      root        1262944 Feb 24 14:11 best_dev-12774.meta
     4 -rw-r--r-- 1 root      root             85 Feb 24 14:11 best_dev_checkpoint
     4 -rw-r--r-- 1 root      root            247 Feb 24 14:17 checkpoint
     4 -rw-r--r-- 1 root      root           3888 Feb 24 13:18 flags.txt
174376 -rw-r--r-- 1 root      root      178557296 Feb 24 14:09 train-12774.data-00000-of-00001
     4 -rw-r--r-- 1 root      root           1469 Feb 24 14:09 train-12774.index
  1236 -rw-r--r-- 1 root      root        1262938 Feb 24 14:09 train-12774.meta
174376 -rw-r--r-- 1 root      root      178557296 Feb 24 14:13 train-14903.data-00000-of-00001
     4 -rw-r--r-- 1 root      root           1469 Feb 24 14:13 train-14903.index
  1236 -rw-r--r-- 1 root      root        1262938 Feb 24 14:13 train-14903.meta
174376 -rw-r--r-- 1 root      root      178557296 Feb 24 14:17 train-17032.data-00000-of-00001
     4 -rw-r--r-- 1 root      root           1469 Feb 24 14:17 train-17032.index
  1236 -rw-r--r-- 1 root      root        1262938 Feb 24 14:17 train-17032.meta
174376 -rw-r--r-- 1 root      root      178557296 Feb 24 14:01 train-19161.data-00000-of-00001
     4 -rw-r--r-- 1 root      root           1469 Feb 24 14:01 train-19161.index
  1236 -rw-r--r-- 1 root      root        1262938 Feb 24 14:01 train-19161.meta
174376 -rw-r--r-- 1 root      root      178557296 Feb 24 14:05 train-21290.data-00000-of-00001
     4 -rw-r--r-- 1 root      root           1469 Feb 24 14:05 train-21290.index

Managing disk space and checkpoints

Checkpoints can consume a lot of disk space, so you may wish to configure how often a checkpoint is written to disk, and how many checkpoints are stored.

--checkpoint_secs specifies the time interval for storing a checkpoint. The default is 600, or every five minutes. You may wish to increase this if you have limited disk space.
--max_to_keep specifies how many checkpoints to keep. The default is 5. You may wish to decrease this if you have limited disk space.

In this example we will store a checkpoint every 15 minutes, and keep only 3 checkpoints.

python3 DeepSpeech.py \
  --train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
  --dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
  --test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
  --checkpoint_dir deepspeech-data/checkpoints \
  --checkpoint_secs 1800 \
  --max_to_keep 3

Do not run this yet

Different checkpoints for loading and saving

In some cases, you may wish to load checkpoints from one location, but save checkpoints to another location - for example if you are doing fine tuning or transfer learning.

--load_checkpoint_dir specifies the directory to load checkpoints from.
--save_checkpoint_dir specifies the directory to save checkpoints to.

In this example we will store a checkpoint every 15 minutes, and keep only 3 checkpoints.

python3 DeepSpeech.py \
  --train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
  --dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
  --test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
  --load_checkpoint_dir deepspeech-data/checkpoints-to-train-from \
  --save_checkpoint_dir deepspeech-data/checkpoints-to-save-to

Do not run this yet

Specifying the directory that the trained model should be exported to

Again, because we have our training environment configured to use Docker, we must ensure that our trained model is stored in the directory used by the bind mount, so that it persists in the event of failure of the Docker container.

To specify where the trained model should be saved, use the --export-dir parameter with DeepSpeech.py:

python3 DeepSpeech.py \
  --train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
  --dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
  --test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
  --checkpoint_dir deepspeech-data/checkpoints \
  --export_dir deepspeech-data/exported-model

You can run this command to start training

Other useful parameters that can be passed to `DeepSpeech.py`

_For a full list of parameters that can be passed to DeepSpeech.py, please consult the documentation.

DeepSpeech.py has many parameters - too many to cover in an introductory PlayBook. Here are some of the commonly used parameters that are useful to explore as you begin to train speech recognition models with DeepSpeech.

`n_hidden` parameter

Neural networks work through a series of layers. Usually there is an input layer, which takes an input - in this case an audio recording, and a series of hidden layers which identify features of the input layer, and an output layer, which makes a prediction - in this case a character.

In large datasets, you need many hidden layers to arrive at an accurate trained model. With smaller datasets, often called toy corpora or toy datasets, you don’t need as many hidden layers.

If you are learning how to train using DeepSpeech, and are working with a small dataset, you will save time by reducing the value of --n_hidden. This reduces the number of hidden layers in the neural network. This both reduces the amount of computing resources consumed during training, and makes training a model much faster.

The --n_hidden parameter has a default value of 2048.

python3 DeepSpeech.py \
  --train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
  --dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
  --test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
  --checkpoint_dir deepspeech-data/checkpoints \
  --export_dir deepspeech-data/exported-model \
  --n_hidden 64

Reduce learning rate on plateau (RLROP)

In neural networks, the learning rate is the rate at which the neural network makes adjustments to the predictions it generates. The accuracy of predictions is measured using the loss. The lower the loss, the lower the difference between the neural network’s predictions, and actual known values. If training is effective, loss will reduce over time. A neural network that has a loss of 0 has perfect prediction.

If the learning rate is too low, predictions will take a long time to align with actual targets. If the learning rate is too high, predictions will overshoot actual targets. The learning rate has to aim for a balance between exploration and exploitation.

If loss is not reducing over time, then the training is said to have plateaued - that is, the adjustments to the predictions are not reducing loss. By adjusting the learning rate, and other parameters, we may escape the plateau and continue to decrease loss.

The --reduce_lr_on_plateau parameter instructs DeepSpeech.py to automatically reduce the learning rate if a plateau is detected. By default, this is false.
The --plateau_epochs parameter specifies the number of epochs of training during which there is no reduction in loss that should be considered a plateau. The default value is 10.
The --plateau_reduction parameter specifies a multiplicative factor that is applied to the current learning rate if a plateau is detected. This number must be less than 1, otherwise it will increase the learning rate. The default value is 0.1.

An example of training with these parameters would be:

python3 DeepSpeech.py \
  --train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
  --dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
  --test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
  --checkpoint_dir deepspeech-data/checkpoints \
  --export_dir deepspeech-data/exported-model \
  --n_hidden 64 \
  --reduce_lr_on_plateau true \
  --plateau_epochs 8 \
  --plateau_reduction 0.08

Early stopping

If training is not resulting in a reduction of loss over time, you can pass parameters to DeepSpeech.py that will stop training. This is called early stopping and is useful if you are using cloud compute resources, or shared resources, and can’t monitor the training continuously.

The --early_stop parameter enables early stopping. It is set to false by default.
The --es_epochs parameter takes an integer of the number of epochs with no improvement after which training will be stopped. It is set to 25 by default, for example if this parameter is omitted, but --early_stop is set to true.
The --es_min_delta parameter is the minimum change in loss per epoch that qualifies as an improvement. By default it is set to 0.05.

An example of training with these parameters would be:

python3 DeepSpeech.py \
  --train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
  --dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
  --test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
  --checkpoint_dir deepspeech-data/checkpoints \
  --export_dir deepspeech-data/exported-model \
  --n_hidden 64 \
  --reduce_lr_on_plateau true \
  --plateau_epochs 8 \
  --plateau_reduction 0.08 \
  --early_stop true \
  --es_epochs 10 \
  --es_min_delta 0.06

Dropout rate

In machine learning, one of the risks during training is that of overfitting. Overfitting is where training creates a model that does not generalize well. That is, it fits to only the set of data on which it is trained. During inference, new data is not recognised accurately.

Dropout is a technical approach to reduce overfitting. In dropout, nodes are randomly removed from the neural network created during training. This simulates the effect of more diverse data, and is a computationally cheap way of reducing overfitting, and improving the generalizability of the model.

Dropout can be set for any layer of a neural network. The parameter that has the most effect for DeepSpeech training is --dropout_rate, which controls the feedforward layers of the neural network. To see the full set of dropout parameters, consult the DeepSpeech documentation.

The -dropout_rate parameter specifies how many nodes should be dropped from the neural network during training. The default value is 0.05. However, if you are training on less than thousands of hours of voice data, you will find a value of 0.3 to 0.4 works better to prevent overfitting.

An example of training with this parameter would be:

python3 DeepSpeech.py \
  --train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
  --dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
  --test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
  --checkpoint_dir deepspeech-data/checkpoints \
  --export_dir deepspeech-data/exported-model \
  --n_hidden 64 \
  --reduce_lr_on_plateau true \
  --plateau_epochs 8 \
  --plateau_reduction 0.08 \
  --early_stop true \
  --es_epochs 10 \
  --es_min_delta 0.06 \
  --dropout_rate 0.3

Steps and epochs

In training, a step is one update of the gradient; that is, one attempt to find the lowest, or minimal loss. The amount of processing done in one step depends on the batch size. By default, DeepSpeech.py has a batch size of 1. That is, it processes one audio file in each step.

An epoch is one full cycle through the training data. That is, if you have 1000 files listed in your train.tsv file, then you will expect to process 1000 steps per epoch (assuming a batch size of 1).

To find out how many steps to expect in each epoch, you can count the number of lines in your train.tsv file:

~/deepspeech-data/cv-corpus-6.1-2020-12-11/id$ wc -l train.tsv
2131 train.tsv

In this case there would be 2131 steps per epoch.

--epochs specifies how many epochs to train. It has a default of 75, which would be appropriate for training tens to hundreds of hours of audio. If you have thousands of hours of audio, you may wish to increase the number of epochs to around 150-300.
--train_batch_size, --dev_batch_size, --test_batch_size all specify the batch size per step. These all have a default value of 1. Increasing the batch size increases the amount of memory required to process the step; you need to be aware of this before increasing the batch size.

Advanced training options

Advanced training options are available, such as feature cache and augmentation. They are beyond the scope of this PlayBook, but you can read more about them in the DeepSpeech documentation.

For a full list of parameters that can be passed to the DeepSpeech.py file, please consult the DeepSpeech documentation.

Monitoring GPU use with `nvtop`

In a separate terminal (ie not from the session where you have the Docker container open), run the command nvtop. You should see the DeepSpeech.py process consuming all available GPUs.

If you do not see the GPU(s) being heavily utilised, you may be training only on your CPUs and you should double check your environment.

Possible errors

`Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.` error when training

You can safely skip this section if you have not encountered this error

There have been several reports of an error similar to the below when training is initiated. Anecdotal evidence suggests that the error is more likely to be encountered if you are training using an RTX-model GPU.

The error will look like this:

Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[]]
	 [[concat/concat/_99]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[]]
0 successful operations.
0 derived errors ignored.

To work around this error, you will need to set the TF_FORCE_GPU_ALLOW_GROWTH flag to True.

This is done in the file

DeepSpeech/training/deepspeech_training/util/config.py

and you should edit it as below:

root@687a2e3516d7:/DeepSpeech/training/deepspeech_training/util# nano config.py

...

    # Standard session configuration that'll be used for all new sessions.
    c.session_config = tfv1.ConfigProto(allow_soft_placement=True, log_device$
                                        inter_op_parallelism_threads=FLAGS.in$
                                        intra_op_parallelism_threads=FLAGS.in$

                                        gpu_options=tfv1.GPUOptions(allow_gro$

    # Set TF_FORCE_GPU_ALLOW_GROWTH to work around cuDNN error on RTX GPUs
    c.session_config.gpu_options.allow_growth=True

Home

Previous - Setting up your DeepSpeech training environment

Next - Testing and evaluating your trained model

This site is open source. Improve this page.

deepspeech-playbook

Training a DeepSpeech model

Contents

Making training files available to the Docker container

Running training

Specifying checkpoint directories so that you can restart training from a checkpoint

Advanced checkpoint configuration

How checkpoints are stored

Managing disk space and checkpoints

Different checkpoints for loading and saving

Specifying the directory that the trained model should be exported to

Other useful parameters that can be passed to `DeepSpeech.py`

`n_hidden` parameter

Reduce learning rate on plateau (RLROP)

Early stopping

Dropout rate

Steps and epochs

Advanced training options

Monitoring GPU use with `nvtop`

Possible errors

`Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.` error when training

deepspeech-playbook

Training a DeepSpeech model

Contents

Making training files available to the Docker container

Running training

Specifying checkpoint directories so that you can restart training from a checkpoint

Advanced checkpoint configuration

How checkpoints are stored

Managing disk space and checkpoints

Different checkpoints for loading and saving

Specifying the directory that the trained model should be exported to

Other useful parameters that can be passed to DeepSpeech.py

n_hidden parameter

Reduce learning rate on plateau (RLROP)

Early stopping

Dropout rate

Steps and epochs

Advanced training options

Monitoring GPU use with nvtop

Possible errors

Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. error when training

Other useful parameters that can be passed to `DeepSpeech.py`

`n_hidden` parameter

Monitoring GPU use with `nvtop`

`Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.` error when training