Home Previous - About DeepSpeech Next - The alphabet.txt file

Formatting your training data for DeepSpeech


DeepSpeech expects audio files to be WAV format, mono-channel, and with a 16kHz sampling rate.

For training, testing, and development, you need to feed CSV files which contain three columns: wav_filename,wav_filesize,transcript. The wav_filesize (i.e. number of bytes) is used to group together audio of similar lengths for efficient batching.

Collecting data

This PlayBook is focused on training a speech recognition model, rather than on collecting the data that is required for an accurate model. However, a good model starts with data.

Punctuation and numbers

If you are collecting data that will be used to train a speech model, then you should remove punctuation marks such as dashes, tick marks, quote marks and so on. These will often be confused, and can hinder training an accurate model.

Numbers should be written in full (ie as a cardinal) - that is, as eight rather than 8.

Preparing your data for training

Data from Common Voice

If you are using data from Common Voice for training a model, you will need to prepare it as outlined in the DeepSpeech documentation.

In this example we will prepare the Indonesian dataset for training, but you can use any language from Common Voice that you prefer. We’ve chosen Indonesian as it has the same orthographic alphabet as English, which means we don’t have to use a different alphabet.txt file for training; we can use the default.

This example assumes you have already [set up a Docker environment for training. If you have not yet set up your Docker environment, we suggest you pause here and do this first. —

First, download the dataset from Common Voice, and extract the archive into your deepspeech-data directory. This makes it available to your Docker container through a bind mount. Start your DeepSpeech Docker container with the deepspeech-data directory as a bind mount (this is covered in the environment section).

Your CV corpus data should be available from within the Docker container.

 root@3de3afbe5d6f:/DeepSpeech# ls  deepspeech-data/cv-corpus-6.1-2020-12-11/id/
 clips    invalidated.tsv  reported.tsv  train.tsv
 dev.tsv  other.tsv        test.tsv      validated.tsv

The deepspeech-training:v0.9.3 Docker image does not come with sox, which is a package used for processing Common Voice data. We need to install sox first.

root@4b39be3b0ffc:/DeepSpeech# apt-get -y update && apt-get install -y sox

Next, we will run the Common Voice importer that ships with DeepSpeech.

root@3de3afbe5d6f:/DeepSpeech# bin/ deepspeech-data/cv-corpus-6.1-2020-12-11/id

This will process all the CV data into the clips directory, and it can now be used for training.


DeepSpeech ships with several scripts which act as importers - preparing a corpus of data for training by DeepSpeech.

If you want to create importers for a new language, or a new corpus, you will need to fork the DeepSpeech repository, then add support for the new language and/or corpus by creating an importer for that language/corpus.

The existing importer scripts are a good starting point for creating your own importers.

They are located in the bin directory of the DeepSpeech repo:

root@3de3afbe5d6f:/DeepSpeech# ls | grep import

The importer scripts ensure that the .wav files and corresponding transcriptions are in the .csv format expected by DeepSpeech.

Home Previous - About DeepSpeech Next - The alphabet.txt file