Home | Previous - About DeepSpeech | Next - The alphabet.txt file |
DeepSpeech expects audio files to be WAV format, mono-channel, and with a 16kHz sampling rate.
For training, testing, and development, you need to feed DeepSpeech.py CSV files which contain three columns: wav_filename,wav_filesize,transcript
. The wav_filesize
(i.e. number of bytes) is used to group together audio of similar lengths for efficient batching.
This PlayBook is focused on training a speech recognition model, rather than on collecting the data that is required for an accurate model. However, a good model starts with data.
Ensure that your voice clips are 10-20 seconds in length. If they are longer or shorter than this, your model will be less accurate.
Ensure that every character in your transcription of a voice clip is in your alphabet.txt file
Ensure that your voice clips exhibit the same sort of diversity you expect to encounter in your runtime audio. This means a diversity of accents, genders, background noise and so on.
Ensure that your voice clips are created using similar microphones to that which you expect in your runtime audio. For example, if you expect to deploy your model on Android mobile phones, ensure that your training data is generated from Android mobile phones.
Ensure that the phrasing on which your voice clips are generated covers the phrases you expect to encounter in your runtime audio.
If you are collecting data that will be used to train a speech model, then you should remove punctuation marks such as dashes, tick marks, quote marks and so on. These will often be confused, and can hinder training an accurate model.
Numbers should be written in full (ie as a cardinal) - that is, as eight
rather than 8
.
If you are using data from Common Voice for training a model, you will need to prepare it as outlined in the DeepSpeech documentation.
In this example we will prepare the Indonesian dataset for training, but you can use any language from Common Voice that you prefer. We’ve chosen Indonesian as it has the same orthographic alphabet as English, which means we don’t have to use a different alphabet.txt
file for training; we can use the default.
This example assumes you have already [set up a Docker environment for training. If you have not yet set up your Docker environment, we suggest you pause here and do this first. —
First, download the dataset from Common Voice, and extract the archive into your deepspeech-data
directory. This makes it available to your Docker container through a bind mount. Start your DeepSpeech Docker container with the deepspeech-data
directory as a bind mount (this is covered in the environment section).
Your CV corpus data should be available from within the Docker container.
root@3de3afbe5d6f:/DeepSpeech# ls deepspeech-data/cv-corpus-6.1-2020-12-11/id/
clips invalidated.tsv reported.tsv train.tsv
dev.tsv other.tsv test.tsv validated.tsv
The deepspeech-training:v0.9.3
Docker image does not come with sox
, which is a package used for processing Common Voice data. We need to install sox
first.
root@4b39be3b0ffc:/DeepSpeech# apt-get -y update && apt-get install -y sox
Next, we will run the Common Voice importer that ships with DeepSpeech.
root@3de3afbe5d6f:/DeepSpeech# bin/import_cv2.py deepspeech-data/cv-corpus-6.1-2020-12-11/id
This will process all the CV data into the clips
directory, and it can now be used for training.
DeepSpeech ships with several scripts which act as importers - preparing a corpus of data for training by DeepSpeech.
If you want to create importers for a new language, or a new corpus, you will need to fork the DeepSpeech repository, then add support for the new language and/or corpus by creating an importer for that language/corpus.
The existing importer scripts are a good starting point for creating your own importers.
They are located in the bin
directory of the DeepSpeech repo:
root@3de3afbe5d6f:/DeepSpeech# ls | grep import
import_aidatatang.py
import_aishell.py
import_ccpmf.py
import_cv.py
import_cv2.py
import_fisher.py
import_freestmandarin.py
import_gram_vaani.py
import_ldc93s1.py
import_librivox.py
import_lingua_libre.py
import_m-ailabs.py
import_magicdata.py
import_primewords.py
import_slr57.py
import_swb.py
import_swc.py
import_ted.py
import_timit.py
import_ts.py
import_tuda.py
import_vctk.py
import_voxforge.py
The importer scripts ensure that the .wav
files and corresponding transcriptions are in the .csv
format expected by DeepSpeech.
Home | Previous - About DeepSpeech | Next - The alphabet.txt file |