1. Introduction
In this tutorial, we will cover how to train a monophone mixture component model and how to perform the recognition of some samples, using the transLectures-UPV toolkit’s basic, low-level tools.
This tutorial aims to provide a short introduction to TLK. It is not our aim here to describe all of the toolkit’s features, but only to get you acquainted with its most common usage.
For those looking for something less technical, TLK includes a separate self-contained tutorial that covers the whole automatic transcription process of real video lectures, using TLK’s high-level scripts.
2. Feature extraction
Commands used: tLextract
In this section, we explain how to extract MFCC features from WAV audio files using the transLectures-UPV toolkit. Given a set of WAV files and a file list containing the path to them, e.g.,
path-wavs/file1.wav path-wavs/file2.wav path-wavs/file3.wav ...
First, define a list containing the path to the feature files that will be extracted. Each line of this file will contain the path of the generated feature file for each WAV file, following the same order. For instance:
path-feas/file1.tLfea path-feas/file2.tLfea path-feas/file3.tLfea ...
Then, execute the following command to generate the features
$ tLextract wav-audios.lst tLfeas.lst
This generated feature files can be used to train acoustic models or to be recognized.
3. Training
In this section, we will explain how to train an acoustic model using the transLectures UPV tools.
3.1. Preparation
First of all, we assume that three file lists and their corresponding files are provided:
-
train.lst: a list of feature files in the transLectures-UPV toolkit format (see Feature extraction for more details).
-
train-phonemes.lst: a list of transcription files, each of them containing the phoneme transcription for each feature file. For instance, the transcription file for the sentence "Hello world" would contain:
SP hh eh l ow SP w er l d SP
-
phonemes.lst: a list of the distinct phonemes that appear on the transcription. For instance, if only the phonemes of the previous example transcription are available, it would contain:
hh eh l ow w er d SP
3.2. Prototype creation
The first step to train an acoustic model is to create a prototype model, which is a first initialization of the model parameters for each phoneme. This model is initialized with the mean and variance of all samples.
First, we need to obtain this mean and variance:
$ tLmkproto DGaussian train.lst 39 -o proto
This command creates a file called proto containing the mean and variance of the 39 components (corresponding to standard audio samples, i.e., amplitude, MFCCs and derivatives).
### proto file ### AMODEL DGaussian D 39 SMOOTH 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N 1 '#' Q 1 Trans -7.06867 MU 4.16901 ... VAR 191.646 ...
Next, the prototype model model_proto is defined using the mean and variance for each prototype model. Basically, the mean and variance of the proto file is copied for each phoneme in phonemes.lst along with uniform probabilities for the transitions. Specifically, the prototype model is composed of different parts:
-
A header specifying the global parameters.
AMODEL DGaussian D 39 SMOOTH 0.191646 ... N 3
The parameter D specifies the number of feature components. The 39 (in this case) real numbers following the token SMOOTH correspond to the minimum variance that can be assigned to a certain component. A good rule of thumb for this parameter is to assign to it the the previously obtained variance file multiplied by a scaling factor of 0.001. The value after the token N corresponds to the number of phonemes.
-
For each phoneme, their corresponding parameters.
'a' Q 3 Trans -0.916291 -0.916291 -0.916291 MU 4.16901 ... VAR 191.646 ... MU 4.16901 ... VAR 191.646 ... MU 4.16901 ... VAR 191.646 ...
The first line corresponds to the name of the phoneme enclosed within two single quotes ('). The next line contains the token Q, followed by the number of states of this phoneme. Next, the transition probabilty of the phoneme is defined. In this case, it defines a left-to-right model with uniform probabilities. Probabilities are specified as logarithms. The next lines correspond to the mean and variance of each state, which correspond to the previously obtained ones.
-
Optionally, a special phoneme representing silence or background noise can be added to the model. The definition of this phoneme is the same as the rest, except that this phoneme possesses only one state, and there is a loopback transition. Therefore, the phoneme would be defined as:
'SP' Q 1 TransL I 1 -0.510826 F -0.916291 . 1 1 -0.510826 F -0.916291 . MU 4.16901 ... VAR 191.646 ...
A whole description of the model’s format can be found in tLformats.
3.3. Monophone training
Once the prototype model has been created, a monophone model can be estimated. This procedure is performed by calling the tLtrain command. Given a prototype model called proto_amodel, the next command can be executed:
$ tLtrain -m 8 -o monophone_amodel proto_amodel train-phonemes.lst train.lst
A new file called monophone_amodel will be created, which corresponds to the application of eight iterations of the EM algorithm to the original model with the samples provided. The model obtained through this process corresponds to a trained monophone model with one Gaussian per state.
This model can be improved by estimating a mixture component model, in which each state emission is modelled by a multiple component Gaussian. The translectures-UPV toolkit can train a model for any number of components. This is performed by splitting each Gaussian of a model into multiple ones, and training them sequentially. Typically, this is performed by sequentially splitting each Gaussian in two, and then performing four training iterations. This process is repeated until the desired number of Gaussians is met. At this point the number of iterations is reduced (from eight to four) because the input model is better estimated.
For instance, from the previous monophone model, we can obtain an equivalent model with mixture Gaussian distributions at the states by executing:
$ tLtomix -i monophone_amodel -o monophone_amodel_I1
And then a model of four mixture components per state can be obtained by executing:
$ tLmumix 0 -i monophone_amodel_I1 -o monophone_amodel_I2 $ tLtrain -m 4 -o monophone_amodel_I2 monophone_amodel_I2 train-phonemes.lst train.lst $ tLmumix 0 -i monophone_amodel_I2 -o monophone_amodel_I4 $ tLtrain -m 4 -o monophone_amodel_I4 monophone_amodel_I4 train-phonemes.lst train.lst
4. Recognition
Commands used: tLrecognise, tLlmformat
In this section, we will describe how to perform the recognition of a given set of samples. First of all, we need four files:
-
amodel, which corresponds to a trained acoustic model (see Training for more details on how to train a model).
-
rec.lst, a list of files containing the features of the samples to be recognized (see Feature extraction for more details on how to extract the features from WAVs).
-
lexicon, a file containing all words that can be recognized and their corresponding annotation in terms of units (e.g., monophones) of the amodel. The format of this file is defined in tLformats.
-
lm, a file containing the language model in the transLectures-UPV toolkit LM format. This file can be created from an ARPA language model file by using tLlmformat.
Then, the recognition is simply performed by executing:
$ tLrecognise lm amodel rec.lst -l lexicon
This command will print one line for each sample containing all the recognized acoustic units. However, this command could take a lot of time to be executed. In order to reduce the time needed to recognize a sample, some pruning can be applied to the process. If the pruning parameters are wisely chosen, system performance will not be affected significantly by the pruning. The next command employs the standard pruning parameters of the POLIMEDIA corpus and will print out the recognition in terms of words:
$ tLrecognise lm amodel rec.lst -l lexicon \ --beam 180 --nmax-astates 20000 -s SP \ --word-end-prunning 90 --lookahead-capacity 1000