1. Introduction
This tutorial aims to provide a full guide on how to transcribe an audio document using the high-level tools of the transLectures-UPV Toolkit (TLK).
Automatically transcribing an audio document is a complex sequential process. First, given a "training" set of audio files with their transcriptions, feature samples have to be extracted; then, an acoustic system is trained from them. Finally, the resulting model is used for the automatic recognition of the audio documents we are interested in (in the case of this tutorial, we will automatically transcribe one "test" audio file).
This sequential process requires the execution of several TLK tools as well as the processing and generation of auxiliary files, such as file lists or intermediate models. In order to simplify these taks, TLK includes three high-level tools (tLtask-preprocess, tLtask-train and tLtask-recognise). These tools read a configuration file in which a task is defined (i.e. a set of audio samples to be transcribed), and execute the required commands, sparing the user from having to know the details.
In this tutorial, given the example data provided in the Tutorial Data (>1GB), which corresponds to these videos, we will explain how to build a speech recognizer (with the "training" part of the audio data), and how to use it to transcribe audio (the "test" part of the audio data).
Should any error arise during this tutorial, you should look at the specific log files that will be generated, solve the problem, and then re-execute the last command. The tools presented in this tutorial will continue the execution from the last step that was correctly performed.
2. Data preparation
First of all, download the Tutorial Data (>1GB) and extract it. Please note that in this tutorial we will assume that you have already installed TLK. Otherwise, please refer to TLK’s Download page and to the README.
After extracting the data, the following files and directories will be created:
tlk-tutorial-data/ tlk-tutorial-data/misc/ tlk-tutorial-data/misc/mono.lex tlk-tutorial-data/misc/mlm.gz tlk-tutorial-data/train/ tlk-tutorial-data/train/5199057f-ae11-a340-b7b1-50fc9da12159.wav tlk-tutorial-data/train/7ffa062b-f46a-fb45-a10e-80f8bba701ad.trs tlk-tutorial-data/train/e79d8953-e3fd-0649-ad49-24c62e9e7c50.wav tlk-tutorial-data/train/87e21b74-ccfa-ff47-90a2-6b7c8ff43997.dfxp tlk-tutorial-data/train/0af34fad-2796-4b4e-831f-3932d0fcd32b.dfxp tlk-tutorial-data/train/a4accdd4-c3fe-9c41-bbbd-393b67214fd2.trs tlk-tutorial-data/train/87e21b74-ccfa-ff47-90a2-6b7c8ff43997.trs tlk-tutorial-data/train/5199057f-ae11-a340-b7b1-50fc9da12159.dfxp tlk-tutorial-data/train/0b6c7ade-5ce4-3546-9abf-522b811bf14c.dfxp tlk-tutorial-data/train/e79d8953-e3fd-0649-ad49-24c62e9e7c50.trs tlk-tutorial-data/train/0af34fad-2796-4b4e-831f-3932d0fcd32b.trs tlk-tutorial-data/train/87e21b74-ccfa-ff47-90a2-6b7c8ff43997.wav tlk-tutorial-data/train/a4accdd4-c3fe-9c41-bbbd-393b67214fd2.wav tlk-tutorial-data/train/0af34fad-2796-4b4e-831f-3932d0fcd32b.wav tlk-tutorial-data/train/e79d8953-e3fd-0649-ad49-24c62e9e7c50.dfxp tlk-tutorial-data/train/0b6c7ade-5ce4-3546-9abf-522b811bf14c.wav tlk-tutorial-data/train/5199057f-ae11-a340-b7b1-50fc9da12159.trs tlk-tutorial-data/train/0b6c7ade-5ce4-3546-9abf-522b811bf14c.trs tlk-tutorial-data/train/7ffa062b-f46a-fb45-a10e-80f8bba701ad.wav tlk-tutorial-data/train/7ffa062b-f46a-fb45-a10e-80f8bba701ad.dfxp tlk-tutorial-data/train/a4accdd4-c3fe-9c41-bbbd-393b67214fd2.dfxp tlk-tutorial-data/test/ tlk-tutorial-data/test/035040d6-7fd4-ab4a-80ff-e87d3a5d84db.wav tlk-tutorial-data/test/035040d6-7fd4-ab4a-80ff-e87d3a5d84db.dfxp tlk-tutorial-data/test/035040d6-7fd4-ab4a-80ff-e87d3a5d84db.trs
The train directory contains the data that will be used to train the system, while the test directory contains the data that will be automatically transcribed. These are audio files from lectures in Spanish recorded at Universitat Politècnica de València and their annotations in .trs and .dfxp format.
The additional files in the misc directory are a basic language model (LM) for Spanish and a lexicon, which will be needed for the recognition. These two files are provided because their creation and estimation is not performed by using TLK.
As TLK does not include utilities to generate these two files, should you want to generate your own LMs, we recommend the use of the SRILM toolkit for parameter estimation, and tLlmformat for format conversion from ARPA to TLK format. For the lexicon, please refer to tLformats.
3. Feature extraction
Once the data has been extracted, it has to be processed and converted to a suitable format in order to be processed by TLK. This process is performed by tLtask-preprocess. This tool segments audio samples from the audio files in a directory. These audio samples, which correspond to segments of the audio file, are defined in the transcription files. Specifically, first, it extracts standard MFCC features; next, it converts the transcription into monophones; and finally, it generates some file lists.
To perform the preprocessing of the data, simply execute
$ tLtask-preprocess es dfxp tlk-tutorial-data/train preprocess-train tLtask-preprocess es dfxp tlk-tutorial-data/test preprocess-test
The first command will process the training data and the second one the test data.
These commands perform the extraction of the audio files in tlk-tutorial-data/train and tlk-tutorial-data/test, using the .dfxp transcription files in Spanish (es). The resulting files will be stored in the folders preprocess-train and preprocess-test, respectively. In addition, some messages will be printed during the preprocessing, stating its progress. For instance, in the case of the previous command, it will print
[II] tLtask-preprocess : Extracting samples from audio files [II] tLtask-preprocess : Extracting audio and transcription samples from tlk-tutorial-data/train [II] tLtask-preprocess : - tlk-tutorial-data/train/0af34fad-2796-4b4e-831f-3932d0fcd32b.wav: Done [II] tLtask-preprocess : - tlk-tutorial-data/train/0b6c7ade-5ce4-3546-9abf-522b811bf14c.wav: Done [II] tLtask-preprocess : - tlk-tutorial-data/train/5199057f-ae11-a340-b7b1-50fc9da12159.wav: Done [II] tLtask-preprocess : - tlk-tutorial-data/train/7ffa062b-f46a-fb45-a10e-80f8bba701ad.wav: Done [II] tLtask-preprocess : - tlk-tutorial-data/train/87e21b74-ccfa-ff47-90a2-6b7c8ff43997.wav: Done [II] tLtask-preprocess : - tlk-tutorial-data/train/a4accdd4-c3fe-9c41-bbbd-393b67214fd2.wav: Done [II] tLtask-preprocess : - tlk-tutorial-data/train/e79d8953-e3fd-0649-ad49-24c62e9e7c50.wav: Done [II] tLtask-preprocess : Parsing 699 .tr files listed in 'preprocess-train/lists/samples.lst' [II] tLtask-preprocess : Parsed 699 of 699 files (skipped 0) [II] tLtask-preprocess : Transliterating 699 samples of 699 [II] tLtask-preprocess : Removing samples with empty (SP) transliteration or no .pho [II] tLtask-preprocess : Train list now has 699 samples [II] tLtask-preprocess : Generating phonemes list: preprocess-train/lists/samples-uniq-monophos.lst [II] tLtask-preprocess : Clustering media files 'byProps' to 'preprocess-train/lists/clusters' [II] tLtask-preprocess : Preprocessing done
Upon the execution of this step, the necessary files required for the training will have been obtained.
For the sake of clarity, let’s take a look at some of the most important ones. For instance, given an audio sample defined in the transcription file tlk-tutorial-data/train/0af34fad-2796-4b4e-831f-3932d0fcd32b.dfxp for the audio file tlk-tutorial-data/train/0af34fad-2796-4b4e-831f-3932d0fcd32b.wav
<tl:s sI="1" b="4.315" e="8.887"> Hola, mi nombre es Germán Moltó, soy profesor de la Escuela Técnica Superior de Ingeniería Informática. </tl:s>
tLtask-preprocess will create an audio sample (a .tLfea file) from the time mark 4.283 to 18.173 in TLK format, the content of which can be checked using tLbin2txt as follows
$ tLbin2txt realfea -i preprocess-train/samples/0af34fad-2796-4b4e-831f-3932d0fcd32b/sp/00004.315_00008.887-0af34fad-2796-4b4e-831f-3932d0fcd32b.tLfea
which will produce something like this
AKREALTF 39 455 -2.48114 0.761772 0.46773 1.34051 1.46015 0.570777 1.17656 0.249015 -1.17891 0.350033 -0.128974 1.7411 -4.52278 0.181867 -0.599321 -0.134459 -0.057059 0.733413 0.555112 0.429736 0.367868 0.303181 0.248381 0.220614 -1.16888 0.358793 1.9436 -0.417099 -1.14304 -2.46277 -1.14641 -0.342092 0.35895 -1.80809 1.49 -0.0456795 1.58645 -1.50347 4.24677 . . .
Additionally to the tLfea, a monophone transcription (.pho) will be also created in the same folder. In this case, the transcription to monophone has been performed using a Spanish transliterator:
SP o l a SP m i SP n o m b r e SP e s SP x e r m a n SP m o l t o SP s o i SP p r o f e s o r SP d e SP l a SP e s k u e l a SP t e k n i k a SP s u p e r i o r SP d e SP i n x e n i e r i a SP i n f o r m a t i k a SP
Apart from these files, other files such as a raw text transcription (.am.txt and .lm.txt) and cluster related files (.prop) will be created. However, these files are not needed for this tutorial and will not be analysed.
Finally, multiple file lists will be created inside the directory preprocess/lists. These file lists contain the paths of all audio samples extracted, as well as other required files for the training and recognition of these samples.
For a detailed reference of tLtask-preprocess, please refer to its manual (man) page.
4. Training
Given the training data preprocessed by tLtask-preprocess, we can now train an acoustic system using tLtask-train. First of all, create a directory for the training.
$ mkdir training cd training
Then link the two directories inside preprocess-train into the training directory.
$ ln -s ../preprocess-train/samples ../preprocess-train/lists .
Next, create a template config file for tLtask-train by executing:
$ tLtask-train --write-example-config-file > config-file.ini
This configuration file contains the default parameters needed to train a standard acoustic system for Spanish. In order to train a system with the tutorial data, edit the Lists section to include the list created in section Feature extraction. To do so, you can simply edit the following lines:
[Lists] set-name = lists/samples ... [General] ... prefix-name = training-tutorial
Then, execute the following command to perform the training:
$ tLtask-train config-file.ini --log-folder log
Now, tLtask-train will execute all necessary commands to train a standard acoustic system using TLK. Please note that, even though certain processes are executed in parallel depending on the computer, this process might take long.
Similarly to tLtask-train, some messages will be printed during the training. For instance, the first lines of the execution should be similar to these ones:
[II] Writing information about the training in launched-train.2013-12-05_12:09.log [II] Training standard monophone model [II] submit.sh : Job Name: training-tutorial.standard.make-proto Log File(s): log/training-tutorial.standard.make-proto.o [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-init Log File(s): log/training-tutorial.standard.monophone.train-init.o [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-em-MIX01-ITER1-estimate Log File(s): log/training-tutorial.standard.monophone.train-em-MIX01-ITER1-estimate.o.* Tasks: 1-10 Max Parallel: 8 [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-em-MIX01-ITER1-update Log File(s): log/training-tutorial.standard.monophone.train-em-MIX01-ITER1-update.o [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-em-MIX01-ITER2-estimate Log File(s): log/training-tutorial.standard.monophone.train-em-MIX01-ITER2-estimate.o.* Tasks: 1-10 Max Parallel: 8 [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-em-MIX01-ITER2-update Log File(s): log/training-tutorial.standard.monophone.train-em-MIX01-ITER2-update.o [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-em-MIX01-ITER3-estimate Log File(s): log/training-tutorial.standard.monophone.train-em-MIX01-ITER3-estimate.o.* Tasks: 1-10 Max Parallel: 8 [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-em-MIX01-ITER3-update Log File(s): log/training-tutorial.standard.monophone.train-em-MIX01-ITER3-update.o [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-em-MIX01-ITER4-estimate Log File(s): log/training-tutorial.standard.monophone.train-em-MIX01-ITER4-estimate.o.* Tasks: 1-10 Max Parallel: 8 [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-em-MIX01-ITER4-update Log File(s): log/training-tutorial.standard.monophone.train-em-MIX01-ITER4-update.o [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-em-MIX01-ITER5-estimate Log File(s): log/training-tutorial.standard.monophone.train-em-MIX01-ITER5-estimate.o.* Tasks: 1-10 Max Parallel: 8 [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-em-MIX01-ITER5-update Log File(s): log/training-tutorial.standard.monophone.train-em-MIX01-ITER5-update.o [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-em-MIX01-ITER6-estimate Log File(s): log/training-tutorial.standard.monophone.train-em-MIX01-ITER6-estimate.o.* Tasks: 1-10 Max Parallel: 8 [II] submit.sh : Job Name: training-tutorial.standard.monophone.train-em-MIX01-ITER6-update Log File(s): log/training-tutorial.standard.monophone.train-em-MIX01-ITER6-update.o . . . [II] Training standard triphone model [II] submit.sh : Job Name: training-tutorial.standard.generate-triphonemes Log File(s): log/training-tutorial.standard.generate-triphonemes.o [II] submit.sh : Job Name: training-tutorial.standard.convert-monophoneme-to-triphoneme Log File(s): log/training-tutorial.standard.convert-monophoneme-to-triphoneme.o [II] submit.sh : Job Name: training-tutorial.standard.triphoneme.train-init Log File(s): log/training-tutorial.standard.triphoneme.train-init.o [II] submit.sh : Job Name: training-tutorial.standard.triphoneme.train-em-MIX01-ITER1-estimate Log File(s): log/training-tutorial.standard.triphoneme.train-em-MIX01-ITER1-estimate.o.* Tasks: 1-10 Max Parallel: 8 . . . [II] Training standard tiedphoneme model [II] submit.sh : Job Name: training-tutorial.standard.generate-tied-init-model Log File(s): log/training-tutorial.standard.generate-tied-init-model.o [II] submit.sh : Job Name: training-tutorial.standard.generate-tied-init-model Log File(s): log/training-tutorial.standard.generate-tied-init-model.o [II] submit.sh : Job Name: training-tutorial.standard.generate-tiedphonemes Log File(s): log/training-tutorial.standard.generate-tiedphonemes.o [II] submit.sh : Job Name: training-tutorial.standard.tiedphoneme.train-init Log File(s): log/training-tutorial.standard.tiedphoneme.train-init.o [II] submit.sh : Job Name: training-tutorial.standard.tiedphoneme.train-em-MIX01-ITER1-estimate Log File(s): log/training-tutorial.standard.tiedphoneme.train-em-MIX01-ITER1-estimate.o.* Tasks: 1-10 Max Parallel: 8 . . .
By default, information logs will be stored in the log directory, and the resulting acoustic model in the models directory.
In addition, the launched commands and the configuration file used for the training will be stored in a file called launched-train.TIME_HOUR.log.
For a detailed reference of all tLtask-train utilities, please refer to its manual (man) page.
5. Recognition
Once the acoustic system has been trained, we can start transcribing audio files using TLK. Given the preprocessed data generated in Feature extraction, tLtask-recognise executes the necessary commands to obtain its transcription, using the acoustic system obtained in Training.
First of all, create a folder, i.e. recognition.
$ cd .. mkdir recognition cd recognition
Then, link inside of it the samples, lists and references directories of the test data created in Feature extraction, and the models directory created in Training.
$ ln -s ../preprocess-test/samples ../preprocess-test/lists \ ../preprocess-test/references ../training/models \ .
Also, link the language model (for Spanish) and the lexicon provided with the tutorial data.
$ ln -s \ ../tlk-tutorial-data/misc/mono.lex \ ../tlk-tutorial-data/misc/mlm.gz \ .
Please note that TLK does not include utilities to generate these two files. Should you want to generate your own LMs, we recommend the use of the SRILM toolkit for parameter estimation, and tLlmformat for format conversion from ARPA to TLK format. For the lexicon, please refer to tLformats.
Next, tLtask-recognise also needs a configuration file to operate. Generate a template by executing:
$ tLtask-recognise --write-example-config-file > config-file.ini
In order to recognize the test data using the acoustic system, simply edit the following lines of the configuration file:
... [General] prefix-name = tutorial ... [HMM] prefix-name = training-tutorial ... [LM] language-model = mlm.gz lexicon = mono.lex ...
Finally, execute the following command to transcribe the test audio samples.
$ tLtask-recognise config-file.ini --log-folder log
Similarly to tLtask-train, some messages will be printed stating the progress. For instance, for the tutorial data:
[II] Writing information about the training in launched-recognition.2013-12-05_10:28.log [II] Submitting the recognition [II] Recognising the samples using the standard model [II] submit.sh : Job Name: tutorial.recognition.standard-step1.recognise-init Log File(s): log/tutorial.recognition.standard-step1.recognise-init.o [II] submit.sh : Job Name: tutorial.recognition.standard-step1.create-tied-lex Log File(s): log/tutorial.recognition.standard-step1.create-tied-lex.o [II] submit.sh : Job Name: tutorial.recognition.standard-step1.recognise-body Log File(s): log/tutorial.recognition.standard-step1.recognise-body.o.* Tasks: 1-10 Max Parallel: 8 [II] submit.sh : Job Name: tutorial.recognition.standard-step1.recognise-finish Log File(s): log/tutorial.recognition.standard-step1.recognise-finish.o [II] Generating CMLLR samples [II] submit.sh : Job Name: tutorial.recognition.cmllr.create-clusters Log File(s): log/tutorial.recognition.cmllr.create-clusters.o [II] submit.sh : Job Name: tutorial.recognition.cmllr.generate-features-cluster-MEDIA_NAME_995a0a5c_b921_e54a_984f_e41bc20e8bc7.SPEAKER_spk1 Log File(s): log/tutorial.recognition.cmllr.generate-features-cluster-MEDIA_NAME_995a0a5c_b921_e54a_984f_e41bc20e8bc7.SPEAKER_spk1.o [II] Recognising the samples using the CMLLR model [II] submit.sh : Job Name: tutorial.recognition.cmllr-step2.recognise-init Log File(s): log/tutorial.recognition.cmllr-step2.recognise-init.o [II] submit.sh : Job Name: tutorial.recognition.cmllr-step2.create-tied-lex Log File(s): log/tutorial.recognition.cmllr-step2.create-tied-lex.o [II] submit.sh : Job Name: tutorial.recognition.cmllr-step2.recognise-body Log File(s): log/tutorial.recognition.cmllr-step2.recognise-body.o.* Tasks: 1-10 Max Parallel: 8 [II] submit.sh : Job Name: tutorial.recognition.cmllr-step2.recognise-finish Log File(s): log/tutorial.recognition.cmllr-step2.recognise-finish.o
In parallel, log messages will be printed in the log directory.
The resulting transcription will be stored in tutorial/cmllr_step2/transcription.
To measure the quality of the automatic transcription produced, we can compare it to the manual transcription that was included in the Tutorial Data (the "reference" transcription). We have already preprocessed this reference transcription in the previous Feature extraction step.
Now, to perform this comparison, we can use the sclite tool of SCTK. Just execute:
$ sclite -r references/035040d6-7fd4-ab4a-80ff-e87d3a5d84db.stm stm -h tutorial/cmllr_step2/transcription.ctm ctm
This command will compute the Word Error Rate (WER) of the automatic transcription.
This should be the result:
,-----------------------------------------------------------------. | tutorial/cmllr_step2/transcription.ctm | |-----------------------------------------------------------------| | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | |--------+--------------+-----------------------------------------| | upv | 60 1698 | 80.0 13.8 6.2 3.2 23.2 93.3 | |=================================================================| | Sum/Avg| 60 1698 | 80.0 13.8 6.2 3.2 23.2 93.3 | |=================================================================| | Mean | 60.0 1698.0 | 80.0 13.8 6.2 3.2 23.2 93.3 | | S.D. | 0.0 0.0 | 0.0 0.0 0.0 0.0 0.0 0.0 | | Median | 60.0 1698.0 | 80.0 13.8 6.2 3.2 23.2 93.3 | `-----------------------------------------------------------------'
in which a 23.2 of WER is obtained; in other words, there is a 23.2% of errors in the test automatic transcription.