TDNN-CTC

This page shows you how to run the yesno recipe. It contains:

Prepare data for training

Train a TDNN model

View text format logs and visualize TensorBoard logs

Select device type, i.e., CPU and GPU, for training

Change training options

Resume training from a checkpoint

Decode with a trained model

Select a checkpoint for decoding

Model averaging

Colab notebook

It shows you step by step how to setup the environment, how to do training, and how to do decoding

How to use a pre-trained model

Inference with a pre-trained model

Download a pre-trained model, provided by us

Decode a single sound file with a pre-trained model

Decode multiple sound files at the same time

It does NOT show you:

How to train with multiple GPUs

The yesno dataset is so small that CPU is more than enough for training as well as for decoding.

How to use LM rescoring for decoding

The dataset does not have an LM for rescoring.

Hint

We assume you have read the page Installation and have setup the environment for icefall.

Hint

You don’t need a GPU to run this recipe. It can be run on a CPU. The training part takes less than 30 seconds on a CPU and you will get the following WER at the end:

[test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]

Data preparation

$ cd egs/yesno/ASR
$ ./prepare.sh

The script ./prepare.sh handles the data preparation for you, automagically. All you need to do is to run it.

The data preparation contains several stages, you can use the following two options:

--stage

--stop-stage

to control which stage(s) should be run. By default, all stages are executed.

For example,

$ cd egs/yesno/ASR
$ ./prepare.sh --stage 0 --stop-stage 0

means to run only stage 0.

To run stage 2 to stage 5, use:

$ ./prepare.sh --stage 2 --stop-stage 5

Training

We provide only a TDNN model, contained in the tdnn folder, for yesno.

The command to run the training part is:

$ cd egs/yesno/ASR
$ export CUDA_VISIBLE_DEVICES=""
$ ./tdnn/train.py

By default, it will run 15 epochs. Training logs and checkpoints are saved in tdnn/exp.

In tdnn/exp, you will find the following files:

epoch-0.pt, epoch-1.pt, …

These are checkpoint files, containing model state_dict and optimizer state_dict. To resume training from some checkpoint, say epoch-10.pt, you can use:
$ ./tdnn/train.py --start-epoch 11
tensorboard/

This folder contains TensorBoard logs. Training loss, validation loss, learning rate, etc, are recorded in these logs. You can visualize them by:
$ cd tdnn/exp/tensorboard
$ tensorboard dev upload --logdir . --description "TDNN training for yesno with icefall"
It will print something like below:
TensorFlow installation not found - running with reduced feature set.
Upload started and will continue reading any new data as it's added to the logdir.

To stop uploading, press Ctrl-C.

New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/yKUbhb5wRmOSXYkId1z9eg/

[2021-08-23T23:49:41] Started scanning logdir.
[2021-08-23T23:49:42] Total uploaded: 135 scalars, 0 tensors, 0 binary objects
Listening for new data in logdir...
Note there is a URL in the above output, click it and you will see the following screenshot:

Fig. 8 TensorBoard screenshot.
log/log-train-xxxx

It is the detailed training log in text format, same as the one you saw printed to the console during training.

Note

By default, ./tdnn/train.py uses GPU 0 for training if GPUs are available. If you have two GPUs, say, GPU 0 and GPU 1, and you want to use GPU 1 for training, you can run:

$ export CUDA_VISIBLE_DEVICES="1"
$ ./tdnn/train.py

Since the yesno dataset is very small, containing only 30 sound files for training, and the model in use is also very small, we use:

$ export CUDA_VISIBLE_DEVICES=""

so that ./tdnn/train.py uses CPU during training.

If you don’t have GPUs, then you don’t need to run export CUDA_VISIBLE_DEVICES="".

To see available training options, you can use:

$ ./tdnn/train.py --help

Other training options, e.g., learning rate, results dir, etc., are pre-configured in the function get_params() in tdnn/train.py. Normally, you don’t need to change them. You can change them by modifying the code, if you want.

Decoding

The decoding part uses checkpoints saved by the training part, so you have to run the training part first.

The command for decoding is:

$ export CUDA_VISIBLE_DEVICES=""
$ ./tdnn/decode.py

You will see the WER in the output log.

Decoded results are saved in tdnn/exp.

$ ./tdnn/decode.py --help

shows you the available decoding options.

Some commonly used options are:

--epoch

You can select which checkpoint to be used for decoding. For instance, ./tdnn/decode.py --epoch 10 means to use ./tdnn/exp/epoch-10.pt for decoding.
--avg

It’s related to model averaging. It specifies number of checkpoints to be averaged. The averaged model is used for decoding. For example, the following command:
$ ./tdnn/decode.py --epoch 10 --avg 3
uses the average of epoch-8.pt, epoch-9.pt and epoch-10.pt for decoding.
--export

If it is True, i.e., ./tdnn/decode.py --export 1, the code will save the averaged model to tdnn/exp/pretrained.pt. See Pre-trained Model for how to use it.

Pre-trained Model

We have uploaded the pre-trained model to https://huggingface.co/csukuangfj/icefall_asr_yesno_tdnn.

The following shows you how to use the pre-trained model.

Download the pre-trained model

$ cd egs/yesno/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/csukuangfj/icefall_asr_yesno_tdnn

Caution

You have to use git lfs to download the pre-trained model.

After downloading, you will have the following files:

$ cd egs/yesno/ASR
$ tree tmp

tmp/
`-- icefall_asr_yesno_tdnn
    |-- README.md
    |-- lang_phone
    |   |-- HLG.pt
    |   |-- L.pt
    |   |-- L_disambig.pt
    |   |-- Linv.pt
    |   |-- lexicon.txt
    |   |-- lexicon_disambig.txt
    |   |-- tokens.txt
    |   `-- words.txt
    |-- lm
    |   |-- G.arpa
    |   `-- G.fst.txt
    |-- pretrained.pt
    `-- test_waves
        |-- 0_0_0_1_0_0_0_1.wav
        |-- 0_0_1_0_0_0_1_0.wav
        |-- 0_0_1_0_0_1_1_1.wav
        |-- 0_0_1_0_1_0_0_1.wav
        |-- 0_0_1_1_0_0_0_1.wav
        |-- 0_0_1_1_0_1_1_0.wav
        |-- 0_0_1_1_1_0_0_0.wav
        |-- 0_0_1_1_1_1_0_0.wav
        |-- 0_1_0_0_0_1_0_0.wav
        |-- 0_1_0_0_1_0_1_0.wav
        |-- 0_1_0_1_0_0_0_0.wav
        |-- 0_1_0_1_1_1_0_0.wav
        |-- 0_1_1_0_0_1_1_1.wav
        |-- 0_1_1_1_0_0_1_0.wav
        |-- 0_1_1_1_1_0_1_0.wav
        |-- 1_0_0_0_0_0_0_0.wav
        |-- 1_0_0_0_0_0_1_1.wav
        |-- 1_0_0_1_0_1_1_1.wav
        |-- 1_0_1_1_0_1_1_1.wav
        |-- 1_0_1_1_1_1_0_1.wav
        |-- 1_1_0_0_0_1_1_1.wav
        |-- 1_1_0_0_1_0_1_1.wav
        |-- 1_1_0_1_0_1_0_0.wav
        |-- 1_1_0_1_1_0_0_1.wav
        |-- 1_1_0_1_1_1_1_0.wav
        |-- 1_1_1_0_0_1_0_1.wav
        |-- 1_1_1_0_1_0_1_0.wav
        |-- 1_1_1_1_0_0_1_0.wav
        |-- 1_1_1_1_1_0_0_0.wav
        `-- 1_1_1_1_1_1_1_1.wav

4 directories, 42 files

$ soxi tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav

Input File     : 'tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav'
Channels       : 1
Sample Rate    : 8000
Precision      : 16-bit
Duration       : 00:00:06.76 = 54080 samples ~ 507 CDDA sectors
File Size      : 108k
Bit Rate       : 128k
Sample Encoding: 16-bit Signed Integer PCM

0_0_1_0_1_0_0_1.wav

0 means No; 1 means Yes. No and Yes are not in English, but in Hebrew. So this file contains NO NO YES NO YES NO NO YES.

Download kaldifeat

kaldifeat is used for extracting features from a single or multiple sound files. Please refer to https://github.com/csukuangfj/kaldifeat to install kaldifeat first.

Inference with a pre-trained model

$ cd egs/yesno/ASR
$ ./tdnn/pretrained.py --help

shows the usage information of ./tdnn/pretrained.py.

To decode a single file, we can use:

./tdnn/pretrained.py \
  --checkpoint ./tmp/icefall_asr_yesno_tdnn/pretrained.pt \
  --words-file ./tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt \
  --HLG ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt \
  ./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav

The output is:

2021-08-24 12:22:51,621 INFO [pretrained.py:119] {'feature_dim': 23, 'num_classes': 4, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'checkpoint': './tmp/icefall_asr_yesno_tdnn/pretrained.pt', 'words_file': './tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt', 'HLG': './tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt', 'sound_files': ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav']}
2021-08-24 12:22:51,645 INFO [pretrained.py:125] device: cpu
2021-08-24 12:22:51,645 INFO [pretrained.py:127] Creating model
2021-08-24 12:22:51,650 INFO [pretrained.py:139] Loading HLG from ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt
2021-08-24 12:22:51,651 INFO [pretrained.py:143] Constructing Fbank computer
2021-08-24 12:22:51,652 INFO [pretrained.py:153] Reading sound files: ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav']
2021-08-24 12:22:51,684 INFO [pretrained.py:159] Decoding started
2021-08-24 12:22:51,708 INFO [pretrained.py:198]
./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav:
NO NO YES NO YES NO NO YES


2021-08-24 12:22:51,708 INFO [pretrained.py:200] Decoding Done

You can see that for the sound file 0_0_1_0_1_0_0_1.wav, the decoding result is NO NO YES NO YES NO NO YES.

To decode multiple files at the same time, you can use

./tdnn/pretrained.py \
  --checkpoint ./tmp/icefall_asr_yesno_tdnn/pretrained.pt \
  --words-file ./tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt \
  --HLG ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt \
  ./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav \
  ./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav

The decoding output is:

2021-08-24 12:25:20,159 INFO [pretrained.py:119] {'feature_dim': 23, 'num_classes': 4, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'checkpoint': './tmp/icefall_asr_yesno_tdnn/pretrained.pt', 'words_file': './tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt', 'HLG': './tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt', 'sound_files': ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav', './tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav']}
2021-08-24 12:25:20,181 INFO [pretrained.py:125] device: cpu
2021-08-24 12:25:20,181 INFO [pretrained.py:127] Creating model
2021-08-24 12:25:20,185 INFO [pretrained.py:139] Loading HLG from ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt
2021-08-24 12:25:20,186 INFO [pretrained.py:143] Constructing Fbank computer
2021-08-24 12:25:20,187 INFO [pretrained.py:153] Reading sound files: ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav',
'./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav']
2021-08-24 12:25:20,213 INFO [pretrained.py:159] Decoding started
2021-08-24 12:25:20,287 INFO [pretrained.py:198]
./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav:
NO NO YES NO YES NO NO YES

./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav:
YES NO YES YES NO YES YES YES

2021-08-24 12:25:20,287 INFO [pretrained.py:200] Decoding Done

You can see again that it decodes correctly.

Colab notebook

We do provide a colab notebook for this recipe.

Congratulations! You have finished the simplest speech recognition recipe in icefall.