TDNN-CTC

This page shows you how to run the yesno recipe. It contains:

    1. Prepare data for training

    1. Train a TDNN model

      1. View text format logs and visualize TensorBoard logs

      1. Select device type, i.e., CPU and GPU, for training

      1. Change training options

      1. Resume training from a checkpoint

    1. Decode with a trained model

      1. Select a checkpoint for decoding

      1. Model averaging

    1. Colab notebook

      1. It shows you step by step how to setup the environment, how to do training, and how to do decoding

      1. How to use a pre-trained model

    1. Inference with a pre-trained model

      1. Download a pre-trained model, provided by us

      1. Decode a single sound file with a pre-trained model

      1. Decode multiple sound files at the same time

It does NOT show you:

    1. How to train with multiple GPUs

    The yesno dataset is so small that CPU is more than enough for training as well as for decoding.

    1. How to use LM rescoring for decoding

    The dataset does not have an LM for rescoring.

Hint

We assume you have read the page Installation and have setup the environment for icefall.

Hint

You don’t need a GPU to run this recipe. It can be run on a CPU. The training part takes less than 30 seconds on a CPU and you will get the following WER at the end:

[test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]

Data preparation

$ cd egs/yesno/ASR
$ ./prepare.sh

The script ./prepare.sh handles the data preparation for you, automagically. All you need to do is to run it.

The data preparation contains several stages, you can use the following two options:

  • --stage

  • --stop-stage

to control which stage(s) should be run. By default, all stages are executed.

For example,

$ cd egs/yesno/ASR
$ ./prepare.sh --stage 0 --stop-stage 0

means to run only stage 0.

To run stage 2 to stage 5, use:

$ ./prepare.sh --stage 2 --stop-stage 5

Training

We provide only a TDNN model, contained in the tdnn folder, for yesno.

The command to run the training part is:

$ cd egs/yesno/ASR
$ export CUDA_VISIBLE_DEVICES=""
$ ./tdnn/train.py

By default, it will run 15 epochs. Training logs and checkpoints are saved in tdnn/exp.

In tdnn/exp, you will find the following files:

  • epoch-0.pt, epoch-1.pt, …

    These are checkpoint files, containing model state_dict and optimizer state_dict. To resume training from some checkpoint, say epoch-10.pt, you can use:

    $ ./tdnn/train.py --start-epoch 11
    
  • tensorboard/

    This folder contains TensorBoard logs. Training loss, validation loss, learning rate, etc, are recorded in these logs. You can visualize them by:

    $ cd tdnn/exp/tensorboard
    $ tensorboard dev upload --logdir . --description "TDNN training for yesno with icefall"
    

    It will print something like below:

    TensorFlow installation not found - running with reduced feature set.
    Upload started and will continue reading any new data as it's added to the logdir.
    
    To stop uploading, press Ctrl-C.
    
    New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/yKUbhb5wRmOSXYkId1z9eg/
    
    [2021-08-23T23:49:41] Started scanning logdir.
    [2021-08-23T23:49:42] Total uploaded: 135 scalars, 0 tensors, 0 binary objects
    Listening for new data in logdir...
    

    Note there is a URL in the above output, click it and you will see the following screenshot:

    TensorBoard screenshot

    Fig. 8 TensorBoard screenshot.

  • log/log-train-xxxx

    It is the detailed training log in text format, same as the one you saw printed to the console during training.

Note

By default, ./tdnn/train.py uses GPU 0 for training if GPUs are available. If you have two GPUs, say, GPU 0 and GPU 1, and you want to use GPU 1 for training, you can run:

$ export CUDA_VISIBLE_DEVICES="1"
$ ./tdnn/train.py

Since the yesno dataset is very small, containing only 30 sound files for training, and the model in use is also very small, we use:

$ export CUDA_VISIBLE_DEVICES=""

so that ./tdnn/train.py uses CPU during training.

If you don’t have GPUs, then you don’t need to run export CUDA_VISIBLE_DEVICES="".

To see available training options, you can use:

$ ./tdnn/train.py --help

Other training options, e.g., learning rate, results dir, etc., are pre-configured in the function get_params() in tdnn/train.py. Normally, you don’t need to change them. You can change them by modifying the code, if you want.

Decoding

The decoding part uses checkpoints saved by the training part, so you have to run the training part first.

The command for decoding is:

$ export CUDA_VISIBLE_DEVICES=""
$ ./tdnn/decode.py

You will see the WER in the output log.

Decoded results are saved in tdnn/exp.

$ ./tdnn/decode.py --help

shows you the available decoding options.

Some commonly used options are:

  • --epoch

    You can select which checkpoint to be used for decoding. For instance, ./tdnn/decode.py --epoch 10 means to use ./tdnn/exp/epoch-10.pt for decoding.

  • --avg

    It’s related to model averaging. It specifies number of checkpoints to be averaged. The averaged model is used for decoding. For example, the following command:

    $ ./tdnn/decode.py --epoch 10 --avg 3
    

    uses the average of epoch-8.pt, epoch-9.pt and epoch-10.pt for decoding.

  • --export

    If it is True, i.e., ./tdnn/decode.py --export 1, the code will save the averaged model to tdnn/exp/pretrained.pt. See Pre-trained Model for how to use it.

Pre-trained Model

We have uploaded the pre-trained model to https://huggingface.co/csukuangfj/icefall_asr_yesno_tdnn.

The following shows you how to use the pre-trained model.

Download the pre-trained model

$ cd egs/yesno/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/csukuangfj/icefall_asr_yesno_tdnn

Caution

You have to use git lfs to download the pre-trained model.

After downloading, you will have the following files:

$ cd egs/yesno/ASR
$ tree tmp
tmp/
`-- icefall_asr_yesno_tdnn
    |-- README.md
    |-- lang_phone
    |   |-- HLG.pt
    |   |-- L.pt
    |   |-- L_disambig.pt
    |   |-- Linv.pt
    |   |-- lexicon.txt
    |   |-- lexicon_disambig.txt
    |   |-- tokens.txt
    |   `-- words.txt
    |-- lm
    |   |-- G.arpa
    |   `-- G.fst.txt
    |-- pretrained.pt
    `-- test_waves
        |-- 0_0_0_1_0_0_0_1.wav
        |-- 0_0_1_0_0_0_1_0.wav
        |-- 0_0_1_0_0_1_1_1.wav
        |-- 0_0_1_0_1_0_0_1.wav
        |-- 0_0_1_1_0_0_0_1.wav
        |-- 0_0_1_1_0_1_1_0.wav
        |-- 0_0_1_1_1_0_0_0.wav
        |-- 0_0_1_1_1_1_0_0.wav
        |-- 0_1_0_0_0_1_0_0.wav
        |-- 0_1_0_0_1_0_1_0.wav
        |-- 0_1_0_1_0_0_0_0.wav
        |-- 0_1_0_1_1_1_0_0.wav
        |-- 0_1_1_0_0_1_1_1.wav
        |-- 0_1_1_1_0_0_1_0.wav
        |-- 0_1_1_1_1_0_1_0.wav
        |-- 1_0_0_0_0_0_0_0.wav
        |-- 1_0_0_0_0_0_1_1.wav
        |-- 1_0_0_1_0_1_1_1.wav
        |-- 1_0_1_1_0_1_1_1.wav
        |-- 1_0_1_1_1_1_0_1.wav
        |-- 1_1_0_0_0_1_1_1.wav
        |-- 1_1_0_0_1_0_1_1.wav
        |-- 1_1_0_1_0_1_0_0.wav
        |-- 1_1_0_1_1_0_0_1.wav
        |-- 1_1_0_1_1_1_1_0.wav
        |-- 1_1_1_0_0_1_0_1.wav
        |-- 1_1_1_0_1_0_1_0.wav
        |-- 1_1_1_1_0_0_1_0.wav
        |-- 1_1_1_1_1_0_0_0.wav
        `-- 1_1_1_1_1_1_1_1.wav

4 directories, 42 files
$ soxi tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav

Input File     : 'tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav'
Channels       : 1
Sample Rate    : 8000
Precision      : 16-bit
Duration       : 00:00:06.76 = 54080 samples ~ 507 CDDA sectors
File Size      : 108k
Bit Rate       : 128k
Sample Encoding: 16-bit Signed Integer PCM
  • 0_0_1_0_1_0_0_1.wav

    0 means No; 1 means Yes. No and Yes are not in English, but in Hebrew. So this file contains NO NO YES NO YES NO NO YES.

Download kaldifeat

kaldifeat is used for extracting features from a single or multiple sound files. Please refer to https://github.com/csukuangfj/kaldifeat to install kaldifeat first.

Inference with a pre-trained model

$ cd egs/yesno/ASR
$ ./tdnn/pretrained.py --help

shows the usage information of ./tdnn/pretrained.py.

To decode a single file, we can use:

./tdnn/pretrained.py \
  --checkpoint ./tmp/icefall_asr_yesno_tdnn/pretrained.pt \
  --words-file ./tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt \
  --HLG ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt \
  ./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav

The output is:

2021-08-24 12:22:51,621 INFO [pretrained.py:119] {'feature_dim': 23, 'num_classes': 4, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'checkpoint': './tmp/icefall_asr_yesno_tdnn/pretrained.pt', 'words_file': './tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt', 'HLG': './tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt', 'sound_files': ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav']}
2021-08-24 12:22:51,645 INFO [pretrained.py:125] device: cpu
2021-08-24 12:22:51,645 INFO [pretrained.py:127] Creating model
2021-08-24 12:22:51,650 INFO [pretrained.py:139] Loading HLG from ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt
2021-08-24 12:22:51,651 INFO [pretrained.py:143] Constructing Fbank computer
2021-08-24 12:22:51,652 INFO [pretrained.py:153] Reading sound files: ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav']
2021-08-24 12:22:51,684 INFO [pretrained.py:159] Decoding started
2021-08-24 12:22:51,708 INFO [pretrained.py:198]
./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav:
NO NO YES NO YES NO NO YES


2021-08-24 12:22:51,708 INFO [pretrained.py:200] Decoding Done

You can see that for the sound file 0_0_1_0_1_0_0_1.wav, the decoding result is NO NO YES NO YES NO NO YES.

To decode multiple files at the same time, you can use

./tdnn/pretrained.py \
  --checkpoint ./tmp/icefall_asr_yesno_tdnn/pretrained.pt \
  --words-file ./tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt \
  --HLG ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt \
  ./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav \
  ./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav

The decoding output is:

2021-08-24 12:25:20,159 INFO [pretrained.py:119] {'feature_dim': 23, 'num_classes': 4, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'checkpoint': './tmp/icefall_asr_yesno_tdnn/pretrained.pt', 'words_file': './tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt', 'HLG': './tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt', 'sound_files': ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav', './tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav']}
2021-08-24 12:25:20,181 INFO [pretrained.py:125] device: cpu
2021-08-24 12:25:20,181 INFO [pretrained.py:127] Creating model
2021-08-24 12:25:20,185 INFO [pretrained.py:139] Loading HLG from ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt
2021-08-24 12:25:20,186 INFO [pretrained.py:143] Constructing Fbank computer
2021-08-24 12:25:20,187 INFO [pretrained.py:153] Reading sound files: ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav',
'./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav']
2021-08-24 12:25:20,213 INFO [pretrained.py:159] Decoding started
2021-08-24 12:25:20,287 INFO [pretrained.py:198]
./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav:
NO NO YES NO YES NO NO YES

./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav:
YES NO YES YES NO YES YES YES

2021-08-24 12:25:20,287 INFO [pretrained.py:200] Decoding Done

You can see again that it decodes correctly.

Colab notebook

We do provide a colab notebook for this recipe.

yesno colab notebook

Congratulations! You have finished the simplest speech recognition recipe in icefall.