TDNN-LiGRU-CTC

This tutorial shows you how to run a TDNN-LiGRU-CTC model with the TIMIT dataset.

Hint

We assume you have read the page Installation and have setup the environment for icefall.

Data preparation

$ cd egs/timit/ASR
$ ./prepare.sh

The script ./prepare.sh handles the data preparation for you, automagically. All you need to do is to run it.

The data preparation contains several stages, you can use the following two options:

  • --stage

  • --stop-stage

to control which stage(s) should be run. By default, all stages are executed.

For example,

$ cd egs/timit/ASR
$ ./prepare.sh --stage 0 --stop-stage 0

means to run only stage 0.

To run stage 2 to stage 5, use:

$ ./prepare.sh --stage 2 --stop-stage 5

Training

Now describing the training of TDNN-LiGRU-CTC model, contained in the tdnn_ligru_ctc folder.

Hint

TIMIT is a very small dataset. So one GPU is enough.

The command to run the training part is:

$ cd egs/timit/ASR
$ export CUDA_VISIBLE_DEVICES="0"
$ ./tdnn_ligru_ctc/train.py

By default, it will run 25 epochs. Training logs and checkpoints are saved in tdnn_ligru_ctc/exp.

In tdnn_ligru_ctc/exp, you will find the following files:

  • epoch-0.pt, epoch-1.pt, …, epoch-29.pt

    These are checkpoint files, containing model state_dict and optimizer state_dict. To resume training from some checkpoint, say epoch-10.pt, you can use:

    $ ./tdnn_ligru_ctc/train.py --start-epoch 11
    
  • tensorboard/

    This folder contains TensorBoard logs. Training loss, validation loss, learning rate, etc, are recorded in these logs. You can visualize them by:

    $ cd tdnn_ligru_ctc/exp/tensorboard
    $ tensorboard dev upload --logdir . --description "TDNN ligru training for timit with icefall"
    
  • log/log-train-xxxx

    It is the detailed training log in text format, same as the one you saw printed to the console during training.

To see available training options, you can use:

$ ./tdnn_ligru_ctc/train.py --help

Other training options, e.g., learning rate, results dir, etc., are pre-configured in the function get_params() in tdnn_ligru_ctc/train.py. Normally, you don’t need to change them. You can change them by modifying the code, if you want.

Decoding

The decoding part uses checkpoints saved by the training part, so you have to run the training part first.

The command for decoding is:

$ export CUDA_VISIBLE_DEVICES="0"
$ ./tdnn_ligru_ctc/decode.py

You will see the WER in the output log.

Decoded results are saved in tdnn_ligru_ctc/exp.

$ ./tdnn_ligru_ctc/decode.py --help

shows you the available decoding options.

Some commonly used options are:

  • --epoch

    You can select which checkpoint to be used for decoding. For instance, ./tdnn_ligru_ctc/decode.py --epoch 10 means to use ./tdnn_ligru_ctc/exp/epoch-10.pt for decoding.

  • --avg

    It’s related to model averaging. It specifies number of checkpoints to be averaged. The averaged model is used for decoding. For example, the following command:

    $ ./tdnn_ligru_ctc/decode.py --epoch 25 --avg 17
    

    uses the average of epoch-9.pt, epoch-10.pt, epoch-11.pt, epoch-12.pt, epoch-13.pt, epoch-14.pt, epoch-15.pt, epoch-16.pt, epoch-17.pt, epoch-18.pt, epoch-19.pt, epoch-20.pt, epoch-21.pt, epoch-22.pt, epoch-23.pt, epoch-24.pt and epoch-25.pt for decoding.

  • --export

    If it is True, i.e., ./tdnn_ligru_ctc/decode.py --export 1, the code will save the averaged model to tdnn_ligru_ctc/exp/pretrained.pt. See Pre-trained Model for how to use it.

Pre-trained Model

We have uploaded the pre-trained model to https://huggingface.co/luomingshuang/icefall_asr_timit_tdnn_ligru_ctc.

The following shows you how to use the pre-trained model.

Install kaldifeat

kaldifeat is used to extract features for a single sound file or multiple sound files at the same time.

Please refer to https://github.com/csukuangfj/kaldifeat for installation.

Download the pre-trained model

$ cd egs/timit/ASR
$ mkdir tmp-ligru
$ cd tmp-ligru
$ git lfs install
$ git clone https://huggingface.co/luomingshuang/icefall_asr_timit_tdnn_ligru_ctc

Caution

You have to use git lfs to download the pre-trained model.

Caution

In order to use this pre-trained model, your k2 version has to be v1.7 or later.

After downloading, you will have the following files:

$ cd egs/timit/ASR
$ tree tmp-ligru
tmp-ligru/
`-- icefall_asr_timit_tdnn_ligru_ctc
    |-- README.md
    |-- data
    |   |-- lang_phone
    |   |   |-- HLG.pt
    |   |   |-- tokens.txt
    |   |   `-- words.txt
    |   `-- lm
    |       `-- G_4_gram.pt
    |-- exp
    |   `-- pretrained_average_9_25.pt
    `-- test_wavs
        |-- FDHC0_SI1559.WAV
        |-- FELC0_SI756.WAV
        |-- FMGD0_SI1564.WAV
        `-- trans.txt

6 directories, 10 files

File descriptions:

  • data/lang_phone/HLG.pt

    It is the decoding graph.

  • data/lang_phone/tokens.txt

    It contains tokens and their IDs.

  • data/lang_phone/words.txt

    It contains words and their IDs.

  • data/lm/G_4_gram.pt

    It is a 4-gram LM, useful for LM rescoring.

  • exp/pretrained.pt

    It contains pre-trained model parameters, obtained by averaging checkpoints from epoch-9.pt to epoch-25.pt. Note: We have removed optimizer state_dict to reduce file size.

  • test_waves/*.WAV

    It contains some test sound files from timit TEST dataset.

  • test_waves/trans.txt

    It contains the reference transcripts for the sound files in test_waves/.

The information of the test sound files is listed below:

$ ffprobe -show_format tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV

Input #0, nistsphere, from 'tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV':
Metadata:
  database_id     : TIMIT
  database_version: 1.0
  utterance_id    : dhc0_si1559
  sample_min      : -4176
  sample_max      : 5984
Duration: 00:00:03.40, bitrate: 258 kb/s
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, 1 channels, s16, 256 kb/s

$ ffprobe -show_format tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV

Input #0, nistsphere, from 'tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV':
Metadata:
  database_id     : TIMIT
  database_version: 1.0
  utterance_id    : elc0_si756
  sample_min      : -1546
  sample_max      : 1989
Duration: 00:00:04.19, bitrate: 257 kb/s
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, 1 channels, s16, 256 kb/s

$ ffprobe -show_format tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV

Input #0, nistsphere, from 'tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV':
Metadata:
  database_id     : TIMIT
  database_version: 1.0
  utterance_id    : mgd0_si1564
  sample_min      : -7626
  sample_max      : 10573
Duration: 00:00:04.44, bitrate: 257 kb/s
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, 1 channels, s16, 256 kb/s

Inference with a pre-trained model

$ cd egs/timit/ASR
$ ./tdnn_ligru_ctc/pretrained.py --help

shows the usage information of ./tdnn_ligru_ctc/pretrained.py.

To decode with 1best method, we can use:

./tdnn_ligru_ctc/pretrained.py
  --method 1best
  --checkpoint ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/exp/pretrained_average_9_25.pt
  --words-file ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lang_phone/words.txt
  --HLG ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lang_phone/HLG.pt
  ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV
  ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV
  ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV

The output is:

2021-11-08 20:41:33,660 INFO [pretrained.py:169] device: cuda:0
2021-11-08 20:41:33,660 INFO [pretrained.py:171] Creating model
2021-11-08 20:41:38,680 INFO [pretrained.py:183] Loading HLG from ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lang_phone/HLG.pt
2021-11-08 20:41:38,695 INFO [pretrained.py:200] Constructing Fbank computer
2021-11-08 20:41:38,697 INFO [pretrained.py:210] Reading sound files: ['./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV', './tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV', './tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV']
2021-11-08 20:41:38,704 INFO [pretrained.py:216] Decoding started
2021-11-08 20:41:39,819 INFO [pretrained.py:246] Use HLG decoding
2021-11-08 20:41:39,829 INFO [pretrained.py:267]
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV:
sil dh ih sh uw ah l iy v iy z ih sil p r aa sil k s ih m ey dx ih sil d w uh dx ih w ih s f iy l ih ng w ih th ih n ih m s eh l f sil jh

./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV:
sil m ih sil t ih r iy s sil s er r ih m ih sil m aa l ih sil k l ey sil r eh sil d w ay sil d aa r sil b ah f sil jh

./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV:
sil hh ah z sil b ih sil g r iy w ah z sil d aw n ih sil b ay s sil n ey sil w eh l f eh n s ih z eh n dh eh r w er sil g r ey z ih ng sil k ae dx l sil


2021-11-08 20:41:39,829 INFO [pretrained.py:269] Decoding Done

To decode with whole-lattice-rescoring methond, you can use

./tdnn_ligru_ctc/pretrained.py \
  --method whole-lattice-rescoring \
  --checkpoint ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/exp/pretrained_average_9_25.pt \
  --words-file ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lang_phone/words.txt \
  --HLG ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lang_phone/HLG.pt \
  --G ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lm/G_4_gram.pt \
  --ngram-lm-scale 0.1 \
  ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV
  ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV
  ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV

The decoding output is:

2021-11-08 20:37:50,693 INFO [pretrained.py:169] device: cuda:0
2021-11-08 20:37:50,693 INFO [pretrained.py:171] Creating model
2021-11-08 20:37:54,693 INFO [pretrained.py:183] Loading HLG from ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lang_phone/HLG.pt
2021-11-08 20:37:54,705 INFO [pretrained.py:191] Loading G from ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lm/G_4_gram.pt
2021-11-08 20:37:54,714 INFO [pretrained.py:200] Constructing Fbank computer
2021-11-08 20:37:54,715 INFO [pretrained.py:210] Reading sound files: ['./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV', './tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV', './tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV']
2021-11-08 20:37:54,720 INFO [pretrained.py:216] Decoding started
2021-11-08 20:37:55,808 INFO [pretrained.py:251] Use HLG decoding + LM rescoring
2021-11-08 20:37:56,348 INFO [pretrained.py:267]
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV:
sil dh ih sh uw ah l iy v iy z ah sil p r aa sil k s ih m ey dx ih sil d w uh dx iy w ih s f iy l iy ng w ih th ih n ih m s eh l f sil jh

./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV:
sil m ih sil t ih r iy l s sil s er r eh m ih sil m aa l ih ng sil k l ey sil r eh sil d w ay sil d aa r sil b ah f sil jh ch

./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV:
sil hh ah z sil b ih n sil g r iy w ah z sil b aw n ih sil b ay s sil n ey sil w er l f eh n s ih z eh n dh eh r w er sil g r ey z ih ng sil k ae dx l sil


2021-11-08 20:37:56,348 INFO [pretrained.py:269] Decoding Done

Colab notebook

We provide a colab notebook for decoding with pre-trained model.

timit tdnn_ligru_ctc colab notebook

Congratulations! You have finished the TDNN-LiGRU-CTC recipe on timit in icefall.