TDNN-LSTM-CTC

This tutorial shows you how to run a TDNN-LSTM-CTC model with the LibriSpeech dataset.

Hint

We assume you have read the page Installation and have setup the environment for icefall.

Data preparation

$ cd egs/librispeech/ASR
$ ./prepare.sh

The script ./prepare.sh handles the data preparation for you, automagically. All you need to do is to run it.

The data preparation contains several stages, you can use the following two options:

--stage

--stop-stage

to control which stage(s) should be run. By default, all stages are executed.

For example,

$ cd egs/librispeech/ASR
$ ./prepare.sh --stage 0 --stop-stage 0

means to run only stage 0.

To run stage 2 to stage 5, use:

$ ./prepare.sh --stage 2 --stop-stage 5

We provide the following YouTube video showing how to run ./prepare.sh.

Note

To get the latest news of next-gen Kaldi, please subscribe the following YouTube channel by Nadira Povey:

https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw

Training

Now describing the training of TDNN-LSTM-CTC model, contained in the tdnn_lstm_ctc folder.

The command to run the training part is:

$ cd egs/librispeech/ASR
$ export CUDA_VISIBLE_DEVICES="0,1,2,3"
$ ./tdnn_lstm_ctc/train.py --world-size 4

By default, it will run 20 epochs. Training logs and checkpoints are saved in tdnn_lstm_ctc/exp.

In tdnn_lstm_ctc/exp, you will find the following files:

epoch-0.pt, epoch-1.pt, …, epoch-19.pt

These are checkpoint files, containing model state_dict and optimizer state_dict. To resume training from some checkpoint, say epoch-10.pt, you can use:
$ ./tdnn_lstm_ctc/train.py --start-epoch 11
tensorboard/

This folder contains TensorBoard logs. Training loss, validation loss, learning rate, etc, are recorded in these logs. You can visualize them by:
$ cd tdnn_lstm_ctc/exp/tensorboard
$ tensorboard dev upload --logdir . --description "TDNN LSTM training for librispeech with icefall"
log/log-train-xxxx

It is the detailed training log in text format, same as the one you saw printed to the console during training.

To see available training options, you can use:

$ ./tdnn_lstm_ctc/train.py --help

Other training options, e.g., learning rate, results dir, etc., are pre-configured in the function get_params() in tdnn_lstm_ctc/train.py. Normally, you don’t need to change them. You can change them by modifying the code, if you want.

Decoding

The decoding part uses checkpoints saved by the training part, so you have to run the training part first.

The command for decoding is:

$ export CUDA_VISIBLE_DEVICES="0"
$ ./tdnn_lstm_ctc/decode.py

You will see the WER in the output log.

Decoded results are saved in tdnn_lstm_ctc/exp.

$ ./tdnn_lstm_ctc/decode.py --help

shows you the available decoding options.

Some commonly used options are:

--epoch

You can select which checkpoint to be used for decoding. For instance, ./tdnn_lstm_ctc/decode.py --epoch 10 means to use ./tdnn_lstm_ctc/exp/epoch-10.pt for decoding.
--avg

It’s related to model averaging. It specifies number of checkpoints to be averaged. The averaged model is used for decoding. For example, the following command:
$ ./tdnn_lstm_ctc/decode.py --epoch 10 --avg 3
uses the average of epoch-8.pt, epoch-9.pt and epoch-10.pt for decoding.
--export

If it is True, i.e., ./tdnn_lstm_ctc/decode.py --export 1, the code will save the averaged model to tdnn_lstm_ctc/exp/pretrained.pt. See Pre-trained Model for how to use it.

Pre-trained Model

We have uploaded the pre-trained model to https://huggingface.co/pkufool/icefall_asr_librispeech_tdnn-lstm_ctc.

The following shows you how to use the pre-trained model.

Install kaldifeat

kaldifeat is used to extract features for a single sound file or multiple sound files at the same time.

Please refer to https://github.com/csukuangfj/kaldifeat for installation.

Download the pre-trained model

$ cd egs/librispeech/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/pkufool/icefall_asr_librispeech_tdnn-lstm_ctc

Caution

You have to use git lfs to download the pre-trained model.

Caution

In order to use this pre-trained model, your k2 version has to be v1.7 or later.

After downloading, you will have the following files:

$ cd egs/librispeech/ASR
$ tree tmp

tmp/
`-- icefall_asr_librispeech_tdnn-lstm_ctc
    |-- README.md
    |-- data
    |   |-- lang_phone
    |   |   |-- HLG.pt
    |   |   |-- tokens.txt
    |   |   `-- words.txt
    |   `-- lm
    |       `-- G_4_gram.pt
    |-- exp
    |   `-- pretrained.pt
    `-- test_wavs
        |-- 1089-134686-0001.flac
        |-- 1221-135766-0001.flac
        |-- 1221-135766-0002.flac
        `-- trans.txt

6 directories, 10 files

File descriptions:

data/lang_phone/HLG.pt

It is the decoding graph.

data/lang_phone/tokens.txt

It contains tokens and their IDs.

data/lang_phone/words.txt

It contains words and their IDs.

data/lm/G_4_gram.pt

It is a 4-gram LM, useful for LM rescoring.

exp/pretrained.pt

It contains pre-trained model parameters, obtained by averaging checkpoints from epoch-14.pt to epoch-19.pt. Note: We have removed optimizer state_dict to reduce file size.

test_waves/*.flac

It contains some test sound files from LibriSpeech test-clean dataset.

test_waves/trans.txt

It contains the reference transcripts for the sound files in test_waves/.

The information of the test sound files is listed below:

$ soxi tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/*.flac

Input File     : 'tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:06.62 = 106000 samples ~ 496.875 CDDA sectors
File Size      : 116k
Bit Rate       : 140k
Sample Encoding: 16-bit FLAC


Input File     : 'tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:16.71 = 267440 samples ~ 1253.62 CDDA sectors
File Size      : 343k
Bit Rate       : 164k
Sample Encoding: 16-bit FLAC


Input File     : 'tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:04.83 = 77200 samples ~ 361.875 CDDA sectors
File Size      : 105k
Bit Rate       : 174k
Sample Encoding: 16-bit FLAC

Total Duration of 3 files: 00:00:28.16

Inference with a pre-trained model

$ cd egs/librispeech/ASR
$ ./tdnn_lstm_ctc/pretrained.py --help

shows the usage information of ./tdnn_lstm_ctc/pretrained.py.

To decode with 1best method, we can use:

./tdnn_lstm_ctc/pretrained.py \
  --checkpoint ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/exp/pretraind.pt \
  --words-file ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lang_phone/words.txt \
  --HLG ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lang_phone/HLG.pt \
  ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac \
  ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac \
  ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac

The output is:

2021-08-24 16:57:13,315 INFO [pretrained.py:168] device: cuda:0
2021-08-24 16:57:13,315 INFO [pretrained.py:170] Creating model
2021-08-24 16:57:18,331 INFO [pretrained.py:182] Loading HLG from ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lang_phone/HLG.pt
2021-08-24 16:57:27,581 INFO [pretrained.py:199] Constructing Fbank computer
2021-08-24 16:57:27,584 INFO [pretrained.py:209] Reading sound files: ['./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac', './tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac', './tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac']
2021-08-24 16:57:27,599 INFO [pretrained.py:215] Decoding started
2021-08-24 16:57:27,791 INFO [pretrained.py:245] Use HLG decoding
2021-08-24 16:57:28,098 INFO [pretrained.py:266]
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac:
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS

./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac:
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN

./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac:
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION


2021-08-24 16:57:28,099 INFO [pretrained.py:268] Decoding Done

To decode with whole-lattice-rescoring methond, you can use

./tdnn_lstm_ctc/pretrained.py \
  --checkpoint ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/exp/pretraind.pt \
  --words-file ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lang_phone/words.txt \
  --HLG ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lang_phone/HLG.pt \
  --method whole-lattice-rescoring \
  --G ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lm/G_4_gram.pt \
  --ngram-lm-scale 0.8 \
  ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac \
  ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac \
  ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac

The decoding output is:

2021-08-24 16:39:24,725 INFO [pretrained.py:168] device: cuda:0
2021-08-24 16:39:24,725 INFO [pretrained.py:170] Creating model
2021-08-24 16:39:29,403 INFO [pretrained.py:182] Loading HLG from ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lang_phone/HLG.pt
2021-08-24 16:39:40,631 INFO [pretrained.py:190] Loading G from ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lm/G_4_gram.pt
2021-08-24 16:39:53,098 INFO [pretrained.py:199] Constructing Fbank computer
2021-08-24 16:39:53,107 INFO [pretrained.py:209] Reading sound files: ['./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac', './tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac', './tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac']
2021-08-24 16:39:53,121 INFO [pretrained.py:215] Decoding started
2021-08-24 16:39:53,443 INFO [pretrained.py:250] Use HLG decoding + LM rescoring
2021-08-24 16:39:54,010 INFO [pretrained.py:266]
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac:
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS

./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac:
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN

./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac:
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION


2021-08-24 16:39:54,010 INFO [pretrained.py:268] Decoding Done

Colab notebook

We provide a colab notebook for decoding with pre-trained model.

Congratulations! You have finished the TDNN-LSTM-CTC recipe on librispeech in icefall.