LM rescoring for Transducer
LM rescoring is a commonly used approach to incorporate external LM information. Unlike shallow-fusion-based methods (see Shallow fusion for Transducer, LODR for RNN Transducer), rescoring is usually performed to re-rank the n-best hypotheses after beam search. Rescoring is usually more efficient than shallow fusion since less computation is performed on the external LM. In this tutorial, we will show you how to use external LM to rescore the n-best hypotheses decoded from neural transducer models in icefall.
Note
This tutorial is based on the recipe pruned_transducer_stateless7_streaming, which is a streaming transducer model trained on LibriSpeech. However, you can easily apply shallow fusion to other recipes. If you encounter any problems, please open an issue here.
Note
For simplicity, the training and testing corpus in this tutorial is the same (LibriSpeech). However, you can change the testing set to any other domains (e.g GigaSpeech) and use an external LM trained on that domain.
Hint
We recommend you to use a GPU for decoding.
For illustration purpose, we will use a pre-trained ASR model from this link. If you want to train your model from scratch, please have a look at Pruned transducer statelessX.
As the initial step, let’s download the pre-trained model.
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
$ cd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
$ cd ../data/lang_bpe_500
$ git lfs pull --include bpe.model
$ cd ../../..
As usual, we first test the model’s performance without external LM. This can be done via the following command:
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--exp-dir $exp_dir \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search
The following WERs are achieved on test-clean and test-other:
$ For test-clean, WER of different settings are:
$ beam_size_4 3.11 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.93 best for test-other
Now, we will try to improve the above WER numbers via external LM rescoring. We will download a pre-trained LM from this link.
Note
This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus. You may also train a RNN LM from scratch. Please refer to this script for training a RNN LM and this script to train a transformer LM.
$ # download the external LM
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
$ # create a symbolic link so that the checkpoint can be loaded
$ pushd icefall-librispeech-rnn-lm/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt
$ popd
With the RNNLM available, we can rescore the n-best hypotheses generated from modified_beam_search. Here,
n should be the number of beams, i.e --beam-size
. The command for LM rescoring is
as follows. Note that the --decoding-method
is set to modified_beam_search_lm_rescore and --use-shallow-fusion
is set to False.
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ lm_dir=./icefall-librispeech-rnn-lm/exp
$ lm_scale=0.43
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--beam-size 4 \
--exp-dir $exp_dir \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_lm_rescore \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \
--use-shallow-fusion 0 \
--lm-type rnn \
--lm-exp-dir $lm_dir \
--lm-epoch 99 \
--lm-scale $lm_scale \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500
$ For test-clean, WER of different settings are:
$ beam_size_4 2.93 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.6 best for test-other
Great! We made some improvements! Increasing the size of the n-best hypotheses will further boost the performance, see the following table:
Beam size |
test-clean |
test-other |
---|---|---|
4 |
2.93 |
7.6 |
8 |
2.67 |
7.11 |
12 |
2.59 |
6.86 |
In fact, we can also apply LODR (see LODR for RNN Transducer) when doing LM rescoring. To do so, we need to download the bi-gram required by LODR:
$ # download the bi-gram
$ git lfs install
$ git clone https://huggingface.co/marcoyang/librispeech_bigram
$ pushd data/lang_bpe_500
$ ln -s ../../librispeech_bigram/2gram.arpa .
$ popd
Then we can performn LM rescoring + LODR by changing the decoding method to modified_beam_search_lm_rescore_LODR.
Note
This decoding method requires the dependency of kenlm. You can install it via this command: pip install https://github.com/kpu/kenlm/archive/master.zip.
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ lm_dir=./icefall-librispeech-rnn-lm/exp
$ lm_scale=0.43
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--beam-size 4 \
--exp-dir $exp_dir \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_lm_rescore_LODR \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \
--use-shallow-fusion 0 \
--lm-type rnn \
--lm-exp-dir $lm_dir \
--lm-epoch 99 \
--lm-scale $lm_scale \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500
You should see the following WERs after executing the commands above:
$ For test-clean, WER of different settings are:
$ beam_size_4 2.9 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.57 best for test-other
It’s slightly better than LM rescoring. If we further increase the beam size, we will see further improvements from LM rescoring + LODR:
Beam size |
test-clean |
test-other |
---|---|---|
4 |
2.9 |
7.57 |
8 |
2.63 |
7.04 |
12 |
2.52 |
6.73 |
As mentioned earlier, LM rescoring is usually faster than shallow-fusion based methods. Here, we benchmark the WERs and decoding speed of them:
Decoding method |
beam=4 |
beam=8 |
beam=12 |
---|---|---|---|
|
3.11/7.93; 132s |
3.1/7.95; 177s |
3.1/7.96; 210s |
|
2.77/7.08; 262s |
2.62/6.65; 352s |
2.58/6.65; 488s |
|
2.61/6.74; 400s |
2.45/6.38; 610s |
2.4/6.23; 870s |
|
2.93/7.6; 156s |
2.67/7.11; 203s |
2.59/6.86; 255s |
|
2.9/7.57; 160s |
2.63/7.04; 203s |
2.52/6.73; 263s |
Note
Decoding is performed with a single 32G V100, we set --max-duration
to 600.
Decoding time here is only for reference and it may vary.