Stateless Transducer

This tutorial shows you how to do transducer training in icefall.

Hint

Instead of using RNN-T or RNN transducer, we only use transducer here. As you will see, there are no RNNs in the model.

Hint

We assume you have read the page Installation and have setup the environment for icefall.

Hint

We recommend you to use a GPU or several GPUs to run this recipe.

In this tutorial, you will learn:

What does the transducer model look like

How to prepare data for training and decoding

How to start the training, either with a single GPU or with multiple GPUs

How to do decoding after training, with greedy search, beam search and, modified beam search

How to use a pre-trained model provided by us to transcribe sound files

The Model

The transducer model consists of 3 parts:

Encoder: It is a conformer encoder with the following parameters
- Number of heads: 8
- Attention dim: 512
- Number of layers: 12
- Feedforward dim: 2048
Decoder: We use a stateless model consisting of:
- An embedding layer with embedding dim 512
- A Conv1d layer with a default kernel size 2 (i.e. it sees 2 symbols of left-context by default)
Joiner: It consists of a nn.tanh() and a nn.Linear().

Caution

The decoder is stateless and very simple. It is borrowed from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9054419 (Rnn-Transducer with Stateless Prediction Network)

We make one modification to it: Place a Conv1d layer right after the embedding layer.

When using Chinese characters as modelling unit, whose vocabulary size is 4336 in this specific dataset, the number of parameters of the model is 87939824, i.e., about 88 M.

The Loss

We are using https://github.com/csukuangfj/optimized_transducer to compute the transducer loss, which removes extra paddings in loss computation to save memory.

Hint

optimized_transducer implements the technqiues proposed in Improving RNN Transducer Modeling for End-to-End Speech Recognition to save memory.

Furthermore, it supports modified transducer, limiting the maximum number of symbols that can be emitted per frame to 1, which simplifies the decoding process significantly. Also, the experiment results show that it does not degrade the performance.

See https://github.com/csukuangfj/optimized_transducer#modified-transducer for what exactly modified transducer is.

https://github.com/csukuangfj/transducer-loss-benchmarking shows that in the unpruned case optimized_transducer has the advantage about minimizing memory usage.

Todo

Add tutorial about pruned_transducer_stateless that uses k2 pruned transducer loss.

Hint

You can use:

pip install optimized_transducer

to install optimized_transducer. Refer to https://github.com/csukuangfj/optimized_transducer for other alternatives.

Data Preparation

To prepare the data for training, please use the following commands:

cd egs/aishell/ASR
./prepare.sh --stop-stage 4
./prepare.sh --stage 6 --stop-stage 6

Note

You can use ./prepare.sh, though it will generate FSTs that are not used in transducer training.

When you finish running the script, you will get the following two folders:

data/fbank: It saves the pre-computed features

data/lang_char: It contains tokens that will be used in the training

Training

cd egs/aishell/ASR
./transducer_stateless_modified/train.py --help

shows you the training options that can be passed from the commandline. The following options are used quite often:

--exp-dir

The experiment folder to save logs and model checkpoints, defaults to ./transducer_stateless_modified/exp.

--num-epochs

It is the number of epochs to train. For instance, ./transducer_stateless_modified/train.py --num-epochs 30 trains for 30 epochs and generates epoch-0.pt, epoch-1.pt, …, epoch-29.pt in the folder set by --exp-dir.

--start-epoch

It’s used to resume training. ./transducer_stateless_modified/train.py --start-epoch 10 loads the checkpoint from exp_dir/epoch-9.pt and starts training from epoch 10, based on the state from epoch 9.
--world-size

It is used for single-machine multi-GPU DDP training.

If it is 1, then no DDP training is used.

If it is 2, then GPU 0 and GPU 1 are used for DDP training.

The following shows some use cases with it.
Use case 1: You have 4 GPUs, but you only want to use GPU 0 and GPU 2 for training. You can do the following:
$ cd egs/aishell/ASR
$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./transducer_stateless_modified/train.py --world-size 2
Use case 2: You have 4 GPUs and you want to use all of them for training. You can do the following:
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/train.py --world-size 4
Use case 3: You have 4 GPUs but you only want to use GPU 3 for training. You can do the following:
$ cd egs/aishell/ASR
$ export CUDA_VISIBLE_DEVICES="3"
$ ./transducer_stateless_modified/train.py --world-size 1
Caution

Only single-machine multi-GPU DDP training is implemented at present. There is an on-going PR https://github.com/k2-fsa/icefall/pull/63 that adds support for multi-machine multi-GPU DDP training.
--max-duration

It specifies the number of seconds over all utterances in a batch before padding. If you encounter CUDA OOM, please reduce it. For instance, if your are using V100 NVIDIA GPU with 32 GB RAM, we recommend you to set it to 300 when the vocabulary size is 500.

Hint

Due to padding, the number of seconds of all utterances in a batch will usually be larger than --max-duration.

A larger value for --max-duration may cause OOM during training, while a smaller value may increase the training time. You have to tune it.

--lr-factor

It controls the learning rate. If you use a single GPU for training, you may want to use a small value for it. If you use multiple GPUs for training, you may increase it.

--context-size

It specifies the kernel size in the decoder. The default value 2 means it functions as a tri-gram LM.

--modified-transducer-prob

It specifies the probability to use modified transducer loss. If it is 0, then no modified transducer is used; if it is 1, then it uses modified transducer loss for all batches. If it is p, it applies modified transducer with probability p.

There are some training options, e.g., number of warmup steps, that are not passed from the commandline. They are pre-configured by the function get_params() in transducer_stateless_modified/train.py

If you need to change them, please modify ./transducer_stateless_modified/train.py directly.

Caution

The training set is perturbed by speed with two factors: 0.9 and 1.1. Each epoch actually processes 3x150 == 450 hours of data.

Training logs

Training logs and checkpoints are saved in the folder set by --exp-dir (defaults to transducer_stateless_modified/exp). You will find the following files in that directory:

epoch-0.pt, epoch-1.pt, …

These are checkpoint files, containing model state_dict and optimizer state_dict. To resume training from some checkpoint, say epoch-10.pt, you can use:
$ ./transducer_stateless_modified/train.py --start-epoch 11
tensorboard/

This folder contains TensorBoard logs. Training loss, validation loss, learning rate, etc, are recorded in these logs. You can visualize them by:
$ cd transducer_stateless_modified/exp/tensorboard
$ tensorboard dev upload --logdir . --name "Aishell transducer training with icefall" --description "Training modified transducer, see https://github.com/k2-fsa/icefall/pull/219"
It will print something like below:
TensorFlow installation not found - running with reduced feature set.
Upload started and will continue reading any new data as it's added to the logdir.

To stop uploading, press Ctrl-C.

New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/laGZ6HrcQxOigbFD5E0Y3Q/

[2022-03-03T14:29:45] Started scanning logdir.
[2022-03-03T14:29:48] Total uploaded: 8477 scalars, 0 tensors, 0 binary objects
Listening for new data in logdir...
Note there is a URL in the above output, click it and you will see the following screenshot:

Fig. 3 TensorBoard screenshot.
log/log-train-xxxx

It is the detailed training log in text format, same as the one you saw printed to the console during training.

Usage examples

The following shows typical use cases:

Case 1

$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/train.py --max-duration 250

It uses --max-duration of 250 to avoid OOM.

Case 2

$ cd egs/aishell/ASR
$ export CUDA_VISIBLE_DEVICES="0,3"
$ ./transducer_stateless_modified/train.py --world-size 2

It uses GPU 0 and GPU 3 for DDP training.

Case 3

$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/train.py --num-epochs 10 --start-epoch 3

It loads checkpoint ./transducer_stateless_modified/exp/epoch-2.pt and starts training from epoch 3. Also, it trains for 10 epochs.

Decoding

The decoding part uses checkpoints saved by the training part, so you have to run the training part first.

$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/decode.py --help

shows the options for decoding.

The commonly used options are:

--method

This specifies the decoding method. Currently, it supports:

greedy_search. You can provide the commandline option --max-sym-per-frame to limit the maximum number of symbols that can be emitted per frame.

beam_search. You can provide the commandline option --beam-size.

modified_beam_search. You can also provide the commandline option --beam-size. To use this method, we assume that you have trained your model with modified transducer, i.e., used the option --modified-transducer-prob in the training.

The following command uses greedy search for decoding
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/decode.py \
        --epoch 64 \
        --avg 33 \
        --exp-dir ./transducer_stateless_modified/exp \
        --max-duration 100 \
        --decoding-method greedy_search \
        --max-sym-per-frame 1
The following command uses beam search for decoding
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/decode.py \
        --epoch 64 \
        --avg 33 \
        --exp-dir ./transducer_stateless_modified/exp \
        --max-duration 100 \
        --decoding-method beam_search \
        --beam-size 4
The following command uses modified beam search for decoding
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/decode.py \
        --epoch 64 \
        --avg 33 \
        --exp-dir ./transducer_stateless_modified/exp \
        --max-duration 100 \
        --decoding-method modified_beam_search \
        --beam-size 4
--max-duration

It has the same meaning as the one used in training. A larger value may cause OOM.

--epoch

It specifies the checkpoint from which epoch that should be used for decoding.

--avg

It specifies the number of models to average. For instance, if it is 3 and if --epoch=10, then it averages the checkpoints epoch-8.pt, epoch-9.pt, and epoch-10.pt and the averaged checkpoint is used for decoding.

After decoding, you can find the decoding logs and results in exp_dir/log/<decoding_method>, e.g., exp_dir/log/greedy_search.

Pre-trained Model

We have uploaded a pre-trained model to https://huggingface.co/csukuangfj/icefall-aishell-transducer-stateless-modified-2022-03-01

We describe how to use the pre-trained model to transcribe a sound file or multiple sound files in the following.

Install kaldifeat

kaldifeat is used to extract features for a single sound file or multiple sound files at the same time.

Please refer to https://github.com/csukuangfj/kaldifeat for installation.

Download the pre-trained model

The following commands describe how to download the pre-trained model:

$ cd egs/aishell/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/csukuangfj/icefall-aishell-transducer-stateless-modified-2022-03-01

Caution

You have to use git lfs to download the pre-trained model.

After downloading, you will have the following files:

$ cd egs/aishell/ASR
$ tree tmp/icefall-aishell-transducer-stateless-modified-2022-03-01

tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/
|-- README.md
|-- data
|   `-- lang_char
|       |-- L.pt
|       |-- lexicon.txt
|       |-- tokens.txt
|       `-- words.txt
|-- exp
|   `-- pretrained.pt
|-- log
|   |-- errs-test-beam_4-epoch-64-avg-33-beam-4.txt
|   |-- errs-test-greedy_search-epoch-64-avg-33-context-2-max-sym-per-frame-1.txt
|   |-- log-decode-epoch-64-avg-33-beam-4-2022-03-02-12-05-03
|   |-- log-decode-epoch-64-avg-33-context-2-max-sym-per-frame-1-2022-02-28-18-13-07
|   |-- recogs-test-beam_4-epoch-64-avg-33-beam-4.txt
|   `-- recogs-test-greedy_search-epoch-64-avg-33-context-2-max-sym-per-frame-1.txt
`-- test_wavs
    |-- BAC009S0764W0121.wav
    |-- BAC009S0764W0122.wav
    |-- BAC009S0764W0123.wav
    `-- transcript.txt

5 directories, 16 files

File descriptions:

data/lang_char

It contains language related files. You can find the vocabulary size in tokens.txt.

exp/pretrained.pt

It contains pre-trained model parameters, obtained by averaging checkpoints from epoch-32.pt to epoch-64.pt. Note: We have removed optimizer state_dict to reduce file size.

log

It contains decoding logs and decoded results.

test_wavs

It contains some test sound files from Aishell test dataset.

The information of the test sound files is listed below:

$ soxi tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/*.wav

Input File     : 'tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:04.20 = 67263 samples ~ 315.295 CDDA sectors
File Size      : 135k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM


Input File     : 'tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:04.12 = 65840 samples ~ 308.625 CDDA sectors
File Size      : 132k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM


Input File     : 'tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:04.00 = 64000 samples ~ 300 CDDA sectors
File Size      : 128k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

Total Duration of 3 files: 00:00:12.32

Usage

$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/pretrained.py --help

displays the help information.

It supports three decoding methods:

greedy search

beam search

modified beam search

Note

In modified beam search, it limits the maximum number of symbols that can be emitted per frame to 1. To use this method, you have to ensure that your model has been trained with the option --modified-transducer-prob. Otherwise, it may give you poor results.

Greedy search

The command to run greedy search is given below:

$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/pretrained.py \
    --checkpoint ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/exp/pretrained.pt \
    --lang-dir ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char \
    --method greedy_search \
    ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav \
    ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav \
    ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav

The output is as follows:

2022-03-03 15:35:26,531 INFO [pretrained.py:239] device: cuda:0
2022-03-03 15:35:26,994 INFO [lexicon.py:176] Loading pre-compiled tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char/Linv.pt
2022-03-03 15:35:27,027 INFO [pretrained.py:246] {'feature_dim': 80, 'encoder_out_dim': 512, 'subsampling_factor': 4, 'attention_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'vgg_frontend': False, 'env_info': {'k2-version': '1.13', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'f4fefe4882bc0ae59af951da3f47335d5495ef71', 'k2-git-date': 'Thu Feb 10 15:16:02 2022', 'lhotse-version': '1.0.0.dev+missing.version.file', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '50d2281-clean', 'icefall-git-date': 'Wed Mar 2 16:02:38 2022', 'icefall-path': '/ceph-fj/fangjun/open-source-2/icefall-aishell', 'k2-path': '/ceph-fj/fangjun/open-source-2/k2-multi-datasets/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-fj/fangjun/open-source-2/lhotse-aishell/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-2-0815224919-75d558775b-mmnv8', 'IP address': '10.177.72.138'}, 'sample_rate': 16000, 'checkpoint': './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/exp/pretrained.pt', 'lang_dir': PosixPath('tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char'), 'method': 'greedy_search', 'sound_files': ['./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav'], 'beam_size': 4, 'context_size': 2, 'max_sym_per_frame': 3, 'blank_id': 0, 'vocab_size': 4336}
2022-03-03 15:35:27,027 INFO [pretrained.py:248] About to create model
2022-03-03 15:35:36,878 INFO [pretrained.py:257] Constructing Fbank computer
2022-03-03 15:35:36,880 INFO [pretrained.py:267] Reading sound files: ['./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav']
2022-03-03 15:35:36,891 INFO [pretrained.py:273] Decoding started
/ceph-fj/fangjun/open-source-2/icefall-aishell/egs/aishell/ASR/transducer_stateless_modified/conformer.py:113: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  lengths = ((x_lens - 1) // 2 - 1) // 2
2022-03-03 15:35:37,163 INFO [pretrained.py:320]
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav:
甚 至 出 现 交 易 几 乎 停 滞 的 情 况

./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav:
一 二 线 城 市 虽 然 也 处 于 调 整 中

./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav:
但 因 为 聚 集 了 过 多 公 共 资 源

2022-03-03 15:35:37,163 INFO [pretrained.py:322] Decoding Done

Beam search

The command to run beam search is given below:

$ cd egs/aishell/ASR

$ ./transducer_stateless_modified/pretrained.py \
    --checkpoint ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/exp/pretrained.pt \
    --lang-dir ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char \
    --method beam_search \
    --beam-size 4 \
    ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav \
    ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav \
    ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav

The output is as follows:

2022-03-03 15:39:09,285 INFO [pretrained.py:239] device: cuda:0
2022-03-03 15:39:09,708 INFO [lexicon.py:176] Loading pre-compiled tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char/Linv.pt
2022-03-03 15:39:09,759 INFO [pretrained.py:246] {'feature_dim': 80, 'encoder_out_dim': 512, 'subsampling_factor': 4, 'attention_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'vgg_frontend': False, 'env_info': {'k2-version': '1.13', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'f4fefe4882bc0ae59af951da3f47335d5495ef71', 'k2-git-date': 'Thu Feb 10 15:16:02 2022', 'lhotse-version': '1.0.0.dev+missing.version.file', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '50d2281-clean', 'icefall-git-date': 'Wed Mar 2 16:02:38 2022', 'icefall-path': '/ceph-fj/fangjun/open-source-2/icefall-aishell', 'k2-path': '/ceph-fj/fangjun/open-source-2/k2-multi-datasets/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-fj/fangjun/open-source-2/lhotse-aishell/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-2-0815224919-75d558775b-mmnv8', 'IP address': '10.177.72.138'}, 'sample_rate': 16000, 'checkpoint': './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/exp/pretrained.pt', 'lang_dir': PosixPath('tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char'), 'method': 'beam_search', 'sound_files': ['./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav'], 'beam_size': 4, 'context_size': 2, 'max_sym_per_frame': 3, 'blank_id': 0, 'vocab_size': 4336}
2022-03-03 15:39:09,760 INFO [pretrained.py:248] About to create model
2022-03-03 15:39:18,919 INFO [pretrained.py:257] Constructing Fbank computer
2022-03-03 15:39:18,922 INFO [pretrained.py:267] Reading sound files: ['./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav']
2022-03-03 15:39:18,929 INFO [pretrained.py:273] Decoding started
/ceph-fj/fangjun/open-source-2/icefall-aishell/egs/aishell/ASR/transducer_stateless_modified/conformer.py:113: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  lengths = ((x_lens - 1) // 2 - 1) // 2
2022-03-03 15:39:21,046 INFO [pretrained.py:320]
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav:
甚 至 出 现 交 易 几 乎 停 滞 的 情 况

./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav:
一 二 线 城 市 虽 然 也 处 于 调 整 中

./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav:
但 因 为 聚 集 了 过 多 公 共 资 源

2022-03-03 15:39:21,047 INFO [pretrained.py:322] Decoding Done

Modified Beam search

The command to run modified beam search is given below:

$ cd egs/aishell/ASR

$ ./transducer_stateless_modified/pretrained.py \
    --checkpoint ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/exp/pretrained.pt \
    --lang-dir ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char \
    --method modified_beam_search \
    --beam-size 4 \
    ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav \
    ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav \
    ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav

The output is as follows:

2022-03-03 15:41:23,319 INFO [pretrained.py:239] device: cuda:0
2022-03-03 15:41:23,798 INFO [lexicon.py:176] Loading pre-compiled tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char/Linv.pt
2022-03-03 15:41:23,831 INFO [pretrained.py:246] {'feature_dim': 80, 'encoder_out_dim': 512, 'subsampling_factor': 4, 'attention_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'vgg_frontend': False, 'env_info': {'k2-version': '1.13', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'f4fefe4882bc0ae59af951da3f47335d5495ef71', 'k2-git-date': 'Thu Feb 10 15:16:02 2022', 'lhotse-version': '1.0.0.dev+missing.version.file', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '50d2281-clean', 'icefall-git-date': 'Wed Mar 2 16:02:38 2022', 'icefall-path': '/ceph-fj/fangjun/open-source-2/icefall-aishell', 'k2-path': '/ceph-fj/fangjun/open-source-2/k2-multi-datasets/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-fj/fangjun/open-source-2/lhotse-aishell/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-2-0815224919-75d558775b-mmnv8', 'IP address': '10.177.72.138'}, 'sample_rate': 16000, 'checkpoint': './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/exp/pretrained.pt', 'lang_dir': PosixPath('tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char'), 'method': 'modified_beam_search', 'sound_files': ['./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav'], 'beam_size': 4, 'context_size': 2, 'max_sym_per_frame': 3, 'blank_id': 0, 'vocab_size': 4336}
2022-03-03 15:41:23,831 INFO [pretrained.py:248] About to create model
2022-03-03 15:41:32,214 INFO [pretrained.py:257] Constructing Fbank computer
2022-03-03 15:41:32,215 INFO [pretrained.py:267] Reading sound files: ['./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav']
2022-03-03 15:41:32,220 INFO [pretrained.py:273] Decoding started
/ceph-fj/fangjun/open-source-2/icefall-aishell/egs/aishell/ASR/transducer_stateless_modified/conformer.py:113: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  lengths = ((x_lens - 1) // 2 - 1) // 2
/ceph-fj/fangjun/open-source-2/icefall-aishell/egs/aishell/ASR/transducer_stateless_modified/beam_search.py:402: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  topk_hyp_indexes = topk_indexes // logits.size(-1)
2022-03-03 15:41:32,583 INFO [pretrained.py:320]
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav:
甚 至 出 现 交 易 几 乎 停 滞 的 情 况

./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav:
一 二 线 城 市 虽 然 也 处 于 调 整 中

./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav:
但 因 为 聚 集 了 过 多 公 共 资 源

2022-03-03 15:41:32,583 INFO [pretrained.py:322] Decoding Done

Colab notebook

We provide a colab notebook for this recipe showing how to use a pre-trained model to transcribe sound files.