Pruned transducer statelessX
This tutorial shows you how to run a streaming conformer transducer model with the LibriSpeech dataset.
Note
The tutorial is suitable for pruned_transducer_stateless, pruned_transducer_stateless2, pruned_transducer_stateless4, pruned_transducer_stateless5, We will take pruned_transducer_stateless4 as an example in this tutorial.
Hint
We assume you have read the page Installation and have setup
the environment for icefall
.
Hint
We recommend you to use a GPU or several GPUs to run this recipe.
Hint
Please scroll down to the bottom of this page to find download links for pretrained models if you don’t want to train a model from scratch.
We use pruned RNN-T to compute the loss.
Note
You can find the paper about pruned RNN-T at the following address:
The transducer model consists of 3 parts:
Encoder, a.k.a, the transcription network. We use a Conformer model (the reworked version by Daniel Povey)
Decoder, a.k.a, the prediction network. We use a stateless model consisting of
nn.Embedding
andnn.Conv1d
Joiner, a.k.a, the joint network.
Caution
Contrary to the conventional RNN-T models, we use a stateless decoder. That is, it has no recurrent connections.
Data preparation
Hint
The data preparation is the same as other recipes on LibriSpeech dataset,
if you have finished this step, you can skip to Training
directly.
$ cd egs/librispeech/ASR
$ ./prepare.sh
The script ./prepare.sh
handles the data preparation for you, automagically.
All you need to do is to run it.
The data preparation contains several stages, you can use the following two options:
--stage
--stop-stage
to control which stage(s) should be run. By default, all stages are executed.
For example,
$ cd egs/librispeech/ASR
$ ./prepare.sh --stage 0 --stop-stage 0
means to run only stage 0.
To run stage 2 to stage 5, use:
$ ./prepare.sh --stage 2 --stop-stage 5
Hint
If you have pre-downloaded the LibriSpeech
dataset and the musan dataset, say,
they are saved in /tmp/LibriSpeech
and /tmp/musan
, you can modify
the dl_dir
variable in ./prepare.sh
to point to /tmp
so that
./prepare.sh
won’t re-download them.
Note
All generated files by ./prepare.sh
, e.g., features, lexicon, etc,
are saved in ./data
directory.
We provide the following YouTube video showing how to run ./prepare.sh
.
Note
To get the latest news of next-gen Kaldi, please subscribe the following YouTube channel by Nadira Povey:
Training
Note
We put the streaming and non-streaming model in one recipe, to train a streaming model you only
need to add 4 extra options comparing with training a non-streaming model. These options are
--dynamic-chunk-training
, --num-left-chunks
, --causal-convolution
, --short-chunk-size
.
You can see the configurable options below for their meanings or read https://arxiv.org/pdf/2012.05481.pdf for more details.
Configurable options
$ cd egs/librispeech/ASR
$ ./pruned_transducer_stateless4/train.py --help
shows you the training options that can be passed from the commandline. The following options are used quite often:
--exp-dir
The directory to save checkpoints, training logs and tensorboard.
--full-libri
If it’s True, the training part uses all the training data, i.e., 960 hours. Otherwise, the training part uses only the subset
train-clean-100
, which has 100 hours of training data.Caution
The training set is perturbed by speed with two factors: 0.9 and 1.1. If
--full-libri
is True, each epoch actually processes3x960 == 2880
hours of data.
--num-epochs
It is the number of epochs to train. For instance,
./pruned_transducer_stateless4/train.py --num-epochs 30
trains for 30 epochs and generatesepoch-1.pt
,epoch-2.pt
, …,epoch-30.pt
in the folder./pruned_transducer_stateless4/exp
.
--start-epoch
It’s used to resume training.
./pruned_transducer_stateless4/train.py --start-epoch 10
loads the checkpoint./pruned_transducer_stateless4/exp/epoch-9.pt
and starts training from epoch 10, based on the state from epoch 9.
--world-size
It is used for multi-GPU single-machine DDP training.
If it is 1, then no DDP training is used.
If it is 2, then GPU 0 and GPU 1 are used for DDP training.
The following shows some use cases with it.
Use case 1: You have 4 GPUs, but you only want to use GPU 0 and GPU 2 for training. You can do the following:
$ cd egs/librispeech/ASR $ export CUDA_VISIBLE_DEVICES="0,2" $ ./pruned_transducer_stateless4/train.py --world-size 2Use case 2: You have 4 GPUs and you want to use all of them for training. You can do the following:
$ cd egs/librispeech/ASR $ ./pruned_transducer_stateless4/train.py --world-size 4Use case 3: You have 4 GPUs but you only want to use GPU 3 for training. You can do the following:
$ cd egs/librispeech/ASR $ export CUDA_VISIBLE_DEVICES="3" $ ./pruned_transducer_stateless4/train.py --world-size 1Caution
Only multi-GPU single-machine DDP training is implemented at present. Multi-GPU multi-machine DDP training will be added later.
--max-duration
It specifies the number of seconds over all utterances in a batch, before padding. If you encounter CUDA OOM, please reduce it.
Hint
Due to padding, the number of seconds of all utterances in a batch will usually be larger than
--max-duration
.A larger value for
--max-duration
may cause OOM during training, while a smaller value may increase the training time. You have to tune it.
--use-fp16
If it is True, the model will train with half precision, from our experiment results, by using half precision you can train with two times larger
--max-duration
so as to get almost 2X speed up.
--dynamic-chunk-training
The flag that indicates whether to train a streaming model or not, it MUST be True if you want to train a streaming model.
--short-chunk-size
When training a streaming attention model with chunk masking, the chunk size would be either max sequence length of current batch or uniformly sampled from (1, short_chunk_size). The default value is 25, you don’t have to change it most of the time.
--num-left-chunks
It indicates how many left context (in chunks) that can be seen when calculating attention. The default value is 4, you don’t have to change it most of the time.
--causal-convolution
Whether to use causal convolution in conformer encoder layer, this requires to be True when training a streaming model.
Pre-configured options
There are some training options, e.g., number of encoder layers,
encoder dimension, decoder dimension, number of warmup steps etc,
that are not passed from the commandline.
They are pre-configured by the function get_params()
in
pruned_transducer_stateless4/train.py
You don’t need to change these pre-configured parameters. If you really need to change
them, please modify ./pruned_transducer_stateless4/train.py
directly.
Note
The options for pruned_transducer_stateless5 are a little different from
other recipes. It allows you to configure --num-encoder-layers
, --dim-feedforward
, --nhead
, --encoder-dim
, --decoder-dim
, --joiner-dim
from commandline, so that you can train models with different size with pruned_transducer_stateless5.
Training logs
Training logs and checkpoints are saved in --exp-dir
(e.g. pruned_transducer_stateless4/exp
.
You will find the following files in that directory:
epoch-1.pt
,epoch-2.pt
, …These are checkpoint files saved at the end of each epoch, containing model
state_dict
and optimizerstate_dict
. To resume training from some checkpoint, sayepoch-10.pt
, you can use:$ ./pruned_transducer_stateless4/train.py --start-epoch 11
checkpoint-436000.pt
,checkpoint-438000.pt
, …These are checkpoint files saved every
--save-every-n
batches, containing modelstate_dict
and optimizerstate_dict
. To resume training from some checkpoint, saycheckpoint-436000
, you can use:$ ./pruned_transducer_stateless4/train.py --start-batch 436000
tensorboard/
This folder contains tensorBoard logs. Training loss, validation loss, learning rate, etc, are recorded in these logs. You can visualize them by:
$ cd pruned_transducer_stateless4/exp/tensorboard $ tensorboard dev upload --logdir . --description "pruned transducer training for LibriSpeech with icefall"It will print something like below:
TensorFlow installation not found - running with reduced feature set. Upload started and will continue reading any new data as it's added to the logdir. To stop uploading, press Ctrl-C. New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/97VKXf80Ru61CnP2ALWZZg/ [2022-11-20T15:50:50] Started scanning logdir. Uploading 4468 scalars... [2022-11-20T15:53:02] Total uploaded: 210171 scalars, 0 tensors, 0 binary objects Listening for new data in logdir...Note there is a URL in the above output. Click it and you will see the following screenshot:
Hint
If you don’t have access to google, you can use the following command to view the tensorboard log locally:
cd pruned_transducer_stateless4/exp/tensorboard tensorboard --logdir . --port 6008It will print the following message:
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all TensorBoard 2.8.0 at http://localhost:6008/ (Press CTRL+C to quit)Now start your browser and go to http://localhost:6008 to view the tensorboard logs.
log/log-train-xxxx
It is the detailed training log in text format, same as the one you saw printed to the console during training.
Usage example
You can use the following command to start the training using 4 GPUs:
export CUDA_VISIBLE_DEVICES="0,1,2,3"
./pruned_transducer_stateless4/train.py \
--world-size 4 \
--dynamic-chunk-training 1 \
--causal-convolution 1 \
--num-epochs 30 \
--start-epoch 1 \
--exp-dir pruned_transducer_stateless4/exp \
--full-libri 1 \
--max-duration 300
Note
Comparing with training a non-streaming model, you only need to add two extra options,
--dynamic-chunk-training 1
and --causal-convolution 1
.
Decoding
The decoding part uses checkpoints saved by the training part, so you have to run the training part first.
Hint
There are two kinds of checkpoints:
(1)
epoch-1.pt
,epoch-2.pt
, …, which are saved at the end of each epoch. You can pass--epoch
topruned_transducer_stateless4/decode.py
to use them.(2)
checkpoints-436000.pt
,epoch-438000.pt
, …, which are saved every--save-every-n
batches. You can pass--iter
topruned_transducer_stateless4/decode.py
to use them.We suggest that you try both types of checkpoints and choose the one that produces the lowest WERs.
Tip
To decode a streaming model, you can use either simulate streaming decoding
in decode.py
or
real streaming decoding
in streaming_decode.py
, the difference between decode.py
and
streaming_decode.py
is that, decode.py
processes the whole acoustic frames at one time with masking (i.e. same as training),
but streaming_decode.py
processes the acoustic frames chunk by chunk (so it can only see limited context).
Note
simulate streaming decoding
in decode.py
and real streaming decoding
in streaming_decode.py
should
produce almost the same results given the same --decode-chunk-size
and --left-context
.
Simulate streaming decoding
$ cd egs/librispeech/ASR
$ ./pruned_transducer_stateless4/decode.py --help
shows the options for decoding. The following options are important for streaming models:
--simulate-streaming
If you want to decode a streaming model with
decode.py
, you MUST set--simulate-streaming
toTrue
.simulate
here means the acoustic frames are not processed frame by frame (or chunk by chunk), instead, the whole sequence is processed at one time with masking (the same as training).
--causal-convolution
If True, the convolution module in encoder layers will be causal convolution. This is MUST be True when decoding with a streaming model.
--decode-chunk-size
For streaming models, we will calculate the chunk-wise attention,
--decode-chunk-size
indicates the chunk length (in frames after subsampling) for chunk-wise attention. Forsimulate streaming decoding
thedecode-chunk-size
is used to generate the attention mask.
--left-context
--left-context
indicates how many left context frames (after subsampling) can be seen for current chunk when calculating chunk-wise attention. Normally,left-context
should equal todecode-chunk-size * num-left-chunks
, wherenum-left-chunks
is the option used to train this model. Forsimulate streaming decoding
theleft-context
is used to generate the attention mask.
The following shows two examples (for the two types of checkpoints):
for m in greedy_search fast_beam_search modified_beam_search; do
for epoch in 25 20; do
for avg in 7 5 3 1; do
./pruned_transducer_stateless4/decode.py \
--epoch $epoch \
--avg $avg \
--simulate-streaming 1 \
--causal-convolution 1 \
--decode-chunk-size 16 \
--left-context 64 \
--exp-dir pruned_transducer_stateless4/exp \
--max-duration 600 \
--decoding-method $m
done
done
done
for m in greedy_search fast_beam_search modified_beam_search; do
for iter in 474000; do
for avg in 8 10 12 14 16 18; do
./pruned_transducer_stateless4/decode.py \
--iter $iter \
--avg $avg \
--simulate-streaming 1 \
--causal-convolution 1 \
--decode-chunk-size 16 \
--left-context 64 \
--exp-dir pruned_transducer_stateless4/exp \
--max-duration 600 \
--decoding-method $m
done
done
done
Real streaming decoding
$ cd egs/librispeech/ASR
$ ./pruned_transducer_stateless4/streaming_decode.py --help
shows the options for decoding. The following options are important for streaming models:
--decode-chunk-size
For streaming models, we will calculate the chunk-wise attention,
--decode-chunk-size
indicates the chunk length (in frames after subsampling) for chunk-wise attention. Forreal streaming decoding
, we will processdecode-chunk-size
acoustic frames at each time.
--left-context
--left-context
indicates how many left context frames (after subsampling) can be seen for current chunk when calculating chunk-wise attention. Normally,left-context
should equal todecode-chunk-size * num-left-chunks
, wherenum-left-chunks
is the option used to train this model.
--num-decode-streams
The number of decoding streams that can be run in parallel (very similar to the
bath size
). Forreal streaming decoding
, the batches will be packed dynamically, for example, if thenum-decode-streams
equals to 10, then, sequence 1 to 10 will be decoded at first, after a while, suppose sequence 1 and 2 are done, so, sequence 3 to 12 will be processed parallelly in a batch.
Note
We also try adding --right-context
in the real streaming decoding, but it seems not to benefit
the performance for all the models, the reasons might be the training and decoding mismatch. You
can try decoding with --right-context
to see if it helps. The default value is 0.
The following shows two examples (for the two types of checkpoints):
for m in greedy_search fast_beam_search modified_beam_search; do
for epoch in 25 20; do
for avg in 7 5 3 1; do
./pruned_transducer_stateless4/decode.py \
--epoch $epoch \
--avg $avg \
--decode-chunk-size 16 \
--left-context 64 \
--num-decode-streams 100 \
--exp-dir pruned_transducer_stateless4/exp \
--max-duration 600 \
--decoding-method $m
done
done
done
for m in greedy_search fast_beam_search modified_beam_search; do
for iter in 474000; do
for avg in 8 10 12 14 16 18; do
./pruned_transducer_stateless4/decode.py \
--iter $iter \
--avg $avg \
--decode-chunk-size 16 \
--left-context 64 \
--num-decode-streams 100 \
--exp-dir pruned_transducer_stateless4/exp \
--max-duration 600 \
--decoding-method $m
done
done
done
Tip
Supporting decoding methods are as follows:
greedy_search
: It takes the symbol with largest posterior probability of each frame as the decoding result.
beam_search
: It implements Algorithm 1 in https://arxiv.org/pdf/1211.3711.pdf and espnet/nets/beam_search_transducer.py is used as a reference. Basically, it keeps topk states for each frame, and expands the kept states with their own contexts to next frame.
modified_beam_search
: It implements the same algorithm asbeam_search
above, but it runs in batch mode with--max-sym-per-frame=1
being hardcoded.
fast_beam_search
: It implements graph composition between the outputlog_probs
and givenFSAs
. It is hard to describe the details in several lines of texts, you can read our paper in https://arxiv.org/pdf/2211.00484.pdf or our rnnt decode code in k2.fast_beam_search
can decode withFSAs
on GPU efficiently.
fast_beam_search_LG
: The same asfast_beam_search
above,fast_beam_search
uses an trivial graph that has only one state, whilefast_beam_search_LG
uses an LG graph (with N-gram LM).
fast_beam_search_nbest
: It produces the decoding results as follows:
Use
fast_beam_search
to get a lattice
Select
num_paths
paths from the lattice usingk2.random_paths()
Unique the selected paths
Intersect the selected paths with the lattice and compute the shortest path from the intersection result
The path with the largest score is used as the decoding output.
fast_beam_search_nbest_LG
: It implements same logic asfast_beam_search_nbest
, the only difference is that it usesfast_beam_search_LG
to generate the lattice.
Note
The supporting decoding methods in streaming_decode.py
might be less than that in decode.py
, if needed,
you can implement them by yourself or file a issue in icefall .
Export Model
pruned_transducer_stateless4/export.py supports exporting checkpoints from pruned_transducer_stateless4/exp
in the following ways.
Export model.state_dict()
Checkpoints saved by pruned_transducer_stateless4/train.py
also include
optimizer.state_dict()
. It is useful for resuming training. But after training,
we are interested only in model.state_dict()
. You can use the following
command to extract model.state_dict()
.
# Assume that --epoch 25 --avg 3 produces the smallest WER
# (You can get such information after running ./pruned_transducer_stateless4/decode.py)
epoch=25
avg=3
./pruned_transducer_stateless4/export.py \
--exp-dir ./pruned_transducer_stateless4/exp \
--streaming-model 1 \
--causal-convolution 1 \
--bpe-model data/lang_bpe_500/bpe.model \
--epoch $epoch \
--avg $avg
Caution
--streaming-model
and --causal-convolution
require to be True to export
a streaming model.
It will generate a file ./pruned_transducer_stateless4/exp/pretrained.pt
.
Hint
To use the generated pretrained.pt
for pruned_transducer_stateless4/decode.py
,
you can run:
cd pruned_transducer_stateless4/exp
ln -s pretrained.pt epoch-999.pt
And then pass --epoch 999 --avg 1 --use-averaged-model 0
to
./pruned_transducer_stateless4/decode.py
.
To use the exported model with ./pruned_transducer_stateless4/pretrained.py
, you
can run:
./pruned_transducer_stateless4/pretrained.py \
--checkpoint ./pruned_transducer_stateless4/exp/pretrained.pt \
--simulate-streaming 1 \
--causal-convolution 1 \
--bpe-model ./data/lang_bpe_500/bpe.model \
--method greedy_search \
/path/to/foo.wav \
/path/to/bar.wav
Export model using torch.jit.script()
./pruned_transducer_stateless4/export.py \
--exp-dir ./pruned_transducer_stateless4/exp \
--streaming-model 1 \
--causal-convolution 1 \
--bpe-model data/lang_bpe_500/bpe.model \
--epoch 25 \
--avg 3 \
--jit 1
Caution
--streaming-model
and --causal-convolution
require to be True to export
a streaming model.
It will generate a file cpu_jit.pt
in the given exp_dir
. You can later
load it by torch.jit.load("cpu_jit.pt")
.
Note cpu
in the name cpu_jit.pt
means the parameters when loaded into Python
are on CPU. You can use to("cuda")
to move them to a CUDA device.
Note
You will need this cpu_jit.pt
when deploying with Sherpa framework.
Download pretrained models
If you don’t want to train from scratch, you can download the pretrained models by visiting the following links:
See https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md for the details of the above pretrained models
Deploy with Sherpa
Please see https://k2-fsa.github.io/sherpa/python/streaming_asr/conformer/index.html#
for how to deploy the models in sherpa
.