ConvEmformer transducer based streaming ASR
This page describes how to use sherpa for streaming ASR with ConvEmformer transducer models trained with pruned stateless transdcuer.
Hint
To be specific, the pre-trained model is trained on the LibriSpeech dataset using the code from https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/conv_emformer_transducer_stateless2.
The pre-trained model can be downloaded from https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05
There are no recurrent modules in the transducer model:
The encoder network (i.e., the transcription network) is a ConvEmformer model
The decoder network (i.e., the prediction network) is a stateless network, consisting of an
nn.Embedding()
and ann.Conv1d()
.The joiner network (i.e., the joint network) contains an adder, a
tanh
activation, and ann.Linear()
.
Streaming ASR in this section consists of two components: