Emformer transducer based streaming ASR

This page describes how to use sherpa for streaming ASR with Emformer transducer models trained with pruned stateless transdcuer.

Hint

To be specific, the pre-trained model is trained on the LibriSpeech dataset using the code from https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_stateless_emformer_rnnt2.

The pre-trained model can be downloaded from https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-stateless-emformer-rnnt2-2022-06-01

There are no recurrent modules in the transducer model:

The encoder network (i.e., the transcription network) is an Emformer model

The decoder network (i.e., the prediction network) is a stateless network, consisting of an nn.Embedding() and a nn.Conv1d().

The joiner network (i.e., the joint network) contains an adder, a tanh activation, and a nn.Linear().

Streaming ASR in this section consists of two components:

Server
- Usage
Client
- Usage

The following is a YouTube video, demonstrating how to use the server and the client.

Note

If you have no access to YouTube, please visit the following link from bilibili https://www.bilibili.com/video/BV1BU4y197bs