Conv-Emformer-transducer-based Models

Hint

Please refer to Installation to install sherpa-ncnn before you read this section.

marcoyang/sherpa-ncnn-conv-emformer-transducer-small-2023-01-09 (English)

This model is a small version of conv-emformer-transducer trained in icefall.

It only has 8.8 million parameters and can be deployed on embedded devices for real-time speech recognition. You can find the models in fp16 and int8 format at https://huggingface.co/marcoyang/sherpa-ncnn-conv-emformer-transducer-small-2023-01-09.

This model is trained using LibriSpeech and thus it supports only English.

In the following, we show you how to download it and deploy it with sherpa-ncnn on an embedded device, whose CPU is RV1126 (Quad core ARM Cortex-A7)

Please use the following commands to download it.

cd /path/to/sherpa-ncnn

wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-conv-emformer-transducer-small-2023-01-09.tar.bz2
tar xvf sherpa-ncnn-conv-emformer-transducer-small-2023-01-09.tar.bz2

Note

Please refer to Embedded Linux (arm) for how to compile sherpa-ncnn for a 32-bit ARM platform. In the following, we test the pre-trained model on an embedded device, whose CPU is RV1126 (Quad core ARM Cortex-A7).

Decode a single wave file with ./build/bin/sherpa-ncnn

Hint

It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.

cd /path/to/sherpa-ncnn

./build/bin/sherpa-ncnn \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/tokens.txt \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/encoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/encoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/decoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/decoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/joiner_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/joiner_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/test_wavs/1089-134686-0001.wav \

The outputs are shown below. The CPU used for decoding is RV1126 (Quad core ARM Cortex-A7).

Decoding time and decoding result of float32 model

Note

The default option uses 4 threads and greedy_search for decoding.

Note

Please use ./build/bin/Release/sherpa-ncnn.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

Decode a single wave file with ./build/bin/sherpa-ncnn (with int8 quantization)

Note

We also support int8 quantization to compresss the model and speed up inference. Currently, only encoder and joiner are quantized.

To decode the int8-quantized model, use the following command:

cd /path/to/sherpa-ncnn

./build/bin/sherpa-ncnn \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/tokens.txt \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/encoder_jit_trace-pnnx.ncnn.int8.param \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/encoder_jit_trace-pnnx.ncnn.int8.bin \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/decoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/decoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/joiner_jit_trace-pnnx.ncnn.int8.param \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/joiner_jit_trace-pnnx.ncnn.int8.bin \
  ./sherpa-ncnn-conv-emformer-transducer-small-2023-01-09/test_wavs/1089-134686-0001.wav \

The outputs are shown below. The CPU used for decoding is RV1126 (Quad core ARM Cortex-A7).

Decoding time and decoding result of int8 model

Compared to the original model in fp16 format, the decoding speed is significantly improved. The decoding time is changed from 3.26 s to 2.44 s.

Note

When the model’s weights are quantized to float16, it is converted to float32 during computation.

When the model’s weights are quantized to int8, it is using int8 during computation.

Hint

Even if we use only 1 thread for the int8 model, the resulting real time factor (RTF) is still less than 1.

csukuangfj/sherpa-ncnn-conv-emformer-transducer-2022-12-06 (Chinese + English)

This model is converted from https://huggingface.co/ptrnull/icefall-asr-conv-emformer-transducer-stateless2-zh, which supports both Chinese and English.

Hint

If you want to train your own model that is able to support both Chinese and English, please refer to our training code:

You can also try the pre-trained models in your browser without installing anything by visiting:

In the following, we describe how to download and use it with sherpa-ncnn.

Please use the following commands to download it.

cd /path/to/sherpa-ncnn

wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-conv-emformer-transducer-2022-12-06.tar.bz2
tar xvf sherpa-ncnn-conv-emformer-transducer-2022-12-06.tar.bz2

Decode a single wave file with ./build/bin/sherpa-ncnn

Hint

It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.

cd /path/to/sherpa-ncnn

./build/bin/sherpa-ncnn \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/test_wavs/0.wav \

Note

Please use ./build/bin/Release/sherpa-ncnn.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

Real-time speech recognition from a microphone with build/bin/sherpa-ncnn-microphone

cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin

Note

Please use ./build/bin/Release/sherpa-ncnn-microphone.exe for Windows.

It will print something like below:

Number of threads: 4
num devices: 4
Use default device: 2
  Name: MacBook Pro Microphone
  Max input channels: 1
Started

Speak and it will show you the recognition result in real-time.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

csukuangfj/sherpa-ncnn-conv-emformer-transducer-2022-12-08 (Chinese)

Hint

This is a very small model that can be run in real-time on embedded sytems.

This model is trained using WenetSpeech dataset and it supports only Chinese.

In the following, we describe how to download and use it with sherpa-ncnn.

Please use the following commands to download it.

cd /path/to/sherpa-ncnn

wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-conv-emformer-transducer-2022-12-08.tar.bz2
tar xvf sherpa-ncnn-conv-emformer-transducer-2022-12-08.tar.bz2

Decode a single wave file with ./build/bin/sherpa-ncnn

Hint

It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.

cd /path/to/sherpa-ncnn

./build/bin/sherpa-ncnn \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/tokens.txt \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/encoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/encoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/decoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/decoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/joiner_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/joiner_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-ncnn.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

Real-time speech recognition from a microphone with build/bin/sherpa-ncnn-microphone

cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/tokens.txt \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/encoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/encoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/decoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/decoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/joiner_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-08/joiner_jit_trace-pnnx.ncnn.bin

Note

Please use ./build/bin/Release/sherpa-ncnn-microphone.exe for Windows.

It will print something like below:

Number of threads: 4
num devices: 4
Use default device: 2
  Name: MacBook Pro Microphone
  Max input channels: 1
Started

Speak and it will show you the recognition result in real-time.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

csukuangfj/sherpa-ncnn-conv-emformer-transducer-2022-12-04 (English)

This model is trained using GigaSpeech and LibriSpeech. It supports only English.

In the following, we describe how to download and use it with sherpa-ncnn.

Please use the following commands to download it.

cd /path/to/sherpa-ncnn

wget https://github.com/k2-fsa/sherpa-ncnn/releases/download/models/sherpa-ncnn-conv-emformer-transducer-2022-12-04.tar.bz2
tar xvf sherpa-ncnn-conv-emformer-transducer-2022-12-04.tar.bz2

Decode a single wave file with ./build/bin/sherpa-ncnn

Hint

It supports decoding only wave files with a single channel and the sampling rate should be 16 kHz.

cd /path/to/sherpa-ncnn

./build/bin/sherpa-ncnn \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/tokens.txt \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/encoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/encoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/decoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/decoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/joiner_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/joiner_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/test_wavs/1089-134686-0001.wav

Note

Please use ./build/bin/Release/sherpa-ncnn.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

Real-time speech recognition from a microphone with build/bin/sherpa-ncnn-microphone

cd /path/to/sherpa-ncnn
./build/bin/sherpa-ncnn-microphone \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/tokens.txt \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/encoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/encoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/decoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/decoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/joiner_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-04/joiner_jit_trace-pnnx.ncnn.param

Note

Please use ./build/bin/Release/sherpa-ncnn-microphone.exe for Windows.

It will print something like below:

Number of threads: 4
num devices: 4
Use default device: 2
  Name: MacBook Pro Microphone
  Max input channels: 1
Started

Speak and it will show you the recognition result in real-time.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.