Conformer-transducer-based Models

Hint

Please refer to Installation to install sherpa-onnx before you read this section.

csukuangfj/sherpa-onnx-streaming-conformer-zh-2023-05-23 (Chinese)

This model is converted from

https://huggingface.co/luomingshuang/icefall_asr_wenetspeech_pruned_transducer_stateless5_streaming

which supports only Chinese as it is trained on the WenetSpeech corpus.

You can find the training code at

https://github.com/k2-fsa/icefall/tree/master/egs/wenetspeech/ASR/pruned_transducer_stateless5

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-conformer-zh-2023-05-23.tar.bz2

tar xvf sherpa-onnx-streaming-conformer-zh-2023-05-23.tar.bz2
rm sherpa-onnx-streaming-conformer-zh-2023-05-23.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

sherpa-onnx-streaming-conformer-zh-2023-05-23 fangjun$ ls -lh *.onnx
-rw-r--r--  1 fangjun  staff    11M May 23 14:44 decoder-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff    12M May 23 14:44 decoder-epoch-99-avg-1.onnx
-rw-r--r--  1 fangjun  staff   160M May 23 14:46 encoder-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   345M May 23 14:47 encoder-epoch-99-avg-1.onnx
-rw-r--r--  1 fangjun  staff   2.7M May 23 14:44 joiner-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff    11M May 23 14:44 joiner-epoch-99-avg-1.onnx

Decode a single wave file

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode a wave file:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx \
  --tokens=./sherpa-onnx-streaming-conformer-zh-2023-05-23/tokens.txt \
  --encoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-streaming-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.onnx \
  ./sherpa-onnx-streaming-conformer-zh-2023-05-23/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx.exe for Windows.

You should see the following output:

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-streaming-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.onnx", decoder_filename="./sherpa-onnx-streaming-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-streaming-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.onnx", tokens="./sherpa-onnx-streaming-conformer-zh-2023-05-23/tokens.txt", num_threads=2, debug=False), lm_config=OnlineLMConfig(model="", scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.2, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), enable_endpoint=True, max_active_paths=4, decoding_method="greedy_search")
sampling rate of input file: 16000
wav filename: ./sherpa-onnx-streaming-conformer-zh-2023-05-23/test_wavs/0.wav
wav duration (s): 5.611
Started
Done!
Recognition result for ./sherpa-onnx-streaming-conformer-zh-2023-05-23/test_wavs/0.wav:
{"is_final":false,"segment":0,"start_time":0.0,"text":"对我做了介绍那么我想说的是大家如果对我的研究感兴趣呢","timestamps":"[0.00, 0.48, 0.76, 0.88, 1.08, 1.24, 2.00, 2.04, 2.16, 2.36, 2.56, 2.72, 2.92, 3.36, 3.44, 3.64, 3.72, 3.84, 3.96, 4.04, 4.16, 4.28, 4.48, 4.64, 4.84, 5.16]","tokens":["对","我","做","了","介","绍","那","么","我","想","说","的","是","大","家","如","果","对","我","的","研","究","感","兴","趣","呢"]}
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.559 s
Real time factor (RTF): 0.559 / 5.611 = 0.100

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

int8

The following code shows how to use int8 models to decode a wave file:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx \
  --tokens=./sherpa-onnx-streaming-conformer-zh-2023-05-23/tokens.txt \
  --encoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-streaming-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.int8.onnx \
  ./sherpa-onnx-streaming-conformer-zh-2023-05-23/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx.exe for Windows.

You should see the following output:

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-streaming-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-streaming-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-streaming-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.int8.onnx", tokens="./sherpa-onnx-streaming-conformer-zh-2023-05-23/tokens.txt", num_threads=2, debug=False), lm_config=OnlineLMConfig(model="", scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.2, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), enable_endpoint=True, max_active_paths=4, decoding_method="greedy_search")
sampling rate of input file: 16000
wav filename: ./sherpa-onnx-streaming-conformer-zh-2023-05-23/test_wavs/0.wav
wav duration (s): 5.611
Started
Done!
Recognition result for ./sherpa-onnx-streaming-conformer-zh-2023-05-23/test_wavs/0.wav:
{"is_final":false,"segment":0,"start_time":0.0,"text":"对我做了介绍那么我想说的是大家如果对我的研究感兴趣呢","timestamps":"[0.00, 0.48, 0.72, 0.88, 1.08, 1.24, 2.00, 2.04, 2.16, 2.28, 2.56, 2.72, 2.92, 3.36, 3.44, 3.60, 3.72, 3.84, 3.92, 4.04, 4.16, 4.28, 4.48, 4.60, 4.84, 5.16]","tokens":["对","我","做","了","介","绍","那","么","我","想","说","的","是","大","家","如","果","对","我","的","研","究","感","兴","趣","呢"]}
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.493 s
Real time factor (RTF): 0.493 / 5.611 = 0.088

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

Real-time speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone \
  --tokens=./sherpa-onnx-streaming-conformer-zh-2023-05-23/tokens.txt \
  --encoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-streaming-conformer-zh-2023-05-23/joiner-epoch-99-avg-1.onnx

Hint

If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech recognition with your microphone if sherpa-onnx-microphone does not work for you.