LSTM-transducer-based Models

Hint

Please refer to Installation to install sherpa-onnx before you read this section.

csukuangfj/sherpa-onnx-lstm-en-2023-02-17 (English)

This model trained using the GigaSpeech and the LibriSpeech dataset.

Please see https://github.com/k2-fsa/icefall/pull/558 for how the model is trained.

You can find the training code at

https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/lstm_transducer_stateless2

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-lstm-en-2023-02-17.tar.bz2

# For Chinese users, please use the following mirror
# wget https://hub.nuaa.cf/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-lstm-en-2023-02-17.tar.bz2

tar xvf sherpa-onnx-lstm-en-2023-02-17.tar.bz2
rm sherpa-onnx-lstm-en-2023-02-17.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

sherpa-onnx-lstm-en-2023-02-17$ ls -lh *.onnx
-rw-r--r-- 1 kuangfangjun root  1.3M Mar 31 22:41 decoder-epoch-99-avg-1.int8.onnx
-rw-r--r-- 1 kuangfangjun root  2.0M Mar 31 22:41 decoder-epoch-99-avg-1.onnx
-rw-r--r-- 1 kuangfangjun root   80M Mar 31 22:41 encoder-epoch-99-avg-1.int8.onnx
-rw-r--r-- 1 kuangfangjun root  319M Mar 31 22:41 encoder-epoch-99-avg-1.onnx
-rw-r--r-- 1 kuangfangjun root  254K Mar 31 22:41 joiner-epoch-99-avg-1.int8.onnx
-rw-r--r-- 1 kuangfangjun root 1003K Mar 31 22:41 joiner-epoch-99-avg-1.onnx

Decode a single wave file

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode a wave file:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx \
  --tokens=./sherpa-onnx-lstm-en-2023-02-17/tokens.txt \
  --encoder=./sherpa-onnx-lstm-en-2023-02-17/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-lstm-en-2023-02-17/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-lstm-en-2023-02-17/joiner-epoch-99-avg-1.onnx \
  ./sherpa-onnx-lstm-en-2023-02-17/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx.exe for Windows.

You should see the following output:

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-lstm-en-2023-02-17/encoder-epoch-99-avg-1.onnx", decoder_filename="./sherpa-onnx-lstm-en-2023-02-17/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-lstm-en-2023-02-17/joiner-epoch-99-avg-1.onnx", tokens="./sherpa-onnx-lstm-en-2023-02-17/tokens.txt", num_threads=2, debug=False), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.2, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), enable_endpoint=True,
max_active_paths=4, decoding_method="greedy_search")
2023-03-31 22:53:22.120185169 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 576406, index: 16, mask: {17, 53, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2023-03-31 22:53:22.120183162 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 576405, index: 15, mask: {16, 52, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
sampling rate of input file: 16000
wav filename: ./sherpa-onnx-lstm-en-2023-02-17/test_wavs/0.wav
wav duration (s): 6.625
Started
Done!
Recognition result for ./sherpa-onnx-lstm-en-2023-02-17/test_wavs/0.wav:
 AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
num threads: 2
decoding method: greedy_search
Elapsed seconds: 2.927 s
Real time factor (RTF): 2.927 / 6.625 = 0.442

int8

The following code shows how to use int8 models to decode a wave file:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx \
  --tokens=./sherpa-onnx-lstm-en-2023-02-17/tokens.txt \
  --encoder=./sherpa-onnx-lstm-en-2023-02-17/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-lstm-en-2023-02-17/decoder-epoch-99-avg-1.int8.onnx \
  --joiner=./sherpa-onnx-lstm-en-2023-02-17/joiner-epoch-99-avg-1.int8.onnx \
  ./sherpa-onnx-lstm-en-2023-02-17/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx.exe for Windows.

You should see the following output:

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-lstm-en-2023-02-17/encoder-epoch-99-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-lstm-en-2023-02-17/decoder-epoch-99-avg-1.int8.onnx", joiner_filename="./sherpa-onnx-lstm-en-2023-02-17/joiner-epoch-99-avg-1.int8.onnx", tokens="./sherpa-onnx-lstm-en-2023-02-17/tokens.txt", num_threads=2, debug=False), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.2, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), enable_endpoint=True, max_active_paths=4, decoding_method="greedy_search")
2023-03-31 22:55:46.608941959 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 578689, index: 16, mask: {17, 53, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2023-03-31 22:55:46.608939862 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 578688, index: 15, mask: {16, 52, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
sampling rate of input file: 16000
wav filename: ./sherpa-onnx-lstm-en-2023-02-17/test_wavs/0.wav
wav duration (s): 6.625
Started
Done!
Recognition result for ./sherpa-onnx-lstm-en-2023-02-17/test_wavs/0.wav:
 AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.009 s
Real time factor (RTF): 1.009 / 6.625 = 0.152

Real-time speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone \
  --tokens=./sherpa-onnx-lstm-en-2023-02-17/tokens.txt \
  --encoder=./sherpa-onnx-lstm-en-2023-02-17/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-lstm-en-2023-02-17/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-lstm-en-2023-02-17/joiner-epoch-99-avg-1.onnx

Hint

If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech recognition with your microphone if sherpa-onnx-microphone does not work for you.

csukuangfj/sherpa-onnx-lstm-zh-2023-02-20 (Chinese)

This is a model trained using the WenetSpeech dataset.

Please see https://github.com/k2-fsa/icefall/pull/595 for how the model is trained.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-lstm-zh-2023-02-20.tar.bz2

# For Chinese users, you can use the following mirror
# wget https://hub.nuaa.cf/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-lstm-zh-2023-02-20.tar.bz2

tar xvf sherpa-onnx-lstm-zh-2023-02-20.tar.bz2
rm sherpa-onnx-lstm-zh-2023-02-20.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

sherpa-onnx-lstm-zh-2023-02-20$ ls -lh *.onnx
-rw-r--r-- 1 kuangfangjun root  12M Mar 31 20:55 decoder-epoch-11-avg-1.int8.onnx
-rw-r--r-- 1 kuangfangjun root  12M Mar 31 20:55 decoder-epoch-11-avg-1.onnx
-rw-r--r-- 1 kuangfangjun root  80M Mar 31 20:55 encoder-epoch-11-avg-1.int8.onnx
-rw-r--r-- 1 kuangfangjun root 319M Mar 31 20:55 encoder-epoch-11-avg-1.onnx
-rw-r--r-- 1 kuangfangjun root 2.8M Mar 31 20:55 joiner-epoch-11-avg-1.int8.onnx
-rw-r--r-- 1 kuangfangjun root  11M Mar 31 20:55 joiner-epoch-11-avg-1.onnx

Decode a single wave file

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode a wave file:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx \
  --tokens=./sherpa-onnx-lstm-zh-2023-02-20/tokens.txt \
  --encoder=./sherpa-onnx-lstm-zh-2023-02-20/encoder-epoch-11-avg-1.onnx \
  --decoder=./sherpa-onnx-lstm-zh-2023-02-20/decoder-epoch-11-avg-1.onnx \
  --joiner=./sherpa-onnx-lstm-zh-2023-02-20/joiner-epoch-11-avg-1.onnx \
  ./sherpa-onnx-lstm-zh-2023-02-20/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-lstm-zh-2023-02-20/encoder-epoch-11-avg-1.onnx", decoder_filename="./sherpa-onnx-lstm-zh-2023-02-20/decoder-epoch-11-avg-1.onnx", joiner_filename="./sherpa-onnx-lstm-zh-2023-02-20/joiner-epoch-11-avg-1.onnx", tokens="./sherpa-onnx-lstm-zh-2023-02-20/tokens.txt", num_threads=2, debug=False), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.2, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), enable_endpoint=True, max_active_paths=4, decoding_method="greedy_search")
2023-03-31 22:58:59.348229346 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 578800, index: 15, mask: {16, 52, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2023-03-31 22:58:59.348231417 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 578801, index: 16, mask: {17, 53, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
sampling rate of input file: 16000
wav filename: ./sherpa-onnx-lstm-zh-2023-02-20/test_wavs/0.wav
wav duration (s): 5.611
Started
Done!
Recognition result for ./sherpa-onnx-lstm-zh-2023-02-20/test_wavs/0.wav:
对我做了介绍那么我想说的是大家如果对我的研究感兴趣
num threads: 2
decoding method: greedy_search
Elapsed seconds: 3.030 s
Real time factor (RTF): 3.030 / 5.611 = 0.540

int8

The following code shows how to use int8 models to decode a wave file:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx \
  --tokens=./sherpa-onnx-lstm-zh-2023-02-20/tokens.txt \
  --encoder=./sherpa-onnx-lstm-zh-2023-02-20/encoder-epoch-11-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-lstm-zh-2023-02-20/decoder-epoch-11-avg-1.int8.onnx \
  --joiner=./sherpa-onnx-lstm-zh-2023-02-20/joiner-epoch-11-avg-1.int8.onnx \
  ./sherpa-onnx-lstm-zh-2023-02-20/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OnlineTransducerModelConfig(encoder_filename="./sherpa-onnx-lstm-zh-2023-02-20/encoder-epoch-11-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-lstm-zh-2023-02-20/decoder-epoch-11-avg-1.int8.onnx", joiner_filename="./sherpa-onnx-lstm-zh-2023-02-20/joiner-epoch-11-avg-1.int8.onnx", tokens="./sherpa-onnx-lstm-zh-2023-02-20/tokens.txt", num_threads=2, debug=False), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.2, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), enable_endpoint=True, max_active_paths=4, decoding_method="greedy_search")
2023-03-31 23:01:05.737519659 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 578880, index: 15, mask: {16, 52, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2023-03-31 23:01:05.737521655 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 578881, index: 16, mask: {17, 53, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
sampling rate of input file: 16000
wav filename: ./sherpa-onnx-lstm-zh-2023-02-20/test_wavs/0.wav
wav duration (s): 5.611
Started
Done!
Recognition result for ./sherpa-onnx-lstm-zh-2023-02-20/test_wavs/0.wav:
对我做了介绍那么我想说的是大家如果对我的研究感兴趣
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.091 s
Real time factor (RTF): 1.091 / 5.611 = 0.194

Real-time speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone \
  --tokens=./sherpa-onnx-lstm-zh-2023-02-20/tokens.txt \
  --encoder=./sherpa-onnx-lstm-zh-2023-02-20/encoder-epoch-11-avg-1.onnx \
  --decoder=./sherpa-onnx-lstm-zh-2023-02-20/decoder-epoch-11-avg-1.onnx \
  --joiner=./sherpa-onnx-lstm-zh-2023-02-20/joiner-epoch-11-avg-1.onnx

Hint

If your system is Linux (including embedded Linux), you can also use sherpa-onnx-alsa to do real-time speech recognition with your microphone if sherpa-onnx-microphone does not work for you.